Skip to main content

Azure Synapse Spark

Prerequisites#

  • Below are the Azure resources which will be required when leveraging Synapse Spark in Guzzle
ResourceDescriptionSteps for creating the resourcesInformation to retrieve (which is then used for Guzzle environment configurations and setting up Computes/data stores)
Synapse WorkspaceSynapse workspace to be used for Guzzle.Steps to create synapse workspaceDevelopment endpoint URL
Synapse Spark PoolSpark pool added in synapse workspace that will be used by GuzzleSteps to setup synapse spark pool-
App RegistrationRegister service principal app for Guzzle and generate secret
- This app registration is being used for Authentication and hence and the steps for Authentication setup can be skipped.
App Registration stepsClient Id, Client Secret and Tenant ID (From App registration overview)
Blob Storage AccountBlob account used for shared storage account permission to Storage Blob Data Contributor. Select Service principal(Registered App) as Member.
- You need to disable soft delete because we are using ADFS protocol to access (for databricks).
- Disable Blob Storage Soft Delete
Steps to assign Storage Blob Data Contributor role to service principal-
Key VaultKey vault used to store different credentials (assumes one key vault is used to store all the credentials which are required for Guzzle environment configurations and data stores)Assign a Key Vault access policy-
  • This are the permissions to be granted to different AAD principals (User/groups/service principals / managed identity) on different resources (follow standard Azure document to grant permissions)
AAD principal (User/groups/service principals / managed identity)ResourcesRBAC (permission)Purpose
Guzzle VM system identity (or user defined identity)Key VaultSecret permission: Get, List, SetGuzzle VM needs to retrieve keys from key vault and also need to save passphrase in the KV for Sync Passphrase function in for API Api Settings - Guzzle
Azure Service principal (App Registration)Primary storage ADLSBlob ContributorThe spark jobs will read and write log files/temp files in the Primary storage as part of the job run. Since the job is run using service principal (app registration), same has to be grated read write permission on primary ADLS
Azure Service principal (App Registration)Synapse workspaceIn Add role assignment section:
- In Add role assignment section:
- Role - > Synapse Admin
Guzzle will submit the jobs to Synapse workspace and run the jobs spark pool - for this it needs minimum: Synapse Admin permission.
Azure Service principal (App Registration)Key VaultSecret permission: Get, List, SetThe job which runs on spark-pool are submitted using service principal (app registration), and will need to retrieve the credential from KV for the data store
Azure Service principal (App Registration)Blob StorageBlob ReaderThe job which runs on spark-pool are submitted using service principal (app registration), and will need to retrieve guzzle binaries from the shared storage ( we only need to read binaries)
- The guzzle binaries are written into shared storage by Guzzle VM using account key or service principal depending on what is configured in this setup: Shared Storage - Guzzle

Guzzle Network Diagram for public endpoint#

  • Synapse workspace is available on public network (as per default setup) the Developer endpoint can be connected over internet. Click here for more information. All this traffic are on public end point
  • The job which runs on spark-pool are submitted using service principal (app registration)

Guzzle VM Connection Table for public endpoint#

#SourceTargetProtocol / /portAuthentication MechanismPurpose of ConnectionTraffic TypeNotes (latency , through put, special security)
1Guzzle VMBLOB for Guzzle​HTTPSManaged Identity (system generated ​ or user defined identity of Guzzle VM)To store jars, third party library on blob for jobs on spark compute to read​Public-
2Guzzle VM​Azure SQL for Guzzle​HTTPSAAD user credential / native SQL account​Read/write Guzzle audit and metadata​Public-
3Guzzle VM​Key Vault​HTTPSManaged Identity (system generated ​ or user defined identity of Guzzle VM)Get stored secrets and keys​Public-
4Guzzle VM​Synapse Developer EndpointHTTPSApp RegistrationSubmit jobs to Spark Pool​Public-
5Spark PoolSynapse Dedicated Pool​ (using Azure Synapse Native datastore)HTTPSApp Registration which is specified in compute or Native user/passwordAuthentication Mechanism:
- App Registration which is specified in compute or Native user/password
- the external data source includes an authentication method, that's why
- purpose of connection:
- for Ingestion: To read and write (connector)
- for processing: To run template SQL (JDBC connection)
- for DQ/Recon: To read (connector)
Public
6Guzzle VMSynapse Dedicated Pool​ (using Azure Synapse)HTTPSNative user/passwordProcessing jobs run directly from Guzzle VM against APIPublic-
7Spark poolGuzzle VMHTTPStemp API key which is part of request and decrypted using passphrase in KVSpark job to connect to guzzle API to retrieve config--
8Spark poolBlob storageHTTPSApp RegistrationSpark jobs to retrieve the config--
9Spark poolAzure SQLHTTPSuser/password or App Registration specifiedto update repo tables--
10Spark poolPrimary ADLSHTTPSApp Registrationto read/write logs--
11Spark poolKey vaultHTTPSApp Registrationto fetch secrets from KV--
12Spark poolSynapse Developer EndpointHTTPSApp Registrationit will connect to Synapse Developer end point to stop the job--
  • Apart from this, there will be additional network traffic between spark pool to source and target used in the ingestion/processing job.

Guzzle Network Diagram for private endpoint#

ResourceAzure Documentation LinkPurpose
Storage AccountConfigure Azure Storage firewalls and virtual networks
Use private endpoints - Azure Storage
Disable public access and create private endpoint to access blob privately.
SQL ServerDeny Public Network Access - Azure portal - Azure Database for MySQL Disable public access and create 2 private endpoint:
1. Guzzle Vm
2. Synapse workspace
Key VaultIntegrate Key Vault with Azure Private LinkDisable public access on azure key vault and used as private endpoint
Synapse Private EndpointAccess control in Synapse workspace how to - Azure Synapse Analytics Add all private endpoint to synapse.
1. Storage Account
2. Key Vault
3. SQL Server
4. Private Link Service
Private Link Service With Load BalancerQuickstart - Create a Private Link service - Azure portal - Azure Private Link To access Guzzle VM in private network, we need to create private link service.
Guzzle API Setting-Change Guzzle API Setting with private link service fully qualified domain name.

Guzzle VM Connection Table for private endpoint#

#SourceTargetProtocol / PortAuthentication MechanismPurpose of connectionTraffic Type
1Guzzle VMStorage Account for Guzzle​HTTPSManaged Identity​To storage jars, third party library on blob for spark compute to read​Private
2Guzzle VMAzure SQL for GuzzleHTTPSAAD user credential / SAS​Read/write Guzzle audit and metadata​​Private
3Guzzle VMKey Vault​HTTPSManaged Identity​Get stored secrets and keys​​Private
4Guzzle VMDatabricks Control Plane​HTTPSManaged Identity​Submit jobs to Databricks Cluster​Private
5Guzzle VMSynapse Control Plane​HTTPSManaged Identity​Submit jobs to Spark Pool​​Private
6Guzzle VM​Synapse Dedicated Pool​HTTPSManaged Identity​Submit jobs to Synapse dedicated pool​Private
7Guzzle VM​Guzzle VM​HTTPSManaged Identity​Update logs, retrieve configs from Guzzle​Private
note

Synapse Compute Support Matrix#

Please follow Guzzle official documentation for synapse compute support with different data sources. Click here.

Guzzle Configurations#

PropertyDescriptionDefault ValueRequired
Synapse workspace URLSpecify the URL of the Azure Synapse workspace. You will find this url as Development endpoint in Synapse workspace overview page in Azure portalNoneYes
Spark pool nameSpecify the spark pool name that will be used by GuzzleNoneYes
Credential TypeSpecify the credential type to connect to the Azure SynapseService principalYes
Client IdAzure Active Directory provided client Id( also called an application ID). The register app in Azure Active Directory provides one unique id for associate to application. The client id of the created application in above steps.NoneYes
Client SecretAzure Active Directory Client Secret. Provide the client secret of the application that is created in above steps. Guzzle used this for verify and generate access key of the user authenticationNoneYes
Tenant IdThe unique identifier of the Azure Active Directory instance also called directory ID. A tenant represents an organization. Provide the tenant id of the application. It's a dedicated instance of Azure AD that an organization or app developer receives at the beginning of a relationship with MicrosoftNoneYes
Driver MemorySpecify the allocated driver memory for running jobsNoneNo
Driver CoresSpecify the allocated driver cores for running jobsNoneNo
Executor MemorySpecify the allocated executor memory for running jobsNoneNo
Executor CoresSpecify the allocated executor cores for running jobsNoneNo
Number of executorsSpecify the number of spark executors you want the job to runNoneNo
Customize spark configSpecify additional Spark configuration options. Specify the config name and config valueNoneNo
Custom cluster tagsApply tags to the cluster. Specify the name of tag and valueNoneNo

Run Guzzle Job with Synapse Spark#

Guzzle Monitor UI#