Skip to main content

AWS Databricks

Guzzle supports Computing environments on the AWS Cloud. In Guzzle AWS cloud setup, we can use Databricks to execute our workloads. This article helps in using AWS Databricks as a computing environment in Guzzle.

Use of AWS Databricks as a computing environment#

PropertyDescriptionDefault ValueRequired
Cluster typeThere are 2 types of Databricks clusters available in Guzzle:

Data Engineering : Data engineering cluster is recommended for automated workloads. It is recommended to use it for your BAU data loads.

Data Analytics : Data analytics cluster is recommended for interactive queries along with concurrent user support. This cluster type is configurable in Guzzle, and it can also execute the workloads, but it is not recommended to use it for your BAU data loads. Data Analytics cluster is costlier than Data Engineering cluster for per DBU usage and meant for interactive queries through Databricks notebook in a shared environment where multiple people have to collaborate as a team.
Data EngineeringTrue
API URLSpecify AWS Databricks environment URL.NoneTrue
API tokenSpecify Databricks access token for authentication and to communicate with AWS databricks environment. User can generate token from databricks workspace using User Settings -> Access Tokens -> Generate new token option.
To specify the token following options are available:
1. Manual: Provide api token directly.
2. Secret Manager: Use it through AWS Secret manager feature. Give value of the secret name where api token is stored in AWS secret manager instance.
NoneYes
Cluster id (applicable for data analytics cluster)Specify analytics cluster name which is precreated in Databricks environment.
Guzzle will show available clusters once a valid API token and URL are configured.
UI will show cluster name, but it stores cluster ID in the underlying YML file
NoneYes
Configure retry when job submit failsEnable job resubmittion when Guzzle is not able to successfully submit the job (possibly due to issues like unavailability of cloud resources or error in control plane). by default, Guzzle not perform any retry when job submittion failed.
Max retry: Provide the number of maximum retry
Min retry interval: Provide the interval time between retry in milliseconds
NoneNo
Spark version (applicable for data engineering cluster)Select the spark version that will run on cluster.
This drop-down is populated once a valid API token and URL are provided.
NoneYes
Enable cluster poolTo reduce cluster start time, you can attach a cluster to a predefined pool of idle instances, for the driver and worker nodes.FalseNo
Instance pool id (applicable when cluster pool is enableded)Specify the predefined pool from the drop-down. Available predefined pools are listed in the drop-down once a valid API token and URL are configured.NoneYes
Driver node type (applicable for data engineering cluster)Specify the instance type for the driver node.NoneYes
Worker node type (applicable for data engineering cluster)Specify the instance type for the worker node.NoneYes
WorkersSpecify the Fix number of workers to use in cluster.8Yes
Enable autoscalingEnable autoscaling to choose the appropriate number of workers for job, whose requirements changes over time or requirements are unknown.FalseNo
Min workers (applicable when autoscaling is enableded)Specify the minimum number of workers to use in cluster.2Yes
Max workers (applicable when autoscaling is enableded)Specify the maximum number of workers to use in cluster.8Yes
On-demand nodesThe on-demand size determines the number of on-demand nodes. The rest of the nodes in the cluster will attempt to be spot nodes. The driver will be on demand.1Yes
Spot fall back to on-demandIf the EC2 spot price exceeds the bid, use on-demand instances instead of spot instances. This applies during both creation and lifetime of the cluster.FalseNo
Max spot priceMax price you are willing to pay for Spot instances.100Yes
Instance profileInstance profile is used to securely access AWS resources without using AWS keys. Click here for more information about how to create and configure instance profiles. Once you have created an instance profile, you can select it in the Instance Profile drop-down list. Make sure your instance profile has permission to access the guzzle config storage and secret manager.NoneYes
Availability zoneSelect a specific cluster availability zone where your cluster will launch.NoneYes
Enable local storage auto-scalingIf you don’t want to allocate a fixed number of EBS volumes at cluster creation time, use autoscaling local storage. When worker begins to run too low on disk, it automatically attaches a new EBS volume to the worker before it runs out of disk space.FalseNo
EBS volume type (applicable when local storage auto-scaling is disabled)For instance types that do not have a local disk, or if you want to increase your Spark shuffle storage space, you can specify additional EBS volumes. Consider the 'Enable local storage auto-scaling' option if you are not sure how to size them.NoneNo
Volume counts (applicable when local storage auto-scaling is disabled)Provision the number of volumes for each instance. Users can choose up to 10 volumes per instance.NoneNo
Volume size in GB (applicable when local storage auto-scaling is disabled)The size of each EBS volume (in GB) for each instance. For general purpose SSD, this value must be within the range 32 - 4096. For throughput optimized HDD, this value must be within the range 500 - 4096.NoneNo
Customize spark configUsed to provide custom Spark configuration properties in a cluster configuration
Ex: conf : spark.sql.broadcastTimeout
value : 5000
None
Customize environment variablesUsed to specify environment variables to use in spark computes.
Ex: Variable : MY_ENVIRONMENT_VARIABLE.
Value : VARIABLE_VALUE
None
Init scriptConfigure cluster node initialization or init-script that is a shell script that runs during startup for each cluster node before the Spark driver or worker JVM starts. You can use init scripts to install packages and libraries not included in the Databricks runtime, modify the JVM system classpath, set system properties and environment variables used by the JVM, or modify Spark configuration parameters, among other configuration tasks.None
Custom cluster tagsCluster tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. You can specify tags as key-value pairs, it applies to cluster and its cloud resources like VMs and disk volumes, as well as DBU usage reports.None

Interface of AWS Databricks compute for cluster type : Data Engineering#

Interface of AWS Databricks compute for cluster type : Data Analytics.#