Guzzle supports Computing environments on the AWS Cloud. In Guzzle AWS cloud setup, we can use Databricks to execute our workloads. This article helps in using AWS Databricks as a computing environment in Guzzle.
|Cluster type||There are 2 types of Databricks clusters available in Guzzle:|
Data Engineering : Data engineering cluster is recommended for automated workloads. It is recommended to use it for your BAU data loads.
Data Analytics : Data analytics cluster is recommended for interactive queries along with concurrent user support. This cluster type is configurable in Guzzle, and it can also execute the workloads, but it is not recommended to use it for your BAU data loads. Data Analytics cluster is costlier than Data Engineering cluster for per DBU usage and meant for interactive queries through Databricks notebook in a shared environment where multiple people have to collaborate as a team.
|API URL||Specify AWS Databricks environment URL.||None||True|
|API token||Specify Databricks access token for authentication and to communicate with AWS databricks environment. User can generate token from databricks workspace using User Settings -> Access Tokens -> Generate new token option. |
To specify the token following options are available:
1. Manual: Provide api token directly.
2. Secret Manager: Use it through AWS Secret manager feature. Give value of the secret name where api token is stored in AWS secret manager instance.
|Cluster id (applicable for data analytics cluster)||Specify analytics cluster name which is precreated in Databricks environment. |
Guzzle will show available clusters once a valid API token and URL are configured.
UI will show cluster name, but it stores cluster ID in the underlying YML file
|Configure retry when job submit fails||Enable job resubmittion when Guzzle is not able to successfully submit the job (possibly due to issues like unavailability of cloud resources or error in control plane). by default, Guzzle not perform any retry when job submittion failed. |
Max retry: Provide the number of maximum retry
Min retry interval: Provide the interval time between retry in milliseconds
|Spark version (applicable for data engineering cluster)||Select the spark version that will run on cluster.|
This drop-down is populated once a valid API token and URL are provided.
|Enable cluster pool||To reduce cluster start time, you can attach a cluster to a predefined pool of idle instances, for the driver and worker nodes.||False||No|
|Instance pool id (applicable when cluster pool is enableded)||Specify the predefined pool from the drop-down. Available predefined pools are listed in the drop-down once a valid API token and URL are configured.||None||Yes|
|Driver node type (applicable for data engineering cluster)||Specify the instance type for the driver node.||None||Yes|
|Worker node type (applicable for data engineering cluster)||Specify the instance type for the worker node.||None||Yes|
|Workers||Specify the Fix number of workers to use in cluster.||8||Yes|
|Enable autoscaling||Enable autoscaling to choose the appropriate number of workers for job, whose requirements changes over time or requirements are unknown.||False||No|
|Min workers (applicable when autoscaling is enableded)||Specify the minimum number of workers to use in cluster.||2||Yes|
|Max workers (applicable when autoscaling is enableded)||Specify the maximum number of workers to use in cluster.||8||Yes|
|On-demand nodes||The on-demand size determines the number of on-demand nodes. The rest of the nodes in the cluster will attempt to be spot nodes. The driver will be on demand.||1||Yes|
|Spot fall back to on-demand||If the EC2 spot price exceeds the bid, use on-demand instances instead of spot instances. This applies during both creation and lifetime of the cluster.||False||No|
|Max spot price||Max price you are willing to pay for Spot instances.||100||Yes|
|Instance profile||Instance profile is used to securely access AWS resources without using AWS keys. Click here for more information about how to create and configure instance profiles. Once you have created an instance profile, you can select it in the Instance Profile drop-down list. Make sure your instance profile has permission to access the guzzle config storage and secret manager.||None||Yes|
|Availability zone||Select a specific cluster availability zone where your cluster will launch.||None||Yes|
|Enable local storage auto-scaling||If you don’t want to allocate a fixed number of EBS volumes at cluster creation time, use autoscaling local storage. When worker begins to run too low on disk, it automatically attaches a new EBS volume to the worker before it runs out of disk space.||False||No|
|EBS volume type (applicable when local storage auto-scaling is disabled)||For instance types that do not have a local disk, or if you want to increase your Spark shuffle storage space, you can specify additional EBS volumes. Consider the 'Enable local storage auto-scaling' option if you are not sure how to size them.||None||No|
|Volume counts (applicable when local storage auto-scaling is disabled)||Provision the number of volumes for each instance. Users can choose up to 10 volumes per instance.||None||No|
|Volume size in GB (applicable when local storage auto-scaling is disabled)||The size of each EBS volume (in GB) for each instance. For general purpose SSD, this value must be within the range 32 - 4096. For throughput optimized HDD, this value must be within the range 500 - 4096.||None||No|
|Customize spark config||Used to provide custom Spark configuration properties in a cluster configuration |
Ex: conf : spark.sql.broadcastTimeout
value : 5000
|Customize environment variables||Used to specify environment variables to use in spark computes.|
Ex: Variable : MY_ENVIRONMENT_VARIABLE.
Value : VARIABLE_VALUE
|Init script||Configure cluster node initialization or init-script that is a shell script that runs during startup for each cluster node before the Spark driver or worker JVM starts. You can use init scripts to install packages and libraries not included in the Databricks runtime, modify the JVM system classpath, set system properties and environment variables used by the JVM, or modify Spark configuration parameters, among other configuration tasks.||None|
|Custom cluster tags||Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. You can specify tags as key-value pairs, it applies to cluster and its cloud resources like VMs and disk volumes, as well as DBU usage reports.||None|