Skip to main content

Spark Parameters

  • When working with Apache Spark, you can utilize various parameters to configure and optimize your Spark jobs. These parameters can be set at different levels, such as activity, pipeline and batch depending on your specific requirements.
  • We can use Spark parameter for various ways, For example, we can use parameters to change spark cluster to higher configurations at runtime level.
  • In Guzzle, we can pass or override spark related parameters.

  • There are 4 ways we can pass spark parameters in guzzle
    • Pipeline activity level (Passing spark parameters in activity level in pipeline which will applicable to particular activity)
      • Pipeline level (Configuration Override)
      • Activity level (Activity Configurations)
    • Pipeline level (Applicable to all activity inside the pipeline)
    • Runtime dialog
      • Override spark config/Override databricks settings
      • Additional parameters
    • Environment Parameter

Activity inside pipeline (Highest Precedence)#

  • You can set spark configurations at activity level inside pipeline from setting icon in pipeline list.
  • You can consider spark configurations of each activity individually.
Pipeline Level


Activity Level

Pipeline Level#

  • We can provide common spark configurations for all mentioned activities in pipeline.
  • If we have not mention spark config in activity inside pipeline then guzzle will take precedence of pipeline configurations.

Pipeline Runtime#

  • If we have not specify the activity and pipeline level then we can set spark parameter at runtime.
  • Guzzle will pass spark parameter to each activities inside the pipeline

Environment Variable#

  • We can also use environment variables to define spark configurations. Guzzle will consider lowest precedence of it.

Compute specific Spark Configs#

Databricks#

ParameterParameter NameDefault ValueDescription
Cluster idguzzle.spark.cluster_id-You can set databricks cluster ID in which job will executed
Spark versionguzzle.spark.spark_version-Databricks Runtime is the set of core components that run on your clusters. All Databricks Runtime versions include Apache Spark and add components and updates that improve usability, performance, and security.
Enable cluster poolguzzle.spark.instance_pool_id-Here we can attach cluster pool to the databricks cluster
Enable auto-scalingguzzle.spark.workers.autoscale.min_workers
guzzle.spark.workers.autoscale.max_workers
-To allow Databricks to resize your cluster automatically, you enable autoscaling for the cluster and provide the min and max range of workers
Driver node typeguzzle.spark.driver_node_type_id-Define node type for driver node
Worker node typeguzzle.spark.node_type_id-Define node type for worker node
spark configguzzle.spark.spark_conf-This is parameters is mainly used adding spark properties
Number of workersguzzle.spark.workers.num_workers-Define number of worker nodes are required for job execution.
Autoscale min workersguzzle.spark.workers.autoscale.min_workers-Define number of minimum workers that are required for workload.
Autoscale max workersguzzle.spark.workers.autoscale.max_workers-Define number of maximum workers that are required for workload.

driver_node type worker_node type override spark config workers

Synapse spark#

ParameterParameter NameDefault ValueDescription
Number of executorsguzzle.spark.num_executors1You can define number of execute required for your spark application
Driver memoryguzzle.spark.driver_memory1gbDefine driver node memory
Driver coresguzzle.spark.driver_cores1Define drive cores
Executor memoryguzzle.spark.executor_memory1gbDefine memory for your execute node
Executor Coreguzzle.spark.executor_cores1Define number of drives
spark configguzzle.spark.spark_conf-This is parameters is mainly used adding spark properties

AWS Glue#

ParameterParameter NameDefault ValueDescription
Number of executorsguzzle.spark.num_executors1Define number of executor your spark job will use to execute workloads

AWS EMR#

ParameterParameter NameDefault ValueDescription
Number of executorsguzzle.spark.num_executors-You can define number of execute required for your spark application
Driver memoryguzzle.spark.driver_memory-Define driver node memory
Driver coresguzzle.spark.driver_cores-Define drive cores
Executor memoryguzzle.spark.executor_memory-Define memory for your execute node
Executor Coreguzzle.spark.executor_cores-Define number of drives
spark configguzzle.spark.spark_conf-This is parameters is mainly used adding spark properties

AWS EMR Serverless#

ParameterParameter NameDefault ValueDescription
Number of executorsguzzle.spark.num_executors-You can define number of execute required for your spark application
Driver memoryguzzle.spark.driver_memory-Define driver node memory
Driver coresguzzle.spark.driver_cores-Define drive cores
Executor memoryguzzle.spark.executor_memory-Define memory for your execute node
Executor Coreguzzle.spark.executor_cores-Define number of drives
spark configguzzle.spark.spark_conf-This is parameters is mainly used adding spark properties