Skip to main content

AWS Glue

Guzzle supports compute environments on the AWS Cloud. In Guzzle AWS Cloud setup, we can use AWS Glue to execute our workloads. This article helps in using AWS Glue as a compute environment in Guzzle.

note

AWS Glue compute can be used only when Guzzle is deployed on AWS EC2 instance

IAM Role permissions for AWS Glue Job#

IAM Role which we refer in Guzzle AWS Glue compute must have following permissions

ActionResourceDescription
s3:GetObjectarn:aws:s3:::<guzzle-shared-storage-bucket>/*To access jar files related to Guzzle and Guzzle AWS Glue job script
secretsmanager:GetSecretValuearn:aws:secretsmanager:<region>:<account>:secret:*To access Guzzle security passphrase and other secrets referred in Guzzle configs
glue:BatchStopJobRunarn:aws:glue:<region>:<account>:job/*To cancel AWS Glue job while it is running

Few other permissions need to be assigned so that AWS Glue job can access glue catalog, data from storage, write logs to AWS CloudWatch etc. For more information please refer to AWS Glue Documentation.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::<datastore-bucket>",
"arn:aws:s3:::<datastore-bucket>/*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetConnection",
"glue:GetDatabase",
"glue:GetTable",
"glue:CreateTable",
"glue:GetUserDefinedFunctions"
],
"Resource": [
"arn:aws:glue:<region>:<account>:catalog",
"arn:aws:glue:<region>:<account>:connection/*",
"arn:aws:glue:<region>:<account>:database/*",
"arn:aws:glue:<region>:<account>:table/*/*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
},
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData"
],
"Resource": [
"*"
]
}
]
}

Guzzle compute configuration properties for AWS Glue#

PropertyDescriptionDefault ValueRequired
Authentication typeSelect the authentication type to be used for accessing AWS Glue service for various actions
Options:
  • Service Role : Use role assigned to EC2 instance where Guzzle is deployed and role assigned to AWS Glue Job to retrieve credential for accessing AWS Glue Service
Service RoleYes
IAM roleRole assumed by the AWS Glue Job with permissions to access your data stores and AWS servicesNoneYes
Glue versionSelect the AWS Glue runtime version
Options:
  • 3.0
3.0Yes
Worker typeSet the type of predefined worker that is allowed when a job runs
Options:
  • G 1X
  • G 2X
G 1XYes
Automatically scaleAWS Glue will optimize costs and resource usage by dynamically scaling the number of workers up and down throughout the job runFalseYes
Requested number of workers / Maximum number of workersThe number of workers you want AWS Glue to allocate to this job10Yes
Generate job insightsAWS Glue will analyze your job runs and provide insights on how to optimize your jobs and the reasons for job failuresTrueNo
Number of retriesNumber of retries for jobs3No
Job timeout (minutes)Set the execution time. The default is 2,880 minutes (48 hours) for a Glue ETL job2880No
Job metricsEnable the creation of CloudWatch metrics when this job runsTrueNo
Continuous loggingEnable logs in CloudWatchTrueNo
Spark UIEnable using Spark UI for monitoring this jobTrueNo
Spark UI logs pathSpark UI logs pathNoneNo
Temporary pathWorking directory. Path must be in the form s3://bucket/prefix/path/. It must end with a slash (/) and not include any filesNoneNo
Delay notification threshold (minutes)Set a delay threshold in minutes. If the job runs longer than the specified time Glue will send a delay notification via CloudWatchNoneNo

Interface for AWS glue#