AWS Glue
Guzzle supports compute environments on the AWS Cloud. In Guzzle AWS Cloud setup, we can use AWS Glue to execute our workloads. This article helps in using AWS Glue as a compute environment in Guzzle.
note
AWS Glue compute can be used only when Guzzle is deployed on AWS EC2 instance
#
IAM Role permissions for AWS Glue JobIAM Role which we refer in Guzzle AWS Glue compute must have following permissions
Action | Resource | Description |
---|---|---|
s3:GetObject | arn:aws:s3:::<guzzle-shared-storage-bucket>/* | To access jar files related to Guzzle and Guzzle AWS Glue job script |
secretsmanager:GetSecretValue | arn:aws:secretsmanager:<region>:<account>:secret:* | To access Guzzle security passphrase and other secrets referred in Guzzle configs |
glue:BatchStopJobRun | arn:aws:glue:<region>:<account>:job/* | To cancel AWS Glue job while it is running |
Few other permissions need to be assigned so that AWS Glue job can access glue catalog, data from storage, write logs to AWS CloudWatch etc. For more information please refer to AWS Glue Documentation.
#
Guzzle compute configuration properties for AWS GlueProperty | Description | Default Value | Required |
---|---|---|---|
Authentication type | Select the authentication type to be used for accessing AWS Glue service for various actions Options:
| Service Role | Yes |
IAM role | Role assumed by the AWS Glue Job with permissions to access your data stores and AWS services | None | Yes |
Glue version | Select the AWS Glue runtime version Options:
| 3.0 | Yes |
Worker type | Set the type of predefined worker that is allowed when a job runs Options:
| G 1X | Yes |
Automatically scale | AWS Glue will optimize costs and resource usage by dynamically scaling the number of workers up and down throughout the job run | False | Yes |
Requested number of workers / Maximum number of workers | The number of workers you want AWS Glue to allocate to this job | 10 | Yes |
Generate job insights | AWS Glue will analyze your job runs and provide insights on how to optimize your jobs and the reasons for job failures | True | No |
Number of retries | Number of retries for jobs | 3 | No |
Job timeout (minutes) | Set the execution time. The default is 2,880 minutes (48 hours) for a Glue ETL job | 2880 | No |
Job metrics | Enable the creation of CloudWatch metrics when this job runs | True | No |
Continuous logging | Enable logs in CloudWatch | True | No |
Spark UI | Enable using Spark UI for monitoring this job | True | No |
Spark UI logs path | Spark UI logs path | None | No |
Temporary path | Working directory. Path must be in the form s3://bucket/prefix/path/. It must end with a slash (/) and not include any files | None | No |
Delay notification threshold (minutes) | Set a delay threshold in minutes. If the job runs longer than the specified time Glue will send a delay notification via CloudWatch | None | No |