We have eliminated livy compute, Quantity Resource from Schedule and other dead code from the system. This clean-up of redundant code improves code quality, leading to better performance and easier maintenance.
Added APIs for exporting and importing Guzzle config packages, allowing users to manage configurations more efficiently.
AWS Glue now supports Delta format using datalake-formats approach, replacing the previous delta-connector based approach. This change is applied to AWS Glue compute, providing improved data handling capabilities.
When preparing table dependencies, we have updated the symbol used to separate datastore, database, and table names from an underscore (_) to a dot (.), making it more consistent and intuitive.
Job execution flow has been enhanced in various services, including Databricks, Synapse, Glue, EMR EC2, and Serverless. We've introduced retry support for failed-to-submit jobs, ensuring a more robust and reliable processing experience. Additionally, users can now set timeout settings for Synapse and EMR compute in the manage section, allowing greater control over job execution.
We have eliminated dead code from the system, resulting in improved overall code cleanliness and efficiency. This optimization contributes to better performance and maintainability of the product.
Updated the internal parameter UI to include an override button, allowing users to explicitly override internal parameters.
Improved the behavior of internal parameters. If a user provides an unsupported internal parameter, the job execution will now fail with an invalid parameters error.
Enhanced the rerun pipeline and activity behavior. Now, when rerunning, a run dialog box is displayed on the monitor screen with pre-populated parameters for easier execution.
Removed dead code from the system to improve overall code cleanliness and efficiency.
Resolved an issue where failed read operations occurred while reading the source file list and performing delete operations concurrently. The issue has been fixed, and the cached file list has been removed.
Added support for EMR EC2 compute config editor UI. Users can now utilize the EMR EC2 compute config editor to modify and customize their EMR EC2 compute configurations.
Added support for EMR Serverless compute config editor UI. Users can now utilize the EMR Serverless compute config editor to modify and customize their EMR Serverless compute configurations.
Resolved synchronization issue with the "Not Started" activity status. Fixed the problem where Guzzle was unable to update the "Not Started" activity and pipeline status to "Abort" when the pipeline configuration file was missing.
Addressed stack overflow issue and enhanced performance. Fixed the problem where Guzzle encountered a stack overflow issue when the source contained a large number of columns.
Fixed processing activity validation issue for governance entity name field
Fixed purview integration issue for processing activity where source entity is overwritten by target entity in case of spark engine
The processing merge operation behavior has been modified so that when a user specifies a merge column, it will be utilized during the insertion of new records. Previously, all columns were used for inserting new records.
Handle special character in table and column name for Azure Synapse Native, Snowflake and Redshift datastore
Added purview integration support for hive/delta, JDBC, Azure SQL, Azure Synapse Analytics and Azure Synapse Analytics Native type of datastores in ingestion and processing module
The problem of the Guzzle API being stopped prematurely due to the stop action on pipeline and batch has been resolved. The termination process has been improved to handle the situation more gracefully.
The problem occurred when the heartbeat thread stopped updating the heartbeat, while the status sync thread continued to update the batch_control table status. This caused an inconsistent state between the job_info and batch_control. the issue has been fixed.
Added support of manual dag pipeline. User can define pipeline execution flow based on activity execution status. It supports three types of transition: success_warning, failed_aborted, and completed. In case of activity success, It will execute all activities which depend on "success_warning" and "completed" status, for failed activity, it will execute "failed_aborted" and "completed" activities.
Improved the parsing and resolution of job parameters in the Guzzle activity. Previously, if this process failed, the job would remain in the "Not Started" state. This issue has been fixed, and now the system marks the job as failed and terminates the execution gracefully.
Introduced new credential type option in Azure Blob/ADLS Gen2 datastore. It will use service principle credentials to read and write data in Azure Synapse Spark compute.
UI Fix: While deleting truncate partition column entry in processing activity UI it was deleting multiple entries with same name. Fixed issue, now it delete only selected partition entry
Azure synapse analytics connector failed to perform operation with the latest version of Azure Databricks DBR due to database property we were passing to connector. in the latest connector it required database name as part of jdbc url. Fixed issue, We are not passing database property as separate attribute User has to pass it as part of database URL
Copy data tool was not able to load the database metadata. Fixed issue, now it is loading database metadata.
Fixed backslash issue in synapse parameter value. Synapse required double backslash compare to other compute.
Re-engineered constraint check module, constraint check module was loading source data in memory and performing the sql validation on in-memory data. Now it generates sql query and execute it on source. This operation will not load source data in memory.
Added support in External activity to call stored procedure for Azure SQL, Synapse, Snowflake and Redshift
Fixed memory issue in databricks cluster api calls, if user call databricks api multiple time with invalid credentials it was creating heap memory issue. Fixed issue and handle invalid api request more gracefully.
Review and handled api response and sdk client connection more gracefully
Added Rest datastore support in External activity. User can perform Rest API call using External activity. It will mark activity successful for 2xx http response code and for other http status codes it will mark it as failed.
Cluster remain in execution state when Guzzle activity is not able to send logs to Guzzle API. Fixed issue, It will terminate the cluster when execution complete even if logs is not sent to Guzzle API.
While sending non UTF-8 character in logs request Guzzle was sending 400 response code. Fixed issue. it will process the logs request with non UTF-8 characters.
While resuming batch, ADF and Synapse pipeline External activities was not skipped if it was executed successfully. Fixed issue, It will skip successfully ran External activities.
Added new timeout settings for NOT_STARTED activities. If Activity not start execution after 15 min of submission It will be marked as ABORTED. User can change the timeout setting by using Manage -> Environment Config -> Timeout and Sync -> Job Heartbeat Configuration -> Not started job timeout option.
Added Hudi support in ingestion activity for AWS Glue compute
Added dynamic activity support in pipeline. Using this feature, user can specify datastore, SQL and activity configuration in pipeline. When pipeline will execute, It will prepare an activity list based on the SQL results. For each row, it will add one or more activities. To pass activity name and parameter from query result, User can add placeholder with column name like #{column_name}. for now this feature is only supported for JDBC, Azure Sql, Azure Synapse Analytics, Redshift and Snowflake datastores.
Added operation and affected row count details in message for Processing template based activities
Added Redshift datastore support for AWS and Azure deployment
Added Snowflake datastore support for AWS computes. User can place snowflake related jar files in /guzzle/libs/<custom_directory_name> directory and configure relative path in additional jars configuration in computes. Azure Databricks and AWS Databricks has snowflake jars pre-configured so no need to add extra jars.
Fixed table name issue, Guzzle was concatenating database and table name which was generating issue for some JDBC sources. now Guzzle will not concat database and table name.
Fixed delta partition column issue for Glue Compute, earlier it was not able to fetch partition columns details.
Added spark override options support for Azure Synapse, AWS Glue, Emr Serverless and Emr Ec2 computes. User can override spark options at runtime, pipeline level or activity level in pipeline.
Added support of Azure Synapse pipeline, User can run Azure Synapse pipeline using External Activity and Azure Data Factory/Azure Synapse Workspace datastore.
Added Delta support for AWS EMR on Serverless compute
Handled nullable constraint check behaviour for Azure Synapse Native datastore. Earlier activity was failing if source(dataframe) and target column has different null constraint. it is fixed by updating source(dataframe) column nullable property as per target table column value.
Added support of override AWS Databricks compute options at Runtime level, Pipeline level and Activity level.
Added Delta technology support for AWS Glue compute
JDBC, Snowflake, Azure SQL and Azure synapse analytics Processing activity will run as part of API. It will not require spark compute for execution.
If the User is planning to use other JDBC drivers which are not bundled in Guzzle, he/she has to put it inside ${GUZZLE_PRODUCT_HOME}/api/libs directory along with ${GUZZLE_PRODUCT_HOME}/libs
Databricks multitask pipeline is submitted to databricks compute so the above-mentioned activities are not supported in Databricks multitask pipeline
Added support of Delta path in all modules so user can pass delta path in table name field like delta./user/hive/warehouse/tablename or external storage like delta.abfss://container@azurestorageacc.dfs.core.windows.net/databricks/tablename
Utilize schema name when deriving the columns from the table in a processing module
Fixed pipeline was kept in running state issue, In case of auto-dependency if parent activity failed it should skip dependent child activities and terminate the pipeline, but it was kept pipeline in running state in some cases when different types of activities are configured, and they have dependencies on each other. Fixed issue, it will skip dependent child activities and terminate the pipeline in case of parent activity failed.
Pipeline was triggering activity re-run when it is manually terminated by the User from UI. Fixed issue, It will terminate the activity and not trigger a re-run if it is manually terminated by the User.
Fixed auto create secondary table issue in housekeeping job, It was failing due to non-supported sql syntax
UI Fixes: Fixed false alert issue when navigating out of api screen without making change, Fixed compute UI doesn't show cluster list option when keyvault or secret is incorrect, Updated azure synapse native connector datastore and ingestion section labels to make it consistent with other labels
Replaced two operations Truncate Table and Insert Into with single atomic operation Insert Overwrite in processing module. Click here to know behaviour changes.
Added AWS glue compute support to run guzzle activities
Added job_instance_id column in constraint_check_summary and constraint_check_detail table. Using this column user can distinct data by job and also trace back to the job run.
Added housekeeping support for job audit, job logs and service logs. Job audit contains job_info and job_info_param table. It will also create index on parent_job_instance_id column in job_info table to improve housekeeping performance.
Updated data type of job_config, source_columns and sample_data columns in data_sampling_job table to increase the column storage capacity.
In ingestion, optimize and revise the data type validation section. For validate datatype rules checkout this sheet ingestion_validate_datatype_rules.xlsx.
In ingestion, removed strict schema check feature.
In ingestion, revise the existing three schema derivation strategies and add two new schema derivation strategies. For more details checkout this documentation.
In ingestion, set the default value for validate data type checkbox to false in YAML and when the YAML property for this checkbox is missing, then guzzle will interpret it as unchecked.
UI Fixes: Fixed business date selection issue in monitor screen filter, Fixed duration field issue in monitor screen also added support of years, months and days units and showing only relevant duration units instead of all units
Batch init catchup parameter was marking successfully ran job as ABORTED for given business date, fixed issue, now it marks only OPEN and FAILED batches as ABORTED
When doing sampling for ingestion job, It was showing epoch seconds for timestamp column value, fixed issue, now it shows formatted value for timestamp column
Added Databricks multitask job support as a Guzzle multitask pipeline
Guzzle was support retry for FAILED and WARNING activity in the pipeline. Now retry is supported for FAILED activity only
For each retry Guzzle was creating a new job info record, Now it uses the same job info record to perform retry
In Guzzle pipeline, when an auto dependency is enabled and Activity failed it will stop only the dependent ones, other activities will continue their execution. earlier it stopped the execution of pending activities as soon as the first activity failed
Encrypted JWT token secret value while showing it in UI and storing it in a config file
Added infer schema support for XML and JSON file source in ingestion activity
UI Fixes: Active tab goes hidden issue when there are too many open tabs in UI, Extra space issue in author config tab bar UI, Updated layout for Rest datastore, Removed duplicate tooltip from select components
Added retry support if key vault secret api failed to fetch secret value. It will perform max 5 retry with interval of 5 seconds
Override spark settings in pipeline was loading spark configuration data using system default compute which is configured inside guzzle.yml. now it is loading spark configuration data using user's default compute, which is configured using My Profile -> Default Compute option.
Support Synapse spark pool as a compute in backend
Running batch, pipeline and external activity as Guzzle API thread instead of separate JVM process
When we pass parameters to external process it gets evaluated and can generate some side-effect. Fixed this issue, Now we are passing parameter in single quotes and escaping $ with a backslash to prevent evaluation
The batch stage was not marked as ABORTED when batch execution terminated abnormally, fixed this issue, It will be marked as ABORTED when batch execution terminate abnormally
If malformed yaml configs are present in guzzle deployment, then whenever guzzle reads that config it will fail and it will keep doing retries until exhausted (currently its 20 second). Due to this, affected operations will respond slowly. Example of affected operatoins are: login/reloaded page/git pull/ new branch when git is enabled, refresh of config in API which happens in background)
Fixed issues in create datastore from activity editor - connection is created with name undefined.yml, if there is existing connection with same name as datastore name then new connection name is not generated properly
Fixed issue of timeout and sync page when guzzle API is down
UI Fixes : Table Dependency component sharing same table name while switching tabs, SingleParametersInput component sharing same state, show error messages on login screen for invalid JWT key vault config, fix issues for jwt settings page
Fixed repository database page issues - disable cancel button when any action is in progress, make driver class input optional
Fixed ADLS Gen2 datastore editor while switching between service principal and access key credential type
Support for spark engine for delta technology in processing activity
Keep generated access token expiry time as 90 days by default
Added support of user parameter, batch_id, stage_id and environment parameters in pipeline resume (this params will be supported when resuming)
Breaking change: Removal of prev_business_ts. An alternate will be pass this explicitly when running stages or in pipeline this can be deduced by passing business date - 1 minute to child jobs
Added default user compute support in profile page
Added git commit message support in guzzle copy data tool
Added azure key vault support in git integration to specify the client secret
Added test connection support in azure sql and azure synapse datastore
UI Changes: links documentation, tutorial and resourc, change log link, renaming of form lables in Ingestion, clean up of HIve/Delta datastore config, updating icons, copy tool shortcut in Landing page, Changelog link in upgrade page, remove Quantity Resource support in in Admin UI
Showing configuration dependencies while performing rename operation
Updated source generated column behavior, cascaded it to target automatically (now requires to speicfy them in schema section to be added to target)