Processing
Processing in Guzzle is used to transform ingested Data into a data lake (or target platform). This activity deals with complex processing logic by implementing all data transformation rules.
Data Processing is the module which supports intra-ETL from staging to foundation and finally to analytics data tables. " Transforming data from one form to another " is the key purpose of the Data Processing module. Guzzle’s processing module can read data from RDBMS, Hive, SQL server, Azure Synapse, other JDBC sources etc.
note
As processing is meant for data movement between homogeneous sources and target, we cannot read data from source tables maintained in one technology and write it into target tables maintained in some other technology.
#
Source Section of the Processing ModuleProperty | Description | Default Value | Required |
---|---|---|---|
Table+Filter | Specify the table from where to read data. The table can contain schema or even database name. If you select Table as input, data flow fetches all the data from the table specified in the dataset. Also, when the Table option is selected, you can also specify the Filter which needs to be included when querying data from synapse. We can specify record filters on the table if any. | None | Yes |
SQL | Specify the SQL query which will be run as-is on the source. | None | Yes |
Incremental | This property is to control behavior of the framework generated column w_current_record_flag, in the cases where target is configured for merge operation. By default, when this property is unchecked, the source is considered a full dump. Hence, all target records are first marked as w_current_record_flag is equal to N before performing the merge on target. In the next step, to merge primary keys that are present in source, this flag is again updated to Y and keys that are not present in source, this flag remains N. In case you check this property, source is considered incremental. Hence, all target records are not marked as N for w_current_record_flag column. | Unchecked | Yes |
Configure Table Dependency | This property is used to manually specify source table dependency and to manually configure Apache Atlas lineage. | None | No |
Additional Columns | Specify, if you want to map any additional framework generated columns or derived columns to the target table. We need to enter the Column Name, Value and the Framework Generated Column. In the additional column section Guzzle provides the following framework generated columns : w_created_business_ts, w_created_ts, w_refresh_ts, w_current_record_flag, w_eff_start_date_ts, w_eff_end_date_ts, w_sequence_key, w_version_key | None | No |
#
Target Section of the Processing ModuleProperty | Description | Default Value | Required |
---|---|---|---|
Table | This is used to specify the Target Table we would like to create based on our Source. We can also use parameters and pass their value during runtime. We may also prefix our table name with database / schema name. | None | Yes |
Operation | Specify, if you want to perform append, overwrite, merge, effective date merge, update only or delete operation on target table by selecting a given option from drop-down. | Append | Yes |
Template | It is used to specify Guzzle template which the user would like to use like delta, default, SQL server or delta_v2 to perform target operations by selecting a given option from drop-down. | Unchecked | Yes |
Primary Keys | Specify primary key columns list from target table to uniquely identify record, if target operation is any of merge, update only, effective date merge or delete. | None | No |
Merge Columns | Specify merge columns to be updated into the target table, if the target operation is merge or update only. Specify history columns list to maintain history into SCD Type 2 targets, if target operation is effective date merge. You can use this option in conjunction with framework columns w_eff_start_date_ts, w_eff_end_date_ts, w_sequence_key, w_version_key available in Additional Columns source section by mapping them into target table | None | No |
Soft Delete | Specify, if you want to perform soft delete onto the target table where the flag column can be maintained into the target table to identify if record is active or deleted. Use this option in conjunction with one of the framework columns w_current_record_flag available in the Additional Columns source section by mapping it to the target table. | None | No |
Truncate Partition Columns | Specify target table partition columns and their corresponding values to truncate target table partition before target load. Truncate partition column values can also be passed as a parameter. | None | No |
#
Processing executionProcessing module support two types of execution engine
- Spark
- Template
#
Spark- In Spark engine, Guzzle loads required source and target data in spark compute, perform operations and write processed data in target table.
- There are few changes in behaviour from Guzzle v2.4.0. Click here to view it.
#
Template- In Template engine, Guzzle prepares operation query which will be executed on source technology. It will not load source and target data in compute memory.
#
Example of the Processing Module:Source Table
Id, name, gender
1, ABC, male
2, XYZ,
Target table
Id, name, gender
0, AWS, female
2, XYZ, male
Operation: merge
Merge_columns: gender
Id, name, gender
0, AWS, female
1, ABC, male
2, XYZ, male