Processing
#
Introduction- The Processing module in Guzzle is used process data within the same datastore. It supports processing data via Spark engine or Template.
- In Processing module, source and target datastore technology must be same. Example: We can not choose Delta as source and hive as target in Guzzle.
- It provides inbuilt framework columns like Effective start data and end date to handle SCD type 2 efficiently.
- It supports common operations like append, overwrite, merge and etc.
- For source, We need to choose datastore technology from where we need to bring data for Processing.
- Guzzle Processing modules supports using both Source SQL and Source table
- The Processing module provides pre and post SQL to execute SQL. This pre-post SQL will execute before or after and read or writing operation.
- It provides the feature for overwriting table dependency for tables.
- Guzzle provides auto-create table functionality in target section. If the target table is not present then Guzzle will create the target table. Only hive, delta and hudi are supported in the Processing module.
#
Spark Engine vs TemplateSpark | Template |
---|---|
Spark will bring data from the source in cluster, apply required transformation via spark dataframe API and send transformed data as result. | Template will directly compiles and run at datastore technology end and sends response back to Guzzle. |
Spark used connector libraries to connect with datastore. | Template uses JDBC connection to connect with data sources. |
We cant not modify behavior Processing of spark engine. | Templates are flexible so user can customize or add new template. |
Spark engine process data in spark cluster. | Template execute queries in underlying datastore via JDBC connection. Guzzle Compute is not involved here |
#
Default template for each datastoreDatastore | Spark | Template |
---|---|---|
Delta | Yes | Yes: Delta |
Hive | Yes | No |
Synapse | No | Yes: Synapse |
Snowflake | No | Yes: Snowflake |
JDBC | No | Yes: Select as per your database |
Azure SQL | No | Yes: SQL Server |
Apache Hudi | Yes ** (Coming Soon) | Yes ** (Coming Soon) |
note
- For other datastore technology, we can use default template. For example, we can use default template for MySQL and PostreSQL.
#
How different operations work GuzzleOperation | Description | Spark | Template |
---|---|---|---|
Append | - Append operation will append data to target table. - Truncate partition option is not available in append. | - It uses spark to process the data so that it will bring data to spark cluster. - It uses spark write mode to append the data . | - It uses INSERT INTO <target_table> SELECT statement. - only single query will execute to append records on target database. |
Overwrite | - Overwrite operation is used to overwrite source records into target table. | It will use temp table to store processed data. Hive - If partition column is configured -> It will use insert overwrite to overwrite that partition data. - It will use insert overwrite statement to overwrite data in that partition. - If partition column is not configured -> It will truncate only those partitions for which source contains data. - Other partitions data will remain intact. - Click here for details. Delta - If partition column is configured -> It will use insert overwrite statement to overwrite data in that partition. - If partition column is Not configured -> It will truncate full target table and insert source data into target database. | - For synapse, snowflake, and SQL Server template it uses two operations: - It will truncate target table. - Uses INSERT INTO <target_table> SELECT statement load data into target table. Delta template - It will use INSERT overwrite <targe_table> select operation to update data in target table. |
Merge | Incremental - Incremental operation, If source data is present in target table then it will update those data in target table. - If source data is not present in target table, it will add those records into target table. - Remaining records which are not present in source wil remain as it is in target table. Full Load - Incremental operation, If source data is present in target table then it will update those data in target table. - If source data is not present in target table, it will add those records into target table. - Remaining records marks as in active in target table. | - Spark will bring source and target data into memory. - It will perform merge operation and write data into temp table. - It will truncate target table(Truncate partition) and insert data from temp table to target table. | - Template uses merge statement to perform incremental and full merge operation. |
Effective Date Merge | Incremental - Incremental operation, If source data is present in target table and respective history column value is changed in source it will mark existing record as inactive using effective enddate and current record flag column value as N. - After that will create new record will updated value in target table using effective startdate column and correct record flag as Y. - If source data is not present in target table, it will add those records into target table. - Remaining records which are not present in source will remain as it is in target table. Full - Incremental operation, If source data is present in target table and respective history column value is changed in source it will mark existing record as inactive using effective enddate and current record flag column value as N. - After that will create new record will updated value in target table using effective startdate column and correct record flag as Y. - If source data is not present in target table, it will add those records into target table. - Remaining records which are not present in source will be marked as inactive records using effective end date and current flag. | - Spark will bring source and target data into memory. It will perform merge operation and write data into temp table. It will truncate target table(Truncate partition) and insert data from temp table to target table. | Template uses merge statement to perform incremental and full merge operation. |
Update only | - Guzzle will update target table records which are coming from source. It won’t add any new record. | Spark will bring source and target data into memory. It will perform update operation and write data into temp table. - It will truncate target table(Truncate partition) and insert data from temp table to target table. | It uses merge operation to update source data into target table. |
Delete (Hard) | - Delete operation will delete all records which are coming from source. | - Spark will bring source and target data into memory. It will perform delete operation and write data into temp table. - It will truncate target table(Truncate partition) and insert data from temp table to target table. | It uses merge operation to delete data which are coming from source. |
Delete (Soft) | - It will mark current flag record as inactive(N). It wont delete any records. | Same as Delete Hard. | Same as Delete Hard. |
#
Framework ColumnsFramework column | Default Value | Data Type | Description | Append | Overwrite | Merge | Effective Date Merge | Update only | Delete (soft) | Delete (hard) |
---|---|---|---|---|---|---|---|---|---|---|
w_created_ts | - By default, guzzle takes current timestamp as w_created_ts value. - If user wants to set custom value in UI then guzzle will take that value as w_created_ts. | Timestamp | Generate current timestamp value. | Yes | Yes | Yes | Yes | No | No | No |
w_refresh_ts | - By default, guzzle takes current timestamp as w_refresh_ts value. - If user wants to set custom value in UI then guzzle will take that value as w_refresh_ts. | Timestamp | This framework column is using to get updated timestamp of the data. | Yes | Yes | Yes | Yes | Yes | Yes | No |
w_current_record_flag | - By default, guzzle takes Y(Yes) for w_current_record_flag value. - If target record in present in marge operation or If record is inactive in w_eff_start_date_ts and w_eff_end_date_ts then guzzle will mark this column as w_current_record_flag. - Custom values are not allowed in as w_current_record_flag. | String | This column will define flag that indicated particular record is latest or not. | Yes | Yes | Yes | Yes | Yes | Yes | No |
w_eff_start_date_ts | - By default, Guzzle takes effective start date based start date of SCD type 2. - If user wants to set custom date in UI then guzzle will take that value as w_eff_start_date_ts. | Timestamp | This column is used to define effective start date of the dimension as SDC2. | No | No | No | Yes | No | No | No |
w_eff_end_date_ts | - By default, Guzzle takes effective end date based end date of SCD type 2. - If user wants to set custom date in UI then guzzle will take that value as w_eff_end_date_ts. | Timestamp | This column is used to define effective end date of the dimension as SDC2. | No | No | No | Yes | No | No | No |
w_sequence_key | By default, Guzzle takes w_sequence_key value based logic that we have implemented. - Custom values are not allowed in as w_sequence_key. | Bigint | This column is used to define system generated surrogate key. | No | No | Yes** | Yes** | No | No | No |
w_version_key | By default, Guzzle takes w_version_key value based logic that we have implemented. - Custom values are not allowed in as w_version_key. | Bigint | This column is used to define system generated surrogate key. | Bigint | This column is used to define version of each dimensions by primary key. | No | No | No | Yes** | No |
note
- Yes** => Only applicable on spark engine.
- Check the behavior of each framework column with different operations. Click here.
#
Special ConfigurationsDescription | Append | Overwrite | Merge | Effective Date Merge | Update | Soft Delete | Hard Delete | |
---|---|---|---|---|---|---|---|---|
Primary Key | It will uses to link source and target data. | No | No | Yes | Yes | Yes | Yes | Yes |
Incremental | When we want to update only those target records which are present in source. Remaining existing records in target will not change. will remain as it is. | No | No | Yes | Yes | No | No | No |
Merge Column | - Before merging record it will check whether data is updated or not in given merge column. if it is updated then it will update target records else it will not update target table. - If the merge columns are not specified in configuration by default Guzzle considers all columns as merge column. | No | No | Yes | No | Yes | No | No |
History Column | - If history column data is update then it will create new records in target table. Else it will not create records in target table. - If history columns are not specified in the configuration by default Guzzle considers all columns as History columns | No | No | No | Yes | No | No | No |
#
Source Columns: CustomerId   CustomerName   Age   Country Below steps are used to determine final list of columns that are part of update section.Target columns: CustomerId, CustomerName, AgeÂ
1. user define merge columns
CustomerName, Age
2. we fetch source columns
CustomerId, CustomerName, Age, Country
3. we fetch target column
CustomerId, CustomerName, Age
4. intersect source(2) and target(3) columns
CustomerId, CustomerName, Age
5. Remove the framework column from (4)
CustomerId, CustomerName, Age -> (No framework column).
6. Remove the primary columns from (5), let's call it data columns
CustomerName, Age
7. finally we intersect merge columns(1) with data columns(6).
CustomerName, Age
#
How Pre SQL and Post SQL works- PreSQL and PostSQL are used to execute SQL statements before and after the execution of the Processing operation.
#
Spark Engine Order of statement executionOperation | Description |
---|---|
Source: pre_sql | When we submit Processing job, Guzzle will first executes source - PreSQL |
Persist source SQL into temp table1 | |
Target: pre_sql | Guzzle will executes PreSQL for target. |
Execute Processing operation query and insert data in temp table2 | Once PreSQL is executed from both the locations, Guzzle starts executing actual Processing operation. Guzzle will store result of the Processing operation to temporary table. |
Truncate target table (if applicable) | In certain Processing operations, Guzzle truncate target table. For example, If we perform overwrite operation using template then Guzzle will truncate target table. |
Copy from temp table2 to final table | Once we have processed data in target table, Guzzle will copy temporary table to target table |
Source: post_sql | After compilation of Processing operation, Guzzle will execute PostSQL statement on source. |
Target: post_sql | After source PostSQL, Guzzle will execute target PostSQL. |
Operation | Description |
---|---|
Source: pre_sql | When we submit Processing job, Guzzle will first executes source - PreSQL |
Target: pre_sql | Guzzle will executes PreSQL for target. |
Execute Processing operation query | Once PreSQL is executed from both the locations, Guzzle starts executing actual Processing operation. |
Source: post_sql | After compilation of Processing operation, Guzzle will execute PostSQL statement on source. |
Target: post_sql | After source PostSQL, Guzzle will execute target PostSQL. |
#
Examples#
Example 1Configurations
- Operation: Merge
- Primary Key: id
- Merge Column: -
- Incremental: False
Source
Target
Result:
- Record with id 1 is not present source so Guzzle is not considering that id in the result and mark it as inactive(Record is not present).
#
Example 2Configurations
- Operation: Merge
- Primary Key: id
- Merge Column: -
- Incremental: True
Source
Target
Result:
- The incremental flag is true, Guzzle will not affect any target records, which are not present in the source.
#
Example 3Configurations
- Operation: Effective Date Merge
- Primary Key: id
- Merge Column: -
- Incremental: False
Source
Target
Result:
- In Effective date merge operation, the record with id 2 is matching so Guzzle will create a new record for id 2 and update the old record with end_date and is_record_present will become N.
#
Example 4Configurations
- Operation: Effective Date Merge
- Primary Key: id
- Merge Column: -
- Incremental: True
- History column: No history column
Source
Target
Result:
- In Effective date merge operation, the record with id 2 and 3 are matching, but for id 2 gender and birth_date column value is updated so it will create a new record but for id 3 there is no change in any column value so it will not create new record for that.
- Incremental operation is True, Guzzle will not affect any target records which are not present in the source.
#
Example 5Configurations
- Operation: Effective Date Merge
- Primary Key: id
- Merge Column: -
- Incremental: False
- History column: name, gender
Source
Target
Result:
- Here the history column is name and gender column. so it will create a new record for those record which has changed in name and gender column. in this example, it is id 1 and 2. for id 3 there is a change in birthdate and salary column but it is not a history column so Guzzle will not create a new record for id 3 it will update the latest value in the existing target record.
- Incremental operation is True, Guzzle will mark target records that are not present in the source as inactive using is_record_present_flag framework column.
#
Example 6Configurations
- Operation: Update
- Primary Key: id
- Merge Column: name, salary, and updated_time.
Source
Target
Result:
- Guzzle will update target table records which are coming from source. It won’t add any new record.
#
FAQs- How Guzzle behaves when external and internal template are the same?
Ans. Guzzle will show only one option in UI, external will take priority. - Where Guzzle can place an external template?
Ans. /Guzzle/default/ext/Processing/template/{new_temp}.yml|