Ingestion Data Validation
#
Overview- Data Validation featured used to filter or capture noisy data.
- Ingestion support below data validation
- Data type validation
- Uniqueness
- Nullable
- SQL validation (custom validation)
- In some cases we don't want to filter noisy data instead just wants to capture noisy data, for instance, a single invalid record can flow into both target and reject section, this we can archive using discard settings.
note
For data type validation click here
#
Uniqueness- By enabling this checkbox, duplicate record(s) will be marked as an invalid record.
- For instance, a record containing duplicate city_id and country_id will be marked as an invalid record.
- In the case of source as a file, uniqueness only applies at the file level, meaning that duplicate records across multiple files will not mark as invalid.
note
This is different from SQL
distinct
, if guzzle found two duplicate records guzzle will mark both records as invalid.
#
Nullable- By disabling this checkbox, a record(s) containing a null value will be marked as an invalid record.
#
SQL validation- By specifying Spark SQL expression condition, if the expression returns `false`` then a record will be marked as an invalid record.
- For instance, the interest rate can't be greater than 10%
#
Discard setting- A record can be marked as invalid by one or more validation rules.
- After a record mark as invalid, there are two possibilities:
- Only flow into the Reject section
- Or flow into both Reject and Target section
- Since a record can have multiple validation rules thus it has its discard setting for each of them.
- For instance, here even if
id
gets duplicate will flow into the Target section and also flow into the Reject section with a validation error, but onlyid <> -1
will flow into the Target section and an invalid record will be transferred to the Reject section (if specified). - Discard settings for all types of validation rules going to by default true, to change that behavior use the Global Discard checkbox in Advance Settings.