- Data Validation featured used to filter or capture noisy data.
- Ingestion support below data validation
- Data type validation
- SQL validation (custom validation)
- In some cases we don't want to filter noisy data instead just wants to capture noisy data, for instance, a single invalid record can flow into both target and reject section, this we can archive using discard settings.
For data type validation click here
- By enabling this checkbox, duplicate record(s) will be marked as an invalid record.
- For instance, a record containing duplicate city_id and country_id will be marked as an invalid record.
- In the case of source as a file, uniqueness only applies at the file level, meaning that duplicate records across multiple files will not mark as invalid.
This is different from SQL
distinct, if guzzle found two duplicate records guzzle will mark both records as invalid.
- By disabling this checkbox, a record(s) containing a null value will be marked as an invalid record.
- By specifying Spark SQL expression condition, if the expression returns `false`` then a record will be marked as an invalid record.
- For instance, the interest rate can't be greater than 10%
- A record can be marked as invalid by one or more validation rules.
- After a record mark as invalid, there are two possibilities:
- Only flow into the Reject section
- Or flow into both Reject and Target section
- Since a record can have multiple validation rules thus it has its discard setting for each of them.
- For instance, here even if
idgets duplicate will flow into the Target section and also flow into the Reject section with a validation error, but only
id <> -1will flow into the Target section and an invalid record will be transferred to the Reject section (if specified).
- Discard settings for all types of validation rules going to by default true, to change that behavior use the Global Discard checkbox in Advance Settings.