Skip to main content

Working with Multiple Files

This article outlines how to work with multiple files for source and target in Ingestion activity. The Source section of Ingestion activity provides the property File Pattern that allow user to specify the dynamic file name pattern using Glob expression language. More details on Glob expression can be found here

For Target section of Ingestion activity, Ingestion activity provides various settings to determine how the files are copied over from source folder or table into the target section

Support for File Pattern in Source Section#

Ingestion activity allows specifying the initial (or root) path when defining the data store as explained. The File Pattern setting is treated as relative to the path specified in the data store. Ingestion activity has the same treatment of File Pattern for all the types of.

Below describes different options of how File Pattern can be specified and its treatment by Ingestion activity

OptionsExampleDescription
Static file pathCSV/customer.csvThis will retrieve specific file from the file system
Recursivecustomer/*/data.csvThis will cause it to enumerate all the files recursively in the customer subfolder in the form the file system and file matching the patter data*.csv will be considered for processing
Dynamic folder and file pathcus*//customer customer_[0-9][0-9]//customerSince the first component of file pattern contains wildcard or regular expression, this will result in enumeration of all the files in the root directory, of which it will pick those which match the file pattern is used to recursively search the files within a given directory
Static folder name and dynamic sub-folder and file namecustomer/Asia/data_<sg, hk, my>/*.csvSince the first two level of folders are statically specified (and does not contain wildcard or reg-exp), it will enumerate all the files in the sub-directory customer/Asia and match the resultant files with the rest of file pattern
note

For optimal performance and to utilize the service-side filter, it is recommended to provide static directory name where possible to reduce the amount of metadata (file listing) that Ingestion activity has to retrieve before applying filename pattern. This is crucial if the source file system contains large of number of files

Using File Data Store in Target or Reject section#

Ingestion activity provides following properties which determines how the files and folders are generated when using File datastore as target.

PropertyDescriptionRequiredDefault Value
PathThis is the directory in which the target files and folder shall be created. This path is relative to the root path that is specified when defining the File datastoreYesNULL
Generate Single FileWhen set as true, this will generate a single file corresponding to each source file. Similarly, for when the source is a table it will generate one file for entire table for non-partitioned table or one data file per each partition in the respective partition folder. If the settings is set to false, it will create a folder for each source file or source table and generate multiple part files as an output in these folders. For partitioned table it will contain generate multiple part files within the partition folderNoFalse
Preserve HierarchyWhen set to true, it will mirror the folder structure as per source folder structure inside the target directory specified as per File Path. The entire folder structure of source as per "File Pattern" is mirrored.NoFalse

Apart from the above properties, there are additional settings that are meant to specify File Format and its associated properties. This is covered in detail in section :

note

Any existing file within the same folder and file name shall be overwritten

Illustration of how file based source are copied in Target#

Assuming that source file pattern results below files:

source
|
โ”œโ”€โ”€ folder1
โ”‚ โ””โ”€โ”€ users1.csv
|
โ””โ”€โ”€ folder2
โ”œโ”€โ”€ users2.csv
|
โ””โ”€โ”€ folder2_a
โ””โ”€โ”€ users2_a.csv

the folders and files that shall be generated in the target folder path (which is root path in Data store + File path specified in target section) shall be as per below:

Target config and its properties
Target PathPreserve HierarchyMerge Part File/ Generate Single FIlePartition defined in tranform tab ?Target FileNameExpected Output
/target/FalseFalseFalseNULL
target
|
โ””โ”€โ”€ data.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ””โ”€โ”€ _SUCCESS
/target/FalseTrueFalseNULL
target
|
โ””โ”€โ”€ data.json
/target/TrueFalseFalseNULL
target
|
โ”œโ”€โ”€ folder1
โ”‚ โ””โ”€โ”€ users1.json
| โ”œโ”€โ”€ part-00000-xxxxxx.json
โ”‚ โ””โ”€โ”€ _SUCCESS
|
โ””โ”€โ”€ folder2
โ”œโ”€โ”€ users2.json
| โ”œโ”€โ”€ part-00000-xxxxxx.json
| โ””โ”€โ”€ _SUCCESS
|
โ””โ”€โ”€ folder2_a
โ””โ”€โ”€ users2_a.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ””โ”€โ”€ _SUCCESS
/target/TrueTrueFalseNULL
target
|
โ”œโ”€โ”€ folder1
โ”‚ โ””โ”€โ”€ users1.json
|
โ””โ”€โ”€ folder2
โ”œโ”€โ”€ users2.json
|
โ””โ”€โ”€ folder2_a
โ””โ”€โ”€ users2_a.json
/target/FalseFalseTrueNULL
target
|
โ””โ”€โ”€ data.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ””โ”€โ”€ _SUCCESS
/target/FalseTrueTrueNULL
target
|
โ””โ”€โ”€ data.json
/target/TrueFalseTrueNULL
target
|
โ””โ”€โ”€ data.json
โ”œโ”€โ”€ city=a
| โ”œโ”€โ”€ part-00000-xxxxxx.json
| โ””โ”€โ”€ _SUCCESS
โ”œโ”€โ”€ city=b
| โ”œโ”€โ”€ part-00000-xxxxxx.json
| โ””โ”€โ”€ _SUCCESS
โ””โ”€โ”€ city=c
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ””โ”€โ”€ _SUCCESS
/target/TrueTrueTrueNULL
target
|
โ””โ”€โ”€ data.json
โ”œโ”€โ”€ city=a
| โ”œโ”€โ”€ part-00000-xxxxxx.json
| โ””โ”€โ”€ _SUCCESS
โ”œโ”€โ”€ city=b
| โ”œโ”€โ”€ part-00000-xxxxxx.json
| โ””โ”€โ”€ _SUCCESS
โ””โ”€โ”€ city=c
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ””โ”€โ”€ _SUCCESS
/target/FalseFalseFalsetarget_file.json
target
|
โ””โ”€โ”€ data.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ””โ”€โ”€ _SUCCESS
/target/FalseTrueFalsetarget_file.json
target
|
โ””โ”€โ”€ target_file.json
/target/FalseFalseTruetarget_file.json
target
|
โ””โ”€โ”€ data.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ””โ”€โ”€ _SUCCESS
/target/FalseTrueTruetarget_file.json
target
|
โ””โ”€โ”€ target_file.json
note
  1. The suffix files or folder when preserve hierarchy is not selected is published ID that is generated
  2. The part file are generated by spark and the naming convention will depend on implementation of Spark connector

Illustration of how table based source are copied in target#

Assuming the source table is non partitioned, it shall generate the files as per below

Target Config and its properties values
Target PathPreserve HierarchyMerge Part File/ Generate Single FIlePartition defined in tranform tab ?Target FileNameExpected Output
/target/True/FalseFalseFalseNULL
target
|
โ””โ”€โ”€ data.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ””โ”€โ”€ _SUCCESS
/target/True/FalseTrueFalseNULL
target
|
โ””โ”€โ”€ data.json
/target/True/FalseTrueTrueNULL
target
|
โ””โ”€โ”€ data.json
โ”œโ”€โ”€ city=a
| โ”œโ”€โ”€ part-00000-xxxxxx.json
| โ””โ”€โ”€ _SUCCESS
โ”œโ”€โ”€ city=b
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ””โ”€โ”€ _SUCCESS
/target/FalseFalseTruetarget_files.json
target
|
โ””โ”€โ”€ data.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ”œโ”€โ”€ part-00000-xxxxxx.json
โ””โ”€โ”€ _SUCCESS
/target/FalseTrueTruetarget_files.json
target
|
โ””โ”€โ”€ target_files.json
note

The partitioned settings of the tales are taken from the table metadata (and not from Ingestion config)

Parallel Processing of Files#

When ingesting data from multiple source files, Ingestion activity will read and process individual files in separate threads. It will spawn a fixed number of threads which will pick one file at a time once it's done processing the previous file. Each of this thread will read the file, perform control total and schema validations and apply transformation before publishing it to target.

The number of threads that Ingestion activity spawns to process the files concurrently is determined by the parameter: guzzle.batchpipeline.threads which can be specified when running the activity or passed when calling the activity from Pipeline

Partial Load setting in Guzzle#

When processing multiple source files, Ingestion Activity will process individual files in separate threads. It reads the files, performs control total and schema validations and applies transformation before publishing it to target. A subset of files can fail during this process due to one of the below reasons: :

  1. When the control total of the file does not match with the actual file content.

  2. If validations are specified the number of records failing the validation breaches the reject threshold set for a given file.

  3. The file becomes unavailable when Ingestion activity is trying to process it.

Partial Load setting in Source section determines whether Ingestion activity should write the data to Target if a subset of files has failed the validations. Below describes the behavior of this setting:.

Truewill proceed to write to the target datastore excluding the files which had failure. The activity will be marked with the status WARNING
false (default)None of the data will be written to the to target datastore and job will be marked with status FAILED