Plugin version: 2.11.0

Batch source to use any Distributed File System as a Source.

This source is used whenever you need to read from a distributed file system. For example, you might want to read in log files from S3 every hour and then store the logs in a Redshift table.

Configuration

Property	Macro Enabled?	Version introduced	Description

Property	Macro Enabled?	Version introduced	Description
Reference Name	No		Required. This will be used to uniquely identify this source for lineage, annotating metadata, etc.
Path	Yes		Required. Path to file(s) to be read. If a directory is specified, terminate the path name with a '/'. The path uses filename expansion (globbing) to read files.
Format	Yes		Required. Format of the data to read. The format must be one of ‘avro’, ‘blob’, ‘csv’, ‘delimited’, ‘json’, ‘parquet’, ‘text’, or ‘tsv’. The ‘blob’ format also requires a schema that contains a field named ‘body’ of type ‘bytes’. If the format is ‘text’, the schema must contain a field named ‘body’ of type ‘string’.
Sample Size	Yes	6.4.0 / 2.6.0	Optional. The maximum number of rows in a file that will get investigated for automatic data type detection. Default is 1000.
Override	Yes	6.4.0 / 2.6.0	Optional. A list of columns with the corresponding data types for whom the automatic data type detection gets skipped.
Delimiter	Yes		Optional. Delimiter to use when the format is ‘delimited’. This will be ignored for other formats.
Enable Quoted Values	Yes	6.7.0 / 2.9.0	Optional. Whether to treat content between quotes as a value. This value will only be used if the format is ‘csv’, ‘tsv’ or ‘delimited’. For example, if this is set to true, a line that looks like `1, "a, b, c"` will output two fields. The first field will have `1` as its value and the second will have `a, b, c` as its value. The quote characters will be trimmed. The newline delimiter cannot be within quotes. It also assumes the quotes are well enclosed, for example, `"a, b, c".` If there is an unenclosed quote, for example `"a,b,c`, an error will occur. If there is an unenclosed quote, an error will occur.
Use First Row as Header	Yes	6.7.0/2.9.0	Optional. Whether to use the first line of each file as the column headers. Supported formats are 'text', 'csv', 'tsv', 'delimited'. Default is False.
Maximum Split Size	Yes		Optional. Maximum size in bytes for each input partition. Smaller partitions will increase the level of parallelism, but will require more resources and overhead. If the Format is blob, you cannot split the data. Default is 128 MB.
Regex Path Filter	Yes		Optional. Regular expression that file paths must match in order to be included in the input. The full file path is compared, not just the filename. If no file is giving, no file filtering will be done. For more information about regular expression syntax, see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html.
Path Field	Yes		Optional. Output field to place the path of the file that the record was read from. If not specified, the file path will not be included in output records. If specified, the field must exist in the output schema as a string.
Path Filename Only	Yes		Optional. Whether to only use the filename instead of the URI of the file path when a path field is given. Default is false.
Read Files Recursively	Yes		Optional. Whether files are to be read recursively from the path. Default is false.
Allow Empty Input	Yes		Optional. Whether to allow an input path that contains no data. When set to false, the plugin will error when there is no data to read. When set to true, no error will be thrown and zero records will be read. Default is false.
File System Properties	Yes		Optional. Additional properties to use with the InputFormat when reading the data.
File Encoding	Yes	6.3.0 / 2.5.0	Optional. File encoding for the source file. Default is UTF-8.
Output Schema	Yes		Required. The output schema for the data.

CDAP Documentation

File Batch Source

Configuration