...
The source is used when you need to read partitions of a TimePartitionedFileSet. For example, suppose there is an application that ingests data by writing to a TimePartitionedFileSet, where arrival time of the data is used as the partition key. You may want to create a pipeline that reads the newly-arrived files, performs data validation and cleansing, and then writes to a Table.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Dataset Name | Yes | Required. Name of the TimePartitionedFileSet from which the records are to be read from. |
Dataset Base Path | Yes | Optional. Base path for the TimePartitionedFileSet. Defaults to the name of the dataset. |
Duration | Yes | Required. Size of the time window to read with each run of the pipeline. The format is expected to be a number followed by an 's', 'm', 'h', or 'd' specifying the time unit, with 's' for seconds, 'm' for minutes, 'h' for hours, and 'd' for days. For example, a value of '5m' means each run of the pipeline will read 5 minutes of events from the TPFS source. |
Delay | Yes | Optional. Delay for reading from TPFS source. The value must be of the same format as the duration value. For example, a duration of '5m' and a delay of '10m' means each run of the pipeline will read events 5 minutes of data from 15 minutes before its logical start time to 10 minutes before its logical start time. The default value is 0. |
Output Schema | No | Required. The Avro schema of the record being read from the source as a JSON Object. |
Example
This example reads from a TimePartitionedFileSet named ‘webactivity’, assuming the underlying files are in Avro format:
...