Page Comparison

...

The source is used when you need to read partitions of a TimePartitionedFileSet. For example, suppose there is an application that ingests data by writing to a TimePartitionedFileSet, where arrival time of the data is used as the partition key. You may want to create a pipeline that reads the newly-arrived files, performs data validation and cleansing, and then writes to a Table.

Configuration

Property	Macro Enabled?	Description
Dataset Name	Yes	Required. Name of the TimePartitionedFileSet from which the records are to be read from.
Dataset Base Path	Yes	Optional. Base path for the TimePartitionedFileSet. Defaults to the name of the dataset.
Duration	Yes	Required. Size of the time window to read with each run of the pipeline. The format is expected to be a number followed by an 's', 'm', 'h', or 'd' specifying the time unit, with 's' for seconds, 'm' for minutes, 'h' for hours, and 'd' for days. For example, a value of '5m' means each run of the pipeline will read 5 minutes of events from the TPFS source.
Delay	Yes	Optional. Delay for reading from TPFS source. The value must be of the same format as the duration value. For example, a duration of '5m' and a delay of '10m' means each run of the pipeline will read events 5 minutes of data from 15 minutes before its logical start time to 10 minutes before its logical start time. The default value is 0.
Output Schema	No	Required. The Avro schema of the record being read from the source as a JSON Object.

Example

This example reads from a TimePartitionedFileSet named ‘webactivity’, assuming the underlying files are in Avro format:

...

Versions Compared

Old Version 11

New Version 12

Key

Configuration

Example