Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The source is used when you need to read partitions of a TimePartitionedFileSet. For example, suppose there is an application that ingests data by writing to a TimePartitionedFileSet, where arrival time of the data is used as the partition key. You may want to create a pipeline that reads the newly-arrived files, performs data validation and cleansing, and then writes to a Table.

Configuration

Property

Macro Enabled?

Description

Dataset Name

Yes

Required. Name of the TimePartitionedFileSet from which the records are to be read from.

Dataset Base Path

Yes

Optional. Base path for the TimePartitionedFileSet. Defaults to the name of the dataset.

Duration

Yes

Required. Size of the time window to read with each run of the pipeline. The format is expected to be a number followed by an 's', 'm', 'h', or 'd' specifying the time unit, with 's' for seconds, 'm' for minutes, 'h' for hours, and 'd' for days. For example, a value of '5m' means each run of the pipeline will read 5 minutes of events from the TPFS source.

Delay

Yes

Optional. Delay for reading from TPFS source. The value must be of the same format as the duration value. For example, a duration of '5m' and a delay of '10m' means each run of the pipeline will read events 5 minutes of data from 15 minutes before its logical start time to 10 minutes before its logical start time. The default value is 0.

Output Schema

No

Required. The Avro schema of the record being read from the source as a JSON Object.

Example

This example reads from a TimePartitionedFileSet named ‘webactivity’, assuming the underlying files are in Avro format:

...