Page Comparison

Note: Datasets and the Parquet Time Partitioned Dataset Batch Source are deprecated and will be removed in CDAP 7.0.0

Reads from a TimePartitionedFileSet whose data is in Parquet format.

The source is used when you need to read partitions of a TimePartitionedFileSet. For example, suppose there is an application that ingests data by writing to a TimePartitionedFileSet, where arrival time of the data is used as the partition key. You may want to create a pipeline that reads the newly-arrived files, performs data validation and cleansing, and then writes to a Table.

Configuration

Property	Enable Macro?	Description
Dataset Name	Yes	Required. Name of the TimePartitionedFileSet from which the records are to be read from.
Dataset Base Path	Yes	Optional. Base path for the TimePartitionedFileSet. Defaults to the name of the dataset.
Duration	Yes	Required. Size of the time window to read with each run of the pipeline. The format is expected to be a number followed by an 's', 'm', 'h', or 'd' (specifying the time unit), with 's' for seconds, 'm' for minutes, 'h' for hours, and 'd' for days. For example, a value of '5m' means each run of the pipeline will read 5 minutes of events from the TPFS source.
Delay	Yes	Optional. delay for reading from TPFS source. The value must be of the same format as the duration value. For example, a duration of '5m' and a delay of '10m' means each run of the pipeline will read events for 5 minutes of data from 15 minutes before its logical start time to 10 minutes before its logical start time. Default is 0.
Output Schema	No	Required. The Parquet schema of the record being read from the source as a JSON Object.

Example

This example reads from a TimePartitionedFileSet named ‘webactivity’, assuming the underlying files are in Parquet format:

...

Versions Compared

Old Version 7

New Version 8

Key

Configuration

Example