Note: Datasets and the Parquet Time Partitioned Dataset Batch Source are deprecated and will be removed in CDAP 7.0.0
Reads from a TimePartitionedFileSet whose data is in Parquet format.
The source is used when you need to read partitions of a TimePartitionedFileSet. For example, suppose there is an application that ingests data by writing to a TimePartitionedFileSet, where arrival time of the data is used as the partition key. You may want to create a pipeline that reads the newly-arrived files, performs data validation and cleansing, and then writes to a Table.
Configuration
Property | Enable Macro? | Description |
---|---|---|
Dataset Name | Yes | Required. Name of the TimePartitionedFileSet from which the records are to be read from. |
Dataset Base Path | Yes | Optional. Base path for the TimePartitionedFileSet. Defaults to the name of the dataset. |
Duration | Yes | Required. Size of the time window to read with each run of the pipeline. The format is expected to be a number followed by an 's', 'm', 'h', or 'd' (specifying the time unit), with 's' for seconds, 'm' for minutes, 'h' for hours, and 'd' for days. For example, a value of '5m' means each run of the pipeline will read 5 minutes of events from the TPFS source. |
Delay | Yes | Optional. delay for reading from TPFS source. The value must be of the same format as the duration value. For example, a duration of '5m' and a delay of '10m' means each run of the pipeline will read events for 5 minutes of data from 15 minutes before its logical start time to 10 minutes before its logical start time. Default is 0. |
Output Schema | No | Required. The Parquet schema of the record being read from the source as a JSON Object. |
Example
This example reads from a TimePartitionedFileSet named ‘webactivity’, assuming the underlying files are in Parquet format:
...