Parquet Dynamic Partitioned Dataset Sink (Deprecated)
This plugin is no longer available as of July 26, 2024.
Sink for a PartitionedDataset
that writes data in Parquet format and leverages one or more record field values for creating partitions. All data for the run will be written to a partition based on the specified fields and their value.
Use this sink whenever you want to write to a PartitionedFileSet
in Parquet format using a value from the record as a partition. For example, you might want to load historical data from a database and partition the dataset on the original creation date of the data.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Dataset Name | Yes | Required. Name of the PartitionedFileSet to which records are written. If it doesn’t exist, it will be created. |
Dataset Base Path | Yes | Optional. Base path for the PartitionedFileSet. Defaults to the name of the dataset. Default is [Namespace]/data/[Dataset name]. |
Partition Field Names | Yes | Required. One or more fields that will be used to partition the dataset. |
Compression Codec | No | Optional. Parameter to determine the compression codec to use on the resulting data. Valid values are None, Snappy, GZip, and LZO. Default is None. |
Append to Existing Partition | No | Optional. Allow appending to existing partitions, by default this capability is disabled. Default is No. |
Output Schema | Yes | Required. The Avro schema of the record being written to the sink as a JSON Object. |
Example
For example, suppose the sink receives input records from customers and purchases:
id | first_name | last_name | street_address | city | state | zipcode |
---|---|---|---|---|---|---|
1 | Douglas | Williams | 1, Vista Montana | San Jose | CA | 95134 |
2 | David | Johnson | 3, Baypointe Parkway | Houston | TX | 78970 |
3 | Hugh | Jackman | 5, Cool Way | Manhattan | NY | 67263 |
4 | Walter | White | 3828, Piermont Dr | Orlando | FL | 73498 |
5 | Frank | Underwood | 1609 Far St. | San Diego | CA | 29770 |
6 | Serena | Woods | 123 Far St. | Las Vegas | NV | 45334 |
If we choose purchase_date
as a partition column field, the sink will create a PartitionedDataset
and populate the partitions as follows:
id | first_name | last_name | street_address | city | state | zipcode |
---|---|---|---|---|---|---|
2 | David | Johnson | 3, Baypointe Parkway | Houston | TX | 78970 |
3 | Hugh | Jackman | 5, Cool Way | Manhattan | NY | 67263 |
6 | Serena | Woods | 123 Far St. | Las Vegas | NV | 45334 |
id | first_name | last_name | street_address | city | state | zipcode |
---|---|---|---|---|---|---|
1 | Douglas | Williams | 1, Vista Montana | San Jose | CA | 95134 |