ORC Dynamic Partitioned Dataset Sink (Deprecated)

This plugin is no longer available as of July 26, 2024.

Sink for a PartitionedFileSet that writes data in ORC format and leverages one or more record field values for creating partitions. All data for the run will be written to a partition based on the specified fields and their value.

Use this sink whenever you want to write to a PartitionedFileSet in ORC format using a value from the record as a partition. For example, you might want to load historical data from a database and partition the dataset on the original creation date of the data.

Configuration

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Dataset Name

Yes

Required. Name of the PartitionedFileSet to which records are written. If it doesn’t exist, it will be created.

Dataset Base Path

No

Required. Base path for the PartitionedFileSet. Defaults to the name of the dataset.

Default is [Namespace]/data/[Dataset name].

Partition Field Names

Yes

Required. One or more fields that will be used to partition the dataset.

Compression Codec

No

Optional. Parameter to determine the compression codec to use on the resulting data. Valid values are None, Snappy, and zlib.

Default is None.

Append to Existing Partition

No

Optional. Allow appending to existing partitions, by default this capability is disabled.

Default is No.

Compression Chunk Size

No

Required if Compression Codec is set. Number of bytes in each compression chunk.

Bytes per stripe

No

Required if Compression Codec is set. Number of bytes in each stripe.

Rows between index entries

No

Required if Compression Codec is set. Number of rows between index entries (must be >= 1,000).

Create inline indices

No

Required if Compression Codec is set. Whether to create inline indices.

Output Schema

Yes

Required. The schema of the record being written to the sink as a JSON Object.

Example

For example, suppose the sink receives input records from customers and purchases:

id

first_name

last_name

street_address

city

state

zipcode

id

first_name

last_name

street_address

city

state

zipcode

1

Douglas

Williams

1, Vista Montana

San Jose

CA

95134

2

David

Johnson

3, Baypointe Parkway

Houston

TX

78970

3

Hugh

Jackman

5, Cool Way

Manhattan

NY

67263

4

Walter

White

3828, Piermont Dr

Orlando

FL

73498

5

Frank

Underwood

1609 Far St.

San Diego

CA

29770

6

Serena

Woods

123 Far St.

Las Vegas

NV

45334

If we choose purchase_date as a partition column field, the sink will create a PartitionedDataset and populate the partitions as follows:

id

first_name

last_name

street_address

city

state

zipcode

id

first_name

last_name

street_address

city

state

zipcode

2

David

Johnson

3, Baypointe Parkway

Houston

TX

78970

3

Hugh

Jackman

5, Cool Way

Manhattan

NY

67263

6

Serena

Woods

123 Far St.

Las Vegas

NV

45334

id

first_name

last_name

street_address

city

state

zipcode

id

first_name

last_name

street_address

city

state

zipcode

1

Douglas

Williams

1, Vista Montana

San Jose

CA

95134