Apache Kudu Sink

The Apache Kudu sink plugin is available in the Hub.

Plugin for ingesting data into Apache Kudu. Plugin can be configured for both batch and real-time pipelines.

Table Creation

When the plugin is used in a pipeline and it's configured to use macros either for table name or master address or both, the table creation is delayed till the pipeline is started. But, if they are no macros they are created at the deployment time. In both cases, the schema validation is done.

Querying from Impala

Using this plugin creates a table within Kudu. If you are interested in querying through Impala, then you would have run the following query to create a reference to Kudu table as an external table within Impala. This can be achieved through impala-shell or HUE interface.

CREATE EXTERNAL TABLE `<table-name>` STORED AS KUDU TBLPROPERTIES( 'kudu.table_name' = '<table-name>', 'kudu.master_addresses' = '<kudu-master-1>:7051,<kudu-master-2>:7051' );

kudu.master_addresses configuration needs not be specified it impala is started with -kudu_impala configuration. For more information on how this can be configured check here.

Configuration

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Reference Name

No

Required. Name used to uniquely identify this sink for lineage, annotating metadata, etc.

Table Name

Yes

Required. The Kudu table name to which the records will be written. This plugin checks if the table already exists. If it exists, it compares the schema of the existing table with the write schema specified for the plugin, If they don't match an error is thrown at configuration time and If the table doesn't exist, the table is created.

Master Addresses

Yes

Required. The list of Kudu master hosts that this plugin will attempt connect to. It's a comma separated list of <hostname>:<port>. Connection is attempt after the plugin is initialized in the pipeline.

Columns to be used as hash keys (comma separated list of values)

No

Optional. The list of fields from the input that should be considered as hashing keys. All the fields should be non-null. Comma separated list of fields to be used as hash keys.

No of buckets

No

Optional. Number of buckets the keys are split into.

Default is 16.

Seed to randomize the mapping of rows to hash buckets

No

Optional. The seed value specified is used to randomize mapping of rows to hash buckets. Setting the seed will ensure the hashed columns contain user provided values.

Default is 1.

Compression Algorithm

No

Optional. The compression algorithm to be used for the columns. Following are different options available. All fields will be applied same compression

Default is Snappy.

Encoding Type

No

Optional. Block encoding for the column. All fields will be applied same encoding

Default is Auto.

User operations timeout in milliseconds

No

Optional. Sets the timeout in milliseconds for user operations with Kudu. If you are writing large sized records it's recommended to increase the this time. It's defaulted to 30 seconds.

Default 30000 milliseconds.

Administration operation timeout in milliseconds

No

Optional. Sets timeout in milliseconds for administrative operations like for creating table if table doesn't exist. This time is mainly used during initialize phase of the plugin when the table is created if it doesn't exist.

Default 30000 milliseconds.

Number of copies

No

Optional. Number of replicas for the above table. This will specify the number of replicas that each tablet will have. By default it will use the default set on the server side and that is generally 1.

Default is 1.

Rows to be cached before being flushed

No

Optional. Number of rows to be cached before being flushed.

Default is 1000.

Specifies the number of boss threads to be used by the client

No

Optional. Number of boss threads used in the Kudu client to interact with Kudu backend.

Default is 1.

Data Type Mapping

CDAP Schema Data Type

Kudu Data Type

CDAP Schema Data Type

Kudu Data Type

int

int

short

short

string

string

bytes

binary

double

double

float

float

boolean

bool

union

first non-nullable type

Created in 2020 by Google Inc.