Apache Kudu Sink (Deprecated)
This plugin is no longer available as of July 26, 2024.
Plugin for ingesting data into Apache Kudu. Plugin can be configured for both batch and real-time pipelines.
Table Creation
When the plugin is used in a pipeline and it's configured to use macros either for table name
or master address
or both, the table creation is delayed till the pipeline is started. But, if they are no macros they are created at the deployment time. In both cases, the schema validation is done.
Querying from Impala
Using this plugin creates a table within Kudu. If you are interested in querying through Impala, then you would have run the following query to create a reference to Kudu table as an external table within Impala. This can be achieved through impala-shell
or HUE interface.
CREATE EXTERNAL TABLE `<table-name>` STORED AS KUDU
TBLPROPERTIES(
'kudu.table_name' = '<table-name>',
'kudu.master_addresses' = '<kudu-master-1>:7051,<kudu-master-2>:7051'
);
kudu.master_addresses
configuration needs not be specified it impala is started with -kudu_impala
configuration. For more information on how this can be configured check here.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Reference Name | No | Required. Name used to uniquely identify this sink for lineage, annotating metadata, etc. |
Table Name | Yes | Required. The Kudu table name to which the records will be written. This plugin checks if the table already exists. If it exists, it compares the schema of the existing table with the write schema specified for the plugin, If they don't match an error is thrown at configuration time and If the table doesn't exist, the table is created. |
Master Addresses | Yes | Required. The list of Kudu master hosts that this plugin will attempt connect to. It's a comma separated list of <hostname>:<port>. Connection is attempt after the plugin is initialized in the pipeline. |
Columns to be used as hash keys (comma separated list of values) | No | Optional. The list of fields from the input that should be considered as hashing keys. All the fields should be non-null. Comma separated list of fields to be used as hash keys. |
No of buckets | No | Optional. Number of buckets the keys are split into. Default is 16. |
Seed to randomize the mapping of rows to hash buckets | No | Optional. The seed value specified is used to randomize mapping of rows to hash buckets. Setting the seed will ensure the hashed columns contain user provided values. Default is 1. |
Compression Algorithm | No | Optional. The compression algorithm to be used for the columns. Following are different options available. All fields will be applied same compression Default is Snappy. |
Encoding Type | No | Optional. Block encoding for the column. All fields will be applied same encoding Default is Auto. |
User operations timeout in milliseconds | No | Optional. Sets the timeout in milliseconds for user operations with Kudu. If you are writing large sized records it's recommended to increase the this time. It's defaulted to 30 seconds. Default 30000 milliseconds. |
Administration operation timeout in milliseconds | No | Optional. Sets timeout in milliseconds for administrative operations like for creating table if table doesn't exist. This time is mainly used during initialize phase of the plugin when the table is created if it doesn't exist. Default 30000 milliseconds. |
Number of copies | No | Optional. Number of replicas for the above table. This will specify the number of replicas that each tablet will have. By default it will use the default set on the server side and that is generally 1. Default is 1. |
Rows to be cached before being flushed | No | Optional. Number of rows to be cached before being flushed. Default is 1000. |
Specifies the number of boss threads to be used by the client | No | Optional. Number of boss threads used in the Kudu client to interact with Kudu backend. Default is 1. |
Data Type Mapping
CDAP Schema Data Type | Kudu Data Type |
---|---|
int | int |
short | short |
string | string |
bytes | binary |
double | double |
float | float |
boolean | bool |
union | first non-nullable type |
Created in 2020 by Google Inc.