Hive Bulk Export Action

The Hive Bulk Export action plugin is available in the Hub.

Plugin version: 1.9.0-1.1.0

The Hive Bulk Export action takes a SELECT query as input and runs that query on a Hive table. It stores the results under the provided HDFS directory. When the SELECT query is provided to the plugin, it converts that SELECT query to INSERT OVERWRITE DIRECTORY Hive statement. When this query is executed, Hive starts a MapReduce job which stores the results to provided directory location. So there can be multiple files in a given directory location.

Important: Hive Export works with Hive 2.3.3.

If any query other than a valid SELECT query is provided, Hive Bulk Export will fail to publish the pipeline. This is because CDAP uses Apache Calcite to parse the SELECT query to verify that it's not any other SQL query.

To run the SELECT query, if the Overwrite Output Directory property is set to no, the pipeline publish will fail if the output directory already exists. In that case, either remove the directory or allow the directory to be overwritten by setting the Overwrite Output Directory property to yes.

You might use Hive Export Action to execute a SELECT query on Hive table(s) and write the results in a provided directory location in csv format.

Configuration

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Hive Metastore Username

Yes

User identity for connecting to the specified hive database. Required for databases that need authentication. Optional for databases that do not require authentication.

Hive Metastore Password

Yes

Password to use to connect to the specified database. Required for databases that need authentication. Optional for databases that do not require authentication.

JDBC Connection String

Yes

Required. JDBC connection string including database name. Use auth=delegationToken. CDAP platform will provide appropriate delegation token while running the pipeline.

Select Statement

Yes

Required. Select command to select values from Hive table(s).

Output Directory

Yes

Required. HDFS Directory path where exported data will be written. If it does not exist it will get created. If it already exists, we can either overwrite it or fail at publish time based on Overwrite Output Directory property.

Overwrite Output Directory

Yes

If yes is selected, if the HDFS path exists, it will be overwritten. If no is selected, if the HDFS path exists, pipeline deployment will fail while publishing the pipeline.

Default is yes.

Column Separator

 

Delimiter in the exported file. Values in each column is separated by this delimiter while writing to output file.

Default is comma (,).

Example

This example connects to a Hive database using the specified JDBC Connection String, which means it will connect to the ‘mydb’ database of a Hive instance running on ‘localhost’ and runs the SELECT query as ‘INSERT OVERWRITE DIRECTORY’ statement. It will use path directory /tmp/hive and delimiter comma to write data into file(s).

Property

Value

Property

Value

Hive Metastore Username

username

Hive Metastore Password

password

JDBC Connection String

jdbc:hive2://localhost:10000/mydb;auth=delegationToken

Select Statement

SELECT * FROM employee JOIN salary ON (employee.id = salary.id)

Output Directory

/tmp/hive

Overwrite Output Directory

yes

Column Separator

,

Created in 2020 by Google Inc.