PostgreSQL database plugin

PostgreSQL database plugin

Introduction

A separate database plugin to support PostgreSQL-specific features and configurations.

Use-Case

  • Users can choose and install PostgreSQL source and sink plugins.

  • Users should see PostgreSQL logo on plugin configuration page for better experience.

  • Users should get relevant information from the tool tip:

    • The tool tip for the connection string should be customized specifically to the PostgreSQL database,

    • The tool tip should describe accurately what each field is used for.

  • User should get a performance comparable to Sqoop by utilizing sqoop libraries for the data ingestion and egress.

  • Users should not have to specify any redundant configuration (ex: JDBC type in source plugin, columns in the sink plugin).

  • Users should get field level lineage for the source and sink that is being used.

  • Reference documentation should be updated to account for the changes.

  • The source code for PostgreSQL database plugin should be placed in repo under data-integrations org.

  • Integration tests for PostgreSQL database plugin should be added in the test repo.

  • The data pipeline using source and sink plugins should run on both mapreduce and spark engines.

User Stories

  • User should be able to install PostgreSQL specific database source and sink plugins from the Hub

  • Users should have each tool tip accurately describe what each field does

  • Users should get field level lineage information for the PostgreSQL source and sink 

  • Users should get a performance comparable to Sqoop when ingesting data from PostgreSQL and while writing data to PostgreSQL (within ~15% of the time taken for sqoop)

  • Users should be able to setup a pipeline avoiding specifying redundant information

  • Users should get updated reference document for PostgreSQL source and sink

  • Users should be able to read all the DB types

Plugin Type

Batch Source
Batch Sink 
Real-time Source
Real-time Sink
Action
Post-Run Action
Aggregate
Join
Spark Model
Spark Compute

Design Tips

PostgreSQL connector reference: https://jdbc.postgresql.org/download/postgresql-9.4.1211.jar

Existing database plugins: https://github.com/cdapio/hydrator-plugins/tree/develop/database-plugins

PostgreSQL datatypes mappings and conversions:

 

Design

The suggestion is to create maven submodule PostgreSQL under database-plugins repo.

 

Sink Properties

User Facing Name

Type

Description

Constraints

User Facing Name

Type

Description

Constraints

Label

String

Label for UI

 

Reference Name

String

Uniquely identified name for lineage

 

Host

String

PostgreSQL host

Required (defaults to localhost on UI)

Port

Number

Specific port where PostgreSQL running on

Optional

(default 5432)

Database

String

Database name to connect

Required

Username

String

DB username

Required

Password

Password

User password

Required

Transaction Isolation Level

Select

Transaction isolation level for queries run by this sink

 

Connection Arguments

Keyvalue

A list of arbitrary string tag/value pairs as connection arguments, list of properties

https://jdbc.postgresql.org/documentation/head/connect.html#connection-parameters

 

Table Name

String

Name of a database table to write to

 

Connect Timeout

Number

The timeout value used for socket connect operations. If connecting to the server takes longer than this value, the connection is broken. The timeout is specified in seconds and a value of zero means that it is disabled

 

Source Properties

 

User Facing Name

Type

Description

Constraints

User Facing Name

Type

Description

Constraints

Label

String

Label for UI

 

Reference Name

String

Uniquely identified name for lineage

 

Host

String

PostgreSQL host

Required (defaults to localhost on UI)

Port

Number

Specific port where PostgreSQL running on

Optional

(default 5432)

Database

String

Database name to connect

Required

Import Query

String

Query for import data

Valid SQL query

Username

String

DB username

Required

Password

String

User password

Required

Bounding Query

String

Returns max and min of split-By Filed

Valid SQL query

Split-By Field Name

String

Field name which will be used to generate splits

 

Number of Splits to Generate

Number

Number of splits to generate

 

Transaction Isolation Level

Select

Transaction isolation level for queries run by this sink

 

Connection Arguments

Keyvalue

A list of arbitrary string tag/value pairs as connection arguments, list of properties https://jdbc.postgresql.org/documentation/head/connect.html#connection-parameters

 

Connect Timeout

Number

The timeout value used for socket connect operations. If connecting to the server takes longer than this value, the connection is broken. The timeout is specified in seconds and a value of zero means that it is disabled

 

 

Action Properties

 

User Facing Name

Type

Description

Constraints

User Facing Name

Type

Description

Constraints

Label

String

Label for UI

 

Host

String

PostgreSQL host

Required (defaults to localhost on UI)

Port

Number

Specific port where PostgreSQL running on

Optional

(default 5432)

Database

String

Database name to connect

Required

Username

String

DB username

Required

Password

String

User password

Required

Connection Arguments

Keyvalue

A list of arbitrary string tag/value pairs as connection arguments, list of properties 

https://jdbc.postgresql.org/documentation/head/connect.html#connection-parameters

 

Database Command

String

Database command to run

Valid SQL query

Connect Timeout

Number

The timeout value used for socket connect operations. If connecting to the server takes longer than this value, the connection is broken. The timeout is specified in seconds and a value of zero means that it is disabled

 

 

Data Types Mapping

Postgres Data Type

CDAP Schema Data Type

Support

Comment

Postgres Data Type

CDAP Schema Data Type

Support

Comment

BIGINT

Schema.Type.LONG

+

 

BIGSERIAL

Schema.Type.LONG

+

Serial is autoincremented

BIT(N)

Schema.Type.STRING

+

Bit strings are strings of 1's and 0's

BIT VARYING(N)

Schema.Type.STRING

+

Bit strings are strings of 1's and 0's

BOOLEAN

Schema.Type.BOOLEAN

+

 

BYTEA

Schema.Type.BYTES

+

 

CHARACTER

Schema.Type.STRING

+

 

CHARACTER VARYING

Schema.Type.STRING

+

 

DOUBLE PRECISION

Schema.Type.DOUBLE

+

 

INTEGER

Schema.Type.INT

+

 

NUMERIC(p, s)/DECIMAL(p, s)

Schema.LogicalType.DECIMAL

+

 

REAL

Schema.Type.FLOAT

+

 

SMALLINT

Schema.Type.INT

+

 

SMALLSERIAL

Schema.Type.INT

+

Serial is autoincremented

SERIAL

Schema.Type.INT

+

Serial is autoincremented

TEXT

Schema.Type.STRING

+

 

DATE

Schema.LogicalType.DATE

+

 

TIME [ (P) ] [ WITHOUT TIME ZONE ]

Schema.LogicalType.TIME_MICROS

+

 

TIME [ (P) ] WITH TIME ZONE

Schema.Type.STRING

+

 

TIMESTAMP [ (P) ] [ WITHOUT TIME ZONE ]

Schema.LogicalType.TIMESTAMP_MICROS

+

 

TIMESTAMP [ (P) ] WITH TIME ZONE

Schema.LogicalType.TIMESTAMP_MICROS

+

Postgresql converts it to UTC(see "Time Stamps" section)

XML

Schema.Type.STRING

+

 

TSQUERY

Schema.Type.STRING

+

 

TSVECTOR

Schema.Type.STRING

+

 

TXID_SNAPSHOT

 

-

Postgresql specific, see documentation

UUID

Schema.Type.STRING

+

 

BOX

Schema.Type.STRING

+

 

CIDR

Schema.Type.STRING

+

 

CIRCLE

Schema.Type.STRING

+

 

INET

Schema.Type.STRING

+

 

INTERVAL

Schema.Type.STRING

+

Created in 2020 by Google Inc.