Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

A separate plugin to support reading from multiple tables, using JDBC. Often used in conjunction with the DynamicMultiFileset sink to perform dumps from multiple tables to HDFS files in a single pipeline.

User Stories

  • User should be able to specify pattern that defines which tables should be read from
  • Users should have ability to specify pattern that defines which schemas should be used to list the tables
  • User should be able to specify white list with tables which should be read
  • Users should have ability to specify the tables to be skipped

Plugin Type

  •  Batch Source

Design

The source will output a record for each row in the tables it reads, with each record containing an additional field that holds the name of the table the record came from. In addition, for each table that will be read, this plugin will set pipeline arguments where the key is 'multisink.[tablename]' and the value is the schema of the table. This is to make it work with the DynamicMultiFileset.

User Configurations

  • referenceName: This will be used to uniquely identify this source for lineage, annotating metadata, etc.
  • connectionString: The JDBC connection string to the database. For example: jdbc:mysql://HOST/DATABASE.
  • user: The username to use when connecting to the database.
  • password: The password to use when connecting to the database.
  • jdbcPluginName: The name of the JDBC plugin to use.
  • enableAutoCommit: Whether to enable auto commit for queries run by this source. Defaults to false. This setting should only matter if you are using a jdbc driver that does not support a false value for auto commit, or a driver that does not support the commit call. For example, the Hive jdbc driver will throw an exception whenever a commit is called. For drivers like that, this should be set to true.
  • tableNamePattern: A pattern that defines which tables should be read from. Any table whose name matches the pattern will read. If not specified, all tables will be read. Pattern syntax is specific to the type of database that is being connected to.
  • whereClause: Filters which records needs to be consumed from each table: i.e. where updated_at > '2018-08-20 00:00:00'. The where clause will be applied to every table that is being read. Therefore, all the columns that are mentioned in the where clause should be present in each table.
  • schemaNamePattern: A pattern that defines which schemas should be used to list the tables. Any schema whose name matches the pattern will read. If not specified, all schema will be read. Pattern syntax is specific to the type of database that is being connected to.
  • tableNameField: The name of the field that holds the table name. Must not be the name of any table column that will be read. Defaults to 'tablename'.
  • whiteList: Used in conjunction with tableNamePattern, this configuration specifies tables to be read. If no value is specified in the whiteList all tables matching the tableNamePattern will be read. By default reads all tables matching the tableNamePattern.
  • blackList: Used in conjunction with tableNamePattern, this configuration specifies the tables to be skipped. By default the black list is empty which means no tables will be skipped.
  • numSplits: Number of splits per table. By default is 1.

Split Mechanism

To improve performance with large tables internal splits per table were added. With the help of numSplits property user can set number of splits per table. Each table, which has only one primary key column, will be splitted to specified number. For now only such types of primary key columns are supported (similar to Database plugins):

  • NUMERIC

  • DECIMAL

  • BIT

  • BOOLEAN

  • INTEGER

  • TINYINT

  • SMALLINT

  • BIGINT

  • REAL

  • FLOAT

  • DOUBLE

  • CHAR

  • VARCHAR

  • LONGVARCHAR

  • DATE

  • TIME

  • TIMESTAMP

If type of primary column is not supported for internal splits or number of primary columns is bigger than 1 then only one split per table will be generated.

For example for table:

Code Block
+-----+----------+------------------+
| ID  | name     | email            |
+-----+----------+------------------+
| 0   | Samuel   | sjax@example.net |
| 1   | Alice    | a@example.net    |
| 2   | Bob      | b@example.net    |
| 3   | John     | j@example.net    |
+-----+----------+------------------+

if ID is primary column, and numSplits is set to 2, than we will have 2 splits with such queries:

Code Block
SELECT * FROM TABLE WHERE ( ID >= 0 ) AND ( ID < 1 )
SELECT * FROM TABLE WHERE ( ID >= 2 ) AND ( ID <= 3 )

Example

This example reads from all tables in the 'customers' database on host 'host123.example.net':

...