Introduction
A separate plugin to support reading from multiple tables, using JDBC. Often used in conjunction with the DynamicMultiFileset sink to perform dumps from multiple tables to HDFS files in a single pipeline.
User Stories
- User should be able to specify pattern that defines which tables should be read from
- Users should have ability to specify pattern that defines which schemas should be used to list the tables
- User should be able to specify white list with tables which should be read
- Users should have ability to specify the tables to be skipped
Plugin Type
- Batch Source
Design
The source will output a record for each row in the tables it reads, with each record containing an additional field that holds the name of the table the record came from. In addition, for each table that will be read, this plugin will set pipeline arguments where the key is 'multisink.[tablename]' and the value is the schema of the table. This is to make it work with the DynamicMultiFileset.
User Configurations
- referenceName: This will be used to uniquely identify this source for lineage, annotating metadata, etc.
- connectionString: The JDBC connection string to the database. For example: jdbc:mysql://HOST/DATABASE.
- user: The username to use when connecting to the database.
- password: The password to use when connecting to the database.
- jdbcPluginName: The name of the JDBC plugin to use.
- enableAutoCommit: Whether to enable auto commit for queries run by this source. Defaults to false. This setting should only matter if you are using a jdbc driver that does not support a false value for auto commit, or a driver that does not support the commit call. For example, the Hive jdbc driver will throw an exception whenever a commit is called. For drivers like that, this should be set to true.
- tableNamePattern: A pattern that defines which tables should be read from. Any table whose name matches the pattern will read. If not specified, all tables will be read. Pattern syntax is specific to the type of database that is being connected to.
- whereClause: Filters which records needs to be consumed from each table: i.e.
where updated_at > '2018-08-20 00:00:00'
. Thewhere
clause will be applied to every table that is being read. Therefore, all the columns that are mentioned in thewhere
clause should be present in each table. - schemaNamePattern: A pattern that defines which schemas should be used to list the tables. Any schema whose name matches the pattern will read. If not specified, all schema will be read. Pattern syntax is specific to the type of database that is being connected to.
- tableNameField: The name of the field that holds the table name. Must not be the name of any table column that will be read. Defaults to 'tablename'.
- whiteList: Used in conjunction with tableNamePattern, this configuration specifies tables to be read. If no value is specified in the whiteList all tables matching the tableNamePattern will be read. By default reads all tables matching the tableNamePattern.
- blackList: Used in conjunction with tableNamePattern, this configuration specifies the tables to be skipped. By default the black list is empty which means no tables will be skipped.
- numSplits: Number of splits per table. By default is 1.
Split Mechanism
To improve performance with large tables internal splits per table were added. With the help of numSplits property user can set number of splits per table. Each table, which has only one primary key column, will be splitted to specified number. For now only such types of primary key columns are supported (similar to Database plugins):
NUMERIC
DECIMAL
BIT
BOOLEAN
INTEGER
TINYINT
SMALLINT
BIGINT
REAL
FLOAT
DOUBLE
CHAR
VARCHAR
LONGVARCHAR
DATE
TIME
TIMESTAMP
If type of primary column is not supported for internal splits or number of primary columns is bigger than 1 then only one split per table will be generated.
For example for table:
Code Block |
---|
+-----+----------+------------------+ | ID | name | email | +-----+----------+------------------+ | 0 | Samuel | sjax@example.net | | 1 | Alice | a@example.net | | 2 | Bob | b@example.net | | 3 | John | j@example.net | +-----+----------+------------------+ |
if ID is primary column, and numSplits is set to 2, than we will have 2 splits with such queries:
Code Block |
---|
SELECT * FROM TABLE WHERE ( ID >= 0 ) AND ( ID < 1 )
SELECT * FROM TABLE WHERE ( ID >= 2 ) AND ( ID <= 3 ) |
Example
This example reads from all tables in the 'customers' database on host 'host123.example.net':
...