Introduction
A separate plugin to support reading from multiple tables, using JDBC. Often used in conjunction with the DynamicMultiFileset sink to perform dumps from multiple tables to HDFS files in a single pipeline.
User Stories
- User should be able to specify pattern that defines which tables should be read from
- Users should have ability to specify pattern that defines which schemas should be used to list the tables
- User should be able to specify white list with tables which should be read
- Users should have ability to specify the tables to be skipped
Plugin Type
- Batch Source
Design
The source will output a record for each row in the tables it reads, with each record containing an additional field that holds the name of the table the record came from. In addition, for each table that will be read, this plugin will set pipeline arguments where the key is 'multisink.[tablename]' and the value is the schema of the table. This is to make it work with the DynamicMultiFileset.
User Configurations
- referenceName: This will be used to uniquely identify this source for lineage, annotating metadata, etc.
- connectionString: The JDBC connection string to the database. For example: jdbc:mysql://HOST/DATABASE.
- user: The username to use when connecting to the database.
- password: The password to use when connecting to the database.
- jdbcPluginName: The name of the JDBC plugin to use.
- enableAutoCommit: Whether to enable auto commit for queries run by this source. Defaults to false. This setting should only matter if you are using a jdbc driver that does not support a false value for auto commit, or a driver that does not support the commit call. For example, the Hive jdbc driver will throw an exception whenever a commit is called. For drivers like that, this should be set to true.
- tableNamePattern: A pattern that defines which tables should be read from. Any table whose name matches the pattern will read. If not specified, all tables will be read. Pattern syntax is specific to the type of database that is being connected to.
- whereClause: Filters which records needs to be consumed from each table: i.e.
where updated_at > '2018-08-20 00:00:00'
. Thewhere
clause will be applied to every table that is being read. Therefore, all the columns that are mentioned in thewhere
clause should be present in each table. - schemaNamePattern: A pattern that defines which schemas should be used to list the tables. Any schema whose name matches the pattern will read. If not specified, all schema will be read. Pattern syntax is specific to the type of database that is being connected to.
- tableNameField: The name of the field that holds the table name. Must not be the name of any table column that will be read. Defaults to 'tablename'.
- whiteList: Used in conjunction with tableNamePattern, this configuration specifies tables to be read. If no value is specified in the whiteList all tables matching the tableNamePattern will be read. By default reads all tables matching the tableNamePattern.
- blackList: Used in conjunction with tableNamePattern, this configuration specifies the tables to be skipped. By default the black list is empty which means no tables will be skipped.
- numSplits: Number of splits per table. By default is 1.
...