Hydrator 3.5 - Joins
Goals
Checklist
Use Cases
A data scientist wants to join data from multiple datasets to run a machine learning model. The first dataset, Customers, contains cust_id, cust_name and location. Second dataset, Purchases, contains purchase_id, cust_id and store. Third dataset, Transactions contains cust_id, tranasaction_data and purchase_amount. The data scientist wants to inner join these datasets on cust_id.
Customers:
Purchases:
Transactions:
Joining above three datasets would give us following output:
1 -> <Customers: (1, Alex, Canada), Purchases: (23, 1, Walmart), Transactions: (1, 2015-01-23, 20)>2 -> <Customers: (2, Bob, California), Purchases: (24, 2, Dstore), Transactions: (2, 2015-01-12, 400)>
2 -> <Customers: (2, Bob, California), Purchases: (25, 2, Walmart), Transactions: (2, 2015-01-12, 400)>
A data scientist wants to join data from multiple datasets. The first dataset, Customers, contains cust_id, cust_name and location. Second dataset, Purchases, contains purchase_id, cust_id and store. Third dataset, Transactions contains purchase_id, tranasaction_data and purchase_amount. The data scientist wants to inner join Customers and Purchases on cust_id and the resultant dataset can be joined to Transactions on purchase_id
Customers:
Purchases:
CustomersPurchases:
Transactions:
CustomersPurchasesTransactions:
A data scientist wants to join two or more datasets and wants to selects columns present in the output of join. For example, Customers has cust_id, cust_name and location. Second dataset, Purchases, has purchase_id, cust_id and store. After inner join of these 2 datasets on cust_id, output dataset should have cust_id, cust_name and purchase_id.
Customers:
Purchases:
CustomersPurchases:
User Stories
As a pipeline developer, I want to be able to join (inner, left outer, right outer, full outer) two or more stage outputs on some common fields, or do a cross join.
As a pipeline developer, I want to be able to get metrics on number of records in and records out of the join.
[UI] As a pipeline developer, I want to be able to see the schema of all input into the join, and the schema output by the join.
As a pipeline developer, I want to be able to choose whether the pipeline with the join runs with mapreduce or spark.
As a plugin developer, I want to be able to write a plugin that gets data from multiple stages joins them.
Design
To support Join on two or more stages, we will need a new plugin type. We propose 2 designs for this plugin. After the discussion, we have decided to implement join with design 1.
Design 1:
In this design we introduce new Joiner plugin type. The API for this plugin will look like:
Joiner.java
public interface Joiner<RECORD_KEY, RECORD> {
/**
* Returns record key based on record and stage name to which records belongs to.
*
* @param input record which needs to be joined
* @param stageName stage name to which records belogs to
* @return Join key
* @throws Exception if there is some error getting the key for join
*/
public RECORD_KEY joinOn(String stageName, RECORD record) throws Exception;
}
To understand the api, lets take a simple example, we want to inner join Customers dataset to Purchases on cust_id.
Customers:
cust_id | cust_name | location |
|---|---|---|
1 | Alex | Canada |
2 | Bob | California |
3 | Cathie | Texas |
Purchases:
purchase_id | cust_id | store |
|---|---|---|
23 | 1 | Walmart |
24 | 2 | Dstore |
25 | 2 | Walmart |
Created in 2020 by Google Inc.