Hydrator 3.5 - Joins

Hydrator 3.5 - Joins

Goals

Checklist

User stories documented (Albert/Vinisha)
User stories reviewed (Nitin)
Design documented (Albert/Vinisha)
Design reviewed (Terence/Andreas)
Feature merged ()
Examples and guides ()
Integration tests () 
Documentation for feature ()
Blog post

Use Cases

  1. A data scientist wants to join data from multiple datasets to run a machine learning model. The first dataset, Customers, contains cust_id, cust_name and location. Second dataset, Purchases, contains purchase_id, cust_id and store. Third dataset, Transactions contains cust_id, tranasaction_data and purchase_amount. The data scientist wants to inner join these datasets on cust_id. 

    Customers:

    Purchases:

    Transactions:

     

    Joining above three datasets would give us following output:
    1 -> <Customers: (1, Alex, Canada), Purchases: (23, 1, Walmart), Transactions: (1, 2015-01-23, 20)>

    2 -> <Customers: (2, Bob, California), Purchases: (24, 2, Dstore), Transactions: (2, 2015-01-12, 400)>

    2 -> <Customers: (2, Bob, California), Purchases: (25, 2, Walmart), Transactions: (2, 2015-01-12, 400)>

  2. A data scientist wants to join data from multiple datasets. The first dataset, Customers, contains cust_id, cust_name and location. Second dataset, Purchases, contains purchase_id, cust_id and store. Third dataset, Transactions contains purchase_id, tranasaction_data and purchase_amount. The data scientist wants to inner join Customers and Purchases on cust_id and the resultant dataset can be joined to Transactions on purchase_id

    Customers:

    Purchases:

    CustomersPurchases:

    Transactions:  

    CustomersPurchasesTransactions:

  3. A data scientist wants to join two or more datasets and wants to selects columns present in the output of join. For example, Customers has cust_id, cust_name and location. Second dataset, Purchases, has purchase_id, cust_id and store. After inner join of these 2 datasets on cust_id, output dataset should have cust_id, cust_name and purchase_id.

    Customers:

    Purchases:

    CustomersPurchases:

User Stories

  1. As a pipeline developer, I want to be able to join (inner, left outer, right outer, full outer) two or more stage outputs on some common fields, or do a cross join.

  2. As a pipeline developer, I want to be able to get metrics on number of records in and records out of the join.

  3. [UI] As a pipeline developer, I want to be able to see the schema of all input into the join, and the schema output by the join.

  4. As a pipeline developer, I want to be able to choose whether the pipeline with the join runs with mapreduce or spark.

  5. As a plugin developer, I want to be able to write a plugin that gets data from multiple stages joins them.

Design

To support Join on two or more stages, we will need a new plugin type. We propose 2 designs for this plugin. After the discussion, we have decided to implement join with design 1.

Design 1:

In this design we introduce new Joiner plugin type. The API for this plugin will look like:

Joiner.java
public interface Joiner<RECORD_KEY, RECORD> { /** * Returns record key based on record and stage name to which records belongs to. * * @param input record which needs to be joined * @param stageName stage name to which records belogs to * @return Join key * @throws Exception if there is some error getting the key for join */ public RECORD_KEY joinOn(String stageName, RECORD record) throws Exception; }

 

To understand the api, lets take a simple example, we want to inner join Customers dataset to Purchases on cust_id. 

Customers:

cust_id

cust_name

location

cust_id

cust_name

location

1

Alex

Canada

2

Bob

California

3

Cathie

Texas  

Purchases:

purchase_id

cust_id

store

purchase_id

cust_id

store

23

1

Walmart

24

2

Dstore

25

2

Walmart  

 

Created in 2020 by Google Inc.