Normalize
Introduction
Users sometimes need to transform one source row into multiple target rows. For example, attributes are stored in columns in one table or file and may need to be broken into multiple records - one record per column attribute. In general it allows one to convert columns to rows.
Use-Case
Convert wide rows and reducing data to it canonicalize form.
Reduce the need for restructuring the collection, as new types of data are being added.
Let’s assume that we are building a customer 360 Master table that aggregates data for a user from multiple sources. Each of the source has it’s own type of data to be added to a customer id. Instead of creating wide columns, normalization allows one to transform into it’s canonical form and update the customer 360 profile simultaneously from multiple sources.
Use Stories
User should be able to specify a record field mapping. The record field mapping, just maps the field in input schema to the field in the output schema. These are the fields that would be available in every record.
User should be able to specify a normalize field mapping. This will describe the input field name and to what output field it should be mapped to and where the value needs to be added.
Example
Example 1
Customer 360 Use-case
Assume we have a source ‘Customer Profile’ table and ‘Customer Purchase’ table.
Custom Profile Table
CustomerId | First Name | Last Name | Shipping Address | Credit Card | Billing Address | Last Update Date |
S23424242 | Joltie | Root | 32826 Mars Way, Marsville, MR, 24344 | 2334-232132-2323 | 32826 Mars Way, Marsville, MR, 24344 | 05/12/2015 |
R45764646 | Root | Joltie | 32423, Your Way, YourVille, YR, 65765 | 2343-12312-12313 | 32421, MyVilla Ct, YourVille, YR, 23423 | 04/03/2012 |
Build a pipeline that ingest the above data into a Customer 360 Master table that has the following schemas and configuration.
Input Schema
Customer ID
First Name
Last Name
Shipping Address
Credit Card
Billing Address
Last Update Date
Normalize
Record Field Mapping
Input Field : Customer ID, map to : ID
Input Field : Last Update Date, map to : Date
Record Field Normalizing
Input Field : First Name, to : Attribute Type
Value to : Attribute Value
Input Field : Last Name, to : Attribute Type
Value to : Attribute Value
Input Field : Credit Card, to : Attribute Type
Value to : Attribute Value
Input Field : Billing Address, to : Attribute Type
Value to : Attribute Value
Output Schema
ID
Attribute Type
Attribute Value
Date
Following is the output from running the pipeline with the configurations specified above. Let’s call this ‘Master Customer 360 Table'
ID | Attribute Type | Attribute Value | Date |
S23424242 | First Name | Joltie | 05/12/2015 |
S23424242 | Last Name | Root | 05/12/2015 |
S23424242 | Credit Card | 2334-232132-2323 | 05/12/2015 |
S23424242 | Billing Address | 32826 Mars Way, Marsville, MR, 24344 | 05/12/2015 |
R45764646 | First Name | Root | 04/03/2012 |
R45764646 | Last Name | Joltie | 04/03/2012 |
R45764646 | Credit Card | 2343-12312-12313 | 04/03/2012 |
R45764646 | Billing Address | 32421, MyVilla Ct, YourVille, YR, 23423 | 04/03/2012 |
Custom Purchase Table
Customer Id | Item ID | Item Cost | Purchase Date |
S23424242 | UR-AR-243123-ST | 245.67 | 08/09/2015 |
S23424242 | SKU-234294242942 | 67.90 | 10/12/2015 |
R45764646 | SKU-567757543532 | 14.15 | 06/09/2014 |
Now, let’s assume the user configures the pipeline for the above data to be added to ‘Master Customer 360 Table'
Following are schema and configuration
Input Schema
Customer Id
Item ID
Item Cost
Purchase Date
Normalize
Record Field Mapping
Input Field : Customer ID, to : ID
Input Field : Purchase Date, to : Date
Normalize Field Mapping
Input Field : Item ID, to : Attribute Type
Value to : Attribute Value
Input Field : Item Cost, to : Attribute Type
Value to : Attribute Value
Output Schema
ID
Attribute Type
Attribute Value
Date
Master Customer 360 Table would be updated as follows
ID | Attribute Type | Attribute Value | Date |
S23424242 | First Name | Joltie | 05/12/2015 |
S23424242 | Last Name | Root | 05/12/2015 |
S23424242 | Credit Card | 2334-232132-2323 | 05/12/2015 |
S23424242 | Billing Address | 32826 Mars Way, Marsville, MR, 24344 | 05/12/2015 |
R45764646 | First Name | Root | 04/03/2012 |
R45764646 | Last Name | Joltie | 04/03/2012 |
R45764646 | Credit Card | 2343-12312-12313 | 04/03/2012 |
R45764646 | Billing Address | 32421, MyVilla Ct, YourVille, YR, 23423 | 04/03/2012 |
S23424242 | Item ID | UR-AR-243123-ST | 08/09/2015 |
S23424242 | Item Cost | 245.67 | 08/09/2015 |
S23424242 | Item ID | SKU-234294242942 | 10/12/2015 |
S23424242 | Item Cost | 67.90 | 10/12/2015 |
R45764646 | Item ID | SKU-567757543532 | 06/09/2014 |
R45764646 | Item Cost | 14.15 | 06/09/2014 |
Example 2
Assume we got following data from any relational store as source
Name | Year | Month | Rent | Food | Transportation |
Joltie | 2016 | 02 | 600 | 100 | 50 |
Root | 2016 | 02 | 900 | 200 | 135 |
Vikram | 2016 | 02 | 550 | 145 | 75 |
Jon | 2016 | 02 | 750 | 175 | 125 |
Jockey | 2015 | 12 | 450 | 125 | 45 |
Configuration is specified as follows
Input Schema
Name, String
Year, Int
Month, Int
Rent, Float
Food, Float
Transportation, Float
Normalize
Record Field Mapping
Input Field : Name, to : Name
Input Field : Year, to : Year
Input Field : Month, to : Month
Record Field Normalizing
Input Field : Rent, to : ExpenditureType
Value to : Expenditure
Input Field : Food, to : ExpenditureType
Value to : Expenditure
Input Field : Transportation, to: ExpenditureType
Value to : Expenditure
Output Schema
Name, String
Year, Int
Month, Int
ExpenditureType, String
Expenditure, Float
Assume we got following data from any relational store as source
Name | Year | Month | ExpenditureType | Expenditure |
Joltie | 2016 | 02 | Rent | 600 |
Joltie | 2016 | 02 | Food | 100 |
Joltie | 2016 | 02 | Transportation | 50 |
Root | 2016 | 02 | Rent | 900 |
Root | 2016 | 02 | Food | 200 |
Root | 2016 | 02 | Transportation | 135 |
Vikram | 2016 | 02 | Rent | 550 |
Vikram | 2016 | 02 | Food |