Normalize

Normalize

 

Introduction

 

Users sometimes need to transform one source row into multiple target rows. For example, attributes are stored in columns in one table or file and may need to be broken into multiple records - one record per column attribute. In general it allows one to convert columns to rows. 

Use-Case

  • Convert wide rows and reducing data to it canonicalize form.

  • Reduce the need for restructuring the collection, as new types of data are being added.

  • Let’s assume that we are building a customer 360 Master table that aggregates data for a user from multiple sources. Each of the source has it’s own type of data to be added to a customer id. Instead of creating wide columns, normalization allows one to transform into it’s canonical form and update the customer 360 profile simultaneously from multiple sources. 

Use Stories

  • User should be able to specify a record field mapping. The record field mapping, just maps the field in input schema to the field in the output schema. These are the fields that would be available in every record. 

  • User should be able to specify a normalize field mapping. This will describe the input field name and to what output field it should be mapped to and where the value needs to be added. 

Example

Example 1

 

Customer 360 Use-case

 

Assume we have a source ‘Customer Profile’ table and ‘Customer Purchase’ table. 

 

Custom Profile Table

 

CustomerId

First Name

Last Name

Shipping Address

Credit Card

Billing Address

Last Update Date

S23424242

Joltie

Root

32826 Mars Way, Marsville,  MR, 24344

2334-232132-2323

32826 Mars Way, Marsville,  MR, 24344

05/12/2015

R45764646

Root

Joltie

32423, Your Way, YourVille, YR, 65765

2343-12312-12313

32421, MyVilla Ct, YourVille, YR, 23423

04/03/2012

 

Build a pipeline that ingest the above data into a Customer 360 Master table that has the following schemas and configuration.

 

Input Schema

  • Customer ID

  • First Name

  • Last Name

  • Shipping Address

  • Credit Card

  • Billing Address

  • Last Update Date

 

Normalize

  • Record Field Mapping

    • Input Field : Customer ID, map to : ID

    • Input Field : Last Update Date, map to : Date

  • Record Field Normalizing

    • Input Field : First Name, to : Attribute Type

      • Value to : Attribute Value

    • Input Field : Last Name, to : Attribute Type

      • Value to : Attribute Value

    • Input Field : Credit Card, to : Attribute Type

      • Value to : Attribute Value

    • Input Field : Billing Address, to : Attribute Type

      • Value to : Attribute Value

 

Output Schema

  • ID

  • Attribute Type

  • Attribute Value

  • Date

 

Following is the output from running the pipeline with the configurations specified above. Let’s call this ‘Master Customer 360 Table'

 

ID

Attribute Type

Attribute Value

Date

S23424242

First Name

Joltie

05/12/2015

S23424242

Last Name

Root

05/12/2015

S23424242

Credit Card

2334-232132-2323

05/12/2015

S23424242

Billing Address

32826 Mars Way, Marsville,  MR, 24344

05/12/2015

R45764646

First Name

Root

04/03/2012

R45764646

Last Name

Joltie

04/03/2012

R45764646

Credit Card

2343-12312-12313

04/03/2012

R45764646

Billing Address

32421, MyVilla Ct, YourVille, YR, 23423

04/03/2012

 

Custom Purchase Table

 

Customer Id

Item ID

Item Cost

Purchase Date

S23424242

UR-AR-243123-ST

245.67

08/09/2015

S23424242

SKU-234294242942

67.90

10/12/2015

R45764646

SKU-567757543532

14.15

06/09/2014

 

Now, let’s assume the user configures the pipeline for the above data to be added to ‘Master Customer 360 Table'

 

Following are schema and configuration

 

Input Schema

  • Customer Id

  • Item ID

  • Item Cost

  • Purchase Date

Normalize

  • Record Field Mapping

    • Input Field : Customer ID, to : ID

    • Input Field : Purchase Date, to : Date

  • Normalize Field Mapping

    • Input Field : Item ID, to : Attribute Type

      • Value to : Attribute Value

    • Input Field : Item Cost, to : Attribute Type

      • Value to : Attribute Value

Output Schema

  • ID

  • Attribute Type

  • Attribute Value

  • Date

 

Master Customer 360 Table would be updated as follows

 

ID

Attribute Type

Attribute Value

Date

S23424242

First Name

Joltie

05/12/2015

S23424242

Last Name

Root

05/12/2015

S23424242

Credit Card

2334-232132-2323

05/12/2015

S23424242

Billing Address

32826 Mars Way, Marsville,  MR, 24344

05/12/2015

R45764646

First Name

Root

04/03/2012

R45764646

Last Name

Joltie

04/03/2012

R45764646

Credit Card

2343-12312-12313

04/03/2012

R45764646

Billing Address

32421, MyVilla Ct, YourVille, YR, 23423

04/03/2012

S23424242

Item ID

UR-AR-243123-ST

08/09/2015

S23424242

Item Cost

245.67

08/09/2015

S23424242

Item ID

SKU-234294242942

10/12/2015

S23424242

Item Cost

67.90

10/12/2015

R45764646

Item ID

SKU-567757543532

 06/09/2014

R45764646

Item Cost

14.15

 06/09/2014

 

Example 2

Assume we got following data from any relational store as source

Name

Year

Month

Rent

Food

Transportation

Joltie

2016

02

600

100

50

Root

2016

02

900

200

135

Vikram

2016

02

550

145

75

Jon

2016

02

750

175

125

Jockey

2015

12

450

125

45

Configuration is specified as follows

 

Input Schema

  • Name, String

  • Year, Int

  • Month, Int

  • Rent, Float

  • Food, Float

  • Transportation, Float

Normalize

  • Record Field Mapping

    • Input Field : Name, to : Name

    • Input Field : Year, to : Year

    • Input Field : Month, to : Month

  • Record Field Normalizing

    • Input Field : Rent, to : ExpenditureType

      • Value to : Expenditure

    • Input Field : Food, to : ExpenditureType

      • Value to : Expenditure

    • Input Field : Transportation, to: ExpenditureType

      • Value to : Expenditure

Output Schema

  • Name, String

  • Year, Int

  • Month, Int

  • ExpenditureType, String

  • Expenditure, Float

 

Assume we got following data from any relational store as source

 

Name

Year

Month

ExpenditureType

Expenditure

Joltie

2016

02

Rent

600

Joltie

2016

02

Food

100

Joltie

2016

02

Transportation

50

Root

2016

02

Rent

900

Root

2016

02

Food

200

Root

2016

02

Transportation

135

Vikram

2016

02

Rent

550

Vikram

2016

02

Food

Created in 2020 by Google Inc.