...
As an additional information for the source and target datasets we might want to show the associated properties such as file path, regex used etc.
Store:
Based on the above example, we want following pieces of information to be stored in the "FieldLevelLineage" dataset
...
Row Key | Column Key | Value | Note |
---|---|---|---|
MyNamespace:HRFile | Properties | inputDir=/data/2017/hr regex=*.csv failOnError=false | One Row per namespace per dataset |
MyNamespace: PersonFile | Properties | inputDir=/data/2017/person regex=*.csv failOnError=false | One Row per namespace per dataset |
MyNamespace:EmployeeData | Properties | rowid=ID /*should we store schema too? what if that changes per run?*/ | One Row per namespace per dataset |
MyNamespace:EmployeeData:AllFields | ID | /* We may not necessarily required to store any value*/ created_time:12345678 updated_time:12345678 last_updated_by:runid_X | One Row per namespace per dataset |
MyNamespace:EmployeeData:AllFields | Name | ||
MyNamespace:EmployeeData:AllFields | Department | ||
MyNamespace:EmployeeData:AllFields | ContactDetails | ||
MyNamespace:EmployeeData:AllFields | JoiningDate | ||
MyNamespace:EmployeeData:ID:<runidX-inverted-start-time>:runidX | Lineage | Please see the full JSON below.
| One row per run if field is part of target |
MyNamespace:EmployeeData:Name:<runidX-inverted-start-time>:runidX | Lineage | Similar JSON | One row per run if field is part of target |
MyNamespace:EmployeeData:ContactDetails:<runidX-inverted-start-time>:runidX | Lineage | Similar JSON | One row per run if field is part of target |
MyNamespace:EmployeeData:JoiningDate:<runidX-inverted-start-time>:runidX | Lineage | Similar JSON | One row per run if field is part of target |
JSON stored for ID field:
Code Block |
---|
{ "sources": [ { "name": "PersonFile", "properties": { "inputPath": "/data/2017/persons", "regex": "*.csv" } }, { "name": "HRFile", "properties": { "inputPath": "/data/2017/hr", "regex": "*.csv" } } ], "targets": [ { "name": "Employee Data" } ], "operations": [ { "inputs": [ { "name": "PersonRecord", "properties": { "source": "PersonFile" } } ], "outputs": [ { "name": "PersonRecord.body" } ], "name": "READ", "description": "Read Person file.", "properties": { "stage": "Person File Reader" } }, { "inputs": [ { "name": "PersonRecord.body" } ], "outputs": [ { "name": "SSN" } ], "name": "PARSE", "description": "Parse the body field", "properties": { "stage": "Person File Parser" } }, { "inputs": [ { "name": "HRRecord", "properties": { "source": "HRFile" } } ], "outputs": [ { "name": "HRRecord.body" } ], "name": "READ", "description": "Read HR file.", "properties": { "stage": "HR File Reader" } }, { "inputs": [ { "name": "PersonRecord.body" } ], "outputs": [ { "name": "Employee_Name" }, { "name": "Dept_Name" } ], "name": "PARSE", "description": "Parse the body field", "properties": { "stage": "HR File Parser" } }, { "inputs": [ { "name": "Employee_Name" }, { "name": "Dept_Name" }, { "name": "SSN" } ], "outputs": [ { "name": "ID", "properties": { "target": "Employee Data" } } ], "name": "GenerateID", "description": "Generate unique Employee Id", "properties": { "stage": "Field Normalizer" } } ] } |
Few things to note:
- When platform receives the LineageGraph from the app, processing of the graph would be done before storing the data so the retrieval is straightforward.
- In the above pipeline, "HR File Parser" stage parses the body and generate fields "Employee_Name", "Dept_Name", "Salary", and "Start_Date". However the actual JSON stored for the ID field only contains operation from "
Retrieval: