As an additional information for the source and target datasets we might want to show the associated properties such as file path, regex used etc.

Store:

Based on the above example, we want following pieces of information to be stored in the "FieldLevelLineage" dataset

...

Row Key	Column Key	Value	Note
MyNamespace:HRFile	Properties	inputDir=/data/2017/hr regex=*.csv failOnError=false	One Row per namespace per dataset
MyNamespace: PersonFile	Properties	inputDir=/data/2017/person regex=*.csv failOnError=false	One Row per namespace per dataset
MyNamespace:EmployeeData	Properties	rowid=ID /should we store schema too? what if that changes per run?/	One Row per namespace per dataset
MyNamespace:EmployeeData:AllFields	ID	/* We may not necessarily required to store any value*/ created_time:12345678 updated_time:12345678 last_updated_by:runid_X	One Row per namespace per dataset
MyNamespace:EmployeeData:AllFields	Name
MyNamespace:EmployeeData:AllFields	Department
MyNamespace:EmployeeData:AllFields	ContactDetails
MyNamespace:EmployeeData:AllFields	JoiningDate
MyNamespace:EmployeeData:ID:<runidX-inverted-start-time>:runidX	Lineage	Please see the full JSON below.	One row per run if field is part of target
MyNamespace:EmployeeData:Name:<runidX-inverted-start-time>:runidX	Lineage	Similar JSON	One row per run if field is part of target
MyNamespace:EmployeeData:ContactDetails:<runidX-inverted-start-time>:runidX	Lineage	Similar JSON	One row per run if field is part of target
MyNamespace:EmployeeData:JoiningDate:<runidX-inverted-start-time>:runidX	Lineage	Similar JSON	One row per run if field is part of target

JSON stored for ID field:

Code Block

{
  "sources": [
    {
      "name": "PersonFile",
      "properties": {
        "inputPath": "/data/2017/persons",
        "regex": "*.csv"
      }
    },
    {
      "name": "HRFile",
      "properties": {
        "inputPath": "/data/2017/hr",
        "regex": "*.csv"
      }
    }
  ],
  "targets": [
    {
      "name": "Employee Data"
    }
  ],
  "operations": [
    {
      "inputs": [
        {
          "name": "PersonRecord",
          "properties": {
            "source": "PersonFile"
          }
        }
      ],
      "outputs": [
        {
          "name": "PersonRecord.body"
        }
      ],
      "name": "READ",
      "description": "Read Person file.",
      "properties": {
        "stage": "Person File Reader"
      }
    },
    {
      "inputs": [
        {
          "name": "PersonRecord.body"
        }
      ],
      "outputs": [
        {
          "name": "SSN"
        }
      ],
      "name": "PARSE",
      "description": "Parse the body field",
      "properties": {
        "stage": "Person File Parser"
      }
    },
    {
      "inputs": [
        {
          "name": "HRRecord",
          "properties": {
            "source": "HRFile"
          }
        }
      ],
      "outputs": [
        {
          "name": "HRRecord.body"
        }
      ],
      "name": "READ",
      "description": "Read HR file.",
      "properties": {
        "stage": "HR File Reader"
      }
    },
    {
      "inputs": [
        {
          "name": "PersonRecord.body"
        }
      ],
      "outputs": [
        {
          "name": "Employee_Name"
        },
        {
          "name": "Dept_Name"
        }
      ],
      "name": "PARSE",
      "description": "Parse the body field",
      "properties": {
        "stage": "HR File Parser"
      }
    },
    {
      "inputs": [
        {
          "name": "Employee_Name"
        },
        {
          "name": "Dept_Name"
        },
        {
          "name": "SSN"
        }
      ],
      "outputs": [
        {
          "name": "ID",
          "properties": {
            "target": "Employee Data"
          }
        }
      ],
      "name": "GenerateID",
      "description": "Generate unique Employee Id",
      "properties": {
        "stage": "Field Normalizer"
      }
    }
  ]
}

Few things to note:

When platform receives the LineageGraph from the app, processing of the graph would be done before storing the data so the retrieval is straightforward.
In the above pipeline, "HR File Parser" stage parses the body and generate fields "Employee_Name", "Dept_Name", "Salary", and "Start_Date". However the actual JSON stored for the ID field only contains operation from "

Versions Compared

Old Version 3

New Version 4

Key

Store:

Retrieval:

Page Comparison

Versions Compared

Old Version 3

New Version 4

Key

Store:

Retrieval: