Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This document explains the design for storage and retrieval of the Field Level Lineage information.Example

Access Pattern:

...

Image Removed

 

In the lineage view, we show high level information first as shown below. Note that 'HR File', 'Person File', and 'Employee Data' are name of the input and output datasets, as indicated by the Reference name in the plugin properties.

Image Removed

Next detail level view contains the clickable fields from the input and output datasets. Note that 2D boxes represents fields belonging to the datasets. Since input datasets are of type file which does not have schema yet, plugin can provide any String name for it. In this case we are using "HR Record" and "Person Record" as name.

Image Removed

 

Once user clicks on particular field, field level lineage graph can be displayed.

Example: Graph for field ID, where circle represents the fields and edges represents operations with names in bubbles.

Image Removed

Note that "body" field is generated from "HR Record" as well as "Person Record". To distinguish it while storing we might need to prefix it with the stage name.

...

  1. For a given dataset, find out the high level lineage (field mapping between source and destination datasets and not the detail operations which caused this conversion) going in backward direction within a given time range. Note that the response should be multi-level. For example, consider a case where "Employee" dataset is generated from "Person", "HR", and "Skills" datasets. Response would contain the field mappings between source datasets ("Person", "HR", and "Skills") and "Employee" dataset. However it is also possible that the source datasets are created/updated in the given time range. So response should also include the field mappings between the datasets which created the source datasets and source datasets themselves.  
  2. For a given dataset, find out the high level lineage (field mapping between source and destination datasets and not the detail operations which caused this conversion) going in forward direction within a given time range. Similar to the above query, response need to be multi-level.
  3. Given a dataset and field name, find out detail lineage (field mapping between the source and destination datasets along with the operations which caused this conversion) going in the backward direction. Response will only contain the operations belonging to the single level.
  4. Given a dataset and field name, find out detail lineage (field mapping between the source and destination datasets along with the operations which caused this conversion) going in the forward direction. Response will only contain the operations belonging to the single level.

REST API:

  1. Given a dataset and time range, get the high level lineage both in forward and backward direction.

    Code Block
    GET /v3/namespaces/<namespace-id>/endpoints/<endpoint-name>/fields/lineage?start=<start-ts>&end=<end-ts>&level=<level>
    
    
    Where:
    namespace-id: namespace name
    endpoint-name: name of the endpoint
    start-ts: starting timestamp(inclusive) in seconds
    end-ts: ending timestamp(exclusive) in seconds for lineage
    level: how many hops to make in backward/forward direction
    
    
    Sample response:
    [
      ...
      list of lineage mappings
      ...
    ]
    
    
    where each lineage mapping will be of the form:
    
    
    {
      "source": {
         "namespace": "ns",
         "name": "Person" 
      },
      "Destination": {
         "namespace": "ns",
         "name": "Employee"
      },
      "fieldmap": [
         { "from": "id", "to": "id" },
         { "from": "first_name", "to": "name"},
         { "from": "last_name", "to": "name"}
      ] 
    }
  2. Given a dataset and field, find out the detailed lineage.

    Code Block
    GET /v3/namespaces/<namespace-id>/endpoints/<endpoint-id>/fields/<field-name>/lineage?start=<start-ts>&end=<end-ts>&direction=<backward/forward>
     
    Where:
    namespace-id: namespace name
    endpoint-id: endpoint name
    field-name: name of the field for which lineage information to be retrieved
    start-ts: starting timestamp(inclusive) in seconds
    end-ts: ending timestamp(exclusive) in seconds for lineage
    direction: backward or forward
    
    
    Sample response:
    [
       ....
           list of connections between fields
       ....
    ]
    
    
    where each connection is as follows:
    {
      "from": "id",
      "to": "id",
      "operation": {
         "name": "IDENTITY",
         "description": "operation description" 
      },
    }
    
    

Store:

Based on the above example, we want following pieces of information to be stored in the "FieldLevelLineage" dataset

...

Example: With one run of the pipeline shown above, following will be the sample data in the store.

Row KeyColumn KeyValueNote
MyNamespace:HRFile:<runidX-inverted-start-time>:runidXProperties

inputDir=/data/2017/hr

regex=*.csv

failOnError=false

One Row per namespace per dataset per run
MyNamespace: PersonFile:<runidX-inverted-start-time>:runidXProperties

inputDir=/data/2017/person

regex=*.csv

failOnError=false

One Row per namespace per dataset per run
MyNamespace:EmployeeData:<runidX-inverted-start-time>:runidXProperties

rowid=ID

/*should we store schema too? what if that changes per run?*/

One Row per namespace per dataset per run
MyNamespace:EmployeeData:AllFields:<runidX-inverted-start-time>:runidXID

/* We may not necessarily required to store any value*/

created_time:12345678

updated_time:12345678

last_updated_by:runid_X

One Row per namespace per dataset per run
MyNamespace:EmployeeData:AllFields:<runidX-inverted-start-time>:runidXName
  


MyNamespace:EmployeeData:AllFields:<runidX-inverted-start-time>:runidXDepartment
  


MyNamespace:EmployeeData:AllFields:<runidX-inverted-start-time>:runidXContactDetails
  


MyNamespace:EmployeeData:AllFields:<runidX-inverted-start-time>:runidXJoiningDate
  


MyNamespace:EmployeeData:<runidX-inverted-start-time>:runidXLineage

JSON representation of the LineageGraph provided by app to the platform.

 


One row per run per target dataset

JSON stored for ID field:

Code Block
{
  "sources": [
    {
      "name": "PersonFile",
      "properties": {
        "inputPath": "/data/2017/persons",
        "regex": "*.csv"
      }
    },
    {
      "name": "HRFile",
      "properties": {
        "inputPath": "/data/2017/hr",
        "regex": "*.csv"
      }
    }
  ],
  "targets": [
    {
      "name": "Employee Data"
    }
  ],
  "operations": [
    {
      "inputs": [
        {
          "name": "PersonRecord",
          "properties": {
            "source": "PersonFile"
          }
        }
      ],
      "outputs": [
        {
          "name": "body"
        }
      ],
      "name": "READ",
      "description": "Read Person file.",
      "properties": {
        "stage": "Person File Reader"
      }
    },
    {
      "inputs": [
        {
          "name": "body"
        }
      ],
      "outputs": [
        {
          "name": "SSN"
        }
      ],
      "name": "PARSE",
      "description": "Parse the body field",
      "properties": {
        "stage": "Person File Parser"
      }
    },
    {
      "inputs": [
        {
          "name": "HRRecord",
          "properties": {
            "source": "HRFile"
          }
        }
      ],
      "outputs": [
        {
          "name": "body"
        }
      ],
      "name": "READ",
      "description": "Read HR file.",
      "properties": {
        "stage": "HR File Reader"
      }
    },
    {
      "inputs": [
        {
          "name": "body"
        }
      ],
      "outputs": [
        {
          "name": "Employee_Name"
        },
        {
          "name": "Dept_Name"
        }
      ],
      "name": "PARSE",
      "description": "Parse the body field",
      "properties": {
        "stage": "HR File Parser"
      }
    },
    {
      "inputs": [
        {
          "name": "Employee_Name"
        },
        {
          "name": "Dept_Name"
        },
        {
          "name": "SSN"
        }
      ],
      "outputs": [
        {
          "name": "ID",
          "properties": {
            "target": "Employee Data"
          }
        }
      ],
      "name": "GenerateID",
      "description": "Generate unique Employee Id",
      "properties": {
        "stage": "Field Normalizer"
      }
    }
  ]
}

...

  1. Get the list of fields in the dataset.

    Code Block
    GET /v3/namespaces/<namespace-id>/datasets/<dataset-id>/fields?start=<start-ts>&end=<end-ts>
     
    Where:
    namespace-id: namespace name
    dataset-id: dataset name
    start-ts: starting timestamp(inclusive) in seconds
    end-ts: ending timestamp(exclusive) in seconds for lineage
     
    Sample Response:
    [
      {
        "name": "ID",
        "properties": {
          "creation_time": 12345678,
          "last_update_time": 12345688,
          "last_modified_run": "runid_x"
        }
      },
      {
        "name": "name",
        "properties": {
          "creation_time": 12345678,
          "last_update_time": 12345688,
          "last_modified_run": "runid_x"
        }
      },
      {
        "name": "Department",
        "properties": {
          "creation_time": 12345678,
          "last_update_time": 12345688,
          "last_modified_run": "runid_x"
        }
      },
      {
        "name": "ContactDetails",
        "properties": {
          "creation_time": 12345678,
          "last_update_time": 12345688,
          "last_modified_run": "runid_x"
        }
      },
      {
        "name": "JoiningDate",
        "properties": {
          "creation_time": 12345678,
          "last_update_time": 12345688,
          "last_modified_run": "runid_x"
        }
      }
    ]
  2. Get the properties associated with the dataset.

    Code Block
    GET /v3/namespaces/<namespace-id>/datasets/<dataset-id>/properties?start=<start-ts>&end=<end-ts>
    
    Where:
    namespace-id: namespace name
    dataset-id: dataset name
    start-ts: starting timestamp(inclusive) in seconds
    end-ts: ending timestamp(exclusive) in seconds for lineage
    Sample Response:
    [
       {
          "programRun": "run1",
          "properties": {
            "inputPath": "/data/2017/hr",
            "regex": "*.csv"
          } 
       },
       {
          "programRun": "run2",  
          "properties": {
            "inputPath": "/data/2017/anotherhrdata",
            "regex": "*.csv"
          }
       }
    ]
  3. Get the lineage associated with the particular field in a dataset.

    Code Block
    GET /v3/namespaces/<namespace-id>/datasets/<dataset-id>/fields/<field-name>/lineage?start=<start-ts>&end=<end-ts>
     
    Where:
    namespace-id: namespace name
    dataset-id: dataset name
    field-name: name of the field for which lineage information to be retrieved
    start-ts: starting timestamp(inclusive) in seconds
    end-ts: ending timestamp(exclusive) in seconds for lineage

    Sample response:

    Code Block
    {
      "startTimeInSeconds": 1442863938,
      "endTimeInSeconds": 1442881938,
      "paths": [
       ....
           list of paths which represent the different ways field is created 
       ....
      ] 
    }
     
    Each path will look as follows:
     {
      "sources": [
        {
          "name": "PersonFile",
          "properties": {
            "inputPath": "/data/2017/persons",
            "regex": "*.csv"
          }
        },
        {
          "name": "HRFile",
          "properties": {
            "inputPath": "/data/2017/hr",
            "regex": "*.csv"
          }
        }
      ],
      "targets": [
        {
          "name": "Employee Data"
        }
      ],
      "operations": [
        {
          "inputs": [
            {
              "name": "PersonRecord",
              "properties": {
                "source": "PersonFile"
              }
            }
          ],
          "outputs": [
            {
              "name": "body"
            }
          ],
          "name": "READ",
          "description": "Read Person file.",
          "properties": {
            "stage": "Person File Reader"
          }
        },
        {
          "inputs": [
            {
              "name": "body"
            }
          ],
          "outputs": [
            {
              "name": "SSN"
            }
          ],
          "name": "PARSE",
          "description": "Parse the body field",
          "properties": {
            "stage": "Person File Parser"
          }
        },
        {
          "inputs": [
            {
              "name": "HRRecord",
              "properties": {
                "source": "HRFile"
              }
            }
          ],
          "outputs": [
            {
              "name": "body"
            }
          ],
          "name": "READ",
          "description": "Read HR file.",
          "properties": {
            "stage": "HR File Reader"
          }
        },
        {
          "inputs": [
            {
              "name": "body"
            }
          ],
          "outputs": [
            {
              "name": "Employee_Name"
            },
            {
              "name": "Dept_Name"
            }
          ],
          "name": "PARSE",
          "description": "Parse the body field",
          "properties": {
            "stage": "HR File Parser"
          }
        },
        {
          "inputs": [
            {
              "name": "Employee_Name"
            },
            {
              "name": "Dept_Name"
            },
            {
              "name": "SSN"
            }
          ],
          "outputs": [
            {
              "name": "ID",
              "properties": {
                "target": "Employee Data"
              }
            }
          ],
          "name": "GenerateID",
          "description": "Generate unique Employee Id",
          "properties": {
            "stage": "Field Normalizer"
          }
        }
      ],
      "runs": [
        "runidX",
        "runidY",
        "runidZ"
      ]
    }

 

 

 

 

 

...