Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post

Overview 

This addition will allow users to see the history of directives made to a column of data.

Goals

User should be able to see lineage information, ie. directives, for columns

Storing lineage information should have minimal/no impact to the wrangler application

User Stories

  • As a user, I should be able to see the directives applied to a column of data.

  • As a user, I should be able to see the directives applied to a column of data over any period of time.

  • As a user, I should be able to see how any column got to its current state as well as other columns that were impacted by it
  • As a user, I should be able to add tags and properties to specific columns of data (stretch)

Design

Save directives for each column in AST format after parsing of directives along with necessary information (time, dataset/stream name/id, etc.).

Use TMS to send information to platform.

Unmarshal and store in HBase.

Access to lineage should only be available through the platform

Questions

  • How to get source and sink datasets?
  • How to ensure this works with multiple transform nodes, even just wrangler nodes?
  • How to send and consume data from wrangler to CDAP?
  • Does ParseTree have all necessary information for every directive?

Approach

Computing Lineage Approach #1:

Store directives during execution of each step

Advantages:

  • Less assumptions

Disadvantages:

  • Add getter to each step class + sometimes (~30%) local variable
  • Slower

Computing Lineage Approach #2 (Preferred):

Compute lineage without looking at data by backtracking

Advantages:

  • No instance variables added to step classes
  • Faster

Disadvantages:

  • Requires stricter rule on directives, ie. every rename must give old and new name. See * below for why
  • More assumption based

*Backtrack starting with columns A,B,C. Previous directive is "set-columns A B C". The directive before that is "lowercase <column>" where <column> is nameOfOwner. No way of knowing what nameOfOwner refers to without looking at data.

Storage Approach #1:

Key is dataset name + column name + down/up

  • Each time a pipeline is run each column will be appended with more lineage information and timestamp
  • On retrieval get one key and display until time reached.

 

Storage Approach #2:

Key is dataset name + column name + time of execution + down/up

  • Each time a pipeline is run each column will get its own addition to the HDFS table
  • On retrieval get all for column that match time range and display by adding all together

 

API changes

New Programmatic APIs

FieldLevelLineage is a Java class that contains all the necessary information to be sent to the CDAP platform

Instance should be initialized per wrangler node passing in a list of final columns (schema) after parsing the directives.

store() takes a ParseTree and stores all the necessary information into lineages.

Stores lineage for each column in lineage instance variable which is a map to ASTs.

Code Block
themeEclipse
languagejava
titleFieldLevelLineage
linenumberstrue
collapsetrue
public class FieldLevelLineage {
  private enum Type { ADD, DROP, MODIFY, READ, RENAME }
  class BranchingStepNode {
    boolean continueUp, continueDown;
    String directive;
    Map<String, Integer> upBranches, downBranches;
    // constructors, toString(), ...
  }
  private static final String PROGRAM = "wrangler";
  private final long startTime;
  private Set<String> startColumns;
  private final Set<String> currentColumns, endColumns;
  private final Map<String, List<BranchingStepNode>> lineage;

  public FieldLevelLineage(String[] columnNames) {...}

  // getters
  // helpers for store

  public void store(ParseTree tree) {
	/**
     * Go through tree one directive at a time
     * For each column associated with the directive get label
     * Store directive for this column correctly into lineage based on label
	 */
  }
 
  // toString()
}

Parse Tree should contain all columns affected per directive.

Labels:

  • All columns should be labeled one of:  {Read, Drop, Modify, Add, Rename}
  • Read: column's name or values are read. Including reading values and modifying. ie. "filter-rows-on..."
  • Drop: column is dropped
  • Add: column is added
  • Rename: column's name is replaced with another name
  • Modify: column's values altered and doesn't fit in any of the other categories, ie. "lowercase"

For Read, Drop, Modify, and Add the column and associated label should be something like -> Column: Name, Label: add.

For Rename the column and associated label should be something like  -> Column: body_5 DOB, Label: rename. // Basically some way of having both names, currently using a space. Old/new for rename. Swap example: A B, Label: rename and B A, Label: rename.

For Read, Modify, and Add there is another option; instead of column name can return {"all columns", "all columns minus _ _ _ _ ", "all columns formatted %s_%d"}, along with label. ie. Column: "all columns minus body", Label: add. "all columns" refers to all columns present in dataset after execution of this step.

For Rename, and Drop this option is not available; must explicitly return name of all columns involved.

**Assumption (can be changed): In ParseTree all columns should be in order of impact. ie. If directive is "copy A A_copy". "A, Label: read" should be before "A_copy, Label: add".

Algorithm visual:  Example wrangler application    -->     lineage     -->    ASTs for each column

             

New REST APIs

PathMethodDescriptionResponse
/v3/namespaces/{namespace-id}/datasets/{dataset-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels>
GETReturns list of directives applied to the specified column in the specified dataset

200: Successful

Response TBD, but will contain a Tree representation

/v3/namespaces/{namespace-id}/streams/{stream-id}/columns/{column-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels>
GETReturns list of directives applied to the specified column in the specified stream

200: Successful

Response TBD, but will contain a Tree representation

CLI Impact or Changes

TBD

UI Impact or Changes

  • Option 2: Add interface to metadata table when viewing dataset to see lineage of columns possibly by clicking on column:                                                          ->   When a column is clicked on will look something like:
        ->   
  • Option 2: Show all columns at once directly on lineage tab from clicking on dataset, tab between field level and dataset level:


Security Impact 

Should be none, TBD

Impact on Infrastructure Outages 

Storage in HBase; Impact TBD.

Test Scenarios

Test IDTest DescriptionExpected Results
1Tests all directivesAll Step subclasses should be properly parsed containing all correct columns with correct labels
2Multiple datasets/streams

Lineages are correctly shown between different datasets/streams

3Tests all store()FieldLevelLineage.store() always correctly stores step

Releases

Release 4.3.0

Release 4.4.0

Related Work

  • Fixing TextDirectives and parsing of directives in general