Field-level Lineage

Field-level lineage allows users to see which directives were applied to a specific column of data in a given timeframe. They can see how a column of data was generated and which other columns were produced from this column as well as how its values were manipulated.

Labels

Every column involved in a directive must have one and only one associated label. These labels are: {READ, ADD, DROP, RENAME, MODIFY}

  • READ: When the values of a column impact one or more other columns it is labeled as a READ column.

    • Ex1. copy <source> <destination>. In this case since the values of the entries of the source column are read in order to produce the destination column, the source column should be labeled as READ.

    • Ex2. filter-row-if-matched <column> <regex>. In this case since the values of the entries of the supplied column are read in order to filter the rows in the dataset, column should be labeled as READ. This is the case even though the supplied column is modified since its values are read.

  • ADD: When a column is generated by the directive, this column is labeled as an ADD column.

    • Ex1. copy <source> <destination>. In this case since the destination is a new column that is generated by this directive, it should be labeled as ADD.

  • DROP: When a column is dropped as a result of the directive, this column is labeled as a DROP column.

    • Ex1. drop <column>[,<column>*]. In this case since all the columns listed are dropped by this directive, all the listed columns should be labled as DROP columns.

  • RENAME: When the name of a column is changed to another name, both the old and new name are labeled as RENAME columns. Note that neither column is labeled as ADD or DROP since no column is added or dropped, but instead a column's name is being replaced in place.

    • Ex1. rename <old> <new>. In this case since the name old is being replaced with the name new, both old and new should be labeled as RENAME. This is because one column's name is being changed/renamed from old to new.

    • EX2. swap <column1> <column2>. In this case since both the name column1 and the name column2 are simply being replaced with the other, both column1 and column2 should be labeled as RENAME. No records are being added or lost by this directive.

  • MODIFY: When the values of a column's entries are potentially changed, but not read and impacting other columns, it should be labeled as a MODIFY column.

    • Ex1. lowercase <column>. In this case since the column doesn't impact any other column, and its values are potentially modified it should be labeled as MODIFY.

  • Bonus: Rather than having to label every column if the columns are all READ, ADD, or MODIFY columns, the following can be used to replace the column name: {"all columns", "all columns minus _ _ _ _", "all columns formatted %s_%d"}. The first represents a case where all columns present in the dataset at the end of the directive can all be labeled the same. The second represents the case where all columns except for a space separated list of columns present in the dataset at the end of the directive can all be labeled the same. The third represents the case where all columns present at the end of the directive which follow the format string, supporting %s and %d, can all be labeled the same. Again this only works for READ, ADD, or MODIFY.

    • Ex1. split-to-columns <column> <regex>. In this case since all the newly produced columns will have names formatted column_%dall columns formatted column_%d can be labeled ADD, rather than each individual new column.

    • Ex2. parse-as-csv <column> <delimiter>. In this case since all the columns present at the end of this directive will have been produced by this directive except for column itself, all columns minus column can be labeled ADD, rather than each individual new column.

    • Ex3. Custom directive: lowercase-all. This custom directive changes all the record values to lowercase. In this case all columns present at the end of this directive will have been modified by this directive, so all columns can be labeled MODIFY, rather than each individual column.

Created in 2020 by Google Inc.