Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

CDAP provides a way to retrieve the lineage for dataset entities. A dataset entity can have associated schema. The schema defines different fields in the dataset along with their data type information. Field Level Lineage allows a user to get more granular lineage view of a dataset. A field lineage for a given dataset shows for the specified time range all the fields that were computed for a dataset and the fields from source datasets that participated in computation of those fields. Field lineage also shows the detail operations that caused the transformation from fields of a source dataset to the field of a given dataset.

Example Use Case

Data analytics group in an Airline company wants to perform analytical queries on the passenger information stored on SFTP servers. To run the workload, passenger information is imported to Hadoop. Because the data can come from a variety of sources, the data is normalized into standard format while importing. PII information is also obfuscated.

By looking at the dataset in Hadoop, a data officer or Business Analyst wants to understand the meaning of the field "fullName" by reading how it was produced. For example, a data officer might want to know that the "fullName" field was created by concatenating the "firstName" and "lastName" fields, both of which were extracted as positional fields from a CSV record in the source named "passengerList". Additionally, typically in a triage or debugging scenario, an operation team or developer wants to identify how a field was computed when fields show up with wrong values.

Concepts and Terminology

  • Field: Field identifies column in a dataset. Field has a name and data type.

  • EndPoint: EndPoint defines the source or destination of the data along with its namespace from where the fields are read or written to.

  • Field Operation: Operation defines a single computation on a field. It has a name and description.

  • Read Operation: Type of operation that reads from the source EndPoint and creates collection of fields.

  • Transform Operation: Type of operation that transforms collection of input fields to collection of output fields.

  • Write Operation: Type of operation that writes the collection of fields to the destination EndPoint.

  • Origin: Origin of the field is the name of the operation that outputted the field. The <origin, fieldName> pair is used to uniquely identify the field because the field can appear in the outputs of multiple operations.

Field Lineage for CDAP Programs

Field Lineage recording is supported from MapReduce and Spark programs.

...

Code Block
@Override
public void initialize() throws Exception {
  MapReduceContext context = getContext();
  List<Operation> operations = new ArrayList();

  Operation read = new ReadOperation("Read", "Read passenger information", EndPoint.of("ns", "passengerList"),
                                     "id", "firstName", "lastName", "address");
  operations.add(read);

  Operation concat = new TransformOperation("Concat", "Concatenated fields",
                                            Arrays.asList(InputField.of("Read", "firstName"),
                                            InputField.of("Read", "lastName")), "fullName");
  operations.add(concat);

  Operation normalize = new TransformOperation("Normalize", "Normalized field",
                                               Collections.singletonList(InputField.of("Read", "address")),
                                               "address");
  operations.add(normalize);

  Operation write = new WriteOperation("Write", "Wrote to passenger dataset", EndPoint.of("ns", "passenger"),
                                       Arrays.asList(InputField.of("Read", "id"),
                                                     InputField.of("Concat", "fullName"),
                                                     InputField.of("Normalize", "address")));
  operations.add(write);

  // Record field operation
  context.record(operations);
}

Field Lineage for CDAP Data Pipelines

Plugins in CDAP data pipelines can also record the field lineage. All plugin types except sparkprogram support field lineage. The capability to record lineage is available in the prepareRun() method of the plugin by using the context provided to the prepareRun() method.

...