Introduction

Metadata is data about date, in other words, data that describes other data. There are many kinds of meta data, including:

Operational meta data describes the way that data was processed or created:

metrics: statistics about the data, and possibly about the processing that produced the data.
lineage: who produced this data when and how
audit: who or what accessed this data in what way (read or modified)

Technical meta data is associated with data and describes its technical properties, etc:

checksums, number of records
format, schema, etc.

Business meta data is associated with data to tag, categorize, inventorize it, or comply with some other business process. It is typically not intrinsic to the data, that is, it cannot be derived from the data itself.

Tags such as “confidential”, “pii”, “financial"
Properties such as “businessUnit:xyz” or market “EMEA”

Applications for metadata are many and impossible to list here. In the context of CDAP, meta data is used with two main purposes:

Data Governance:

Traceability: For a piece of data, where did it originate, how was it processed/transformed, where was it sent to, etc.
Compliance: Many enterprises are under strict regulations that require the ability to trace back all data (and meta data) to its origin and over its life time.

Discovery: Data scientists or business analysts use meta data to find data that they are interested in.

User Stories

[Trace back]

A credit card statement has a wrong charge and the customer complained about it. The bank needs to find out where the incorrect data originates from.
- Was the original data already incorrect? Then it needs to be identified for further action
- Was the data damaged during processing? If so, how was it processed, what were the pipelines/plugins that processed it, with what configuration?
A downstream process fails because its input data contains a field that does not comply with the schema. The operations team needs to determine why:
- What pipeline produced this data from what input data?
- What operations were applied to the input to produce this field?
A user notices that a time stamps in a data set are in the wrong time zone, but only for some data. The operations team needs to find out:
- Where did the incorrect data come from? Is it one of the data providers that sends incorrect time stamps? Or is the problem in the pipeline that ingested the data?
In a data lake, various processes are responsible for dumping the data from variety of sources. The quality of the data produced varies based on where the data is coming from. It is important for the user to identify which sources are producing low quality data by tracing back to them. User can then apply additional pre-processing on such sources or simply quarantine them.
Organizations typically maintain one data lake which gets data pumped into it from different departments. While analyzing such data in the data lake, data scientist needs additional information. For example field named 'resource' in the data lake can have different meaning based on where it is originated from. For Admin Operation department, resource can simply represent the hardware unit, however for Human Resource department, resource can represent the employee information. Ability to trace back the field to its source allows data scientist to get more contextual information.

[Trace forward]

A data provider calls a bank's data lake operator and notifies him that the data received over a time period was wrong. The bank now needs to find out what other data was derived from this data, and reprocess it with the correct input data.
ETL developer at the organization can trace forward the transformations happening on the data from sources till it lands up in the data lake. This information can be used by him to optimize the data pipelines. For example, filtering the records before performing joins [add more details..].
Admin can use the ability to trace forward from the sources to figure out which of the sources are more popular(in terms of data quality, more applicability for the data scientists etc.) and apply stronger policies to secure such data sources.

[Meta data provenance]

A data scientist notices that a data set is not tagged as “PII” even though it contains phone numbers. He call the data lake operations team. The team that produces the data assures that they have tagged this data as “PII”. The operations team wants to find out why the tag is missing - was it modified or removed after the fact or was it missing at creation time? - and consults the audit logs/change history of the data set’s meta data.

[Discovery]

For a data experiment, a scientist wants to process credit card transactions that have been normalized to UTC time stamps. How can he find a dataset that has this data? And if that data does not exist, how can he find a data set with credit card transactions, and normalize the time stamps himself? He will search the meta data for:
1. Datasets that are tagged / described as credit card transactions
2. Datasets that have a time stamp field tagged “utc” or “normalized”

[Metrics as Metadata]

The data scientist further wants to understand the quality of the data. For this, he wants to see the processing metrics for each file in the data set
1. how many records were processed
2. how many records were discarded due to schema/data validation errors

[Metadata propagation]

A developer wants to create a pipeline that reads from a dataset, applies some transformations, and propagates meta data from its input to its output. For example, if a field in the input data is tagged as “PII”, the corresponding field in the output data should also be tagged “PII". However, if the pipeline anonymizes that field, it should not be tagged as “PII” in the output, but rather as “anonymized”.

[Integrations]

An enterprise has a business meta data system and would like to synchronize the CDAP metadata with that system. For example, Atlas, or Collibra.
1. Periodic batch import/export
2. Batch export of all meta data that has changed since last export
3. Tight integration through exposing all metadata changes via a message bus
4. Query external system from pipeline
5. Publish to external system from pipeline

Required Platform Capabilities

[Trace back] Ability to trace a single record back to its origin

we would need to know

What run of which pipeline produced this record?
How was each field of the file (the output of that run) computed?
When traced back to the source, what input file was it in?

this can be accomplished by

adding the input file name and the id and run id of the pipeline to each record
computing field-level lineage for each run of a pipeline
possibly repeating this step for the pipeline that produced the input for this pipeline; etc.

[Trace forward]

we need to know for this dataset:

what files were received during this time frame?
what pipelines processed any of these files, and what were their outputs?
possibly transitively the same for pipelines that processed the outputs

this can be accomplished by

storing a lineage graph from dataset to dataset
annotating each file in a dataset with the run id of the pipeline that processed it
a tool recursively/transitively finds all files produced from the affected files

[Meta Data Provenance]

we need to know

what meta data was associated with this field when it was ingested originally
what changes were made to the meta data afterwards, and who made them?

this can be accomplished by

storing a change log for all meta data in a retrievable way (more than just logging it)

[Discovery]

we need to be able to

tag and annotate fields of a dataset’s schema and make that searchable
complex queries (such as “credit card transactions AND timestamp:(UTC or normalized)

[Metrics as Metadata]

we need to

store metrics as meta data during processing
retrieve these metrics as meta data during discovery

[Metadata Propagation]

we need

programmatic APIs (for plugins) to access and publish meta data

meta data should only written of pipeline is successful

ways to configure a pipeline:

what meta data to retrieve from context/arguments/external
how to publish that meta data, and for what entities

minimum requirement is to have plugin APIs such that a custom plugin can do it

better: a Python/JavaScript action plugin to avoid compile/package/deploy
even better: a DSL or set configuration/directives for Hydrator to avoid coding

Requirements

Store

associate meta data with a file
associate meta data with a field of a dataset (’s schema)
retrieve meta data for non-CDAP entities
search meta data for non-CDAP entities
retrieve the change history for all meta data of an entity (and its sub-entities)

Lineage

File to file lineage
Field lineage

collect per plugin/transform/directive
present as graph or similar navigable UI

Pipeline

propagate meta data from source to sink
map input files to output files 1:1
conditional processing based on meta data
explicitly set meta data for en entity
associate processing metrics as meta data for the sink
define meta data based on condition

Integrations

query meta data for an entity from an external meta data system
publish meta data to an external meta data system
all meta data operations via message bus
batch import/export of meta data (only changes)
authorization for meta data through Ranger/Sentry/external auth provider

Current Roadmap

5.0:

Field-level meta data
Field-level lineage

5.1:

File/Partition/custom entity meta data
Integration with external meta data systems

5.2:

Metadata provenance
Operational metadata
Catalog of all data by metadata

Metadata 5.X+

Introduction

User Stories

Required Platform Capabilities

Requirements

Current Roadmap