Introduction

Metadata is data about datedata, in other words, data that describes other data. There are many kinds of meta data, including:

Operational meta data describes the way that data was processed or created:

metrics: statistics about the data, and possibly about the processing that produced the data.
lineage: who produced this data when and how
audit: who or what accessed this data in what way (read or modified)

Technical meta data is associated with data and describes its technical properties, etc:

checksums, number of records
format, schema, etc.

Business meta data is associated with data to tag, categorize, inventorize it, or comply with some other business process. It is typically not intrinsic to the data, that is, it cannot be derived from the data itself.

Tags such as “confidential”, “pii”, “financial"
Properties such as “businessUnit:xyz” or market “EMEA”

Applications for metadata are many and impossible to list here. In the context of CDAP, meta data is used with two main purposes:

Data Governance:

Traceability: For a piece of data, where did it originate, how was it processed/transformed, where was it sent to, etc.
Compliance: Many enterprises are under strict regulations that require the ability to trace back all data (and meta data) to its origin and over its life time.

Discovery: Data scientists or business analysts use meta data to find data that they are interested in.

User Stories

[Trace back]

A credit card statement has a wrong charge and the customer complained about it. The bank needs to find out where the incorrect data originates from.
- Was the original data already incorrect? Then it needs to be identified for further action
- Was the data damaged during processing? If so, how was it processed, what were the pipelines/plugins that processed it, with what configuration?
A downstream process fails because its input data contains a field that does not comply with the schema. The operations team needs to determine why:
- What pipeline produced this data from what input data?
- What operations were applied to the input to produce this field?
A user notices that a time stamps in a data set are in the wrong time zone, but only for some data. The operations team needs to find out:
- Where did the incorrect data come from? Is it one of the data providers that sends incorrect time stamps? Or is the problem in the pipeline that ingested the data?
In a data lake, various processes are responsible for dumping the data from variety of sources. The quality of the data produced varies based on where the data is coming from. It is important for the user to identify which sources are producing low quality data by tracing back to them. User can then apply additional pre-processing on such sources or simply quarantine them.
Organizations typically maintain one data lake which gets data pumped into it from different departments. While analyzing such data in the data lake, data scientist needs additional information. For example field named 'resource' in the data lake can have different meaning based on where it is originated from. For Admin Operation department, resource can simply represent the hardware unit, however for Human Resource department, resource can represent the employee information. Ability to trace back the field to its source allows data scientist to get more contextual informationAn Admin finds out that a certain dataset is very popular and used by many downstream consumers. He wishes to trace it back to the source to apply stronger policies to secure such data sources.

[Trace forward]

A data provider calls a bank's data lake operator and notifies him that the data received over a time period was wrong. The bank now needs to find out what other data was derived from this data, and reprocess it with the correct input data.
ETL developer at the organization can trace forward the transformations happening on the data from sources till it lands up in the data lake. This information can be used by him to optimize the data pipelines. For example, filtering the records before performing joins [add more details..].
Admin can use the ability to trace forward from the sources to figure out which of the sources are more popular(in terms of data quality, more applicability for the data scientists etc.) and apply stronger policies to secure such data sources.Knowledge about how the output of a pipeline in used by downstream consumers can help the pipeline developer optimize the pipeline. For example, apply a filter or normalization if he finds out all consumers apply that.
An Admin found out that a source inappropriately contained sensitive information. Tracing forward helps him determine derived datasets that need to be (re-)classified as sensitive.

[Meta data provenance]

A data scientist notices that a data set is not tagged as “PII” even though it contains phone numbers. He call the data lake operations team. The team that produces the data assures that they have tagged this data as “PII”. The operations team wants to find out why the tag is missing - was it modified or removed after the fact or was it missing at creation time? - and consults the audit logs/change history of the data set’s meta data.
A data scientist noticed a data set which was tagged with a tag. The dataset scientist wants to know who added this and tag and time it was added.
In case of major security breach, Admin of the data lake can validate the authenticity of the dataset based on the creation time.
As a part of the metadata provenance, data can be tagged with the owner information. Such owner information can be used by data scientists to assign weightage to the dataset.

[Discovery]

For a data experiment, a scientist wants to process credit card transactions that have been normalized to UTC time stamps. How can he find a dataset that has this data? And if that data does not exist, how can he find a data set with credit card transactions, and normalize the time stamps himself? He will search the meta data for:
1. Datasets that are tagged / described as credit card transactions
2. Datasets that have a time stamp field tagged “utc” or “normalized”

[Fine-Grained Metadata]

In case of major security breach, the Admin of the data lake can validate the authenticity of each file in a dataset based on its creation time.
Data quality can vary within a dataset, based the the origin of each file. It is useful to assign data quality metadata to each file.

[Metrics as Metadata/Data Quality]

The data scientist further wants to understand the quality of the data. For this, he wants to see the processing metrics for each file in the data set
1. how many records were processed
2. how many records were discarded due to schema/data validation errors
In a data lake, various processes are responsible for dumping data from a variety of sources. The quality of the data produced varies based on where the data is coming from. It is important for the user to identify which sources are producing low quality data by tracing back to them. User can then apply additional pre-processing on such sources or simply quarantine them.

[Metadata propagation]

A developer wants to create a pipeline that reads from a dataset, applies some transformations, and propagates meta data from its input to its output. For example, if a field in the input data is tagged as “PII”, the corresponding field in the output data should also be tagged “PII". However, if the pipeline anonymizes that field, it should not be tagged as “PII” in the output, but rather as “anonymized”.
A developer wants to create pipeline that read from a dataset, applies some transformation, and propagates some attributes of the source to its output. For example, he might want the output to be tagged with the filesize of the input file.
Organizations typically maintain one data lake which gets data pumped into it from different departments. While analyzing such data in the data lake, data scientist needs additional information. For example field named 'resource' in the data lake can have different meaning based on where it is originated from. For Admin Operation department, resource can simply represent the hardware unit, however for Human Resource department, resource can represent the employee information. Therefore, it would be best to annotate the sink dataset with the origin upon ingestion.
As a part of the ingestion process, data can be tagged with the owner information. Such owner information can be used by data scientists to assign weightage to the dataset.

[Integrations]

An enterprise has a business meta data system and would like to synchronize the CDAP metadata with that system. For example, Atlas, or Collibra.
1. Periodic batch import/export
2. Batch export of all meta data that has changed since last export
3. Tight integration through exposing all metadata changes via a message bus
4. Query external system from pipeline
5. Publish to external system from pipeline

Required Platform Capabilities

[Trace back] Ability to trace a single record back to its origin

...

associate meta data with a file
associate meta data with a field of a dataset (’s schema)
retrieve meta data for non-CDAP entities
search meta data for non-CDAP entities
retrieve the change history for all meta data of an entity (and its sub-entities)

Lineage

File to file lineage
Field lineage

collect per plugin/transform/directive
present as graph or similar navigable UI

Pipeline

propagate meta data from source to sink
map input files to output files 1:1
conditional processing based on meta data
explicitly set meta data for en entity
associate processing metrics as meta data for the sink
define meta data based on condition

Integrations

query meta data for an entity from an external meta data system
publish meta data to an external meta data system
all meta data operations via message bus
batch import/export of meta data (only changes)
authorization for meta data through Ranger/Sentry/external auth provider

Current Roadmap

5.0:

Field-level meta data : Metadata Custom Entities and Authorization
Field-level lineage

5.1:

File/Partition/custom entity meta data
Integration with external meta data systems

5.2:

Metadata provenance
Operational metadata
Catalog of all data by metadata

Versions Compared

Old Version 9

New Version Current

Key

Table of Contents

Introduction

User Stories

Required Platform Capabilities

Current Roadmap

Page Comparison

Versions Compared

Old Version 9

New Version Current

Key

Table of Contents

Introduction

User Stories

Required Platform Capabilities

Current Roadmap