...
- Operational meta data describes the way that data was processed or created:
- metrics: statistics about the data, and possibly about the processing that produced the data.
- lineage: who produced this data when and how
- audit: who or what accessed this data in what way (read or modified)
- Technical meta data is associated with data and describes its technical properties, etc:
- checksums, number of records
- format, schema, etc.
- Business meta data is associated with data to tag, categorize, inventorize it, or comply with some other business process. It is typically not intrinsic to the data, that is, it cannot be derived from the data itself.
- Tags such as “confidential”, “pii”, “financial"
- Properties such as “businessUnit:xyz” or market “EMEA”
Applications for metadata are many and impossible to list here. In the context of CDAP, meta data is used with two main purposes:
- Data Governance:
- Traceability: For a piece of data, where did it originate, how was it processed/transformed, where was it sent to, etc.
- Compliance: Many enterprises are under strict regulations that require the ability to trace back all data (and meta data) to its origin and over its life time.
- Discovery: Data scientists or business analysts use meta data to find data that they are interested in.
User Stories
[Trace back]
- A credit card statement has a wrong charge and the customer complained about it. The bank needs to find out where the incorrect data originates from.
- Was the original data already incorrect? Then it needs to be identified for further action
- Was the data damaged during processing? If so, how was it processed, what were the pipelines/plugins that processed it, with what configuration?
- A downstream process fails because its input data contains a field that does not comply with the schema. The operations team needs to determine why:
- What pipeline produced this data from what input data?
- What operations were applied to the input to produce this field?
- A user notices that a time stamps in a data set are in the wrong time zone, but only for some data. The operations team needs to find out:
- Where did the incorrect data come from? Is it one of the data providers that sends incorrect time stamps? Or is the problem in the pipeline that ingested the data?
[Trace forward]
- A data provider calls a bank's data lake operator and notifies him that the data received over a time period was wrong. The bank now needs to find out what other data was derived from this data, and reprocess it with the correct input data.
[Meta data provenance]
- A data scientist notices that a data set is not tagged as “PII” even though it contains phone numbers. He call the data lake operations team. The team that produces the data assures that they have tagged this data as “PII”. The operations team wants to find out why the tag is missing - was it modified or removed after the fact or was it missing at creation time? - and consults the audit logs/change history of the data set’s meta data.
[Discovery]
- For a data experiment, a scientist wants to process credit card transactions that have been normalized to UTC time stamps. How can he find a dataset that has this data? And if that data does not exist, how can he find a data set with credit card transactions, and normalize the time stamps himself? He will search the meta data for:
- Datasets that are tagged / described as credit card transactions
- Datasets that have a time stamp field tagged “utc” or “normalized”
[Metrics as Metadata]
- The data scientist further wants to understand the quality of the data. For this, he wants to see the processing metrics for each file in the data set
- how many records were processed
- how many records were discarded due to schema/data validation errors
[Metadata propagation]
- A developer wants to create a pipeline that reads from a dataset, applies some transformations, and propagates meta data from its input to its output. For example, if a field in the input data is tagged as “PII”, the corresponding field in the output data should also be tagged “PII". However, if the pipeline anonymizes that field, it should not be tagged as “PII” in the output, but rather as “anonymized”.
[Integrations]
- An enterprise has a business meta data system and would like to synchronize the CDAP metadata with that system. For example, Atlas, or Collibra.
- Periodic batch import/export
- Batch export of all meta data that has changed since last export
- Tight integration through exposing all metadata changes via a message bus
- Query external system from pipeline
- Publish to external system from pipeline
Required Platform Capabilities
[Trace back] Ability to trace a single record back to its origin
...