...
- Operational meta data describes the way that data was processed or created:
- metrics: statistics about the data, and possibly about the processing that produced the data.
- lineage: who produced this data when and how
- audit: who or what accessed this data in what way (read or modified)
- Technical meta data is associated with data and describes its technical properties, etc:
- checksums, number of records
- format, schema, etc.
- Business meta data is associated with data to tag, categorize, inventorize it, or comply with some other business process. It is typically not intrinsic to the data, that is, it cannot be derived from the data itself.
- Tags such as “confidential”, “pii”, “financial"
- Properties such as “businessUnit:xyz” or market “EMEA”
Applications for metadata are many and impossible to list here. In the context of CDAP, meta data is used with two main purposes:
- Data Governance:
- Traceability: For a piece of data, where did it originate, how was it processed/transformed, where was it sent to, etc.
- Compliance: Many enterprises are under strict regulations that require the ability to trace back all data (and meta data) to its origin and over its life time.
- Discovery: Data scientists or business analysts use meta data to find data that they are interested in.
User Stories
[Trace back]
- A credit card statement has a wrong charge and the customer complained about it. The bank needs to find out where the incorrect data originates from.
- Was the original data already incorrect? Then it needs to be identified for further action
- Was the data damaged during processing? If so, how was it processed, what were the pipelines/plugins that processed it, with what configuration?
- A downstream process fails because its input data contains a field that does not comply with the schema. The operations team needs to determine why:
- What pipeline produced this data from what input data?
- What operations were applied to the input to produce this field?
- A user notices that a time stamps in a data set are in the wrong time zone, but only for some data. The operations team needs to find out:
- Where did the incorrect data come from? Is it one of the data providers that sends incorrect time stamps? Or is the problem in the pipeline that ingested the data?
- In a data lake, various processes are responsible for dumping the data from variety of sources. The quality of the data produced varies based on where the data is coming from. It is important for the user to identify which sources are producing low quality data by tracing back to them. User can then apply additional pre-processing on such sources or simply quarantine them.
- Organizations typically maintain one data lake which gets data pumped into it from different departments. While analyzing such data in the data lake, data scientist needs additional information. For example field named 'resource' in the data lake can have different meaning based on where it is originated from. For Admin Operation department, resource can simply represent the hardware unit, however for Human Resource department, resource can represent the employee information. Ability to trace back the field to its source allows data scientist to get more contextual information.
[Trace forward]
- A data provider calls a bank's data lake operator and notifies him that the data received over a time period was wrong. The bank now needs to find out what other data was derived from this data, and reprocess it with the correct input data.
- ETL developer at the organization can trace forward the transformations happening on the data from sources till it lands up in the data lake. This information can be used by him to optimize the data pipelines. For example, filtering the records before performing joins [add more details..].
- Admin can use the ability to trace forward from the sources to figure out which of the sources are more popular(in terms of data quality, more applicability for the data scientists etc.) and apply stronger policies to secure such data sources.
[Meta data provenance]
- A data scientist notices that a data set is not tagged as “PII” even though it contains phone numbers. He call the data lake operations team. The team that produces the data assures that they have tagged this data as “PII”. The operations team wants to find out why the tag is missing - was it modified or removed after the fact or was it missing at creation time? - and consults the audit logs/change history of the data set’s meta data.
[Discovery]
- For a data experiment, a scientist wants to process credit card transactions that have been normalized to UTC time stamps. How can he find a dataset that has this data? And if that data does not exist, how can he find a data set with credit card transactions, and normalize the time stamps himself? He will search the meta data for:
- Datasets that are tagged / described as credit card transactions
- Datasets that have a time stamp field tagged “utc” or “normalized”
[Metrics as Metadata]
- The data scientist further wants to understand the quality of the data. For this, he wants to see the processing metrics for each file in the data set
- how many records were processed
- how many records were discarded due to schema/data validation errors
[Metadata propagation]
- A developer wants to create a pipeline that reads from a dataset, applies some transformations, and propagates meta data from its input to its output. For example, if a field in the input data is tagged as “PII”, the corresponding field in the output data should also be tagged “PII". However, if the pipeline anonymizes that field, it should not be tagged as “PII” in the output, but rather as “anonymized”.
[Integrations]
- An enterprise has a business meta data system and would like to synchronize the CDAP metadata with that system. For example, Atlas, or Collibra.
- Periodic batch import/export
- Batch export of all meta data that has changed since last export
- Tight integration through exposing all metadata changes via a message bus
- Query external system from pipeline
- Publish to external system from pipeline
Required Platform Capabilities
[Trace back] Ability to trace a single record back to its origin
...