Metadata and Data Discovery

Metadata and Data Discovery

Overview

This is a design document for the Metadata and Data Discovery feature in CDAP. Metadata and Data Discovery will allow CDAP applications and datasets to be annotated with both business and system metadata. This will enable users to:

  1. Track lineage and provenance of datasets

  2. Discover datasets and applications based on Metadata

Requirements

Id

Requirement

Priority

Description/Comments

Id

Requirement

Priority

Description/Comments

R1

CDAP should provide Java API to annotate metadata

H

  • Java API is only for business metadata

  • Only String data-type will be supported for values

R2

CDAP should have the ability to distinguish the type of metadata

H

  • Scopes : SYSTEM, USER

  • If added using Java API, then it is Scope.USER

  • If added using REST API, then it is Scope.USER

  • If added via a Program (workflow token, data quality, etc), it is Scope.SYSTEM

  • If added by the CDAP System, it is Scope.SYSTEM

  • Is Schema part of metadata? What about other Dataset properties?

R3

Metadata can be associated with Dataset (All Metadata) and as well Application (Business Metadata and Generic)

H

  • What System Metadata can be tagged on an Application?

    • creation time

    • created by

    • last update time

    • last updated by

    • ... any more?

  • What System Metadata can be tagged on a Dataset?

    • last updated by (program)

    • last update time

    • ... any more?

  • Is this for lineage tracking purposes only? Or is this general purpose? If it is for general purpose store and retrieval, then we may have to alter storage.

R4

User should have the ability to annotate business metadata with Application or Datasets using REST API / CLI and Programmatically using API

H

 

R5

User should have the ability to retrieve all of business metadata using REST API / CLI

H

  • We should allow retrieval of System/Generic metadata as well.

R6

User should have the ability to search a Dataset and Application based on Generic Metadata

H

  • free text search or exact match

  • Do we need an auto-complete interface

  • Solr, Elastic Search, Apache Blur, Cloudera Search (index HBase), Lily

R7

User should have the ability to search a Dataset and Application based on Business Metadata

H

 

R8

User should have the ability to view metadata in UI in Dataset View and for Application in Application View.

H

 

R9

User should have the ability to distinguish the type of metadata he is viewing on the UI.

H

 

R10

User should have ability to search an Application or Dataset by their Metadata on the UI

H

 

R11

CDAP system should automatically annotate dataset with machine generated metadata

M

Enumerate:

  • creation time

  • created by

  • last update time

  • last updated by

  • Any more? All updates? Last N updates?

 

R12

User should have the ability to search Dataset based on machine-generated metadata – specifically we start with field names rather than values.

M

 

R13

User should have the ability to search Dataset based on the schema fields associated with Dataset.

M

 

R14

User should have the ability to publish business tags on notification system

H

 

R15

User should have the ability to specify a filter on machine-generated metadata to be associated with dataset.

L

Enumerate. What kind of filters?

User Stories

In no specific order...

Id

Description

Requirements fulfilled

Id

Description

Requirements fulfilled

U1

As a user, I should be able to tag applications, programs, streams and datasets with business metadata at deploy time

R1

U2

As a user, I should be able to tag application, programs, streams and datasets with business metadata after deployment

R4

U3

As a user, I should be able to view business and system metadata separately

R2, R3, R6

U4

As a user, I should be able to view metadata and runtime information that the CDAP system has automatically tagged applications and datasets with.

R11

U5

As a user, I should be able to retrieve all the business and system metadata associated with applications and datasets

R5, R8

U6

As a user, I should be able to view the lineage and provenance for a dataset in a given time period

 

U7

As a user, I should be able to discover/search datasets based on metadata - business or system

R6, R7, R10

U8

As a user, I should be able to discover/search datasets based on fields in the dataset schema

R13

U9

As a user, I should be able to publish metadata to the notification system

R14

U10

As a user, I should be able to specify filters on the kind of metadata that is automatically added to datasets and applications

R15

User Interactions

Interaction 1

  1. User goes to the metadata discovery page

  2. User types in a metadata keyword in the search box

  3. User gets a list of all possible metadata keys in the search results

  4. User selects (clicks) a single metadata key

  5. User gets a list of all datasets and applications that were annotated with the selected metadata keys

  6. User selects a dataset from the above list

  7. User is taken to the dataset detail page

  8. User has the option to view the lineage of the dataset, for a given time interval

Interaction 2

  1. User is shown all available metadata keys

  2. User follows Interaction 1 from point 4 onwards

Interaction 3

  1. User is on a page showing a lineage diagram for a dataset for the specified time interval

  2. User clicks on a program in the lineage diagram

  3. User is shown all runs for the selected program in the specified time interval

  4. User selects a runId

  5. User is shown all metadata for the selected run

Interaction 4

  1. User is on a dataset/application/program detail page

  2. User clicks a button to view metadata

  3. User is shown all metadata for the selected dataset/application/program/runId

Scope (3.2)

In Scope:

Metadata Annotation and Retrieval

  • Allow users to update and retrieve Scope.USER Metadata through REST endpoints

  • Automatically associate application, runId with dataset access

  • Publish business metadata to Kafka topic.

Lineage

  • View lineage on a dataset through REST API

Search and Discovery

  • Allow prefix-searching for a single metadata key or value. (Either just key, or a combination of key and value).

Out of Scope 

Metadata Annotation and Retrieval

  • No Java Clients/CLI support

  • Annotating dataset/stream accesses as read/write is out of scope. 

  • No support for annotating applications or datasets programmatically

Lineage

  • Lineage will not be directional.

Search and Discovery

  • No support for boolean expressions.

  • No free-text search.

Architecture

The Metadata Service will be implemented as a separate service that runs in Yarn. One of the primary reasons for this is that it cuts across AppFabric and DataFabric. This will also allow us to scale it independently. This service will expose some REST APIs for interaction, most of which will be public (routable) while some may be private (non-routable). This service will allow the following kinds of interactions:

  1. Direct interactions from users for setting and retrieving metadata for applications and datasets

  2. Interactions from AppFabric/Dataset Service during (the start/end of) program runs for annotating that a program accessed a dataset (as input or output).

This service will also be responsible for computing and serving the lineage of a dataset in a specified time interval.

An additional service may (TBD) also be exposed to fulfill the requirement of metadata search. This service may be a YARN service that runs Solr or Lucene to index metadata.

Design

Metadata System

Metadata Types

The metadata system supports the following kinds of metadata:

  1. Business (User) Metadata: Metadata that users annotate, using either the Java API or the REST API.

  2. System Metadata: Metadata that is annotated by the CDAP system. This may include two kinds of metadata:

    1. Generic Metadata: Metadata that is added by default to every application/dataset - created_by, creation_time, last_updated_by, last_update_time (Any more )

    2. Runtime Metadata: Metadata that contains runtime information like Workflow Token, Data Quality Stats, Program/Dataset Runtime Arguments, Preferences. We may not store this data in the metadata service, but may just reference from the current MDS with a runId.

System Metadata Updates

Updates to system metadata will happen at four different times:

  1. App/Dataset Deployment: To update System Metadata

  2. Program start: To update the history table for the dataset access with the run

  3. During Program Run: To update dataset accesses for programs that update datasets (that are not set as input or output datasets) during their run 

  4. Program end: To update the history table for the dataset access with the end time (and the metadata values?)

Search

We will use Solr or Lucene as an external search and indexing engine. An overview of the investigation of various available options can be found at External Search and Indexing Engine Investigation.

CDAP Search System Service

To support Solr for fault tolerant mode there will be new CDAP system service to act as adapter for the search and index engine in YARN.

The new proposed search system service will have a primary role to manage and act as proxy between CDAP master and the external search and index engine.

Indexing and Searching Metadata Records Using IndexedTable

In the 3.2 release, we will support search on metadata records using the key*=value* and key* format rather than free text search.

To do this, we will use IndexedTable instance to store the metadata records which allow us to get another table act as inverted index of the metadata records.

Store

The metadata system will be composed of the following three tables:

Business Metadata Table

This table contains the most recent metadata associated with a dataset or an application. It contains no historical data. It's purpose is to optimally serve the annotate and retrieve APIs for business metadata. 

RowKey:

<target-type>

<target-id>

<metadata-type>

<key>

<target-type>

<target-id>

<metadata-type>

<key>

where:

<target-type>APP/PROGRAM/DATASET/STREAM

<target-id>: app-id(ns+app) / program-id(ns+app+pgtype+pgm) / dataset-id(ns+dataset)/stream-id(ns+stream)

<metadata-type>: "p" for a property (key-value) record, "t" for a tag record

<key>: the key

Col:

Key:Value

Value

Key:Value

Value

We store both Key:Value and Value in the column to make indexing easier using an IndexedTable. We will be able to support prefix searches with this approach as well.

When an app is deleted, its business metadata from this table is also deleted

Lineage Table

This table contains historical information of program runs and the datasets that they accessed for read or write. It serves like an audit-trail. The primary goal of this table is to be able to compute the lineage for a dataset.

RowKey:

This table can have six kinds of row-keys

Dataset access from a CDAP Program

d

<dataset-id>

<inverted-start-time>

p

<run-id>

<access-type>

d

<dataset-id>

<inverted-start-time>

p

<run-id>

<access-type>

p

<run-id>

<inverted-start-time>

d

<dataset-id>

<access-type>

where: 

dIdentifies a dataset access

<dataset-id>The dataset id containing ns+dataset

<inverted-start-time>: The start time of the run; inverted because HBase0.96 does not support reverse scan.

pIdentifies that this access was made from a CDAP Program

<run-id>The run id containing ns+app+program+runId

<access-type>: r/w/rw

 

Stream access from a CDAP Program

s

<stream-id>

<inverted-start-time>

p

<run-id>

<access-type>

s

<stream-id>

<inverted-start-time>

p

<run-id>

<access-type>

p

<run-id>

<inverted-start-time>

s

<stream-id>

<access-type>

where: 

sIdentifies a stream access

<stream-id>The stream id containing ns+stream

<inverted-start-time>: The start time of the run; inverted because HBase0.96 does not support reverse scan.

pIdentifies that this access was made from a CDAP Program

<run-id>The run id containing ns+app+program+runId

<access-type>r/w/rw

 

Stream access from an external entity

s

<stream-id>

<inverted-start-time>

e

<external-entity-id>

<access-type>

s

<stream-id>

<inverted-start-time>

e

<external-entity-id>

<access-type>

e

<external-entity-id>

<inverted-start-time>

s

<stream-id>

<access-type>

where: 

sIdentifies a stream access

<stream-id>The stream id containing ns+stream

<inverted-start-time>: The start time of the run; inverted because HBase0.96 does not support reverse scan.

eIdentifies that this access was made from an external entity

<external-entity-id>The id of the external entity, which is a combination of the source query param specified in the modified Stream Write REST APIs below and an access timestamp (eg. day timestamp). It is okay to have timestamp in the external entity id, since we will not be computing lineage for external entities.

<access-type>r/w/rw

Columns:

<stop-time>

<metadata>

<stop-time>

<metadata>

<stop-time> - Stop time of the program in ms, -1 if program is still running

<metadata> - JSON containing business and system metadata, even though storing system metadata could mean a duplication. This will be null for dataset and stream rows. This will not contain run information like runtime arguments, workflow token etc. These will be looked up from the run record table when required.

Lineage Computation

Lineage can be computed for streams and datasets using the lineage table. Lineage computation is a breadth first search on the lineage table.

Given a dataset dataset1 and a time range time1 and time2, the following scan and filter will compute one iteration of the breadth first search.

Scan

start row: d-dataset1-inverted-time2

stop row: d-dataset1-inverted-min(running-program-start-times)

Filter

stop time == -1 OR stop time >= time1

This scan will satisfy the following condition - (time1 <= stop-time < time2) OR (start-time < time2 AND (stop-time == -1 OR stop-time >= time2))

min(running program start times) <= start time < time2 AND (stop time == -1 OR stop time >= time1)

inverted min(running program start times) > inverted start time >= inverted time2 AND (stop time == -1 OR stop time >= time1)

 

The above scan will give us all programs that accessed dataset dataset1. We can continue the scans with the programs to go further down the search. Note that each node in the graph will need a separate scan to move forward in the search.

 

 When is metadata stored in this table?

  • Every time a dataset access happens - It will make querying expensive

Created in 2020 by Google Inc.