Metadata and Data Discovery
- 1 Overview
- 2 Requirements
- 3 User Stories
- 4 User Interactions
- 4.1 Interaction 1
- 4.2 Interaction 2
- 4.3 Interaction 3
- 4.4 Interaction 4
- 5 Scope (3.2)
- 5.1 In Scope:
- 5.1.1 Metadata Annotation and Retrieval
- 5.1.2 Lineage
- 5.1.3 Search and Discovery
- 5.2 Out of Scope
- 5.2.1 Metadata Annotation and Retrieval
- 5.2.2 Lineage
- 5.2.3 Search and Discovery
- 5.1 In Scope:
- 6 Architecture
- 7 Design
- 8 Store
- 8.1 Business Metadata Table
- 8.2 Lineage Table
- 8.2.1 Lineage Computation
- 9 REST APIs
- 10 Java APIs
- 10.1 Applications
- 10.2 Datasets
- 10.2.1 UseDataset
- 10.2.2 getDataset()
- 10.3 Stream
- 11 Existing REST API changes
- 11.1 Stream
- 12 Notifications
- 13 Caveats/Assumptions
- 14 CLI
- 15 Questions/TODOs
Overview
This is a design document for the Metadata and Data Discovery feature in CDAP. Metadata and Data Discovery will allow CDAP applications and datasets to be annotated with both business and system metadata. This will enable users to:
Track lineage and provenance of datasets
Discover datasets and applications based on Metadata
Requirements
Id | Requirement | Priority | Description/Comments |
|---|---|---|---|
R1 | CDAP should provide Java API to annotate metadata | H |
|
R2 | CDAP should have the ability to distinguish the type of metadata | H |
|
R3 | Metadata can be associated with Dataset (All Metadata) and as well Application (Business Metadata and Generic) | H |
|
R4 | User should have the ability to annotate business metadata with Application or Datasets using REST API / CLI and Programmatically using API | H |
|
R5 | User should have the ability to retrieve all of business metadata using REST API / CLI | H |
|
R6 | User should have the ability to search a Dataset and Application based on Generic Metadata | H |
|
R7 | User should have the ability to search a Dataset and Application based on Business Metadata | H |
|
R8 | User should have the ability to view metadata in UI in Dataset View and for Application in Application View. | H |
|
R9 | User should have the ability to distinguish the type of metadata he is viewing on the UI. | H |
|
R10 | User should have ability to search an Application or Dataset by their Metadata on the UI | H |
|
R11 | CDAP system should automatically annotate dataset with machine generated metadata | M | Enumerate:
|
R12 | User should have the ability to search Dataset based on machine-generated metadata – specifically we start with field names rather than values. | M |
|
R13 | User should have the ability to search Dataset based on the schema fields associated with Dataset. | M |
|
R14 | User should have the ability to publish business tags on notification system | H |
|
R15 | User should have the ability to specify a filter on machine-generated metadata to be associated with dataset. | L | Enumerate. What kind of filters? |
User Stories
In no specific order...
Id | Description | Requirements fulfilled |
|---|---|---|
U1 | As a user, I should be able to tag applications, programs, streams and datasets with business metadata at deploy time | R1 |
U2 | As a user, I should be able to tag application, programs, streams and datasets with business metadata after deployment | R4 |
U3 | As a user, I should be able to view business and system metadata separately | R2, R3, R6 |
U4 | As a user, I should be able to view metadata and runtime information that the CDAP system has automatically tagged applications and datasets with. | R11 |
U5 | As a user, I should be able to retrieve all the business and system metadata associated with applications and datasets | R5, R8 |
U6 | As a user, I should be able to view the lineage and provenance for a dataset in a given time period |
|
U7 | As a user, I should be able to discover/search datasets based on metadata - business or system | R6, R7, R10 |
U8 | As a user, I should be able to discover/search datasets based on fields in the dataset schema | R13 |
U9 | As a user, I should be able to publish metadata to the notification system | R14 |
U10 | As a user, I should be able to specify filters on the kind of metadata that is automatically added to datasets and applications | R15 |
User Interactions
Interaction 1
User goes to the metadata discovery page
User types in a metadata keyword in the search box
User gets a list of all possible metadata keys in the search results
User selects (clicks) a single metadata key
User gets a list of all datasets and applications that were annotated with the selected metadata keys
User selects a dataset from the above list
User is taken to the dataset detail page
User has the option to view the lineage of the dataset, for a given time interval
Interaction 2
User is shown all available metadata keys
User follows Interaction 1 from point 4 onwards
Interaction 3
User is on a page showing a lineage diagram for a dataset for the specified time interval
User clicks on a program in the lineage diagram
User is shown all runs for the selected program in the specified time interval
User selects a runId
User is shown all metadata for the selected run
Interaction 4
User is on a dataset/application/program detail page
User clicks a button to view metadata
User is shown all metadata for the selected dataset/application/program/runId
Scope (3.2)
In Scope:
Metadata Annotation and Retrieval
Allow users to update and retrieve Scope.USER Metadata through REST endpoints
Automatically associate application, runId with dataset access
Publish business metadata to Kafka topic.
Lineage
View lineage on a dataset through REST API
Search and Discovery
Allow prefix-searching for a single metadata key or value. (Either just key, or a combination of key and value).
Out of Scope
Metadata Annotation and Retrieval
No Java Clients/CLI support
Annotating dataset/stream accesses as read/write is out of scope.
No support for annotating applications or datasets programmatically
Lineage
Lineage will not be directional.
Search and Discovery
No support for boolean expressions.
No free-text search.
Architecture
The Metadata Service will be implemented as a separate service that runs in Yarn. One of the primary reasons for this is that it cuts across AppFabric and DataFabric. This will also allow us to scale it independently. This service will expose some REST APIs for interaction, most of which will be public (routable) while some may be private (non-routable). This service will allow the following kinds of interactions:
Direct interactions from users for setting and retrieving metadata for applications and datasets
Interactions from AppFabric/Dataset Service during (the start/end of) program runs for annotating that a program accessed a dataset (as input or output).
This service will also be responsible for computing and serving the lineage of a dataset in a specified time interval.
An additional service may (TBD) also be exposed to fulfill the requirement of metadata search. This service may be a YARN service that runs Solr or Lucene to index metadata.
Design
Metadata System
Metadata Types
The metadata system supports the following kinds of metadata:
Business (User) Metadata: Metadata that users annotate, using either the Java API or the REST API.
System Metadata: Metadata that is annotated by the CDAP system. This may include two kinds of metadata:
Generic Metadata: Metadata that is added by default to every application/dataset - created_by, creation_time, last_updated_by, last_update_time (Any more )
Runtime Metadata: Metadata that contains runtime information like Workflow Token, Data Quality Stats, Program/Dataset Runtime Arguments, Preferences. We may not store this data in the metadata service, but may just reference from the current MDS with a runId.
System Metadata Updates
Updates to system metadata will happen at four different times:
App/Dataset Deployment: To update System Metadata
Program start: To update the history table for the dataset access with the run
During Program Run: To update dataset accesses for programs that update datasets (that are not set as input or output datasets) during their run
Program end: To update the history table for the dataset access with the end time (and the metadata values?)
Search
We will use Solr or Lucene as an external search and indexing engine. An overview of the investigation of various available options can be found at External Search and Indexing Engine Investigation.
CDAP Search System Service
To support Solr for fault tolerant mode there will be new CDAP system service to act as adapter for the search and index engine in YARN.
The new proposed search system service will have a primary role to manage and act as proxy between CDAP master and the external search and index engine.
Indexing and Searching Metadata Records Using IndexedTable
In the 3.2 release, we will support search on metadata records using the key*=value* and key* format rather than free text search.
To do this, we will use IndexedTable instance to store the metadata records which allow us to get another table act as inverted index of the metadata records.
Store
The metadata system will be composed of the following three tables:
Business Metadata Table
This table contains the most recent metadata associated with a dataset or an application. It contains no historical data. It's purpose is to optimally serve the annotate and retrieve APIs for business metadata.
RowKey:
|
|
|
|
|---|
where:
<target-type>: APP/PROGRAM/DATASET/STREAM
<target-id>: app-id(ns+app) / program-id(ns+app+pgtype+pgm) / dataset-id(ns+dataset)/stream-id(ns+stream)
<metadata-type>: "p" for a property (key-value) record, "t" for a tag record
<key>: the key
Col:
|
|
|---|
We store both Key:Value and Value in the column to make indexing easier using an IndexedTable. We will be able to support prefix searches with this approach as well.
When an app is deleted, its business metadata from this table is also deleted
Lineage Table
This table contains historical information of program runs and the datasets that they accessed for read or write. It serves like an audit-trail. The primary goal of this table is to be able to compute the lineage for a dataset.
RowKey:
This table can have six kinds of row-keys
Dataset access from a CDAP Program
d |
|
|
|
|
|
|---|---|---|---|---|---|
p |
|
|
|
|
|
where:
d: Identifies a dataset access
<dataset-id>: The dataset id containing ns+dataset
<inverted-start-time>: The start time of the run; inverted because HBase0.96 does not support reverse scan.
p: Identifies that this access was made from a CDAP Program
<run-id>: The run id containing ns+app+program+runId
<access-type>: r/w/rw
Stream access from a CDAP Program
|
|
|
|
|
|
|---|---|---|---|---|---|
|
|
|
|
|
|
where:
s: Identifies a stream access
<stream-id>: The stream id containing ns+stream
<inverted-start-time>: The start time of the run; inverted because HBase0.96 does not support reverse scan.
p: Identifies that this access was made from a CDAP Program
<run-id>: The run id containing ns+app+program+runId
<access-type>: r/w/rw
Stream access from an external entity
|
|
| e |
|
|
|---|---|---|---|---|---|
|
|
| s |
|
|
where:
s: Identifies a stream access
<stream-id>: The stream id containing ns+stream
<inverted-start-time>: The start time of the run; inverted because HBase0.96 does not support reverse scan.
e: Identifies that this access was made from an external entity
<external-entity-id>: The id of the external entity, which is a combination of the source query param specified in the modified Stream Write REST APIs below and an access timestamp (eg. day timestamp). It is okay to have timestamp in the external entity id, since we will not be computing lineage for external entities.
<access-type>: r/w/rw
Columns:
|
|
|---|
<stop-time> - Stop time of the program in ms, -1 if program is still running
<metadata> - JSON containing business and system metadata, even though storing system metadata could mean a duplication. This will be null for dataset and stream rows. This will not contain run information like runtime arguments, workflow token etc. These will be looked up from the run record table when required.
Lineage Computation
Lineage can be computed for streams and datasets using the lineage table. Lineage computation is a breadth first search on the lineage table.
Given a dataset dataset1 and a time range time1 and time2, the following scan and filter will compute one iteration of the breadth first search.
Scan
start row: d-dataset1-inverted-time2
stop row: d-dataset1-inverted-min(running-program-start-times)
Filter
stop time == -1 OR stop time >= time1
This scan will satisfy the following condition - (time1 <= stop-time < time2) OR (start-time < time2 AND (stop-time == -1 OR stop-time >= time2))
min(running program start times) <= start time < time2 AND (stop time == -1 OR stop time >= time1)
inverted min(running program start times) > inverted start time >= inverted time2 AND (stop time == -1 OR stop time >= time1)
The above scan will give us all programs that accessed dataset dataset1. We can continue the scans with the programs to go further down the search. Note that each node in the graph will need a separate scan to move forward in the search.
When is metadata stored in this table?
Every time a dataset access happens - It will make querying expensive