Discovery and Lineage

Discovery

Metadata can be used to tag different CDAP components so that they are easily identifiable and managed. This helps in discovering CDAP components.

For example, you can tag a dataset as experimental or an application as production. These entities can then be discovered by using search queries with the annotated metadata.

Using search, you can discover entities:

that have a particular value for any key in their properties;
that have a particular key with a particular value in their properties; or
that have a particular tag.

You can find a dataset or a stream that has a "field with the given name" or a "field with the given name and the given type".

To search metadata, you can use the Metadata HTTP RESTful API.

Lineage

Lineage can be retrieved for dataset and stream entities. A lineage shows, for a specified time range, all data access of the entity and details of where that access originated from.

For example, with a stream, writing to a stream can take place from a worker, which may have obtained the data from a combination of a dataset and a (different) stream. The data in those entities can come from (possibly) other entities. The number of levels of the lineage that are calculated is set when a request is made to view the lineage of a particular entity.

In the case of streams, the lineage includes whether the access was reading or writing to the stream.

In the case of datasets, lineage can indicate if a dataset access was for reading, writing, or both, if the methods in the dataset have appropriate annotations. If annotations are absent, lineage can only indicate that a dataset access took place, and does not provide indication if that access was for reading or writing.