Metadata Microservices
Use the CDAP Metadata Microservices to set, retrieve, and delete the metadata annotations of applications, datasets, and other entities in CDAP.
Note: Metadata for versioned entities is not versioned, including entities such as applications, programs, schedules, and program runs. Additions to metadata in one version are reflected in all versions.
Metadata consists of properties (a list of key-value pairs) or tags (a list of keys). Metadata and their use are described in the Metadata and Lineage section.
The Microservices is divided into these sections:
Metadata properties
Metadata tags
Searching metadata
Viewing lineage
Field level lineage
Metadata for a run of a program
Metadata keys, values, and tags must conform to the CDAP alphanumeric extra extended character set, and are limited to 50 characters in length. The entire metadata object associated with a single entity is limited to 10K bytes in size.
There is one reserved word for property keys and values: tags, either as tags
or TAGS
. Tags themselves have no reserved words.
All methods or endpoints described in this API have a base URL (typically http://<host>:11015
or https://<host>:10443
) that precedes the resource identifier, as described in the Microservices Conventions. These methods return a status code, as listed in the Microservices Status Codes.
Note: Datasets are deprecated and will be removed in CDAP 7.0.0.
Metadata Properties
Annotating Properties
To annotate user metadata properties for an application, dataset, or other entities including custom entities, submit an HTTP POST request:
POST /v3/namespaces/<namespace-id>/<entity-details>/metadata/properties
or, for a particular program of a specific application:
POST /v3/namespaces/<namespace-id>/apps/<app-id>/<program-type>/<program-id>/metadata/properties
or, for a particular version of an artifact:
POST /v3/namespaces/<namespace-id>/artifacts/<artifact-id>/versions/<artifact-version>/metadata/properties
or, for a custom entity like field of a dataset:
with the metadata properties as a JSON string map of string-string pairs, passed in the request body:
New property keys will be added and existing keys will be updated. Existing keys not in the properties map will not be deleted.
Parameter | Description |
---|---|
| Namespace ID. |
| Hierarchical key-value representation of the entity. |
| Name of the application. |
| One of |
| Name of the program. |
| Name of the artifact. |
| Version of the artifact. |
| Name of the dataset. |
| Name of the field. |
HTTP Responses
Status Codes | Description |
---|---|
| The properties were set. |
Note: When using this API, properties can be added to the metadata of the specified entity only in the user scope.
Retrieving Properties
To retrieve user metadata properties for an application, dataset, or other entities including custom entities, submit an HTTP GET request:
or, for a specific application:
or, for a particular program of a specific application:
or, for a particular version of an artifact:
or, for a custom entity like field of a dataset:
with the metadata properties returned as a JSON string map of string-string pairs, passed in the response body (pretty-printed):
Parameter | Description |
---|---|
| Namespace ID. |
| Hierarchical key-value representation of the entity. |
| Name of the application. |
| One of |
| Name of the program. |
| Name of the artifact. |
| Version of the artifact. |
| Name of the dataset. |
| Name of the field. |
| Optional scope filter. If not specified, properties in the |
Example
To get the creation time for a deployed data pipeline called POS_SALES, issue the following GET request:
The result is:
HTTP Responses
Status Codes | Description |
---|---|
| The properties requested were returned as a JSON string in the body of the response which can be empty if there are no properties associated with the entity, or the entity does not exist. |
Deleting Properties
To delete all user metadata properties for an application, dataset, or other entities including custom entities, submit an HTTP DELETE request:
or, for all user metadata properties of a particular program of a specific application:
or, for a particular version of an artifact:
To delete a specific property for an application, dataset, or submit an HTTP DELETE request with the property key:
or, for a particular property of a program of a specific application:
or, for a particular version of an artifact:
or, for a custom entity like field of a dataset:
Parameter | Description |
---|---|
| Namespace ID. |
| Hierarchical key-value representation of the entity. |
| Name of the application. |
| One of |
| Name of the program. |
| Name of the artifact. |
| Version of the artifact. |
| Name of the dataset. |
| Name of the field. |
| Metadata property key. |
HTTP Responses
Status Codes | Description |
---|---|
| The method was successfully called, and the properties were deleted, or in the case of a specific key, were either deleted or the key was not present, or the entity itself was not present. |
Metadata Tags
Adding Tags
To add user metadata tags for an application, dataset, or other entities including custom entities, submit an HTTP POST request:
or, for a particular program of a specific application:
or, for a particular version of an artifact:
or, for a custom entity like field of a dataset:
with the metadata tags, as a list of strings, passed in the JSON request body:
Parameter | Description |
---|---|
| Namespace ID. |
| Hierarchical key-value representation of the entity. |
| Name of the application. |
| One of |
| Name of the program. |
| Name of the artifact. |
| Version of the artifact. |
| Name of the dataset. |
| Name of the field. |
HTTP Responses
Status Codes | Description |
---|---|
| The tags were set. |
Retrieving Tags
To retrieve user metadata tags for an application, dataset, or other entities including custom entities, submit an HTTP GET request:
or, for a particular program of a specific application:
or, for a particular version of an artifact:
or, for a custom entity like field of a dataset:
with the metadata tags returned as a JSON string in the return body:
Parameter | Description |
---|---|
| Namespace ID. |
| Hierarchical key-value representation of the entity. |
| Name of the application. |
| One of |
| Name of the program. |
| Name of the artifact. |
| Version of the artifact. |
| Name of the dataset. |
| Name of the field. |
| Optional scope filter. If not specified, properties in the |
HTTP Responses
Status Codes | Description |
---|---|
| The tags requested were returned as a JSON string in the body of the response which can be empty if there are no tags associated with the entity or entity does not exist. |
Removing Tags
To delete all user metadata tags for an application, dataset, or other entities including custom entities, submit an HTTP DELETE request:
or, for all user metadata tags of a particular program of a specific application:
or, for a particular version of an artifact:
To delete a specific user metadata tag for an application, dataset, or submit an HTTP DELETE request with the tag:
or, for a particular user metadata tag of a program of a specific application:
or, for a particular version of an artifact:
or, for a custom entity like field of a dataset:
Parameter | Description |
---|---|
| Namespace ID. |
| Hierarchical key-value representation of the entity. |
| Name of the application. |
| One of |
| Name of the program. |
| Name of the artifact. |
| Version of the artifact. |
| Name of the dataset. |
| Name of the field. |
| Metadata tag. |
HTTP Responses
Status Codes | Description |
---|---|
| The method was successfully called, and the tags were deleted, or in the case of a specific tag, was either deleted or the tag was not present, or the entity itself was not present. |
Searching for Metadata
CDAP supports searching metadata of entities. To find which applications, datasets, etc. have a particular metadata property or metadata tag, submit an HTTP GET request:
Parameter | Description |
---|---|
| Namespace ID. |
| Query term, as described below. Query terms are case-insensitive. |
| Restricts the search to either all or specified entity types: |
| Options for controlling cursors, limits, offsets, the inclusion of hidden and custom entities, and sorting: Option NameOption Value, Description, and Notes Format for an option: |
Entities that match the specified query and entity type are returned in the body of the response in JSON format:
HTTP Responses
Status Codes | Description |
---|---|
| Entity ID and metadata of entities that match the query and entity type(s) are returned in the body of the response. |
Query Terms
CDAP supports prefix-based search of metadata properties and tags across both user and system scopes. Search metadata of entities by using either a complete or partial name followed by an asterisk *
.
Search for properties and tags by specifying one of:
Complete property key-value pair, separated by a colon, such as
type:production
Complete property key with a partial value, such as
type:prod*
Complete
tags
key with a complete or partial value, such astags:production
ortags:prod*
to search for tags onlyComplete or partial value, such as
prod*
; this will return both properties and tagsMultiple search terms separated by space, such as
type:prod* author:joe
; this will return entities having either of the terms in their metadata.
Since CDAP also annotates system metadata to entities by default as mentioned at System Metadata, the following special search queries are also supported:
Artifacts or applications containing a specific plugin:
plugin:<plugin-name>
Programs with a specific mode:
batch
orrealtime
Applications with a specific program type:
service:<service-name>
,mapreduce:<mapreduce-name>
,spark:<spark-name>
,worker:<worker-name>
,workflow:<workflow-name>
Datasets or views with schema field:
field name only:
field-name
field name with a type:
<field-name>:<field-type>
, wherefield-type
can be:simple types:
int
,long
,boolean
,float
,double
,bytes
,string
,enum
complex types:
array
,map
,record
,union
With a schema as shown above, queries such as employee:record
, employeeName:string
, departments
, departments:array
can be issued.
Viewing Lineages
To view the lineage of a dataset, submit an HTTP GET request:
where:
Parameter | Description |
---|---|
| Namespace ID. |
|
|
| Name of the |
| Starting time-stamp of lineage (inclusive), in seconds. Supports |
| Ending time-stamp of lineage (exclusive), in seconds. Supports |
| Number of levels of lineage output to return. Defaults to 10. Determines how far back the provenance of the data in the lineage chain is calculated. |
| An optional set of |
| An optional |
See in the Metrics Microservices “Querying by a Time Range” for examples of the "now" time syntax.
For more information about collapsing lineage output, see the following section on Collapsing Lineage Output.
The lineage will be returned as a JSON string in the body of the response. The JSON describes lineage as a graph of connections between programs and datasets in the specified time range. The number of levels of the request (levels
) determines the depth of the graph. This impacts how far back the provenance of the data in the lineage chain is calculated, as described in the Metadata and Lineage.
Lineage JSON consists of three main sections:
Relations: contains information on data accessed by programs. Access type can be read, write, both, or unknown. It also contains the runid of the program that accessed the data, and the specifics of any component of a program that also accessed the data.
Data: contains Datasets that were accessed by programs.
Programs: contains information on programs (MapReduce, Spark, workers, etc.) that accessed the data.
Here is an example, pretty-printed:
HTTP Responses
Status Codes | Description |
---|---|
| Entities IDs of entities with the metadata properties specified were returned as a list of strings in the body of the response |
| No entities matching the specified query were found |
Collapsing Lineage Output
Lineage output can be collapsed by access
, run
, or component
. Collapsing allows you to group all the lineage relations for the specified collapse-type together in the lineage output. Collapsing is useful when you do not want to differentiate between multiple lineage relations that only differ by the collapse-type.
For example, consider a program that wrote to a dataset in multiple runs over a specified time interval. If you do not want to differentiate between lineage relations involving different runs of this program (so you only want to know that a program accessed a data entity in a given time interval), you could provide the query parameter collapse=run
to the lineage API. This would collapse the lineage relations in the output to group the multiple runs of the program together.
You can also collapse lineage relations by access (which will group together those relations that differ only by access type) or by component (which will group together those that differ only by component together). The lineage HTTP RESTful API also allows you to use multiple collapse` parameters in the same query.
Consider these relations from the output of a lineage API request:
Collapsing the above by run
would group the runs together as:
Collapsing by access
would produce:
Similarly, collapsing by component
will generate:
Rolling Up Lineage Output
Lineage rollup allows you to group multiple entities together for a condensed view in the lineage output.
Currently, for the parameter rollup
, only the value workflow
is supported, which allows programs to be grouped together into workflows if multiple programs are created as part of a workflow.
For example, suppose you have a workflow that starts two programs to complete an associated task. If you do not want to differentiate between lineage relations involving different programs of this workflow, you could provide the query parameter rollup=workflow
to the lineage API. This would rollup the lineage relations in the output to show corresponding workflows instead of individual programs.
Consider these relations from the output of a lineage API request:
Rolling up the above using rollup=workflow
would group the programs together as:
Field Level Lineage
Fields associated with the Dataset
Gets the fields that were associated with the dataset for the specified range of time:
where:
Parameter | Description |
---|---|
| Namespace ID. |
| Name of the |
| Starting time-stamp (inclusive), in seconds. Supports |
| Ending time-stamp (exclusive), in seconds. Supports |
| Optional |
| Optional flag, when set to true the current fields of the dataset will be be included irrespective of whether they have any lineage information or not. |
Following is sample response:
HTTP Responses
Status Codes | Description |
---|---|
| Fields of dataset are returned as a list of strings in the body of the response. |
| Failure to parse the time range provided. |
Field Lineage Summary
Gets the field lineage summary for a specified field of a dataset. The field lineage summary consists of the sets of datasets and their respective fields used to compute the specified field of a dataset:
where:
Parameter | Description |
---|---|
| Namespace ID. |
| Name of the |
| Name of the |
| Starting time-stamp (inclusive), in seconds. Supports |
| Ending time-stamp (exclusive), in seconds. Supports |
|
|
The returned response consists of the direction in which the summary is requested and the datasets and fields that were responsible for the computation of a specified field. Currently, the only supported direction is incoming
.
Following is a sample response:
HTTP Responses
Status Codes | Description |
---|---|
| Fields of dataset are returned as a list of strings in the body of the response. |
| Failure to parse the time range provided. |
Field Lineage Operations
Gets the details of operations responsible for computation of a specified field of a dataset for a specified range of time:
where:
Parameter | Description |
---|---|
| Namespace ID. |
| Name of the |
| Name of the |
| Starting time-stamp (inclusive), in seconds. Supports |
| Ending time-stamp (exclusive), in seconds. Supports |
|
|
The single field can be computed in multiple ways, where each unique way consists of a list of operations that participated in the computation and the list of programs that performed the computation. The returned response consists of the direction in which the operations are requested. Currently, the only supported direction is incoming
. For the specified direction, the response includes the different ways that the field was computed.
Following is a sample response:
HTTP Responses
Status Codes | Description |
---|---|
| Fields of dataset are returned as a list of strings in the body of the response. |
| Failure to parse the time range provided. |
Metadata for Custom Entities
Metadata can also be associated with custom entities. In CDAP Entities are separated into two kinds:
CDAP Entities: These are system defined entities that have special meaning in CDAP.
Namespace
,Application
,Dataset
etc. are example of CDAP Entities.Custom Entities: These are user defined entities that represent a resource that exists in CDAP.
Custom Entities are represented as a hierarchical key-value pair and can optionally have an explicitly defined type.
If a type is not specified then the last key in the hierarchy is considered as the type.
In Microservices, custom entities are represented as hierarchical key-value pairs. If the last key in the hierarchy is not a type, the type is specified as a query parameter.
For example, to add tags to a custom file entity in a dataset:
In the example above, the custom entity is a single key-value pair where file
is the key and <file-name>
is the value.
To add tags to a custom jar entity in a namespace:
In the example above, the custom entity consists of two key-value pairs. The first has key jar
and value <jar-id>
. The second has key versions
and value <jar-version>
. We pass the jar as the type to specify the type of the entity since the last key in the hierarchy is not the type in this case.
In both examples, the metadata tags are passed as a JSON list of strings in the request body:
Created in 2020 by Google Inc.