Metadata Custom Entities and Authorization

Introduction 

The following design documents:

  1. Metadata API to add metadata to elements inside CDAP which are not entities for example files in a fileset, schema fields in a dataset.
  2. Plugin/Program API to add metadata to elements inside CDAP which are not entities for example files in a fileset, schema fields in a dataset.
  3. Authorization for Metadata

Design

MetadataEntity Representation Format

One of the pain point of existing metadata APIs is that it only allow metadata to be associated with CDAP entities. This is very restrictive for enterprises who want capability to tag/discover any resources in CDAP (for example field of a schema, file of fileset) which are not CDAP entities. 

To solve the earlier we will need to support a generic way of specifying metadata entities (entities and nonentities) in CDAP. For this we purpose the following generic way of specifying metadata entities for metadata annotations. 

A map of string to string will allow user to specify any metadata entities in CDAP and also support existing CDAP entities. For example:

  1. Existing CDAP entities like a dataset with datasetId myDataset can be specified as:
    Map<namespace=myNamespace, dataset=myDataset>
  2. Field 'empName' of dataset 'myDataset' can be specified as:
    Map<namespace=myNamespace, dataset=myDataset, field=empName
  3. File 'part01' of a fileset 'myFileset' can be specified as:
    Map<namespace=myNamespace, dataset=myFileset, file=part01>
  4. The above free form map allows us to represent any resource in CDAP metadata system irrespective of whether it is present in CDAP or not. For example an external MySQL table can represented as:
    Map<database=myDatabase, table=myTable>

 

Metadata Resource
@Beta
public final class MetadataEntity {

  private List<KeyValue> details;

  private MetadataEntity(List<KeyValue> details) {
    this.details = details;
  }

  public List<KeyValue> getDetails() {
    return details;
  }

  public static Builder builder() {
    return new Builder();
  }

  public static class Builder {
    private final List<KeyValue> details = new LinkedList<>();

    public Builder add(String k, String v) {
      details.add(new KeyValue<>(k, v));
      return this;
    }

    public Builder fromEntityId(EntityId entityId) {
      ...
      return this;
    }

    public MetadataEntity build() {
      return new Resource(Collections.unmodifiableList(details));
    }
  }
  public static MetadataEntity fromEntityId(EntityId entityId) {
    // converts to EntityId to Metadata Resource
  }
}

Overview of API changes

Our existing metadata API will need to change to allow user to specify the above generic metadata target. All our existing metadata APIs are built around EntityId as target for metadata system. Since EntityId are just a key-value pair with well defined key names they can easily be represented as a MetadataEntity presented above.

 

For example mapreduce program with ProgramId 'myProgram' can be represented as a map with the following key value pair:

Map<namespace=myNamespace, application=myApplication, appVersion=1.0.0-SNAPSHOT, programName=myProgram, programType=mapreduce>

 

CDAP internal Metadata APIs will be changed to accept a MetadataEntity rather than EntityId. For example the following APIs in MetadataAdmin

MetadataAdmin (Existing)
void addProperties(NamespacedEntityId namespacedEntityId, Map<String, String> properties)
  throws NotFoundException, InvalidMetadataException;

void addTags(NamespacedEntityId namespacedEntityId, String... tags)
  throws NotFoundException, InvalidMetadataException;

Set<MetadataRecord> getMetadata(NamespacedEntityId namespacedEntityId) throws NotFoundException;

will change to

MetadataAdmin (New)
void addProperties(MetadataEntity entity, Map<String, String> properties)
  throws NotFoundException, InvalidMetadataException;

void addTags(MetadataEntity entity, String... tags)
  throws NotFoundException, InvalidMetadataException;

Set<MetadataRecord> getMetadata(MetadataEntity entity) throws NotFoundException;

In addition to new metadata APIs we will also introduce we will also introduce new utility methods and public APIs which can allow user to add metadata by directly specifying EntityId and/or easily convert an EntityId to MetadataEntity for the metadata system. This has been shown in the MetadataEntity class documented earlier. 

For backward compatibilty we will deprecate all the APIs which work with EntityId and change their implementation to convert EntityId to MetadataEntity.

 

Schema Fields as MetadataEntity

Allowing metadata to be associated with non-entities (MetadataEntity) will allow us to associate metadata with Schema fields. Schema fields are an important non-entites and it needs to be discussed how we will show associated metadata with Schema fields, retrieve them and display them in the UI.

Specifying Schema Field as Resource:

DatasetId myDataset = context.getDataset(....);
MetadataEntity.Builder builder = MetadataEntity.Builder.fromEntityId(myDataset);
builder.add("field", "EmployeeSSN");
MetadataEntity employeeSSNField = builder.build();
 
metadataClient.addTags(employeeSSNField, "PII");

 

Retrieving Schema Fields through Metadata Search:

When a user perform a search with metadata which is associated with schema field ideally we should display the schema field. In our current UI displaying non-entities is not supported so we will display the dataset asscoiated with it.

Program/Plugin Level APIs

@Override
public void initialize() throws Exception {
  MapReduceContext context = getContext();
  MetadataContext metadataContext = context.getMetadataContext();
  metadataContext.addTags(resource, tags...);
}

The MetadataContext which a developer will get here will be a RemoteMetadataClient which will discover the MetadataService through service discovery.

 

Plugin APIs

We will be extending the APIs for Lineage for Metadata. 

Field Level Lineage API

We will introduce a new interface called PluginMetadataWriter

/**
 * Metadata Writer APIs for Plugins
 */
public interface PluginMetadataWriter {

  void addProperties(Map<String, String> properties);

  void addTags(String... tags);

  void addTags(Iterable<String> tags);

  void removeAllMetadata();

  void removeProperties();

  void removeProperties(String... keys);

  void removeTags();

  void removeTags(String... tags);
}

 

Destination from Lineage APIs will become

/**
 * Destination represents the dataset of which the fields will be part of.
 */
public class Destination implements PluginMetadataWriter {
  // Namespace associated with the Dataset.
  // This is required since programs can read the data from different namespace.
  String namespace;

  // Name of the Dataset
  String name;

  // Description associated with the Dataset.
  String description;

  // Properties associated with the Dataset.
  // This can potentially store plugin properties of the Sink for context.
  // For example in case of KafkaProducer sink, properties can include broker id, list of topics etc.
  Map<String, String> properties;







  // Metadata Information
  // Metadata information associated with Destination Dataset (The metadata for dataset only, individual schema
  // field metadata should be recorded as part of FieldOperation.
  Map<String, String> metadataProperties;
  Set<String> tags;


  @Override
  public void addProperties(Map<String, String> properties) {
    // adds to  metadataProperties
  }

  @Override
  public void addTags(String... tags) {
    // adds to tags
  }

  @Override
  public void addTags(Iterable<String> tags) {
    // adds to tags
  }

  @Override
  public void removeAllMetadata() {
    // remove all metadata (properties and tags)
  }

  @Override
  public void removeProperties() {
    // clears metadataProperties
  }

  @Override
  public void removeProperties(String... keys) {
    // clears the given keys from metadataProperties
  }

  @Override
  public void removeTags() {
    // removes all tags
  }

  @Override
  public void removeTags(String... tags) {
    // removes the specified tags from tags
  }
}

 

 

public class Input implements PluginMetadataWriter {
  // Schema field which is input to the operation
  Schema.Field field;

  // Source information if the field belongs to the source/dataset
  @Nullable
  Source source;

  // Create input from a Field. Since Schema can be nested, plain String cannot be
  // used to uniquely identify the Field in the Schema, as multiple Fields can have same name
  // but different nesting. In order to uniquely identify a Field from the Schema we will
  // need an enhancement in the CDAP platform so that each Field can hold the transient
  // reference to the parent Field. From these references, then we can create unique field path.
  public static Input ofField(Schema.Field field) {
  }

  // Create input from the Field which belongs to the Source
  public static Input ofField(Source source, Schema.Field field) {

  }

  // Metadata information associated with schema Field
  Map<String, String> metadataProperties;
  Set<String> tags;


  @Override
  public void addProperties(Map<String, String> properties) {
    // adds to  metadataProperties
  }

  @Override
  public void addTags(String... tags) {
    // adds to tags
  }

  @Override
  public void addTags(Iterable<String> tags) {
    // adds to tags
  }

  @Override
  public void removeAllMetadata() {
    // remove all metadata (properties and tags)
  }

  @Override
  public void removeProperties() {
    // clears metadataProperties
  }

  @Override
  public void removeProperties(String... keys) {
    // clears the given keys from metadataProperties
  }

  @Override
  public void removeTags() {
    // removes all tags
  }

  @Override
  public void removeTags(String... tags) {
    // removes the specified tags from tags
  }
}

 

We can modify the LineageRecorder interface to support recording for both metadata and lineage in implementation.

Authorization for Metadata

Allowing metadata to be added to CDAP MetadataEntity (non-entities) opens the question about authorization enforcement (i.e. who can add metadata to these resources). Since these resources are not entities we cannot have policy defined for them as of now.

Even though CDAP MetadataEntity are not predefined we can depend on the fact that these resources are generally under some CDAP EntityId. For example schema fields are always associated with a dataset, file in a fileset is always associated with dataset itself. If such a relationship does not exist we can depend on the fact that resources exists under a namespace and we can perform authorization on these EntityIds. In case of external resources which does not even exist under a namespace we can enforce on InstanceId if needed.

Since Metadata always belong to some MetadataEntity (EntityId or non-entity ids) in CDAP the enforcement will de done on the EntityId (see above as how we will determine the entity to enforce on in case of MetadataEntity which is not EntityId). 

OperationPrivilege Required
Get Metadata (Property/Tag)READ on the Entity with which the metadata is associated
Add Metadata (Property/Tag)WRITE on the Entity to which the metadata is being added
Remove Metadata (Property/Tag)WRITE/ADMIN on the Entity from which the metadata is being removed

 

Metadata in Transaction

In CDAP 5.0 we introduce the capability of adding metadata from program/pipeline runs. This raises the questions of what happens to the metadata added in a pipeline run which failed. Metadata added by pipeline runs which have failed will be retained. Since, as of now we expect metadata to be added to schema fields rather than the data records itself (written in a pipeline run which might fail leading to no data being written) we can say that they are not related to each other. Although, in case of a conditional metadata annotation for examlple tagging a schema field with a tag like "high" only if any of the entry written to the field have a value more than 100 will not be lead to expected results. With a fail pipeline run we will end up with schema field tagged with "high" even though none of the data records have value greater than 100.

Metadata for Versions

Our initial thought was to have metadata independent of application/artificat version to keep the behavior consistent with authorization policy but there are use cases where this model will not serve very well. For example an enterprise have two version of various applicatio deployed in their CDAP instance. Once version is in production and another is in development and is being actively developed. In such scenario a user might want to tag all of the development version of application with say tag "dev" and all production version with "prod" allowing the user to later discover them thorugh search. Making metadata version independent will not work for this use case. So, we will support two way of adding metadata to versioned application/artificat.

  1. If while adding the metadata to an application/artificat a version is not specified then that metadata will be added to all the existing version.
  2. If while adding the metadata a version for application/artifact is specified then that metadata will be added to only the specified version of application/artifact.

Special Character Support for Metadata

Currently, our metadata system only allow a-z, 0-9 and - characters. This put sever restriction on the user as what they can store in our metadata system. For example if someone want to add a metadata value which has commas in it the current system will not allow it. We will extend out current metadata system to allow common special characters to be stored. This will require changes in characters which we use as separator while sotring metadata. 

Metadata Storage and Indexing

In the current implementation of MetadataDataset, the key which is stored is a toString representation of the EntityId i.e.
EntityType.entitydetails.key For example for a dataset it looks like

<length-encoding>DatasetInstance<length-encoding>namespaceName<length-encoding>datasetName<length-encoding>metadataKey

Note: We store the old Id representation and not EntityIds to keep backward compatibility with serialized keys from before. During this release when we will be upgrading the metadata store we should defenitely migrate all the keys to not use old Ids and use a serialization form which is independent of EntityIds.

For more information please refer to earlier design documentation of our metadata store and the implementation here:

Storage Design

MdsKey

With the proposed changed in this design document we will introduce a class called MetadataEntity which will be a List of key-value pairs. In a simple represetation it will look like:

<length-encoding>namespace<length-encoding>nsOne<length-encoding>dataset<length-encoding>dsOne<length-encoding>metadataKey

 

Also for a file in PFS it will look something like this

<length-encoding>namespace<length-encoding>nsOne<length-encoding>dataset<length-encoding>dsOne<length-encoding>partition<length-encoding>partitionOne<length-encoding>file<length-encoding>fileOne<length-encoding>metadataKey


We cannot store this with our current storage key as the key be something like this:

file:nsOne.datasetOne.PartitionOne.FileOne

Since files are not an EntityId in CDAP, CDAP does not know the hireracy of this custom entity type. Hence CDAP will not be able consturct the MetadataEntity back since all the individual keys are not persisted in the above format. To solve this issue we will now store the MetadataEntity information with all the key-value pairs. To maintain backward compatibility and support search based on the entity type we will also be storing the information where the key is prefixed by the target entity type as earlier. So finally the key will look something like this:

<length-encoding>file<length-encoding>namespace<length-encoding>nsOne<length-encoding>dataset<length-encoding>dsOne<length-encoding>partition<length-encoding>partitionOne<length-encoding>file<length-encoding>fileOne<length-encoding>metadataKey

It should be noted that it is important to store the keys prefixed by the type because it limits our scan size when we retrieve metadata for an entity/non-entity. For example consider the following scenario

Lets say myStreamOne is tagged with myTagOne and myTagTwo and myStreamViewOne is tagged with myTagThree

EntityType:EntityDetails.MoreEntityDetails.MetadataKey
So it looks like this: (Note the : and . are just for readability current we store length encoding)

stream:myNamespaceOne.myStreamOne.myTagOne
stream:myNamespaceOne.myStreamOne.myTagTwo
stream_view:myNamespaceOne.myStreamOne.myStreamViewOne.myTagThree


If we change it store key-value parts (without entity-type prefix) of entities the above will look like:

namespace=myNamespaceOne.stream=myStreamOne.myTagOne
namespace=myNamespaceOne.stream=myStreamOne.myTagTwo
namespace=myNamespaceOne.stream=myStreamOne.stream_view=myStreamViewOne.myTagThree


Now when someone says give me all the metadata for MyStreamOne we do a prefix based search to collect all the metadata keys where the search prefix is (in current implementation)

stream:myNamespaceOne.myStreamOne.

With our MetadataEntity change the search prefix will look like this:

namespace=myNamespaceOne.stream=myStreamOne.

The problem with above new key is that it will also match
namespace=myNamespaceOne.stream=myStreamOne.stream_view=myStreamViewOne.myTagThree

and give us the metadata for stream view which is child of the stream. We can of course filter them out as a post-processing step but this is very bad for searches for namespaces because this will give metadata for everything inside namespace. Such large scan result can easily be eliminated if we store the keys prefixed by entity-type. If an entity-type is not known then we can store it as a some constant like UNKNOWN_TYPE.

 

Search Queries:

We will maintain support for all search queries as listed here for backward compatibility. No new search capabilites will be added.

 

Upgrade:

We will need an upgrade step which will upgrade all the keys to the new format of storage from the old one. During this upgrade we will also get rid of the old Id compatibility serialization form which we use and we will use a serialization form which will be independent of the EntityId but will directly map to it which will help us to convert the serialized form into EntityId as and when needed.

Open Questions

  1. How does metadata for schema applied to external sinks (dataset) which CDAP does not know about like kudu table?
    > Associated with external datasets.
  2. What are the different possibilities of search?
    1. Do we need to support mathematical operators such as >, <, <= etc. In this case the data needs to be treated as numbers. Does the user need to specify the type of metadata being added.
    2. Do we need to support relational operator in search queries. For example: List all datasets 
    3. Metadata now has class/type (business, operational, technical) do we need capabilities to filter metadata on this? 
  3. How are resources like files, partition etc which are not cdap entities and cdap does not know about them are presented in UI when discovered through metadata. 
    > To be designed

New REST APIs

  • As documented in the design. New APIs will be added to support interacting with metadata on non-entities.

Deprecated REST API

  • All existing Metadata APIs which are based on EntityId

CLI Impact or Changes

  • The CLI will need to support metadata being associted with non-entities.

UI Impact or Changes

  • UI should be able to show metadata for non-entities.

Security Impact 

  • Currently Metadata does not have authorization. We will be adding authorization to metadata.

 

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results
   
   
   
   

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3

 

Future work

Created in 2020 by Google Inc.