...
- Field level metadata tagging:
- As a developer of the MapReduce/Spark program, I want a programmatic way to look at the fields in the dataset and assign metadata/tags to it.
- As a developer of the MapReduce/Spark program, I want a programmatic way to read the metadata associated with the fields in the input dataset.
- As a developer of the plugin, I want programmatic way to assign metadata/tags to the fields belonging to the output schema.
- As a developer of the plugin, I want programmatic way to read the metadata/tags associated with the fields from input schema.
- As a developer of the CDAP data pipeline, I want an ability to assign metadata/tags to the fields of the StructuredRecord. For example, if I am creating a pipeline which is reading from database table (say UserProfile), when I populate the schema in the UI, I want to assign tag PII=true to certain fields such as phone number, social security number etc.
- Is there a user story where CDAP data pipeline developer will require to read the metadata associated with the field while developing pipeline?
- As a developer of the CDAP data pipeline, if I add certain plugins such as JavaScript transform where I need to provide my own transform method, I should be able to read the tags associated with the fields from the input schema so that I can do the tags based processing in my Javascript transform method.
- As a developer of the CDAP data pipeline, if I add certain plugins such as JavaScript transform where I need to provide my own transform method, I should be able to assign the tags to the fields belonging to the output schema.
- As a runner of the CDAP data pipeline, I want an ability to provide additional metadata/tags through runtime arguments. For example in the test environment, pipeline runner might not want to obfuscate the PII fields so he should be able to provide the runtime argument "userprofile.field.phonenumber.tags.PII=false"
- Is there a user story where CDAP data pipeline runner will require to read the metadata associated with the field while running the pipeline?
- As an Admin of the CDAP data platform, I should be able to look at what tags are associated with the particular field of a given dataset.
- As an Admin of the CDAP data platform, I should be able to assign new tags/metadata to the particular field of a given dataset.
- As a Data Governance officer, I should be able to list the fields which are marked as "PII" in a given dataset.
- As a Data scientist, I want to get the list of datasets for which the field is marked with a given tag, for example phone number is marked with the "anonymized=true"
- Finer granularity (File/Partition/Table in a database) metadata:
- As a developer of the CDAP program such as MapReduce/Spark, when I read the fileset dataset, I should be able to read the metadata associated with each individual file in the dataset.
- As a developer of the CDAP program such as MapReduce/Spark, when I write to the fileset dataset, I should be able to assign tags/metadata to the each individual file in the dataset.
- As a developer of the CDAP program such as MapReduce/Spark, when I read the partitioned fileset dataset, I should be able to read the metadata associated with each partitions in the dataset.
- As a developer of the CDAP program such as MapReduce/Spark, when I write to the partition fileset dataset, I should be able to assign tags/metadata to the each partition in the dataset.
- As a developer of the CDAP Action plugin, I should be able to read the tags/metadata, such as data quality score, associated with the each individual file in a dataset.
- As a developer of the CDAP Action plugin, I should be able to read the tags/metadata associated with the each partition of the partitioned fileset dataset.
- As a developer of the CDAP Action plugin, I should be able to assign the tags/metadata, such as data quality score, to the each individual file in a fileset dataset.
- As a developer of the CDAP Action plugin, I should be able to assign the tags/metadata to the each partition of the partitioned fileset dataset.
- As an Admin of the CDAP platform, I should be able to list the the tags/metadata associated with the individual file in the dataset.
- As an Admin of the CDAP platform, I should be able to assign/ovveride tags/metadata associated with the individual file in a dataset.
- As an Admin of the CDAP platform, I should be able to list the the tags/metadata associated with the individual partition in the partition fileset dataset.
- As an Admin of the CDAP platform, I should be able to assign/ovveride tags/metadata associated with the individual partition in the partition fileset dataset.
- As a Data Governance officer, I should be able search for files on the HDFS given a specific tag/metadata for example 'Owner=HR", gives me all files owned by HR department.
- As a Data Governance officer, I should be able search for all the directories on the HDFS given a specific tag/metadata for example 'CreationDate=12/30/2017", gives me all directories created on the specified date.
- As a Data scientist, I only want to use the files which are tagged with certain tag for example "SecurityCode=green" for analysis for compliance reasons.
- (what is the role of CDAP pipeline developer and CDAP pipeline runner in this particular section? Can they use this capability somehow?)
- Store metadata along with the record:
- As a developer of the MapReduce/Spark program, I want an ability to read the tags/metadata associated with the files/partition/dataset in the map and reduce / executors tasks so that I can emit them as an additional field in the record.
- As a CDAP plugin developer, in the transform method, I want an ability to read the tags/metadata associated with the files/partition/dataset so that I can emit them as an additional field of the StructuredRecord.
- Field level lineage:
- Metadata provenance:
- Metadata propagation:
...