Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

A data dictionary will be a way for users to define and describe columns that apply across all datasets in a namespace and allow users to enforce a common naming convention, type, and indicate if the column contains PII data. Anyone creating new datasets or browsing datasets would then be able to see and leverage this information in Tracker.

Goals

After the completion of this project users will have a single point of governance for their data fields across namespace.

Use Cases

A team would like to make sure that any new datasets they create use the same naming conventions for field names. For example, in all datasets, the accountId is always in the same format (not accountID or account_id), and always contains the same type of data (always a nullable string, not an int or long). Using the data dictionary, the team sets the name, format, and definition of the column, and enables compliance checking. Now, anyone browsing through the datasets available in the namespace through Tracker can see the definitions for all the columns. If the user finds a dataset which is out of compliance, he/she can take action to bring it into compliance.
A company has specific fields in their datasets that are sensitive such as SSN or full name. They would like these columns to be tracked across all datasets so that the users of the system know to treat it with a higher level of secrecy. By adding these columns to the data dictionary, they are able to identify these columns across any dataset and make sure they are keeping the data safe. As new datasets are created with the same columns, the data dictionary is automatically applied and the columns are marked as PII. If desired, someone would be able to write a script to automatically add tags to datasets containing PII columns and audit them to make sure only specific people have access to them.

User Stories

- As a user of Tracker, I would like to define and enforce a common data dictionary across all my datasets.
- As a user of Tracker, I would like to see when a specific field in a dataset is not in compliance with my data dictionary.
- As a user of Tracker, I would like to view the reasons why my specific field is not in compliance with my data dictionary.
- As a user of Tracker, I would like to mark specific fields as PII or not.
- As a user of Tracker, I would like to be able to easily add existing fields to the Data Dictionary.

Design

Dictionary

The data dictionary will be stored in a new custom dataset specific to a namespace and backed by a Table.
The name will be _dataDictionary
The schema of the dictionary table will be as follows:
- rowKey - the column name, all lowercase
- columnName - the column name with the case preserved
- columnType - the type of the column
- isNullable - is the column nullable or not
- isPII - is the column PII or not
- description - the description of the column
- datasets - a list of datasets containing the column
- numberUsing - will default to 1, and will be increased if other extensions start using this data

Config Options

As part of this, we will also need to introduce the notion of configuration preferences for Tracker in order to store the state of compliance checks for this instance.
We will create a new config dataset which will be a simple key-value store to keep track of this information. The goal would be to store additional config values in there as needed by future tracker features.
We will create a new config api handler to manage the configuration options.
Config keys will be in the form of <feature>.<option>. So for the compliance checks for the dictionary, the config key would be "dictionary.compliance" with a value of "true" or "false"

New Programmatic APIs

New REST APIs

Dictionary

We will need the following additional endpoints to support the data dictionary which will be added in a new handler class

Method

Endpoint

Description

Params

Sample Response

GET

/v1/dictionary

Returns the entire data dictionary for the namespace

none

On success, returns a 200.

{ "results" : [] }

{
  "results" : [
    {
      "columnName" : "testColumn1",
      "columnType" : "string",
      "isNullable" : true,
      "isPII" : false,
      "description" : "something something something"
    },
    {
      "columnName" : "testColumn2",
      "columnType" : "long",
      "isNullable" : false,
      "isPII" : true,
      "description" : "else else else"
    }
  ]
}

POST

/v1/dictionary

Returns the data dictionary related to the specified schema

payload - a schema containing a list of columns

On success, returns a 200.

{ "results" : [] }

{
  "results" : [
    {
      "columnName" : "testColumn1",
      "columnType" : "string",
      "isNullable" : true,
      "isPII" : false,
      "description" : "something something something"
    },
    {
      "columnName" : "testColumn2",
      "columnType" : "long",
      "isNullable" : false,
      "isPII" : true,
      "description" : "else else else"
    }
  ]
}

POST

/v1/dictionary/validate

Validates the given schema against the current data dictionary

payload - A json string containing the schema to validate

{ "status" : "OK", "results" : [] }

{
  "status" : "CONFLICT",
  "results" : [
    {
      "
      "columnName" : "nameOfColumnFromSchema",
      "expectedName" : "nameOfColumnFromDict",
      "columnType" : "string",
      "expectedType" : "string",
      "isNullable" : true,
      "expectedNullable" : true,
      "reason" : "The column case did not match the data dictionary."
    },
    {
      "columnName" : "nameOfColumnFromSchema",
      "reason" : "The column does not exist in the data dictionary."
    }
  ]
}

POST

/v1/dictionary/{column-name}

Add a column to the data dictionary

column-name - the name of the column to add (Case Sensitive) payload - a JSON string in the format:

{
  "columnType" : "String",
  "isNullable" : true,
  "isPII : false,
  "description" : "this is a description of the column"
}

On success, returns a 200.

PUT

/v1/dictionary/{column-name}

Update a row in the data dictionary

column-name - the name of the column to add (Case Sensitive)

payload - a JSON string in the format:

{
  "columnType" : "String",
  "isNullable" : true,
  "isPII : false,
  "description" : "this is a description of the column"
}

On success, returns a 200. If the field is not found, returns a 404.

DELETE /v1/dictionary/{column-name} Delete a column from the data dictionary column-name - the name of the column to delete (Case Sensitive) On success, returns a 200. If the field is not found, returns a 404. If the numberUsing > 1, return 409 conflict.

Configuration

Method

Endpoint

Description

Params

Sample Response

GET

/v1/config

Returns the entire tracker config as a key value map

none

{
  "config-key-1" : "config-value-1",
  "config-key-2" : "config-value-2"
}

If no values are set, an empty map is returned.

{}

GET

/v1/config/{config-key}

/v1/config/{config-key}?strict="true"

Returns the value for the given config key.

If strict is "true"

This will return a single configuration value with key strictly matching with config-key. This will be a faster operation for single key-value lookup

If strict is "false" or not provided
This will perform a row scan of the table, so partial values are also accepted.

config-key the key for the config to return

Query parameter "strict" which can be true or false

[{"config-key" : "config-value-1"}]

If multiple results are found:

[
  {"config-key-1" : "config-value-1"},
  {"config-key-2" : "config-value-3"}
]

Returns 404 if no config key was found.

[]

POST / PUT

/v1/config/{config-key}

Sets the value of the specified config key

config-key the key for the config to set payload the value for that config

{ "value" : "configValue" }

Returns 200 if the key was successfully set or updated Returns 400 and an error message if the key or value is invalid for some reason.

DELETE /v1/config/{config-key} Deleted the value of the specified config key config-key the key for the config value to delete Returns 200 if the key was deleted Returns 404 if the key was not found

UI Impact or Changes

Tracker will have a tab for data dictionary where users will be able to see and interact with the dictionary for that namespace.

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Test ID	Test Description	Expected Results
1	Add new "column" to data dictionary	200
2	Add duplicate "column" to data dictionary	400
3	Delete an existing column from dictionary	200
4	Delete a non existing column from dictionary	400
5	Update column properties in dictionary with valid new properties	200
6	Update column properties in dictionary with invalid new properties	400

CDAP

Tracker Data Dictionary

Introduction

Goals

Use Cases

User Stories

Design

Dictionary

Config Options

New Programmatic APIs

New REST APIs

Dictionary

Configuration

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Related Work

Future work