Tracker Data Dictionary
- 1 Introduction
- 2 Goals
- 2.1 Use Cases
- 3 User Stories
- 4 Design
- 4.1 Dictionary
- 4.2 Config Options
- 5 New Programmatic APIs
- 5.1 New REST APIs
- 5.1.1 Dictionary
- 5.1.2 Configuration
- 5.2 UI Impact or Changes
- 5.1 New REST APIs
- 6 Security Impact
- 7 Impact on Infrastructure Outages
- 8 Test Scenarios
- 9 Releases
- 10 Related Work
- 11 Future work
Checklist
Introduction
A data dictionary will be a way for users to define and describe columns that apply across all datasets in a namespace and allow users to enforce a common naming convention, type, and indicate if the column contains PII data. Anyone creating new datasets or browsing datasets would then be able to see and leverage this information in Tracker.
Goals
After the completion of this project users will have a single point of governance for their data fields across namespace.
Use Cases
A team would like to make sure that any new datasets they create use the same naming conventions for field names. For example, in all datasets, the accountId is always in the same format (not accountID or account_id), and always contains the same type of data (always a nullable string, not an int or long). Using the data dictionary, the team sets the name, format, and definition of the column, and enables compliance checking. Now, anyone browsing through the datasets available in the namespace through Tracker can see the definitions for all the columns. If the user finds a dataset which is out of compliance, he/she can take action to bring it into compliance.
A company has specific fields in their datasets that are sensitive such as SSN or full name. They would like these columns to be tracked across all datasets so that the users of the system know to treat it with a higher level of secrecy. By adding these columns to the data dictionary, they are able to identify these columns across any dataset and make sure they are keeping the data safe. As new datasets are created with the same columns, the data dictionary is automatically applied and the columns are marked as PII. If desired, someone would be able to write a script to automatically add tags to datasets containing PII columns and audit them to make sure only specific people have access to them.
User Stories
Design
Dictionary
The data dictionary will be stored in a new custom dataset specific to a namespace and backed by a Table.
The name will be _dataDictionary
The schema of the dictionary table will be as follows:
rowKey - the column name, all lowercase
columnName - the column name with the case preserved
columnType - the type of the column
isNullable - is the column nullable or not
isPII - is the column PII or not
description - the description of the column
datasets - a list of datasets containing the column
numberUsing - will default to 1, and will be increased if other extensions start using this data
Config Options
As part of this, we will also need to introduce the notion of configuration preferences for Tracker in order to store the state of compliance checks for this instance.
We will create a new config dataset which will be a simple key-value store to keep track of this information. The goal would be to store additional config values in there as needed by future tracker features.
We will create a new config api handler to manage the configuration options.
Config keys will be in the form of <feature>.<option>. So for the compliance checks for the dictionary, the config key would be "dictionary.compliance" with a value of "true" or "false"
New Programmatic APIs
New REST APIs
Dictionary
We will need the following additional endpoints to support the data dictionary which will be added in a new handler class
Configuration
UI Impact or Changes
Tracker will have a tab for data dictionary where users will be able to see and interact with the dictionary for that namespace.
Security Impact
Impact on Infrastructure Outages
Test Scenarios
Test ID | Test Description | Expected Results |
|---|---|---|
1 | Add new "column" to data dictionary | 200 |
2 | Add duplicate "column" to data dictionary | 400 |
3 | Delete an existing column from dictionary | 200 |
4 | Delete a non existing column from dictionary | 400 |
5 | Update column properties in dictionary with valid new properties | 200 |
6 | Update column properties in dictionary with invalid new properties | 400 |
Releases
Related Work