Tracker Data Dictionary

Tracker Data Dictionary

Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction 

 A data dictionary will be a way for users to define and describe columns that apply across all datasets in a namespace and allow users to enforce a common naming convention, type, and indicate if the column contains PII data. Anyone creating new datasets or browsing datasets would then be able to see and leverage this information in Tracker. 

Goals

After the completion of this project users will have a single point of governance for their data fields across namespace.

Use Cases

  • A team would like to make sure that any new datasets they create use the same naming conventions for field names. For example, in all datasets, the accountId is always in the same format (not accountID or account_id), and always contains the same type of data (always a nullable string, not an int or long). Using the data dictionary, the team sets the name, format, and definition of the column, and enables compliance checking. Now, anyone browsing through the datasets available in the namespace through Tracker can see the definitions for all the columns. If the user finds a dataset which is out of compliance, he/she can take action to bring it into compliance.

  • A company has specific fields in their datasets that are sensitive such as SSN or full name. They would like these columns to be tracked across all datasets so that the users of the system know to treat it with a higher level of secrecy. By adding these columns to the data dictionary, they are able to identify these columns across any dataset and make sure they are keeping the data safe. As new datasets are created with the same columns, the data dictionary is automatically applied and the columns are marked as PII. If desired, someone would be able to write a script to automatically add tags to datasets containing PII columns and audit them to make sure only specific people have access to them.

User Stories 

Design

Dictionary

  • The data dictionary will be stored in a new custom dataset specific to a namespace and backed by a Table.

  • The name will be _dataDictionary

  • The schema of the dictionary table will be as follows:

    • rowKey - the column name, all lowercase

    • columnName - the column name with the case preserved

    • columnType - the type of the column

    • isNullable - is the column nullable or not

    • isPII - is the column PII or not

    • description - the description of the column

    • datasets - a list of datasets containing the column

    • numberUsing - will default to 1, and will be increased if other extensions start using this data

Config Options

  • As part of this, we will also need to introduce the notion of configuration preferences for Tracker in order to store the state of compliance checks for this instance.

  • We will create a new config dataset which will be a simple key-value store to keep track of this information. The goal would be to store additional config values in there as needed by future tracker features.

  • We will create a new config api handler to manage the configuration options.

  • Config keys will be in the form of <feature>.<option>. So for the compliance checks for the dictionary, the config key would be "dictionary.compliance" with a value of "true" or "false"

New Programmatic APIs

New REST APIs

Dictionary

  • We will need the following additional endpoints to support the data dictionary which will be added in a new handler class

Configuration


UI Impact or Changes

Tracker will have a tab for data dictionary where users will be able to see and interact with the dictionary for that namespace.

Security Impact 

 

Impact on Infrastructure Outages 

 

Test Scenarios

Test ID

Test Description

Expected Results

Test ID

Test Description

Expected Results

1

Add new "column" to data dictionary

200

2

Add duplicate "column" to data dictionary

400

3

Delete an existing column from dictionary

200

4

Delete a non existing column from dictionary

400

5

Update column properties in dictionary with valid new properties

200

6

Update column properties in dictionary with invalid new properties

400

Releases

Related Work

 

Future work

Comments

Created in 2020 by Google Inc.