Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

Move CDAP Applications Extensions such as Cask-Tracker and Wrangler to CDAP System namespace.

Goals

Currently the CDAP Application for extensions such Cask Tracker, Wrangler are created and run in the namespace it is enabled in. Since

Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.

supports applications in system namespace. we want to move these extensions to system namespace and be able to support request from all the user namespaces. This reduces the overhead and resource footprint for the extensions.

User Stories

Breakdown of User-Stories
User Story #1
User Story #2
User Story #3

Design

Cover details on assumptions made, design alternatives considered, high level design

Approach

Moving Cask-Tracker to system namespace.

Tracker app contains two programs

AuditLogFlow
- AuditLogConsumerFlowlet - Read from TMS audit topic, emit payload string to next stage. read offset is persisted to a key-value table (default : _auditOffset)
- AuditLogPublisher - Deserialize the payload and get "AuditMessage" (currently filters on current namespace, will have to skip that check) and writes the audit message to "AuditLogTable" - custom dataset, "AuditMetricsCube" - backed by cube dataset, "LatestEntityTable" - custom dataset
TrackerService
- AuditLogHandler - Single endpoint "query" - scans "AuditLogTable" based on query params and responds.
- AuditMetricsHandler - uses "AuditMetricsCube" to handle queries for "Top N Applications/Datasets/Programs" and "Histogram". uses "LatestEntityTable" for time-since endpoint (need to look into what that means)
- TrackerMeterHandler - uses "AuditMetricsCube" and "LatestEntityTable" for truth meter score
- AuditTagsHandler - uses "AuditTagsTable" to store tags, promote/demote tags based on REST endpoints, for some of the REST service methods, it also talks with the metadata-service using zookeeper-service discovery directly for getting metadata.
- DataDictionaryHandler - similar to above but uses "dataDictionary" table
- ConfigurationHandler - Storing, retrieving and deleting config using a ConfigTable (Key-value-Table).

Datasets

Tracker app creates/uses 5 datasets

AuditLogTable
AuditMetricsCube
LatestEntityTable
AuditTagsTable
DataDictionaryTable
ConfigTable

In Terms of the Flow logic, we will read all the messages from audit from all namespace and persist to the dataset's its being written to. This change will be straightforward.

However for Service, we have to namespace all the service endpoints. there also might be corresponding change in dataset key format to include the namespace information. This will require change in UI to pass along the namespace from which a particular query is handled.

Example,

Earlier say in default namespace, user wants to get all the tags. will hit the endpoint

/v3/namespaces/default/apps/TrackerApp/services/TrackerService/methods/v1/tags

however now since the service will be running in the system namespace user will have to hit the endpoint and provide the source namespace as query param

/v3/namespaces/system/apps/TrackerApp/services/TrackerService/methods/v1/tags?namespace=default

Open questions :

1) What is the expectation for upgrading older tracker apps ? If the user had enabled say "Tracker" in 3 namespaces with CDAP 4.3, after upgrading, when he/she enables Tracker, it will run in system namespace. How do we transfer the data from the older tracker dataset in those 3 namespaces into the system namespace tracker datasets ?

Data Prep Extension:

Connections created using Connection Service (Should improve connection-id creation to return unique connection-id instead of using name)
the connection-id is used to uniquely identify a connection and connection to sources such as (GCS, HDFS, S3, Database etc) are made using specific service handlers. do not expect any changes in those handler logic.
we will need to isolate connections created by namespace, so users don't see connections between namespace.
schema for the source is obtained from the respective source service endpoint. do not expect any change
workspace data is stored workspace dataset, identified by workspace id (evaluate id creation logic) stores the data and transformation, etc. (verify this)
Directives service which handles workspace lifecycle should handle namespaces and will need changes similar to connection store. Listing all workspaces should only list the workspaces from a given namespace.

Open questions

What is schema registry service and how is it used ?
Directives service - allowing users to add custom directives ? we might want to add namespace level limitations after moving to system namespace.
What is the requirement in terms of authorization and where will it be handled.

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

Path

Method

Description

Response Code

Response

/v3/apps/<app-id>

GET

Returns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

Impact #1
Impact #2
Impact #3

Security Impact

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

CDAP

Moving Cask Extensions to CDAP System Namespace

Analytics