Moving Cask Extensions to CDAP System Namespace
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
IntroductionÂ
Move CDAP Applications Extensions such as Cask-Tracker and Wrangler to CDAP System namespace.
Goals
Currently the CDAP Application for extensions such Cask Tracker, Wrangler are created and run in the namespace it is enabled in. Since
User StoriesÂ
- Breakdown of User-StoriesÂ
- User Story #1
- User Story #2
- User Story #3
Design
Cover details on assumptions made, design alternatives considered, high level design
Approach
Moving Cask-Tracker to system namespace.
 Tracker app contains two programs
- AuditLogFlow
- AuditLogConsumerFlowlet - Read from TMS audit topic, emit payload string to next stage. read offset is persisted to a key-value table (default : _auditOffset)
- AuditLogPublisher - Deserialize the payload and get "AuditMessage" (currently filters on current namespace, will have to skip that check) and writes the audit message to "AuditLogTable" - custom dataset, "AuditMetricsCube" - backed by cube dataset, "LatestEntityTable" - custom dataset
- TrackerService
- AuditLogHandler - Single endpoint "query" - scans "AuditLogTable" based on query params and responds.
- AuditMetricsHandler - uses "AuditMetricsCube" to handle queries for "Top N Applications/Datasets/Programs" and "Histogram". uses "LatestEntityTable" for time-since endpoint (need to look into what that means)
- TrackerMeterHandler -Â Â uses "AuditMetricsCube" and "LatestEntityTable" for truth meter scoreÂ
- AuditTagsHandler - uses "AuditTagsTable" to store tags, promote/demote tags based on REST endpoints, for some of the REST service methods, it also talks with the metadata-service using zookeeper-service discovery directly for getting metadata.Â
DataDictionaryHandler - similar to above but uses "dataDictionary" table
ConfigurationHandler - Storing, retrieving and deleting config using a ConfigTable (Key-value-Table).
 DatasetsÂ
  Tracker app creates/uses 5 datasetsÂ
- AuditLogTable
- AuditMetricsCube Â
LatestEntityTable
- AuditTagsTable
- DataDictionaryTable
- ConfigTable
In Terms of the Flow logic, we will read all the messages from audit from all namespace and persist to the dataset's its being written to. This change will be straightforward.
However for Service, we have to namespace all the service endpoints. there also might be corresponding change in dataset key format to include the namespace information. This will require change in UI to pass along the namespace from which a particular query is handled.
Example,Â
Earlier say in default namespace, user wants to get all the tags. will hit the endpoint
/v3/namespaces/default/apps/TrackerApp/services/TrackerService/methods/v1/tags
however now since the service will be running in the system namespace user will have to hit the endpoint and provide the source namespace as query param
/v3/namespaces/system/apps/TrackerApp/services/TrackerService/methods/v1/tags?namespace=default
Open questions :
1) What is the expectation for upgrading older tracker apps ? If the user had enabled say "Tracker" in 3 namespaces with CDAP 4.3, after upgrading, when he/she enables Tracker, it will run in system namespace. How do we transfer the data from the older tracker dataset in those 3 namespaces into the system namespace tracker datasets ?Â
Data Prep Extension:
- Connections created using Connection Service (Should improve connection-id creation to return unique connection-id instead of using name)
- the connection-id is used to uniquely identify a connection and connection to sources such as (GCS, HDFS, S3, Database etc) are made using specific service handlers. do not expect any changes in those handler logic.
- we will need to isolate connections created by namespace, so users don't see connections between namespace.
- schema for the source is obtained from the respective source service endpoint. do not expect any changeÂ
- workspace data is stored workspace dataset, identified by workspace id (evaluate id creation logic) stores the data and transformation, etc. (verify this)
- Directives service which handles workspace lifecycle should handle namespaces and will need changes similar to connection store. Listing all workspaces should only list the workspaces from a given namespace.
Open questionsÂ
- What is schema registry service and how is it used ?Â
- Directives service - allowing users to add custom directives ? we might want to add namespace level limitations after moving to system namespace.
- What is the requirement in terms of authorization and where will it be handled.
Approach #1
Approach #2
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application | 200 - On success 404 - When application is not available 500 - Any internal errors | |
Deprecated REST API
Path | Method | Description |
---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application |
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
Security ImpactÂ
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure OutagesÂ
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ]Â component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3
Future work
Created in 2020 by Google Inc.