Moving Cask Extensions to CDAP System Namespace

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

Move CDAP Applications Extensions such as Cask-Tracker and Wrangler to CDAP System namespace.

Goals

Currently the CDAP Application for extensions such Cask Tracker, Wrangler are created and run in the namespace it is enabled in. Since

Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.
supports applications in system namespace. we want to move these extensions to system namespace and be able to support request from all the user namespaces. This reduces the overhead and resource footprint for the extensions.

User Stories 

  • Breakdown of User-Stories 
  • User Story #1
  • User Story #2
  • User Story #3

Design

Cover details on assumptions made, design alternatives considered, high level design

Approach

Moving Cask-Tracker to system namespace.

 Tracker app contains two programs

  • AuditLogFlow
    • AuditLogConsumerFlowlet - Read from TMS audit topic, emit payload string to next stage. read offset is persisted to a key-value table (default : _auditOffset)
    • AuditLogPublisher - Deserialize the payload and get "AuditMessage" (currently filters on current namespace, will have to skip that check) and writes the audit message to "AuditLogTable" - custom dataset, "AuditMetricsCube" - backed by cube dataset, "LatestEntityTable" - custom dataset

  • TrackerService
    • AuditLogHandler - Single endpoint "query" - scans "AuditLogTable" based on query params and responds.
    • AuditMetricsHandler - uses "AuditMetricsCube" to handle queries for "Top N Applications/Datasets/Programs" and "Histogram". uses "LatestEntityTable" for time-since endpoint (need to look into what that means)
    • TrackerMeterHandler -  uses "AuditMetricsCube" and "LatestEntityTable" for truth meter score 
    • AuditTagsHandler - uses "AuditTagsTable" to store tags, promote/demote tags based on REST endpoints, for some of the REST service methods, it also talks with the metadata-service using zookeeper-service discovery directly for getting metadata. 
    • DataDictionaryHandler - similar to above but uses "dataDictionary" table

    • ConfigurationHandler - Storing, retrieving and deleting config using a ConfigTable (Key-value-Table).


  Datasets 

   Tracker app creates/uses 5 datasets 

  • AuditLogTable
  • AuditMetricsCube  
  • LatestEntityTable

  • AuditTagsTable
  • DataDictionaryTable
  • ConfigTable

In Terms of the Flow logic, we will read all the messages from audit from all namespace and persist to the dataset's its being written to. This change will be straightforward.

However for Service, we have to namespace all the service endpoints. there also might be corresponding change in dataset key format to include the namespace information. This will require change in UI to pass along the namespace from which a particular query is handled.

Example, 

Earlier say in default namespace, user wants to get all the tags. will hit the endpoint

/v3/namespaces/default/apps/TrackerApp/services/TrackerService/methods/v1/tags


however now since the service will be running in the system namespace user will have to hit the endpoint and provide the source namespace as query param

/v3/namespaces/system/apps/TrackerApp/services/TrackerService/methods/v1/tags?namespace=default

Open questions :

1) What is the expectation for upgrading older tracker apps ? If the user had enabled say "Tracker" in 3 namespaces with CDAP 4.3, after upgrading, when he/she enables Tracker, it will run in system namespace. How do we transfer the data from the older tracker dataset in those 3 namespaces into the system namespace tracker datasets ? 


Data Prep Extension:


  • Connections created using Connection Service (Should improve connection-id creation to return unique connection-id instead of using name)
  • the connection-id is used to uniquely identify a connection and connection to sources such as (GCS, HDFS, S3, Database etc) are made using specific service handlers. do not expect any changes in those handler logic.
  • we will need to isolate connections created by namespace, so users don't see connections between namespace.
  • schema for the source is obtained from the respective source service endpoint. do not expect any change 
  • workspace data is stored workspace dataset, identified by workspace id (evaluate id creation logic) stores the data and transformation, etc. (verify this)
  • Directives service which handles workspace lifecycle should handle namespaces and will need changes similar to connection store. Listing all workspaces should only list the workspaces from a given namespace.



Open questions 

  1. What is schema registry service and how is it used ? 
  2. Directives service - allowing users to add custom directives ? we might want to add namespace level limitations after moving to system namespace.
  3. What is the requirement in terms of authorization and where will it be handled.


Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionResponse CodeResponse
/v3/apps/<app-id>GETReturns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors







Deprecated REST API

PathMethodDescription
/v3/apps/<app-id>GETReturns the application spec for a given application

CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results












Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3


Future work

Created in 2020 by Google Inc.