Schema on Read with Wrangler Directives - WIP
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
IntroductionÂ
Users of CDAP might have already existing data in HDFS or HBase. In order to bring the data into CDAP, the only ways are to create a data pipeline or a CDAP app that would re-process the data and to create a CDAP datasets for further analysis. Allowing capabilities to use existing data without having to re-process will allow for great user experience and a good on-ramp for customers with a lot of legacy data.
Goals
- Ease of adoption: Allow users to leverage their existing data in HDFS or HBase without having to re-process the dataÂ
- Usability: Create datasets from existing data in HDFS or HBase and provide a great user-experience.
User StoriesÂ
- As a user, I would like to create dataset from existing data on HDFS (or HBase)
- As a user, I would like to apply schema to the dataset that is created from existing data on HDFS (or HBase)
- As a user, I would like to apply transformations on data existing on HDFS (or HBase) to derive the data with pre-defined schema
- As a user, I would like to use explore queries on the dataset that is created from existing data on HDFS
- As a user, I would like to use the dataset as a source in data pipelines.
Design
- The custom inputformat needs to depend on wrangler to execute the directives. However, currently explore classpath does not contain custom inputformat and serde classes from a cdap application. This means even if custom serde and inputformat is specified from a cdap application, it is not included in the explore classpath. Another issue is, since the custom inputformat has to be present in the explore classpath, the hive queries can not be performed outside cdap on that table.
- This feature is not just schema on read or creating a view on hdfs data. It also includes wrangling each record being executed from a mapreduce or spark job. Using wrangler in custom inputformat means the wrangler directives will be applied to each record read from a file. This could be computationally heavy. For example, when we lookup a dataset from wrangler directives or computing string distance or making a REST call. These patterns are not recommended as best practices.
- Wrangler has plugin context so it can instantiate datasets. But inside a custom dataset we can not instantiate another dataset. So directives to access a dataset would not be supported.Â
- Wrangler depends on guava 19 because it relies on `org.simmetrics` external library for calculating string distance which depends on guava 19. If we add wrangler to explore class path, then there can be version conflict.Â
- Wrangler now supports User Defined Directives. Each directive is a plugin. That would mean we will have to load these plugins inside an inputformat.Â
- Wrangler is a monolithic application. Meaning its very hard to segregate directives in wrangler that are compatible with the custom inputformat in cdap and directives that are not.
Approach
Approach #1
Approach #2
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application | 200 - On success 404 - When application is not available 500 - Any internal errors | Â |
 |  |  |  |  |
Deprecated REST API
Path | Method | Description |
---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application |
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
Security ImpactÂ
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure OutagesÂ
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ]Â component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
 |  |  |
 |  |  |
 |  |  |
 |  |  |
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3
Â
Future work
Created in 2020 by Google Inc.