Data Wrangler - CDAP 4.0
Data Wrangler - CDAP 4.0
Goals
Adding a Data Wrangler will improve the overall user experience for creating schemas, and facilitate easier of importing data.
Checklist
- User stories documented(Todd)
- Requirements documented(Todd)
- Requirements Reviewed
- Mockups Built
- Design Built
- Design Accepted
User Stories.
- As a Hydrator user in my pipeline after a source plugin, I want a transform node that allows me to graphically build a schema to be used in my pipeline.
- As a Hydrator user I want the new transform node type to be able to operate anywhere in my pipeline after a Source is defined.
- As a Hydrator user I want the nmentedew transform node to make a best effort to determine if the first row of a file I’m importing is different so that I can quickly determine if a header row exists.
- As a Hydrator user I want the new transform node to understand common delimiters so that I can parse my data into columns and fields.
- As a Hydrator user I want the new transform node to allow configuration of column fields (name/type/reorder/include/drop/merge/split) from sources in my pipeline, using a graphical interface
- As a Hydrator user I want the new transform node to provide easily visible statistics for data quality and a histogram for distribution. I want these to be viewable at the column level for each field in my source.
- As a Hydrator user I want the new transform node to provide a history of all steps I perform on a document to be available to me.
Requirements
General
- The new tool can be instantiated as a new transform node type from inside Hydrator.
- The tool should also be accessible outside of Hydrator.
- The input for the tool should be schema OR a json representation of sampled data in JSON. The new tool should not configure data sources.
- The output for the tool should be an output schema in JSON and the DSL for performing the transformations.
- State, including sample, should be preserved upon "saving" and returning from within the pipeline.
Supported Operations
Sample Data/Schema inference
- The tool should receive data for graphical presentation when it is in preview mode without explicit direction from the user.
- The tool should accept copy and paste or file upload or http rest endpoint for sampling data.
- The default value for number of records/rows/documents sampled should be 1000, and user definable.
- The tool should make a best effort attempt to determine delimiter.
- The tool should make a best effort attempt to determine if a header row is present.
- The tool should make a best effort attempt to determine if there are encapsulating delimiters, double quotes.
- The tool should allow user specification of delimiter, from the screen, or from a dropdown:
- comma
- semicolon
- tab
- pipe
- Caret
- Custom (any unicode value)
- The tool will make a best effort attempt at determining type for each column.
Column Operations
- Drop Columns. Columns should be droppable from one button. There should be a global option to show dropped columns, as grayed out in the UI.
- Reorder Columns. They should be draggable to reorder columns
- Rename Columns. The names should be an input field that is rename-able.
- Type. Type should be selectable from a drop down menu.
- Split. Columns should be able to be split based on an expression or delimiting character.
- Merge. Columns of the same type should be able to be merged with other columns based on the following operators:
- String
- Concatenate with char/space
- Deduplicate
- Replace (if < or >)
- Numeric
- Sum
- Average
- Subtract (limited to two columns)
- Divide (limited to two columns)
- Deduplicate
- Replace (if < or >)
- Modulus (limited to two columns)
- String
- Data Quality Score. For each column a data quality score should be presented to indicate percentage of nulls, and percentage of outliers.
- Histogram for displaying the count of each detected value in the column sample or the count of values that fall within a numeric range.
Bulk Column Editing Operations should be possible from a single view to:
- Rename
- Drop
- Reorder
- Change Type
- Merge
- Deduplicate
Steps viewer to:
- View all previous steps.
- Rollback to a previous point. Rollback will destroy all operations between current step and rollback point. There will be no in process editing of stepss
Future Considerations
- Date/time support as a field type, and date/time functions
Design
Check for header workflow:
, multiple selections available,
Related content
CDAP Release 6.1.4
CDAP Release 6.1.4
More like this
CDAP Release 6.7.0
CDAP Release 6.7.0
More like this
Wrangler overview
Wrangler overview
More like this
Schema on Read with Wrangler Directives - WIP
Schema on Read with Wrangler Directives - WIP
More like this
Wrangler Lineage for Directives
Wrangler Lineage for Directives
More like this
Introduction to data pipelines
Introduction to data pipelines
More like this
Created in 2020 by Google Inc.