Data Wrangler - CDAP 4.0

Owned by Todd Greenstein

Last updated: Nov 21, 2016 by Nitin MotgiVersion comment

3 min read

Goals

Adding a Data Wrangler will improve the overall user experience for creating schemas, and facilitate easier of importing data.

Checklist

User stories documented(Todd)
Requirements documented(Todd)
Requirements Reviewed
Mockups Built
Design Built
Design Accepted

User Stories.

As a Hydrator user in my pipeline after a source plugin, I want a transform node that allows me to graphically build a schema to be used in my pipeline.
As a Hydrator user I want the new transform node type to be able to operate anywhere in my pipeline after a Source is defined.
As a Hydrator user I want the nmentedew transform node to make a best effort to determine if the first row of a file I’m importing is different so that I can quickly determine if a header row exists.
As a Hydrator user I want the new transform node to understand common delimiters so that I can parse my data into columns and fields.
As a Hydrator user I want the new transform node to allow configuration of column fields (name/type/reorder/include/drop/merge/split) from sources in my pipeline, using a graphical interface
As a Hydrator user I want the new transform node to provide easily visible statistics for data quality and a histogram for distribution. I want these to be viewable at the column level for each field in my source.
As a Hydrator user I want the new transform node to provide a history of all steps I perform on a document to be available to me.

Requirements

General

The new tool can be instantiated as a new transform node type from inside Hydrator.
The tool should also be accessible outside of Hydrator.
The input for the tool should be schema OR a json representation of sampled data in JSON. The new tool should not configure data sources.
The output for the tool should be an output schema in JSON and the DSL for performing the transformations.
State, including sample, should be preserved upon "saving" and returning from within the pipeline.

Supported Operations

Sample Data/Schema inference

- The tool should receive data for graphical presentation when it is in preview mode without explicit direction from the user.
- The tool should accept copy and paste or file upload or http rest endpoint for sampling data.
- The default value for number of records/rows/documents sampled should be 1000, and user definable.
- The tool should make a best effort attempt to determine delimiter.
- The tool should make a best effort attempt to determine if a header row is present.
- The tool should make a best effort attempt to determine if there are encapsulating delimiters, double quotes.
- The tool should allow user specification of delimiter, from the screen, or from a dropdown:
  - comma
  - semicolon
  - tab
  - pipe
  - Caret
  - Custom (any unicode value)
- The tool will make a best effort attempt at determining type for each column.

Column Operations

- Drop Columns. Columns should be droppable from one button. There should be a global option to show dropped columns, as grayed out in the UI.
- Reorder Columns. They should be draggable to reorder columns
- Rename Columns. The names should be an input field that is rename-able.
- Type. Type should be selectable from a drop down menu.
- Split. Columns should be able to be split based on an expression or delimiting character.
- Merge. Columns of the same type should be able to be merged with other columns based on the following operators:
  - String
    - Concatenate with char/space
    - Deduplicate
    - Replace (if < or >)
  - Numeric
    - Sum
    - Average
    - Subtract (limited to two columns)
    - Divide (limited to two columns)
    - Deduplicate
    - Replace (if < or >)
    - Modulus (limited to two columns)
- Data Quality Score. For each column a data quality score should be presented to indicate percentage of nulls, and percentage of outliers.
- Histogram for displaying the count of each detected value in the column sample or the count of values that fall within a numeric range.

Bulk Column Editing Operations should be possible from a single view to:

- Rename
- Drop
- Reorder
- Change Type
- Merge
- Deduplicate

Steps viewer to:

- View all previous steps.
- Rollback to a previous point. Rollback will destroy all operations between current step and rollback point. There will be no in process editing of stepss

Future Considerations

Date/time support as a field type, and date/time functions

Design

Check for header workflow:

Created in 2020 by Google Inc.