Data Wrangler - CDAP 4.0

Data Wrangler - CDAP 4.0

Goals

Adding a Data Wrangler will improve the overall user experience for creating schemas, and facilitate easier of importing data.

Checklist

User stories documented(Todd)
Requirements documented(Todd)
Requirements Reviewed
Mockups Built
Design Built
Design Accepted

User Stories.  

  • As a Hydrator user in my pipeline after a source plugin, I want a transform node that allows me to graphically build a schema to be used in my pipeline.

  • As a Hydrator user I want the new transform node type to be able to operate anywhere in my pipeline after a Source is defined.

  • As a Hydrator user I want the nmentedew transform node to make a best effort to determine if the first row of a file I’m importing is different so that I can quickly determine if a header row exists.

  • As a Hydrator user I want the new transform node to understand common delimiters so that I can parse my data into columns and fields.

  • As a Hydrator user I want the new transform node to allow configuration of column fields (name/type/reorder/include/drop/merge/split) from sources in my pipeline, using a graphical interface

  • As a Hydrator user I want the new transform node to provide easily visible statistics for data quality and a histogram for distribution. I want these to be viewable at the column level for each field in my source.

  • As a Hydrator user I want the new transform node to provide a history of all steps I perform on a document to be available to me.

Requirements

General 

  • The new tool can be instantiated as a new transform node type from inside Hydrator.

  • The tool should also be accessible outside of Hydrator.

  • The input for the tool should be schema OR a json representation of sampled data in JSON.   The new tool should not configure data sources.  

  • The output for the tool should be an output schema in JSON and the DSL for performing the transformations.

  • State, including sample, should be preserved upon "saving" and returning from within the pipeline.  

Supported Operations

Sample Data/Schema inference

Column Operations

Bulk Column Editing Operations should be possible from a single view to:

        Steps viewer to:

Future Considerations

  • Date/time support as a field type, and date/time functions

Design

Check for header workflow:

Created in 2020 by Google Inc.