Data Wrangler - CDAP 4.0

Goals

Adding a Data Wrangler will improve the overall user experience for creating schemas, and facilitate easier of importing data.

Checklist

  • User stories documented(Todd)
  • Requirements documented(Todd)
  • Requirements Reviewed
  • Mockups Built
  • Design Built
  • Design Accepted

User Stories.  

  • As a Hydrator user in my pipeline after a source plugin, I want a transform node that allows me to graphically build a schema to be used in my pipeline.
  • As a Hydrator user I want the new transform node type to be able to operate anywhere in my pipeline after a Source is defined.
  • As a Hydrator user I want the nmentedew transform node to make a best effort to determine if the first row of a file I’m importing is different so that I can quickly determine if a header row exists.
  • As a Hydrator user I want the new transform node to understand common delimiters so that I can parse my data into columns and fields.
  • As a Hydrator user I want the new transform node to allow configuration of column fields (name/type/reorder/include/drop/merge/split) from sources in my pipeline, using a graphical interface
  • As a Hydrator user I want the new transform node to provide easily visible statistics for data quality and a histogram for distribution. I want these to be viewable at the column level for each field in my source.
  • As a Hydrator user I want the new transform node to provide a history of all steps I perform on a document to be available to me.

Requirements

General 

  • The new tool can be instantiated as a new transform node type from inside Hydrator.
  • The tool should also be accessible outside of Hydrator.
  • The input for the tool should be schema OR a json representation of sampled data in JSON.   The new tool should not configure data sources.  
  • The output for the tool should be an output schema in JSON and the DSL for performing the transformations.
  • State, including sample, should be preserved upon "saving" and returning from within the pipeline.  

Supported Operations

Sample Data/Schema inference

    • The tool should receive data for graphical presentation when it is in preview mode without explicit direction from the user.  
    • The tool should accept copy and paste or file upload or http rest endpoint for sampling data.   
    • The default value for number of records/rows/documents sampled should be 1000, and user definable.  
    • The tool should make a best effort attempt to determine delimiter.
    • The tool should make a best effort attempt to determine if a header row is present.
    • The tool should make a best effort attempt to determine if there are encapsulating delimiters, double quotes.   
    • The tool should allow user specification of delimiter, from the screen, or from a dropdown:
      • comma
      • semicolon
      • tab
      • pipe
      • Caret
      • Custom (any unicode value)
    • The tool will make a best effort attempt at determining type for each column.   

Column Operations

    • Drop Columns.  Columns should be droppable from one button.  There should be a global option to show dropped columns, as grayed out in the UI.  
    • Reorder Columns.  They should be draggable to reorder columns
    • Rename Columns.  The names should be an input field that is rename-able.  
    • Type.  Type should be selectable from a drop down menu.  
    • Split. Columns should be able to be split based on an expression or delimiting character.
    • Merge.  Columns of the same type should be able to be merged with other columns based on the following operators:
      • String
        • Concatenate with char/space
        • Deduplicate
        • Replace (if < or >)
      • Numeric
        • Sum 
        • Average
        • Subtract (limited to two columns)
        • Divide (limited to two columns)
        • Deduplicate
        • Replace (if < or >)
        • Modulus (limited to two columns)
    • Data Quality Score.  For each column a data quality score should be presented to indicate percentage of nulls, and percentage of outliers.  
    • Histogram for displaying the count of each detected value in the column sample or the count of values that fall within a numeric range.  

Bulk Column Editing Operations should be possible from a single view to:

    • Rename
    • Drop
    • Reorder
    • Change Type
    • Merge
    • Deduplicate 

        Steps viewer to:

    • View all previous steps.
    • Rollback to a previous point.  Rollback will destroy all operations between current step and rollback point.  There will be no in process editing of stepss

Future Considerations

  • Date/time support as a field type, and date/time functions

Design

Check for header workflow:

Created in 2020 by Google Inc.