Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The developer to load webpage click and view data (customer id, timestamp, action, url) into a partitioned fileset. After loading the data, the developer wants to de-duplicate records and calculate how many times each customer clicked and viewed over the past hour, past day, and past month. 

User Stories:

  1. (3.4) A developer should be able to create pipelines that contain aggregations (GROUP BY -> count/sum/unique)
  2. (3.5) A developer should be able to create a pipeline with multiple sources, with one happening after the otherA control some parts of the pipeline running before others. For example, one source -> sink branch running before another source -> sink branch.
  3. (3.4) A developer should be able to use a Spark ML job as a pipeline stage
  4. A (3.4) A developer should be able to rerun failed pipeline runs without reconfiguring the pipeline
  5. A (3.4) A developer should be able to de-duplicate records in a pipeline
  6. A (3.5) A developer should be able to join multiple branches of a pipeline
  7. A (3.5) A developer should be able to use an Explore action as a pipeline stage
  8. A (3.5) A developer should be able to create pipelines that contain Spark Streaming jobs
  9. A (3.5) A developer should be able to create pipelines that run based on various conditions, including input data availability and Kafka events

...

Introduce a new plugin type "aggregator".  In general, to support more and more plugin types in a generic way, we want to refactor the config:

...

:

...

Some problems with this is that the plugin property "functions" is itself a json describing plugins to use.  This is not easy for somebody to configure, but maybe it could be simplified by a UI widget type.

 

Option 2: Keep the current structure of the config, with plugin type at the top level.

Code Block
{
  "sourcesstages": [
    {
      "name": "inputTable",
      "plugin": {
        "name": "Table",
        "type": "batchsource",  // new field
        "properties": {
        }
      }
    },
    {
      "name": "aggStage",
      "plugin": {
        "name": "GroupByAggregate",
     }   "type": "aggregator",  // }new plugin type
  }   ],   "aggregatorsproperties": [
{
   {       "namegroupBy": "aggStageid",
          "groupByfunctions": "id",[
      "functions": [     {
   {           "columnName": "totalPrice",
              "plugin": {
                "name": "sum",
                "properties": {
                  "column": "price"
                }
   }           }
            },
            {
              "columnName": "numTransactions",
              "plugin": {
                "name": "count"
              }
            }
          ]"
        }
      }
    }
  ],
  "connections": [
    { "from": "inputTable", "to": "aggStage" } 
   ]
}

...

]
}

Some problems with this is that the plugin property "functions" is itself a json describing plugins to use.  This is not easy for somebody to configure, but maybe it could be simplified by a UI widget type.

 

Java APIs for plugin developers.  It is basically mapreduce, 'Aggregation' is probably a bad name for this.  Need to see if this fits into Spark.  Would we have to remove the emitters?

...

Note: This would also satisfy user story 5, where a unique can be implemented as a Aggregation plugin, where you group by the fields you want to unique, and ignore the Iterable<> in aggregate and just emit the group key.

Story 2: Control Flow (

...

Not Reviewed, WIP)

Option 1: Introduce different types of connections. One for data flow, one for control flow

...

Story 3: Spark ML in a pipeline

Add a plugin type "sparkMLsparksink" that is treated like a transform.  But instead of being a stage inside a mapper, it is a program in a workflow.  The application will create a transient dataset to act as the input into the program, or an explicit source can be givensink.  When present, a spark program will be used to read data, transform it, then send all transformed results to the sparksink plugin.

Code Block
{
  "stages": [
    {
      "name": "customersTable",
      "plugin": {
        "name": "Database",
        "type": "batchsource", ...
      }
    },    
    {
      "name": "categorizer",
      "plugin": {
        "name": "SVM",
        "type": "sparkMLsparksink", ...
      }
    },
    {
      "name": "models",
      "plugin": {
        "name": "Table",
        "type": "batchsink", ...
      }
    },
  ],
  "connections": [
    { "from": "customersTable", "to": "categorizer" },
    { "from": "categorizer", "to": "models" }
  ]
}

Story 6: Join (Not Reviewed, WIP)

Add a join plugin type.  Different implementations could be inner join, left outer join, etc.

...