...
Note: This would also satisfy user story 5, where a unique can be implemented as a Aggregation plugin, where you group by the fields you want to unique, and ignore the Iterable<> in aggregate and just emit the group key.
Story 2: Control Flow (
...
Not Reviewed, WIP)
Option 1: Introduce different types of connections. One for data flow, one for control flow
...
Story 3: Spark ML in a pipeline
Add a plugin type "sparkMLsparksink" that is treated like a transform. But instead of being a stage inside a mapper, it is a program in a workflow. The application will create a transient dataset to act as the input into the program, or an explicit source can be givensink. When present, a spark program will be used to read data, transform it, then send all transformed results to the sparksink plugin.
Code Block |
---|
{ "stages": [ { "name": "customersTable", "plugin": { "name": "Database", "type": "batchsource", ... } }, { "name": "categorizer", "plugin": { "name": "SVM", "type": "sparkML", ... } }, { "name": "models", "plugin": { "name": "Table", "type": "batchsink", ... } }, ], "connections": [ { "from": "customersTable", "to": "categorizer" }, { "from": "categorizer", "to": "models" } ] } |
Story 6: Join (Not Reviewed, WIP)
Add a join plugin type. Different implementations could be inner join, left outer join, etc.
...