...
- User stories documented (Albert/Alvin)
- User stories reviewed (Nitin)
- Design documented (Albert/Alvin)
- Design reviewed (Andreas/Terence)
- Feature merged (Albert/Alvin)
- Examples and guides (Albert/Alvin)
- Integration tests (Albert/Alvin)
- Documentation for feature (Albert/Alvin)
- Blog post
...
The developer to load webpage click and view data (customer id, timestamp, action, url) into a partitioned fileset. After loading the data, the developer wants to de-duplicate records and calculate how many times each customer clicked and viewed over the past hour, past day, and past month.
User Stories:
- (3.4) A developer should be able to create pipelines that contain aggregations (GROUP BY -> count/sum/unique)
- (3.5) A developer should be able to create a pipeline with multiple sources, with one happening after the otherA control some parts of the pipeline running before others. For example, one source -> sink branch running before another source -> sink branch.
- (3.4) A developer should be able to use a Spark ML job as a pipeline stage
- A (3.4) A developer should be able to rerun failed pipeline runs without reconfiguring the pipeline
- A (3.4) A developer should be able to de-duplicate records in a pipeline
- A (3.5) A developer should be able to join multiple branches of a pipeline
- A (3.5) A developer should be able to use an Explore action as a pipeline stage
- A (3.5) A developer should be able to create pipelines that contain Spark Streaming jobs
- A (3.5) A developer should be able to create pipelines that run based on various conditions, including input data availability and Kafka events
...
Introduce a new plugin type "aggregator". In general, to support more and more plugin types in a generic way, we want to refactor the config:
...
:
...
Some problems with this is that the plugin property "functions" is itself a json describing plugins to use. This is not easy for somebody to configure, but maybe it could be simplified by a UI widget type.
Option 2: Keep the current structure of the config, with plugin type at the top level.
Code Block |
---|
{ "sourcesstages": [ { "name": "inputTable", "plugin": { "name": "Table", "type": "batchsource", // new field "properties": { } } }, { "name": "aggStage", "plugin": { "name": "GroupByAggregate", } "type": "aggregator", // }new plugin type } ], "aggregatorsproperties": [ { { "namegroupBy": "aggStageid", "groupByfunctions": "id",[ "functions": [ { { "columnName": "totalPrice", "plugin": { "name": "sum", "properties": { "column": "price" } } } }, { "columnName": "numTransactions", "plugin": { "name": "count" } } ]" } } } ], "connections": [ { "from": "inputTable", "to": "aggStage" } ] } |
...
]
} |
Some problems with this is that the plugin property "functions" is itself a json describing plugins to use. This is not easy for somebody to configure, but maybe it could be simplified by a UI widget type.
Java APIs for plugin developers. It is basically mapreduce, 'Aggregation' is probably a bad name for this. Need to see if this fits into Spark. Would we have to remove the emitters?
...
Note: This would also satisfy user story 5, where a unique can be implemented as a Aggregation plugin, where you group by the fields you want to unique, and ignore the Iterable<> in aggregate and just emit the group key.
Story 2: Control Flow (
...
Not Reviewed, WIP)
Option 1: Introduce different types of connections. One for data flow, one for control flow
...
Story 3: Spark ML in a pipeline
Add a plugin type "sparkMLsparksink" that is treated like a transform. But instead of being a stage inside a mapper, it is a program in a workflow. The application will create a transient dataset to act as the input into the program, or an explicit source can be givensink. When present, a spark program will be used to read data, transform it, then send all transformed results to the sparksink plugin.
Code Block |
---|
{ "stages": [ { "name": "customersTable", "plugin": { "name": "Database", "type": "batchsource", ... } }, { "name": "categorizer", "plugin": { "name": "SVM", "type": "sparkMLsparksink", ... } }, { "name": "models", "plugin": { "name": "Table", "type": "batchsink", ... } }, ], "connections": [ { "from": "customersTable", "to": "categorizer" }, { "from": "categorizer", "to": "models" } ] } |
Story 6: Join (Not Reviewed, WIP)
Add a join plugin type. Different implementations could be inner join, left outer join, etc.
...