...
- User stories documented (Albert/Alvin)
- User stories reviewed (Nitin)
- Design documented (Albert/Alvin)
- Design reviewed (Andreas/Terence)
- Feature merged (Albert/Alvin)
- Examples and guides (Albert/Alvin)
- Integration tests (Albert/Alvin)
- Documentation for feature (Albert/Alvin)
- Blog post
...
The developer to load webpage click and view data (customer id, timestamp, action, url) into a partitioned fileset. After loading the data, the developer wants to de-duplicate records and calculate how many times each customer clicked and viewed over the past hour, past day, and past month.
User Stories:
- (3.4) A developer should be able to create pipelines that contain aggregations (GROUP BY -> count/sum/unique)
- (3.5) A developer should be able to create a pipeline with multiple sources, with one happening after the otherA control some parts of the pipeline running before others. For example, one source -> sink branch running before another source -> sink branch.
- (3.4) A developer should be able to use a Spark ML job as a pipeline stage
- A (3.4) A developer should be able to rerun failed pipeline runs without reconfiguring the pipeline
- A (3.4) A developer should be able to de-duplicate records in a pipeline
- A (3.5) A developer should be able to join multiple branches of a pipeline
- A (3.5) A developer should be able to use an Explore action as a pipeline stage
- A (3.5) A developer should be able to create pipelines that contain Spark Streaming jobs
- A (3.5) A developer should be able to create pipelines that run based on various conditions, including input data availability and Kafka events
Design:
Story 1: Group By -> Aggregations
Option 1:
Introduce a new plugin type "aggregation"aggregator". In general, to support more and more plugin types in a generic way, we want to refactor the config:
Code Block |
---|
Option 1: { "stages": [ { "name": "inputTable", "plugin": { "name": "Table", "type": "batchsource", // new field "properties": { } } }, { "name": "aggStage", "plugin": { "name": "RecordAggregatorGroupByAggregate", "type": "aggregationaggregator", // new plugin type "properties": { "groupBy": "id", "functions": "{[ "totalPrice": { "namecolumnName": "sumtotalPrice", "propertiesplugin": { "columnname": "pricesum", } "properties": { }, "column": "price" "numTransactions": { } "name": "count" } } }", } { } } ], "connectionscolumnName": ["numTransactions", { "from": "inputTable", "to": "aggStage" } ] } Option 2"plugin": { "sources": [ { "name": "inputTablecount", "plugin": { } "name": "Table", "type": "batchsource", } // new field ]" "properties": { } } } ], "aggregationsconnections": [ { "namefrom": "aggStageinputTable", "groupBy"to": "idaggStage", } "aggregations": [ { "columnName": "totalPrice", "plugin": { ] } |
Some problems with this is that the plugin property "functions" is itself a json describing plugins to use. This is not easy for somebody to configure, but maybe it could be simplified by a UI widget type.
Java APIs for plugin developers. It is basically mapreduce, 'Aggregation' is probably a bad name for this. Need to see if this fits into Spark. Would we have to remove the emitters?
Code Block |
---|
public abstract class Aggregation<GROUP_BY, RECORD_TYPE, OUTPUT_TYPE> {
public abstract void groupBy(RECORD_TYPE input, Emitter<GROUP_BY> emitter);
public abstract void aggregate(GROUP_BY groupKey, Iterable<RECORD_TYPE> groupRecords, Emitter<OUTPUT_TYPE> emitter);
}
@Plugin(type = "aggregation")
@Name("GroupByAggregate")
public GroupByAggregate extends Aggregation<StructuredRecord, StructuredRecord, StructuredRecord> {
private static final AggConfig config;
public static class AggConfig extends PluginConfig {
private String groupBy;
// ideally this would be Map<String, FunctionConfig> functions
private String functions;
}
public void configurePipeline(PipelineConfigurer configurer) {
Map<String, FunctionConfig> functions = gson.fromJson(config.functions, MAP_TYPE);
for each function:
usePlugin(id, type, name, properties);
}
public groupBy(StructuredRecord input, Emitter<StructuredRecord> emitter) {
// key = new record from input with only fields in config.groupBy
Set<String> fields = config.groupBy.split(",");
emitter.emit(recordSubset(input, fields));
}
public void initialize() {
Map<String, FunctionConfig> functions = gson.fromJson(config.functions, MAP_TYPE);
for each function:
val = function.aggregate(groupRecords);
}
public void aggregate(StructuredRecord groupKey, Iterable<StructuredRecord> groupRecords, Emitter<StructuredRecord> emitter) {
// reset all functions
for (StructuredRecord record : groupRecords) {
foreach function:
function.update(record);
}
// build record from group key and function values
for each function:
val = function.aggregate();
// emit record
}
}
public abstract class AggregationFunction<RECORD_TYPE, OUTPUT_TYPE> {
public abstract void reset();
public abstract void update(RECORD_TYPE record);
public abstract OUTPUT_TYPE aggregate();
}
@Plugin(type = "aggregationFunction")
@Name("sum")
public SumAggregation extends AggregationFunction<StructuredRecord, Number> {
private final SumConfig config;
private Number sum;
public static class SumConfig extends PluginConfig {
private String column;
}
public void update(StructuredRecord record) {
// get type of config.column, initialize sum to right type based on that
sum += (casted to correct thing) record.get(config.column);
}
public Number aggregate() {
return sum;
}
} |
Note: This would also satisfy user story 5, where a unique can be implemented as a Aggregation plugin, where you group by the fields you want to unique, and ignore the Iterable<> in aggregate and just emit the group key.
Story 2: Control Flow (Not Reviewed, WIP)
Option 1: Introduce different types of connections. One for data flow, one for control flow
Code Block |
---|
{
"stages": [
{
"name": "customersTable",
"plugin": {
"name": "Database",
"type": "batchsource", ...
}
},
{
"name": "customersFiles",
"plugin": {
"name": "TPFSParquet",
"type": "batchsink", ...
}
},
{
"name": "purchasesTable",
"plugin": {
"name": "Database",
"type": "batchsource"
}
},
{
"name": "purchasesFiles",
"plugin": {
"name": "TPFSParquet",
"type": "batchsink", ...
}
},
],
"connections": [
{ "from": "customersTable", "to": "customersFiles", "type": "data" },
{ "from": "customersFiles", "to": "purchasesTable", "type": "control" },
{ "from": "purchasesTable", "to": "purchasesFiles", "type": "data" }
]
} |
An alternative could be to introduce the concept of "phases", with connections between phases always control flow, and connections within phases as data flow
Code Block |
---|
{ "phases": [ { "name": "phase1", "stages": [ { "name": "sumcustomersTable", "propertiesplugin": { "columnname": "priceDatabase", "type": }"batchsource", ... } }, { "columnNamename": "numTransactionscustomersFiles", "plugin": { "name": "count" }: { } "name": "TPFSParquet", ] } ], "connectionstype": [ "batchsink", ... { "from": "inputTable", "to": "aggStage" } ], } public abstract class Aggregation<INPUT_TYPE, GROUP_BY, RECORD_TYPE, OUTPUT_TYPE> { "connections": [ public abstract groupBy(INPUT_TYPE input, Emitter<KeyValue<GROUP_BY, RECORD_TYPE>> emitter); public abstract aggregate(GROUP_BY groupKey, Iterable<RECORD_TYPE> groupRecords, Emitter<OUTPUT_TYPE> emitter); } @Plugin(type = "aggregation") @Name("record") public RecordAggregation extends Aggregation<StructuredRecord, StructuredRecord, StructuredRecord, StructuredRecord> { private static final AggConfig config;{ "from": "customersTable", "to": "customersFiles" } ] }, { "name": "phase2", "stages": [ public static class{ AggConfig extends PluginConfig { private String groupBy; "name": "purchasesTable", // ideally this would be Map<String, FunctionConfig> functions "plugin": { private String functions; } public void configurePipeline(PipelineConfigurer configurer) {"name": "Database", Map<String, FunctionConfig> functions = gson.fromJson(config.functions, MAP_TYPE); "type": "batchsource" for each function: } usePlugin(id, type, name, properties); }, public groupBy(StructuredRecord input, Emitter<KeyValue<StructuredRecord,{ StructuredRecord>> emitter) { // key = new record from input with only fields in config.groupBy"name": "purchasesFiles", // emitter.emit(new KeyValue<>(key, input));"plugin": { } public aggregate(StructuredRecord groupKey, Iterable<StructuredRecord> groupRecords, Emitter<StructuredRecord> emitter) { "name": "TPFSParquet", Map<String, FunctionConfig> functions = gson.fromJson(config.functions, MAP_TYPE);"type": "batchsink", ... for each function: } val = function.aggregate(groupRecords); for} (StructuredRecord record : groupRecords) { ], function.update(record); "connections": [ } // build record{ "from group key and function values": "purchasesTable", "to": "purchasesFiles" } ] for each function: } ] val = function.aggregate();"connections": [ // emit record { "from": "phase1", "to": "phase2" } ] } public abstract class AggregationFunction<RECORD_TYPE, OUTPUT_TYPE> { public abstract void update(RECORD_TYPE record); public abstract OUTPUT_TYPE aggregate(); } @Plugin(type = "aggregationFunction") @Name("sum") public SumAggregation extends AggregationFunction<StructuredRecord, Number> { private final SumConfig config; private Number sum; |
Option2: Make it so that connections into certain plugin types imply control flow rather than data flow. For example, introduce "condition" plugin type. Connections into a condition imply control flow rather than data flow. Similarly, connections into an "action" plugin type would imply control flow
Code Block |
---|
{ "stages": [ { "name": "customersTable", "plugin": { "name": "Database", public static class SumConfig extends PluginConfig {"type": "batchsource", ... private} String column; }, public void update(StructuredRecord record) { // get type of config.column, initialize sum to right type based on that "name": "customersFiles", "plugin": { sum += record; "name": "TPFSParquet", } public Number aggregate() { "type": "batchsink", ... return sum; } } |
Story 2: Multiple Sources
Option 1: Introduce the concept of "phases". Each phase has its own dag. Phases can be connected, with connections denoting control flow rather than data flow.
Code Block |
---|
{ "phases": [ }, { "name": "phase1afterDump", "stagesplugin": [{ { "name": "AlwaysRun", "nametype": "customersTablecondition", } "plugin": { }, { "name": "DatabasepurchasesTable", "type"plugin": "batchsource", ...{ "name": "Database", } "type": "batchsource" }, } }, { { "name": "customersFilespurchasesFiles", "plugin": { "name": "TPFSParquet", "type": "batchsink", ... } } }, ], "connections": [ { "from": "customersTable", "to": "customersFiles" } ] }, { "namefrom": "phase2customersFiles", "stagesto": [ : "afterDump" }, { "from": "afterDump", "to": "purchasesTable" }, { "namefrom": "purchasesTable", "to": "purchasesFiles" } "plugin": { "name": "Database", "type": "batchsource" } }, { ] } |
You could also say that connections into a source imply control flow, or connections into an action imply control flow.
Story 3: Spark ML in a pipeline
Add a plugin type "sparksink" that is treated like a sink. When present, a spark program will be used to read data, transform it, then send all transformed results to the sparksink plugin.
Code Block |
---|
{ "stages": [ { "name": "purchasesFilescustomersTable", "plugin": { "name": "TPFSParquetDatabase", "type": "batchsinkbatchsource", ... } }, }{ ]"name": "categorizer", "connectionsplugin": [{ { "fromname": "purchasesTableSVM", "totype": "purchasesFilessparksink", }... ]} }, ] { "connections": [ { "from": "phase1", "toname": "phase2models", } ] } |
Option2: Introduce "condition" plugin type. Connections into a condition imply control flow rather than data flow.
Code Block |
---|
{ "stages": [ {"plugin": { "name": "Table", "nametype": "customersTablebatchsink", ... } "plugin": { }, ], "nameconnections": "Database",[ { "typefrom": "batchsourcecustomersTable", ... "to": "categorizer" } ] }} |
Story 6: Join (Not Reviewed, WIP)
Add a join plugin type. Different implementations could be inner join, left outer join, etc.
Code Block |
---|
{ }, "stages": [ { "name": "customersFilescustomers", "plugin": { "name": "TPFSParquetTable", "type": "batchsinkbatchsource", ... } }, { "name": "afterDumppurchases", "plugin": { "name": "AlwaysRunTable", "type": "conditionbatchsource", ... } }, { "name": "purchasesTablecustomerPurchaseJoin", "plugin": { "name": "Databaseinner", "type": "batchsourcejoin", } "properties": { }, { "nameleft": "purchasesFiles", "plugin": {customers.id", "nameright": "TPFSParquetpurchases.id", "typerename": "batchsink", ...customers.name:customername,purchases.name:itemname" } }, ], }, "connections": [ ... { "from": "customersTable", "to": "customersFiles" }, ], "connections": [ { "from": "customersFilescustomers", "to": "afterDumpcustomerPurchaseJoin" }, { "from": "afterDumppurchases", "to": "purchasesTablecustomerPurchaseJoin" }, { "from": "purchasesTablecustomerPurchaseJoin", "to": "purchasesFilessink" }, ] } |
You could also say that certain plugin types imply control flow (runcondition being one of them), whereas other plugin types imply data flow.
Java API for join plugin type: these might just be built into the app. Otherwise the interface is basically MapReduce.