Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  •  User stories documented (Albert/Alvin)
  •  User stories reviewed (Nitin)
  •  Design documented (Albert/Alvin)
  •  Design reviewed (Andreas/Terence)
  •  Feature merged (Albert/Alvin)
  •  Examples and guides (Albert/Alvin)
  •  Integration tests (Albert/Alvin) 
  •  Documentation for feature (Albert/Alvin)
  •  Blog post

...

  1. (3.4) A developer should be able to create pipelines that contain aggregations (GROUP BY -> count/sum/unique)
  2. (3.5) A developer should be able to control some parts of the pipeline running before others. For example, one source -> sink branch running before another source -> sink branch.
  3. (3.54) A developer should be able to use a Spark ML job as a pipeline stage
  4. (3.4) A developer should be able to rerun failed pipeline runs without reconfiguring the pipeline
  5. (3.4) A developer should be able to de-duplicate records in a pipeline
  6. (3.5) A developer should be able to join multiple branches of a pipeline
  7. (3.5) A developer should be able to use an Explore action as a pipeline stage
  8. (3.5) A developer should be able to create pipelines that contain Spark Streaming jobs
  9. (3.5) A developer should be able to create pipelines that run based on various conditions, including input data availability and Kafka events

...

Some problems with this is that the plugin property "functions" is itself a json describing plugins to use.  This is not easy for somebody to configure, but maybe it could be simplified by a UI widget type.

 

Option 2: Keep the current structure of the config, with plugin type at the top level.

Code Block
{
  "sources": [
    {
      "name": "inputTable",
      "plugin": {
        "name": "Table",
        "type": "batchsource",  // new field
        "properties": {
        }
      }
    }
  ],
  "aggregators": [
    {
      "name": "aggStage",
      "groupBy": "id",
      "functions": [
        {
          "columnName": "totalPrice",
          "plugin": {
            "name": "sum",
            "properties": {
              "column": "price"
            }
          }
        },
        {
          "columnName": "numTransactions",
          "plugin": {
            "name": "count"
          }
        }
      ]
    }
  ],
  "connections": [
    { "from": "inputTable", "to": "aggStage" } 
  ]
}

This would allow us to have more structure in the config, but then we lose the generalization of plugin types.

 

Java APIs for plugin developers.  It is basically mapreduce, 'Aggregation' is probably a bad name for this.  Need to see if this fits into Spark.  Would we have to remove the emitters?

...

Java APIs for plugin developers.  It is basically mapreduce, 'Aggregation' is probably a bad name for this.  Need to see if this fits into Spark.  Would we have to remove the emitters?

Code Block
public abstract class Aggregation<GROUP_BY, RECORD_TYPE, OUTPUT_TYPE> {
 
  public abstract void groupBy(RECORD_TYPE input, Emitter<GROUP_BY> emitter);
 
  public abstract void aggregate(GROUP_BY groupKey, Iterable<RECORD_TYPE> groupRecords, Emitter<OUTPUT_TYPE> emitter);
 
}

@Plugin(type = "aggregation")
@Name("GroupByAggregate")
public GroupByAggregate extends Aggregation<StructuredRecord, StructuredRecord, StructuredRecord> {
  private static final AggConfig config;
  
  public static class AggConfig extends PluginConfig {
    private String groupBy;
 
    // ideally this would be Map<String, FunctionConfig> functions
    private String functions;
  }
 
  public void configurePipeline(PipelineConfigurer configurer) {
    Map<String, FunctionConfig> functions = gson.fromJson(config.functions, MAP_TYPE);
    for each function:
      usePlugin(id, type, name, properties);
  }
 
  public groupBy(StructuredRecord input, Emitter<StructuredRecord> emitter) {
    // key = new record from input with only fields in config.groupBy
    Set<String> fields = config.groupBy.split(",");
    emitter.emit(recordSubset(input, fields));
  }
 
  public void initialize() {
    Map<String, FunctionConfig> functions = gson.fromJson(config.functions, MAP_TYPE);
    for each function:
      val = function.aggregate(groupRecords);
  }
 
  public void aggregate(StructuredRecord groupKey, Iterable<StructuredRecord> groupRecords, Emitter<StructuredRecord> emitter) {
    // reset all functions
    for (StructuredRecord record : groupRecords) {
      foreach function:
        function.update(record);
    }
    // build record from group key and function values
    for each function:
      val = function.aggregate();
    // emit record
  }
 
}
 
public abstract class AggregationFunction<RECORD_TYPE, OUTPUT_TYPE> {
  public abstract void reset();
  public abstract void update(RECORD_TYPE record);
  public abstract OUTPUT_TYPE aggregate();
}
 
@Plugin(type = "aggregationFunction")
@Name("sum")
public SumAggregation extends AggregationFunction<StructuredRecord, Number> {
  private final SumConfig config;
  private Number sum;
  
  public static class SumConfig extends PluginConfig {
    private String column;
  }
 
  public void update(StructuredRecord record) {
    // get type of config.column, initialize sum to right type based on that
    sum += (casted to correct thing) record.get(config.column);
  }
 
  public Number aggregate() {
    return sum;
  }
}

Note: This would also satisfy user story 5, where a unique can be implemented as a Aggregation plugin, where you group by the fields you want to unique, and ignore the Iterable<> in aggregate and just emit the group key.

Story 2: Control Flow (

...

Not Reviewed, WIP)

Option 1: Introduce different types of connections. One for data flow, one for control flow

...

Story 3: Spark ML in a pipeline

Add a plugin type "sparkMLsparksink" that is treated like a transform.  But instead of being a stage inside a mapper, it is a program in a workflow.  The application will create a transient dataset to act as the input into the program, or an explicit source can be givensink.  When present, a spark program will be used to read data, transform it, then send all transformed results to the sparksink plugin.

Code Block
{
  "stages": [
    {
      "name": "customersTable",
      "plugin": {
        "name": "Database",
        "type": "batchsource", ...
      }
    },    
    {
      "name": "categorizer",
      "plugin": {
        "name": "SVM",
        "type": "sparkMLsparksink", ...
      }
    },
    {
      "name": "models",
      "plugin": {
        "name": "Table",
        "type": "batchsink", ...
      }
    },
  ],
  "connections": [
    { "from": "customersTable", "to": "categorizer" },
 
  { "from": "categorizer", "to": "models" }
  ]
}

Story 6: Join (Not Reviewed, WIP)

Add a join plugin type.  Different implementations could be inner join, left outer join, etc.

...