Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

User should be able to run a spark job that operates as a transform in ETL Batch scenario.


API Design:

Below is the SparkCompute class SparkTransform class that users can implement to define their Spark transform plugin.
Because it extends BatchConfigurable, users will be able to override:

  • configurePipeline - in order to add datasets and streams needed
  • prepareRun - configure the job before starting the run
  • onRunFinish - perform any end of run logic

The SparkMLLib also exposes a transform method, which gives access to an RDD representing data from previous stages of the pipeline.

  • transform - perform any transformations and return an RDD representing the transformed RDD.

Code Block
languagejava
titleSparkMLLibSparkTransform
public abstract class SparkCompute<INSparkTransform<IN, OUT> extends BatchConfigurable<SparkComputeContext>BatchConfigurable<SparkPluginContext> implements Serializable {

  public static final String PLUGIN_TYPE = "sparkcomputesparktransform";

  private static final long serialVersionUID = -8600555200583639593L;

  /**
   * User Spark code which will be executed to compute the transformed RDD
   *
   * @param context {@link SparkComputeContextSparkPluginContext} for this job
   * @param input the input from previous stages of the Batch run.
   */
  public abstract JavaRDD<OUT> transform(SparkComputeContextSparkPluginContext context, JavaRDD<IN> input) throws Exception;

}


Users will have access to a SparkComputeContext which

SparkPluginContext

 which exposes functionality that would be available to a regular Spark program via SparkContext, except it excludes the following methods.:

  • getMetrics
  • getWorkflowToken
  • getTaskLocalizationContext
  • getSpecification
  • getServiceDiscoverer
  • setExecutorResources


Implementation Summary:

  1. SmartWorkflow will break up any stages of SparkCompute SparkTransform type into its own phase to be run in ETLSpark, instead of ETLMapReduce. It will be broken up into its own phase because ETLSpark doesn't support certain sources and transforms.
  2. Because SparkMLLib extends BatchConfigurableSparkTransform extends BatchConfigurable, its prepareRun and onFinish will automatically be handled by ETLSpark, though we will need to special case for the type of context passed in.