...
User should be able to run a spark job that operates as a transform in ETL Batch scenario.
API Design:
Below is the SparkTransform class that users can implement to define their Spark transform plugin.
Because it extends BatchConfigurablePipelineConfigurable, users will be able to override:
- configurePipeline - in order to add datasets and streams needed
- prepareRun - configure the job before starting the run onRunFinish - perform any end of run logic
The SparkMLLib also exposes a transform method, which gives access to an RDD representing data from previous stages of the pipeline.
- transform - perform any transformations and return an RDD representing the transformed RDD.
Code Block | ||||
---|---|---|---|---|
| ||||
/** * Spark Transform stage. * * @param <IN> Type of input object * @param <OUT> Type of output object */ @Beta public abstract class SparkTransform<IN, OUT> extendsimplements BatchConfigurable<SparkPluginContext>PipelineConfigurable, implements Serializable { public static final String PLUGIN_TYPE = "sparktransform"; private static final long serialVersionUID = -8600555200583639593L8156450728774382658L; /** * Configure an ETL pipeline. * * @param pipelineConfigurer the configurer used to add required datasets and streams * @throws IllegalArgumentException if the given config is invalid * User Spark code which will be executed to compute the transformed RDD/ @Override public void configurePipeline(PipelineConfigurer pipelineConfigurer) throws IllegalArgumentException { //no-op } /** * Transform the input and return the output to be sent to the next stage in the pipeline. * * @param context {@link SparkPluginContext} for this job * @param input the input from previous stages of the Batch run. input data to be transformed * @throws Exception if there's an error during this method invocation */ public abstract JavaRDD<OUT> transform(SparkPluginContext context, JavaRDD<IN> input) throws Exception; } |
Users will have access to a
SparkPluginContext
which exposes functionality that would be available to a regular Spark program via SparkContext, except it excludes the following methods.:
- getMetrics
- getWorkflowToken
- getTaskLocalizationContext
- getSpecification
- getServiceDiscoverer
- setExecutorResources
Implementation Summary:
- SmartWorkflow will break up any stages of SparkTransform type into its own phase to be run in ETLSpark, instead of ETLMapReduce. It will be broken up into its own phase because ETLSpark doesn't support certain sources and transforms.Because SparkTransform extends BatchConfigurable, its prepareRun and onFinish will automatically be handled by ETLSpark, though we will need to special case for the type
- of context passed inThere will be special casing in ETLSparkProgram to call the user's SparkTransform class's transform method after reading from the source and before writing to the sink.