Spark Computation in Scala Analytics
The Spark Computation in Scala analytics plugin is available in the Hub.
Executes user-provided Spark code in Scala that transforms RDD to RDD with full access to all Spark features.
This plugin can be used when you want to have complete control on the Spark computation. For example, you may want to join the input RDD with another Dataset and select a subset of the join result using Spark SQL.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Scala | Yes | Required. Spark code in Scala defining how to transform RDD to RDD. The code must implement a function called def transform(df: DataFrame) : DataFrame
def transform(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame The input Operating on lower level def transform(rdd: RDD[StructuredRecord]) : RDD[StructuredRecord]
def transform(rdd: RDD[StructuredRecord], context: SparkExecutionPluginContext) : RDD[StructuredRecord] For example: def transform(rdd: RDD[StructuredRecord], context: SparkExecutionPluginContext) : RDD[StructuredRecord] = {
val outputSchema = context.getOutputSchema
rdd
.flatMap(_.get[String]("body").split("\\s+"))
.map(s => (s, 1))
.reduceByKey(_ + _)
.map(t => StructuredRecord.builder(outputSchema).set("word", t._1).set("count", t._2).build)
} The will perform a word count on the input field The following imports are included automatically and are ready for the user code to use: import io.cdap.cdap.api.data.format._
import io.cdap.cdap.api.data.schema._;
import io.cdap.cdap.etl.api.batch._
import org.apache.spark._
import org.apache.spark.api.java._
import org.apache.spark.rdd._
import org.apache.spark.sql._
import org.apache.spark.SparkContext._
import scala.collection.JavaConversions._ |
Dependencies | Yes | Optional. Extra dependencies for the Spark program. It is a ‘,' separated list of URI for the location of dependency jars. A path can be ended with an asterisk ‘*’ as a wildcard, in which all files with extension '.jar’ under the parent path will be included. |
Compile at Deployment Time | No | Optional. Decide whether to perform code compilation at deployment time. It will be useful to turn it off in cases when some library classes are only available at run time, but not at deployment time. Default is true. |
Related content
Created in 2020 by Google Inc.