{"type":"doc","content":[{"type":"paragraph","content":[{"text":"Apache Spark is used for in-memory cluster computing. It lets you load large sets of data into memory and query them repeatedly. This makes it suitable for both iterative and interactive programs. Similar to MapReduce, Spark can access datasets as both input and output. ","type":"text"},{"text":"Spark programs","type":"text","marks":[{"type":"em"}]},{"text":" in CDAP can be written in either Java or Scala.","type":"text"}]},{"type":"paragraph","content":[{"text":"To process data using Spark, specify ","type":"text"},{"text":"addSpark()","type":"text","marks":[{"type":"code"}]},{"text":" in your application specification:","type":"text"}]},{"type":"codeBlock","content":[{"text":"public void configure() {\n ...\n addSpark(new WordCountProgram());","type":"text"}]},{"type":"paragraph","content":[{"text":"It is recommended to implement the ","type":"text"},{"text":"Spark","type":"text","marks":[{"type":"code"}]},{"text":" interface by extending the ","type":"text"},{"text":"AbstractSpark","type":"text","marks":[{"type":"code"}]},{"text":" class, which allows the overriding of these three methods:","type":"text"}]},{"type":"bulletList","content":[{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"configure()","type":"text","marks":[{"type":"code"}]}]}]},{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"initialize()","type":"text","marks":[{"type":"code"}]}]}]},{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"destroy()","type":"text","marks":[{"type":"code"}]}]}]}]},{"type":"paragraph","content":[{"text":"You can extend from the abstract class ","type":"text"},{"text":"AbstractSpark","type":"text","marks":[{"type":"code"}]},{"text":" to simplify the implementation:","type":"text"}]},{"type":"codeBlock","content":[{"text":"public class WordCountProgram extends AbstractSpark {\n @Override\n public SparkSpecification configure() {\n return SparkSpecification.Builder.with()\n .setName(\"WordCountProgram\")\n .setDescription(\"Calculates word frequency\")\n .setMainClassName(\"com.example.WordCounter\")\n .build();\n }\n ...\n}","type":"text"}]},{"type":"paragraph","content":[{"text":"The configure method is similar to the one found in service and MapReduce programs. It defines the name, description, and the class containing the Spark program to be executed by the Spark framework.","type":"text"}]},{"type":"paragraph","content":[{"text":"The ","type":"text"},{"text":"initialize()","type":"text","marks":[{"type":"code"}]},{"text":" method is invoked at runtime, before the Spark program is executed. Because many Spark programs do not need this method, the ","type":"text"},{"text":"AbstractSpark","type":"text","marks":[{"type":"code"}]},{"text":" class provides a default implementation that does nothing.","type":"text"}]},{"type":"paragraph","content":[{"text":"However, if your program requires it, you can override this method to obtain access to the ","type":"text"},{"text":"SparkConf","type":"text","marks":[{"type":"code"}]},{"text":" configuration and use it to set the ","type":"text"},{"text":"Spark properties","type":"text","marks":[{"type":"link","attrs":{"href":"https://spark.apache.org/docs/1.6.1/configuration.html"}}]},{"text":" for the program:","type":"text"}]},{"type":"codeBlock","content":[{"text":"import org.apache.spark.SparkConf;\n. . .\n@Override\nprotected void initialize() throws Exception {\n getContext().setSparkConf(new SparkConf().set(\"spark.driver.extraJavaOptions\", \"-XX:MaxDirectMemorySize=1024m\"));\n}","type":"text"}]},{"type":"paragraph","content":[{"text":"The ","type":"text"},{"text":"destroy()","type":"text","marks":[{"type":"code"}]},{"text":" method is invoked after the Spark program has finished. You could perform cleanup or send a notification of program completion, if that was required. Like ","type":"text"},{"text":"initialize()","type":"text","marks":[{"type":"code"}]},{"text":", since many Spark programs do not need this method, the ","type":"text"},{"text":"AbstractSpark","type":"text","marks":[{"type":"code"}]},{"text":" class also provides a default implementation for this method that does nothing.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Spark and Resources","type":"text"}]},{"type":"paragraph","content":[{"text":"When a Spark program is configured, the resource requirements for both the Spark driver processes and the Spark executor processes can be set, both in terms of the amount of memory (in megabytes) and the number of virtual cores assigned.","type":"text"}]},{"type":"paragraph","content":[{"text":"If both the memory and the number of cores needs to be set, this can be done using:","type":"text"}]},{"type":"codeBlock","content":[{"text":"setExecutorResources(new Resources(1024, 2));","type":"text"}]},{"type":"paragraph","content":[{"text":"In this case, 1024 MB and two cores is assigned to each executor process.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"CDAP Spark Program","type":"text"}]},{"type":"paragraph","content":[{"text":"The main class being set through the ","type":"text"},{"text":"setMainClass","type":"text","marks":[{"type":"code"}]},{"text":" or ","type":"text"},{"text":"setMainClassName","type":"text","marks":[{"type":"code"}]},{"text":" method inside the ","type":"text"},{"text":"Spark.configure()","type":"text","marks":[{"type":"code"}]},{"text":" method will be executed by the Spark framework. The main class must have one of these properties:","type":"text"}]},{"type":"orderedList","attrs":{"order":1},"content":[{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Extends from ","type":"text"},{"text":"SparkMain","type":"text","marks":[{"type":"code"}]},{"text":", if written in Scala","type":"text"}]}]},{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Have a ","type":"text"},{"text":"def main(args: Array[String])","type":"text","marks":[{"type":"code"}]},{"text":" method, if written in Scala","type":"text"}]}]},{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Implements ","type":"text"},{"text":"JavaSparkMain","type":"text","marks":[{"type":"code"}]},{"text":", if written in Java","type":"text"}]}]},{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Have a ","type":"text"},{"text":"public static void main(String[] args)","type":"text","marks":[{"type":"code"}]},{"text":" method, if written in Java","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"A user program is responsible for creating a ","type":"text"},{"text":"SparkContext","type":"text","marks":[{"type":"code"}]},{"text":" or ","type":"text"},{"text":"JavaSparkContext","type":"text","marks":[{"type":"code"}]},{"text":" instance, either inside the ","type":"text"},{"text":"run","type":"text","marks":[{"type":"code"}]},{"text":" methods of ","type":"text"},{"text":"SparkMain","type":"text","marks":[{"type":"code"}]},{"text":" or ","type":"text"},{"text":"JavaSparkMain","type":"text","marks":[{"type":"code"}]},{"text":", or inside their ","type":"text"},{"text":"main","type":"text","marks":[{"type":"code"}]},{"text":" methods.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"CDAP SparkExecutionContext","type":"text"}]},{"type":"paragraph","content":[{"text":"CDAP provides a ","type":"text"},{"text":"SparkExecutionContext","type":"text","marks":[{"type":"code"}]},{"text":", which is needed to access datasets and to interact with CDAP services such as metrics and service discovery. It is only available to Spark programs that are extended from either ","type":"text"},{"text":"SparkMain","type":"text","marks":[{"type":"code"}]},{"text":" or ","type":"text"},{"text":"JavaSparkMain","type":"text","marks":[{"type":"code"}]},{"text":".","type":"text"}]},{"type":"paragraph","content":[{"text":"Scala:","type":"text"}]},{"type":"codeBlock","content":[{"text":"class MyScalaSparkProgram extends SparkMain { \n override def run(implicit sec: SparkExecutionContext): Unit = {\n val sc = new SparkContext\n val RDD[(String, String)] = sc.fromDataset(\"mydataset\")\n ...\n }\n}","type":"text"}]},{"type":"paragraph","content":[{"text":"Java:","type":"text"}]},{"type":"codeBlock","content":[{"text":"public class MyJavaSparkProgram implements JavaSparkMain {\n @Override\n public void run(JavaSparkExecutionContext sec) {\n JavaSparkContext jsc = new JavaSparkContext();\n JavaPairRDD rdd = sec.fromDataset(\"mydataset\");\n ...\n }\n}","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Spark and Datasets","type":"text"}]},{"type":"paragraph","content":[{"text":"Spark programs in CDAP can directly access ","type":"text"},{"text":"datasets","type":"text","marks":[{"type":"strong"}]},{"text":" similar to the way a MapReduce can. These programs can create Spark's Resilient Distributed Dataset (RDD) by reading a dataset and can also write RDD to a dataset. In Scala, implicit objects are provided for reading and writing datasets directly through the ","type":"text"},{"text":"SparkContext","type":"text","marks":[{"type":"code"}]},{"text":" and ","type":"text"},{"text":"RDD","type":"text","marks":[{"type":"code"}]},{"text":" objects.","type":"text"}]},{"type":"paragraph","content":[{"text":"In order to access a dataset in Spark, both the key and value classes have to be serializable. Otherwise, Spark will fail to read or write them. For example, the Table dataset has a value type of Row, which is not serializable. An ","type":"text"},{"text":"ObjectStore","type":"text","marks":[{"type":"code"}]},{"text":" dataset can be used, provided its classes are serializable.","type":"text"}]},{"type":"bulletList","content":[{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Creating an RDD from a dataset:","type":"text"}]},{"type":"bulletList","content":[{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Scala:","type":"text"}]},{"type":"codeBlock","content":[{"text":"val sc = new SparkContext\nval purchaseRDD = sc.readFromDataset[Array[Byte], Purchase](\"purchases\");","type":"text"}]}]},{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Java","type":"text"}]},{"type":"codeBlock","content":[{"text":"JavaSparkContext jsc = new JavaSparkContext();\nJavaPairRDD purchaseRDD = sec.fromDataset(\"purchases\");","type":"text"}]},{"type":"paragraph"}]}]}]},{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Writing an RDD to a dataset:","type":"text"}]},{"type":"bulletList","content":[{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Scala","type":"text"}]},{"type":"codeBlock","content":[{"text":"purchaseRDD.saveAsDataset(\"purchases\")","type":"text"}]},{"type":"paragraph"}]},{"type":"listItem","content":[{"type":"paragraph","content":[{"text":"Java","type":"text"}]},{"type":"codeBlock","content":[{"text":"sec.saveAsDataset(purchaseRDD, \"purchases\");","type":"text"}]}]}]}]}]},{"type":"panel","attrs":{"panelType":"note"},"content":[{"type":"paragraph","content":[{"text":"Note","type":"text","marks":[{"type":"strong"}]},{"text":": Spark programs can read or write to datasets in different namespaces using Cross Namespace Dataset Access by passing a ","type":"text"},{"text":"String","type":"text","marks":[{"type":"code"}]},{"text":" containing the namespace as an additional parameter before the dataset name parameter. (By default, if the namespace parameter is not supplied, the namespace in which the program runs is used.).","type":"text"}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Spark and Services","type":"text"}]},{"type":"paragraph","content":[{"text":"Spark programs in CDAP, including worker nodes, can discover Services. Service Discovery by worker nodes ensures that if an endpoint changes during the execution of a Spark program, due to failure or another reason, worker nodes will see the most recent endpoint.","type":"text"}]},{"type":"paragraph","content":[{"text":"Here is an example of service discovery in a Spark program:","type":"text"}]},{"type":"codeBlock","content":[{"text":"final ServiceDiscoverer serviceDiscover = sec.getServiceDiscoverer();\nJavaPairRDD ranksRaw = ranks.mapToPair(new PairFunction,\n byte[], Integer>() {\n @Override\n public Tuple2 call(Tuple2 tuple) throws Exception {\n URL serviceURL = serviceDiscover.getServiceURL(SparkPageRankApp.GOOGLE_TYPE_PR_SERVICE_NAME);\n if (serviceURL == null) {\n throw new RuntimeException(\"Failed to discover service: \" +\n SparkPageRankApp.GOOGLE_TYPE_PR_SERVICE_NAME);\n }\n try {\n URLConnection connection = new URL(serviceURL, String.format(\"transform/%s\",\n tuple._2().toString())).openConnection();\n try (\n BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charsets.UTF_8))\n ) {\n String pr = reader.readLine();\n return new Tuple2(tuple._1().getBytes(Charsets.UTF_8), Integer.parseInt(pr));\n }\n } catch (Exception e) {\n LOG.warn(\"Failed to read the stream for service {}\",\n SparkPageRankApp.GOOGLE_PR_SERVICE, e);\n throw Throwables.propagate(e);\n }\n }\n});","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Spark Metrics","type":"text"}]},{"type":"paragraph","content":[{"text":"Spark programs in CDAP emit metrics, similar to a MapReduce program. CDAP collect system metrics emitted by Spark and display them in the CDAP UI. This helps in monitoring the progress and resources used by a Spark program. You can also emit custom user metrics from the worker nodes of your Spark program:","type":"text"}]},{"type":"codeBlock","content":[{"text":"final Metrics sparkMetrics = sc.getMetrics();\nJavaPairRDD ranksRaw = ranks.mapToPair(new PairFunction,\n byte[], Integer>() {\n @Override\n public Tuple2 call(Tuple2 tuple) throws Exception {\n if (tuple._2() > 100) {\n sparkMetrics.count(MORE_THAN_100_KEY, 1);\n }\n }\n});","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Spark in Workflows","type":"text"}]},{"type":"paragraph","content":[{"text":"Spark programs in CDAP can also be added to a workflow, similar to a MapReduce. The Spark program can get information about the workflow through the ","type":"text"},{"text":"SparkExecutionContext.getWorkflowInfo","type":"text","marks":[{"type":"code"}]},{"text":" method.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Transactions and Spark","type":"text"}]},{"type":"paragraph","content":[{"text":"When a Spark program interacts with datasets, CDAP will automatically create a long-running transaction that covers the Spark job execution. A Spark job refers to a Spark action and any tasks that need to be executed to evaluate the action (see ","type":"text"},{"text":"Spark Job Scheduling","type":"text","marks":[{"type":"link","attrs":{"href":"https://spark.apache.org/docs/1.6.1/job-scheduling.html#scheduling-within-an-application"}}]},{"text":" for details).","type":"text"}]},{"type":"paragraph","content":[{"text":"You can also control the transaction scope yourself explicitly. It's useful when you want multiple Spark actions to be committed in the same transaction. For example, in Kafka Spark Streaming, you can persist the Kafka offsets together with the changes in the datasets in the same transaction to obtain exactly-once processing semantics.","type":"text"}]},{"type":"paragraph","content":[{"text":"When using an ","type":"text"},{"text":"explicit","type":"text","marks":[{"type":"em"}]},{"text":" transaction, you can access a dataset directly by calling the ","type":"text"},{"text":"getDataset()","type":"text","marks":[{"type":"code"}]},{"text":" method of the ","type":"text"},{"text":"DatasetContext","type":"text","marks":[{"type":"code"}]},{"text":" provided to the transaction. However, the dataset acquired through ","type":"text"},{"text":"getDataset()","type":"text","marks":[{"type":"code"}]},{"text":" cannot be used through a function closure. See the section on ","type":"text"},{"text":"Using Datasets in Programs","type":"text","marks":[{"type":"link","attrs":{"href":"/wiki/spaces/DOCS/pages/596476941"}}]},{"text":" for additional information.","type":"text"}]},{"type":"paragraph","content":[{"text":"Here is an example of using an explicit transaction in Spark:","type":"text"}]},{"type":"paragraph","content":[{"text":"Scala:","type":"text"}]},{"type":"codeBlock","content":[{"text":"// Perform multiple operations in the same transaction\nTransaction {\n // Create a standard wordcount RDD\n val wordCountRDD = sc.fromStream[String](\"stream\")\n .flatMap(_.split(\" \"))\n .map((_, 1))\n .reduceByKey(_ + _)\n\n // Save those words that have count > 10 to the \"aboveten\" dataset\n wordCountRDD\n .filter(_._2 > 10)\n .saveAsDataset(\"aboveten\")\n\n // Save all wordcount to an \"allcounts\" dataset\n wordCountRDD.saveAsDataset(\"allcounts\")\n\n // Updates to both the \"aboveten\" and \"allcounts\" datasets will be committed within the same transaction\n}\n\n// Perform RDD operations together with direct dataset access in the same transaction\nTransaction((datasetContext: DatasetContext) => {\n sc.fromDataset[String, Int](\"source\")\n .saveAsDataset(\"sink\")\n\n val table: Table = datasetContext.getDataset(\"copyCount\")\n table.increment(new Increment(\"source\", \"sink\", 1L))\n})","type":"text"}]},{"type":"paragraph","content":[{"text":"Java:","type":"text"}]},{"type":"codeBlock","content":[{"text":"@Override\npublic void run(JavaSparkExecutionContext sec) throws Exception {\n // Perform RDD operations together with direct dataset access in the same transaction\n sec.execute(new TransactionRunnable(sec));\n}\n\nstatic class TransactionRunnable implements TxRunnable, Serializable {\n\n private final JavaSparkExecutionContext sec;\n\n public TransactionRunnable(JavaSparkExecutionContext sec) {\n this.sec = sec;\n }\n\n @Override\n public void run(DatasetContext context) throws Exception {\n JavaPairRDD source = sec.fromDataset(\"source\");\n sec.saveAsDataset(source, \"sink\");\n\n Table table = context.getDataset(\"copyCount\");\n table.increment(new Increment(\"source\", \"sink\", 1L));\n }\n}","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Spark Versions","type":"text"}]},{"type":"paragraph","content":[{"text":"CDAP allows you to write Spark programs using either ","type":"text"},{"text":"Spark 2 or Spark","type":"text","marks":[{"type":"annotation","attrs":{"annotationType":"inlineComment","id":"18ec7659-ad19-4d41-9485-d2b76a076d66"}}]},{"text":" 3 with Scala 2.12. ","type":"text"}]},{"type":"paragraph","content":[{"text":"To use it, you must add the ","type":"text"},{"text":"cdap-api-spark3_2.12","type":"text","marks":[{"type":"code"}]},{"text":" Maven dependency:","type":"text"}]},{"type":"codeBlock","content":[{"text":". . .\n\n io.cdap.cdap\n cdap-api-spark3_2.12\n ${cdap.version}\n\n. . .","type":"text"}]},{"type":"paragraph"},{"type":"paragraph"}],"version":1}

Browser not supported