Spark Program in Python Action

The Spark Program in Python action plugin is available in the Hub.

Executes user-provided Spark code in Python. You can use this plugin when you want to run arbitrary Spark code.

Configuration

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Python

Yes

Required. The self-contained Spark application written in Python. For example, the Naive Bayes Machine Learning from the official Spark documentation can be written as:

# Import libraries from pyspark import * from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel from pyspark.mllib.linalg import Vectors from pyspark.mllib.regression import LabeledPoint def parseLine(line): parts = line.split(',') label = float(parts[0]) features = Vectors.dense([float(x) for x in parts[1].split(' ')]) return LabeledPoint(label, features) sc = SparkContext() data = sc.textFile("${input.path}").map(parseLine) # Split data aproximately into training (60%) and test (40%) training, test = data.randomSplit([0.6, 0.4], seed=0) # Train a naive Bayes model. model = NaiveBayes.train(training, 1.0) # Make prediction and test accuracy. predictionAndLabel = test.map(lambda p: (model.predict(p.features), p.label)) accuracy = 1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() / test.count() # Save and load model model.save(sc, "${output.path}")

With the input.path and output.path being provided at runtime.

Extra python libraries

Yes

Optional. Extra libraries for the PySpark program. It is a ',' separated list of URI for the locations of extra .egg, .zip and .py libraries.

Created in 2020 by Google Inc.