Provisioners are responsible for creating, initializing, and destroying the cloud environment that pipelines will run in. Each provisioner exposes a set of configuration settings that are used to control what type of cluster is created and deleted. Different provisioners create different types of clusters.

  • Google Dataproc: A fast, easy-to-use, and fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, cost efficient way.

  • Amazon Elastic MapReduce: Provides a managed Hadoop framework that makes it easy, fast, and cost effective to process vast amounts of data across dynamically scalable Amazon EC2 instances.

  • Remote Hadoop: Runs jobs on a pre-existing Hadoop cluster, whether on premise or in the cloud.


Created in 2020 by Google Inc.