Dynamic resource configuration

Users want to create dynamic pipelines for greater reusability and ease of operations. This guide walks you through configuring compute resources for running the pipelines during the pipeline runtime.

Background

Batch pipelines can be orchestrated using MapReduce or Spark engines. The resources for engines (CPU and Memory) are typically configured during design time and can be changed during runtime from the UI by changing the resources in the Resources tab as shown below. In addition, users can also change the compute profile which can be changed from the UI.

 

For dynamic pipelines, these resources should be configured via runtime arguments. The section below shows how to configure these resources.

Solution

Configuring Compute Profile

Compute profile can be configured at runtime using system.profile.name runtime argument (preferences). The value for the profile name should include the scope and profile name separated by a colon scope:profileName.

The following example starts the pipeline called BQMerge on a profile called dp10 in system scope:

As of 6.1.2 any system.profile prefixed configuration is filtered out in the UI. This is fixed in 6.1.3.

Configuring Engine Resources

The engine resources CPU (cores) and memory can be configured using runtime arguments (preferences). To configure resources for Spark Driver and Executor, use the following options:

  • task.driver.system.resources.memory to configure the memory for Spark Driver.

    • Memory is configured in Megabytes.

    • Example: Setting task.driver.system.resources.memory to 2048 sets the driver memory resources to 2 GB (2048 MB).

  • task.driver.system.resources.cores to configure the CPU (cores) for Spark Driver.

    • By default the driver CPU is set to 1 core.

    • Example: Setting task.driver.system.resources.cores to 2 sets the driver cores to 2.

  • task.executor.system.resources.memory to configure the memory for Spark Executors.

    • Memory is configured in Megabytes

    • Example: task.executor.system.resources.memory 2048 sets the executor memory resources to 2 GB (2048 MB).

  • task.executor.system.resources.cores to configure the CPU (cores) for Spark Executors.

    • By default the driver CPU (cores) is set to 1 core.

    • Example: task.executor.system.resources.cores 2 configures 2 cores for all executors.

Configuring Compute Resources for Dataproc

  • system.profile.properties.serviceAccount service account for the Dataproc cluster.

  • system.profile.properties.masterNumNodes to set the number of master nodes.

  • system.profile.properties.masterMemoryMB to set the memory per master node.

  • system.profile.properties.masterCPUs to set the number of CPUs for the master.

  • system.profile.properties.masterDiskGB to set the disk in GB per master node.

  • system.profile.properties.workerNumNodes to set the number of worker nodes.

  • system.profile.properties.workerMemoryMB to set the memory per worker node.

  • system.profile.properties.workerCPUs to set the number of CPUs per worker node.

  • system.profile.properties.workerDiskGB to set the disk in GB per worker node.

  • system.profile.properties.stackdriverLoggingEnabled to true to enable Stackdriver logging for the pipelines.

  • system.profile.properties.stackdriverMonitoringEnabled to true to enable Stackdriver monitoring for the pipelines.

  • system.profile.properties.imageVersion to configure Dataproc image version.

  • system.profile.properties.network to configure network for the Dataproc cluster.

Created in 2020 by Google Inc.