Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The first Spark job consists of the DataGenerator, Wrangler, and Trash stages. You can tell this is the case because the metrics for the first stages will update and complete before the next stages start.

...

In this example, the metrics for DataGenerator, Wrangler, and Trash grow to the expected output records while Wrangler2 and Trash2 stay at zero. The second Spark job consists of the DataGenerator, Wrangler2, and Trash2 stages. This second job is run after the first job completes.

...

Caching

In the branch example above, even though the DataGenerator source stage is part of two Spark jobs, the number of output records stays at 1 million instead of 2 million. This means the source is not processed twice. This is achieved using Spark caching. As the first job runs, it caches the output of the DataGenerator stage so that it can be used during the second job. This is important in most pipelines because it prevents multiple reads from the source. This is especially true if it is possible to read different data from the source at different times. For example, suppose a pipeline is reading from a database table that is constantly being updated. If both Spark jobs were to read from the table, they would both read different data because they would be launched at different times. Caching ensures that only a single read is done so that all branches see the same data. Additionally, it is desirable for certain sources like the BigQuery source, where there is a monetary cost associated with each read.

...