Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Note that the source has output more than the 1 million records it is configured to output. Remember that Spark caches data during a job and not as a separate job. This means that caching cannot be relied upon to prevent multiple reads from a source. In most situations, it is not worth caching anything in a pipeline with parallel execution enabled. As such, parallel execution should only be used if multiple reads from the source is not a problem.

Also note that one branch has processed twice as many records as the other. Even though both Spark jobs are launched at the same time, they may not begin at exactly the same time. Also, if your cluster does not have enough resources to run the entirety of the jobs at the same time, one job may receive more resources than the other, causing them to complete in different amounts of time. In this example, since there are two Spark jobs, your cluster will need to be twice as big as the cluster needed if parallel execution was off. Similarly, a pipeline with N jobs will need a cluster N times as large to take full advantage of the parallelism.

...