Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Pipelines are executed on Hadoop clusters. Hadoop is a collection of open source technologies that read, process, and store data in a distributed fashion. Understanding a bit about Hadoop clusters will help you understand how pipelines are executed in parallel. Hadoop is associated with many different systems, but the two used by pipelines are the two most basic ones -- : HDFS (Hadoop Distributed File System) and YARN (Yet Another Resource Negotiator). HDFS is used to store intermediate data and YARN is used to coordinate workloads across the cluster.

...