Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

MapReduce program is used to process data in batch. MapReduce can be written as in a conventional Hadoop system. Additionally, CDAP datasets can CDAP datasets can be accessed from MapReduce as both input and output.

...

or specify addWorkflow() in your application and specify your MapReduce in the workflow definitionthe workflow definition:

Code Block
public void configure() {
  ...
  // Run a MapReduce on the acquired data using a workflow
  addWorkflow(new PurchaseHistoryWorkflow());

...

The configure method is similar to the one found in applications. It defines the name and description of the MapReduce program. You can also specify resources also specify resources (memory and virtual cores) used by the mappers and reducers.

The initialize() method is invoked at runtime, before the MapReduce is executed. Through the getContext() method you can obtain an instance of the MapReduceContext. It allows you to specify datasets to to specify datasets to be used as input or output; it also provides you access to the actual Hadoop job configuration, as though you were running the MapReduce directly on Hadoop. For example, you can specify the input and output datasets, the mapper and reducer classes as well as the intermediate data format:

...