...
A MapReduce program can interact with a dataset by using it as an input or an output. The dataset needs to implement specific interfaces to support this, as described in the following sections.
A Dataset as the Input Source of a MapReduce Program
When you run a MapReduce program, you can configure it to read its input from a dataset. The source dataset must implement the BatchReadable
interface, which requires two methods:
...
Code Block |
---|
@UseDataSet("myTable") KeyValueTable kvTable; ... @Override public void initialize() throws Exception { MapReduceContext context = getContext(); ... context.addInput(Input.ofDataset("myTable", kvTable.getSplits(16, startKey, stopKey))); } |
A Dataset as the Output Destination of a MapReduce Program
Just as you have the option to read input from a dataset, you have the option to write to a dataset as the output destination of a MapReduce program if that dataset implements the BatchWritable
interface:
...
The write()
method is used to redirect all writes performed by a Reducer to the dataset. Again, the KEY
and VALUE
type parameters must match the output key and value type parameters of the Reducer.
Multiple Output Destinations of a MapReduce Program
To write to multiple output datasets from a MapReduce program, begin by adding the datasets as outputs:
...
Note that the multiple output write method—method, MapReduceTaskContext.write(String, KEY key, VALUE value)
—can , can only be used if there are multiple outputs. Similarly, the single output write method—method, MapReduceTaskContext.write(KEY key, VALUE value)
—can , can only be used if there is a single output to the MapReduce program.
...