Is a aggregator plugin that allows one to sort rows based on fields you specify and whether they should be sorted in ascending or descending order.
Use-case
User is processing web access log and as part of his data pipeline user is aggregating response codes. User uses “GROUP BY” aggregation plugin in the pipeline to count for web responses in each response category. At the end of the file, user wants to sort it based on count before being written to the file.
User Stories
User should have ability to specify single or composite field for sorting the records in the output
User should be able to specify field (basic type) from a nested structure for sorting the records
User should specify for each field how the records should be sorted
User can specify only basic types - String, Int, Long, Short, Float, Double, Byte as key, in case any other types are specified then error is thrown to notify the user
Example
Following is a simple example showing how Order By would work.
Input
First Name
Last Name
Age
Zip Code
Joltie
Root
29
32826
Henry
Zilka
62
96789
Baby
Trump
10
76563
Donald
Trump
70
34566
Ivanka
Trump
34
94306
Bipasha
Basu
39
67543
BabyII
Trump
10
32816
Configuration is specified as follows
Input Schema
First Name, String
Last Name, String
Age, Int
Zip Code, Long
Sort by
Last Name, Ascending
Age, Ascending
Zipcode, Descending
Output Schema
First Name, String
Last Name, String
Age, Int
Zip Code, Long
Output is as follows
First Name
Last Name
Age
ZipCode
Bipasha
Basu
39
67543
Jolie
Root
29
32826
Baby
Trump
10
76563
BabyII
Trump
10
32816
Ivanka
Trump
34
94306
Donald
Trump
70
76563
Henry
Zilka
62
96789
Implementation Tips
Investigate how ‘Group Comparator’ and ‘Sort Comparator’ work together and be used to achieve the functionality for this plugin.
Build a simple map-reduce program to show understand how the above functionality work — Implement Sort Comparator using StructuredRecord
If the above works, then Data Pipeline Application Template need to be modified to the set the sort class comparator and this shouldn’t affect the other plugins.
Design
The order by plugin will use the Secondary Sort technique to sort the values (in ascending or descending order) passed to each reducer.
This plugin looks as below:
CompositeKeyWritable.java
/**
* CustomWritable for the composite key.
*/
public class CompositeKey implements Writable, WritableComparable<CompositeKey> {
private String structureRecordJSON; //StructuredReocrd will be received as JSON string from the mapper.
private String sortFieldsJSON; //List of fields to be sorted will be received as JSON string from the mapper.
/**
*This comparator controls the sort order of the keys.
*/
public int compareTo(CompositeKey other) {
//Compare the structuredRecord objects parsed from the json string using JSON
}
}