Versions Compared
compared with
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Is a aggregator plugin that allows one to sort rows based on fields you specify and whether they should be sorted in ascending or descending order.
Use-case
User is processing web access log and as part of his data pipeline user is aggregating response codes. User uses “GROUP BY” aggregation plugin in the pipeline to count for web responses in each response category. At the end of the file, user wants to sort it based on count before being written to the file.
User Stories
- User should have ability to specify single or composite field for sorting the records in the output
- User should be able to specify field (basic type) from a nested structure for sorting the records
- User should specify for each field how the records should be sorted
- User can specify only basic types - String, Int, Long, Short, Float, Double, Byte as key, in case any other types are specified then error is thrown to notify the user
Example
Following is a simple example showing how Order By would work.
Input
First Name | Last Name | Age | Zip Code |
Joltie | Root | 29 | 32826 |
Henry | Zilka | 62 | 96789 |
Baby | Trump | 10 | 76563 |
Donald | Trump | 70 | 34566 |
Ivanka | Trump | 34 | 94306 |
Bipasha | Basu | 39 | 67543 |
BabyII | Trump | 10 | 32816 |
Configuration is specified as follows
- Input Schema
- First Name, String
- Last Name, String
- Age, Int
- Zip Code, Long
- Sort by
- Last Name, Ascending
- Age, Ascending
- Zipcode, Descending
- Output Schema
- First Name, String
- Last Name, String
- Age, Int
- Zip Code, Long
Output is as follows
First Name | Last Name | Age | Zip Code |
Bipasha | Basu | 39 | 67543 |
Jolie | Root | 29 | 32826 |
Baby | Trump | 10 | 76563 |
BabyII | Trump | 10 | 32816 |
Ivanka | Trump | 34 | 94306 |
Donald | Trump | 70 | 76563 |
Henry | Zilka | 62 | 96789 |
Implementation Tips
- Investigate how ‘Group Comparator’ and ‘Sort Comparator’ work together and be used to achieve the functionality for this plugin.
- Build a simple map-reduce program to show understand how the above functionality work — Implement Sort Comparator using StructuredRecord
- If the above works, then Data Pipeline Application Template need to be modified to the set the sort class comparator and this shouldn’t affect the other plugins.
Design
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature