Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Many times, users would like to sample a large dataset to pull only a few records for analysis. This transform would allow them to take a random sample of the data flowing through the transform. We should use the sampling method described for HEDIS reporting.
Use case(s)
- I would like to sample my member database for calculating the Adult BMI Measure HEDIS measure. In this case, I would like to build a pipeline to pull records from my member database, sort them alphabetically using a OrderBy plugin (in development), then apply a sampling methodology as follows: input a sample size, an over sampling percentage (the final sample size is calculated as Final Sample Size = Input Sample Size * (Input Sample Size * Over Sampling Percentage) (round up to the next whole number)). So we will choose every Nth = (Total Records/Final Sample Size) member. The first member is chosen using a (random number between 0 and 1) * N and then every Nth member after that.
- As a data scientist, I would like to sample 20% of the records in the dataset for training a machine learning model. I would like to build a hydrator pipeline where I can leverage a transform where 1000 records go into the plugin, and only 200 records come out for processing.
- I have a stream of items of large and unknown length and I would like to randomly choose items from this stream such that each item is equally likely to be selected. I would like to leverage this transform with a kafka queue in a spark streaming pipeline. (Reservoir Sampling Example)
User Storie(s)
- As a hydrator user, i would like to sample the records in my pipeline so that a large number of records go in, but only a specified number of records + over sampling percentage comes out of the transform.
Plugin Type
- Aggregate (Or maybe a transform)
Configurables
This section defines properties that are configurable for this plugin.
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Input Sample Size | String | The number of records that you would like to sample from the input records. | |
Input Sample Percentage | String | The % of records that you would like to sample from the input records. | 0 - 100 |
Oversampling Percentage | String | The % of additional records you would like to include in addition to the input sample size to account for oversampling. Defaults to 0. | 0 - 100 |
Sampling Type | String | Type of the Sampling algorithm that needs to be used to sample the data. | |
Random | String | Random float value between 0 and 1 to be used in Systematic Sampling. If not provided, plugin will | |
Total Records | String | Total number of input records to be used in Systematic Sampling. |
Design / Implementation Tips
- One of Input Sample Size or Input Sample Percentage must be specified.
- Please follow the "Systematic Sampling Methodology" (starts on page 44) found in this document: https://drive.google.com/open?id=0B1DD6Nd_UiCZZzNBN1Z2ZHZHZUk for Inout Sample Size
- Please use Reservoir Sampling method http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/ which may require different input values.
- This should be a single plugin that allows the user to choose the method of sampling they would like to use. We should design this in such a way that additional sampling methods can be added to the same plugin.
Design
Approach(s)
Properties
Security
Limitation(s)
Future Work
- Some future work – HYDRATOR-99999
- Another future work – HYDRATOR-99999
Test Case(s)
- Test case #1
- Test case #2
Sample Pipeline
Please attach one or more sample pipeline(s) and associated data.
Pipeline #1
Pipeline #2
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature