Introduction

Many times, users would like to sample a large dataset to pull only a few records for analysis. This transform would allow them to take a random sample of the data flowing through the transform. We should use the sampling method described for HEDIS reporting.

Use case(s)

I would like to sample my member database for calculating the Adult BMI Measure HEDIS measure. In this case, I would like to build a pipeline to pull records from my member database, sort them alphabetically using a OrderBy plugin (in development), then apply a sampling methodology as follows: input a sample size, an over sampling percentage (the final sample size is calculated as Final Sample Size = Input Sample Size * (Input Sample Size * Over Sampling Percentage) (round up to the next whole number)). So we will choose every Nth = (Total Records/Final Sample Size) member. The first member is chosen using a (random number between 0 and 1) * N and then every Nth member after that.
As a data scientist, I would like to sample 20% of the records in the dataset for training a machine learning model. I would like to build a hydrator pipeline where I can leverage a transform where 1000 records go into the plugin, and only 200 records come out for processing.
I have a stream of items of large and unknown length and I would like to randomly choose items from this stream such that each item is equally likely to be selected. I would like to leverage this transform with a kafka queue in a spark streaming pipeline. (Reservoir Sampling Example)

User Storie(s)

As a hydrator user, i would like to sample the records in my pipeline so that a large number of records go in, but only a specified number of records + over sampling percentage comes out of the transform.

Plugin Type

Aggregate (Or maybe a transform)

Configurables

This section defines properties that are configurable for this plugin.

User Facing Name	Type	Description	Constraints
Input Sample Size	String	The number of records that you would like to sample from the input records.
Input Sample Percentage	String	The % of records that you would like to sample from the input records.	0 - 100
Oversampling Percentage	String	The % of additional records you would like to include in addition to the input sample size to account for oversampling. Defaults to 0.	0 - 100
Sampling Type	String	Type of the Sampling algorithm that needs to be used to sample the data. For example: Systematic or Reservoir
Random	String	Random float value between 0 and 1 to be used in Systematic Sampling. If not provided, plugin will internally generate random value.
Total Records	String	Total number of input records to be used in Systematic Sampling.

Design / Implementation Tips

One of Input Sample Size or Input Sample Percentage must be specified.
Please follow the "Systematic Sampling Methodology" (starts on page 44) found in this document: https://drive.google.com/open?id=0B1DD6Nd_UiCZZzNBN1Z2ZHZHZUk for Inout Sample Size
Please use Reservoir Sampling method http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/ which may require different input values.
This should be a single plugin that allows the user to choose the method of sampling they would like to use. We should design this in such a way that additional sampling methods can be added to the same plugin.

Design

Code Block

language	js
linenumbers	true

{
	"name": "Sampling",
	"plugin": {
		"name": "Sampling",
		"type": "batchaggregator",
		"label": "Sampling",
		"artifact": {
			"name": "sampling-aggregator-plugin",
			"version": "1.6.0",
			"scope": "SYSTEM"
		},
		"properties": {
			"samplingType": "Systematic",
			"sampleSize": "2",
			"random": "0.2",
			"overSamplingPercentage": "30",
			"totalRecords": "11"
		}
	}
}

Approach(s)

Properties

Security

Properties

sampleSize: The number of records that needs to be sampled from the input records.
samplePercentage: The percentage of records that needs to be sampled from the input records. Either of 'samplePercentage' or 'sampleSize' needs to be mentioned.
overSamplingPercentage: The percentage of additional records that needs to be included in addition to the input sample size to account for oversampling to be used in Systematic Sampling.
samplingType: Type of the Sampling algorithm that needs to be used to sample the data. For example: Systematic or Reservoir
random: Random float value between 0 and 1 to be used in Systematic Sampling. If not provided, plugin will internally generate random value.
totalRecords: Total number of input records to be used in Systematic Sampling.

NFR

Only Performance measurement is in scope as part of NFR.

Limitation(s)

User has to provide total number of records when selecting Sampling Type as Systematic.

Future Work

Some future work – HYDRATOR-99999
Another future work – HYDRATOR-99999

Test Case(s)

Test case #1
Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data.

Pipeline #1

Pipeline #2

Sample records with Systematic sampling
Sample records with Reservoir Sampling
Sample records with Systematic sampling along with over-sampling percentage

Sample Pipeline

sampling-systematic-cdap-data-pipeline.json

sampling-systematic_samplePercentage-cdap-data-pipeline.json

sampling-reservoir-cdap-data-pipeline.json

Table of Contents

Table of Contents

style	circle

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

Versions Compared

Old Version 6

New Version Current

Key

Introduction

Use case(s)

User Storie(s)

Plugin Type

Configurables

Design / Implementation Tips

Design

Approach(s)

Properties

Properties

NFR

Limitation(s)

Future Work

Test Case(s)

Sample Pipeline

Pipeline #1

Pipeline #2

Sample records with Systematic sampling
Sample records with Reservoir Sampling
Sample records with Systematic sampling along with over-sampling percentage

Sample Pipeline

Page Comparison

Versions Compared

Old Version 6

New Version Current

Key

Introduction

Use case(s)

User Storie(s)

Plugin Type

Configurables

Design / Implementation Tips

Design

Approach(s)

Properties

Properties

NFR

Limitation(s)

Future Work

Test Case(s)

Sample Pipeline

Pipeline #1

Pipeline #2

Sample records with Systematic samplingSample records with Reservoir SamplingSample records with Systematic sampling along with over-sampling percentage

Sample Pipeline

Sample records with Systematic sampling
Sample records with Reservoir Sampling
Sample records with Systematic sampling along with over-sampling percentage