Use-Case

User Stories

As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have n-gram data in one of the columns in output schema.
As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.
As a Hydrator user I want to specify the tokenization unit for the input to be tokenized before it could be converted to n-gram

Conditions

End to End Example pipeline:

Stream	NGramTransform	TPFSAvro

Input source:

NGramTransform:

Mandatory inputs from user:

- Field to be used to transform input features into n-grams:”tokens”
- Number of terms in each n-gram:”2”
- Transformed field for sequence of n-gram:”ngrams”
- Tokenization unit: "words"

TPFSAvro Output

ngrams

[hi i,i heard,heard about,about spark]

[hdfs is,is a,a file,file system]

[spark is,is an,an engine]

Design

This is a sparkcompute type of plugin and is meant to work with Spark only.

Properties:

**fieldToBeTransformed:** Column to be used to transform input features into n-grams.

Input JSON:

{
"name": "NGramTransform",
"type": "sparkcompute",
"properties": {
"fieldToBeTransformed": "tokens",
"numberOfTerms": "2",

"tokenizationUnit":"word",

"outputField": "ngrams"
}
}

Table of Contents

Checklist