Use-Case

Transform input features(tokens in array form) into n-grams using parameter for number of terms in each n-gram.
Transformed output will be an array of n-grams where each n-gram is represented by a space-delimited string of n consecutive words.

User Stories

As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n-gram data.
As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.

Conditions

Example

Input source:

NGramTerms:

Mandatory inputs from user:

- Column to be used to transform input features into n-grams:”tokens”
- No of terms in each n-gram:”2”
- Transformed column for sequence of n-gram:”ngrams”

Output:

words

[hi i,i heard,heard about,about spark]

[hdfs is,is file,file system]

[spark is,is an,an engine]

Design

Properties:

**columnToBeTransformed:** Column to be used to transform input features into n-grams.

Input JSON:

{
"name": "NGramTerms",
"type": "sparkcompute",
"properties": {
"columnToBeTransformed": "tokens",
"noOfTerms": "2",
"outputColumn": "ngrams"
}
}

Table of Contents

Checklist