Introduction

An n-gram is a sequence of n tokens (typically words) for some integer n.NGram can be used to transform input features into n-grams.

Use-Case

Transform input features(tokens in array form) into n-grams using parameter for number of terms in each n-gram.
Transformed output will be an array of n-grams where each n-gram is represented by a space-delimited string of n consecutive words.

User Stories

As a Hydrator user,I want to transfom the data in a column from source schema and output the n-grams into output schema which will have a single column having n-gram data.
As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features.
As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.

Conditions

Source field ,to be transformed,can be of only array type.
User can transform single column only from the source schema.
Output schema will have a single column of type string array.

Example

Input source:

topic	tokens
Java	[hi,i,heard,about,spark]
HDFS	[hdfs,is,file,system]
Spark	[spark,is,an,engine]

NGramTerms:

Mandatory inputs from user:

- Column to be used to transform input features into n-grams:”tokens”
- No of terms in each n-gram:”2”
- Transformed column for sequence of n-gram:”ngrams”

Output:

words

[hi i,i heard,heard about,about spark]

[hdfs is,is file,file system]

[spark is,is an,an engine]

Design

Properties:

**columnToBeTransformed:** Column to be used to transform input features into n-grams.

**noOfTerms:** No of terms in each n-gram.

**outputColumn:** Transformed column for sequence of n-gram.

Input JSON:

{
"name": "NGramTerms",
"type": "sparkcompute",
"properties": {
"columnToBeTransformed": "tokens",
"noOfTerms": "2",
"outputColumn": "ngrams"
}
}

Table of Contents

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature