Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
An n-gram is a sequence of n tokens (typically words) for some integer n.
NGramTerms plugin would be used to transform input features into n-grams.
Use-Case
- Transform input features(tokens in array form) into n-grams using parameter for number of terms in each n-gram.
- Transformed output will be an array of n-grams where each n-gram is represented by a space-delimited string of n consecutive words.
User Stories
- As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n-gram data.
- As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
- As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
- As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.
Conditions
- Source field ,to be transformed,can be of only type string array.
- User can transform single column only from the source schema.
- Output schema will have a single column of type string array.
- If the input sequence contains fewer than
n
strings, no output is produced.
Example
Input source:
topic | tokens |
Java | [hi,i,heard,about,spark] |
HDFS | [hdfs,is,file,system] |
Spark | [spark,is,an,engine] |
NGramTerms:
Mandatory inputs from user:
- Column to be used to transform input features into n-grams:”tokens”
- No of terms in each n-gram:”2”
- Transformed column for sequence of n-gram:”ngrams”
Output:
ngrams |
[hi i,i heard,heard about,about spark] |
[hdfs is,is file,file system] |
[spark is,is an,an engine] |
Design
Properties:
- **columnToBeTransformed:** Column to be used to transform input features into n-grams.
- **noOfTerms:** No of terms in each n-gram.
- **outputColumn:** Transformed column for sequence of n-gram.
Input JSON:
{
"name": "NGramTerms",
"type": "sparkcompute",
"properties": {
"columnToBeTransformed": "tokens",
"noOfTerms": "2",
"outputColumn": "ngrams"
}
}
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature