Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
An n-gram is a sequence of n tokens (typically words) for some integer n.
NGramTransform plugin would be used to transform input features into n-grams.
Use-Case
- Transform input features(tokens in array form) into n-grams using parameter for number of terms in each n-gram.
- Transformed output will be an array of n-grams where each n-gram is represented by a space-delimited string of n consecutive wordsUser wants to use tokens such as bigrams/ngrams instead of just unigrams while developing features for supervised Machine Learning models.
User Stories
- As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n-gram data.
- As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
- As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
- As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.
Conditions
- Source field ,to be transformed,can be of only type string array.
- User can transform single field only from the source schema.
- Output schema will have a single field of type string array.
- If the input sequence contains fewer than
n
strings, no output is produced.
Example
Input source:
topic | tokens |
Java | [hi,i,heard,about,spark] |
HDFS | [hdfs,is,file,system] |
Spark | [spark,is,an,engine] |
NGramTransform:
Mandatory inputs from user:
- Field to be used to transform input features into n-grams:”tokens”
- Number of terms in each n-gram:”2”
- Transformed field for sequence of n-gram:”ngrams”
Output:
ngrams |
[hi i,i heard,heard about,about spark] |
[hdfs is,is file,file system] |
[spark is,is an,an engine] |
End to End Example pipeline:
Stream | Tokenizer | NGramTransform | TPFSAvro |
---|
Input source:
topic | sentence |
---|---|
java | hi i heard about spark |
HDFS | hdfs is a file system |
Spark | spark is an engine |
Tokenizer:
Mandatory inputs from user:
- Column on which tokenization to be done:”sentence”
- Delimiter for tokenization:” ”
- Output column name for tokenized data:”tokens”
NGramTransform:
Mandatory inputs from user:
- Field to be used to transform input features into n-grams:”tokens”
- Number of terms in each n-gram:”2”
- Transformed field for sequence of n-gram:”ngrams”
TPFSAvro Output
ngrams |
[hi i,i heard,heard about,about spark] |
[hdfs is,is a,a file,file system] |
[spark is,is an,an engine] |
Design
This is a sparkcompute type of plugin and is meant to work with Spark only.
Properties:
- **fieldToBeTransformed:** Column to be used to transform input features into n-grams.
- **numberOfTerms:** Number of terms in each n-gram.
- **outputField:** Transformed column for sequence of n-gram.
Input JSON:
{
"name": "NGramTransform",
"type": "sparkcompute",
"properties": {
"fieldToBeTransformed": "tokens",
"numberOfTerms": "2",
"outputField": "ngrams"
}
}
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature
...