An n-gram is a sequence of n tokens (typically words) for some integer n.
NGramTransform plugin would be used to transform input features into n-grams.
Use-Case
A User in their hydrator pipeline can retrieve bigrams from the text/data which can be used for statistical analysis purpose. Example:User wants to tokenize the text on the basis of '.' and create bigrams. Text:hello my friend.how are you today bye Output for bigrams: hello my my friend how are are you you today today bye
User Stories
As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n-gram data.
As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.
Conditions
Source field ,to be transformed,can be of only type string array.
User can transform single field only from the source schema.
Output schema will have a single field of type string array.
If the input sequence contains fewer than n strings, no output is produced.
Example
Input source:
topic
tokens
Java
[hi,i,heard,about,spark]
HDFS
[hdfs,is,file,system]
Spark
[spark,is,an,engine]
NGramTransform:
Mandatory inputs from user:
Field to be used to transform input features into n-grams:”tokens”
Number of terms in each n-gram:”2”
Transformed field for sequence of n-gram:”ngrams”
Output:
ngrams
[hi i,i heard,heard about,about spark]
[hdfs is,is file,file system]
[spark is,is an,an engine]
End to End Example pipeline:
Stream
Tokenizer
NGramTransform
TPFSAvro
Input source:
topic
sentence
java
hi i heard about spark
HDFS
hdfs is a file system
Spark
spark is an engine
Tokenizer:
Mandatory inputs from user:
Column on which tokenization to be done:”sentence”
Delimiter for tokenization:” ”
Output column name for tokenized data:”tokens”
NGramTransform:
Mandatory inputs from user:
Field to be used to transform input features into n-grams:”tokens”
Number of terms in each n-gram:”2”
Transformed field for sequence of n-gram:”ngrams”
TPFSAvro Output
ngrams
[hi i,i heard,heard about,about spark]
[hdfs is,is a,a file,file system]
[spark is,is an,an engine]
Design
This is a sparkcompute type of plugin and is meant to work with Spark only.
Properties:
**fieldToBeTransformed:** Column to be used to transform input features into n-grams.
**numberOfTerms:** Number of terms in each n-gram.
**outputField:** Transformed column for sequence of n-gram.