Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 Introduction


       An n-gram is a sequence of n tokens (typically words) for some integer n.

      NGramTransform plugin would be used to transform input features into n-grams. 

Use-Case

  • A User in their hydrator pipeline can retrieve bigrams from the text/data which can be used for statistical analysis purpose.
    Example:User wants to

    use tokens such as bigrams/ngrams instead of just unigrams while developing features for supervised Machine Learning models.

    tokenize the text on the basis of '.' and create bigrams.
    Text:hello my friend.how are you today bye
    Output for bigrams:
    hello my
    my friend
    how are
    are you
    you today
    today bye

User Stories

  • As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n-gram data.
  • As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
  • As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
  • As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.

Conditions

  • Source field ,to be transformed,can be of only type string array.
  • User can transform single field only from the source schema.
  • Output schema will have a single field of type string array.
  • If the input sequence contains fewer than n strings, no output is produced.

Example

Input source:

topic

tokens

Java

[hi,i,heard,about,spark]

HDFS

[hdfs,is,file,system]

Spark

[spark,is,an,engine]

NGramTransform:

Mandatory inputs from user:

    • Field to be used to transform input features into n-grams:”tokens”
    • Number of terms in each n-gram:”2”
    • Transformed field for sequence of n-gram:”ngrams”

Output:

ngrams

[hi i,i heard,heard about,about spark]

[hdfs is,is file,file system]

[spark is,is an,an engine]

End to End Example pipeline:       

 

StreamTokenizerNGramTransformTPFSAvro

 

Input source:

 

topicsentence
javahi i heard about spark
HDFShdfs is a file system
Sparkspark is an engine


Tokenizer:

Mandatory inputs from user:

    • Column on which tokenization to be done:”sentence”
    • Delimiter for tokenization:” ”
    • Output column name for tokenized data:”tokens”

NGramTransform:

Mandatory inputs from user:

    • Field to be used to transform input features into n-grams:”tokens”
    • Number of terms in each n-gram:”2”
    • Transformed field for sequence of n-gram:”ngrams”


TPFSAvro Output

ngrams

[hi i,i heard,heard about,about spark]

[hdfs is,is a,a file,file system]

[spark is,is an,an engine]

 

Design

This is a sparkcompute type of plugin and is meant to work with Spark only.

Properties:

  • **fieldToBeTransformed:** Column to be used to transform input features into n-grams.
  • **numberOfTerms:** Number of terms in each n-gram.
  • **outputField:** Transformed column for sequence of n-gram.

Input JSON:

         {
           "name": "NGramTransform",
           "type": "sparkcompute",
           "properties": {
                                   "fieldToBeTransformed": "tokens",
                                   "numberOfTerms": "2",
                                   "outputField": "ngrams"
                                }
          }

Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature

...