Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

oduction

ntroduction


       An n-gram is a sequence of n tokens (typically words) for some integer n.

      NGramTransform plugin would be used to transform input features into n-grams. 

Use-Case

  • Transform input features(tokens in array form) into n-grams using parameter for number of terms in each n-gram.
  • Transformed output will be an array of n-grams where each n-gram is represented by a space-delimited string of n consecutive words.

User Stories

  • As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n-gram data.
  • As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
  • As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
  • As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.

Conditions

  • Source field ,to be transformed,can be of only type string array.
  • User can transform single field only from the source schema.
  • Output schema will have a single field of type string array.
  • If the input sequence contains fewer than n strings, no output is produced.

Example

Input source:

topic

tokens

Java

[hi,i,heard,about,spark]

HDFS

[hdfs,is,file,system]

Spark

[spark,is,an,engine]

NGramTransform:

Mandatory inputs from user:

    • Field to be used to transform input features into n-grams:”tokens”
    • Number of terms in each n-gram:”2”
    • Transformed field for sequence of n-gram:”ngrams”

Output:

ngrams

[hi i,i heard,heard about,about spark]

[hdfs is,is file,file system]

[spark is,is an,an engine]

End to End Example pipeline:       

 

StreamTokenizerNGramTransformTPFSAvro

 

Input source:

 

topicsentence
javahi i heard about spark
HDFShdfs is a file system
Sparkspark is an engine


Tokenizer:

Mandatory inputs from user:

    • Column on which tokenization to be done:”sentence”
    • Delimiter for tokenization:” ”
    • Output column name for tokenized data:”tokens”

NGramTransform:

Mandatory inputs from user:

    • Field to be used to transform input features into n-grams:”tokens”
    • Number of terms in each n-gram:”2”
    • Transformed field for sequence of n-gram:”ngrams”


TPFSAvro Output

ngrams

[hi i,i heard,heard about,about spark]

[hdfs is,is a,a file,file system]

[spark is,is an,an engine]

 

Design

Properties:

  • **fieldToBeTransformed:** Column to be used to transform input features into n-grams.
  • **noOfTerms:** No of terms in each n-gram.
  • **outputField:** Transformed column for sequence of n-gram.

Input JSON:

         {
           "name": "NGramTransform",
           "type": "sparkcompute",
           "properties": {
                                   "fieldToBeTransformed": "tokens",
                                   "noOfTerms": "2",
                                   "outputField": "ngrams"
                                }
          }

Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature
Introduction

 

     An n-gram is a sequence of n tokens (typically words) for some integer n.

 

    NGramTransform plugin would be used to transform input features into n-grams. 

Use-Case

  • Transform input features(tokens in array form) into n-grams using parameter for number of terms in each n-gram.
  • Transformed output will be an array of n-grams where each n-gram is represented by a space-delimited string of n consecutive words.

User Stories

  • As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n-gram data.
  • As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
  • As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
  • As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.

Conditions

  • Source field ,to be transformed,can be of only type string array.
  • User can transform single field only from the source schema.
  • Output schema will have a single field of type string array.
  • If the input sequence contains fewer than n strings, no output is produced.

    Example

    Input source:

    topic

    tokens

    Java

    [hi,i,heard,about,spark]

    HDFS

    [hdfs,is,file,system]

    Spark

    [spark,is,an,engine]

    NGramTransform:

    Mandatory inputs from user:

  • Field to be used to transform input features into n-grams:”tokens”
  • Number of terms in each n-gram:”2”
  • Transformed field for sequence of n-gram:”ngrams”

    Output:

    ngrams

    [hi i,i heard,heard about,about spark]

    [hdfs is,is file,file system]

    [spark is,is an,an engine]

    End to End Example pipeline

           

    Input source:

          

    topic

    sentence

    Java

    Hello world / is the /basic application

    HDFS

    HDFS/ is a /file system

    Spark

    Spark /is engine for /bigdata processing

     

    topic

    tokens

    Java

    [hi,i,heard,about,spark]

    HDFS

    [hdfs,is,file,system]

    Spark

    [spark,is,an,engine]

     

    NGramTransform:

     

    Mandatory inputs from user:

     

  • Field to be used to transform input features into n-grams:”tokens”
  • Number of terms in each n-gram:”2”
  • Transformed field for sequence of n-gram:”ngrams”

     

    Output:

     

    ngrams

    [hi i,i heard,heard about,about spark]

    [hdfs is,is file,file system]

    [spark is,is an,an engine]

     

    Design

    Properties:

    • **fieldToBeTransformed:** Column to be used to transform input features into n-grams.
    • **noOfTerms:** No of terms in each n-gram.
    • **outputField:** Transformed column for sequence of n-gram.

    Input JSON:

             {
               "name": "NGramTransform",
               "type": "sparkcompute",
               "properties": {
                                       "fieldToBeTransformed": "tokens",
                                       "noOfTerms": "2",
                                       "outputField": "ngrams"
                                    }
              }

    Table of Contents

    Table of Contents
    stylecircle

    Checklist

    •  User stories documented 
    •  User stories reviewed 
    •  Design documented 
    •  Design reviewed 
    •  Feature merged 
    •  Examples and guides 
    •  Integration tests 
    •  Documentation for feature 
    •  Short video demonstrating the feature