Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

 Introduction


       An n-gram is a sequence of n tokens (typically words) for some integer n.

      NGramTransform plugin would be used to transform input features into n-grams. 

Use-Case

  • Transform input features(tokens in array form) into n-grams using parameter for number of terms in each n-gram.
  • Transformed output will be an array of n-grams where each n-gram is represented by a space-delimited string of n consecutive words.

User Stories

  • As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n-gram data.
  • As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
  • As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
  • As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.

Conditions

  • Source field ,to be transformed,can be of only type string array.
  • User can transform single field only from the source schema.
  • Output schema will have a single field of type string array.
  • If the input sequence contains fewer than n strings, no output is produced.

Example

Input source:

topic

tokens

Java

[hi,i,heard,about,spark]

HDFS

[hdfs,is,file,system]

Spark

[spark,is,an,engine]

NGramTransform:

Mandatory inputs from user:

    • Field to be used to transform input features into n-grams:”tokens”
    • Number of terms in each n-gram:”2”
    • Transformed field for sequence of n-gram:”ngrams”

Output:

ngrams

[hi i,i heard,heard about,about spark]

[hdfs is,is file,file system]

[spark is,is an,an engine]

End to End Example pipeline:       

 

StreamTokenizerNGramTransformTPFSAvro

 

Input source:

 

topicsentence
javahi i heard about spark
HDFShdfs is a file system
Sparkspark is an engine


Tokenizer:

Mandatory inputs from user:

    • Column on which tokenization to be done:”sentence”
    • Delimiter for tokenization:” ”
    • Output column name for tokenized data:”tokens”

NGramTransform:

Mandatory inputs from user:

    • Field to be used to transform input features into n-grams:”tokens”
    • Number of terms in each n-gram:”2”
    • Transformed field for sequence of n-gram:”ngrams”


TPFSAvro Output

ngrams

[hi i,i heard,heard about,about spark]

[hdfs is,is a,a file,file system]

[spark is,is an,an engine]

 

Design

Properties:

  • **fieldToBeTransformed:** Column to be used to transform input features into n-grams.
  • **numberOfTerms:** Number of terms in each n-gram.
  • **outputField:** Transformed column for sequence of n-gram.

Input JSON:

         {
           "name": "NGramTransform",
           "type": "sparkcompute",
           "properties": {
                                   "fieldToBeTransformed": "tokens",
                                   "numberOfTerms": "2",
                                   "outputField": "ngrams"
                                }
          }

Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature

 

 

  • No labels