Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction


       An n-gram is a sequence of n tokens (typically words) for some integer n.

      NGramTerms plugin would be used to transform input features into n-grams. 

Use-Case

  • Transform input features(tokens in array form) into n-grams using parameter for number of terms in each n-gram.
  • Transformed output will be an array of n-grams where each n-gram is represented by a space-delimited string of n consecutive words.

User Stories

  • As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n-gram data.
  • As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
  • As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
  • As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.

Conditions

  • Source field ,to be transformed,can be of only type string array.
  • User can transform single column only from the source schema.
  • Output schema will have a single column of type string array.
  • If the input sequence contains fewer than n strings, no output is produced.

Example

Input source:

topic

tokens

Java

[hi,i,heard,about,spark]

HDFS

[hdfs,is,file,system]

Spark

[spark,is,an,engine]

NGramTerms:

Mandatory inputs from user:

    • Column to be used to transform input features into n-grams:”tokens”
    • No of terms in each n-gram:”2”
    • Transformed column for sequence of n-gram:”ngrams”

Output:

ngrams

[hi i,i heard,heard about,about spark]

[hdfs is,is file,file system]

[spark is,is an,an engine]

Design

Properties:

  • **columnToBeTransformed:** Column to be used to transform input features into n-grams.
  • **noOfTerms:** No of terms in each n-gram.
  • **outputColumn:** Transformed column for sequence of n-gram.

Input JSON:

         {
           "name": "NGramTerms",
           "type": "sparkcompute",
           "properties": {
                                   "columnToBeTransformed": "tokens",
                                   "noOfTerms": "2",
                                   "outputColumn": "ngrams"
                                }
          }

Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature