Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Introduction


       An n-gram is a sequence of n tokens (typically words) for some integer n.NGram can be used to transform input features into n-grams. 

Use-Case

      Transform input features(tokens in array form) into n-grams using parameter for number of terms in each n-gram.
      Transformed output will be an array of n-grams where each n-gram is represented by a space-delimited string of n consecutive words.

User Stories

  • As a Hydrator user,I want to transfom the data in a column from source schema and output the n-grams into output schema which will have a single column having n-gram data.
  • As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
  • As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features.
  • As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.

Conditions

  • Source field ,to be transformed,can be of only array type.
  • User can transform single column only from the source schema.
  • Output schema will have a single column of type string array.

Example

Input source:

topic

tokens

Java

[hi,i,heard,about,spark]

HDFS

[hdfs,is,file,system]

Spark

[spark,is,an,engine]

NGramTerms:

Mandatory inputs from user:

    • Column to be used to transform input features into n-grams:”tokens”
    • No of terms in each n-gram:”2”
    • Transformed column for sequence of n-gram:”ngrams”

Output:

words

[hi i,i heard,heard about,about spark]

[hdfs is,is file,file system]

[spark is,is an,an engine]

 

Design

Properties:

  • **columnToBeTransformed:** Column to be used to transform input features into n-grams.
  • **noOfTerms:** No of terms in each n-gram.
  • **outputColumn:** Transformed column for sequence of n-gram.

Input JSON:

         {
           "name": "NGramTerms",
           "type": "sparkcompute",
           "properties": {
                                   "columnToBeTransformed": "tokens",
                                   "noOfTerms": "2",
                                   "outputColumn": "ngrams"
                                }
          }

Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature
  • No labels