NGramTransform

NGramTransform

Introduction

 

       An n-gram is a sequence of n tokens (typically words) for some integer n.

      NGramTransform plugin would be used to transform input features into n-grams. 

Use-Case

  • A bio data scientist wants to  study the sequence of the nucleotides using the input stream of DNA sequencing to identify the bonds.
    The input Stream contains the DNA sequence eg AGCTTCGA. The output contains the bigram sequence AG, GC, CT, TT, TC, CG, GA

    Input source: 

    Mandatory inputs from user:NGramTransform: 

    • Field to be used to transform input features into n-grams:”DNASequence”

    • Number of terms in each n-gram:”2”

    • Transformed field for sequence of n-gram:”bigram” 

    • Tokenization unit used to tokenize the input string before n-gram could be created:"Character" 

    Output: 

 

 

User Stories

 

  • As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have  n-gram data in one of the columns in output schema.

  • As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.

  • As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.

  • As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.

  • As a Hydrator user I want to specify the tokenization unit for the input to be tokenized before it could be converted to n-gram

Conditions

  • Source field ,to be transformed,can be of only type string.

  • User can transform single field only from the source schema.

  • If the input sequence contains fewer than n strings, no output is produced.


End to End Example pipeline:       

Stream

NGramTransform

TPFSAvro

Stream

NGramTransform

TPFSAvro

 

Input source:

 

topic

sentence

topic

sentence

java

hi i heard about spark

HDFS

hdfs is a file system

Spark

spark is an engine

 

NGramTransform:

Mandatory inputs from user:

TPFSAvro Output

topic

sentence

ngrams

topic

sentence

ngrams

java

hi i heard about spark

[hi i,i heard,heard about,about spark]

HDFS

hdfs is a file system

[hdfs is,is a,a file,file system]

Spark

spark is an engine

[spark is,is an,an engine]

 

Design

This is a sparkcompute type of plugin and is meant to work with Spark only.

Properties:

  • **fieldToBeTransformed:** Column to be used to transform input features into n-grams.

  • **numberOfTerms:** Number of terms in each n-gram.

  • **outputField:** Transformed column for sequence of n-gram.

  • **tokenizationUnit** Unit into which the input string will be tokenized.

Input JSON:

         {
           "name": "NGramTransform",
           "type": "sparkcompute",
           "properties": {
                                   "fieldToBeTransformed": "tokens",
                                   "numberOfTerms": "2",

                                    "tokenizationUnit":"word",

                                   "outputField": "ngrams"
                                }
          }

 

Table of Contents

Checklist

User stories documented 
User stories reviewed 
Design documented 
Design reviewed 
Feature merged 
Examples and guides 
Integration tests 
Documentation for feature 
Short video demonstrating the feature

 

 

Created in 2020 by Google Inc.