oduction

Introduction

An n-gram is a sequence of n tokens (typically words) for some integer n.

NGramTransform plugin would be used to transform input features into n-grams.

Use-Case

Transform input features(tokens in array form) into n-grams using parameter for number of terms in each n-gram.
Transformed output will be an array of n-grams where each n-gram is represented by a space-delimited string of n consecutive words.

User Stories

As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n-gram data.
As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.

Conditions

Source field ,to be transformed,can be of only type string array.

User can transform single field only from the source schema.

Output schema will have a single field of type string array.

If the input sequence contains fewer than n strings, no output is produced.

Example

Input source:

topic	tokens
Java	[hi,i,heard,about,spark]
HDFS	[hdfs,is,file,system]
Spark	[spark,is,an,engine]

NGramTransform:

Mandatory inputs from user:

Field to be used to transform input features into n-grams:”tokens”

Number of terms in each n-gram:”2”

Transformed field for sequence of n-gram:”ngrams”

Output:

ngrams

[hi i,i heard,heard about,about spark]

[hdfs is,is file,file system]

[spark is,is an,an engine]

End to End Example pipeline:

Stream	Tokenizer	NGramTransform	TPFSAvro

Input source:

topic	sentence
java	hi i heard about spark
HDFS	hdfs is a file system
Spark	spark is an engine

Tokenizer:

Mandatory inputs from user:

- Column on which tokenization to be done:”sentence”
- Delimiter for tokenization:” ”
- Output column name for tokenized data:”tokens”

:

A bio data scientist wants to study the sequence of the nucleotides using the input stream of DNA sequencing to identify the bonds.
The input Stream contains the DNA sequence eg AGCTTCGA. The output contains the bigram sequence AG, GC, CT, TT, TC, CG, GA
Input source:
DNASequence
AGCTTCGA
Mandatory inputs from user:NGramTransform:
- Field to be used to transform input features into n-grams:

”tokens”

- ”DNASequence”
- Number of terms in each n-gram:”2”
- Transformed field for sequence of n-gram:

”ngrams”

TPFSAvro Output

ngrams

[hi i,i heard,heard about,about spark]

[hdfs is,is a,a file,file system]

[spark is,is an,an engine]

Design

Properties:

**fieldToBeTransformed:** Column to be used to transform input features into n-grams.

**noOfTerms:** No of terms in each n-gram.

**outputField:** Transformed column for sequence of n-gram.

Input JSON:

{
"name": "NGramTransform",
"type": "sparkcompute",
"properties": {
"fieldToBeTransformed": "tokens",
"noOfTerms": "2",
"outputField": "ngrams"
}
}

Table of Contents

Table of Contents

style	circle

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

Introduction

An n-gram is a sequence of n tokens (typically words) for some integer n.

NGramTransform plugin would be used to transform input features into n-grams.

Use-Case

Transform input features(tokens in array form) into n-grams using parameter for number of terms in each n-gram.
Transformed output will be an array of n-grams where each n-gram is represented by a space-delimited string of n consecutive words.

User Stories

- ”bigram”
- Tokenization unit used to tokenize the input string before n-gram could be created:"Character"
Output:
DNASequence bigram
AGCTTCGA [AG, GC, CT, TT, TC, CG, GA]

User Stories

As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n n-gram data in one of the columns in output schema.
As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.
As a Hydrator user I want to specify the tokenization unit for the input to be tokenized before it could be converted to n-gram

Conditions

Source field ,to be transformed,can be of only type string array.
User can transform single field only from the source schema.
Output schema will have a single field of type string array.
If the input sequence contains fewer than n strings, no output is produced.

End to End Example pipeline:

Stream	NGramTransform	TPFSAvro

Input source:

topic

tokens

sentence

Java

java

[

hi

,

i

,

heard

,

about

,

spark

]

HDFS

[

hdfs

,

is

,

a file

,

system

]

Spark

[

spark

,

is

,

an

,

engine

]

NGramTransform:

Mandatory inputs from user:

- Field to be used to transform input features into n-grams:”tokens”
- Number of terms in each n-gram:”2”
- Transformed field for sequence of n-gram:”ngrams”

Output:

- Tokenization unit: "words"

TPFSAvro Output

[hdfs,is,file,

topic	sentence	ngrams
java	hi i heard about spark	[hi i,i heard,heard about,about spark]
HDFS	[hdfs is ,is file,a file system]
topic	tokens
Java	[hi,i,heard,about,spark]
HDFS	[spark hdfs is,is an,an engine]

End to End Example pipeline

Input source:

topic

sentence

Java

Hello world / is the /basic application

HDFS

HDFS/ is a /file system

Spark

Spark /is engine for /bigdata processing

[,,,]

a,a file,file system]
Spark	spark	is	an	engine

NGramTransform:

Mandatory inputs from user:

Field to be used to transform input features into n-grams:”tokens”

Number of terms in each n-gram:”2”

Transformed field for sequence of n-gram:”ngrams”

Output:

[hi i,i heard,heard about,about spark]hdfs is,is file,file system][

ngrams
	[	spark is,is an,an engine]

Design

This is a sparkcompute type of plugin and is meant to work with Spark only.

Properties:

**fieldToBeTransformed:** Column to be used to transform input features into n-grams.

**noOfTermsnumberOfTerms:** No Number of terms in each n-gram.

**outputField:** Transformed column for sequence of n-gram.
**tokenizationUnit** Unit into which the input string will be tokenized.

Input JSON:

{
"name": "NGramTransform",
"type": "sparkcompute",
"properties": {
"fieldToBeTransformed": "tokens",
"noOfTermsnumberOfTerms": "2",

"tokenizationUnit":"word",

"outputField": "ngrams"
}
}

Table of Contents

Table of Contents

style	circle

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

Versions Compared

Old Version 8

New Version Current

Key

Use-Case

Conditions

Use-Case

Conditions

Page Comparison

Versions Compared

Old Version 8

New Version Current

Key

Use-Case

Conditions

Use-Case

Conditions