NGramTransform Spark Compute Analytics

The NGramTransform Spark Compute analytics plugin is available in the Hub.

Transforms the input features into n-grams, where n-gram is a sequence of n tokens (typically words) for some integer ‘n’.

For example, a bio data scientist wants to study the sequence of the nucleotides using the input stream of DNA sequencing to identify the bonds. The input stream contains the DNA sequence AGCTTCGA. The output contains the bigram sequence AG, GC, CT, TT, TC, CG, GA.

Configuration

Property	Macro Enabled?	Description

Property	Macro Enabled?	Description
Field Used To Transform Input Features Into N-Grams	No	Required. Field to be used to transform input features into n-grams.
N-Gram Size	Yes	Required. NGram size.
Transformed Field	No	Required. Transformed field for sequence of n-gram.

Example

This example transforms features from “tokens” field into n-grams (output field name is ngrams) using ngram size as “2”.

Property	Value

Property	Value
Field Used To Transform Input Features Into N-Grams	`tokens`
N-Gram Size	`2`
Transformed Field	`ngrams`

For example, suppose the NGramTransform receives below input records:

topic	tokens

topic	tokens
java	`[hi,i,heard,about,spark]`
hdfs	`[hdfs,is,a,file,system]`
spark	`[spark,is,an,engine]`

Output schema will contain only a single field “ngrams” having transformed ngrams in string array form:

ngrams

ngrams
`[hi i,i heard,heard about,about spark]`
`[hdfs is,is a,a file,file system]`
`[spark is,is an,an engine]`