NGramTransform Spark Compute Analytics (Deprecated)
This plugin is no longer available as of July 26, 2024.
Transforms the input features into n-grams, where n-gram is a sequence of n tokens (typically words) for some integer ‘n’.
For example, a bio data scientist wants to study the sequence of the nucleotides using the input stream of DNA sequencing to identify the bonds. The input stream contains the DNA sequence AGCTTCGA. The output contains the bigram sequence AG, GC, CT, TT, TC, CG, GA.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Field Used To Transform Input Features Into N-Grams | No | Required. Field to be used to transform input features into n-grams. |
N-Gram Size | Yes | Required. NGram size. |
Transformed Field | No | Required. Transformed field for sequence of n-gram. |
Example
This example transforms features from “tokens” field into n-grams (output field name is ngrams) using ngram size as “2”.
Property | Value |
---|---|
Field Used To Transform Input Features Into N-Grams |
|
N-Gram Size |
|
Transformed Field |
|
For example, suppose the NGramTransform receives below input records:
topic | tokens |
---|---|
java |
|
hdfs |
|
spark |
|
Output schema will contain only a single field “ngrams” having transformed ngrams in string array form:
ngrams |
---|
|
|
|
Created in 2020 by Google Inc.