Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For example, a bio data scientist wants to study the sequence of the nucleotides using the input stream of DNA sequencing to identify the bonds. The input Stream stream contains the DNA sequence eg AGCTTCGA. The output contains the bigram sequence AG, GC, CT, TT, TC, CG, GA.

Configuration

Property

Macro Enabled?

Description

Field Used To Transform Input Features Into N-Grams

No

Required. Field to be used to transform input features into n-grams.

N-Gram Size

Yes

Required. NGram size.

Transformed Field

No

Required. Transformed field for sequence of n-gram.

Example

This example transforms features from “tokens” field into n-grams (output field name is ngrams) using ngram size as “2”.

Property

Value

Field Used To Transform Input Features Into N-Grams

tokens

N-Gram Size

2

Transformed Field

ngrams

For example, suppose the NGramTransform receives below input records:

topic

tokens

java

[hi,i,heard,about,spark]

hdfs

[hdfs,is,a,file,system]

spark

[spark,is,an,engine]

Output schema will contain only a single field “ngrams” having transformed ngrams in string array form:

ngrams

[hi i,i heard,heard about,about spark]

[hdfs is,is a,a file,file system]

[spark is,is an,an engine]