Feature Generator

Feature Generator

Introduction 

Feature generator plugin will be used to generate text based feature from a string field.

Use-case

A user has training data that has labeled various tweets as positive, neutral, or negative. The user wants to train a model (Eg. Decision Tree) from the data, then use it to tag new tweets as positive, neutral, or negative.

User Stories

  • The user should be able to generate text based features from a string field using HashingTF.

  • The user should be able to specify number of features to use with HashingTF.

  • The user should be able to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation.

  • The user should be able to set vector size, min count, num partitions, num iterations, and window size when training skip-gram model.

  • The user should be able to set which fileset and path to use when storing the skip-gram model.

  • The user should be able to generate text based features from a string field using a stored skip-gram model (Spark's Word2Vec).

  • The user should be able to use generated features to train a model or for prediction.

Example

Skip-Gram (Spark's Word2Vec)

Following is a simple example showing how Spark's word2vec can be used for text based generation using skip-gram model.

The SkipGram Trainer will fit the data for input column specified and for parameters vectorSize : 3, minCount: 2, numPartitions: 1, numIterations: 1 and windowSize: 3, and save the model into a fileSet.

Suppose the SkipGramGenerator receives the following input records:

offset

text

offset

text

1

Spark ML plugins 

2

Classes in Java

The SkipGramFeatureGenerator will use the saved model and generate records that will contain all the fields along with the output       fields mentioned in ``outputColumnMapping``.

offset

text

result

offset

text

result

1

Spark ML plugins

[0.040902843077977494, -0.010430609186490376, -0.04750693837801615]

2

Classes in Java

[-0.04352385476231575, 3.2448768615722656E-4, 0.02223073500208557]

 

HashingTF Feature Generator:

Suppose the feature generator receives the following records:

offset

text

offset

text

1

Hi I heard about Spark 

2

Logistic regression models are neat

The HashingTF Feature Generator will transform column ``text`` to generate fixed length vector of size 10 and emit the generated sparse vector as a cobination of three columns: result_size, result_indices, result_value.

offset

text

result_size

result_indices

result_value

offset

text

result_size

result_indices

result_value

1

Hi I heard about Spark 

10

[3, 6, 7, 9] 

[2.0, 1.0, 1.0, 1.0]

2

Logistic regression models are neat

10

[0, 2, 4, 5, 8]

[1.0, 1.0, 1.0, 1.0, 1.0]

 

Design 

SkipGramFeatureTrainer:

SparkSink to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation.

Properties:

Input Json Format

{ "name": "FeatureTrainer", "type": "sparksink", "properties": { "fileSetName": "feature-generator", "path": "feature", "vectorSize": "3", "minCount": "2", "numPartitions": "1", "numIterations ": "1", "windowSize ": "3", "inputCol": "text" } }

SkipGramFeatureGenerator:

SparkCompute to generate text based feature from string using stored skip-gram model (Spark's Word2Vec).

The sparkcompute will emit record containing the original input schema along with the transformed columns as mentioned in the outputMapping.

Properties:

Input Json Format

{ "name": "FeatureGenerator", "type": "sparkcompute", "properties": { "fileSetName": "feature-generator", "path": "feature", "outputColumnMapping": "text:result" } }

 

HashingTFFeatureGenerator:

SparkCompute to generate text based feature from string using HashingTF or stored skip-gram model (Spark's Word2Vec).

The sparkcompute will emit record containing the original input schema along with the 3 extra columns(representing the sparse vector representation of the value) for every transformed column as mentioned in the outputMapping.

Properties:

{ "name": "FeatureGenerator", "type": "sparkcompute", "properties": { "numFeatures": "16" "outputColumnMapping": "text:result" } }

Table of Contents

 

Checklist

User stories documented 
User stories reviewed 
Design documented 
Design reviewed 
Feature merged 
Examples and guides 
Integration tests 
Documentation for feature 
Short video demonstrating the feature

Created in 2020 by Google Inc.