Introduction

Feature generator plugin will be used to generate text based feature from a string field.

Use-case

A user has training data that has labeled various tweets as positive, neutral, or negative. The user wants to train a model (Eg. Decision Tree) from the data, then use it to tag new tweets as positive, neutral, or negative.

User Stories

The user should be able to generate text based features from a string field using HashingTF.
The user should be able to specify number of features to use with HashingTF.
The user should be able to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation.
The user should be able to set vector size, min count, num partitions, num iterations, and window size when training skip-gram model.
The user should be able to set which fileset and path to use when storing the skip-gram model.
The user should be able to generate text based features from a string field using a stored skip-gram model (Spark's Word2Vec).
The user should be able to use generated features to train a model or for prediction.

Example

Design

Feature Trainer:

SparkSink to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation

Properties:

fileSetName : The name of the FileSet to save the model to.
path : Path of the FileSet to save the model to.
vectorSize: The dimension of codes after transforming from words.
minCount: The minimum number of times a token must appear to be included in the word2vec model's vocabulary.
numPartitions: Number of partitions for sentences of words.
numIterations : Maximum number of iterations (>= 0).
windowSize : The window size (context words from [-window, window]) default 5.
inputCol: Input column to train the skip-gram model (Spark's Word2Vec).

Input Json Format

Code Block

language	js
linenumbers	true

{
    "name": "FeatureTrainer",
    "type": "sparksink",
    "properties": {
        "fileSetName": "feature-generator",
        "path": "feature",
        "vectorSize": "3",
        "minCount": "2",
        "numPartitions": "1",
        "numIterations ": "1",
        "windowSize ": "3",
        "inputCol": "text"
    }
}

FeatureGenerator:

SparkCompute to generate text based feature from string using HashingTF or stored skip-gram model (Spark's Word2Vec)

Properties:

featureGenerationType: Extraction type (HashingTF or Word2Vec) to be used to generate the text based feature.
fileSetName : The name of the FileSet to load the skip-gram model from.
path : Path of the FileSet to load the skip-gram model from.
numFeatures: Number of features to be used for HashingTF.
columnMapping: Input column to output column mapping where each output column will contain the generated feature vector for the corresponding input field as array.

Input Json Format

Code Block

language	js
linenumbers	true

{
    "name": "FeatureGenerator",
    "type": "sparkcompute",
    "properties": {
        "featureGenerationType" : "Word2Vec",
        "fileSetName": "feature-generator",
        "path": "feature",
        "columnMapping": "text:result"
    }
}
 
or 
 
{
    "name": "FeatureGenerator",
    "type": "sparkcompute",
    "properties": {
        "featureGenerationType" : "HashingTF",
        "numFeatures": "10"
        "columnMapping": "text:result"
    }
}

Table of Contents

Table of Contents

style	circle

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

Versions Compared

Old Version 1

New Version 2

Key

Introduction

Use-case

User Stories

Example

Design

Feature Trainer:

FeatureGenerator:

Page Comparison

Versions Compared

Old Version 1

New Version 2

Key

Introduction

Use-case

User Stories

Example

Design

Feature Trainer:

FeatureGenerator: