Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Introduction 

Feature generator plugin will be used to generate text based feature from a string field.

Use-case

A user has training data that has labeled various tweets as positive, neutral, or negative. The user wants to train a model (Eg. Decision Tree) from the data, then use it to tag new tweets as positive, neutral, or negative.

User Stories

  • The user should be able to generate text based features from a string field using HashingTF.

  • The user should be able to specify number of features to use with HashingTF.

  • The user should be able to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation.

  • The user should be able to set vector size, min count, num partitions, num iterations, and window size when training skip-gram model.

  • The user should be able to set which fileset and path to use when storing the skip-gram model.

  • The user should be able to generate text based features from a string field using a stored skip-gram model (Spark's Word2Vec).

  • The user should be able to use generated features to train a model or for prediction.

Example

 

Design 

Feature Trainer:

SparkSink to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation

Properties:

  • fileSetName : The name of the FileSet to save the model to.
  • path : Path of the FileSet to save the model to.
  • vectorSize: The dimension of codes after transforming from words.
  • minCount: The minimum number of times a token must appear to be included in the word2vec model's vocabulary.
  • numPartitions: Number of partitions for sentences of words.
  • numIterations : Maximum number of iterations (>= 0).
  • windowSize The window size (context words from [-window, window]) default 5.
  • inputCol: Input column to train the skip-gram model (Spark's Word2Vec).

Input Json Format

{
    "name": "FeatureTrainer",
    "type": "sparksink",
    "properties": {
        "fileSetName": "feature-generator",
        "path": "feature",
        "vectorSize": "3",
        "minCount": "2",
        "numPartitions": "1",
        "numIterations ": "1",
        "windowSize ": "3",
        "inputCol": "text"
    }
}

FeatureGenerator:

SparkCompute to generate text based feature from string using HashingTF or stored skip-gram model (Spark's Word2Vec)

Properties:

  • featureGenerationType: Extraction type (HashingTF or  Word2Vec) to be used to generate the text based feature.
  • fileSetName : The name of the FileSet to load the skip-gram model from.
  • path : Path of the FileSet to load the skip-gram model from.
  • numFeatures: Number of features to be used for HashingTF.

  • columnMapping: Input column to output column mapping where each output column will contain the generated feature vector for the corresponding input field as array.

Input Json Format

{
    "name": "FeatureGenerator",
    "type": "sparkcompute",
    "properties": {
        "featureGenerationType" : "Word2Vec",
        "fileSetName": "feature-generator",
        "path": "feature",
        "columnMapping": "text:result"
    }
}
 
or 
 
{
    "name": "FeatureGenerator",
    "type": "sparkcompute",
    "properties": {
        "featureGenerationType" : "HashingTF",
        "numFeatures": "16"
        "columnMapping": "text:result"
    }
}

Table of Contents

 

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature
  • No labels