Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Introduction 

Feature generator plugin will be used to generate text based feature from a string field.

Use-case

A user has training data that has labeled various tweets as positive, neutral, or negative. The user wants to train a model (Eg. Decision Tree) from the data, then use it to tag new tweets as positive, neutral, or negative.

User Stories

  • The user should be able to generate text based features from a string field using HashingTF.

  • The user should be able to specify number of features to use with HashingTF.

  • The user should be able to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation.

  • The user should be able to set vector size, min count, num partitions, num iterations, and window size when training skip-gram model.

  • The user should be able to set which fileset and path to use when storing the skip-gram model.

  • The user should be able to generate text based features from a string field using a stored skip-gram model (Spark's Word2Vec).

  • The user should be able to use generated features to train a model or for prediction.

Example

 

Design 

SkipGramFeatureTrainer:

SparkSink to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation.

Properties:

  • fileSetName : The name of the FileSet to save the model to.
  • path : Path of the FileSet to save the model to.
  • vectorSize: The dimension of codes after transforming from words.
  • minCount: The minimum number of times a token must appear to be included in the word2vec model's vocabulary.
  • numPartitions: Number of partitions for sentences of words.
  • numIterations : Maximum number of iterations (>= 0).
  • windowSize The window size (context words from [-window, window]) default 5.
  • inputCol: Input column to train the skip-gram model (Spark's Word2Vec).

Input Json Format

{
    "name": "FeatureTrainer",
    "type": "sparksink",
    "properties": {
        "fileSetName": "feature-generator",
        "path": "feature",
        "vectorSize": "3",
        "minCount": "2",
        "numPartitions": "1",
        "numIterations ": "1",
        "windowSize ": "3",
        "inputCol": "text"
    }
}

SkipGramFeatureGenerator:

SparkCompute to generate text based feature from string using stored skip-gram model (Spark's Word2Vec).

The sparkcompute will emit record containing the original input schema along with the transformed columns as mentioned in the outputMapping.

Properties:

  • fileSetName : The name of the FileSet to load the skip-gram model from.
  • path : Path of the FileSet to load the skip-gram model from.
  • outputMapping: Input column to output column mapping where each output column will contain the generated feature vector for the corresponding input field as double array.

Input Json Format

{
    "name": "FeatureGenerator",
    "type": "sparkcompute",
    "properties": {
        "fileSetName": "feature-generator",
        "path": "feature",
        "outputMapping": "text:result"
    }
}

 

HashingTFFeatureGenerator:

SparkCompute to generate text based feature from string using HashingTF or stored skip-gram model (Spark's Word2Vec).

The sparkcompute will emit record containing the original input schema along with the 3 extra columns(representing the sparse vector representation of the value) for every transformed column as mentioned in the outputMapping.

Properties:

  • numFeatures: Number of features to be used for HashingTF.
  • outputMapping: Input column to output column mapping where for each input column, output will contain 3 corresponding fields as <output>_size, <output>_indices, <output>_value. The 3 columns combined will give the sparse vector value for the input column.

     

{
    "name": "FeatureGenerator",
    "type": "sparkcompute",
    "properties": {
        "numFeatures": "16"
        "outputMapping": "text:result"
    }
}

Table of Contents

 

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature
  • No labels