Introduction

Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.TF-IDF is separate into two parts: TF (+hashing) and IDF.

TF: HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors.

IDF: IDF is an Estimator which fits on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.

Use-Case

Data scientist wants a text pre-processing step returning fixed length feature vectors which can be passed to learning algorithm for training and prediction.

User Stories

As a hydrator user, I want to avoid the need to compute global term-index map, which can be expensive for a large corpus, for data training and prediction.
As a hydrator user, I want to transform text into fixed length feature vectors as a pre-processing step for data training and prediction.

Conditions

Example

Design

This is a sparkcompute type of plugin and is meant to work with Spark only.

Table of Contents

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

CDAP

Hashing TF and IDF

Use-Case

User Stories

Conditions

Design