Hashing TF and IDF

Introduction


Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.TF-IDF is separate into two parts: TF (+hashing) and IDF.

TF: HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors.

IDF: IDF is an Estimator which fits on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.

Use-Case

  • Data scientist wants a text pre-processing step returning  fixed length feature vectors which can be passed to learning algorithm for training and prediction.

User Stories

  • As a hydrator user, I want to avoid the need to compute global term-index map, which can be expensive for a large corpus, for data training and prediction.

  • As a hydrator user, I want to transform text into fixed length feature vectors as a pre-processing step for data training and prediction.

Conditions

Example

Design

This is a sparkcompute type of plugin and is meant to work with Spark only.

 

Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature

Created in 2020 by Google Inc.