Hashing TF and IDF

Hashing TF and IDF

Introduction

 

Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.TF-IDF is separate into two parts: TF (+hashing) and IDF.

TF: HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors.

IDF: IDF is an Estimator which fits on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.

Use-Case

  • Data scientist wants a text pre-processing step returning  fixed length feature vectors which can be passed to learning algorithm for training and prediction.

User Stories

  • As a hydrator user, I want to avoid the need to compute global term-index map, which can be expensive for a large corpus, for data training and prediction.

  • As a hydrator user, I want to transform text into fixed length feature vectors as a pre-processing step for data training and prediction.

Conditions

Example

Design

This is a sparkcompute type of plugin and is meant to work with Spark only.

 

Table of Contents

Checklist

User stories documented 
User stories reviewed 
Design documented 
Design reviewed 
Feature merged 
Examples and guides 
Integration tests 
Documentation for feature 
Short video demonstrating the feature

Created in 2020 by Google Inc.