Hashing TF and IDF
- Shashank
Introduction
Term frequency-inverse document frequency (TF-IDF)Â is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.TF-IDF is separate into two parts: TF (+hashing) and IDF.
TF: HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors.
IDF: IDF is an Estimator which fits on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.
Use-Case
Data scientist wants a text pre-processing step returning  fixed length feature vectors which can be passed to learning algorithm for training and prediction.
User Stories
As a hydrator user, I want to avoid the need to compute global term-index map, which can be expensive for a large corpus, for data training and prediction.
As a hydrator user, I want to transform text into fixed length feature vectors as a pre-processing step for data training and prediction.
Conditions
Example
Design
This is a sparkcompute type of plugin and is meant to work with Spark only.
Â
Table of Contents
Checklist
- User stories documentedÂ
- User stories reviewedÂ
- Design documentedÂ
- Design reviewedÂ
- Feature mergedÂ
- Examples and guidesÂ
- Integration testsÂ
- Documentation for featureÂ
- Short video demonstrating the feature
Created in 2020 by Google Inc.