Collaborative Filtering Plugin
Russ Savage
Introduction
This plugin allows the user to build a collaborative filtering model using a hydrator pipeline. This will be useful for building recommendation engines by allowing users to build ingestion pipelines for the data. We already have the code for doing collaborative filtering in Scala: https://github.com/caskdata/cdap-apps/blob/develop/MovieRecommender/src/main/scala/co/cask/cdap/apps/movierecommender/RecommendationBuilder.scala
This plugin would make that logic available in a Hydrator pipeline. This will be based on the Spark ML Lib: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
Use case(s)
- A developer would like to build a recommendation engine to show related products to users when they browse their site. They would like to base those recommendations on similar users and their ratings. The developer would ultimately like to leverage this model in a webservice that can be called and given a set of preferences, return a set of recommendations based on similar customers.
User Storie(s)
- As a developer, I would like to train a collaborative filtering model using a hydrator pipeline.
- As a developer, I would like to classify new records based on that model by providing the correct information.
- As a developer, I would like that hydrator plugin to work in batch and in realtime modes.
- As a developer, I should be able to provide the file set name to save the training model.
- As a developer, I should be able to provide the path of the file set.
- As a developer, I should be able to change the following parameters for this model:
- numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure) (Default -1).
- rank is the number of latent factors in the model (Default 10).
- iterations is the number of iterations of ALS to run. ALS typically converges to a reasonable solution in 20 iterations or less (Default 10).
- lambda specifies the regularization parameter in ALS (Default 1.0).
- implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data (Default false).
- alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations (Default 1.0, but only used when implicitPrefs set to true).
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Configurables
This section defines properties that are configurable for this plugin.
User Facing Name | Type | Description | Constraints |
---|---|---|---|
numBlocks | int | The number of blocks used to parallelize computation (set to -1 to auto-configure) | |
rank | int | The number of latent factors in the model | |
iterations | int | The number of iterations of ALS to run. ALS typically converges to a reasonable solution in 20 iterations or less | |
lambda | double | Specifies the regularization parameter in ALS | |
implicitPrefs | boolean | Specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data | |
alpha | double | A parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations |
Design / Implementation Tips
- See Spark docs here: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
- See implementation of this in spark here: https://github.com/caskdata/cdap-apps/blob/develop/MovieRecommender/src/main/scala/co/cask/cdap/apps/movierecommender/RecommendationBuilder.scala
- Additional example of building a movie recomendation engine here: https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html
Design
Approach(s)
Properties
Security
Limitation(s)
Future Work
- Some future work – HYDRATOR-99999
- Another future work – HYDRATOR-99999
Test Case(s)
- Test case #1
- Test case #2
Sample Pipeline
Please attach one or more sample pipeline(s) and associated data.
Pipeline #1
Pipeline #2
Table of Contents
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature
Related content
Created in 2020 by Google Inc.