Collaborative Filtering Plugin

Collaborative Filtering Plugin

Introduction

This plugin allows the user to build a collaborative filtering model using a hydrator pipeline. This will be useful for building recommendation engines by allowing users to build ingestion pipelines for the data. We already have the code for doing collaborative filtering in Scala: https://github.com/caskdata/cdap-apps/blob/develop/MovieRecommender/src/main/scala/co/cask/cdap/apps/movierecommender/RecommendationBuilder.scala 

This plugin would make that logic available in a Hydrator pipeline. This will be based on the Spark ML Lib: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

Use case(s)

  • A developer would like to build a recommendation engine to show related products to users when they browse their site. They would like to base those recommendations on similar users and their ratings. The developer would ultimately like to leverage this model in a webservice that can be called and given a set of preferences, return a set of recommendations based on similar customers.

User Storie(s)

  • As a developer, I would like to train a collaborative filtering model using a hydrator pipeline.
  • As a developer, I would like to classify new records based on that model by providing the correct information.
  • As a developer, I would like that hydrator plugin to work in batch and in realtime modes.
  • As a developer, I should be able to provide the file set name to save the training model.
  • As a developer, I should be able to provide the path of the file set.
  • As a developer, I should be able to change the following parameters for this model:
    • numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure) (Default -1).
    • rank is the number of latent factors in the model (Default 10).
    • iterations is the number of iterations of ALS to run. ALS typically converges to a reasonable solution in 20 iterations or less (Default 10).
    • lambda specifies the regularization parameter in ALS (Default 1.0).
    • implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data (Default false).
    • alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations (Default 1.0, but only used when implicitPrefs set to true).

Plugin Type

  • Batch Source
  • Batch Sink 
  • Real-time Source
  • Real-time Sink
  • Action
  • Post-Run Action
  • Aggregate
  • Join
  • Spark Model
  • Spark Compute

Configurables

This section defines properties that are configurable for this plugin. 

User Facing NameTypeDescriptionConstraints
numBlocksintThe number of blocks used to parallelize computation (set to -1 to auto-configure) 
rankintThe number of latent factors in the model 
iterationsintThe number of iterations of ALS to run. ALS typically converges to a reasonable solution in 20 iterations or less 
lambdadoubleSpecifies the regularization parameter in ALS 
implicitPrefsbooleanSpecifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data 
alphadoubleA parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations 

Design / Implementation Tips

Design

Approach(s)

Properties

Security

Limitation(s)

Future Work

  • Some future work – HYDRATOR-99999
  • Another future work – HYDRATOR-99999

Test Case(s)

  • Test case #1
  • Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data. 

Pipeline #1

Pipeline #2

 

 

Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature

Related content

Created in 2020 by Google Inc.