Collaborative Filtering Plugin

Collaborative Filtering Plugin

Introduction

This plugin allows the user to build a collaborative filtering model using a hydrator pipeline. This will be useful for building recommendation engines by allowing users to build ingestion pipelines for the data. We already have the code for doing collaborative filtering in Scala: https://github.com/caskdata/cdap-apps/blob/develop/MovieRecommender/src/main/scala/co/cask/cdap/apps/movierecommender/RecommendationBuilder.scala 

This plugin would make that logic available in a Hydrator pipeline. This will be based on the Spark ML Lib: http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

Use case(s)

  • A developer would like to build a recommendation engine to show related products to users when they browse their site. They would like to base those recommendations on similar users and their ratings. The developer would ultimately like to leverage this model in a webservice that can be called and given a set of preferences, return a set of recommendations based on similar customers.

User Storie(s)

  • As a developer, I would like to train a collaborative filtering model using a hydrator pipeline.

  • As a developer, I would like to classify new records based on that model by providing the correct information.

  • As a developer, I would like that hydrator plugin to work in batch and in realtime modes.

  • As a developer, I should be able to provide the file set name to save the training model.

  • As a developer, I should be able to provide the path of the file set.

  • As a developer, I should be able to change the following parameters for this model:

    • numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure) (Default -1).

    • rank is the number of latent factors in the model (Default 10).

    • iterations is the number of iterations of ALS to run. ALS typically converges to a reasonable solution in 20 iterations or less (Default 10).

    • lambda specifies the regularization parameter in ALS (Default 1.0).

    • implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data (Default false).

    • alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations (Default 1.0, but only used when implicitPrefs set to true).

Plugin Type

Batch Source
Batch Sink 
Real-time Source
Real-time Sink
Action
Post-Run Action
Aggregate
Join
Spark Model
Spark Compute

Configurables

This section defines properties that are configurable for this plugin. 

User Facing Name

Type

Description

Constraints

User Facing Name

Type

Description

Constraints

numBlocks

int

The number of blocks used to parallelize computation (set to -1 to auto-configure)

 

rank

int

The number of latent factors in the model

 

iterations

int

The number of iterations of ALS to run. ALS typically converges to a reasonable solution in 20 iterations or less

 

lambda

double

Specifies the regularization parameter in ALS

 

implicitPrefs

boolean

Specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data

 

alpha

double

A parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations

 

Design / Implementation Tips

Design

Approach(s)

Properties

Security

Limitation(s)

Future Work

  • Some future work – HYDRATOR-99999

  • Another future work – HYDRATOR-99999

Test Case(s)

  • Test case #1

  • Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data. 

Pipeline #1

Pipeline #2

 

 

Table of Contents

Checklist

User stories documented 
User stories reviewed 
Design documented 
Design reviewed 
Feature merged 
Examples and guides 
Integration tests 
Documentation for feature 
Short video demonstrating the feature

Created in 2020 by Google Inc.