How to Build a Recommeder System from Scratch

By Jill Cates

Agenda

  1. What is a recommender system?
  2. Why do we need recommender systems?
  3. How does it work?
  4. Collaborative Filtering
  5. Content-based Filtering
  6. Tutorial using MovieLens dataset

What is a Recommender System?

  • an application of machine learning
  • predicts a user's preference towards a given item
  • aims to drive user engagement

Examples of Recommenders

Examples of Recommenders

Examples of Recommenders

Why Do We Need Recommender Systems?

Before e-Commerce...

Things were sold exclusively in brick-and-mortar stores

limited inventory

mainstream products

Why Do We Need Recommender Systems?

e-Commerce...

Introduction of the online marketplace

unlimited inventory

niche products

Why Do We Need Recommender Systems?

"With the evolution of online retail, however, has come the revelation that being able to recategorize and rearrange products on the fly unlocks their real value." - Chris Anderson

Why Do We Need Recommender Systems?

The Tasting Booth Experiment (Iyengar and Lepper, 2000)

"30% of the consumers in the limited-choice condition subsequently purchased a jar of jam; in contrast, only 3% of the consumers in the extensive-choice condition did so"

Recommender Systems = Machine Learning

Recommender Systems = Machine Learning

Recommender Systems = Machine Learning

Recommender Systems = Machine Learning

Collaborative Filtering

  • Based on the assumption that similar users like similar things
  • "Customers who bought this item also bought..."
  • "Because you watched Movie X..."

User-item ("utility") matrix

Content-based Filtering

  • Generates recommendations using user and item features
  • Handles the cold-start problem for new users and items
  • User features: age, gender, spoken language
  • Item features: movie genre, year of release, cast

Tutorial

  • Build an item-item recommender using the MovieLens dataset

Step 1: Import Dependencies

We will be representing our data as a Pandas DataFrame.

  • a two-dimensional data structure
  • columns represent features, rows represent items
  • analogous to an Excel spreadsheet or SQL table
  • documentation can be found here
In [67]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: Load the Data

Ratings

Dataframe contains 4 columns:

  • userId
  • movieId
  • rating
  • timestamp

We need this data to perform collaborative filtering.

In [69]:
ratings = pd.read_csv("https://s3-us-west-2.amazonaws.com/recommender-tutorial/ratings.csv")

ratings.head()
Out[69]:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931

Movies

Dataframe contains 3 columns:

  • movieId
  • title
  • genres

We need this data to interpret the results of our collaborative filtering recommender.

In [86]:
movies = pd.read_csv("https://s3-us-west-2.amazonaws.com/recommender-tutorial/movies.csv")

movies.head(3)
Out[86]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance

Step 3: Exploratory Data Analysis

Ratings contains users' ratings for a given movie. Let's see how many ratings, unique movies, and unique users are in our dataset.

In [84]:
n_ratings = len(ratings)
n_movies = ratings['movieId'].nunique()
n_users = ratings['userId'].nunique()
In [85]:
print(f"Number of ratings: {n_ratings}")
print(f"Number of unique movieId's: {n_movies}")
print(f"Number of unique users: {n_users}")
print(f"Average number of ratings per user: {round(n_ratings/n_users, 2)}")
print(f"Average number of ratings per movie: {round(n_ratings/n_movies, 2)}")
Number of ratings: 100836
Number of unique movieId's: 9724
Number of unique users: 610
Average number of ratings per user: 165.3
Average number of ratings per movie: 10.37

How many ratings did each user make? Let's use pandas' groupby() and count() to:

  1. group the data by userId's
  2. count the number of ratings for each userId.
In [61]:
user_freq = ratings[['userId', 'movieId']].groupby('userId').count().reset_index()
user_freq.columns = ['userId', 'n_ratings']
user_freq.head()
Out[61]:
userId n_ratings
0 1 232
1 2 29
2 3 39
3 4 216
4 5 44
In [87]:
mean_n_ratings = user_freq['n_ratings'].mean()
print(f"Mean number of ratings for a given user: {mean_n_ratings:.2f}.")
Mean number of ratings for a given user: 165.30.

Let's visualize the distribution of movie ratings in this dataset, and distribution of user rating frequency. We can do this using a Python package called seaborn.

In [90]:
sns.set_style("whitegrid")
plt.figure(figsize=(14,3))

# PLOT 1
plt.subplot(1,2,1)
ax = sns.countplot(x="rating", data=ratings, palette="viridis")
ax.set(title="Distribution of movie ratings")

# PLOT 2
plt.subplot(1,2,2)
ax = sns.kdeplot(user_freq['n_ratings'], shade=True, legend=False)
ax.set(title="Number of movies rated per user", xlabel="# ratings per user", ylabel="density")
plt.axvline(user_freq['n_ratings'].mean(), color="k", linestyle="--")
plt.show()

What are the highest and lowest rated movies?

To find the "best" and "worst" movies, we need to calculate the mean rating for each movie in our dataset. We can do this by grouping by movieId and calculating the mean of the rating column.

In [64]:
mean_rating = ratings.groupby('movieId')[['rating']].mean()
mean_rating.head()
Out[64]:
rating
movieId
1 3.920930
2 3.431818
3 3.259615
4 2.357143
5 3.071429

What are the highest and lowest rated movies?

In [79]:
lowest_rated = mean_rating['rating'].idxmin()
movies.loc[movies['movieId'] == lowest_rated]
Out[79]:
movieId title genres
2689 3604 Gypsy (1962) Musical
In [66]:
highest_rated = mean_rating['rating'].idxmax()
movies.loc[movies['movieId'] == highest_rated]
Out[66]:
movieId title genres
48 53 Lamerica (1994) Adventure|Drama

The highest rated movie is Lamerica?!

I'm sure that this movie is great but it doesn't sound like a familiar blockbuster. Let's dig deeper and see who rated this movie.

In [83]:
ratings.loc[ratings['movieId']==highest_rated]
Out[83]:
userId movieId rating timestamp
13368 85 53 5.0 889468268
96115 603 53 5.0 963180003

While Lamerica may have the "highest" average rating, it only received 2 ratings. This isn't a good measure of a movie's popularity. The quality of a movie's rating depends not only on the average rating but also on the number of ratings.

Bayesian Average

  • A weighted average that accounts for how many ratings there are
  • Useful when there isn't much data available
  • Used extensively in baseball statistics (i.e., batting average)
  • Calculate with the following equation:

$$r_{i} = \frac{C \times m + \Sigma{\text{ratings}}}{C+N} $$

where:

  • $C$ = our confidence (average number of ratings for a given movie)
  • $m$ = our prior (global average rating)
  • $N$ = total number of ratings for movie $i$

We first need to get the count and mean for each movie in our dataset. We can do this with a sophisticated groupby that applies both mean and count using the agg method.

In [99]:
movie_stats = ratings.groupby('movieId')[['rating']].agg(['count', 'mean'])
movie_stats.columns = movie_stats.columns.droplevel()
movie_stats.head()
Out[99]:
count mean
movieId
1 215 3.920930
2 110 3.431818
3 52 3.259615
4 7 2.357143
5 49 3.071429

Let's calculate the Bayesian average. We first write a function that computes the Bayesian average for a given movie.

In [103]:
C = movie_stats['count'].mean()
m = movie_stats['mean'].mean()

def bayesian_avg(ratings_i):
    bayesian_avg = (C*m+ratings_i.sum())/(C+ratings_i.count())
    return bayesian_avg

We can apply bayesian_avg to our entire ratings dataset using the agg method.

In [ ]:
bayesian_avg_ratings = ratings.groupby('movieId')['rating'].agg(bayesian_avg).reset_index()
bayesian_avg_ratings.columns = ['movieId', 'bayesian_avg']
bayesian_avg_ratings.head()

Which movies have the highest Bayesian average rating?

In [101]:
movie_stats = movie_stats.merge(bayesian_avg_ratings, on='movieId')
movie_stats = movie_stats.merge(movies[['movieId', 'title']])
movie_stats.sort_values('bayesian_avg', ascending=False).head()
Out[101]:
movieId count mean bayesian_avg title
277 318 317 4.429022 4.392070 Shawshank Redemption, The (1994)
659 858 192 4.289062 4.236457 Godfather, The (1972)
2224 2959 218 4.272936 4.227052 Fight Club (1999)
224 260 251 4.231076 4.192646 Star Wars: Episode IV - A New Hope (1977)
46 50 204 4.237745 4.190567 Usual Suspects, The (1995)

Using Bayesian averages, we can see that Lamerica is no longer the top movie.

Which movies have the lowest Bayesian average rating?

In [102]:
movie_stats.sort_values('bayesian_avg', ascending=True).head()
Out[102]:
movieId count mean bayesian_avg title
1172 1556 19 1.605263 2.190377 Speed 2: Cruise Control (1997)
2679 3593 19 1.657895 2.224426 Battlefield Earth (2000)
1372 1882 33 1.954545 2.267268 Godzilla (1998)
1144 1499 27 1.925926 2.296800 Anaconda (1997)
1988 2643 16 1.687500 2.306841 Superman IV: The Quest for Peace (1987)

Step 4: Transforming the Data

  • Need to transform data into user-item matrix for collaborative filtering
  • scipy.sparse_matrix: columns are movies and rows are users
  • Each cell is populated with a user's rating towards a movie
  • Empty cell = no rating available

In [37]:
from scipy.sparse import csr_matrix

def create_X(df):
    """
    Generates a sparse matrix from the ratings dataframe.
    
    Args:
        df: ratings dataframe
    
    Returns:
        X: sparse matrix
        user_mapper: dict that maps user id's to user indices
        user_inv_mapper: dict that maps user indices to user id's
        movie_mapper: dict that maps movie id's to movie indices
        movie_inv_mapper: dict that maps movie indices to movie id's
    """
    N = df['userId'].nunique()
    M = df['movieId'].nunique()

    user_mapper = dict(zip(np.unique(df["userId"]), list(range(N))))
    movie_mapper = dict(zip(np.unique(df["movieId"]), list(range(M))))
    user_inv_mapper = dict(zip(list(range(N)), np.unique(df["userId"])))
    movie_inv_mapper = dict(zip(list(range(M)), np.unique(df["movieId"])))
    
    user_index = [user_mapper[i] for i in df['userId']]
    movie_index = [movie_mapper[i] for i in df['movieId']]

    X = csr_matrix((df["rating"], (movie_index, user_index)), shape=(M, N))
    
    return X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper
In [31]:
X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper =create_X(ratings)

Calculating Sparsity of the Matrix

Let's see how sparse our matrix, $X$, is. We can calculate matrix density, $d$, with the following equation:

$d=\frac{\text{# non-zero elements}}{\text{total # elements}}$

In [32]:
density = X.count_nonzero()/(X.shape[0]*X.shape[1])

print(f"Matrix density: {round(sparsity*100,2)}%")
Matrix sparsity: 1.7%

Wow, our matrix is quite sparse. But don't be discouraged! User-item matrices are typically very sparse. A general rule of thumb is that your matrix density should be no lower than 0.5% to generate decent results.

How to save your sparse matrix

To save our sparse matrix for future analysis, we can use scipy's sparse matrix save_npz method.

In [112]:
from scipy.sparse import save_npz, load_npz

save_npz('user_item_matrix.npz', X)
Out[112]:
<9724x610 sparse matrix of type '<class 'numpy.float64'>'
	with 100836 stored elements in Compressed Sparse Row format>

We can load it again using load_npz.

In [113]:
X = load_npz('user_item_matrix.npz')
X
Out[113]:
<9724x610 sparse matrix of type '<class 'numpy.float64'>'
	with 100836 stored elements in Compressed Sparse Row format>

Step 5: Finding similar movies using k-Nearest Neighbours

k-Nearest Neighbours (kNN) is a classification algorithm that predicts the label of a given sample based on majority vote of its nearest $k$ neighbours.

Distance between two samples can be measured using cosine similarity, Euclidean distance, Manhattan distance, etc.

Cosine Similarity vs. Euclidean Distance

  • Cosine Similarity: measures similarity of two points in orientation (i.e., the cosine angle between $A$ and $B$)
    • the closer the cosine similarity is to 1, the more similar the items are

$$\text{similairity}=\frac{A \cdot B}{\mid{A}\mid\mid{B}\mid}$$

  • Euclidean Distance: measures distance between two items in a n-dimensional space (i.e., measures straight line from $A$ to $B$)
    • unlike cosine similarity, Euclidean distance takes magnitude into account

$$d(p,q) = \sqrt{(p_1-q_1)^2 + (p_2-q_2)^2 + ... + (p_n-q_n)^2}$$

Cosine Similarity vs. Euclidean Distance

  • A-B and D-E have the same cosine similarity
  • A-B has a smaller Euclidean distance than D-E

Let's create a function that finds $k$ similair movies for a given movie.

In [36]:
from sklearn.neighbors import NearestNeighbors

def find_similar_movies(movie_id, X, k, metric='cosine', show_distance=False):
    """
    Finds k-nearest neighbours for a given movie id.
    
    Args:
        movie_id: id of the movie of interest
        X: user-item utility matrix
        k: number of similar movies to retrieve
        metric: distance metric for kNN calculations
    
    Returns:
        list of k similar movie ID's
    """
    neighbour_ids = []
    
    movie_ind = movie_mapper[movie_id]
    movie_vec = X[movie_ind]
    k+=1
    kNN = NearestNeighbors(n_neighbors=k, algorithm="brute", metric=metric)
    kNN.fit(X)
    if isinstance(movie_vec, (np.ndarray)):
        movie_vec = movie_vec.reshape(1,-1)
    neighbour = kNN.kneighbors(movie_vec, return_distance=show_distance)
    for i in range(0,k):
        n = neighbour.item(i)
        neighbour_ids.append(movie_inv_mapper[n])
    neighbour_ids.pop(0)
    return neighbour_ids

We can test out our function by passing in a movieId. We'll create a dictionary that maps movieId to movie title so that we can better interpret our results.

In this case, movie_id = 1 is Toy Story. We'll set k = 10 and use our default metric, cosine similarity. This means that we're looking for the 10 most similar movies to Toy Story.

In [114]:
movie_titles = dict(zip(movies['movieId'], movies['title']))

movie_id = 1

similar_ids = find_similar_movies(movie_id, X, k=10)
movie_title = movie_titles[movie_id]

print(f"Because you watched {movie_title}...")
for i in similar_ids:
    print(movie_titles[i])
Because you watched Toy Story (1995)...
Toy Story 2 (1999)
Jurassic Park (1993)
Independence Day (a.k.a. ID4) (1996)
Star Wars: Episode IV - A New Hope (1977)
Forrest Gump (1994)
Lion King, The (1994)
Star Wars: Episode VI - Return of the Jedi (1983)
Mission: Impossible (1996)
Groundhog Day (1993)
Back to the Future (1985)

Let's repeat the process but this time using Euclidean distance.

In [48]:
movie_titles = dict(zip(movies['movieId'], movies['title']))

movie_id = 1
similar_ids = find_similar_movies(movie_id, X, k=10, metric="euclidean")

movie_title = movie_titles[movie_id]
print(f"Because you watched {movie_title}...")
for i in similar_ids:
    print(movie_titles[i])
Because you watched Toy Story (1995)...
Toy Story 2 (1999)
Mission: Impossible (1996)
Independence Day (a.k.a. ID4) (1996)
Bug's Life, A (1998)
Nutty Professor, The (1996)
Willy Wonka & the Chocolate Factory (1971)
Babe (1995)
Groundhog Day (1993)
Mask, The (1994)
Honey, I Shrunk the Kids (1989)

Dimensionality Reduction with Matrix Factorization

(Advanced)

Matrix Factorization

  • A linear algebra technique that can help discover latent features between users and movies
  • Latent features give a more compact representation of user tastes and item descriptions
  • Can enhance the quality of recommendations when $X$ is very sparse
  • Factorizes the user-item matrix into two "factor matrices":
    • user-factor matrix (n_users, k)
    • item-factor matrix (k, n_items)

Singular Value Decomposition

  • Singular Value Decomposition (SVD) is a type of matrix factorization that is used for data reduction and de-noising
  • scikit-learn has a class called TruncatedSVD that we can use to reduce our matrix from (n_users, n_movies) to (n_users, n_components) where n_components represents number of latent features.
In [148]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=25, n_iter=10)
Z = svd.fit_transform(X.T)
print(f"The shape of our compressed Z matrix is: {Z.shape}.")
The shape of our compressed Z matrix is: (610, 25).

Let's apply our find_similar_movies function to our new Z matrix.

In [149]:
movie_id = 1
similar_movies = find_similar_movies(movie_id, X=Z.T, metric='cosine', k=10)
movie_title = movie_titles[movie_id]

print(f"Because you watched {movie_title}:")
for i in similar_movies:
    print(movie_titles[i])
Because you watched Toy Story (1995):
Powder (1995)
Sabrina (1995)
Ace Ventura: When Nature Calls (1995)
Get Shorty (1995)
Casino (1995)
Four Rooms (1995)
Nixon (1995)
Copycat (1995)
Money Train (1995)
Assassins (1995)

When we reduce the dimensions of our original user-item matrix to (n_users, 30), we get the recommendations shown above.

Top N Recommender

  • Matrix factorization allows us to predict missing ratings in our original $X$ matrix
  • Reconstruct matrix by getting inner product of the user-factor matrix and movie-factor matrix
In [150]:
new_X = svd.inverse_transform(Z).T

print(f"Dimensions of original user-item matrix: {X.shape}, {type(X)}")
print(f"Dimensions of SVD-reconstructed X matrix: {new_X.shape}, {type(new_X)}")
Dimensions of original user-item matrix: (9724, 610), <class 'scipy.sparse.csr.csr_matrix'>
Dimensions of SVD-reconstructed X matrix: (9724, 610), <class 'numpy.ndarray'>

Let's try generaing top N recommendations for a user in our dataset. We'll look at userId 5. Which movies did this user rate highly?

In [156]:
userId = 5
user_preferences = ratings[(ratings['userId']==userId)&(ratings['rating']>=4)]
user_preferences = user_preferences.merge(movies[['movieId', 'title']])
user_preferences.sort_values('rating', ascending=False).head(5)
Out[156]:
userId movieId rating timestamp title
11 5 296 5.0 847434748 Pulp Fiction (1994)
8 5 247 5.0 847435337 Heavenly Creatures (1994)
21 5 595 5.0 847434832 Beauty and the Beast (1991)
20 5 594 5.0 847435238 Snow White and the Seven Dwarfs (1937)
19 5 590 5.0 847434747 Dances with Wolves (1990)

Now, let's predict which movies userId 5 will also like based on their previous ratings. We need to do the following:

  • get row from $X$ which represents userId=5
  • sort their predicted ratings in descending order

We can use np.argsort() to grab top $N$ indices for userId=5.

In [157]:
top_N = 10

movie_titles = dict(zip(movies['movieId'], movies['title']))

top_N_indices = new_X[user_mapper[userId]].argsort()[-top_N:][::-1]

print(f"Top {top_N} Recommendations for UserId {userId}:")
for i in top_N_indices:
    movie_id = movie_inv_mapper[i]
    print(movie_titles[movie_id])
Top 10 Recommendations for UserId 5:
Puppet Masters, The (1994)
Heat (1995)
Beautiful Girls (1996)
Chasers (1994)
Timecop (1994)
Little Women (1994)
Birdcage, The (1996)
Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
Sliver (1993)
Wallace & Gromit: A Close Shave (1995)