import numpy as np
import pandas as pd
import os
from math import sqrt
import matplotlib.pyplot as plt
%matplotlib inline
Here I will go through one more approach of building the same movie recommendation engine. Here I wont be using the K-Means algorithm and will be using a Collaborative filtering approach.
At high level in this approach:
#Storing the movie information
movies_df = pd.read_csv('../input/movielens-20m-dataset/movie.csv')
#Storing the user information
ratings_df = pd.read_csv('../input/movielens-20m-dataset/rating.csv')
Now that we have the data imported in two data frames, lets inspect the data frames.
movies_df.head(2)
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
ratings_df.head(2)
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 2 | 3.5 | 2005-04-02 23:53:47 |
1 | 1 | 29 | 3.5 | 2005-04-02 23:31:16 |
So in the Movies dataset we have the Movies along with their IDs and Genres. In the rating dataframe we have the ratings provided by users corresponding to each of the Movie IDs.
Lets first analyze the movies data and modify it to suit our needs
The title column contains a combination of Movie Name and Year. This is not really useful in our algorithm. Lets first separate out the two in their own columns. I am using RegEx to separate out the name and year into separate columns.
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
Lets see what we have now
movies_df.head(2)
movieId | title | genres | year | |
---|---|---|---|---|
0 | 1 | Toy Story | Adventure|Animation|Children|Comedy|Fantasy | 1995 |
1 | 2 | Jumanji | Adventure|Children|Fantasy | 1995 |
Lets refine the data more. We do not need the Genre column in this algorithm. We will be calculating the similarity scores just based on ratings. Lets drop the Genre column.
movies_df = movies_df.drop('genres', 1)
movies_df.head(2)
movieId | title | year | |
---|---|---|---|
0 | 1 | Toy Story | 1995 |
1 | 2 | Jumanji | 1995 |
Lets move on to the ratings data
ratings_df.head(2)
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 2 | 3.5 | 2005-04-02 23:53:47 |
1 | 1 | 29 | 3.5 | 2005-04-02 23:31:16 |
We have no use of the timestamp column in this algorithm. Lets remove the timestamp column
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()
userId | movieId | rating | |
---|---|---|---|
0 | 1 | 2 | 3.5 |
1 | 1 | 29 | 3.5 |
2 | 1 | 32 | 3.5 |
3 | 1 | 47 | 3.5 |
4 | 1 | 50 | 3.5 |
Now that we have the processed datasets, we can start with Collaborative filtering process.
We will be following below flow at high level for the algorithm:
Lets get the input user first
input_user = [
{'title':'Heat', 'rating':5},
{'title':'GoldenEye', 'rating':3.5},
{'title':'Jumanji', 'rating':2},
{'title':"Sabrina", 'rating':5},
{'title':'Sudden Death', 'rating':4.5}
]
input_movies = pd.DataFrame(input_user)
input_movies
title | rating | |
---|---|---|
0 | Heat | 5.0 |
1 | GoldenEye | 3.5 |
2 | Jumanji | 2.0 |
3 | Sabrina | 5.0 |
4 | Sudden Death | 4.5 |
Since the ratings dataset deals with the Movie IDs and not the names, lets add the Movie IDs to the input dataset.
inputId = movies_df[movies_df['title'].isin(input_movies['title'].tolist())]
input_movies = pd.merge(inputId, input_movies)
input_movies = input_movies.drop('year', 1)
input_movies
movieId | title | rating | |
---|---|---|---|
0 | 2 | Jumanji | 2.0 |
1 | 6 | Heat | 5.0 |
2 | 73608 | Heat | 5.0 |
3 | 7 | Sabrina | 5.0 |
4 | 915 | Sabrina | 5.0 |
5 | 9 | Sudden Death | 4.5 |
6 | 10 | GoldenEye | 3.5 |
Next we can get similar users from the ratings dataset, identifying users who have rated the movies in similar fashion.
similar_users = ratings_df[ratings_df['movieId'].isin(input_movies['movieId'].tolist())]
similar_users.head()
userId | movieId | rating | |
---|---|---|---|
0 | 1 | 2 | 3.5 |
423 | 4 | 6 | 3.0 |
424 | 4 | 10 | 4.0 |
451 | 5 | 2 | 3.0 |
519 | 6 | 7 | 5.0 |
Next lets group the rows based on User ID since each user will have multiple movies rated.
grouped_users = similar_users.groupby(['userId'])
grouped_users.get_group(138484)
userId | movieId | rating | |
---|---|---|---|
19999136 | 138484 | 2 | 3.0 |
19999138 | 138484 | 6 | 5.0 |
19999139 | 138484 | 10 | 3.0 |
To have a better recommendation, lets sort the above group based on users who jave movies rated more common to the input.
grouped_users = sorted(grouped_users, key=lambda x: len(x[1]), reverse=True)
grouped_users[:2]
[(93152, userId movieId rating 13483207 93152 2 5.0 13483210 93152 6 5.0 13483211 93152 7 2.0 13483212 93152 9 3.0 13483213 93152 10 5.0 13483405 93152 915 2.0 13484838 93152 73608 5.0), (156, userId movieId rating 19847 156 2 5.0 19851 156 6 4.0 19852 156 7 4.0 19853 156 9 3.0 19854 156 10 4.0 20215 156 915 4.0)]
Next we will compare the input user to all the other users and find similar ones based on their ratings. I am using Pearson Correlation Coefficient to measure the relation between the users.
To have an optimum performance for this post, I am filtering the dataset to iterate through less users. In the web app I will be using the whole dataset for comparison.
grouped_users = grouped_users[0:100]
Calculate Pearson Correlation Coefficient
correlateDict = {}
for name, group in grouped_users:
group = group.sort_values(by='movieId')
inputMovies = input_movies.sort_values(by='movieId')
nRatings = len(group)
temp_df = input_movies[input_movies['movieId'].isin(group['movieId'].tolist())]
tempRatingList = temp_df['rating'].tolist()
tempGroupList = group['rating'].tolist()
Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
if Sxx != 0 and Syy != 0:
correlateDict[name] = Sxy/sqrt(Sxx*Syy)
else:
correlateDict[name] = 0
Lets convert the Correlation Coefficients to a Dataframe and have a better view.
correlateDF = pd.DataFrame.from_dict(correlateDict, orient='index')
correlateDF.columns = ['similarityIndex']
correlateDF['userId'] = correlateDF.index
correlateDF.index = range(len(correlateDF))
correlateDF.head()
similarityIndex | userId | |
---|---|---|
0 | -0.417402 | 93152 |
1 | -0.783349 | 156 |
2 | -0.400249 | 903 |
3 | 0.470882 | 982 |
4 | 0.573070 | 1547 |
Now that we have the similarity scores, lets get the top 20 users which are most similar to the input user, based on the scores.
similarusers=correlateDF.sort_values(by='similarityIndex', ascending=False)[0:50]
similarusers.head()
similarityIndex | userId | |
---|---|---|
30 | 0.741620 | 57735 |
78 | 0.716115 | 1516 |
73 | 0.662652 | 132039 |
7 | 0.639863 | 5084 |
11 | 0.628077 | 11900 |
Now that we have identified a list of users who have the most similarities with the input user, based on ratings, lets recommend some movies for the input user based on this finding.
The data frame which we created above for all the similar users, lets add a column to show what actual rating they provided on which movie.
similarusrsrating=similarusers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
similarusrsrating.head()
similarityIndex | userId | movieId | rating | |
---|---|---|---|---|
0 | 0.74162 | 57735 | 1 | 3.0 |
1 | 0.74162 | 57735 | 2 | 1.0 |
2 | 0.74162 | 57735 | 3 | 1.0 |
3 | 0.74162 | 57735 | 6 | 2.5 |
4 | 0.74162 | 57735 | 7 | 2.0 |
Next we get a weighted rating for each of the users based on their current ratings and the similarity scores with the input user. I am just multiplying the Similarity score column and the rating column to get the weighted rating values and adding it as a new column.
similarusrsrating['weightedRating'] = similarusrsrating['similarityIndex']*similarusrsrating['rating']
similarusrsrating.head()
similarityIndex | userId | movieId | rating | weightedRating | |
---|---|---|---|---|---|
0 | 0.74162 | 57735 | 1 | 3.0 | 2.22486 |
1 | 0.74162 | 57735 | 2 | 1.0 | 0.74162 |
2 | 0.74162 | 57735 | 3 | 1.0 | 0.74162 |
3 | 0.74162 | 57735 | 6 | 2.5 | 1.85405 |
4 | 0.74162 | 57735 | 7 | 2.0 | 1.48324 |
Lets group this data based on user IDs to get a more focussed dataset. I am grouping the data based on User ID and getting a sum of the similarity scores and weighted rating columns.
tmpsimilarusrsrating = similarusrsrating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tmpsimilarusrsrating.columns = ['sum_similarityIndex','sum_weightedRating']
tmpsimilarusrsrating.head()
sum_similarityIndex | sum_weightedRating | |
---|---|---|
movieId | ||
1 | 12.491866 | 48.767799 |
2 | 14.792067 | 33.109592 |
3 | 9.766578 | 26.542068 |
4 | 4.332229 | 11.410606 |
5 | 10.121630 | 29.586209 |
Based on the filtered dataset above, we are now ready to recommend movies for the input user. For that I am creating an empty data frame to store the recommendations. I will get the weighted average of all the similarity scores and the weighted ratings and then populate a column in the Data frame. each of the weighted average will be corresponding to a specific movie Id.
recommend_movies = pd.DataFrame()
recommend_movies['weighted recom score'] = tmpsimilarusrsrating['sum_weightedRating']/tmpsimilarusrsrating['sum_similarityIndex']
recommend_movies['movieId'] = tmpsimilarusrsrating.index
recommend_movies.head()
weighted recom score | movieId | |
---|---|---|
movieId | ||
1 | 3.903964 | 1 |
2 | 2.238334 | 2 |
3 | 2.717643 | 3 |
4 | 2.633888 | 4 |
5 | 2.923068 | 5 |
This is our recommended movie data set. To have the top, say 5 movies, lets sort the data frame.
recommend_movies = recommend_movies.sort_values(by='weighted recom score', ascending=False)
recommend_movies.head()
weighted recom score | movieId | |
---|---|---|
movieId | ||
8339 | 5.0 | 8339 |
103659 | 5.0 | 103659 |
32361 | 5.0 | 32361 |
72647 | 5.0 | 72647 |
3599 | 5.0 | 3599 |
Match the Movie IDs with the original movie data frame to get the movie names too.
movies_df.loc[movies_df['movieId'].isin(recommend_movies.head()['movieId'].tolist())]
movieId | title | year | |
---|---|---|---|
3508 | 3599 | Anchors Aweigh | 1945 |
7757 | 8339 | Damn the Defiant! (H.M.S. Defiant) | 1962 |
9893 | 32361 | Come and Get It | 1936 |
14509 | 72647 | Zorn's Lemma | 1970 |
21327 | 103659 | Justice League: The Flashpoint Paradox | 2013 |