A Movie Recommender System using K-Means Clustering

Imports and Get the Dataset

I am using the Movie Lens dataset for this. There are two data files included in the dataset:

Both of the datasets are imported in two separate dataframes

Lets preview the dataframes

Data Analysis

Lets analyze the data first to understand what we are working with.

To Start the analysis, lets first take a subset of users and check their likings

2 Genre Analysis

We pick two Genres from the movie list and filter the dataset to get average ratings for those two Genre. Lets pick Horror and Thriller

Now that we have a filtered dataframe showing average ratings for the two Genres, lets refine the data a bit more to keep only users who liked one of the Genre more than other.

Lets draw a scatter plot to understand the data distribution for each of the users

Based on the data view above, we can see clear boundaries between user's ratings. Lets try to get clusters from this data using K-Means

Lets plot the clusters to have a better view

So here we can see the data being divided into thwo groups/clusters:

Lets try to break the data into one more cluster

Lets see how it looks now

Now we see one more group added. So with the new clustering we have:

Lets add one more cluster and see the effect

So how does it look now

So we can keep on adding clusters and refining the data groups from our dataset. The clusters accurately divide the users based on their likings(from ratings). But whats the optimum value for K or number of clusters. Lets figure that out

Select K or number of clusters

To get the correct value for K(# of clusters), lets perform the above steps for a range of K values and plot the errors for each. We will use the Elbow Method to choose the K value.

Now lets plot the errors to have a visual. The optimum K value will be where the score (Y axis) have seemingly higher values.

From the graph we can see possible K values can be 12,32,57 amongst other values. After these the score really dips a lot and we wont be going further down.

Lets pick K as 12 and perform the predictions.

Lets see how it looks now

Lets add one more Genre

Lets add Fantasy as another genre and perform the similar analysis as above.

With the new dataset, lets do a prediction using 12 clusters.

Lets see how this looks now with 3 genres.

Now we can see how the clusters have changed. As the data input increases, the clusters become more refined. We wont add any more genres for now.

Cluster the Movies

Now that we have seen how we can cluster based on Genres, lets change our approach and build the clusters based on user ratings on the movies.

Prepare the dataset

First we transform the input data so that its easier to view/analyze the ratings across users and moview. We will build a pivot table showing users and their ratings for each movie.

We can see that majority of the ratings are NA and understandable because not all users have rated all movies. So lets sort the data to have the most rated ones first.

Now we have a good ratings view. Still some NA but we can manage. Lets visualize this on a heatmap to identify the rating clusters.

This shows a visual of how users rated the movies. The white cells are when users didnt rate that movie. We handle this next.

For next steps, to have proper performance for this post, let me filter the dataset and work with a smaller dataset. In the actual API which I will be deploying for the Recomender app, I will be using the dataset.

Get the Sparse matrix

But the problem still remains where there are NA values in the dataset. To get around this, I will convert the dataset to sparse csr matrix.

Now that we have the Sparse matrix for the ratings, lets perform some predictions.

Perform Predictions

We will identify clusters based on the above ratings sparse frame and use K value as

I know its a bit undecisive from above regarding which value to take for K. Lets select one of 2,7,12,17.

Lets take k as 12

Lets visualize some of the clusters from above.

The group column shows which cluster group the user belongs to based on the ratings. Now that we have the clustered dataset, lets see what type of predictions we can perform for this.

Predictions from the cluster

Lets pick a cluster group to analyze

This how the ratings table looks for the cluster

The blank cells are because users didnt rate that specific movie. We can use the ratings from other users in the same cluster and get an average to get the specific rating for a blank cell. Let me demonstrate. First let me pick a movie

For all other users, where the cell is blank for this movie in this cluster, the predicted rating will be

Recommend Movies

Now that we have identified the clusters for users based on similar ratings they provided for the movies, we can use the cluster info to recommend movies to other users belonging to same cluster. If we see the mean score of ratings for some movies in a specific cluster, we will know the specific taste for that cluster

Now we can recommend movies to a specific user in a specific cluster. The method of recommending the moview will be:

Lets see how it works

Lets pick user id: 83 and recommend movies to the user