Movie Recommendations using Collaborative Filtering

Here I will go through one more approach of building the same movie recommendation engine. Here I wont be using the K-Means algorithm and will be using a Collaborative filtering approach.

At high level in this approach:

Get the datasets

Now that we have the data imported in two data frames, lets inspect the data frames.

So in the Movies dataset we have the Movies along with their IDs and Genres. In the rating dataframe we have the ratings provided by users corresponding to each of the Movie IDs.

Pre-Processing data

Movies Data

Lets first analyze the movies data and modify it to suit our needs

The title column contains a combination of Movie Name and Year. This is not really useful in our algorithm. Lets first separate out the two in their own columns. I am using RegEx to separate out the name and year into separate columns.

Lets see what we have now

Lets refine the data more. We do not need the Genre column in this algorithm. We will be calculating the similarity scores just based on ratings. Lets drop the Genre column.

Ratings Data

Lets move on to the ratings data

We have no use of the timestamp column in this algorithm. Lets remove the timestamp column

Collaborative Filtering Process

Now that we have the processed datasets, we can start with Collaborative filtering process.

We will be following below flow at high level for the algorithm:

Lets get the input user first

Add Movie ID

Since the ratings dataset deals with the Movie IDs and not the names, lets add the Movie IDs to the input dataset.

Get similar users from dataset

Next we can get similar users from the ratings dataset, identifying users who have rated the movies in similar fashion.

Next lets group the rows based on User ID since each user will have multiple movies rated.

To have a better recommendation, lets sort the above group based on users who jave movies rated more common to the input.

Similar users compared to the input user

Next we will compare the input user to all the other users and find similar ones based on their ratings. I am using Pearson Correlation Coefficient to measure the relation between the users.

To have an optimum performance for this post, I am filtering the dataset to iterate through less users. In the web app I will be using the whole dataset for comparison.

Calculate Pearson Correlation Coefficient

Lets convert the Correlation Coefficients to a Dataframe and have a better view.

Get Top 20 similar users

Now that we have the similarity scores, lets get the top 20 users which are most similar to the input user, based on the scores.

Recommendation Process

Now that we have identified a list of users who have the most similarities with the input user, based on ratings, lets recommend some movies for the input user based on this finding.

The data frame which we created above for all the similar users, lets add a column to show what actual rating they provided on which movie.

Next we get a weighted rating for each of the users based on their current ratings and the similarity scores with the input user. I am just multiplying the Similarity score column and the rating column to get the weighted rating values and adding it as a new column.

Lets group this data based on user IDs to get a more focussed dataset. I am grouping the data based on User ID and getting a sum of the similarity scores and weighted rating columns.

Provide Recommendation

Based on the filtered dataset above, we are now ready to recommend movies for the input user. For that I am creating an empty data frame to store the recommendations. I will get the weighted average of all the similarity scores and the weighted ratings and then populate a column in the Data frame. each of the weighted average will be corresponding to a specific movie Id.

This is our recommended movie data set. To have the top, say 5 movies, lets sort the data frame.

Match the Movie IDs with the original movie data frame to get the movie names too.