Skip to main content

How Recommender systems works? (With Python Example Movie Data Recommender)

How Recommender systems works (Python code — example film Recommender)

Nowadays we hear very often the words “Recommender systems” and mainly it’s because they are quite often used by companies for different purposes, such as to increase sales (items’ suggestion while purchasing → Amazon: user that have bought this as also bought this) or in suggestions to customers to give them a better customer experience (film suggestion → Netflix) or also in advertising to target the right people based on preferences similarities.
The recommender systems are basically systems that can recommend things to people based on what everybody else did.
Here there is an example of film suggestion taken from an online course. I want to thank Frank Kane for this very useful course on Data Science and Machine Learning with Python. Here there is the course’s link in case you would like to go deeper with Data Science.
We’ll make an example taking the database provided in the course, because it’s not too big and this will help with speed of calculus. In any case online there are a lot of resources, such as MovieLens Database with 20M ratings, 465k tag, 27k movies and 138k users.

How does Recommender System works?

Recommender Systems, as we said earlier, are an systems to recommend items to users. We have 2 kind of Recommender systems:
  • User-based: the model find similarities between users
  • Item-based: the model find similarities between items
There are PRO and CONS for both of them, here an other article if you want to read further about this topic.
These systems are based on similarities, so the calculation of the correlation between data, so between users for the first case and items in the second case.
The correlation is a numerical values between -1 and 1 that indicates how much two variables are related to each other. Correlation = 0 means no correlation, while >0 is positive correlation and <0 is negative correlation.
Here a graphical visualisation (font wikipedia)of the respective correlation coefficient of 2 variables (x,y):

Several sets of (xy) points, with the Pearson correlation coefficient of x and y for each set. (font: Wikipedia)

We can see that correlation =1 or -1 do not refer to the slope of the data but just to how the data are related between each other. There are different methods to calculate the correlation coefficient, one of them in Pearson method:

Formula of Pearson Correlation (font: Wikipedia)

So the correlation is the Covariance between two variables, X and Y , and the multiplication of their Variance.
There are also other methods to do it as Scatter diagram, or Spearman’s Rank Correlation coefficient, or method of Least squares.
In our model we’ll use the item based because we are considering that a user based system could be influenced by the change of film taste in the time by people and also because having less films than items, will fasten our calculations.
Let’s start importing our dataset. Our starting point will be a merged dataset (let’s see just the first 2 rows with the “.head()”):

Dataset import

As we see the dataset has 100k rows that correspond to the ratings we have. The informations in the table are the:
  • movie_id
  • title
  • user_id
  • rating
Before calculate all correlations and prepare our data for it, let’s make quick considerations: with this model we do not have a mathematical way to calculate the accuracy of the model, but we can try to use the common sense and intuition. For example one thing that we can do is to consider if we have in out list films that we know already can be correlated. For example we can think of Star Wars or Star Trek Series. The idea is that if someone has watched an episode of the serie and the rating was also high, I would expect that he also like the other film of the serie.
So let’s check how many film of Star Trek we have in our dataset. For this we can use a function of pandas that let us to find string of text in the columns:

List of films containing “Star Trek”

Before starting with the correlation calculation, we need to have all ratings of a film in columns, the rows will represent the users and the data in the table will be the ratings. For this we can use the function pivot_table of pandas as below:

Pivot with title as columns and user_id as rows

Once we have this new table, we can calculate the correlation of the Star Trek column with all others and for this we can use the corrwith function

X is the column of df (pivot table calculated before)
df is the pivot we calculated before
corr : result of the “corrwith” calculation

We can see clearly that something went wrong with this result, considering that we expected to find other Start Trek films. So probably what’s wrong is that we are considering all the films, even those that have just 1 rating and this do not give to the model consistency.
Let’s try to filter the films with count of ratings>100 and let’s see what happen:

corr result filtered with rating_count>100

Now the result looks more realistic as wee see that there are other episodes of Star Trek serie in the result. So, we could also try to do some other tests with other film, but let’s consider that the result is good and let’s implement it now on all the dataset.
Pandas makes it very easy for us, considering that we’ll use also a shorter function than before :) corr, instead to corrwith.
We’ll use the filter min_periods=100, this will do the work for us and we’ll not need to filter anymore; we can also specify which correlation function to use, and in this case we’ll use the Pearson formula.
This is the result:

Correlation Matrix

So we have calculated the Correlation Matrix for all film, having as result a 1664x1664 matrix where both columns and rows are films and the diagonal of the matrix will be all 1 because every film is related with itself or Nan in case the film has been filtered out with 100 ratings.
Now that we have the Correlation Matrix comes the fun part, where we have to suggest to the user which are the films (output of our system) that best match with his previous preferences (that will be the input of our system).
So starting from the Correlation Matrix, we’ll consider all the columns corresponding to the film the user already watched, for each column, we’ll drop the Nan Values. Once we have the values, we can consider to multiply each value for the rating considering it as weight (we’ll increase correlation, that will not be anymore between -1 and 1, for the film that user liked with higher rating) and after we’ll append all the values of all the columns considered in a Series “user_corr”.
In the Series we need to do few other operations:
  • groupby title summing the correlation value (this is why we could have the same film appearing more than once)
  • drop all the film that the user has already watched
Once we have the final Series, we can ordered the values in descending order (ascending=False) and suggest the first 5 films or how many films we want.
Let’s see the steps with coding applying what we said for the user 0. This is the list of the film the user watched:

Select the titles the user 0 watched

We create now the list of all film with all correlations multiplied by ratings (integers from 1 to 5).

Let’s create the Series with all correlation weighted

We make the groupby in order to not have duplicate films and we also sum their rating:

We are making now groupby title

We create a list of the film we have seen (checking before if they are in the series of all correlations) and than we drop them:

We now create a list of title of film watched to drop (if contained in our Series) and than we drop with the last line of code

Once we have the final list ordered we can print the result to our user hoping he will like the suggestions :)

We can now print the output for the user where we show the results
Here it is the output for the film watched

We have seen an example on how we can suggest a list of film for existing user or also for any other user just giving some input to the system as title of the film watched and our rating. The more data we will have and the better consistency the system will have.
We can also play with the system trying to changing the parameters as the filter to 100 rating or also the method for calculate the correlation or you can also to consider the impact of ratings in a different way in the system.
As you can see there isn’t a just a system but it’s possible to try different options giving different solutions, finding also a way for the improvements.
Consider also that we can apply this method to all other possible data, for example also to suggest to the customer which item he would like to buy.
At this point I’m happy if you are arrived till the end and follow me in case you have found it interesting or useful in someway!
Enjoy data science!

Comments

Popular posts from this blog

Let's Understand Ten Machine Learning Algorithms

Ten Machine Learning Algorithms to Learn Machine Learning Practitioners have different personalities. While some of them are “I am an expert in X and X can train on any type of data”, where X = some algorithm, some others are “Right tool for the right job people”. A lot of them also subscribe to “Jack of all trades. Master of one” strategy, where they have one area of deep expertise and know slightly about different fields of Machine Learning. That said, no one can deny the fact that as practicing Data Scientists, we will have to know basics of some common machine learning algorithms, which would help us engage with a new-domain problem we come across. This is a whirlwind tour of common machine learning algorithms and quick resources about them which can help you get started on them. 1. Principal Component Analysis(PCA)/SVD PCA is an unsupervised method to understand global properties of a dataset consisting of vectors. Covariance Matrix of data points is analyzed here to un...

gRPC with Java : Build Fast & Scalable Modern API & Microservices using Protocol Buffers

gRPC Java Master Class : Build Fast & Scalable Modern API for your Microservice using gRPC Protocol Buffers gRPC is a revolutionary and modern way to define and write APIs for your microservices. The days of REST, JSON and Swagger are over! Now writing an API is easy, simple, fast and efficient. gRPC is created by Google and Square, is an official CNCF project (like Docker and Kubernetes) and is now used by the biggest tech companies such as Netflix, CoreOS, CockRoachDB, and so on! gRPC is very popular and has over 15,000 stars on GitHub (2 times what Kafka has!). I am convinced that gRPC is the FUTURE for writing API for microservices so I want to give you a chance to learn about it TODAY. Amongst the advantage of gRPC: 1) All your APIs and messages are simply defined using Protocol Buffers 2) All your server and client code for any programming language gets generated automatically for free! Saves you hours of programming 3) Data is compact and serialised 4) API ...

What is Big Data ?

What is Big Data ? It is now time to answer an important question – What is Big Data? Big data, as defined by Wikipedia, is this: “Big data is a broad term for  data sets  so large or complex that traditional  data processing  applications are inadequate. Challenges include  analysis , capture,  data curation , search,  sharing ,  storage , transfer ,  visualization ,  querying  and  information privacy . The term often refers simply to the use of  predictive analytics  or certain other advanced methods to extract value from data, and seldom to a particular size of data set.” In simple terms, Big Data is data that has the 3 characteristics that we mentioned in the last section – • It is big – typically in terabytes or even petabytes • It is varied – it could be a traditional database, it could be video data, log data, text data or even voice data • It keeps increasing as new data keeps flowing in This kin...