Skip to main content

Recommender Systems — User-Based and Item-Based Collaborative Filtering

Recommender Systems — User-Based and Item-Based Collaborative Filtering

This is part 2 of my series on Recommender Systems. The last post was an introduction to RecSys. Today I’ll explain in more detail three types of Collaborative Filtering: User-Based Collaborative Filtering (UB-CF) and Item-Based Collaborative Filtering (IB-CF).
Let’s begin.

User-Based Collaborative Filtering (UB-CF)

Imagine that we want to recommend a movie to our friend Stanley. We could assume that similar people will have similar taste. Suppose that me and Stanley have seen the same movies, and we rated them all almost identically. But Stanley hasn’t seen ‘The Godfather: Part II’and I didIf I love that movie, it sounds logical to think that he will too. With that, we have created an artificial rating based on our similarity.
Well, UB-CF uses that logic and recommends items by finding similar users to the active user (to whom we are trying to recommend a movie). A specific application of this is the user-based Nearest Neighbor algorithm. This algorithm needs two tasks:
1.Find the K-nearest neighbors (KNN) to the user a, using a similarity function wto measure the distance between each pair of users:
2.Predict the rating that user a will give to all items the k neighbors have consumed but a has not. We Look for the item j with the best predicted rating.
In other words, we are creating a User-Item Matrix, predicting the ratings on items the active user has not see, based on the other similar users. This technique is memory-based.
Filling the blanks

PROS:

  • Easy to implement.
  • Context independent.
  • Compared to other techniques, such as content-based, it is more accurate.

CONS:

  • Sparsity: The percentage of people who rate items is really low.
  • Scalability: The more K neighbors we consider (under a certain threshold), the better my classification should be. Nevertheless, the more users there are in the system, the greater the cost of finding the nearest K neighbors will be.
  • Cold-start: New users will have no to little information about them to be compared with other users.
  • New item: Just like the last point, new items will lack of ratings to create a solid ranking (More of this on ‘How to sort and rank items’).

Item-Based Collaborative Filtering (IB-CF)

Back to Stanley. Instead of focusing on his friends, we could focus on what items from all the options are more similar to what we know he enjoys. This new focus is known as Item-Based Collaborative Filtering (IB-CF).
We could divide IB-CF in two sub tasks:
1.Calculate similarity among the items:
  • Cosine-Based Similarity
  • Correlation-Based Similarity
  • Adjusted Cosine Similarity
  • 1-Jaccard distance
2.Calculation of Prediction:
  • Weighted Sum
  • Regression
The difference between UB-CF and this method is that, in this case, we directly pre-calculate the similarity between the co-rated items, skipping K-neighborhood search.

Slope One

Slope One is part of the Item-Based Collaborative Filtering family, introduced in a 2005 paper by Daniel Lemire and Anna Maclachlan called Slope One Predictors for Online Rating-Based Collaborative Filtering.
The main idea behind this model is the following:
Suppose we have two different users: Aand B. Also, we have item I and item JUser A rated item I with 1 star and the item J with 1.5. If the User B rated Item Iwith a 2. We can make the assumption that the difference between both items will be the same as User A. With this in mind, User B would rate Item J as: 2+ (1,5–1) = 2,5
Main idea behind Slope One
The authors focus on 5 objectives:
1. Easy to implement and maintain.
2. Updatable online: new ratings should change predictions quickly.
3. Efficient at the time of consultation: storage is the main cost.
4. It works with little user feedback.
5. Reasonably accurate, within certain ranges in which a small gain in accuracy does not mean a great sacrifice of simplicity and scalability.

Recap

We saw User-Based and Item-Based Collaborative Filtering. The first has a focus on filling an user-item matrix and recommending based on the users more similar to the active user. On the other hand, IB-CF fills a Item-Item matrix, and recommends based on similar items.
It is hard to explain all these subjects briefly, but understanding them is the first step to getting deeper into RecSys

Comments

Popular posts from this blog

EVENT DRIVEN MICROSERVICES

EVENT BASED MICROSERVICES - Event Sourcing In a Microservice Architecture, especially with Database per Microservice, the Microservices need to exchange data. For resilient, highly scalable, and fault-tolerant systems, they should communicate asynchronously by exchanging Events. In such a case, you may want to have Atomic operations, e.g., update the Database and send the message. If you have SQL databases and want to have distributed transactions for a high volume of data, you cannot use the two-phase locking (2PL) as it does not scale. If you use NoSQL Databases and want to have a distributed transaction, you cannot use 2PL as many NoSQL databases do not support two-phase locking. In such scenarios, use Event based Architecture with Event Sourcing. In traditional databases, the Business Entity with the current “state” is directly stored. In Event Sourcing, any state-changing event or other significant events are stored instead of the entities. It means the modifications of a Busines...

KAFKA - Architecture

Kafka - Architecture What is Kafka? Kafka is an event-streaming platform that is designed to process high volumes of data in real-time. Developed by LinkedIn in 2011, it has quickly become the infrastructural backbone of companies like Netflix, Twitter, and Spotify. Why do we need Kafka? In today’s data-driven world, tracking information like user clicks, recommendations, and shopping carts can be invaluable for a company’s growth. With these analytics, companies can make the product improvements needed to boost user engagement and conversion rates. However, on sites with millions of daily users, collecting and analyzing this data is nontrivial. Kafka was des i gned to streamline this operation, acting as a robust tool that maintains efficient, real-time processing capabilities with incredible quantities of data. For instance, as of late 2019, LinkedIn’s Kafka deployments were managing more than 7 trillion messages per day. How does Kafka work? Kafka provides a structured architecture ...

Tapping Into the “Long Tail” of Big Data

Variety, not volume or velocity, drives big-data investments !!! Gartner defines big data as the three Vs: high-volume, high-velocity, high-variety information assets. While all three Vs are growing, variety is becoming the single biggest driver of big-data investments, as seen in the results of a recent survey by New Vantage Partners. This trend will continue to grow as firms seek to integrate more sources and focus on the “long tail” of big data. From schema-free JSON to nested types in other databases (relational and NoSQL), to non-flat data (Avro, Parquet, XML), data formats are multiplying and connectors are becoming crucial. In 2017, analytics platforms will be evaluated based on their ability to provide live direct connectivity to these disparate sources. Tapping Into the “Long Tail” of Big Data When asked about drivers of Big Data success, 69% of corporate executives named greater data variety as the most important factor, followed by volume (25%), with ...