Monday 5 October 2015

Predicting Likes: Inside A Simple Recommendation Engine's Algorithms

www.toptal.com
BY MAHMUD RIDWAN - TECHNICAL EDITOR @ TOPTAL
A recommendation engine (sometimes referred to as a recommender system) is a tool that lets algorithm developers predict what a user may or may not like among a list of given items. Recommendation engines are a pretty interesting alternative to search fields, as recommendation engines help users discover products or content that they may not come across otherwise. This makes recommendation engines a great part of web sites and services such as Facebook, YouTube, Amazon, and more.
Recommendation engines work ideally in one of two ways. It can rely on the properties of the items that a user likes, which are analyzed to determine what else the user may like; or, it can rely on the likes and dislikes of other users, which the recommendation engine then uses to compute a similarity index between users and recommend items to them accordingly. It is also possible to combine both these methods to build a much more robust recommendation engine. However, like all other information related problems, it is essential to pick an algorithm that is suitable for the problem being addressed.
Building a Recommendation Engine
In this tutorial, we will walk you through the process of building a recommendation engine that is collaborative and memory-based. This recommendation engine will recommend movies to users based on what they like and dislike, and will function like the second example that was mentioned before. For this project, we will be using basic set operations, a little mathematics, and Node.js/CoffeeScript. All source code relevant to this tutorial can be found here.

Sets and Equations

Before implementing a collaborative memory-based recommendation engine, we must first understand the core idea behind such a system. To this engine, each item and each user is nothing but identifiers. Therefore, we will not take any other attribute of a movie (for example, the cast, director, genre, etc.) into consideration while generating recommendations. The similarity between two users is represented using a decimal number between -1.0 and 1.0. We will call this number the similarity index. Finally, the possibility of a user liking a movie will be represented using another decimal number between -1.0 and 1.0. Now that we have modelled the world around this system using simple terms, we can unleash a handful of elegant mathematical equations to define the relationship between these identifiers and numbers.
In our recommendation algorithm, we will maintain a number of sets. Each user will have two sets: a set of movies the user likes, and a set of movies the user dislikes. Each movie will also have two sets associated with it: a set of users who liked the movie, and a set of users who disliked the movie. During the stages where recommendations are generated, a number of sets will be produced - mostly unions or intersections of the other sets. We will also have ordered lists of suggestions and similar users for each user.
To calculate the similarity index, we will use a variation of the Jaccard index formula. Originally known as “coefficient de communauté” (coined by Paul Jaccard), the formula compares two sets and produces a simple decimal statistic between 0 and 1.0:
similarity index
The formula involves the division of the number of common elements in either set by the number of all the elements (counted only once) in both sets. The Jaccard index of two identical sets will always be 1, while the Jaccard index of two sets with no common elements will always yield 0. Now that we know how to compare two sets, let us think of a strategy we can use to compare two users. As discussed earlier, the users, from the system’s point of view, are three things: an identifier, a set of liked movies, and a set of disliked movies. If we were to define our users’ similarity index based only on the set of their liked movies, we could directly use the Jaccard index formula:
jaccard index formula
Here, U1 and U2 are the two users we are comparing, and L1 and L2 are the sets of movies that U1 and U2 have liked, respectively. Now, if you think about it, two users liking the same movies are similar, then two users disliking the same movies should also be similar. This is where we modify the equation a little:
modified equasion
Instead of just considering the common likes in the formula’s numerator, we now add the number of common dislikes as well. In the denominator, we take the number of all the items that either user has liked or disliked. Now that we have considered both likes and dislikes in an independent sort of way, we should also think about the case where two users are polar opposites in their preferences. The similarity index of two users where one likes a movie and the other dislikes it shouldn’t be 0:
similarity index of two users
That’s one long formula! But it’s simple, I promise. It’s similar to our previous formula with a small difference in the numerator. We are now subtracting the number of conflicting likes and dislikes of the two users from the number of their common likes and dislikes. This causes the similarity index formula to have a range of values between -1.0 and 1.0. Two users having identical tastes will have a similarity index of 1.0 while two users having entirely conflicting tastes in movies will have a similarity index of -1.0.
Now that we know how to compare two users based on their taste in movies, we have to explore one more formula before we can start implementing our homebrewed recommendation engine algorithm:
recommendation engine algorithm
Let’s break this equation down a little. What we mean by P(U,M) is the possibility of a user U liking the movieMZL and ZD are the sum of the similarity indices of user U with all the users who have liked or disliked the movie M, respectively. |ML|+|MD| represents the total number of users who have liked or disliked the movie M. The result P(U,M) produces a number between -1.0 and 1.0.
That’s about it. In the next section, we can use these formulae to start implementing our collaborative memory-based recommendation engine.

Building the Recommendation Engine

We will build this recommendation engine as a very simple Node.js application. There will also be very little work on the front-end, mostly some HTML pages and forms (we will use Bootstrap to make the pages look neat). On the server side, we will use CoffeeScript. The application will have a few GET and POST routes. Even though we will have the notion of users in the application, we will not have any elaborate registration/login mechanism. For persistency, we will use the Bourne package available via NPM which enables an application to store data in plain JSON files, and perform basic database queries on them. We will use Express.js to ease the process of managing the routes and handlers.
At this point, if you are new to Node.js development, you might want to clone the GitHub repository so that it’s easier to follow this tutorial. As with any other Node.js project, we will begin by creating a package.json fileand installing a set of dependency packages required for this project. If you are using the cloned repository, the package.json file should already be there, from where installing the dependencies will require you to execute “$ npm install”. This will install all the packages listed inside the package.json file.
The Node.js packages we need for this project are:
We will build the recommendation engine by splitting all relevant methods into four separate CoffeeScript classes, each of which will stored under “lib/engine”: Engine, Rater, Similars, and Suggestions. The class Engine will be responsible for providing a simple API for the recommendation engine, and will bind the other three classes together. Rater will be responsible for tracking likes and dislikes (as two separate instances of the Rater class). Similars and Suggestions will be responsible for determining and tracking similar users and recommended items for the users, respectively.

Tracking Likes and Dislikes

Let us first begin with our Raters class. This is a simple one:
class Rater
 constructor: (@engine, @kind) ->
 add: (user, item, done) ->
 remove: (user, item, done) ->
 itemsByUser: (user, done) ->
 usersByItem: (item, done) ->
As indicated earlier in this tutorial, we will have one instance of Rater for likes, and another one for dislikes. To record that a user likes an item, we will pass them to “Rater#add()”. Similarly, to remove the rating, we will pass them to “Rater#remove()”.
Since we are using Bourne as a server-less database solution, we will store these ratings in a file named “./db-#{@kind}.json”, where kind is either “likes” or “dislikes”. We will open the database inside the constructor of the Rater instance:
constructor: (@engine, @kind) ->
 @db = new Bourne "./db-#{@kind}.json"
This will make adding rating records as simple as calling a Bourne database method inside our “Rater#add()” method:
@db.insert user: user, item: item, (err) =>
And it is similar to remove them (“db.delete” instead of “db.insert”). However, before we either add or remove something, we must ensure it doesn’t already exist in the database. Ideally, with a real database, we could have done it as a single operation. With Bourne, we have to do a manual check first; and, once the insertion or deletion is done, we need to make sure we recalculate the similarity indices for this user, and then generate a set of new suggestions. The “Rater#add()” and “Rater#remove()” methods will look something like this:
add: (user, item, done) ->
 @db.find user: user, item: item, (err, res) =>
  if res.length > 0
   return done()

  @db.insert user: user, item: item, (err) =>
   async.series [
    (done) =>
     @engine.similars.update user, done
    (done) =>
     @engine.suggestions.update user, done
   ], done

remove: (user, item, done) ->
 @db.delete user: user, item: item, (err) =>
  async.series [
   (done) =>
    @engine.similars.update user, done
   (done) =>
    @engine.suggestions.update user, done
  ], done
For brevity, we will skip the parts where we check for errors. This might be a reasonable thing to do in an article, but is not an excuse for ignoring errors in real code.
The other two methods, “Rater#itemsByUser()” and “Rater#usersByItem()” of this class will involve doing what their names imply - looking up items rated by a user and users who have rated an item, respectively. For example, when Rater is instantiated with kind = “likes”, “Rater#itemsByUser()” will find all the items the user has rated.

Finding Similar Users

Moving on to our next class: Similars. This class will help us compute and keep track of the similarity indices between the users. As discussed before, calculating the similarity between two users involves analyzing the sets of items they like and dislike. To do that, we will rely on the Rater instances to fetch the sets of relevant items, and then determine the similarity index for certain pairs of users using the similarity index formula.
Finding Similar Users
Just like our previous class, Rater, we will put everything in a Bourne database named “./db-similars.json”, which we will open in the constructor of Rater. The class will have a method “Similars#byUser()”, which will let us look up users similar to a given user through a simple database lookup:
@db.findOne user: user, (err, {others}) =>
However, the most important method of this class is “Similars#update()” which works by taking a user and computing a list of other users who are similar, and storing the list in the database, along with their similarity indices. It starts by finding the user’s likes and dislikes:
async.auto
 userLikes: (done) =>
  @engine.likes.itemsByUser user, done
 userDislikes: (done) =>
  @engine.dislikes.itemsByUser user, done
, (err, {userLikes, userDislikes}) =>
 items = _.flatten([userLikes, userDislikes])
We also find all the users who have rated these items:
async.map items, (item, done) =>
 async.map [
  @engine.likes
  @engine.dislikes
 ], (rater, done) =>
  rater.usersByItem item, done
 , done
, (err, others) =>
Next, for each of these other users, we compute the similarity index and store it all in the database:
async.map others, (other, done) =>
 async.auto
  otherLikes: (done) =>
   @engine.likes.itemsByUser other, done
  otherDislikes: (done) =>
   @engine.dislikes.itemsByUser other, done
 , (err, {otherLikes, otherDislikes}) =>
  done null,
   user: other
   similarity: (_.intersection(userLikes, otherLikes).length+_.intersection(userDislikes, otherDislikes).length-_.intersection(userLikes, otherDislikes).length-_.intersection(userDislikes, otherLikes).length) / _.union(userLikes, otherLikes, userDislikes, otherDislikes).length

, (err, others) =>
 @db.insert
  user: user
  others: others
 , done
Within the snippet above, you will notice that we have an expression identical in nature to our similarity index formula, a variant of the Jaccard index formula.

Generating Recommendations

Our next class, Suggestions, is where all the predictions take place. Like the class Similars, we rely on another Bourne database named “./db-suggestions.json”, opened inside the constructor.
Generating Recommendations and suggestions
The class will have a method “Suggestions#forUser()” to lookup computed suggestions for the given user:
forUser: (user, done) ->
 @db.findOne user: user, (err, {suggestions}={suggestion: []}) ->
  done null, suggestions
The method that will compute these results is “Suggestions#update()”. This method, like “Similars#update()”, will take a user as an argument. The method begins by listing all the users similar to the given user, and all the items the given user has not rated:
@engine.similars.byUser user, (err, others) =>
 async.auto 
  likes: (done) =>
   @engine.likes.itemsByUser user, done
  dislikes: (done) =>
   @engine.dislikes.itemsByUser user, done
  items: (done) =>
   async.map others, (other, done) =>
    async.map [
     @engine.likes
     @engine.dislikes
    ], (rater, done) =>
     rater.itemsByUser other.user, done
    , done
   , done
 , (err, {likes, dislikes, items}) =>
  items = _.difference _.unique(_.flatten items), likes, dislikes
Once we have all the other users and the unrated items listed, we can begin computing a new set of recommendations by removing any previous set of recommendations, iterating over each item, and computing the possibility of the user liking it based on available information:
@db.delete user: user, (err) =>
async.map items, (item, done) =>
  async.auto
   likers: (done) =>
    @engine.likes.usersByItem item, done
   dislikers: (done) =>
    @engine.dislikes.usersByItem item, done
  , (err, {likers, dislikers}) =>
   numerator = 0
   for other in _.without _.flatten([likers, dislikers]), user
    other = _.findWhere(others, user: other)
    if other?
    numerator += other.similarity

    done null,
     item: item
     weight: numerator / _.union(likers, dislikers).length

 , (err, suggestions) =>
Once that is done, we save it back to the database:
@db.insert
 user: user
 suggestions: suggestions
, done

4 comments: