The Event Recommendation Engine Challenge just finished on Kaggle.com. I managed to build a good model and finished 7th.
Given a dataset of users and events, we had to predict which event users will be interested in. We were given 38000 users, 3 million events and a bunch of data about them (like friends, attendance or interest in events).
First thing, preprocess the data and put it in a database. I had to go for a database since I couldn’t fit everything in RAM. I chose MongoDB because it’s just so easy to set up. It wasn’t the best database choice for this task. I should try and experiment with Neo4j in the future.
Regarding preprocessing, the most I’ve struggled with was location. Most users had an unformatted location string. Around half of the events had a well formatted location (city, state and country), as well as GPS coordinates. At first, I used the Yahoo Placemaker API to convert user locations into coordinates. This way, I could compute distances between users and events.
I then noticed that external data is not allowed. No problem. With 1.6 million events having both location strings and GPS coordinates, I was able to build a database of spatial information. I could then match user locations and get some coordinates without an external API.
Given a (user, event) pair, these were the features I’ve used in my model:
- number of users attending, not attending, maybe attending and invited to the event;
- number of friends attending, not attending, maybe attending and invited to the event;
- location similarity between user and event;
- number of users attending the event that have also attended events the user did;
- similarity between the event and events the user attended, based on clusters – I used KMeans (loving scikit-learn) to cluster together similar events, based on words; I chose a few values for the number of clusters, in order to capture a little granularity;
- same thing for events attended by friends;
- same thing for events the user (or his friends) didn’t attend to;
- time to event, apparently most important feature;
- similarity between the event’s word distribution and the average distribution of words for events the user attended;
- if the user was invited or not.
I didn’t manage to get anything out of user age and gender. I’m still wondering if (and how) that info can be used in some useful way.
In order to generate results, I went for the classifier approach (two classes, interested and not interested). I also tried ranking and regression, but classifying worked best. I chose a Random Forest (again.. scikit-learn), because it was able to work with missing values. I also added Logistic Regression (on the features that didn’t have missing values) and averaged the results.
The full code is on github. Mind you, except the scripts main.py and model.py, all other files contain code snippets I built on the fly while exploring the data in ipython.
Can you explain how you are using the “time to event” feature. We are judging the events having the same time stamp.
Also please explain how to use “number of users attending, not attending, maybe attending and invited to the event”. I am using the number of attending users only, but that is not helping much.
The “time to event” feature is the time between the event notification and the event start. So, for example, if the user got the notification on April 5th and the event is on April 8th, the time to event is 3 days.
For the other features, I get the attendance list for that event (and the dataset from Kaggle also has users that are not attending, maybe attending or invited). I use all those numbers as features. You can see a description of all features at the end of this file: https://github.com/andreiolariu/kaggle-event-recommendation/blob/master/main.py
I checked your code in github. I am also building a recommender system for a very similar problem. Please advise on the below question.
Given the class imbalance & that fact that I need to maximize map@k which is the best loss function & evaluation criteria for evaluating result of a binary classifier( random forest/ gbm etc)? Did you use recall or precision etc as a scoring criteria for tune the hyper parameters of the random forest?
It’s hard to say. This problem was not recommendation, it was classification. But in typical recommendation scenarios, using binary classifiers would be unfeasible. If your problem is very similar to this one, then using what I’ve used here (mapk) should work.
Hi Andrie, I’m starting to build an event-based recommendation just like this (actually simpler). The point is, I’m just a beginner (I know how an user-based or item-based recommendation system can make recommendations using users’ references) and can’t understand how can you use all those features.
Can you give some advice (books, articles or knowledge needed) so that I can understand your instruction.
You can start with the book “Mining Massive Datasets”. The PDF is available online (www.mmds.org) and there’s also an online course built after it on Coursera.
Thank you for your kind reference. I will read the book.
I have checked you code in github and I saw you used two constant 0.69 and 0.57 corresponding to Random Forest and Logistic Regression for combining the results of two models. Could you explain me how you obtained those constants?
Hi, Tom. The procedure is something like this. You split the training data into A, B and C. You train the two models (Random Forest and Logistic Regression) on A and get predictions for B and C. Using a simple for loop going from 0.00 to 1.00 in 0.01 increments, you find the best performing blend using the predictions for B and their corresponding labels. Finally, you test to whole thing on C to get an accurate estimate of the ensemble’s performance.
Thank you for your reply, Andrei. Happy New Year :D.