The Event Recommendation Engine Challenge just finished on Kaggle.com. I managed to build a good model and finished 7th.
Given a dataset of users and events, we had to predict which event users will be interested in. We were given 38000 users, 3 million events and a bunch of data about them (like friends, attendance or interest in events).
First thing, preprocess the data and put it in a database. I had to go for a database since I couldn’t fit everything in RAM. I chose MongoDB because it’s just so easy to set up. It wasn’t the best database choice for this task. I should try and experiment with Neo4j in the future.
Regarding preprocessing, the most I’ve struggled with was location. Most users had an unformatted location string. Around half of the events had a well formatted location (city, state and country), as well as GPS coordinates. At first, I used the Yahoo Placemaker API to convert user locations into coordinates. This way, I could compute distances between users and events.
I then noticed that external data is not allowed. No problem. With 1.6 million events having both location strings and GPS coordinates, I was able to build a database of spatial information. I could then match user locations and get some coordinates without an external API.
Given a (user, event) pair, these were the features I’ve used in my model:
- number of users attending, not attending, maybe attending and invited to the event;
- number of friends attending, not attending, maybe attending and invited to the event;
- location similarity between user and event;
- number of users attending the event that have also attended events the user did;
- similarity between the event and events the user attended, based on clusters – I used KMeans (loving scikit-learn) to cluster together similar events, based on words; I chose a few values for the number of clusters, in order to capture a little granularity;
- same thing for events attended by friends;
- same thing for events the user (or his friends) didn’t attend to;
- time to event, apparently most important feature;
- similarity between the event’s word distribution and the average distribution of words for events the user attended;
- if the user was invited or not.
I didn’t manage to get anything out of user age and gender. I’m still wondering if (and how) that info can be used in some useful way.
In order to generate results, I went for the classifier approach (two classes, interested and not interested). I also tried ranking and regression, but classifying worked best. I chose a Random Forest (again.. scikit-learn), because it was able to work with missing values. I also added Logistic Regression (on the features that didn’t have missing values) and averaged the results.
The full code is on github. Mind you, except the scripts main.py and model.py, all other files contain code snippets I built on the fly while exploring the data in ipython.