My first Kaggle competition (and how I ranked 3rd)

1. Intro
First, a few words about Kaggle. It’s a website/community for machine learning competitions. Companies and organizations share a problem (most of the time it’s an actual real world problem), provide a dataset and offer prizes for the best performing models. Some examples of current competitions: predict customer retention, discover dark matter by how it bends light in space photos (AWESOME), predict diseases in patients based on their history (and win $3 million!) and so on.

I was planning to join a competition on Kaggle since I found out about the website (in spring, I think), but I never found the time to. And then I got an email about a new competition – to detect insults in comments. I had a little knowledge on text mining, I also had some free time, so I downloaded the dataset and started coding.

Impermium, the company behind this competition, came with some prize money. First place got $7000, while second place got $2500. Third place got just the eternal glory, yay!

For implementation, I used python, the wonderful scikit-learn library (for the SVM implementation) and the neurolab library (for the neural network implementation).

2. General System Architecture
Here I’ll briefly describe the architecture of the model that performed the best. This architecture will be expanded afterwards.

First, all texts were preprocessed. Then they were fed into 3 different classifiers: word-level SVM, character-level SVM and a dictionary-based classifier. The output from each classifier, along with some other features, were fed into a neural network.

3. Tokenizing
This step was a lot more important than I first imagined. Here are some of the things that I tried and improved (or at least seemed to) the model score:
– removing links, html entities and html code
– formatting whitespaces (removing duplicates, removing newlines and tabs)
– removing non-ascii characters (I didn’t think people curse using special characters; I would reconsider this decision, given more time)
– adding special tokens in texts for character groups such as: #$%#$ (some people curse like this), ?!???, !!!!!!
– removing repeated letters: coooool -> cool, niiiice -> niice (yes, this is not the best implementation, but it usually works)
– replacing smileys by 2 tokens, one for positive smileys and one for negative smileys (“saddies”?)
– removing dots inside words (some people put dots inside curse words – they’re not getting away with this!)
– grouping together sequences of one-letter words – like “f u c k” (some people split a curse word in letters – they’re not getting away with it!)
– trying to group consecutive words (like “fu ck”) using a dictionary (some people split curse words in 2 – they’re not getting away with it!)

4. Classifiers
The first classifier was an SVM on word ngrams, with n from 1 to 4. Not a lot can be said here, I just imported it, fed it the ngrams generated from the tokenized text and let scikit-learn do its stuff.

The second classifier was another SVM, this time on character ngrams, with n from 4 to 10.

The third classifier was a custom build dictionary based classifier. It used a curse words dictionary (which I found online and then enriched with words I found in the misclassified examples). This classifier just looked if the text had words from the dictionary and also words like “you”, “your”, “yourself” (which you use when cursing somebody). It then computed a simple score based on the distances between the curse words and the “you”-words.

The final classifier was used to combine the previous ones. I used a neural network, but I’m sure other techniques can be applied here. The network had a hidden layer of 3 neurons and was trained using the function “train_rprop” (from the neurolab library). I took advantage of the network’s flexibility and added some more features as inputs:
– the ratio of curse words
– the text length
– the ratio of *, ! or ?
– the ratio of capital letter (should have used words in all caps instead)

5. Epilogue
I used k-folds cross-validation for model selection (scikit-learn comes with some nifty tools here also).

In total, I worked around one week for this. I tried a lot more models and approaches, but what I’ve showed here got the best score. I’d love to share the code, but it’s written so crappily (due to some hard core time constraints) that I’m too ashamed to release it. Still, if anybody wants it, please contact me – I’ll share it after explicitly agreeing to one condition: Don’t make fun of it. Screw that, code available here.

I think 3rd place is a very good result (considering it’s my first competition ever). Still, I noticed that there were a lot of mislabeled examples (perhaps more than the 1% stated on the contest page). This might have had an influence on the final ranking (the difference between the winner’s score and the fifth score was less than 1%).

Yet again, I always say “Fortune favors the brave” (I don’t believe in luck). Jumping to some actionable information, light bending dark matter detection sounds pretty cool!

The story of the Oscar predictions

Yesterday was uberVU’s third hackaton. Talking with one of the organizers, I found out that the guys there were planning to create an infographic about predicting the Oscar winners. UberVU.com was already tracking the volume of tweets and the sentiment for the nominees, so all the data was available.

Hmm, but can we make the infographic better? I thought about the movie posters and how they include captions from reviews, like “A shock-a-minute masterpiece” from here. Could I get such stuff out of tweets and include it in the infographic?

Well, let’s try. I started writing some code to get frequent captions out of tweets, but there were too many noisy expressions that would require some advanced filtering. I decided to stick just to words and finally just to adjectives. The approach is inspired by a post on Edwin Chen’s blog.

Unfortunately, the guys I hoped would help me with the infographic didn’t come. I’m not very good in Photoshop (I know how to crop and stuff, but an infographic requires a little more skill). So I decided to just build a tagcloud using wordle.net.

I sorted the movies by the number of adjectives they attracted. If the movies are ranked based on how many emotions they determine in their viewers, then this would be the final ranking (from last to first):

1_loud2_tree3_midnight4_moneyball5_warhorse6_hugo7_descendants8_help9_artist

The code is available on github.

Update: The guys from uberVU have created the infographic, inserting some of the stuff above, and they have posted it on Techcrunch. They used sentiment data to predict the winner (choosing “The Help”). In the end, the winner chosen by the jury proved to be “The Artist”.

El Clasico on Twitter

Saturday evening – at the theatre. It was the worst play I’ve ever seen. The room was half empty (which is .. rare, to say the least). A few people around me were dozing off. Meanwhile, I was counting sleeping people or analysing the beautiful decorations. I quit trying to figure out what all the metaphors in the play meant, all I wanted were some subtitles, to translate what the actors said into concrete facts and events.

Anyway, I then remembered that El Clasico (the match between Real Madrid and Barcelona) was just a couple of hours away. I started wondering what’s the impact of this sporting event on Twitter.

As soon as I got back home, I hacked some code to monitor a few Twitter keywords (“barcelona”, “real madrid”, “el clasico”, …) and then left it running.

Next day, I checked out the “harvest”. It collected over 3000 tweets over an interval of 3h30′, ranging from 3 to almost 100 tweets per minute. I plotted the histogram and highlighted the first and second half (with grey shading) and the 4 goals (with red lines).

The histogram highlights spikes in tweets at the beginning of the match, as well as in the end of each half. A little more interesting is the behaviour after a goal – a short drop (everybody stops to see the goal) followed by a spike (after checking the replays, people tweet about it).

Elclasico

Now let’s put some keywords on the chart. Get the most frequent words (I filtered for overall count > 25). Now compute each word’s count over 5 minute windows, plus the mean and standard deviation for these counts. Going over the counts again, I plot words with a window frequency a lot higher when compared to the mean (I used (freq – mean) / st_deviation > 3.5). I set the text opacity depending on the above score. Code is available here.

The words this algorithm found are pretty good. It found 2 of the goal scorers, another player with the decisive pass, some typical end-of-the-match words, as well as the occasional spam (one hour or a few minutes before the start). Possible improvement: Normalise frequencies for each window – will check it out in a future project.

uberVU hackaton – Relationship Tagcloud

Yesterday I took part in the second uberVU hackaton. Theme of the day was visualizing data. I joined the team trying to improve the tagcloud (my teammates were Alex Suciu and Dan Filimon).

We came up with the idea of highlighting relations in a tagcloud. More exactly, we presumed a normal phrase is of the fom “<noun – who?> <verb – what?> <noun – to whom?>” (and this can be easily expanded to include adjectives, adverbs and pronouns). This is, pretty much, the basis of natural language parsing. Since a parser is very slow and inadequate for huge volumes of data (which was the case here), we thought of simplifying it. We would have a smaller accuracy than an advanced parser, but the results over tens of thousands of tweets (that’s what we were working on) would be (eventually, after adding some more work hours) similar.

The code is available here. Unfortunately, since there weren’t enough frontend guys around, there’s no way of visualizing the results. Still, we can read and comment them. We tracked tweets containing the “iphone” keyword (about 5000 in total) and we noticed an interesting fact among our results – people express a lot of possession over iPhones. The second, third and sixth most frequent (verb, noun) relations were “have iphone”, “got iphone” and “want iphone”. Also, a few places behind we found “need iphone” and “buy iphone”.

A future interesting project would be to track the evolution of these pairs for a new product and see how they evolve, from the rumours, first announcement, release and up to a few months after, when pretty much everybody has it.

Interview with a Lady Gaga fan

A lot of people comment on youtube videos. A LOT! Lady Gaga’s Bad Romance has more than 1 million comments. What are all those people talking about? Since I’m too lazy to actually read the comments, I’m taking a different approach: I’m building a Markov Model Lady Gaga Fan Simulator (MMLGFS).

The MMLGFS will require training, taking some comments as input. After this, you are able to actually talk to it. Just provide the first 2 words and the model will output the rest of the phrase. Check out the code here.

Now that we have a virtual Lady Gaga fan, let’s ask him some questions:

Web Mining for Dummies reporter: Who are you?
Lady Gaga fan: I am… aspiring to become a succesful pokemon trainer.

WMfD: How would you describe Lady Gaga’s music?
LGf: Her music… is the best christmas gift ever.

WMfD: What’s the hidden meaning of this song?
LGf: It’s… about being in bed with another guy.

WMfD: And how do you feel about that?
LGf:I like… a few but i was apparently wrong as i caught her in all videos and tweet peoplechoiceawardsgaga.

WMfD: A lot of people think Jennifer Lopez is so much cooler. What do you think?
LGf: Jennifer Lopez… and pitbull is worse and they still wiped the floor is going to say stupid things btw is the latest dreamwork animation puss in boots.

WMfD: Thank you for this interview, mister fan, and have a nice day.
LGf: You are… going to the paradise because of this song so please support a dedicated and hungry newartist from seattle you will be funded for my mp3player anyone have suggestions like heavy metal or alternative.

Let’s try a different example (or people will say I am making fun of Lady Gaga [which I am, btw]). This time, I’m choosing lecture 1 from Stanford’s Machine Learning course, presented by Andrew Ng. Side note: Stanford announced a set of new free online courses starting January/February 2012, like Natural Language Processing or Game Theory:

Stanford

There were only about 180 comments for this model, so it’s not as advanced as the previous one. Still, it does turn out to be silly:

WMfD: Why do you follow Andrew Ng’s lectures on Youtube?
Machine Learning student: It is… a much better experience than listening to my boring professor at our university.

WMfD: Did you find the lectures useful?
MLs: I know… some people who like to learn after my day job very useful and it will even make us realize more the complexity of our mind and the power of the creator who designed it.

WMfD: What would you tell Andrew Ng if you met him?
MLs: Andrew Ng… thanks stanford for the knowledge that i ve always been wanting to learn will not give it consciousness instead it will give me an opportunity learning something valuable for free but for real.

WLfD: Thank you for your time.
MLs: You are… a terrorist.