1. Intro
First, a few words about Kaggle. It’s a website/community for machine learning competitions. Companies and organizations share a problem (most of the time it’s an actual real world problem), provide a dataset and offer prizes for the best performing models. Some examples of current competitions: predict customer retention, discover dark matter by how it bends light in space photos (AWESOME), predict diseases in patients based on their history (and win $3 million!) and so on.
I was planning to join a competition on Kaggle since I found out about the website (in spring, I think), but I never found the time to. And then I got an email about a new competition – to detect insults in comments. I had a little knowledge on text mining, I also had some free time, so I downloaded the dataset and started coding.
Impermium, the company behind this competition, came with some prize money. First place got $7000, while second place got $2500. Third place got just the eternal glory, yay!
For implementation, I used python, the wonderful scikit-learn library (for the SVM implementation) and the neurolab library (for the neural network implementation).
2. General System Architecture
Here I’ll briefly describe the architecture of the model that performed the best. This architecture will be expanded afterwards.
First, all texts were preprocessed. Then they were fed into 3 different classifiers: word-level SVM, character-level SVM and a dictionary-based classifier. The output from each classifier, along with some other features, were fed into a neural network.
3. Tokenizing
This step was a lot more important than I first imagined. Here are some of the things that I tried and improved (or at least seemed to) the model score:
– removing links, html entities and html code
– formatting whitespaces (removing duplicates, removing newlines and tabs)
– removing non-ascii characters (I didn’t think people curse using special characters; I would reconsider this decision, given more time)
– adding special tokens in texts for character groups such as: #$%#$ (some people curse like this), ?!???, !!!!!!
– removing repeated letters: coooool -> cool, niiiice -> niice (yes, this is not the best implementation, but it usually works)
– replacing smileys by 2 tokens, one for positive smileys and one for negative smileys (“saddies”?)
– removing dots inside words (some people put dots inside curse words – they’re not getting away with this!)
– grouping together sequences of one-letter words – like “f u c k” (some people split a curse word in letters – they’re not getting away with it!)
– trying to group consecutive words (like “fu ck”) using a dictionary (some people split curse words in 2 – they’re not getting away with it!)
4. Classifiers
The first classifier was an SVM on word ngrams, with n from 1 to 4. Not a lot can be said here, I just imported it, fed it the ngrams generated from the tokenized text and let scikit-learn do its stuff.
The second classifier was another SVM, this time on character ngrams, with n from 4 to 10.
The third classifier was a custom build dictionary based classifier. It used a curse words dictionary (which I found online and then enriched with words I found in the misclassified examples). This classifier just looked if the text had words from the dictionary and also words like “you”, “your”, “yourself” (which you use when cursing somebody). It then computed a simple score based on the distances between the curse words and the “you”-words.
The final classifier was used to combine the previous ones. I used a neural network, but I’m sure other techniques can be applied here. The network had a hidden layer of 3 neurons and was trained using the function “train_rprop” (from the neurolab library). I took advantage of the network’s flexibility and added some more features as inputs:
– the ratio of curse words
– the text length
– the ratio of *, ! or ?
– the ratio of capital letter (should have used words in all caps instead)
5. Epilogue
I used k-folds cross-validation for model selection (scikit-learn comes with some nifty tools here also).
In total, I worked around one week for this. I tried a lot more models and approaches, but what I’ve showed here got the best score. I’d love to share the code, but it’s written so crappily (due to some hard core time constraints) that I’m too ashamed to release it. Still, if anybody wants it, please contact me – I’ll share it after explicitly agreeing to one condition: Don’t make fun of it. Screw that, code available here.
I think 3rd place is a very good result (considering it’s my first competition ever). Still, I noticed that there were a lot of mislabeled examples (perhaps more than the 1% stated on the contest page). This might have had an influence on the final ranking (the difference between the winner’s score and the fifth score was less than 1%).
Yet again, I always say “Fortune favors the brave” (I don’t believe in luck). Jumping to some actionable information, light bending dark matter detection sounds pretty cool!