It’s been quite a while since my last post here. It doesn’t mean I’ve forgotten about this place, just that I’ve had quite an intense summer (“work hard, play hard”). Now I’m excited to see the first results of my work: this week I’ll be at SYNASC 2012, in Timisoara, presenting my paper on summarizing microblogging streams. Here are the slides:
There’s been a lot of buzz over the past couple of years on predicting the outcome of events based on Twitter data. Having easy access to the thoughts of millions of people worldwide, tapping into the stream of short, cryptic and mostly useless tweets and trying to make some sense out of them attracted the interest of a lot of curious people.
Jessica Chung and Erik Tjong Kim Sang tried to predict the outcome of political elections. Johan Bollen found a correlation between Twitter and the stock market. Xiaofeng Wang tried to predict crime based on tweets.
When it comes to predicting Oscar winners, Liviu lica and the guys from uberVU used overall sentiment, which worked in 2011, but failed in 2012. At the uberVU hackaton, I tried using another approach, focused on adjective, which (lucky me) seemed to work. But a new study showed that
Twitter messages are not useful when it comes to movie predictions. And I agree with them: all of the ideas above are flawed. People are noisy sensors. Aggregating over noisy sensors does not result in the right answer, just in an estimate of it (along with an uncertainty level).
But there is one way to reduce the uncertainty level down to a negligible value: use tweets from psychics. The problem with this approach is identifying “psychic” tweets. Obviously, there are very few psychics in the world, so identifying their tweets is not trivial.
I used a simple rule-based filtering approach: I picked only tweets that don’t contain a question (no ‘?’) and the author expresses certainty about who the winner will be (the phrase ‘will win’ appears in the tweet, but ‘think’ or ‘hope’ don’t).
For the proof of concept, I used the corpus from the previous hackaton – 62000 tweets recorded in one week, prior to the Oscars, each tweet assigned to the movie it’s referring. The movie “The Tree of Life” has just 2100 tweets, while “The Artist” goes up to 19200. Out of the 62000 tweets, I get only 98 after filtering. Let’s see how they are distributed:
So there you have it – the power of psychic tweets, predicting the Oscar winner!
Disclaimer: While the data and results are real, I hope you enjoyed this April 1st prank 🙂
Yesterday was uberVU’s third hackaton. Talking with one of the organizers, I found out that the guys there were planning to create an infographic about predicting the Oscar winners. UberVU.com was already tracking the volume of tweets and the sentiment for the nominees, so all the data was available.
Hmm, but can we make the infographic better? I thought about the movie posters and how they include captions from reviews, like “A shock-a-minute masterpiece” from here. Could I get such stuff out of tweets and include it in the infographic?
Well, let’s try. I started writing some code to get frequent captions out of tweets, but there were too many noisy expressions that would require some advanced filtering. I decided to stick just to words and finally just to adjectives. The approach is inspired by a post on Edwin Chen’s blog.
Unfortunately, the guys I hoped would help me with the infographic didn’t come. I’m not very good in Photoshop (I know how to crop and stuff, but an infographic requires a little more skill). So I decided to just build a tagcloud using wordle.net.
I sorted the movies by the number of adjectives they attracted. If the movies are ranked based on how many emotions they determine in their viewers, then this would be the final ranking (from last to first):
The code is available on github.
Update: The guys from uberVU have created the infographic, inserting some of the stuff above, and they have posted it on Techcrunch. They used sentiment data to predict the winner (choosing “The Help”). In the end, the winner chosen by the jury proved to be “The Artist”.
Saturday evening – at the theatre. It was the worst play I’ve ever seen. The room was half empty (which is .. rare, to say the least). A few people around me were dozing off. Meanwhile, I was counting sleeping people or analysing the beautiful decorations. I quit trying to figure out what all the metaphors in the play meant, all I wanted were some subtitles, to translate what the actors said into concrete facts and events.
Anyway, I then remembered that El Clasico (the match between Real Madrid and Barcelona) was just a couple of hours away. I started wondering what’s the impact of this sporting event on Twitter.
As soon as I got back home, I hacked some code to monitor a few Twitter keywords (“barcelona”, “real madrid”, “el clasico”, …) and then left it running.
Next day, I checked out the “harvest”. It collected over 3000 tweets over an interval of 3h30′, ranging from 3 to almost 100 tweets per minute. I plotted the histogram and highlighted the first and second half (with grey shading) and the 4 goals (with red lines).
The histogram highlights spikes in tweets at the beginning of the match, as well as in the end of each half. A little more interesting is the behaviour after a goal – a short drop (everybody stops to see the goal) followed by a spike (after checking the replays, people tweet about it).
Now let’s put some keywords on the chart. Get the most frequent words (I filtered for overall count > 25). Now compute each word’s count over 5 minute windows, plus the mean and standard deviation for these counts. Going over the counts again, I plot words with a window frequency a lot higher when compared to the mean (I used (freq – mean) / st_deviation > 3.5). I set the text opacity depending on the above score. Code is available here.
The words this algorithm found are pretty good. It found 2 of the goal scorers, another player with the decisive pass, some typical end-of-the-match words, as well as the occasional spam (one hour or a few minutes before the start). Possible improvement: Normalise frequencies for each window – will check it out in a future project.
Yesterday I took part in the second uberVU hackaton. Theme of the day was visualizing data. I joined the team trying to improve the tagcloud (my teammates were Alex Suciu and Dan Filimon).
We came up with the idea of highlighting relations in a tagcloud. More exactly, we presumed a normal phrase is of the fom “<noun – who?> <verb – what?> <noun – to whom?>” (and this can be easily expanded to include adjectives, adverbs and pronouns). This is, pretty much, the basis of natural language parsing. Since a parser is very slow and inadequate for huge volumes of data (which was the case here), we thought of simplifying it. We would have a smaller accuracy than an advanced parser, but the results over tens of thousands of tweets (that’s what we were working on) would be (eventually, after adding some more work hours) similar.
The code is available here. Unfortunately, since there weren’t enough frontend guys around, there’s no way of visualizing the results. Still, we can read and comment them. We tracked tweets containing the “iphone” keyword (about 5000 in total) and we noticed an interesting fact among our results – people express a lot of possession over iPhones. The second, third and sixth most frequent (verb, noun) relations were “have iphone”, “got iphone” and “want iphone”. Also, a few places behind we found “need iphone” and “buy iphone”.
A future interesting project would be to track the evolution of these pairs for a new product and see how they evolve, from the rumours, first announcement, release and up to a few months after, when pretty much everybody has it.