The story of the Oscar predictions

Yesterday was uberVU’s third hackaton. Talking with one of the organizers, I found out that the guys there were planning to create an infographic about predicting the Oscar winners. was already tracking the volume of tweets and the sentiment for the nominees, so all the data was available.

Hmm, but can we make the infographic better? I thought about the movie posters and how they include captions from reviews, like “A shock-a-minute masterpiece” from here. Could I get such stuff out of tweets and include it in the infographic?

Well, let’s try. I started writing some code to get frequent captions out of tweets, but there were too many noisy expressions that would require some advanced filtering. I decided to stick just to words and finally just to adjectives. The approach is inspired by a post on Edwin Chen’s blog.

Unfortunately, the guys I hoped would help me with the infographic didn’t come. I’m not very good in Photoshop (I know how to crop and stuff, but an infographic requires a little more skill). So I decided to just build a tagcloud using

I sorted the movies by the number of adjectives they attracted. If the movies are ranked based on how many emotions they determine in their viewers, then this would be the final ranking (from last to first):


The code is available on github.

Update: The guys from uberVU have created the infographic, inserting some of the stuff above, and they have posted it on Techcrunch. They used sentiment data to predict the winner (choosing “The Help”). In the end, the winner chosen by the jury proved to be “The Artist”.

uberVU hackaton – Relationship Tagcloud

Yesterday I took part in the second uberVU hackaton. Theme of the day was visualizing data. I joined the team trying to improve the tagcloud (my teammates were Alex Suciu and Dan Filimon).

We came up with the idea of highlighting relations in a tagcloud. More exactly, we presumed a normal phrase is of the fom “<noun – who?> <verb – what?> <noun – to whom?>” (and this can be easily expanded to include adjectives, adverbs and pronouns). This is, pretty much, the basis of natural language parsing. Since a parser is very slow and inadequate for huge volumes of data (which was the case here), we thought of simplifying it. We would have a smaller accuracy than an advanced parser, but the results over tens of thousands of tweets (that’s what we were working on) would be (eventually, after adding some more work hours) similar.

The code is available here. Unfortunately, since there weren’t enough frontend guys around, there’s no way of visualizing the results. Still, we can read and comment them. We tracked tweets containing the “iphone” keyword (about 5000 in total) and we noticed an interesting fact among our results – people express a lot of possession over iPhones. The second, third and sixth most frequent (verb, noun) relations were “have iphone”, “got iphone” and “want iphone”. Also, a few places behind we found “need iphone” and “buy iphone”.

A future interesting project would be to track the evolution of these pairs for a new product and see how they evolve, from the rumours, first announcement, release and up to a few months after, when pretty much everybody has it.