uberVU hackaton – Relationship Tagcloud

Yesterday I took part in the second uberVU hackaton. Theme of the day was visualizing data. I joined the team trying to improve the tagcloud (my teammates were Alex Suciu and Dan Filimon).

We came up with the idea of highlighting relations in a tagcloud. More exactly, we presumed a normal phrase is of the fom “<noun – who?> <verb – what?> <noun – to whom?>” (and this can be easily expanded to include adjectives, adverbs and pronouns). This is, pretty much, the basis of natural language parsing. Since a parser is very slow and inadequate for huge volumes of data (which was the case here), we thought of simplifying it. We would have a smaller accuracy than an advanced parser, but the results over tens of thousands of tweets (that’s what we were working on) would be (eventually, after adding some more work hours) similar.

The code is available here. Unfortunately, since there weren’t enough frontend guys around, there’s no way of visualizing the results. Still, we can read and comment them. We tracked tweets containing the “iphone” keyword (about 5000 in total) and we noticed an interesting fact among our results – people express a lot of possession over iPhones. The second, third and sixth most frequent (verb, noun) relations were “have iphone”, “got iphone” and “want iphone”. Also, a few places behind we found “need iphone” and “buy iphone”.

A future interesting project would be to track the evolution of these pairs for a new product and see how they evolve, from the rumours, first announcement, release and up to a few months after, when pretty much everybody has it.

Interview with a Lady Gaga fan

A lot of people comment on youtube videos. A LOT! Lady Gaga’s Bad Romance has more than 1 million comments. What are all those people talking about? Since I’m too lazy to actually read the comments, I’m taking a different approach: I’m building a Markov Model Lady Gaga Fan Simulator (MMLGFS).

The MMLGFS will require training, taking some comments as input. After this, you are able to actually talk to it. Just provide the first 2 words and the model will output the rest of the phrase. Check out the code here.

Now that we have a virtual Lady Gaga fan, let’s ask him some questions:

Web Mining for Dummies reporter: Who are you?
Lady Gaga fan: I am… aspiring to become a succesful pokemon trainer.

WMfD: How would you describe Lady Gaga’s music?
LGf: Her music… is the best christmas gift ever.

WMfD: What’s the hidden meaning of this song?
LGf: It’s… about being in bed with another guy.

WMfD: And how do you feel about that?
LGf:I like… a few but i was apparently wrong as i caught her in all videos and tweet peoplechoiceawardsgaga.

WMfD: A lot of people think Jennifer Lopez is so much cooler. What do you think?
LGf: Jennifer Lopez… and pitbull is worse and they still wiped the floor is going to say stupid things btw is the latest dreamwork animation puss in boots.

WMfD: Thank you for this interview, mister fan, and have a nice day.
LGf: You are… going to the paradise because of this song so please support a dedicated and hungry newartist from seattle you will be funded for my mp3player anyone have suggestions like heavy metal or alternative.

Let’s try a different example (or people will say I am making fun of Lady Gaga [which I am, btw]). This time, I’m choosing lecture 1 from Stanford’s Machine Learning course, presented by Andrew Ng. Side note: Stanford announced a set of new free online courses starting January/February 2012, like Natural Language Processing or Game Theory:

Stanford

There were only about 180 comments for this model, so it’s not as advanced as the previous one. Still, it does turn out to be silly:

WMfD: Why do you follow Andrew Ng’s lectures on Youtube?
Machine Learning student: It is… a much better experience than listening to my boring professor at our university.

WMfD: Did you find the lectures useful?
MLs: I know… some people who like to learn after my day job very useful and it will even make us realize more the complexity of our mind and the power of the creator who designed it.

WMfD: What would you tell Andrew Ng if you met him?
MLs: Andrew Ng… thanks stanford for the knowledge that i ve always been wanting to learn will not give it consciousness instead it will give me an opportunity learning something valuable for free but for real.

WLfD: Thank you for your time.
MLs: You are… a terrorist.

 

Is winter really coming?

For the past couple of months I’ve been reading George R R Martin’s A Song of Ice and Fire. It’s a fantasy series, based in a medieval-like world, with a twisting plot and lots of catchy phrases. Most popular phrase is “Winter is coming”, always used in key moments, emphasising it’s double entendre. Another one I caught is “A Lannister always pays his debts”, the Lannisters being (dooh) the bad guys. But.. are there more such phrases?

Well, let’s find out!

I downloaded the first 4 books. Split the text into words. I build a tree in order to easily determine the frequent phrases (having a minimum frequency of 10 occurrences and a minimum length of 3 words). After constructing the tree and computing the frequencies, I moved the phrases into a list, so I could sort them based on a score (score = phrase’s length in characters * log(frequency)). Check the code here.

So.. what are the top phrases?
1. lord commander of the kingsguard (24 occurrences)
2. grand maester pycelle (111)
3. the lord commander of the kingsguard (14)
4. the seven kingdoms (190)
5. the children of the forest (37)
6. of the nights watch (108)
7. the nights watch (244)
8. his mouth with the back of his hand (12)
9. prince aemon the dragonknight (20)
10. of the seven kingdoms (62)

Plotting the scores and the ranks of the phrases in the sorted list, we get Zipf’s PMF (wohooo):

Zipf

Analyzing the top 10 phrases, we see that 9 are characters/places/groups, while number 8 is.. plain weird. Not quite what I was expecting. Further down we get some of the catchy phrases I was looking for:
12. fear cuts deeper than swords (22 occurrences)
16. the night is dark and full of terrors (10)
24. a lannister always pays his debts (12)
40. you know nothing jon snow (20)
…..
1038. winter is coming (18)

Changing the way scores are computed might improve (or not) the ranking of catchy phrases. Since I am looking for frequent phrases composed out of ordinary words, as oposed to some of the top 10 phrases (composed out of uncommon words – example: “grand maester pycelle”), I might use base word frequencies in the score equation. Another interesting idea is analyzing word distribution over the whole text (here).

To end this, let’s see how these 5 phrases are faring on Google search:
– fear cuts deeper than swords: 588 000 results
– the night is dark and full of terrors: 92 000 results
– a lannister always pays his debts: 256 000 results
– you know nothing jon snow: 462 000 results – this is the surprise result (it even has a facebook page)
– winter is coming – 14 700 000 results