For the past couple of months I’ve been reading George R R Martin’s A Song of Ice and Fire. It’s a fantasy series, based in a medieval-like world, with a twisting plot and lots of catchy phrases. Most popular phrase is “Winter is coming”, always used in key moments, emphasising it’s double entendre. Another one I caught is “A Lannister always pays his debts”, the Lannisters being (dooh) the bad guys. But.. are there more such phrases?
Well, let’s find out!
I downloaded the first 4 books. Split the text into words. I build a tree in order to easily determine the frequent phrases (having a minimum frequency of 10 occurrences and a minimum length of 3 words). After constructing the tree and computing the frequencies, I moved the phrases into a list, so I could sort them based on a score (score = phrase’s length in characters * log(frequency)). Check the code here.
So.. what are the top phrases?
1. lord commander of the kingsguard (24 occurrences)
2. grand maester pycelle (111)
3. the lord commander of the kingsguard (14)
4. the seven kingdoms (190)
5. the children of the forest (37)
6. of the nights watch (108)
7. the nights watch (244)
8. his mouth with the back of his hand (12)
9. prince aemon the dragonknight (20)
10. of the seven kingdoms (62)
Plotting the scores and the ranks of the phrases in the sorted list, we get Zipf’s PMF (wohooo):
Analyzing the top 10 phrases, we see that 9 are characters/places/groups, while number 8 is.. plain weird. Not quite what I was expecting. Further down we get some of the catchy phrases I was looking for:
12. fear cuts deeper than swords (22 occurrences)
16. the night is dark and full of terrors (10)
24. a lannister always pays his debts (12)
40. you know nothing jon snow (20)
…..
1038. winter is coming (18)
Changing the way scores are computed might improve (or not) the ranking of catchy phrases. Since I am looking for frequent phrases composed out of ordinary words, as oposed to some of the top 10 phrases (composed out of uncommon words – example: “grand maester pycelle”), I might use base word frequencies in the score equation. Another interesting idea is analyzing word distribution over the whole text (here).
To end this, let’s see how these 5 phrases are faring on Google search:
– fear cuts deeper than swords: 588 000 results
– the night is dark and full of terrors: 92 000 results
– a lannister always pays his debts: 256 000 results
– you know nothing jon snow: 462 000 results – this is the surprise result (it even has a facebook page)
– winter is coming – 14 700 000 results
Do you know I’ve also started reading the book? :DI am almost done with the first volume!Regarding the mining, I think it would be interesting some sentiment analysis involving different characters. As a side note. I think the Lannisters are not the "bad" people in the book! Only Jaime Lannister and that sister of his! 😀 I actually appreciate Tyrion. 😛
Tyrion is sort of the evil genius, smart and crippled, always plotting against others. I appreciate his cunningness, but he is still a bad guy. Also, you forgot Joffrey.I think I might try some sentiment analysis in a later project. Or maybe some topic detection starting from the main characters (using Latent Dirichlet Allocation). Really curious to see LDA in action.