Zipf's Law and Why Content Must Be Interesting

If you read a few of my blog posts, you’ll probably come across the idea of interesting content pretty soon. And it’s not just me, either: polyglots like Olly Richards, Steve Kaufmann, and many others have made a big deal about interesting content as well.

But why is this?

Well, the most obvious answer is because it’s interesting. I mean, it makes a lot of sense, that if you’re reading, watching, or listening to something that interests you then you’re going to be enjoying it more. This has a number of important results:

  • You’re more likely to keep going

  • You’re more likely to read, listen, and watch more

  • Because you’re enjoying it, you’re more likely to learn and to remember

These are all good reasons, and I can’t stress enough that these reasons are completely correct. I could go into them in more detail, but others have already done so (but if you want me to anyway, send me an email at alexander.decodinglatin.org and let me know).\

But there’s another reason that I think isn’t often highlighted as much, which is equally important. In fact, I’d say that this reason is strong enough to insist on interesting content even if none of the other reasons applied. That reason is Zipf’s Law.

Okay fine, not that kind of law…

Zipf’s law is a law of probability first proposed by American linguist George Kingsley Zipf, which effectively says that the number of times a word will occur in a given piece or work corresponds to a rule. The rule can be expressed two ways.

The first way says that (roughly) words in a language occur with a frequency of 0.1/r times (where) r means rank). I know, you didn’t think you’d get a maths lesson, but bear with me (if it’s REALLY uninteresting, skip to the next paragraph). What this means is that the most common word in a language (in English it’s ‘the’) will occur roughly 0.1/1 times, which means 10% of the time. So roughly 10% of all words in a book will be ‘the’ in English. The second most common word will appear 0.1/2 times, or 5%. And so on.

What you can see from this is that words drop in frequency pretty quickly, right? Well, that’s not all. The law can also be expressed in terms of the most common word in any given text. On this way of looking at it, if the most common word in a book occurs 1,000 times, then the second most common word will appear half of that (1/2) or 500 times, the third most common will appear a third of that (1/3), or roughly 330 times, the fourth a quarter (1/4) or 250 times, and so on. So the 50th most common word would only appear 20 times.

The result of this is that the first few words come up a lot, so they’re pretty easy to learn, but the rest don’t come up all that much. In fact, it pretty much means every word is unique, every word is rare (except for those top words that we just can’t get away from).

So what does any of this have to do with language learning? Two things:

  1. First, it means that the most common words are going to come up a lot in pretty much anything you read, so you don’t really need to worry about them all that much - you’ll learn them pretty quickly, and if not, you’ll still get lots of practice.

  2. Since, after those top words, every word becomes pretty much unique, the vocabulary that you’ll learn is directly related to the content you absorb.

Rule number 1 is great for us: it means we don’t need to stress about learning some list of the most common words. Rule number 2 has a rather different, rather interesting, conclusion.

Rule number 2 says that the words of a text are going to be peculiar to that book, that author, that genre, that subject-matter, and so on. Which means that if you read a lot of crime fiction, you’re going to learn the words for ‘murder’, ‘dead’, ‘clue’, and so on pretty quickly. But the odds of you learning these words in a mathematics textbook are pretty slim, let’s be honest. But maybe not impossible…

So what does this mean for content? It means the only logical way to learn is by absorbing content that’s interesting to you. Why? Because if it’s interesting to you, odds are it’ll contain a lot of words and concepts that are useful to you in your native language. For example, if I love talking about music, I shouldn’t read a book about oratory, because it’s just not going to give me the words that I’m most likely to use - I never talk about oratory, so learning a whole lot of words about oratory isn’t going to feel like huge progress. But if I learnt the same number of words about music in Latin, then I’d feel like I’d learnt a whole lot more, because suddenly I’d be a whole lot closer to expressing my own interests and loves in Latin.

Make sense?

The short version is this: interesting content provides the most efficient way of learning vocabulary and grammar. Anything else is a waste of time.

So go find some interesting content!

P.S. You might object that what’s really interesting to you is too hard - well, I talk about how to read any Latin text in my book Decoding Latin: A User’s Guide, which you can get here or, if you sign up for my mailing list, you can get for free!