## War and Peace and Zipf’s Law

Elements of Data Scienceis in print now, available from Lulu.com and online booksellers. To celebrate, I’ll post some excerpts here, starting with one of my favorite examples, Zipf’s Law. You can read the complete chapter here, or run the Jupyter notebook on Colab.

In almost any book, in almost any language, if you count the number of unique words and the number of times each word appears, you will find a remarkable pattern: the most common word appears twice as often as the second most common – at least approximately – three times as often as the third most common, and so on.

In general, if we sort the words in descending order of frequency, there is an inverse relationship between the rank of the words – first, second, third, etc. – and the number of times they appear. This observation was most famously made by George Kingsley Zipf, so it is called Zipf’s law.

To see if this law holds for the words in *War and Peace*, we’ll make a Zipf plot, which shows:

- The frequency of each word on the y-axis, and
- The rank of each word on the x-axis, starting from 1.

In the previous chapter, we looped through the book and made a string that contains all punctuation characters. Here are the results, which we will need again.

all_punctuation = ',.-:[#]*/“’—‘!?”;()%@'

The following program reads through the book and makes a dictionary that maps from each word to the number of times it appears.

fp = open('2600-0.txt') for line in fp: if line.startswith('***'): break unique_words = {} for line in fp: if line.startswith('***'): break for word in line.split(): word = word.lower() word = word.strip(all_punctuation) if word in unique_words: unique_words[word] += 1 else: unique_words[word] = 1

In `unique_words`

, the keys are words and the values are their frequencies. We can use the `values`

function to get the values from the dictionary. The result has the type `dict_values`

:

freqs = unique_words.values() type(freqs)

dict_values

Before we plot them, we have to sort them, but the `sort`

function doesn’t work with `dict_values`

.

%%expect AttributeError freqs.sort()

AttributeError: 'dict_values' object has no attribute 'sort'

We can use `list`

to make a list of frequencies:

freq_list = list(unique_words.values()) type(freq_list)

list

And now we can use `sort`

. By default it sorts in ascending order, but we can pass a keyword argument to reverse the order.

freq_list.sort(reverse=True)

Now, for the ranks, we need a sequence that counts from 1 to `n`

, where `n`

is the number of elements in `freq_list`

. We can use the `range`

function, which returns a value with type `range`

. As a small example, here’s the range from 1 to 5.

range(1, 5)

range(1, 5)

However, there’s a catch. If we use the range to make a list, we see that “the range from 1 to 5” includes 1, but it doesn’t include 5.

list(range(1, 5))

[1, 2, 3, 4]

That might seem strange, but it is often more convenient to use `range`

when it is defined this way, rather than what might seem like the more natural way. Anyway, we can get what we want by increasing the second argument by one:

list(range(1, 6))

[1, 2, 3, 4, 5]

So, finally, we can make a range that represents the ranks from `1`

to `n`

:

n = len(freq_list) ranks = range(1, n+1) ranks

range(1, 20484)

And now we can plot the frequencies versus the ranks:

plt.plot(ranks, freq_list)

plt.xlabel('Rank')

plt.ylabel('Frequency')

plt.title("War and Peace and Zipf's law");

According to Zipf’s law, these frequencies should be inversely proportional to the ranks. If that’s true, we can write:

f = k / r

where r is the rank of a word, f is its frequency, and k is an unknown constant of proportionality. If we take the logarithm of both sides, we get

log f = log k – log r

This equation implies that if we plot f versus r on a log-log scale, we expect to see a straight line with intercept at log k and slope -1.

## 6.6. Logarithmic Scales

We can use `plt.xscale`

to plot the x-axis on a log scale.

plt.plot(ranks, freq_list) plt.xlabel('Rank') plt.ylabel('Frequency') plt.title("War and Peace and Zipf's law") plt.xscale('log')

And `plt.yscale`

to plot the y-axis on a log scale.

plt.plot(ranks, freq_list) plt.xlabel('Rank') plt.ylabel('Frequency') plt.title("War and Peace and Zipf's law") plt.xscale('log') plt.yscale('log')

The result is not quite a straight line, but it is close. We can get a sense of the slope by connecting the end points with a line. First, we’ll select the first and last elements from `xs`

.

xs = ranks[0], ranks[-1] xs

(1, 20483)

And the first and last elements from `ys`

.

ys = freq_list[0], freq_list[-1] ys

(34389, 1)

And plot a line between them.

plt.plot(xs, ys, color='gray') plt.plot(ranks, freq_list) plt.xlabel('Rank') plt.ylabel('Frequency') plt.title("War and Peace and Zipf's law") plt.xscale('log') plt.yscale('log')

The slope of this line is the “rise over run”, that is, the difference on the y-axis divided by the difference on the x-axis. We can compute the rise using `np.log10`

to compute the log base 10 of the first and last values:

np.log10(ys)

array([4.53641955, 0. ])

Then we can use `np.diff`

to compute the difference between the elements:

rise = np.diff(np.log10(ys)) rise

array([-4.53641955])

**Exercise:** Use `log10`

and `diff`

to compute the run, that is, the difference on the x-axis. Then divide the rise by the run to get the slope of the grey line. Is it close to -1, as Zipf’s law predicts? Hint: yes.