Browsed by
Month: November 2023



Probably Overthinking It is available to predorder now. You can get a 30% discount if you order from the publisher and use the code UCPNEW. You can also order from Amazon or, if you want to support independent bookstores, from

Recently I read a Scientific American article about superbolts, which are lightning strikes that “can be 1,000 times as strong as ordinary strikes”. This reminded me of distributions I’ve seen of many natural phenomena — like earthquakes, asteroids, and solar flares — where the most extreme examples are thousands of times bigger than the ordinary ones. So the article about superbolts made we wonder

  1. Whether superbolts are really a separate category, or whether they are just extreme examples from a long-tailed distribution, and
  2. Whether the distribution is well-modeled by a Student t-distribution on a log scale, like many of the examples I’ve looked at.

The SciAm article refers to this paper from 2019, which uses data from the World Wide Lightning Location Network (WWLLN). That data is not freely available, but I contacted the authors of the paper, who kindly agreed to share a histogram of data collected over from 2010 to 2018, including more than a billion lightning strokes (what is called a lightning strike in common usage is an event that can include more than one stroke).

For each stroke, the dataset includes an estimate of the energy released in 1 millisecond within a certain range of frequencies, reported in Joules. The following figure shows the distribution of these measurements on a log scale, along with a lognormal model. Specifically, it shows the tail distribution, which is the fraction of the sample greater than or equal to each value.

On the left part of the curve, there is some daylight between the data and the model, probably because low-energy strokes are less likely to be detected and measured accurately. Other than that, we could conclude that the data are well-modeled by a lognormal distribution.

But with the y-axis on a linear scale, it’s hard to tell whether the tail of the distribution fits the model. We can see the low probabilities in the tail more clearly if we put the y-axis on a log scale. Here’s what that looks like.

On this scale it’s apparent that the lognormal model seriously underestimates the frequency of superbolts. In the dataset, the fraction of strokes that exceed 10e7.9 J is about 6 per 10e9. According to the lognormal model, it would be about 3 per 10e16 — so it’s off by about 7 orders of magnitude.

In this previous article, I showed that a Student t-distribution on a log scale, which I call a log-t distribution, is a good model for several datasets like this one. Here’s the lightning data again with a log-t model I chose to fit the data.

With the y-axis on a linear scale, we can see that the log-t model fits the data as well as the lognormal or better. And here’s the same comparison with the y-axis on a log scale.

Here we can see that the log-t model fits the tail of the distribution substantially better. Even in the extreme tail, the data fall almost entirely within the bounds we would expect to see by chance.

One of the researchers who provided this data explained that if you look at data collected from different regions of the world during different seasons, the distributions have different parameters. And that suggests a reason the combined magnitudes might follow a t distribution, which can be generated by a mixture of Gaussian distributions with different variance.

I would not say that these data are literally generated from a t distribution. The world is more complicated than that. But if we are particularly interested in the tail of the distribution — as superbolt researchers are — this might be a useful model.

The details of my analysis are in this Jupyter notebook, which you can run on Colab.

Thanks to Professors Robert Holzworth and Michael McCarthy for sharing the data from their paper and reading a draft of this post (with the acknowledgement that any errors are my fault, not theirs).

Life in a Lognormal World

Life in a Lognormal World

At PyData Global 2023 I will present a talk, “Extremes, outliers, and GOATs: On life in a lognormal world”. It is scheduled for Wednesday 6 December at 11 am Eastern Time.

Here is the abstract:

The fastest runners are much faster than we expect from a Gaussian distribution, and the best chess players are much better. In almost every field of human endeavor, there are outliers who stand out even among the most talented people in the world. Where do they come from?

In this talk, I present as possible explanations two data-generating processes that yield lognormal distributions, and show that these models describe many real-world scenarios in natural and social sciences, engineering, and business. And I suggest methods — using SciPy tools — for identifying these distributions, estimating their parameters, and generating predictions.

You can buy tickets for the virtual conference here. If your budget for conferences is limited, PyData tickets are sold under a pay-what-you-can pricing model, with suggested donations based on your role and location.

My talk is based partly on Chapter 4 of Probably Overthinking It and partly on an additional exploration that didn’t make it into the book.

The exploration is motivated by this paper by Philip Gingerich, which takes the heterodox view that measurements in many biological systems follow a lognormal model rather than a Gaussian. Looking at anthropometric data, Gingerich reports that the two models are equally good for 21 of 28 measurements, “but whenever alternatives are distinguishable, [the lognormal model] is consistently and strongly favored.”

I replicated his analysis with two other datasets:

  • The Anthropometric Survey of US Army Personnel (ANSUR II), available from the Open Design Lab at Penn State.
  • Results of medical blood tests from supplemental material from “Quantitative laboratory results: normal or lognormal distribution?” by Frank Klawonn , Georg Hoffmann and Matthias Orth.

I used different methods to fit the models and compare them. The details are in this Jupyter notebook.

The ANSUR dataset contains 93 measurements from 4,082 male and 1,986 female members of the U.S. armed forces. For each measurement, I found the Gaussian and lognormal models that best fit the data and computed the mean absolute error (MAE) of the models.

The following scatter plot shows one point for each measurement, with the average error of the Gaussian model on the x-axis and the average error of the lognormal model on the y-axis.

  • Points in the lower left indicate that both models are good.
  • Points in the upper right indicate that both models are bad.
  • In the upper left, the Gaussian model is better.
  • In the lower right, the lognormal model is better.

These results are consistent with Gingerich’s. For many measurements, the Gaussian and lognormal models are equally good, and for a few they are equally bad. But when one model is better than the other, it is almost always the lognormal.

The most notable example is weight:

In these figures, the grey area shows the difference between the data and the best-fitting model. On the left, the Gaussian model does not fit the data very well; on the right, the lognormal model fits so well, the gray area is barely visible.

So why should measurements like these follow a lognormal distribution? For that you’ll have to come to my talk.

In the meantime, Probably Overthinking It is available to predorder now. You can get a 30% discount if you order from the publisher and use the code UCPNEW. You can also order from Amazon or, if you want to support independent bookstores, from

We Have a Book!

We Have a Book!

My copy of Probably Overthinking It has arrived!

If you want a copy for yourself, you can get a 30% discount if you order from the publisher and use the code UCPNEW. You can also order from Amazon or, if you want to support independent bookstores, from

The official release date is December 6, but since the book is in warehouses now, it might arrive a little early. While you wait, please enjoy this excerpt from the introduction…


Let me start with a premise: we are better off when our decisions are guided by evidence and reason. By “evidence,” I mean data that is relevant to a question. By “reason” I mean the thought processes we use to interpret evidence and make decisions. And by “better off,” I mean we are more likely to accomplish what we set out to do—­and more likely to avoid undesired outcomes.

Sometimes interpreting data is easy. For example, one of the reasons we know that smoking causes lung cancer is that when only 20% of the population smoked, 80% of people with lung cancer were smokers. If you are a doctor who treats patients with lung cancer, it does not take long to notice numbers like that.

But interpreting data is not always that easy. For example, in 1971 a researcher at the University of California, Berkeley, published a pa­ per about the relationship between smoking during pregnancy, the weight of babies at birth, and mortality in the first month of life. He found that babies of mothers who smoke are lighter at birth and more likely to be classified as “low birthweight.” Also, low-­birthweight babies are more likely to die within a month of birth, by a factor of 22. These results were not surprising.

However, when he looked specifically at the low-­birthweight babies, he found that the mortality rate for children of smokers is lower, by a factor of two. That was surprising. He also found that among low-­birthweight babies, children of smokers are less likely to have birth defects, also by a factor of 2. These results make maternal ­smoking seem beneficial for low-­birthweight babies, somehow protecting them from birth defects and mortality.

The paper was influential. In a 2014 retrospective in the Inter- national Journal of Epidemiology, one commentator suggests it was responsible for “holding up anti-­smoking measures among pregnant women for perhaps a decade” in the United States. Another suggests it “postponed by several years any campaign to change mothers’ smoking habits” in the United Kingdom. But it was a mistake. In fact, maternal smoking is bad for babies, low birthweight or not. The reason for the apparent benefit is a statistical error I will explain in chapter 7.

Among epidemiologists, this example is known as the low-­birthweight paradox. A related phenomenon is called the obesity paradox. Other examples in this book include Berkson’s paradox and Simpson’s paradox. As you might infer from the prevalence of “paradoxes,” using data to answer questions can be tricky. But it is not hopeless. Once you have seen a few examples, you will start to recognize them, and you will be less likely to be fooled. And I have collected a lot of examples.

So we can use data to answer questions and resolve debates. We can also use it to make better decisions, but it is not always easy. One of the challenges is that our intuition for probability is sometimes dangerously misleading. For example, in October 2021, a guest on a well-­known podcast reported with alarm that “in the U.K. 70-­plus percent of the people who die now from COVID are fully vaccinated.” He was correct; that number was from a report published by Public Health England, based on reliable national statistics. But his implication—­that the vaccine is useless or actually harmful—­is wrong.

As I’ll show in chapter 9, we can use data from the same report to compute the effectiveness of the vaccine and estimate the number of lives it saved. It turns out that the vaccine was more than 80% effective at preventing death and probably saved more than 7000 lives, in a four-­week period, out of a population of 48 million. If you ever find yourself with the opportunity to save 7000 people in a month, you should take it.

The error committed by this podcast guest is known as the base rate fallacy, and it is an easy mistake to make. In this book, we will see examples from medicine, criminal justice, and other domains where decisions based on probability can be a matter of health, freedom, and life.

The Ground Rules

Not long ago, the only statistics in newspapers were in the sports section. Now, newspapers publish articles with original research, based on data collected and analyzed by journalists, presented with well-­designed, effective visualization. And data visualization has come a long way. When USA Today started publishing in 1982, the infographics on their front page were a novelty. But many of them presented a single statistic, or a few percentages in the form of a pie chart.

Since then, data journalists have turned up the heat. In 2015, “The Upshot,” an online feature of the New York Times, published an interactive, three-­dimensional representation of the yield curve — a notoriously difficult concept in economics. I am not sure I fully understand this figure, but I admire the effort, and I appreciate the willingness of the authors to challenge the audience. I will also challenge my audience, but I won’t assume that you have prior knowledge of statistics beyond a few basics. Everything else, I’ll explain as we go.

Some of the examples in this book are based on published research; others are based on my own observations and exploration of data. Rather than report results from a prior work or copy a figure, I get the data, replicate the analysis, and make the figures myself. In some cases, I was able to repeat the analysis with more recent data. These updates are enlightening. For example, the low-­birthweight paradox, which was first observed in the 1970s, persisted into the 1990s, but it has disappeared in the most recent data.

All of the work for this book is based on tools and practices of reproducible science. I wrote each chapter in a Jupyter notebook, which combines the text, computer code, and results in a single document. These documents are organized in a version-­control system that helps to ensure they are consistent and correct. In total, I wrote about 6000 lines of Python code using reliable, open-­source libraries like NumPy, SciPy, and pandas. Of course, it is possible that there are bugs in my code, but I have tested it to minimize the chance of errors that substantially affect the results.

My Jupyter notebooks are available online so that anyone can replicate the analysis I’ve done with the push of a button.