Gaussian Archives - Probably Overthinking It

Greatest GOAT of All Time?

June 27, 2026 AllenDowney

A recent post claims that the “most statistically dominant athlete” of all time was cricketer Don Bradman. It’s a bold claim – let’s see if it holds up.

In Chapter 4 of Probably Overthinking It, I listed a few examples of athletes who are considered the Greatest Of All Time (GOAT), and noted that in many cases they are not just a little better than the second-best, but much better. That is, they are outliers among outliers.

And I suggested that part of the explanation for this phenomenon is that the distribution of accomplishment in many fields (at least, the ones where accomplishment can be quantified) follows a lognormal distribution. That matters because the lognormal distribution has a long tail, which means that extreme values can be much farther from the mean than we would see in a Gaussian (normal) distribution.

So I was intrigued by this post about Don Bradman:

The most statistically dominant athlete ever isn’t Jordan or Messi — and you’ve probably never heard of him. 🏏

I plotted the career batting averages of the greatest Test cricketers of all time. They make a clean bell curve. In cricket, averaging 50 makes you an immortal legend, and the all-time record holders top out around 60 — about 2 standard deviations above the mean.

Then there’s Don Bradman: 99.94. More than 6 standard deviations out. The gap between him and the SECOND best player is bigger than the gap between #2 and the entire average. By the math, a batsman this good should appear about once in 7 billion.

To be this far ahead in basketball, Michael Jordan would’ve had to average 43 points in every single game he ever played. No one in any sport — not Jordan, not Gretzky, not Messi — is this far from their peers.

Not just the greatest. The single biggest outlier in the history of sport.

Let’s see if that’s true. In particular, I’ll investigate the “clean bell curve” – it implies a Gaussian model of the data, which is where that “once in 7 billion” comes from.

To give away the ending, here’s what I found:

The data don’t fit a Gaussian particularly well, especially in the tail, so the “one in 7 billion” might not be right.
I thought they might fit a lognormal better, but I was wrong.
It turns out that the data fit a Weibull distribution very well, and from that we can get a revised estimate of how much of a GOAT Bradman was.

As it turns out, he was a pretty goaty GOAT – but not quite one in 7 billion.

Load the data

I got the data from Kaggle (runs_of_batsmen.csv).

DATA_PATH = "runs_of_batsmen.csv"

df = pd.read_csv(DATA_PATH)

# Convert to numeric, coercing errors to NaN
df['Batting Average'] = pd.to_numeric(df['Batting Average'], errors='coerce')
df['Innings'] = pd.to_numeric(df['Innings'], errors='coerce')

I select only batters with 10 or more innings.

df = df.query('Innings >= 10')

Here’s what the distribution of batting averages looks like.

import seaborn as sns

sns.kdeplot(df['Batting Average'])
decorate()

_images/9e6e3e754fa48807f6d4f5b72552714ed01e3d2dec5cfb8bb37abb5ad845d7aa.png

That’s not much of a bell curve. It’s pretty clearly skewed to the right, even if we leave Bradman out of it. But let’s fit a Gaussian to it anyway.

The Gaussian Model

There are a lot of ways to fit a model to data. The one I like for applications like this is percentile matching – that is, finding a model that minimizes the average vertical distance between the empirical CDF of the data and the CDF of the model. That’s what fit_normal does.

avg_series = df["Batting Average"].dropna()
gaussian_model = fit_normal(avg_series)

Here’s what that looks like, plotting the tail distribution, which is the complement of the CDF. The shaded area shows the differences between the data and the model.

tail_data = TailDist.from_seq(avg_series)
plot_fit_with_area(tail_data, gaussian_model, kind="tail")
decorate(xlabel="Batting Average", title='Gaussian model')

_images/4b2c95fa3fda1563ca2714d6361e97b106ed1984db9479b168798ca548a9d940.png

This is not a great fit. There are clear differences in the shapes of the tail distributions.

And here’s the average error, which is related to the area between the curves.

mae_normal = average_error(avg_series, gaussian_model)
mae_normal

0.017745997600540485

When we plot the tail distributions on linear scales, it looks like the model might be good enough. The problem is that we can’t see what’s happening in the tail. For that, it’s useful to plot the tail distributions on a log-y scale.

n = len(avg_series)
qs = np.linspace(tail_data.qs.min(), tail_data.qs.max(), 200)
tail_model = TailDist(gaussian_model.sf(qs), index=qs)

plot_model_bounds(gaussian_model, n=n, qs=qs, kind="tail", color="gray", alpha=0.2)
tail_model.plot(label="model")
tail_data.plot(label="data")
decorate(yscale="log", title='Gaussian model (log-y scale)')

_images/b8bccf875d3216eb826852be8de578e37cbfb0909184ce61df747e79bef1db5e.png

The model fits the left side of the distribution well enough, but after that, it diverges badly. Around 60, the difference between the model and the data is about an order of magnitude.

Nevertheless, we can use the Gaussian model to compute the probability of a batting average as high as Bradman’s 99.94.

BRADMAN = 99.94


def probability_of_exceeding(model, x, n):
    p = model.cdf(x)
    one_in = 1 / (1 - p)
    in_sample = 1 / (1 - p ** n)
    return one_in, in_sample

In the fitted distribution, the probability of a single batter achieving Bradman’s average is about one per 5.7 trillion.

one_in, in_sample = probability_of_exceeding(gaussian_model, BRADMAN, n)
one_in / 1e12

5.71160383940456

And even if we account for the sample size, the probability that one of 3438 batsman reaches that level is one per 1.7 billion. So, according to the Gaussian model, this outcome is basically impossible.

n, in_sample / 1e9

(3438, 1.6613158346144736)

But the Gaussian model doesn’t fit the data well – and there’s no reason it should. To see why not, let’s think about the process that generates the distribution of batting averages.

As a simplifying assumption, suppose a batsman has the same probability of getting out at any time; in that case, the number of runs in an inning might follow a negative binomial (NB) distribution. As we add up innings (or average over them), the total would eventually converge to a normal distribution, but most batsmans don’t have enough innings to converge. So for each batsman, the distribution of runs follows something between NB and normal. And when we combine them, we get a mixture of those hybrids. There’s no obvious reason the results should fit a simple mathematical model.

But it turns out that they do.

The Weibull Model

The Weibull distribution is not the first thing I thought of – I tried a lognormal distribution first. But a Weibull distribution fits the data really, really well. The fit_scipy_dist is a more general version of fit_normal that works with any of the SciPy distributions.

weibull_model = fit_scipy_dist(avg_series, weibull_min)
weibull_model.args

(1.8667199063293203, 0.9939717502596944, 22.486446489455503)

Here’s the result. Again, the gray area shows the difference between the data and the model.

plot_fit_with_area(tail_data, weibull_model, kind="tail")
decorate(xlabel="Batting Average", title="Weibull model")

_images/2c9b224b40833eea766b5d0cd386fc20182601b3ae5245f8f68f7ffca163a4cf.png

The gray area is not visible. Here’s the average error.

mae_weibull = average_error(avg_series, weibull_model)
mae_weibull

0.002555441381799216

The average error of the Gaussian model is about 7x bigger.

mae_normal / mae_weibull

6.944396270223196

Here’s what the tail looks like on a log-y axis.

tail_model = TailDist(weibull_model.sf(qs), index=qs)

plot_model_bounds(weibull_model, n=n, qs=qs, kind="tail", color="gray", alpha=0.2)
tail_model.plot(label="model")
tail_data.plot(label="data")
decorate(yscale="log", title="Weibull model (log-y scale)")

_images/b42367fb1469d102aa278bd0b6f4c1a339ea7f78a8919fd2da3d27b61c2e3fc3.png

The model fits the data well – within the bounds of variability we expect for this sample size – except for Bradman, who is still an outlier.

Again, we can compute the probability that any batter exceeds Bradman’s level:

one_in, in_sample = probability_of_exceeding(weibull_model, BRADMAN, n)
one_in / 1e6

7.980438255436601

It’s about one per 8 million. And in a sample of 3438 batsmen, the chance that any one of them reaches that level is one per 2,322.

in_sample / 1e3

2.3217442928624186

So Bradman is still an outlier among outliers, but not quite the statistical anomaly that he would be in a normal distribution.

Finally, let’s think about why a Weibull distribution fits this dataset so well. In the data-generating process I suggested earlier, if a batsman has the same probability of getting out at any time, the number of runs in an inning might follow a negative binomial (NB) distribution. If we think of the run-generating process as continuous, the distribution of runs before an out would be exponential. And if we relax the assumption that the hazard rate is constant, the distribution of runs per inning would be Weibull.

I think that’s an intriguing first step, but at best it explains the distribution of runs per inning – but not the distribution of batting averages across batsmen with, presumably, different hazard rates.

Appendix: The Lognormal Model

Just for completeness, here’s the lognormal model.

log_series = np.log10(avg_series)
log_model = fit_normal(log_series, x0=norm.fit(log_series))
tail_log_data = TailDist.from_seq(log_series)

plot_fit_with_area(tail_log_data, log_model, kind="tail")
decorate(xlabel="log10(Batting Average)", title="Lognormal model")

_images/6344bb082e43613d7cc8910ef8f34769e420045404b8d576f8382bc64b0c8bc2.png

The average error is slightly worse than the Gaussian model.

mae_lognormal = average_error(log_series, log_model)
mae_lognormal

0.01885918996726559

And it doesn’t fit the tail well at all.

log_qs = np.log10(qs)
tail_model = TailDist(log_model.sf(log_qs), index=log_qs)

plot_model_bounds(log_model, n=n, qs=log_qs, kind="tail", color="gray", alpha=0.2)
tail_model.plot(label="model")
tail_log_data.plot(label="data")
decorate(xlabel="log10(Batting Average)", yscale="log", title="Lognormal model (log-y scale)")

_images/5d0319f985ddb2ca9f541683b89a467e19b0e12c468c4c00b5a969111401be6b.png

How Gaussian Is It?

May 2, 2022 AllenDowney

This article is an excerpt from the current draft of my book Probably Overthinking It, to be published by the University of Chicago Press in early 2023.
If you would like to receive infrequent notifications about the book (and possibly a discount), please sign up for this mailing list.
This book is intended for a general audience, so I explain some things that might be familiar to readers of this blog – and I leave out the Python code. After the book is published, I will post the Jupyter notebooks with all of the details!

How tall are you? How long are your arms? How far it is from the radiale landmark on your right elbow to the stylion landmark on your right wrist?

You might not know that last one, but the U.S. Army does. Or rather, they know the answer for the 6068 members of the armed forces they measured at the Natick Soldier Center (just a few miles from my house) as part of the Anthropometric Surveys of 2010-2011, abbreviated army-style as ANSUR-II.

In addition to the radiale-stylion length of each participant, the ANSUR dataset includes 93 other measurements “chosen as the most useful ones for meeting current and anticipated Army and [Marine Corps] needs.” The results were declassified in 2017 and are available to download from the Open Design Lab at Penn State.

Measurements like the ones in the ANSUR dataset tend to follow a Gaussian distribution. As an example, let’s look at the sitting height of the male participants, which is the “vertical distance between a sitting surface and the top of the head.” The following figure shows the distribution of these measurements as a dashed line and the Gaussian model as a shaded area.

The width of the shaded area shows the variability we would expect from a Gaussian distribution with this sample size. The distribution falls entirely within the shaded area, which indicates that the model is consistent with the data.

To quantify how well the model fits the data, I computed the maximum vertical distance between them; in this example, it is 0.26 percentile ranks, at the location indicated by the vertical dotted line. The deviation is barely visible.

Why should measurements like this follow a Gaussian distribution? The answer comes in three parts:

Physical characteristics like height depend on many factors, both genetic and environmental.
The contribution of these factors tends to be additive; that is, the measurement is the sum of many contributions.
In a randomly-chosen individual, the set of factors they have inherited or experienced is effectively random.

According to the Central Limit Theorem, the sum of a large number of random values follows a Gaussian distribution. Mathematically, the theorem is only true if the random values come from the same distribution and they are not correlated with each other.

Of course, genetic and environmental factors are more complicated than that. In reality, some contributions are bigger than others, so they don’t all come from the same distribution. And they are likely to be correlated with each other. And their effects are not purely additive; they can interact with each other in more complicated ways.

However, even when the requirements of the Central Limit Theorem are not met exactly, the combined effect of many factors will be approximately Gaussian as long as:

None of the contributions are much bigger than the others,
The correlations between them are not too strong,
The total effect is not too far from the sum of the parts.

Many natural systems satisfy these requirements, which is why so many distributions in the world are approximately Gaussian.

However, there are exceptions. In the ANSUR dataset, the measurement that is the worst match to the Gaussian model is the forearm length of the female participants, which is the distance I mentioned earlier between the radiale landmark on the right elbow and the stylion landmark on the right wrist.

The following figure shows the distribution of these measurements and a Gaussian model.

The maximum vertical distance between them is 4.2 percentile ranks, at the location indicated by the vertical dotted line; it looks like there are more measurements between 24 and 25 cm than we would expect in a Gaussian distribution.

There are two ways to think about this difference between the data and the model. One, which is widespread in the history of statistics and natural philosophy, is that the model represents some kind of ideal, and if the world fails to meet this ideal, the fault lies in the world, not the model.

In my opinion, this is nonsense. The world is complicated. Sometimes we can describe it with simple models, and it is often useful when we can. Sometimes, as in this case, a simple model fits the data surprisingly well. And when that happens, sometimes we find a reason the model works so well, which helps to explain why the world is as it is. But when the world deviates from the model, that’s a problem for the model, not a deficiency of the world.

Differences and Mixtures

I have cut the following section from the book. I still think it’s interesting, but it was in the way of more important things. Sometimes you have to kill your darlings.

So far I have analyzed measurements from male and female participants separately, and you might have wondered why. For some of these measurements, the distributions for men and women are similar, and if we combine them, the mixture is approximately Gaussian. But for some of them the distributions are substantially different; if we combine them, the result is not very Gaussian at all.

To show what that looks like, I computed the distance between the male and female distributions for each measurement and identified the distributions that are most similar and most different.

The measurement with the smallest difference between men and women is “buttock circumference”, which is “the horizontal circumference of the trunk at the level of the maximum protrusion of the right buttock”. The following figure shows the distribution of this measurement for men and women.

The two distributions are nearly identical, and both are well-modeled by a Gaussian distribution. As a result, if we combine measurements from men and women into a single distribution, the result is approximately Gaussian.

The measurement with the biggest difference between men and women is “neck circumference”, which is the circumference of the neck at the point of the thyroid cartilage. The following figure shows the distributions of this measurement for the male and female participants.

The difference is substantial. The average for women is 33 cm; for men it is 40 cm. The protrusion of the thyroid cartilage has been known since at least the 1600s as an “Adam’s apple”, named for the masculine archetype of the Genesis creation narrative. The origin of the term suggests that we are not the first to notice this difference.

There is some overlap between the distributions; that is, some women have thicker necks than some men. Nevertheless, if we choose a threshold between the two means, shown as a vertical line in the figure, we find fewer than 6% of women above the threshold, and fewer than 6% of men below it.

The following figure shows the distribution of neck size if we combine the male and female participants into a single sample.

The result is a distribution that deviates substantially from the Gaussian model. This example shows one of several reasons we find non-Gaussian distributions in nature: mixtures of populations with different means. That’s why Gaussian distributions are generally found within a species. If we combine measurements from different species, we should not expect Gaussian distributions.

Although I generally recommend CDFs as the best ways to visualize distributions, mixtures like this might be an exception. As an alternative, here is a KDE plot of the combined male and female measurements.

This view shows more clearly that the combined distribution is a mixture of distributions with different means; as a result, the mixture has two distinct peaks, known as modes.

In subsequent chapters we’ll see other distributions that deviate from the Gaussian model and develop models to explain where they come from.

If you would like to get infrequent email announcements about my book, please sign up below. I’ll let you know about milestones, promotions, and other news, but not more than one email per month. I will not share your email or use this list for any other purpose.

Watch your tail!

August 13, 2019 AllenDowney

For a long time I have recommended using CDFs to compare distributions. If you are comparing an empirical distribution to a model, the CDF gives you the best view of any differences between the data and the model.

Now I want to amend my advice. CDFs give you a good view of the distribution between the 5th and 95th percentiles, but they are not as good for the tails.

To compare both tails, as well as the “bulk” of the distribution, I recommend a triptych that looks like this:

There’s a lot of information in that figure. So let me explain.

The code for this article is in this Jupyter notebook.

Daily changes

Suppose you observe a random process, like daily changes in the S&P 500. And suppose you have collected historical data in the form of percent changes from one day to the next. The distribution of those changes might look like this:

If you fit a Gaussian model to this data, it looks like this:

It looks like there are small discrepancies between the model and the data, but if you follow my previous advice, you might look at these CDFs and conclude that the Gaussian model is pretty good.

If we zoom in on the middle of the distribution, we can see the discrepancies more clearly:

In this figure it is clearer that the Gaussian model does not fit the data particularly well. And, as we’ll see, the tails are even worse.

Survival on a log-log scale

In my opinion, the best way to compare tails is to plot the survival curve (which is the complementary CDF) on a log-log scale.

In this case, because the dataset includes positive and negative values, I shift them right to view the right tail, and left to view the left tail.

Here’s what the right tail looks like:

This view is like a microscope for looking at tail behavior; it compresses the bulk of the distribution and expands the tail. In this case we can see a small discrepancy between the data and the model around 1 percentage point. And we can see a substantial discrepancy above 3 percentage points.

The Gaussian distribution has “thin tails”; that is, the probabilities it assigns to extreme events drop off very quickly. In the dataset, extreme values are much more common than the model predicts.

The results for the left tail are similar:

Again, there is a small discrepancy near -1 percentage points, as we saw when we zoomed in on the CDF. And there is a substantial discrepancy in the leftmost tail.

Student’s t-distribution

Now let’s try the same exercise with Student’s t-distribution. There are two ways I suggest you think about this distribution:

1) Student’s t is similar to a Gaussian distribution in the middle, but it has heavier tails. The heaviness of the tails is controlled by a third parameter, ν.

2) Also, Student’s t is a mixture of Gaussian distributions with different variances. The tail parameter, ν, is related to the variance of the variances.

For a demonstration of the second interpretation, I recommend this animation by Rasmus Bååth.

I used PyMC to estimate the parameters of a Student’s t model and generate a posterior predictive distribution. You can see the details in this Jupyter notebook.

Here is the CDF of the Student t model compared to the data and the Gaussian model:

In the bulk of the distribution, Student’s t-distribution is clearly a better fit.

Now here’s the right tail, again comparing survival curves on a log-log scale:

Student’s t-distribution is a better fit than the Gaussian model, but it overestimates the probability of extreme values. The problem is that the left tail of the empirical distribution is heavier than the right. But the model is symmetric, so it can only match one tail or the other, not both.

Here is the left tail:

The model fits the left tail about as well as possible.

If you are primarily worried about predicting extreme losses, this model would be a good choice. But if you need to model both tails well, you could try one of the asymmetric generalizations of Student’s t.

The old six sigma

The tail behavior of the Gaussian distribution is the key to understanding “six sigma events”.

John Cook explains six sigmas in this excellent article:

“Six sigma means six standard deviations away from the mean of a probability distribution, sigma (σ) being the common notation for a standard deviation. Moreover, the underlying distribution is implicitly a normal (Gaussian) distribution; people don’t commonly talk about ‘six sigma’ in the context of other distributions.”

This is important. John also explains:

“A six-sigma event isn’t that rare unless your probability distribution is normal… The rarity of six-sigma events comes from the assumption of a normal distribution more than from the number of sigmas per se.”

So, if you see a six-sigma event, you should probably not think, “That was extremely rare, according to my Gaussian model.” Instead, you should think, “Maybe my Gaussian model is not a good choice”.

Probably Overthinking It

Data science, Bayesian Statistics, and other ideas

Browsed by
Tag: Gaussian

Greatest GOAT of All Time?

June 27, 2026 AllenDowney

Load the data

The Gaussian Model

The Weibull Model

Appendix: The Lognormal Model

How Gaussian Is It?

May 2, 2022 AllenDowney

Differences and Mixtures

Watch your tail!

August 13, 2019 AllenDowney

Daily changes

Survival on a log-log scale

Student’s t-distribution

The old six sigma

Load the data

The Gaussian Model

The Weibull Model

Appendix: The Lognormal Model

Differences and Mixtures

Probably Overthinking It mailing list

Daily changes

Survival on a log-log scale

Student’s t-distribution

The old six sigma