Probably Overthinking It

Bayesian Decision Analysis

January 9, 2026 AllenDowney

At PyData Global 2025 I presented a workshop on Bayesian Decision Analysis with PyMC. The video is available now.

This workshop is based on the first session of the Applied Bayesian Modeling Workshop I teach along with my colleagues at PyMC Labs. If you would like to learn more, it is not too late to sign up for the next offering, starting Monday January 12.

Resources:

The slides are here
The GitHub repository with the workshop material is here
You can run the notebooks on Colab: notebook without solutions, notebook with solutions

Here’s the abstract and description of the workshop.

Bayesian Decision Analysis with PyMC: Beyond A/B Testing

This hands-on tutorial introduces practical Bayesian inference using PyMC, focusing on A/B testing, decision-making under uncertainty, and hierarchical modeling. With real-world examples, you’ll learn how to build and interpret Bayesian models, evaluate competing hypotheses, and implement adaptive strategies like Thompson sampling. Whether you’re working in marketing, healthcare, public policy, UX design, or data science more broadly, these techniques offer powerful tools for experimentation, decision-making, and evidence-based analysis.

Description

Bayesian methods offer a natural and interpretable framework for updating beliefs with data, and PyMC makes it easy to apply these techniques in practice. In this tutorial, we’ll walk through a series of examples that demonstrate the core concepts:

Bayesian A/B Testing with the Beta-Binomial Model

Represent prior beliefs with the beta distribution
Use binomial likelihoods to model observed outcomes
Understand posterior distributions and credible intervals

Bayesian Bandits and Thompson Sampling

Go beyond hypothesis testing: estimate the probability of one version outperforming another
Use Thompson sampling to guide decision-making
Simulate and visualize an adaptive email campaign

Hierarchical Models for Partial Pooling and Prediction

Learn how to share information across variants
Use posterior predictive distributions to quantify uncertainty
Understand second-order probabilities

Hands-On Learning

Participants will follow along in Jupyter notebooks (hosted on Colab — no installation required). Exercises are embedded throughout, with guided solutions. Code is based on PyMC, ArviZ, and standard scientific Python libraries.

Prerequisites

Intermediate Python: basic familiarity with NumPy, plotting, and Jupyter notebooks
No prior experience with Bayesian statistics or PyMC is assumed
All materials run on Colab (no setup required)

The Raven Paradox

December 26, 2025 AllenDowney

Suppose you are not sure whether all ravens are black. If you see a white raven, that clearly refutes the hypothesis. And if you see a black raven, that supports the hypothesis in the sense that it increases our confidence, maybe slightly. But what if you see a red apple – does that make the hypothesis any more or less likely?

This question is the core of the Raven paradox, a problem in the philosophy of science posed by Carl Gustav Hempel in the 1940s. It highlights a counterintuitive aspect of how we evaluate evidence and confirm hypotheses.

No resolution of the paradox is universally accepted, but the most widely accepted is what I will call the standard Bayesian response. In this article, I’ll present this response, explain why I think it is incomplete, and propose an extension that might resolve the paradox.

Click here to run this notebook on Colab.

The Problem

The paradox starts with the hypothesis

A: All ravens are black

And the contrapositive hypothesis

B: All non-black things are non-ravens

Logically, these hypotheses are identical – if A is true, B must be true, and vice versa. So if we have a certain level of confidence in A, we should have exactly the same confidence in B. And if we observe evidence in favor of A, we should also accept it as evidence in favor of B, to the same degree.

Also, if we accept that a black raven is evidence in favor of A, we should also accept that a non-black non-raven is evidence in favor of B.

Finally, if a non-black non-raven is evidence in favor of B, we should also accept that it is evidence in favor of A.

Therefore, a red apple (which is a non-black non-raven) is evidence that all ravens are black.

If you accept this conclusion, it seems like every time you see a red apple (or a blue car, or a green leaf, etc.) you should think, “Now I am slightly more confident that all ravens are black”.

But that seems absurd, so we have two options:

Discover an error in the argument, or
Accept the conclusion.

As you might expect, many versions of (1) and (2) have been proposed.

The standard Bayesian response is to accept the conclusion but, quoth Wikipedia “argue that the amount of confirmation provided is very small, due to the large discrepancy between the number of ravens and the number of non-black objects. According to this resolution, the conclusion appears paradoxical because we intuitively estimate the amount of evidence provided by the observation of a green apple to be zero, when it is in fact non-zero but extremely small.”

It is true that when the number of non-ravens is large, the amount of evidence we get from each non-black non-raven is so small it is negligible. But I don’t think that’s why the conclusion is so acutely counterintuitive.

To clarify my objection, let me present a smaller example I’ll call the Roulette paradox.

The Roulette Paradox

An American roulette wheel has 36 pockets with the numbers 1 to 36, and two pockets labeled 0 and 00. The non-zero pockets are red or black, and the zero pockets are green.

Suppose we work in quality control at the roulette factory and our job is to check that all zero pockets are green. If we observe a green zero, that’s evidence that all zeros are green. But what if we observe a red 19?

In this example, the standard Bayesian response fails:

First, the number of non-zeros is not particularly large, so the weight of the evidence is not negligible.
Also, the Bayesian response doesn’t address what I think is actually the key: The non-green non-zero may or may not be evidence, depending on how it was sampled.

As I will demonstrate,

If we choose a pocket at random and it turns out to be a non-green non-zero, that is not evidence that all zeros are green.
But if we choose a non-green pocket and it turns out to be non-zero, that is evidence that all zeros are green.

In both cases we observe a non-green non-zero, but “observe” is ambiguous. Whether the observation is evidence or not depends on the sampling process that generated the observation. And I think confusion between these two scenarios is the foundation of the paradox.

The Setup

Let’s get into the details. Switching from roulette back to ravens, we will consider four scenarios:

You choose a random thing and it turns out to be a black raven.
You choose a random thing and it turns out to be a non-black non-raven.
You choose a random raven and it turns out to be black.
You choose a random non-black thing and it turns out to be a non-raven.

The key to the raven paradox is the difference between scenarios 2 and 4.

Scenario 2 is what most people imagine when they picture “observing a red apple”. And in this scenario, the red apple is irrelevant, exactly as intuition insists.
In Scenario 4, a red apple is evidence in favor of A, because we’re systematically checking non-black things to ensure they’re not ravens – so finding they aren’t is confirmation. But this sampling process is a more contrived interpretation of “observing a red apple”.

The reason for the paradox is that we imagine Scenario 2 and we are given the conclusion from Scenario 4.

It might not be obvious why the red apple is evidence in Scenario 4, but not Scenario 2. I think it will be clearer if we do the math.

The Math

We’ll start with a small world where there are only N = 9 ravens and M = 19 non-ravens. Then we’ll see what happens as we vary N and M.

I’ll use i to represent the unknown number of black ravens, which could be any value from 0 to N, and j to represent the unknown number of black non-ravens, from 0 to M.

We’ll use a joint distribution to represent beliefs about i and j; then we’ll use Bayes’s Theorem to update these beliefs when we see new data.

Let’s start with a uniform prior over all possible combinations of (i, j). For this prior, the probability of A is 10%. We’ll see later that the prior affects the strength of the evidence, but it doesn’t affect whether an observation is in favor of A or not.

Scenario 1

Now let’s consider the first scenario: we choose a thing at random from the universe of things, and we find that it is a black raven.

The likelihood for this observation is: i / (N + M), because i is the number of black ravens and N + M is the total number of things.

In this scenario the posterior probability of A is 20%. The posterior probability is higher than the prior, so the black raven is evidence in favor of A.

To quantify the strength of the evidence, we’ll use the log odds ratio, which is 0.81. Later we’ll see how the strength of the evidence depends on the prior distribution of i and j.

Before we go on, let’s also look at the marginal distribution of i (number of black ravens) before and after.

_images/73acb5dc58b1e987037e9ef21aff0c9bc014e6a3f841f015b7365f9352d648c7.png

As expected, observing a black raven increases our confidence that all ravens are black. The posterior distribution shifts toward higher values of i, and the probability that i = N increases.

In Scenario 1, the likelihood depends only on i, not on j, so the update doesn’t change our beliefs about j (the number of black non-ravens).

Finally, let’s visualize posterior joint distribution of i and j.

_images/3f8f1dc012592d11ac19f20c5698984fc6134a93c7eeec35c1ba1aed5913a1a2.png

Because we started with a uniform distribution and the data has no bearing on j, the joint posterior probabilities don’t depend on j.

In summary, Scenario 1 is consistent with intuition: a black raven is evidence in favor of A.

Scenario 2

In this scenario, we choose a thing at random from the universe of N + M things, and it turns out to be a red apple – which we will treat generally as a non-black non-raven.

The likelihood of this observation is: (M - j) / (N + M), because M - j is the number of non-black non-ravens and N + M is the total number of things.

In this scenario, the posterior probability of A is the same as the prior. In fact, the entire distribution of i is unchanged.

So the red apple is not evidence in favor of A or against it. This is consistent with the intuition that the red apple (or any non-black non-raven) is irrelevant.

However, the red apple is evidence about j, as we can confirm by comparing the marginal distribution of j before and after.

_images/b99c195fe739e438b9f9d21fbbfc7e12e264b93bff2de276ae0dbe7effc395dc.png

And here’s the posterior joint distribution of i and j.

_images/de8e2c353c1109b5324bd83ac12ec0262d1267021a2736b91d8f83ec641bc11d.png

Because the red apple has no bearing on i, the posterior probabilities in this scenario don’t depend on i.

In summary, Scenario 2 matches our intuition: a red apple (chosen at random) is not evidence about whether all ravens are black.

Scenario 3

In this scenario, we choose a raven first and then observe that it is black.

The likelihood for this observation is: i / N, because i is the number of black ravens and N is the total number of ravens.

In this scenario, the posterior probability of A is 20%, the same as in Scenario 1.So we conclude that the black raven is evidence in favor of A, with the same strength regardless of whether we are in:

Scenario 1: Select a random thing and it turns out to be a black raven or
Scenario 3: Select a random raven and it turns out to be black.

In fact, the entire posterior distribution is the same in both scenarios. That’s because the likelihoods in Scenarios 1 and 3 differ only by a constant factor, which is removed when the posterior distributions are normalized.

In summary, Scenario 3 is consistent with intuition: if we choose a raven and find that it is black, that is evidence in favor of A.

Scenario 4

In the last scenario, we first choose a non-black thing (from all non-black things in the universe), and then observe that it is a non-raven.

The likelihood of this observation is: (M - j) / (N - i + M - j) because M - j is the number of non-black non-ravens and N - i + M - j is the total number of non-black things.

This likelihood depends on both i and j, unlike Scenario 2. This is the key difference that makes Scenario 4 informative about whether all ravens are black.

The posterior probability of A is about 15%, which is greater than the prior, so the non-black non-raven is evidence in favor of A. The log odds ratio is about 0.47, which is smaller than in Scenarios 1 and 3, because there are more non-ravens than ravens. As we’ll see, the strength of the evidence gets smaller as M gets bigger.

Here is the marginal distribution of i (number of black ravens) before and after.

_images/90ae0b51191d2f3b068cc9a4d382e172f23d8ec19a5708c3e5f51677d0b966cd.png

And here’s the marginal distribution of j (number of black non-ravens) before and after.

_images/f7b88c2307331de2664b17bd98bdd246a61d436d1f5fc263d05e80ea706ce875.png

Finally, here’s the posterior joint distribution of i and j.

_images/83ec65bdd92130d86aad0ff90c7b99232096fffa2e14c5ec96647f66e8137d0c.png

In Scenario 4, the likelihood depends on both i and j, so the update changes our beliefs about both parameters.

And in Scenario 4 a non-black non-raven (chosen from non-black things) is evidence in favor of A. This might still be surprising, but let me suggest a way to think about it: in this scenario we are checking non-black things to make sure they are not ravens. If we find a non-black raven, that contradicts A. If we don’t, that supports A.

In all four scenarios, the results are consistent with intuition. So as long as you are clear about which scenario you are in, there is no paradox. The paradox is only apparent if you think you are in Scenario 2 and you imagine the result from Scenario 4.

In the context of the original problem:

If you walk out of your house and the first thing you see is a red apple (or a blue car, or a green leaf) that has no bearing on whether raven are black.
But if you deliberately select a non-black thing and check whether it’s a raven, and you find that it is not, that actually is evidence that all ravens are black – but consistent with the standard Bayesian response, it is so weak it is negligible.

Successive updates

In these examples, we started with a uniform prior over all combinations of i and j. Of course that’s not a realistic representation of what we believe about the world. So let’s consider the effect of other priors.

In general, different priors lead to different posterior distributions, and in this case they lead to different conclusions about the strength of the evidence. But they lead to the same conclusion about the direction of the evidence.

To demonstrate, let’s see what happens if we observe a series of black ravens (in Scenario 1 or 3). For simplicity, assume that we sample with replacement.

The following function computes multiple updates, starting with the uniform prior and then using the posterior from each update as the prior for the next.

This table shows the results in Scenario 1 (which is the same as in Scenario 3). For each iteration, the table shows the prior and posterior probability of A, and the log odds ratio.

Iteration	Prior	Posterior	LOR
0	0.100000	0.200000	0.810930
1	0.200000	0.284211	0.462624
2	0.284211	0.360000	0.348307
3	0.360000	0.427901	0.284942
4	0.427901	0.488715	0.245274
5	0.488715	0.543171	0.218261
6	0.543171	0.591920	0.198796
7	0.591920	0.635551	0.184196
8	0.635551	0.674590	0.172914
9	0.674590	0.709512	0.163995

As we see more ravens, the posterior probability of A increases, but the LOR decreases – which means that each raven provides weaker evidence than the previous one. In the long run the LOR converges to a value greater than 0 (about 0.11), which means that each raven provides at least some additional evidence, even when the prior is far from the uniform distribution we started with.

In the worst case, if the prior probability of A is 0 or 1, nothing we observe can change those beliefs, so nothing is evidence for or against A. But there is no prior where a black raven provides evidence against A.

[Proof: The likelihood of the observation is maximized when all ravens are black (i = N). Therefore, for any prior that gives non-zero probability to both A and its complement, the LOR is positive: these observations can never be evidence against A.]

The following table shows the results in Scenario 4, where we select a non-black thing and check that it is not a raven.

Iteration	Prior	Posterior	LOR
0	0.100000	0.149403	0.457933
1	0.149403	0.201006	0.359272
2	0.201006	0.253991	0.302582
3	0.253991	0.307217	0.264273
4	0.307217	0.359496	0.235611
5	0.359496	0.409837	0.212911
6	0.409837	0.457528	0.194344
7	0.457528	0.502141	0.178860
8	0.502141	0.543477	0.165785
9	0.543477	0.581514	0.154644

The pattern is similar. Each non-black thing that turns out not to be a raven is weaker evidence than the previous one. But it is always in favor of A – in this scenario, there is no prior where a non-black non-raven is evidence against A.

Varying `M`

Finally, let’s see how the strength of the evidence varies as we increase M, the number of non-ravens. The following function computes results in Scenario 4 for a range of values of M, holding constant the number of ravens, N = 9.

M	Prior	Posterior	LOR
20	0.1	0.147655	0.444110
50	0.1	0.124515	0.246875
100	0.1	0.114530	0.151946
200	0.1	0.108495	0.091022
500	0.1	0.104100	0.044751
1000	0.1	0.102331	0.025640

As M increases (more non-ravens in the universe), the strength of the evidence decreases. This is consistent with the standard Bayesian response, which notes that in a realistic scenario, the evidence is negligible.

Conclusion

The standard Bayesian response to the Raven paradox is correct in the sense that if a non-black non-raven is evidence that all ravens are black, it is so extremely weak. But that doesn’t explain why the roulette example – where the number of non-green non-zero pockets is relatively small – is still so contrary to intuition.

I think a better explanation for the paradox is the ambiguity of the word “observe”. If we are explicit about the sampling process that generates the observation, we find that a non-black non-raven may or may not be evidence that all ravens are black.

Scenario 2: If we choose a random thing and find that it is a non-black non-raven, that is not evidence.
Scenario 4: If we choose a non-black thing and find that it is a non-raven, that is evidence.

The first case is entirely consistent with intuition. The second case is less obvious, but if we consider smaller examples like a roulette wheel, and do the math, it can be reconciled with intuition.

Confusion between these scenarios causes the apparent paradox, and clarity about the scenarios resolves it.

Symmetry and Asymmetry

It might still seem strange that a black raven is always evidence for A and B, but a non-black non-raven may or may not be, depending on the sampling process. If A and B are logically identical, and a black raven supports A, it’s still not clear why a non-black non-raven doesn’t always support B.

After all, if we start with B, we conclude that a non-black non-raven is always evidence for B (and A), and a black raven may or may not be. Where does this asymmetry come from?

We broke the symmetry when we formulated “All ravens are black” as “Out of all ravens, how many are black?” This formulation first divides the world into ravens and non-ravens, then asks how many in each group are black.

Conversely, if we start with “All non-black things are non-ravens”, we formulate it as “Out of all non-black things, how many are ravens?” In this formulation, we divide the world into black and non-black things, then ask how many in each group are ravens.

The asymmetry is apparent when we parameterize the models. If we start with A, we define i to be the number of ravens that are black. And we find that in Scenario 1, the likelihood of a black raven depends on i, and in Scenario 2, the likelihood of a non-black non-raven does not.

If we start with B, we define i to be the number of non-black things that are non-ravens. Then in Scenario 1 we find that a non-black non-raven pertains to i, but a black raven does not.

So the symmetry is broken when we formulate the hypothesis in a way that is testable with data. In propositional logic, A and B are equivalent in the sense that evidence for one must be evidence for the other. In the Bayesian formulation, “How many ravens are black?” and “How many non-black things are non-ravens?” are not equivalent; evidence for one is not necessarily evidence for the other.

A critic might say that the Bayesian formulation is a non-resolution – that is, it doesn’t solve the original problem posed by Hempel; it only solves a related problem by making additional assumptions.

A Bayesian response is that the Raven Paradox is only problematic in the abstract world of propositional logic; as soon as we formulate the question in a way that connects it to the real world through observation, it disappears. So the Raven Paradox is similar to the principle of explosion – it demonstrates a brittleness in propositional logic that makes it unsuitable for reasoning about many real-world hypotheses.

SAT math scores: gender difference or selection bias?

December 16, 2025 AllenDowney

The video from my PyData Boston talk is up now:

Resources

The slides are here
Run the first notebook (Poincaré problem) on Colab
Run the second notebook (analysis of SAT data) on Colab

If you want to learn to do this kind of analysis, you can sign up for the January 2026 offering of the Applied Bayesian Modeling Workshop, which I teach along with my colleagues at PyMC Labs.

And as always, you can read Think Bayes in hard copy or free online.

Abstract

Why do male test takers consistently score about 30 points higher than female test takers on the mathematics section of the SAT? Does this reflect an actual difference in math ability, or is it an artifact of selection bias—if young men with low math ability are less likely to take the test than young women with the same ability?

This talk presents a Bayesian model that estimates how much of the observed difference can be explained by selection effects. We’ll walk through a complete Bayesian workflow, including prior elicitation with PreliZ, model building in PyMC, and validation with ArviZ, showing how Bayesian methods disentangle latent traits from observed outcomes and separate the signal from the noise.

The Lost Chapter

December 4, 2025 AllenDowney

I’m happy to report that Probably Overthinking It is available now in paperback. If you would like a copy, you can order from Bookshop.org and Amazon (affiliate links).

To celebrate, I’m publishing The Lost Chapter — that is, the chapter I cut from the published book. It’s about The Girl Named Florida problem, which might be the most counterintuitive problem in probability — even more than the Monty Hall problem.

When I started writing the book, I thought it would include more puzzles and paradoxes like this, but as the project evolved, it shifted toward real world problems where data help us answer questions and make better decisions. As much as The Girl Named Florida is challenging and puzzling, it doesn’t have much application in the real world.

But it got a new life in the internet recently, so I think this is a good time to publish! The following is an excerpt; you can read the complete chapter here.

The Girl Named Florida

The Monty Hall Problem is famously contentious. People have strong feelings about the answer, and it has probably started more fights than any other problem in probability. But there’s another problem that I think it’s even more counterintuitive – and it has started a good number of fights as well. It’s called The Girl Named Florida.

I’ve written about this problem before, and I’ve demonstrated the correct answer, but I don’t think I really explained why the answer is what it is. That’s what I’ll try to do here.

As far as I have found, the source of the problem is Leonard Mlodinow’s book, The Drunkard’s Walk, which pose the question like this:

In a family with two children, what are the chances, if one of the children is a girl named Florida, that both children are girls?

If you have not encountered this problem before, your first thought is probably that the girl’s name is irrelevant – but it’s not. In fact, the answer depends on how common the name is.

If you feel like that can’t possibly be right, you are not alone. Solving this puzzle requires conditional probability, which is one of the most counterintuitive areas of probability. So I suggest we approach it slowly – like we’re defusing a bomb.

We’ll start with two problems involving coins and dice, where the probabilities are relatively simple. These examples demonstrate three principles that will help when things get strange:

It is not always clear when the condition in a conditional probability is relevant, and our intuition can be unreliable.
A reliable way to compute conditional probabilities is to enumerate equally likely possibilities and count.
If someone does something rare, it is likely that they made more than one attempt.

Then, finally, we’ll solve The Girl Named Florida.

Tossing Coins

Let’s warm up with two problems related to coins and dice.

We’ll assume that coins are fair, so the probability of getting heads or tails is 1/2. And the outcome of one coin toss does not affect another, so even if the coin comes up heads ten times, the probability of heads on the next toss is 1/2.

Now, suppose I toss a coin twice where I can see the outcome and you can’t. I tell you that I got heads at least once, and ask you the probability that I got heads both times.

You might think, if the outcome of one coin does not affect the other, it doesn’t matter if one of the coins came up heads – the probability for the other coin is still 1/2.

But that’s not right; the correct answer is 1/3. To see why, consider this:

After I toss the coins, there are four equally likely outcomes: two heads, two tails, heads first and then tails, or tails first and then heads.
When I tell you that I got heads at least once, I rule out one of the possibilities, two tails.
The remaining three possibilities are still equally likely, so the probability of each is 1/3.
In one of the remaining possibilities, the other coin is also heads.

So the conditional probability is 1/3.

If that argument doesn’t entirely convince you, there’s another way to solve problems like this, called enumeration.

Enumeration

A conditional probability has two parts: a statement and a condition. Both are claims about the world that might be true or not, but they play different roles. A conditional probability is the probability that the statement is true, given that the condition is true. In the coin toss example, the statement is “I got heads both times” and the condition is “I tossed a coin twice and got heads at least once”.

We’ve seen that it can be tricky to compute conditional probabilities, so let me suggest what I think is the most reliable way to get the right answer and be confident that it’s correct. Here are the steps:

Make a list of equally likely outcomes,
Select the subset where the condition is true,
Within the subset where the condition is true, compute the fraction where the statement is also true.

This method is called enumerating the sample space, where the “sample space” is the list of outcomes. In the coin toss example, there are four possible outcomes, as shown in the following diagram.

_images/cac73dbcec0ebbfe16b64c983c335aaa366de414ea093c872a9fc77de6b4fd05.png

The shaded cells (both light and dark) show the three outcomes where the condition is true; the darker cell shows the one outcome where the statement is true. So the conditional probability is 1/3.

This example demonstrates one of the principles we’ll need to understand the puzzles: you have to count the combinations. If we know that the number of heads is either one or two, it is tempting to think these possibilities are equally likely. But there is only one way to get two heads, and there are two ways to get one heads. So the one-heads possibility is more likely.

In the next section, we’ll use this method to solve a problem involving dice. But I’ll start with a story that sets the scene.

Can We Get Serious Now?

The 2016 film Sully is based on the true story of Captain Chelsea Sullenberger, who famously and improbably landed a large passenger jet in the Hudson River near New York City, saving the lives of all 155 people on board.

In the aftermath of this emergency landing, investigators questioned his decision to ditch the airplane rather than attempt to land at one of two airports nearby. To demonstrate that these alternatives were feasible, they showed simulations of pilots landing successfully at both airports.

In the movie version of the hearing, Tom Hanks, who played Captain Sullenberger, memorably asks, “Can we get serious now?” Having seen the simulations, he says, “I’d like to know how many times the pilot practiced that maneuver before he actually pulled it off. Please ask how many practice runs they had.”

One of the investigators replies, “Seventeen. The pilot […] had seventeen practice attempts before the simulation we just witnessed.” And the audience gasps.

Of course this scene is fictionalized, but the logic of this exchange is consistent with the actual investigation. It is also consistent with the laws of probability.

If someone accomplishes an unlikely feat, you are right to suspect it was not their first try. And the more unlikely the feat, the more attempts you might guess they made.

I will demonstrate this point with coins and dice. Suppose I toss a coin and, based on the outcome, roll a die either once or twice. I don’t let you see the coin or the die, and you don’t know how many times I rolled, but I report that I rolled at least one six. Which do you think is more likely, that I rolled once or twice?

You might suspect that I rolled twice – and this time your intuition is correct. If I get a six, it is more likely that I rolled twice.

To see how much more likely, let’s enumerate the possibilities. The following diagram shows 72 equally likely outcomes.

_images/d3adf32d7822666f1be1efc833f3b66dd154ffea3ce8389ddd6c3afe62f20d5b.png

The left side shows 36 cases where I roll the die once, using a single digit to represent the outcomes. The right side shows 36 cases where I roll the die twice: the first digit represents the first roll; the second digit represents the second roll.

The shaded area indicates the outcomes where at least one die is a six. There are 17 in total, 6 when I roll the die once and 11 when I roll it twice. So if I tell you I rolled at least one six, the probability is 11/17 that I rolled the die twice, which is about 65%.

If you succeed at something difficult, it is likely you had more than one chance.

The Two Child Problems

Next we’ll solve two puzzles made famous by Martin Gardner in his Scientific American column in 1959. He posed the first like this:

Mr. Jones has two children. The older child is a girl. What is the probability that both children are girls?

The real world is complicated, so let’s assume that these problems are set in a world where all children are boys or girls with equal probability. With that simplification, there are four equally likely combinations of two children, shown in the following diagram.

_images/502b755caff85849b479f54f200cc4eeb645981da95d8f8a1c908832b6539896.png

The shaded areas show the families where the condition is true – that is, the first child is a girl. The darker area shows the only family where the statement is true – that is, both children are girls.

There are two possibilities where the condition is true and one of them where the statement is true, so the conditional probability is 1/2. This result confirms what you might have suspected: the sex of the older child is irrelevant. The probability that the second child is a girl is 1/2, regardless.

Now here’s the second problem, which I have revised to make it easier to compare with the first part:

Mr. Smith has two children. At least one of them is a [girl]. What is the probability that both children are [girls]?

Again, there are four equally likely combinations of two children, shown in the following diagram.

_images/eaad730bbc578ce03f54e6d28126d38225e8c1d36f22add213a361922c82eead.png

Now there are three possibilities where the condition is true – that is, at least one child is a girl. In one of them, the statement is true – that is, both children are girls. So the conditional probability is 1/3.

This problem is identical to the coin example, and it demonstrates the same principle: you have to count the combinations. There is only one way to have two girls, but there are two ways to have a boy and a girl.

More Variations

Now let’s consider a series of related questions where:

One of the children is a girl born on Saturday,
One of the children is a left-handed girl, and finally
One of the children is a girl named Florida.

To avoid real-world complications, let’s assume:

Children are equally likely to be born on any day of the week.
One child in 10 is left-handed.
One child out of 1000 is named Florida.
Children are independent of one other in the sense that the attributes of one (birth day, handedness, and name) don’t affect the attributes of the others.

Let’s also assume that “one of the children” means at least one, so a family could have two girls born on Saturday, or even two girls named Florida.

Saturday’s Child

In a family with two children, what are the chances, if one of the children is a girl born on Saturday, that both children are girls?

To answer this question, we’ll divide each of the four boy-girl combinations into 49 day-of-the-week combinations.

I’ll represent days with the digits 1 through 7, with 1 for Sunday and 7 for Saturday. And I’ll represent families with two digit numbers; for example, the number 17 represents a family where the first child was born on Sunday and the second child was born on Saturday.

The following diagram shows the four possible orders for boys and girls, and within them, the 49 possible orders for days of the week.

_images/42bb6c763be3e28fa2d943e0fe2e59b450f3466146f26cd34063a3646d195b1c.png

The shaded area shows the possibilities where the condition is true – that is, at least one of the children is a girl born on a Saturday. And the darker area shows the possibilities where the statement is true – that is, both children are girls.

There are 27 cases where the condition is true. In 13 of them, the statement is true, so the conditional probability is 13/27, which is about 48%.

So the day of the week is not irrelevant. If at least one child is a girl, the probability of two girls is about 33%. If at least one child is a girl born on Saturday, the probability of two girls is about 48%.

Now let’s see what happens if the girl is left-handed.

Left-handed girl

In a family with two children, what are the chances, if one of the children is a left-handed girl, that both children are girls?

Let’s assume that 1 child in 10 is left-handed, and if one sibling is left-handed, it doesn’t change the probability that the other is.

The following diagram shows the possible combinations, using “R” to represent a right-handed child and “L” to represent a left-handed child. Again, the shaded areas show where the condition is true; the darker area shows where the statement is true.

_images/fb32bf5f014d00840e8d73fb8405088d0378a13550e56894fe9452d437be5fd3.png

There are 39 combinations where at least one child is a left-handed girl. Of them, there are 19 combinations where both children are girls. So the conditional probability is 19/39, which is about 49%.

Now we’re starting to see a pattern. If the probability of a particular attribute, like birthday or handedness, is 1 in n, the number of cases where the condition is true is 4n-1 and the number of cases where the statement is true is 2n-1. So the conditional probability is (2n-1) / (4n-1).

Looking at the diagram, we can see where the terms in this expression come from. The multiple of four represents the segments of the L-shaped region where the condition is true; the multiple of two represents the segments where the statement is true. And we subtract one from the numerator and denominator so we don’t count the case in the lower-right corner twice.

In the days-of-the-week example, n is 7 and the conditional probability is 13/27, about 48%. In the handedness example, n is 10 and the conditional probability is 19/39, about 49%. And for the girl named Florida, who is 1 in 1000, the conditional probability is 1999/3999, about 49.99%. As n increases, the conditional probability approaches 1/2.

Going in the other direction, if we choose an attribute that’s more common, like 1 in 2, the conditional probability is 3/7, around 43%. And if we choose an attribute that everyone has, n is 1 and the conditional probability is 1/3.

In general for problems like these, the answer is between 1/3 and 1/2, closer to 1/3 if the attribute is common, and closer to 1/2 if it is rare.

But Why?

At this point I hope you are satisfied that the answers we calculated are correct. Enumerating the sample space, as we did, is a reliable way to compute conditional probabilities. But it might not be clear why additional information, like the name of a child or the day they were born, is relevant to the probability that a family has two girls.

The key is to remember what we learned from Sully: if you succeed at something improbable, you probably made more than one attempt.

If a family has a girl born on Saturday, they have done something moderately improbable, which suggests that they had more than one chance, that is, more than one girl. If they have a girl named Florida, which is more improbable, it is even more likely that they have two girls.

These problems seem paradoxical because we have a strong intuition that the additional information is irrelevant. The resolution of the paradox is that our intuition is wrong. In these examples, names and birthdays are relevant because they make the condition of the conditional probability more strict. And if you meet a strict condition, it is likely that you had more than one chance.

Be Careful What You Ask For

When I first wrote about these problems in 2011, a reader objected that the wording of the questions is ambiguous. For example, here’s Gardner’s version again (with my revision):

Mr. Smith has two children. At least one of them is a [girl]. What is the probability that both children are [girls]?

And here is the objection:

If we pick a family at random, ask if they have a girl, and learn that they do, the probability is 1/3 that the family has two girls.
But if we pick a family at random, choose one of the children at random, and find that she’s a girl, the probability is 1/2 that the family has two girls.

In either case, we know that the family has at least one girl, but the answer depends on how we came to know that.

I am sympathetic to this objection, up to a point. Yes, the question is ambiguous, but natural language is almost always ambiguous. As readers, we have to make assumptions about the author’s intent.

If I tell you that a family has at least one girl, without specifying how I came to know it, it’s reasonable to assume that all families with a girl are equally likely. I think that’s the natural interpretation of the question and, based on the answers Gardner and Mlodinow provide, that’s the interpretation they intended.

I offered this explanation to the reader who objected, but he was not satisfied. He replied at length, writing almost 4000 words about this problem, which is longer than this chapter. Sadly, we were not able to resolve our differences.

But this exchange helped me understand the difficulty of explaining this problem clearly, which helped when I wrote this chapter. So, if you think I succeeded, it’s probably because I had more than one chance.

Think DSP second edition

November 4, 2025 AllenDowney

I have started work on a second edition of Think DSP! You can see the current draft here.

I started this project in part because of this announcement:

Once in a while, a few of the Scicloj friends will meet to learn about signal processing, following the Think DSP book by Allen B. Downey, and implementing things in Clojure. Our notes will be published at Clojure Civitas.

If you know some Clojure and you want to learn DSP, click here to learn more.

The other reason I started the project is that Cursor (the AI-assisted IDE) helped get me unstuck. Here’s the problem: Think DSP was the last book I wrote in LaTeX before I switched to Jupyter notebooks. So I had code in Python modules and the text in LaTeX.

At some point I put the code in notebooks, with exercises and solutions in other notebooks. Some time later I converted the LaTeX to notebooks — but those notebooks only had code snippets, not the complete working code. And they were full of leftover LaTeX bits, like references to the figures, which were in all the wrong places.

Merging the text, code, and exercises into a single notebook was too daunting, so the book was idle for a while. And that’s where Cursor came in.

I drafted a plan to convert the notebooks to Markdown (using Jupytext) and then use AI tools to merge them. After I merged the first chapter, I checked it for issues and revised the prompt. After a few iterations, I had a plan document and a prompt document that worked pretty well. Cursor took a few minutes to merge the notebooks for each chapter, but the results were good.

I made a second pass to make the structure of the notebooks consistent. For example, each notebook starts with five or six cells that import libraries, download files, etc. It’s helpful if these cells are the same, and with the same tags, in every notebook. Revising the Markdown version of the notebooks and then converting to ipynb worked well.

The whole process took about 1.5 work days — and there are probably still some issues I’ll have to fix by hand — but it is nice to get the project unstuck!

It’s Levels

November 4, 2025 AllenDowney

A recent Reddit post asks “Amateur athletes of Reddit: what’s your ‘There’s levels to this shit’ experience from your sport?” Responses included:

We have some good runners who can win local races … And then you realise that if you put them in a 5000m race with Olympic-level athletes they’d get lapped at least 3 time and possibly 4 times.

Former NHL player putting in 10% effort was a harder, faster shot than mid-high tier beer league at 110%. The average adult player’s ceiling is buried somewhere deep under the worst NHL player’s basement floor.

And the thread includes more than one reference to Brian Scalabrine, who played in the NBA but was not a star. After he retired, he participated in a “Scallenge” where he played one-on-one against talented amateur players — and beat them by a combined score of 44-6. Explaining the gap in ability between the best amateurs and the worse professionals, he said “I’m way closer to LeBron James than you are to me.”

Brian Scalabrine is probably right, because professionals in many areas — not just athletics — really are on another level. And then there’s another level above that, and a level above that, too.

This phenomenon is surprising because it violates our intuition for the distribution of ability. We expect something like a bell curve — the Gaussian distribution — and what we get is a lognormal distribution with a tail that extends much, much farther.

This is the topic of Chapter 4 of Probably Overthinking It, where I show some examples and propose two explanations. To celebrate the imminent release of the paperback edition, here’s an excerpt (or, if you prefer video, I gave a talk based on this chapter).

Running Speeds

If you are a fan of the Atlanta Braves, a Major League Baseball team, or if you watch enough videos on the internet, you have probably seen one of the most popular forms of between-inning entertainment: a foot race between one of the fans and a spandex-suit-wearing mascot called the Freeze.

The route of the race is the dirt track that runs across the outfield, a distance of about 160 meters, which the Freeze runs in less than 20 seconds. To keep things interesting, the fan gets a head start of about 5 seconds. That might not seem like a lot, but if you watch one of these races, this lead seems insurmountable. However, when the Freeze starts running, you immediately see the difference between a pretty good runner and a very good runner. With few exceptions, the Freeze runs down the fan, overtakes them, and coasts to the finish line with seconds to spare.

But as fast as he is, the Freeze is not even a professional runner; he is a member of the Braves’ ground crew named Nigel Talton. In college, he ran 200 meters in 21.66 seconds, which is very good. But the 200 meter collegiate record is 20.1 seconds, set by Wallace Spearmon in 2005, and the current world record is 19.19 seconds, set by Usain Bolt in 2009.

To put all that in perspective, let’s start with me. For a middle-aged man, I am a decent runner. When I was 42 years old, I ran my best-ever 10 kilometer race in 42:44, which was faster than 94% of the other runners who showed up for a local 10K. Around that time, I could run 200 meters in about 30 seconds (with wind assistance).

But a good high school runner is faster than me. At a recent meet, the fastest girl at a nearby high school ran 200 meters in about 27 seconds, and the fastest boy ran under 24 seconds.

So, in terms of speed, a fast high school girl is 11% faster than me, a fast high school boy is 12% faster than her; Nigel Talton, in his prime, was 11% faster than him, Wallace Spearmon was about 8% faster than Talton, and Usain Bolt is about 5% faster than Spearmon.

Unless you are Usain Bolt, there is always someone faster than you, and not just a little bit faster; they are much faster. The reason, as you might suspect by now, is that the distribution of running speed is not Gaussian. It is more like lognormal.

To demonstrate, I’ll use data from the James Joyce Ramble, which is the 10 kilometer race where I ran my previously-mentioned personal record time. I downloaded the times for the 1,592 finishers and converted them to speeds in kilometers per hour. The following figure shows the distribution of these speeds on a logarithmic scale, along with a Gaussian model I fit to the data.

_images/38e6faeb6cb420938f6543d89a5eecacfcf15a85f0c19cc67cb5c4a2aa4fd78d.png

The logarithms follow a Gaussian distribution, which means the speeds themselves are lognormal. You might wonder why. Well, I have a theory, based on the following assumptions:

First, everyone has a maximum speed they are capable of running, assuming that they train effectively.
Second, these speed limits can depend on many factors, including height and weight, fast- and slow-twitch muscle mass, cardiovascular conditioning, flexibility and elasticity, and probably more.
Finally, the way these factors interact tends to be multiplicative; that is, each person’s speed limit depends on the product of multiple factors.

Here’s why I think speed depends on a product rather than a sum of factors. If all of your factors are good, you are fast; if any of them are bad, you are slow. Mathematically, the operation that has this property is multiplication.

For example, suppose there are only two factors, measured on a scale from 0 to 1, and each person’s speed limit is determined by their product. Let’s consider three hypothetical people:

The first person scores high on both factors, let’s say 0.9. The product of these factors is 0.81, so they would be fast.
The second person scores relatively low on both factors, let’s say 0.3. The product is 0.09, so they would be quite slow.

So far, this is not surprising: if you are good in every way, you are fast; if you are bad in every way, you are slow. But what if you are good in some ways and bad in others?

The third person scores 0.9 on one factor and 0.3 on the other. The product is 0.27, so they are a little bit faster than someone who scores low on both factors, but much slower than someone who scores high on both.

That’s a property of multiplication: the product depends most strongly on the smallest factor. And as the number of factors increases, the effect becomes more dramatic.

To simulate this mechanism, I generated five random factors from a Gaussian distribution and multiplied them together. I adjusted the mean and standard deviation of the Gaussians so that the resulting distribution fit the data; the following figure shows the results.

_images/ed0ef80c12cc2f01d6acf7b7ab92d2e87d9f2f68a8acd6e9f37bca812cb4f3bc.png

The simulation results fit the data well. So this example demonstrates a second mechanism that can produce lognormal distributions: the limiting power of the weakest link. If there are at least five factors that affect running speed, and each person’s limit depends on their worst factor, that would explain why the distribution of running speed is lognormal.

I suspect that distributions of many other skills are also lognormal, for similar reasons. Unfortunately, most abilities are not as easy to measure as running speed, but some are. For example, chess-playing skill can be quantified using the Elo rating system, which we’ll explore in the next section.

Chess Rankings

In the Elo chess rating system, every player is assigned a score that reflects their ability. These scores are updated after every game. If you win, your score goes up; if you lose, it goes down. The size of the increase or decrease depends on your opponent’s score. If you beat a player with a higher score, your score might go up a lot; if you beat a player with a lower score, it might barely change. Most scores are in the range from 100 to about 3000, although in theory there is no lower or upper bound.

By themselves, the scores don’t mean very much; what matters is the difference in scores between two players, which can be used to compute the probability that one beats the other. For example, if the difference in scores is 400, we expect the higher-rated player to win about 90% of the time.

If the distribution of chess skill is lognormal, and if Elo scores quantify this skill, we expect the distribution of Elo scores to be lognormal. To find out, I collected data from Chess.com, which is a popular internet chess server that hosts individual games and tournaments for players from all over the world. Their leader board shows the distribution of Elo ratings for almost six million players who have used their service. The following figure shows the distribution of these scores on a log scale, along with a lognormal model.

_images/d0e61e1f8103056729ece9d180e48bc67ef422d31cf5d46f7f5f44856e6b7a0e.png

The lognormal model does not fit the data particularly well. But that might be misleading, because unlike running speeds, Elo scores have no natural zero point. The conventional zero point was chosen arbitrarily, which means we can shift it up or down without changing what the scores mean relative to each other.

With that in mind, suppose we shift the entire scale so that the lowest point is 550 rather than 100. The following figure shows the distribution of these shifted scores on a log scale, along with a lognormal model.

_images/212090bf9c6718b7d51f02945d3cb353d999e9243739e5c24535e8f5f36f013e.png

With this adjustment, the lognormal model fits the data well.

Now, we’ve seen two explanations for lognormal distributions: proportional growth and weakest links. Which one determines the distribution of abilities like chess? I think both mechanisms are plausible.

As you get better at chess, you have opportunities to play against better opponents and learn from the experience. You also gain the ability to learn from others; books and articles that are inscrutable to beginners become invaluable to experts. As you understand more, you are able to learn faster, so the growth rate of your skill might be proportional to your current level.

At the same time, lifetime achievement in chess can be limited by many factors. Success requires some combination of natural abilities, opportunity, passion, and discipline. If you are good at all of them, you might become a world-class player. If you lack any of them, you will not. The way these factors interact is like multiplication, where the outcome is most strongly affected by the weakest link.

These mechanisms shape the distribution of ability in other fields, even the ones that are harder to measure, like musical ability. As you gain musical experience, you play with better musicians and work with better teachers. As in chess, you can benefit from more advanced resources. And, as in almost any endeavor, you learn how to learn.

At the same time, there are many factors that can limit musical achievement. One person might have a bad ear or poor dexterity. Another might find that they don’t love music enough, or they love something else more. One might not have the resources and opportunity to pursue music; another might lack the discipline and tenacity to stick with it. If you have the necessary aptitude, opportunity, and personal attributes, you could be a world-class musician; if you lack any of them, you probably can’t.

Outliers

If you have read Malcolm Gladwell’s book, Outliers, this conclusion might be disappointing. Based on examples and research on expert performance, Gladwell suggests that it takes 10,000 hours of effective practice to achieve world-class mastery in almost any field.

Referring to a study of violinists led by the psychologist K. Anders Ericsson, Gladwell writes:

The striking thing […] is that he and his colleagues couldn’t find any ‘naturals,’ musicians who floated effortlessly to the top while practicing a fraction of the time their peers did. Nor could they find any ‘grinds,’ people who worked harder than everyone else, yet just didn’t have what it takes to break the top ranks.”

The key to success, Gladwell concludes, is many hours of practice. The source of the number 10,000 seems to be neurologist Daniel Levitin, quoted by Gladwell:

“In study after study, of composers, basketball players, fiction writers, ice skaters, concert pianists, chess players, master criminals, and what have you, this number comes up again and again. […] No one has yet found a case in which true world-class expertise was accomplished in less time.”

The core claim of the rule is that 10,000 hours of practice is necessary to achieve expertise. Of course, as Ericsson wrote in a commentary, “There is nothing magical about exactly 10,000 hours”. But it is probably true that no world-class musician has practiced substantially less.

However, some people have taken the rule to mean that 10,000 hours is sufficient to achieve expertise. In this interpretation, anyone can master any field; all they have to do is practice! Well, in running and many other athletic areas, that is obviously not true. And I doubt it is true in chess, music, or many other fields.

Natural talent is not enough to achieve world-level performance without practice, but that doesn’t mean it is irrelevant. For most people in most fields, natural attributes and circumstances impose an upper limit on performance.

In his commentary, Ericsson summarizes research showing the importance of “motivation and the original enjoyment of the activities in the domain and, even more important, […] inevitable differences in the capacity to engage in hard work (deliberate practice).” In other words, the thing that distinguishes a world-class violinist from everyone else is not 10,000 hours of practice, but the passion, opportunity, and discipline it takes to spend 10,000 hours doing anything.

The Greatest of All Time

Lognormal distributions of ability might explain an otherwise surprising phenomenon: in many fields of endeavor, there is one person widely regarded as the Greatest of All Time or the G.O.A.T.

For example, in hockey, Wayne Gretzky is the G.O.A.T. and it would be hard to find someone who knows hockey and disagrees. In basketball, it’s Michael Jordan; in women’s tennis, Serena Williams, and so on for most sports. Some cases are more controversial than others, but even when there are a few contenders for the title, there are only a few.

And more often than not, these top performers are not just a little better than the rest, they are a lot better. For example, in his career in the National Hockey League, Wayne Gretzky scored 2,857 points (the total of goals and assists). The player in second place scored 1,921. The magnitude of this difference is surprising, in part, because it is not what we would get from a Gaussian distribution.

To demonstrate this point, I generated a random sample of 100,000 people from a lognormal distribution loosely based on chess ratings. Then I generated a sample from a Gaussian distribution with the same mean and variance. The following figure shows the results.

_images/aa055e28e1c194573c783404d077ef86fd93623c7b05684cc2dfbf955ba2cac2.png

The mean and variance of these distributions is about the same, but the shapes are different: the Gaussian distribution extends a little farther to the left, and the lognormal distribution extends much farther to the right.

The crosses indicate the top three scorers in each sample. In the Gaussian distribution, the top three scores are 1123, 1146, and 1161. They are barely distinguishable in the figure, and and if we think of them as Elo scores, there is not much difference between them. According to the Elo formula, we expect the top player to beat the #3 player about 55% of the time.

In the lognormal distribution, the top three scores are 2913, 3066, and 3155. They are clearly distinct in the figure and substantially different in practice. In this example, we expect the top player to beat #3 about 80% of the time.

In reality, the top-rated chess players in the world are more tightly clustered than my simulated players, so this example is not entirely realistic. Even so, Garry Kasparov is widely considered to be the greatest chess player of all time. The current world champion, Magnus Carlsen, might overtake him in another decade, but even he acknowledges that he is not there yet.

Less well known, but more dominant, is Marion Tinsley, who was the checkers (aka draughts) world champion from 1955 to 1958, withdrew from competition for almost 20 years – partly for lack of competition – and then reigned uninterrupted from 1975 to 1991. Between 1950 and his death in 1995, he lost only seven games, two of them to a computer. The man who programmed the computer thought Tinsley was “an aberration of nature”.

Marion Tinsley might have been the greatest G.O.A.T. of all time, but I’m not sure that makes him an aberration. Rather, he is an example of the natural behavior of lognormal distributions:

In a lognormal distribution, the outliers are farther from average than in a Gaussian distribution, which is why ordinary runners can’t beat the Freeze, even with a head start.
And the margin between the top performer and the runner-up is wider than it would be in a Gaussian distribution, which is why the greatest of all time is, in many fields, an outlier among outliers.

Cancer Survival Rates Are Misleading

October 27, 2025 AllenDowney

Five-year survival might be the most misleading statistic in medicine. For example, suppose 5-year survival for a hypothetical cancer is

91% among patients diagnosed early, while the tumor is localized at the primary site,
74% among patients diagnosed later, when the tumor has spread regionally to nearby lymph nodes or adjacent organs, and
16% among patients diagnosed late, when the tumor has spread to distant organs or lymph nodes.

What can we infer from these statistics?

If a patient is diagnosed early, it is tempting to think the probability is 91% that they will survive five years after diagnosis.
Looking at the difference in survival between early and late detection, it is tempting to conclude that more screening would save lives.
In a case where a patient is diagnosed late and dies of cancer, it is tempting to say that they would have survived if their cancer had been caught early.
And if 5-year survival increases over time, it is tempting to conclude that treatment has improved.

In fact, none of these inferences are correct.

Let’s take them one at a time.

Particularization

Here’s the first incorrect inference:

If a patient is diagnosed early, and 5-year survival is 91%, it is tempting to think the probability is 91% that they will survive five years after diagnosis.

This is almost correct in the sense that it applies to the past cases that were used to estimate the survival rate – of all patients in the dataset who were diagnosed early, 91% of them survived at least five years.

But it is misleading for two reasons:

Because it is based on past cases, it doesn’t apply to present cases if (1) the effectiveness of treatment has changed or – often more importantly – (2) diagnostic practices have changed.
Also, before interpreting a probability like this, which applies in general, it is important to particularize it for a specific case.

Factors that should be taken into account include the general health of the patient, their age, and the mode of detection. Some factors are causal – for example, general health directly improves the chance of survival. Other factors are less obvious because they are informational – for example, the mode of detection can make a big difference:

If a tumor is discovered because it is causing symptoms, it is more likely to be larger, more aggressive, and relatively late for a given stage – and all of those implications decrease the chance of survival.
If a tumor is not symptomatic, but discovered during a physical exam, it is probably larger, later, and more likely to cause mortality, compared to one discovered by high resolution imaging or a sensitive chemical test.
Conversely, tumors detected by screening are more likely to be slow-growing because of length-biased sampling – the probability of detection depends on the time between when a tumor is detectable and when it causes symptoms.

Taking age into account is complicated because it might be both causal and informational, with opposite implications. A young patient might be more robust and able to tolerate treatment, but a tumor detectable in a younger person is likely to have progressed more quickly than one that could only be discovered after more years of life. So the implication of age might be negative among the youngest and oldest patients, and positive in the middle-aged.

For some cancers, the magnitude of these implications is large, so the probability of 5-year survival for a particular patient might be higher than 91% or much lower.

Is More Screening Better?

Now let’s consider the second incorrect inference.

If 5-year survival is high when a cancer is detected early and much lower when it is detected late, it is tempting to conclude that more screening would save lives.

For example, in a recent video, Nassim Taleb and Emi Gal discuss the pros and cons of cancer screening, especially full-body MRIs for people who have no symptoms. At one point they consider this table of survival rates based on stage at diagnosis:

They note that survival is highest if a tumor is detected while localized at the primary site, lower if it has spread regionally, and often much lower if it has spread distantly.

They take this as evidence that screening for these cancers is beneficial. For example, at one point Taleb says, “Look at the payoff for pancreatic cancer – 10 times the survival rate.”

And Gal adds, “Colon cancer, it’s like seven times… The overarching insight is that you want to find cancer early… This table makes the case for the importance of finding cancer early.”

Taleb agrees, but this inference is incorrect: This table does not make the case that it is better to catch cancer early.

Catching cancer early is beneficial only if (1) the cancers we catch would otherwise cause disease and death, and (2) we have treatments that prevent those outcomes, and (3) these benefits outweigh the costs of additional screening. This table does not show that any of those things is true.

In fact, it is possible for a cancer to reproduce any row in this table, even if we have no treatment and detection has no effect on outcomes. To demonstrate, I’ll use a model of tumor progression to show that a hypothetical cancer could have the same survival rates as colon cancer – even if there is no effective treatment at all.

To be clear, I’m not saying that cancer treatment is not effective – in many cases we know that it is. I’m saying that we can’t tell, just looking at a survival rates, whether early detection has any benefit at all.

Markov Model

The details of the model are here and you can click here to run my analysis in a Jupyter notebook on Colab.

We’ll model tumor progression using a Markov chain with these states:

U1, U2, and U3 represent tumors that are undetected at each stage: local, regional, and distant.
D1, D2, D3 represent tumors that were detected/diagnosed at each stage.
And M represents mortality.

The following figure shows the states and transition rates of the model.

The transition probabilities are:

lams, two values that represent transition rates between stages,
kappas, three values that represent detection rates at each state,
mu the mortality rate from D3,
gamma the effectiveness of treatment.

If gamma > 0, the treatment is effective by decreasing the probability of progression to the next stage. In the models we’ll run for this example, gamma=0, which means that detection has no effect on progression.

Overall, these values are within the range we observe in real cancers.

The transition rates are close to 1/6, which means the average time at each stage is 6 simulated years.
The detection rate is low at the first stage, higher at the second, and much higher at the third.
The mortality rate is close to 1/3, so the average survival after diagnosis at the third stage is about 3 years.

In this model, death occurs only after a cancer has progressed to the third stage and been detected. That’s not realistic – in reality deaths can occur at any stage, due to cancer or other causes.

But adding more transitions would not make make the model better. The purpose of the model is to show that we can reproduce the survival rates we see in reality, even if there are no effective treatments. Making the model more realistic would increase the number of parameters, which would make it easier to reproduce the data, but that would not make the conclusion stronger. More realistic models are not necessarily better.

Results

We can simulate this model to compute survival rates and the distribution of stage at diagnosis

In the model, survival rates are 95% if localized, 72% if spread regionally, and 17% if spread distantly. For colon cancer, the rates from the SEER data are are 91%, 74%, and 16%. So the simulation results are not exactly the same, but they are close.

In the model, 38% of tumors are localized when diagnosed, 33% have spread regionally, and 29% have spread distantly. The actual distribution for colon cancer is 38%, 38%, and 24%. Again, the simulation results are not exactly the same, but close.

With more trial and error, I could probably find parameters that reproduce the results exactly. That would not be surprising, because even though the model is meant to be parsimonious, it has seven parameters and we are matching observations with only five degrees of freedom.

That might seem unfair, but it makes the point that there is not enough information in the survival table – even if we also consider the distribution of stages – to estimate the parameters of the model precisely. The data don’t exclude the possibility that treatment is ineffective, so they don’t prove that early detection is beneficial.

Again, it might be better to find cancer early, if the benefit of treatment outweighs the costs of false discovery and overdiagnosis – but that’s a different analysis, and 5-year survival rates aren’t part of it.

That’s why this editorial concludes, “Only reduced mortality rates can prove that screening saves lives… journal editors should no longer allow misleading statistics such as five year survival to be reported as evidence for screening.”

Counterfactuals

Now let’s consider the third incorrect inference.

In a case where a patient is diagnosed late and dies of cancer, it is tempting to say that they would have survived if their cancer had been caught early.

For example, later in the previous video, Taleb says, “Your mother, had she had a colonoscopy, she would be alive today… she’s no longer with us because it was detected when it was stage IV, right?” And Gal agrees.

That might be true, if treatment would have prevented the cancer from progressing. But this conclusion is not supported by the data in the survival table.

If someone is diagnosed late and dies, it is tempting to look at the survival table and think there’s a 91% chance they would have survived if they had been diagnosed earlier. But that’s not accurate – and it might not even be close.

First, remember what 91% survival means: among people diagnosed early, 91% survived five years after diagnosis. But among those survivors, an unknown proportion had tumors that would not have been fatal, even without treatment. Some might be non-progressive, or progress so slowly that they never cause disease or death. But in a case where the patient dies of cancer, we know their tumor was not one of those.

As a simplified example, suppose that of all tumors that are caught early, 50% would cause death within five years, if untreated, and 50% would not. Now imagine 100 people, all detected early and all treated: 50 would survive with or without treatment; out of the other 50, 41 survive with treatment – so overall survival is 91%. But if we know someone is in the second group, their chance of survival is 41/50, which is 82%.

And if the percentage of non-progressive cancers is higher than 50%, the survival rate for progressive cancers is even lower, holding overall survival constant. So that’s one reason the inference is incorrect.

To see the other reason, let’s be precise about the counterfactual scenario. Suppose someone was diagnosed in 2020 with a tumor that had spread distantly, and they died in 2022. Would they be alive in 2025 if they had been diagnosed earlier?

That depends on when the hypothetical diagnosis happens – if we imagine they were diagnosed in 2020, five year survival might apply (except for the previous point). But if it had spread distantly in 2020, we have to go farther back in time to catch it early. For example, if it took 10 years to progress, catching it early means catching it in 2010. In that case, being “alive today” would depend on 15-year survival, not 5-year.

It’s a Different Question

The five-year survival rate answers the question, “Of all people diagnosed early, how many survive five years?” That is a straightforward statistic to compute.

But the hypothetical asks a different question: “Of all people who died [during a particular interval] after being diagnosed late, how many would be alive [at some later point] if the tumor had been detected early?” That is a much harder question to answer – and five-year survival provides little or no help.

In general, we don’t know the probability that someone would be alive today, if they had been diagnosed earlier. Among other things, it depends on progression rates with and without treatment.

If many of the tumors caught early would not have progressed or caused death, even without treatment, the counterfactual probability would be low.
In any case, if treatment is ineffective, as in the hypothetical cancer we simulated, the counterfactual probability is zero.
At the other extreme, if treatment is perfectly effective, the probability is 100%.

It might be frustrating that we can’t be more specific about the probability of the counterfactual, but if someone you know was diagnosed late and died, and it bothers you to think they would have lived if they had been diagnosed earlier, it might be some comfort to realize that we don’t know that – and it could be unlikely.

Comparing Survival Rates

Now let’s consider the last incorrect inference.

If 5-year survival increases over time, it is tempting to conclude that treatment has improved.

This conclusion is appealing because if cancer treatment improves, survival rates improve, other things being equal. But if we do more screening and catch more cancers early, survival rates improve even if treatment is no more effective. And if screening becomes more sensitive, and detects smaller tumors, survival rates also improve.

For many cancers, all three factors have changed over time: improved treatment, more screening, and more sensitive screening. Looking only at survival rates, we can’t tell how much change in survival we should attribute to each.

And the answer is different for different sites. For example, this paper concludes:

In some cases, increased survival was accompanied by decreased burden of disease, reflecting true progress. For example, from 1975 to 2010, five-year survival for colon cancer patients improved … while cancer burden fell: Fewer cases … and fewer deaths …, a pattern explained by both increased early detection (with removal of cancer precursors) and more effective treatment. In other cases, however, increased survival did not reflect true progress. In melanoma, kidney, and thyroid cancer, five-year survival increased but incidence increased with no change in mortality. This pattern suggests overdiagnosis from increased early detection, an increase in cancer burden.

And this paper explains:

Screening detects abnormalities that meet the pathological definition of cancer but that will never progress to cause symptoms or death (non-progressive or slow growing cancers). The higher the number of overdiagnosed patients, the higher the survival rate.

It concludes that “Although 5-year survival is a valid measure for comparing cancer therapies in a randomized trial, our analysis shows that changes in 5-year survival over time bear little relationship to changes in cancer mortality. Instead, they appear primarily related to changing patterns of diagnosis.”

That conclusion might be stated too strongly. This response paper concludes “While the change in the 5-year survival rate is not a perfect measure of progress against cancer […] it does contain useful information; its critics may have been unduly harsh. Part of the long-run increase in 5-year cancer survival rates is due to improved […] therapy.”

But they acknowledge that with survival rates alone, we can’t say what part – and in some cases we have evidence that it is small.

Summary

Survival statistics are misleading because they suggest inferences they do not actually support.

Survival rates from the past might not apply to the present, and for a particular patient, the probability of survival depends on (1) causal factors like general health, and (2) informational factors like the mode of discovery (screening vs symptomatic presentation).
If survival rates are higher when tumors are discovered early, that doesn’t mean that more screening would be better.
And if a cancer is diagnosed late, and the patient dies, that doesn’t mean that if it had been diagnosed early, they would have lived.
Finally, if survival rates improve over time (or they are different in different places) that doesn’t mean treatment is more effective.

To be clear, all of these conclusions can be true, and in some cases we know they are true, at least in part. For some cancers, treatments have improved, and for some, additional screening would save lives. But to support these conclusions, we need other methods and metrics – notably randomized controlled trials that compare mortality. Survival rates alone provide little or no information, and they are more likely to mislead than inform.

The Foundation Fallacy

October 22, 2025 AllenDowney

At Olin College recently, I met with a group from the Kyiv School of Economics who are creating a new engineering program. I am very impressed with the work they are doing, and their persistence despite everything happening in Ukraine.

As preparation for their curriculum design process, they interviewed engineers and engineering students, and they identified two recurring themes: passion and disappointment — that is, passion for engineering and disappointment with the education they got.

One of the professors, reflecting on her work experience, said she thought her education had given her a good theoretical foundation, but when she went to work, she found that it did not apply — she felt like she was starting from scratch.

I suggested that if a “good theoretical foundation” is not actually good preparation for engineering work, maybe it’s not actually a foundation — maybe it’s just a hoop for the ones who can jump through it, and a barrier for the ones who can’t.

The engineering curriculum is based on the assumption that math (especially calculus) and science (especially physics) are (1) the foundations of engineering, and therefore (2) the prerequisites of engineering education. Together, these assumptions are what I call the Foundation Fallacy.

To explain what I mean, I’ll use an example that is not exactly engineering, but it demonstrates the fallacy and some of the rhetoric that sometimes obscures it.

A recent post on LinkedIn includes this image:

And this text:

What makes a data scientist a data scientist? Is it their ability to use R or Python to solve data problems? Partially. But just like any tool, I’d rather those making decisions with data truly understand the tools they’re using so that when something breaks, they can diagnose it.

As the image shows, running a linear regression in R or Python is just the tip of the iceberg. What lies beneath, including the theory, assumptions, and reasoning that make those models work, is far more substantial and complex.

ChatGPT can write the code. But it’s the data scientist who decides whether that model is appropriate, interprets the results, and translates them into sound decisions. That’s why I don’t just hand my students an R function and tell them to use it. We dig into why it works, not just that it works. The questions and groans I get along the way are all part of the process, because this deeper understanding is what truly sets a data scientist apart.

Most of the replies to this post, coming from people who jumped through the hoops, agree. The ones who hit a barrier, and the ones groaning in statistics classes, might have a different opinion.

I completely agree that choosing models, interpreting results, and making sound decisions are as important as programming skills. But I’m not sure the things in that iceberg actually develop those skills — in fact, I am confident they don’t.

And maybe for someone who knows these topics, “when something breaks, they can diagnose it.” But I’m not sure about that either — and I am quite sure it’s not necessary. You can understand multiple collinearity without a semester of linear algebra. And you can get what you need to know about AIC without a semester of information theory.

For someone building a regression model, a high-level understanding of causal inference is a lot more useful than the Gauss-Markov theorem. Also more useful: domain knowledge, understanding the context, and communicating the results. Maybe math and science classes could teach these topics, but the ones in this universe really, really don’t.

Everything I just said about linear regression also applies to engineering. Good engineers understand context, not just technology; they understand the people who will interact with, and be affected by, the things they build; and they can communicate effectively with non-engineers.

In their work lives, engineers hardly ever use calculus — more often they use computational tools based on numerical methods. If they know calculus, does that knowledge help them use the tools more effectively, or diagnose problems? Maybe, but I really doubt it.

My reply to the iceberg analogy is the car analogy: you can drive a car without knowing how the engine works. And knowing how the engine works does not make you a better driver. If someone is passionate about driving, the worst thing we can do is make them study thermodynamics. The best thing we can do is let them drive.

Simpson’s What?

October 16, 2025 AllenDowney

I like Simpson’s paradox so much I wrote three chapters about it in Probably Overthinking It. In fact, I like it so much I have a Google alert that notifies me when someone publishes a new example (or when the horse named Simpson’s Paradox wins a race).

So I was initially excited about this paper that appeared recently in Nature: “The geographic association of multiple sclerosis and amyotrophic lateral sclerosis”. But sadly, I’m pretty sure it’s bogus.

The paper compares death rates due to multiple sclerosis (MS) and amyotrophic lateral sclerosis (ALS) across 50 states and the District of Columbia, and reports a strong correlation.

This result is contrary to all previous work on these diseases – which might be a warning sign. But the author explains that this correlation has not been detected in previous work because it is masked when the analysis combines male and female death rates.

This could make sense, because death rates due to MS are higher for women, and death rates due to ALS are higher for men. So if we compare different groups with different proportions of males and females, it’s possible we could see something like Simpson’s paradox.

But as far as I know, the proportions of men and women are the same in all 50 states, plus the District Columbia – or maybe a little higher in Alaska. So an essential element of Simpson’s paradox – different composition of the subgroups – is missing.

Annoyingly, the “Data Availability” section of the paper only identifies the public sources of the data – it does not provide the processed data. But we can use synthesized data to figure out what’s going on.

Specifically, let’s try to replicate this key figure from the paper:

The x-axis is age adjusted death rates from MS; the y-axis is age-adjusted death rates from ALS. Each dot corresponds to one gender group in one state. The blue line fits the male data, with correlation 0.7. The pink line fits the female data, with correlation 0.75.

The black line is supposed to be a fit to all the data, showing the non-correlation we supposedly get if we combine the two groups. But I’m pretty sure that line is a mistake.

Click here to read this article with the Python code, or if you want to replicate my analysis, you can click here to run the notebook on Colab.

Synthetic Data

I used a random number generator to synthesize correlated data with the approximate distribution of the date in the figure. The following figure shows a linear regression for the male and female data separately, and a third line that is my attempt to replicate the black line in the original figure.

I thought the author might have combined the dots from the male and female groups into a collection of 102 points, and fit a line to that. That is a nonsensical thing to do, but it does yield a Simpson-like reversal in the slope of the line — and the sign of the correlation.

_images/3e7af43cbca8d3b134f479a3aa1cdd26131edd1562308344099d9f17782e7d57.png

The line for the combined data has a non-negligible negative slope, and the correlation is about -0.4 – so this is not the line that appears in the original figure, which has a very small correlation. So, I don’t know where that line came from.

In any case, the correct way to combine the data is not to plot a line through 102 points in the scatter plot, but to fit a line to the combined death rates in the 51 states. Assuming that the gender ratios in the states are close to 50/50, the combined rates are just the means of the male and female rates. The following figure shows what we get if we combine the rates correctly.

_images/21fd70a6c01747e4906e2c0e87286110ce9a96b1e6d1e145b5e69c2e987bf9f4.png

So there’s no Simpson’s paradox here – there’s a positive correlation among the subgroups, and there’s a positive correlation when we combine them. I love a good Simpson’s paradox, but this isn’t one of them.

On a quick skim, I think the rest of the paper is also likely to be nonsensical, but I’ll leave that for other people to debunk. Also, peer review is dead.

It gets worse

UPDATE: After I published the first draft of this article, I noticed that there are an unknown number of data points at (0, 0) in the original figure. They are probably states with missing data, but if they were included in the analysis as zeros — which they absolutely should not be — that would explain the flat line.

If we assume there are two states with missing data, that strengthens the effect in the subgroups, and weakens the effect in the combined groups. The result is a line with a small negative slope, as in the original paper.

The Poincaré Problem

September 25, 2025 AllenDowney

Selection bias is the hardest problem in statistics because it’s almost unavoidable in practice, and once the data have been collected, it’s usually not possible to quantify the effect of selection or recover an unbiased estimate of what you are trying to measure.

And because the effect is systematic, not random, it doesn’t help to collect more data. In fact, larger sample sizes make the problem worse, because they give the false impression of precision.

But sometimes, if we are willing to make assumptions about the data generating process, we can use Bayesian methods to infer the effect of selection bias and produce an unbiased estimate.

Click here to run this notebook on Colab.

Poincaré and the Baker

As an example, let’s solve an exercise from Chapter 7 of Think Bayes. It’s based on a fictional anecdote about the mathematician Henri Poincaré:

Supposedly Poincaré suspected that his local bakery was selling loaves of bread that were lighter than the advertised weight of 1 kg, so every day for a year he bought a loaf of bread, brought it home and weighed it. At the end of the year, he plotted the distribution of his measurements and showed that it fit a normal distribution with mean 950 g and standard deviation 50 g. He brought this evidence to the bread police, who gave the baker a warning.

For the next year, Poincaré continued to weigh his bread every day. At the end of the year, he found that the average weight was 1000 g, just as it should be, but again he complained to the bread police, and this time they fined the baker.

Why? Because the shape of the new distribution was asymmetric. Unlike the normal distribution, it was skewed to the right, which is consistent with the hypothesis that the baker was still making 950 g loaves, but deliberately giving Poincaré the heavier ones.

To see whether this anecdote is plausible, let’s suppose that when the baker sees Poincaré coming, he hefts k loaves of bread and gives Poincaré the heaviest one. How many loaves would the baker have to heft to make the average of the maximum 1000 g?

How Many Loaves?

Here are distributions with the same underlying normal distribution and different values of k.

mu_true, sigma_true = 950, 50

As k increases, the mean increases and the standard deviation decreases.

When k=4, the mean is close to 1000. So let’s assume the baker hefted four loaves and gave the heaviest to Poincaré.

At the end of one year, can we tell the difference between the following possibilities?

Innocent: The baker actually increased the mean to 1000, and k=1.
Shenanigans: The mean was still 950, but the baker selected with k=4.

Here’s a sample under the k=4 scenario, compared to 10 samples with the same mean and standard deviation, and k=1.

The k=4 distribution falls mostly within the range of variation we’d expect from the k=1 distribution (with the same mean and standard deviation). If you were on the jury and saw this evidence, would you convict the baker?

Ask a Bayesian

As a Bayesian approach to this problem, let’s see if we can use this data to estimate k and the parameters of the underlying distribution. Here’s a PyMC model that

Defines prior distributions for mu, sigma, and k, and
Uses a custom distribution that computes the likelihood of the data for a hypothetical set of parameters (see the notebook for details).

def make_model(sample):
    with pm.Model() as model:
        mu = pm.Normal("mu", mu=950, sigma=30)
        sigma = pm.HalfNormal("sigma", sigma=30)
        k = pm.Uniform("k", lower=0.5, upper=15)

        obs = pm.CustomDist(
            "obs",
            mu, sigma, k,
            logp=max_normal_logp,
            observed=sample,
        )
    return model

Notice that we treat k as continuous. That’s because continuous parameters are much easier to sample (and the log PDF function allows non-integer values of k). But it also make sense in the context of the problem – for example, if the baker sometimes hefts three loaves and sometimes four, we can approximate the distribution of the maximum with k=3.5.

The model runs quickly and the diagnostics look good. Here are the posterior distributions of the parameters compared to their known values.

Posterior distribution of mu showing the posterior mean is 940 compared to the true value 950.

Posterior distribution of sigma showing the posterior mean is 54 compared to the true value 50.

Posterior distribution of k showing the posterior mean is 5.5 compared to the true value 4.

With one year of data, we can recover the parameters pretty well. The true values fall comfortably inside the posterior distributions, and the posterior mode of k is close to the true value, 4.

But the posterior distributions are still quite wide. There is even some possibility that the baker is innocent, although it is small.

Conclusion

This example shows that we can use the shape of an observed distribution to estimate the effect of selection bias and recover the unbiased latent distribution. But we might need a lot of data, and the inference depends on strong assumptions about the data generating process.

Credits: I don’t remember where I got this example from (maybe here?), but it appears in Leonard Mlodinov, The Drunkard’s Walk (2008). Mlodinov credits Bart Holland, What Are the Chances? (2002). The ultimate source seems to be George Gamow and Marvin Stern, Puzzle Math (1958) – but their version is about a German professor, not Poincaré.

You can order print and ebook versions of Think Bayes 2e from Bookshop.org and Amazon.

Bayesian Decision Analysis with PyMC: Beyond A/B Testing

The Problem

The Roulette Paradox

The Setup

The Math

Scenario 1

Scenario 2

Scenario 3

Scenario 4

Successive updates

Varying M

Conclusion

Symmetry and Asymmetry

Related Reading

Resources

Abstract

The Girl Named Florida

Tossing Coins

Enumeration

Can We Get Serious Now?

The Two Child Problems

More Variations

Saturday’s Child

Left-handed girl

But Why?

Be Careful What You Ask For

Running Speeds

Chess Rankings

Outliers

The Greatest of All Time

Particularization

Is More Screening Better?

Markov Model

Results

Counterfactuals

It’s a Different Question

Comparing Survival Rates

Summary

Synthetic Data

It gets worse

Poincaré and the Baker

How Many Loaves?

Ask a Bayesian

Conclusion

Varying `M`