Browsed by
Category: Uncategorized

Planning for your midlife crisis

Planning for your midlife crisis

Yesterday I presented a talk at ODSC East 2026, called “Counterfactual Analysis with Bayesian Models: What Drives the Life Expectancy Gap?” Here’s the abstract

Across nearly every country in the world, women live longer than men—but the size of this gap varies from about two years in some countries to more than twelve in others. What explains these differences, and how much of the gap can be closed?

In this talk, I present a practical approach to counterfactual analysis using Bayesian regression models. Using publicly available mortality data, we build a model that relates the life expectancy gap between men and women to differences in cause-specific death rates, including homicide, drug overdoses, traffic fatalities, smoking-related disease, and chronic illness.

The model generates posterior simulations that answer “what-if” questions. For example: How much smaller would the U.S. life expectancy gap be if homicide rates matched those in Western Europe?

The talk presents the workflow—from assembling global datasets to fitting interpretable Bayesian models with PyMC and generating counterfactual simulations. Attendees will learn how Bayesian models can support explainable modeling and analysis under uncertainty.

I think the talk went well, and we got some good questions at the end. There’s no recording, unfortunately, but my slides are here. And if you want to know more, I have a series of blog posts on Substack

The fifth and final post is on the way. In the meantime, here’s a quick post on a related topic.

Are you middle-aged?

Here’s a question from Reddit’s Stupid Questions forum:

I always thought middle age was in your 40s but since life expectancy is around 75 or so, wouldn’t it be about 35?

If life expectancy is 75, you might think the midpoint is half that, which is 37.5. But if 75 is life expectancy at birth and you survive to age 37.5, your life expectancy at that age is higher than 75. So 37.5 is not halfway!

If we really want to find the midpoint – and it wouldn’t be Probably Overthinking It if we didn’t – we have to find the age where your expected remaining lifetime equals your current age.

Let’s do it.

Data

From the Human Mortality Database I downloaded life tables for the United States, combined and broken down for men and women. The following function reads and cleans a table.

def read_life_table(filename):
    lt = pd.read_fwf(filename, skiprows=2, infer_nrows=200)
    lt['Age'] = lt['Age'].str.replace('+', '', regex=False).astype(int)
    return lt

Here are the first few rows of the combined table (see notes below for details).

blt = read_life_table('../data/bltper_1x1.txt')
blt.head()
YearAgemxqxaxlxdxLxTxex
0193300.061290.058610.25100000586195624608960960.90
1193310.009460.009410.509413988693696599398563.67
2193320.004350.004340.509325340593050590028963.27
3193330.003100.003100.509284828892704580723962.55
4193340.002390.002380.509256022192450571453561.74

We’ll also read the female and male tables.

flt = read_life_table('../data/fltper_1x1.txt')
mlt = read_life_table('../data/mltper_1x1.txt')

The tables include data from 1933 to 2024, so we’ll select the most recent data.

year = blt['Year'].unique()[-1]
table = blt.query('Year == @year').set_index('Age')

The column we’ll use is ex, which is life expectancy as a function of age.

age = table.index.to_series()
ex = table['ex']

Life expectancy at birth is 79 years, so the naive midpoint is 39.5.

ex[0], ex[0] / 2
(79.08, 39.54)

But at age 40, expected remaining lifetime is 41.1, so 39.5 is not the midpoint.

ex[39], ex[40]
(42.04, 41.12)

This plot shows life expectancy at each age, compared to age.

ex.plot(label='Remaining life expectancy')
age.plot(label='Age')
decorate(ylabel='Years',
        title='Remaining life expectancy vs age, United States 2024')
_images/a55e9cbde25b36d9c42e2ede2a78a827460d537f230b5cc4e7dfcb519a44bbc6.png

“Middle age” is where the lines cross, which we can compute by linear interpolation.

from scipy.interpolate import interp1d

inverse = interp1d(ex - age, age)
inverse(0)
array(40.58638743)

So the overall midpoint is 40.6 years. But as you might expect, it’s different for men and women. Let’s put the analysis we did in a function.

def get_midpoint(filename):
    lt = read_life_table(filename)
    year = lt['Year'].unique()[-1]
    table = lt.query('Year == @year').set_index('Age')

    age = pd.Series(table.index)
    ex = table['ex']

    inverse = interp1d(ex - age, age)
    return inverse(0)

And run it for men.

get_midpoint('../data/mltper_1x1.txt')
array(39.57142857)

And women.

get_midpoint('../data/fltper_1x1.txt')
array(41.56185567)

Men hit middle age at 39.6, women at 41.6.

The Gender Gap and Age

Finally, let’s see how the gender gap in life expectancy changes as a function of age.

ex_male = mlt.query('Year == @year').set_index('Age')['ex']
ex_female = flt.query('Year == @year').set_index('Age')['ex']
gap = ex_female - ex_male
gap.plot(label='')
decorate(ylabel='Years',
         title='Life expectancy gender gap vs age')
_images/2cf48be5acc971ad8f6973506c224767a25ef2e38bcf53c5b12b0f895c26ed11.png

At birth the life expectancy gap is close to five years. At age 100, it is close to zero.

But just looking at the gap might be misleading. For a more complete picture let’s also look at the ratio.

ratio = ex_female / ex_male
ratio.plot(label='')

decorate(ylabel='Ratio',
         title='Life expectancy gender ratio (female / male)')
_images/1a514d22895c3e18a767184700d6edf467975abb4b302afe044fd42c002ae21c.png

The life expectancy ratio tells a more complicated story.

  • At birth, the ratio is 1.06, which means female babies live 6% longer, on average.
  • Around age 80, the ratio peaks at nearly 1.14 – so between female and male octogenarians, we expect the women to live 14% longer.
  • At advanced ages, the ratio declines steeply and actually crosses over after age 100 – although the crossover is minimal and might not be statistically valid.

To interpret these results, we can think about the causes of death that contribute to age-specific death rates at different stages of life.

  • In young adulthood, the causes of death that contribute most to gender gaps include road traffic, homicide, accidental injury, drug use disorders.
  • In advanced adulthood, they include cancer, cardiovascular disease, respiratory disease, liver disease, diabetes, and suicide.

The causes that affect younger people have large gender gaps, but relatively low death rates. As people get older, these low-rate causes contribute less to age-specific death rates, and the higher-rate causes contribute more.

I think that’s a plausible explanation for the increasing ratio from age 0 to 80. For the decline that follows, I can only speculate that there is a selection effect: people who get to these advanced ages are likely to have better-than-average lifestyle histories (less smoking and drinking, better diet, more exercise) – and among people with better lifestyles, the gender gap is small.

Notes

Data credit: HMD. Human Mortality Database. Max Planck Institute for Demographic Research (Germany), University of California, Berkeley (USA), and French Institute for Demographic Studies (France). Available at [www.mortality.org].

Here are the columns of the 1×1 Period Life Tables:

  • Year: Calendar year to which the period life table refers.
  • Age: Exact age (x), in years, at the beginning of the interval ([x, x+1)).
  • mx: Central death rate at age (x):
  • qx: Probability of dying between ages (x) and (x+1):
  • ax: Average fraction of the interval lived by those who die in ([x, x+1)). Typically around 0.5 for most ages, lower for infants (reflecting higher early mortality within the year).
  • lx: Number of survivors at exact age (x), out of a radix (usually 100,000 births).
  • dx: Number of deaths between ages (x) and (x+1):
  • Lx: Person-years lived between ages (x) and (x+1), approximately
  • Tx: Total person-years remaining above age (x):
  • ex: Life expectancy at age (x):

The details are in this Jupyter notebook.

Attention, Chinese Readers

Attention, Chinese Readers

The Chinese edition of Probably Overthinking It is available now (also here)!

If you have the Chinese edition, there are two sections you won’t get to read — so I am including them here.

Here is an excerpt from Chapter 3, including the deleted paragraph:

In the Present

The women surveyed in 1990 rejected the childbearing example of their mothers emphatically. On average, each woman had 2.3 fewer children than her mother. If that pattern had continued for another generation, the average family size in 2018 would have been about 0.8. But it wasn’t.

In fact, the average family size in 2018 was very close to 2, just as in 1990. So how did that happen?

As it turns out, this is close to what we would expect if every woman had one child fewer than her mother. The following distribution shows the actual distribution in 2018, compared to the result if we start with the 1990 distribution and simulate the “one child fewer” scenario.

_images/ddb1f82d657fad8171d5c400c9a539aead9ac1a4f85b7460f3a4ae7f7cb00237.png

The means of the two distributions are almost the same, but the shapes are different. In reality, there were more zero- and two-child families in 1990 than the simulation predicts, and fewer one-child families. But at least on average, it seems like women in the U.S. have been following the “one child fewer” policy for the last 30 years.

The scenario at the beginning of this chapter is meant to be light-hearted, but in reality governments in many places and times have enacted policies meant to control family sizes and population growth. Most famously, China implemented a one-child policy in 1980 that imposed severe penalties on families with more than one child. Of course, this policy is objectionable to anyone who considers reproductive freedom a fundamental human right. But even as a practical matter, the unintended consequences were profound.

Rather than catalog them, I will mention one that is particularly ironic: while this policy was in effect, economic and social forces reduced the average desired family size so much that, when the policy was relaxed in 2015 and again in 2021, average lifetime fertility increased to only 1.3, far below the level needed to keep the population constant, near 2.1. Since then, China has implemented new policies intended to increase family sizes, but it is not clear whether they will have much effect. Demographers predict that by the time you read this, the population of China will probably be shrinking [UPDATE: It is.]. The consequences of the one-child policy are widespread and will affect China and the rest of the world for a long time.

And here is an excerpt from Chapter 5, including the deleted explanation.

Child mortality

Fortunately, child mortality has decreased since 1900. The following figure shows the percentage of children who die before age 5 for four geographical regions, from 1900 to 2019. These data were combined from several sources by Gapminder, a foundation based in Sweden that “promotes sustainable global development […] by increased use and understanding of statistics.”

_images/220c5c7e411ef012b610deab7f65ab6dbd0a010aa40d46318a9a14823f2a268e.png

In every region, child mortality has decreased consistently and substantially. The only exceptions are indicated by the vertical lines: the 1918 influenza pandemic, which visibly affected Asia, the Americas, and Europe; World War II in Europe (1939-1945); and the Great Leap Forward in China (1958-1962). In every case, these exceptions did not affect the long-term trend.

[COMMENT: I thought I was being diplomatic by referring generally to the Great Leap Forward — rather than the Great Chinese Famine or “Three Years of Great Famine” (三年大饥荒) — but apparently that was not enough.]

Although there is more work to do, especially in Africa, child mortality is substantially lower now, in every region of the world, than in 1900. As a result most people now are better new than used.

To demonstrate this change, I collected recent mortality data from the Global Health Observatory of the World Health Organization (WHO). For people born in 2019, we don’t know what their future lifetimes will be, but we can estimate it if we assume that the mortality rate in each age group will not change over their lifetimes.

Based on that simplification, the following figure shows average remaining lifetime as a function of age for Sweden and Nigeria in 2019, compared to Sweden in 1905.

_images/6a45a65e6a7b3201af0f74c7b7df4d57ce5c5976972ce0e69538a1914fa5cc5b.png

Since 1905, Sweden has continued to make progress; life expectancy at every age is higher in 2019 than in 1905. And Swedes now have the new-better-than-used property. Their life expectancy at birth is about 82 years, and it declines consistently over their lives, just like a light bulb.

Unfortunately, Nigeria has one of the highest rates of child mortality in the world: in 2019, almost 8% of babies died in their first year of life. After that, they are briefly better used than new: life expectancy at birth is about 62 years; however, a baby who survives the first year will live another 65 years, on average.

Going forward, I hope we continue to reduce child mortality in every region; if we do, soon every person born will be better new than used. Or maybe we can do even better than that.

Field Sobriety Tests and the Base Rate Fallacy

Field Sobriety Tests and the Base Rate Fallacy

In Chapter 9 of Probably Overthinking It I wrote about Drug Recognition Experts (DREs), who are law enforcement officers trained to recognize impaired drivers.

I reviewed the research papers that were supposed to evaluate the accuracy of DREs and I summarized my impressions like this:

What I found was a collection of studies that are, across the board, deeply flawed. Every one of them features at least one methodological error so blatant it would be embarrassing at a middle school science fair.

Recently the related topic of Field Sobriety Tests (FSTs) came up in this Reddit discussion, which links to this TV news report about sober drivers who were arrested based on FST results.

The TV report refers to this 2023 paper in JAMA Psychiatry. Because it’s recent, published in a good quality journal, and called “Evaluation of Field Sobriety Tests for Identifying Drivers Under the Influence of Cannabis: A Randomized Clinical Trial”, I thought it might address the problems I found in previous research.

Unfortunately, it has the same problems:

  • Selection bias: It excludes as subjects people with conditions that might cause them to fail an FST while sober – but these are exactly the people most vulnerable to false positive results.
  • Wrong metrics: The paper focuses on the true positive and false positive rates, and neglects the predictive value of the test – which is more relevant to the policy question.
  • Unrealistic base rate: In the test conditions, two thirds of the participants were impaired, which is almost certainly higher than the relevant fraction in the real world.

Despite all that, the false positive rate they reported is 49%, which means that nearly half of the sober participants were wrongly classified as impaired.

Let’s look at each of these problems more closely.

False Positives

The study tested 184 participants, 121 randomly assigned to the THC group and 63 to the placebo group. The THC group smoked cannabis cigarettes containing THC; the placebo group smoked cigarettes with almost none. Each participant was evaluated by one officer, who was “blinded to treatment assignment”. The paper reports

Officers classified 98 participants (81.0%) in the THC group and 31 (49.2%) in the placebo group as FST impaired.

The following table summarizes these results as a confusion matrix:

FST PositiveFST NegativeTotal
THC Group9823121
Placebo Group313263
Total12955184

Let’s start with the most obvious problem: of 63 people in the placebo group, 31 were wrongly classified as impaired, so the false positive rate was 49%.

Although the tests “were administered by certified DRE instructors, the highest training level for impaired driving detection”, the results for sober participants were no better than a coin toss. That’s pretty bad, but in reality it’s probably worse, because of selection bias.

Selection Bias

The study recruited 261 people who met these requirements: “age 21 to 55 years, cannabis use 4 or more times in the past month, holding a valid driver’s license, and driving at least 1000 miles in the past year.”

But it excluded 62 recruits for reasons including “history of traumatic brain injury [and] significant medical conditions or psychiatric conditions”. They also excluded people with a positive urine test for nonprescription drugs or substance use disorder in the past year.

That’s a problem because people with these kinds of medical conditions are more likely to fail an FST – even if they are not actually impaired. By excluding them, the study excludes exactly the people most vulnerable to a false positive result.

A better experiment would recruit a representative sample of drivers, including people older than 55 and people with conditions that make it hard to pass a field sobriety test. The TV report highlights an example: an autistic man who was arrested for DUI because his autism-related differences were mistaken for impairment. I assume he would have been excluded from the study.

To see how much difference the selection criteria could make, suppose 20 of the excluded participants (about one third) had been assigned to the placebo group. And suppose that because of their conditions 16 of them were wrongly classified as impaired – that’s 80%, somewhat higher than the rate among included participants.

That would increase the number of false positives by 16 and the number of true negatives by 4, so the unbiased false positive rate might be 57%.

This is just a guess: it’s not clear how many were excluded specifically for medical conditions or how many of the excluded would have failed the FST. But this calculation gives us a sense of how big the bias could be.

As I wrote in Probably Overthinking It:

How can you estimate the number of false positives if you exclude from the study everyone likely to yield a false positive? You can’t.

And that brings us to the next problem.

Predictive Value

The paper reports:

Officers classified 98 participants (81.0%) in the THC group and 31 (49.2%) in the placebo group as FST impaired at the first evaluation

They quantify this difference as 31.8 percentage points, with 95% CI, 16.4-47.2 percentage points, and report a p-value < .001. Based on this analysis, they conclude:

FSTs administered by highly trained law enforcement officers differentiated between individuals receiving THC vs placebo

This conclusion is true in the sense that the difference in percentages is statistically significant, but the policy question is not whether THC exposure changes FST performance under laboratory conditions. The question is whether an FST result provides sufficiently strong evidence to justify detention or arrest.

For that, the false positive rate is relevant, and as we have discussed, it is probably more than 50%.

But even more important is the positive predictive value (PPV), which is the probability that a positive test is correct. In the confusion matrix, there are 129 positive tests, of which 98 are correct and 31 incorrect, so the PPV is 98 out of 129, about 76%.

Of the people who failed the FST, 76% were actually impaired. That might sound good enough for probable cause, but that conclusion is misleading because there is still another problem – the base rate.

Base Rate

In the study, two thirds of the participants were impaired. In the real world, it is unlikely that two thirds of drivers are impaired – or even two thirds of drivers who take an FST. So the base rate in the study is too high.

To see why that matters, we have to do a little math. First we’ll use the confusion matrix to compute one more metric, sensitivity, which is the percentage of impaired participants who were classified correctly.

We can use sensitivity, along with the false positive rate we already computed, to figure out the positive predictive value of a test with a more realistic base rate.

Of all people pulled over and given a field sobriety test, how many do you think are impaired by THC? That’s a hard question to answer, so we’ll try a couple of values.

First, suppose the base rate is one third, rather than the two thirds in the study. If we imagine 100 drivers:

  • If 33 are impaired, and sensitivity is 81%, we expect 27 true positive results.
  • If 67 are not impaired, and the false positive rate is 49%, we expect 33 false positive results.

In that case the positive predictive value is 27 / (27 + 33), which means that only 45% of positive tests are correct. If we put those numbers in a table, the calculation might be clearer.

TestsProb posPos testsPercent
Impaired330.81026.72744.773
Not impaired670.49232.96855.227

With a lower base rate, PPV is lower, which means that a positive test is weaker evidence of impairment. But even 45% might be too high.

If we suppose that 15% of drivers who take an FST are impaired, we can run the numbers again.

TestsProb posPos testsPercent
Impaired150.81012.14922.508
Not impaired850.49241.82577.492

With 15% base rate, the predictive value of the test is only 23% – which means 77% of drivers identified as impaired would actually be sober.

In reality, the base rate depends on the context. At a checkpoint where every driver is stopped, the base rate might be lower than 15%. If a driver is stopped for driving erratically, the base rate might be relatively high. But even then, it is unlikely to be as high as 66%, as in the study.

Discussion

The JAMA Psychiatry study provides valuable data, but it suffers from the same methodological problems as previous DRE validation studies:

  1. High false positive rate: Nearly half of sober participants were incorrectly classified as impaired.
  2. Selection bias: The study excluded exactly the people most likely to be falsely accused, making it impossible to assess the true false positive rate in the general population.
  3. Unrealistic base rate: The base rate in the study is higher than what we expect in real-world use, which inflates the predictive value of the test.

Although I have been critical of the study, I agree with their interpretation of the results:

…the substantial overlap of FST impairment between groups and the high frequency at which FST impairment was suspected to be due to THC suggest that absent other indicators, FSTs alone may be insufficient to identify THC-specific driving impairment.

Emphasis mine.

Notes

In my interpretation of the results, I follow the methodology of the study, which treats assignment to the THC group as ground truth – that is, we assume that participants in the THC group were actually impaired and participants in the placebo group were not. And the paper reports:

Median self-reported highness (scale of 0 to 100, with higher scores indicating more impairment) at 30 minutes was 64 (IQR, 32-76) for the THC group and 13 (IQR, 1-28) for the placebo group (P < .001).

The THC group felt that they were more impaired, but based on the IQRs, it looks like there might be overlap. That complicates the interpretation of “impaired”, but for this analysis I use the study’s operational definition.

Click here to run this notebook on Colab.

Don’t Bet on the Super Bowl

Don’t Bet on the Super Bowl

If you have studied probability, you might be familiar with fractional odds, which represent the ratio of the probability something happens to the probability it doesn’t. For example, if the Seahawks have a 75% chance of winning the Super Bowl, they have a 25% chance of losing, so the ratio is 75 to 25, sometimes written 3:1 and pronounced “three to one”.

But if you search for “the odds that the Seahawks win”, you will probably get moneyline odds, also known as American odds. Right now, the moneyline odds are -240 for the Seahawks and +195 for the Patriots. If you are not familiar with this format, that means:

  • If you bet $100 on the Patriots and they win, you gain $195 – otherwise you lose $100.
  • If you bet $240 on the Seahawks and they win, you gain $100 – otherwise you lose $240.

If you are used to fractional odds, this format might make your head hurt. So let’s unpack it.

Suppose you think the Patriots have a 25% chance of winning. Under that assumption, we can compute the expected value of the first wager like this:

def expected_value(p, wager, payout):
    return p * payout - (1-p) * wager
expected_value(p=0.25, wager=100, payout=195)
-26.25

If the Patriots actually have a 25% chance of winning, the first wager has negative expected value – so you probably don’t want to make it.

Now let’s compute the expected value of the second wager – assuming the Seahawks have a 75% chance of winning:

expected_value(p=0.75, wager=240, payout=100)
15.0

The expected value of this wager is positive, so you might want to make it – but only if you have good reason to think the Seahawks have a 75% chance of winning.

Implied Probability

More generally, we can compute the expected value of each wager for a range of probabilities from 0 to 1.

ps = np.linspace(0, 1)
ev_patriots = expected_value(ps, 100, 195)
ps = np.linspace(0, 1)
ev_seahawks = expected_value(1-ps, 240, 100)

Here’s what they look like.

plt.plot(ps, ev_patriots, label='Bet on Patriots')
plt.plot(ps, ev_seahawks, label='Bet on Seahawks')
plt.axhline(0, color='gray', alpha=0.4)

decorate(xlabel='Actual probability Patriots win',
        ylabel='Expected value of wager')
_images/48c1402d18913b4e4a43f62a11e1f6206cac36ff426a02adefd811caa7714e50.png

To find the crossover point, we can set the expected value to 0 and solve for p. This function computes the result:

def crossover(wager, payout):
    return wager / (wager + payout)

Here’s crossover for a bet on the Patriots at the offered odds.

p1 = crossover(100, 195)
p1
0.3389830508474576

If you think the Patriots have a probability higher than the crossover, the first bet has positive expected value.

And here’s the crossover for a bet on the Seahawks.

p2 = crossover(240, 100)
p2
0.7058823529411765

If you think the Seahawks have a probability higher than this crossover, the second bet has positive expected value.

So the offered odds imply that the consensus view of the betting market is that the Patriots have a 33.9% chance of winning and the Seahawks have a 70.6% chance. But you might notice that the sum of those probabilities exceeds 1.

p1 + p2
1.0448654037886342

What does that mean?

The Take

The sum of the crossover probabilities determines “the take”, which is the share of the betting pool taken by “the house” – that is, the entity that takes the bets.

For example, suppose 1000 people take the first wager and bet $100 each on the Patriots. And 1000 people take the second wager and bet $240 on the Seahawks.

Here’s the total expected value of all of those wagers.

total = expected_value(ps, 100_000, 195_000) + expected_value(1-ps, 240_000, 100_000) 
plt.plot(ps, total, label='Total')
plt.axhline(0, color='gray', alpha=0.4)

decorate(xlabel='Actual probability Patriots win',
        ylabel='Total expected value of all wagers')
_images/4cbfc771ae7a96120417d15491ffc998549738a707551c824a63cfbaa12536b4.png

The total expected value is negative for all probabilities (or zero if the Patriots have no chance at all) – which means the house wins.

How much the house wins depends on the actual probability. As an example, suppose the actual probability is the midpoint of the probabilities implied by the odds:

p = (p1 + (1-p2)) / 2
p
0.31655034895314055

In that case, here’s the expected take, assuming that the implied probability is correct.

take = -expected_value(p, 100_000, 195_000) - expected_value(1-p, 240_000, 100_000) 
take
14244.765702891316

As a percentage of the total betting pool, it’s a little more than 4%.

take / (100_000 + 240_000)
0.04189636971438623

Which we could have approximated by computing the “overround”, which is the amount that the sum of the implied probabilities exceeds 1.

(p1 + p2) - 1
0.04486540378863424

Don’t Bet

In summary, here are the reasons you should not bet on the Super Bowl:

  • If the implied probabilities are right (within a few percent) all wagers have negative expected value.
  • If you think the implied probabilities are wrong, you might be able to make a good bet – but only if you are right. The odds represent the aggregated knowledge of everyone who places a bet, which probably includes a lot of people who know more than you.
  • If you spend a lot of time and effort, you might find instances where the implied probabilities are wrong, and you might even make money in the long run. But there are better things you could do with your time.

Betting is a zero-sum game if you include the house and a negative-sum game for people who bet. If you make money, someone else loses – there is no net creation of economic value.

So, if you have the skills to beat the odds, find something more productive to do.

The Girl Born on Tuesday

The Girl Born on Tuesday

Some people have strong opinions about this question:

In a family with two children, if at least one of the children is a girl born on Tuesday, what are the chances that both children are girls?

In this article, I hope to offer

  1. A solution to one interpretation of this question,
  2. An explanation of why the solution seems so counterintuitive,
  3. A discussion of other interpretations, and
  4. An implication of this problem for teaching and learning probability.

Let’s get started.

One interpretation

One reason this problem is contentious is that it is open to multiple interpretations. I’ll start by presenting just one – then we’ll get back to the ambiguity.

First, to avoid real-world complications, let’s assume an imaginary world where:

  • Every family has two children.
  • 50% of children are boys and 50% are girls.
  • All days of the week are equally likely birth days.
  • Genders and birth days are independent.

Second, we will interpret the question in terms of conditional probability; that is, we’ll compute P(B|A), where

  • A is “at least one of the children is a girl born on Tuesday”, and
  • B is “both children are girls”.

Under these assumptions and this interpretation, the answer is unambiguous – and it turns out to be 13/27 (about 48.1%).

But why?

This problem is counterintuitive because it elicits confusion between causation and evidence.

  • If a family has a girl born on a Tuesday, that does not cause the other child to be a girl.
  • But the fact that a family has a girl born on Tuesday is evidence that the other child is a girl.

To see why, imagine two families: the first has one girl and the other has ten girls. Suppose I choose one of the families at random, check to see whether they have a girl born on Tuesday, and find that they do.

Which family do you think I chose?

  • If I chose the family with one girl, the chance is only 1/7 (about 14%) that she was born on Tuesday.
  • If I chose the family with ten girls, the chance is about 79% that at least one of them was born on a Tuesday.

And that’s the key to understanding the problem:

A family with more than one girl is more likely to have one born on Tuesday. Therefore, if a family has a girl born on a Tuesday, it is more likely that they have more than one girl.

That’s the qualitative argument. Now we’ll make it quantitative – with Bayes’s Theorem.

Bayes’s Theorem

Let’s start with four kinds of two-child families.

kinds = ['Boy Boy', 'Boy Girl', 'Girl Boy', 'Girl Girl']

Under our simplifying assumptions, these combinations are equally likely, so their prior probabilities are equal.

from fractions import Fraction

prior = pd.Series(Fraction(1, 4), kinds)
display(prior, 'prior')
prior
Boy Boy1/4
Boy Girl1/4
Girl Boy1/4
Girl Girl1/4

Now for each kind of family, let’s compute the likelihood of a girl born on Tuesday:

  • If there are two boys, the probability of a girl born on Tuesday is 0.
  • If there is one girl, the probability she is born on Tuesday is 1/7.
  • If there are two girls, the probability at least one is born on Tuesday is 1 - (6/7)**2.

Let’s put those values in a list.

p = Fraction(1, 7)
likelihood = [0, p, p, 1 - (1-p)**2]
likelihood
[0, Fraction(1, 7), Fraction(1, 7), Fraction(13, 49)]

To compute the posterior probabilities, we multiply the prior and likelihood, then normalize so the results add up to 1.

posterior = prior * likelihood
posterior /= posterior.sum()
display(posterior, 'posterior')
posterior
Boy Boy0
Boy Girl7/27
Girl Boy7/27
Girl Girl13/27

The posterior probability of two girls is 13/27. As always, Bayes’s Theorem is the chainsaw that cuts through the knottiest problems in probability.

Other versions

Everything so far is based on the interpretation of the question as a conditional probability. But many people have pointed out that the question is ambiguous because it does not specify how we learn that the family has a girl born on a Tuesday.

This objection is valid:

  1. The answer depends on how we get the information, and
  2. The statement of the problem does not say how.

There are many versions of this problem that specify different ways you might learn that a family has a girl born on a Tuesday, and you might enjoy the challenge of solving them.

In general, if we specify the process that generates the data, we can use simulation, enumeration, or Bayes’s Theorem to compute the conditional probability given the data.

But what should we do if the data-generating process is not uniquely specified?

  • One option is to say that the question has no answer because it is ambiguous.
  • Another option is to specify a prior distribution of possible data-generating processes, compute the answer under each process, and apply the law of total probability.

Some of the people who choose the second option also choose a prior distribution so that the answer turns out to be 1/2. In my view, that is a correct answer to one interpretation, but that interpretation seems arbitrary – by choosing different priors, we can make the answer almost anything.

I prefer the interpretation I presented, because

  1. I believe it is what was intended by the people who posed the problem,
  2. It is consistent with the conventional interpretation of conditional probability,
  3. It yields an answer that seems paradoxical at first, so it is an interesting problem,
  4. The apparent paradox can be resolved in a way that sheds light on conditional probability and the idea of independent events.

So I think it’s a perfectly good problem – it’s just hard to express it unambiguously in natural language (as opposed to math notation).

But you don’t have to agree with me. If you prefer a different interpretation of the question, and it leads to a different answer, feel free to write a blog post about it.

What about independence?

I think the girl born on Tuesday carries a lesson about how we teach. In introductory probability, students often learn two ways to compute the probability of a conjunction. First they learn the easy way:

  • P(A and B) =  P(A) P(B)

But they are warned that this only applies if A and B are independent. Otherwise, they have to do it the hard way:

  • P(A and B) =  P(A) P(B|A)

But how to we know whether A and B are independent? Formally, they are independent if

  • P(B|A) = P(B)

So, in order to know which formula to use, you have to know P(B|A). But if you know P(B|A), you might as well use the second formula.

Rather than check independence by conditional probability, it is more common to assert independence by intuition. For example, if we flip two coins, we have a strong intuition that the outcomes are independent. And if the coins are known to fair, this intuition is correct. But if there is any uncertainty about the probability of heads, it is not.

The coin example – and Monty Hall, and Bertrand’s Boxes, and many more – demonstrate the real lesson of the girl born on Tuesday – our intuition for independence is wildly unreliable.

Which means we might want to rethink the way we teach it.

In general

Previously I wrote about a version of this problem where the girl is named Florida. In general, if we are given that a family has at least one girl with a particular property, and the prevalence of the property is p, we can use Bayes’s Theorem to compute the probability of two girls.

I’ll use SymPy to represent the priors and the probability p.

from sympy import Rational

prior = pd.Series(Rational(1, 4), kinds)
display(prior, 'prior')
prior
Boy Boy1/4
Boy Girl1/4
Girl Boy1/4
Girl Girl1/4

Here are the likelihoods in terms of p.

from sympy import symbols

p = symbols('p')

likelihood = [0, p, p, 1 - (1-p)**2]
likelihood
[0, p, p, 1 - (1 - p)**2]

And here are the posteriors.

posterior = prior * likelihood
posterior /= posterior.sum()

for kind, prob in posterior.items():
    print(kind, prob.simplify())
Boy Boy 0
Boy Girl -1/(p - 4)
Girl Boy -1/(p - 4)
Girl Girl (p - 2)/(p - 4)

So the general answer is (p-2) / (p-4).

If we plug in p = 1/7, we get 13/27 again.

prob = posterior['Girl Girl'].subs({p: Rational(1, 7)})
prob

Or for the girl named Florida, let’s assume one girl out of 1000 is named Florida.

prob = posterior['Girl Girl'].subs({p: Rational(1, 1000)})
prob

The following figure shows the probability of two girls as a function of the prevalence of the property.

xs = np.linspace(0, 1)
ys = (xs-2) / (xs-4)

plt.plot(xs, ys)
plt.xlabel('Prevalence of the property')
plt.ylabel('Conditional probability of two girls')
_images/c81aa262e67d9b56ecabe5664c2397cdd0375ce23e2d2c683d8d281e36c47726.png

If the property is rare – like the name Florida – the conditional probability is close to 1/2. If the property is common – like having a name – the conditional probability is close to 1/3.

Objections

Here are some objections to the “girl born on Tuesday” problem along with my responses.

You have to model the message, not just the event

Objection.
The statement “at least one child is a girl born on Tuesday” should not be treated as a bare event in a probability space. It should be treated as the outcome of a random process that generates messages or facts we learn. Therefore, the probability space must include not only family composition, but also the mechanism by which that information is produced. Any solution that conditions only on the family outcomes is incomplete.

Response.
I agree that if the problem is interpreted as conditioning on a message (something that is said, reported, or chosen from among several true statements), then the reporting mechanism matters and must be modeled explicitly. However, I don’t think such a mechanism is required in all cases. It is standard and meaningful to interpret a question as conditioning on an event – an extensional property of outcomes – without introducing an additional random variable for how the information was obtained. That is the interpretation I adopt here.

Without a specified selection rule, symmetry forces the answer to 1/2

Objection.
If the problem does not specify how the information was obtained, then we must assume a symmetric rule for selecting which true statement is revealed. Under that assumption, conditioning on “at least one boy” or “at least one girl” must give the same answer, and applying the law of total probability forces the posterior probability to equal the prior. Therefore, the correct answer must be 1/2.

Response.
This conclusion follows only if we assume that the conditioning is on a message chosen from a symmetric set of alternatives. Under that interpretation, the result does depend on the selection rule, and 1/2 is a valid answer for one particular choice of rule. But if the conditioning is on an event rather than a message, there is no requirement that different events form a symmetric partition or that the law of total probability be applied across them in this way. Under the event-based interpretation, the argument forcing 1/2 does not apply.

The problem is ambiguous and therefore has no answer

Objection.
Because the problem does not specify how we learn that there is a girl born on Tuesday, it is fundamentally ambiguous. Since different interpretations lead to different answers, the question has no single correct solution.

Response.
It’s true that the problem is ambiguous as stated in natural language. One option is to declare it unanswerable. Another is to resolve the ambiguity by adopting a conventional default interpretation. I choose the latter: I interpret the question as a conditional probability defined on an explicit probability model and make that interpretation clear by enumerating the sample space. Under that interpretation, the answer is unambiguous and, in my view, interesting and instructive – even if other interpretations lead to different answers.

You are changing the sampling procedure

Objection.
Some people object that the 13/27 result comes from changing how families are selected. Conditioning on “at least one child is a girl born on Tuesday” oversamples families with more girls, so the conditional distribution no longer represents the original population of two-child families. From this perspective, the result feels like an artifact of biased sampling rather than a genuine probability update.

Response.
That description is accurate, but it is not a flaw. Conditioning is biased sampling: evidence changes the distribution of outcomes. Families with more girls really are more likely to satisfy the condition, and the conditional probability reflects that fact.

The day of the week seems irrelevant

Objection.
Tuesday has nothing to do with gender, so it feels wrong that adding this detail should change the probability. Since the day of the week does not cause a child to be a girl, it seems irrelevant to the question.

Response.
This objection reflects a common confusion between causal independence and evidential relevance. While the day of the week does not cause the other child’s gender, it provides evidence about the number of girls in the family. Evidence can change probabilities even when there is no causal connection.

The result depends on unrealistic independence assumptions

Objection.
The solution assumes that genders and days of the week are independent and uniformly distributed, which is not true in the real world. If those assumptions are relaxed, the answer changes.

Response.
That is correct, but those assumptions are not the source of the puzzle. Relaxing them changes the numerical value of the answer, but not the underlying logic. The same kind of reasoning applies under more realistic models.

The problem is artificial or pathological

Objection.
Some readers reject the problem not because the calculation is wrong, but because the setup feels artificial or unlike how information is learned in real life. From this view, the problem is a trick rather than a meaningful probability question.

Response.
Whether this is a flaw or a feature depends on the goal. The problem is artificial, but it is intended to expose how unreliable our intuitions about conditional probability and independence can be. In that sense, its artificiality is what makes it pedagogically useful. The underlying issue – determining how evidence bears on hypotheses – comes up in real-world problems all the time. And getting it wrong has real-world consequences.