The Frog Puzzle

The Frog Puzzle

Here’s a probability puzzle from a TED-Ed video called Can you solve the frog riddle? by Derek Abbott. It came up recently in this Reddit thread:

You’re stranded in a rainforest after accidentally eating a poisonous mushroom. To survive the poison, you need to lick a certain species of frog. Only female frogs produce the antidote. Male and female frogs occur in equal numbers and look identical, but male frogs have a distinctive croak.

You see one frog alone on a tree stump. In another direction, you hear the croak of a male frog coming from a clearing with two frogs. You can’t tell which one made the sound.

You only have time to go to one place. What are your chances of survival if you go to the clearing and lick both frogs? What if you go to the lone frog?

The second question is relatively easy: if we assume that you are equally likely to see a male or female frog, the probability is 50% that the lone frog is female.

The first question depends on how we interpret the puzzle. In particular, it hinges on the word “distinctive” – does that mean:

  • Only male frogs croak, and the sound is distinguishable from background noises, or
  • Both male and female frogs croak, but the male croak is distinguishable from the female croak.

Based on the answer presented in the video, the first meaning is intended. So we’ll start by solving that version.

But the second meaning makes the problem a little harder, so we’ll solve that one, too.

Only Male Frogs Croak

To solve the intended version of the puzzle, we’ll assume

  • Only male frogs croak, and
  • When two frogs appear together, their sexes are independent.

So we’ll start with a prior where all two-frog combinations are equally likely.

from sympy import Rational

hypo = ['FF', 'FM', 'MF', 'MM']
prior = Rational(1)

Now let’s think about the likelihood of the data under each scenario. In the video, the solution is based on these assumptions:

  • If both frogs are female, the probability of hearing the male croak is 0.
  • If either frog is male, the probability that one of them croaks is 1.
likelihood = [0, 1, 1, 1]

I’ll use a BayesTable to compute the posterior probability for each scenario.

import pandas as pd
import numpy as np

class BayesTable(pd.DataFrame):
    def __init__(self, hypo, prior=1, **options):
        columns = ['prior', 'likelihood', 'unnorm', 'posterior']
        super().__init__(index=hypo, columns=columns, **options)
        self.prior = prior
    
    def update(self, likelihood):
        self.likelihood = likelihood
        self.unnorm = self.prior * self.likelihood
        nc = self.unnorm.sum()
        self.posterior = self.unnorm / nc
table = BayesTable(hypo, prior)
table.update(likelihood)
table
priorlikelihoodunnormposterior
FF1000
FM1111/3
MF1111/3
MM1111/3

From the table, we can extract the posterior probability that both frogs are male.

from sympy import init_printing
init_printing(use_latex=False)
table.posterior['MM']
1/3

With these assumptions, the probability 1/3 that both frogs are male (and you die), so the probability is 2/3 that at least one is female (and you live).

And that’s the answer in the video.

Poisson (not Poison) Frogs

But is that the right likelihood? Suppose frogs are equally likely to croak at any instant in time, so their croaks follow a Poisson process. If we assume that these croaking processes are independent, two frogs would be more likely to croak, during a given interval, than one.

If the interval is much longer than the average time between croaks, the probability that either frog croaks approaches 1, which is consistent with the previous solution.

But if the interval is short – as it might be if you were deciding whether to approach the first frog – the probability of hearing a croak would be double if there are two male frogs rather than one.

In that case, the likelihood of the data would be:

half = Rational(1, 2)
likelihood = [0, half, half, 1]

And here are the posterior probabilities:

table = BayesTable(hypo, prior)
table.update(likelihood)
table
priorlikelihoodunnormposterior
FF1000
FM11/21/21/4
MF11/21/21/4
MM1111/2

With Poisson frogs and a short interval, the probability of two male frogs is 1/2, so it doesn’t matter whether you approach the lone frog or the pair of frogs.

Female Frogs Croak, Too

Now let’s think about the other interpretation of the puzzle: suppose both male and female frogs croak, but we can distinguish one from the other. And suppose male and female frogs croak at different rates, but they are still independent.

Assume that male frogs croak at a rate of 1 per time unit, and female frogs at a rate of r per time unit. In that case, if we start listening at a random time, the probability that we hear a male frog first is 1 / (r+1) if there’s only one male frog, and 1 if there are two male frogs.

So the likelihood in this case is:

from sympy import symbols

r = symbols('r')
likelihood = [0, 1 / (r+1), 1 / (r+1), 1]

And here are the posteriors

table = BayesTable(hypo, prior)
table.update(likelihood)
table
priorlikelihoodunnormposterior
FF1000
FM11/(r + 1)1/(r + 1)1/((1 + 2/(r + 1))*(r + 1))
MF11/(r + 1)1/(r + 1)1/((1 + 2/(r + 1))*(r + 1))
MM1111/(1 + 2/(r + 1))

In this scenario, here’s the probability you die.

prob_die = table.posterior['MM']
prob_die.simplify()
r + 1
─────
r + 3

If female frogs don’t croak, we get the same answer as in the first scenario.

prob_die.subs({r: 0})
1/3

If male and female frogs croak at the same rate, the probability that both frogs are male is 1/2.

prob_die.subs({r: 1})
1/2

But if female frogs croak much more often, the fact that a male croaked first is strong evidence that both are male, so the posterior probability is close to 1.

prob_die.subs({r: 1000}).evalf()
0.998005982053839

Assortative Mating

Now suppose that when we see two frogs together, their sexes are not independent; specifically, let’s assume that the probability of a same-sex pair is p, so the probability of a mixed-sex pair is 1-p. In this scenario, the priors (before we hear the croak) are not equal.

p = symbols('p')
prior = [p, 1-p, 1-p, p]

Here are the posterior probabilities, assuming again that both male and female frogs, possibly at different rates.

likelihood = [0, 1 / (r+1), 1 / (r+1), 1]
table = BayesTable(hypo, prior)
table.update(likelihood)
table
priorlikelihoodunnormposterior
FFp000
FM1 – p1/(r + 1)(1 – p)/(r + 1)(1 – p)/((p + 2*(1 – p)/(r + 1))*(r + 1))
MF1 – p1/(r + 1)(1 – p)/(r + 1)(1 – p)/((p + 2*(1 – p)/(r + 1))*(r + 1))
MMp1pp/(p + 2*(1 – p)/(r + 1))
table.posterior['MM'].simplify()
 p⋅(r + 1) 
───────────
p⋅r - p + 2

If p=1/2, this simplifies to the previous scenario.

table.posterior['MM'].subs({p: half}).simplify()
r + 1
─────
r + 3

And if r=0 (female frogs don’t croak), we get the answer presented in the video.

table.posterior['MM'].subs({p: half, r: 0}).simplify()
1/3

But depending on the assumptions, the probability can be as low as 0

table.posterior['MM'].subs({p: 0, r: 1}).simplify()
0

Or as high as 1.

table.posterior['MM'].subs({p: 1, r: 0}).simplify()
1

Or anything in between. As is often the case with problems like these, the answer depends on a precise specification of the data-generating process.

Discussion

If all of this seems like more trouble than it’s worth, let me suggest a metacognitive shortcut for solving puzzles like this.

  1. Notice that in all probability puzzles, the answer is either 1/2 or 1/3.
  2. Also, the answer is always counterintuitive; otherwise it wouldn’t be a puzzle.
  3. Therefore, if your intuition says the answer is 1/2, it’s actually 1/3, and vice versa.

That might save you some time.

This notebook uses methods and materials from Think Bayes, second edition. If you like this sort of thing, you can read the whole book, and more examples, at allendowney.github.io/ThinkBayes2/.



Planning for your midlife crisis

Planning for your midlife crisis

Yesterday I presented a talk at ODSC East 2026, called “Counterfactual Analysis with Bayesian Models: What Drives the Life Expectancy Gap?” Here’s the abstract

Across nearly every country in the world, women live longer than men—but the size of this gap varies from about two years in some countries to more than twelve in others. What explains these differences, and how much of the gap can be closed?

In this talk, I present a practical approach to counterfactual analysis using Bayesian regression models. Using publicly available mortality data, we build a model that relates the life expectancy gap between men and women to differences in cause-specific death rates, including homicide, drug overdoses, traffic fatalities, smoking-related disease, and chronic illness.

The model generates posterior simulations that answer “what-if” questions. For example: How much smaller would the U.S. life expectancy gap be if homicide rates matched those in Western Europe?

The talk presents the workflow—from assembling global datasets to fitting interpretable Bayesian models with PyMC and generating counterfactual simulations. Attendees will learn how Bayesian models can support explainable modeling and analysis under uncertainty.

I think the talk went well, and we got some good questions at the end. There’s no recording, unfortunately, but my slides are here. And if you want to know more, I have a series of blog posts on Substack

The fifth and final post is on the way. In the meantime, here’s a quick post on a related topic.

Are you middle-aged?

Here’s a question from Reddit’s Stupid Questions forum:

I always thought middle age was in your 40s but since life expectancy is around 75 or so, wouldn’t it be about 35?

If life expectancy is 75, you might think the midpoint is half that, which is 37.5. But if 75 is life expectancy at birth and you survive to age 37.5, your life expectancy at that age is higher than 75. So 37.5 is not halfway!

If we really want to find the midpoint – and it wouldn’t be Probably Overthinking It if we didn’t – we have to find the age where your expected remaining lifetime equals your current age.

Let’s do it.

Data

From the Human Mortality Database I downloaded life tables for the United States, combined and broken down for men and women. The following function reads and cleans a table.

def read_life_table(filename):
    lt = pd.read_fwf(filename, skiprows=2, infer_nrows=200)
    lt['Age'] = lt['Age'].str.replace('+', '', regex=False).astype(int)
    return lt

Here are the first few rows of the combined table (see notes below for details).

blt = read_life_table('../data/bltper_1x1.txt')
blt.head()
YearAgemxqxaxlxdxLxTxex
0193300.061290.058610.25100000586195624608960960.90
1193310.009460.009410.509413988693696599398563.67
2193320.004350.004340.509325340593050590028963.27
3193330.003100.003100.509284828892704580723962.55
4193340.002390.002380.509256022192450571453561.74

We’ll also read the female and male tables.

flt = read_life_table('../data/fltper_1x1.txt')
mlt = read_life_table('../data/mltper_1x1.txt')

The tables include data from 1933 to 2024, so we’ll select the most recent data.

year = blt['Year'].unique()[-1]
table = blt.query('Year == @year').set_index('Age')

The column we’ll use is ex, which is life expectancy as a function of age.

age = table.index.to_series()
ex = table['ex']

Life expectancy at birth is 79 years, so the naive midpoint is 39.5.

ex[0], ex[0] / 2
(79.08, 39.54)

But at age 40, expected remaining lifetime is 41.1, so 39.5 is not the midpoint.

ex[39], ex[40]
(42.04, 41.12)

This plot shows life expectancy at each age, compared to age.

ex.plot(label='Remaining life expectancy')
age.plot(label='Age')
decorate(ylabel='Years',
        title='Remaining life expectancy vs age, United States 2024')
_images/a55e9cbde25b36d9c42e2ede2a78a827460d537f230b5cc4e7dfcb519a44bbc6.png

“Middle age” is where the lines cross, which we can compute by linear interpolation.

from scipy.interpolate import interp1d

inverse = interp1d(ex - age, age)
inverse(0)
array(40.58638743)

So the overall midpoint is 40.6 years. But as you might expect, it’s different for men and women. Let’s put the analysis we did in a function.

def get_midpoint(filename):
    lt = read_life_table(filename)
    year = lt['Year'].unique()[-1]
    table = lt.query('Year == @year').set_index('Age')

    age = pd.Series(table.index)
    ex = table['ex']

    inverse = interp1d(ex - age, age)
    return inverse(0)

And run it for men.

get_midpoint('../data/mltper_1x1.txt')
array(39.57142857)

And women.

get_midpoint('../data/fltper_1x1.txt')
array(41.56185567)

Men hit middle age at 39.6, women at 41.6.

The Gender Gap and Age

Finally, let’s see how the gender gap in life expectancy changes as a function of age.

ex_male = mlt.query('Year == @year').set_index('Age')['ex']
ex_female = flt.query('Year == @year').set_index('Age')['ex']
gap = ex_female - ex_male
gap.plot(label='')
decorate(ylabel='Years',
         title='Life expectancy gender gap vs age')
_images/2cf48be5acc971ad8f6973506c224767a25ef2e38bcf53c5b12b0f895c26ed11.png

At birth the life expectancy gap is close to five years. At age 100, it is close to zero.

But just looking at the gap might be misleading. For a more complete picture let’s also look at the ratio.

ratio = ex_female / ex_male
ratio.plot(label='')

decorate(ylabel='Ratio',
         title='Life expectancy gender ratio (female / male)')
_images/1a514d22895c3e18a767184700d6edf467975abb4b302afe044fd42c002ae21c.png

The life expectancy ratio tells a more complicated story.

  • At birth, the ratio is 1.06, which means female babies live 6% longer, on average.
  • Around age 80, the ratio peaks at nearly 1.14 – so between female and male octogenarians, we expect the women to live 14% longer.
  • At advanced ages, the ratio declines steeply and actually crosses over after age 100 – although the crossover is minimal and might not be statistically valid.

To interpret these results, we can think about the causes of death that contribute to age-specific death rates at different stages of life.

  • In young adulthood, the causes of death that contribute most to gender gaps include road traffic, homicide, accidental injury, drug use disorders.
  • In advanced adulthood, they include cancer, cardiovascular disease, respiratory disease, liver disease, diabetes, and suicide.

The causes that affect younger people have large gender gaps, but relatively low death rates. As people get older, these low-rate causes contribute less to age-specific death rates, and the higher-rate causes contribute more.

I think that’s a plausible explanation for the increasing ratio from age 0 to 80. For the decline that follows, I can only speculate that there is a selection effect: people who get to these advanced ages are likely to have better-than-average lifestyle histories (less smoking and drinking, better diet, more exercise) – and among people with better lifestyles, the gender gap is small.

Notes

Data credit: HMD. Human Mortality Database. Max Planck Institute for Demographic Research (Germany), University of California, Berkeley (USA), and French Institute for Demographic Studies (France). Available at [www.mortality.org].

Here are the columns of the 1×1 Period Life Tables:

  • Year: Calendar year to which the period life table refers.
  • Age: Exact age (x), in years, at the beginning of the interval ([x, x+1)).
  • mx: Central death rate at age (x):
  • qx: Probability of dying between ages (x) and (x+1):
  • ax: Average fraction of the interval lived by those who die in ([x, x+1)). Typically around 0.5 for most ages, lower for infants (reflecting higher early mortality within the year).
  • lx: Number of survivors at exact age (x), out of a radix (usually 100,000 births).
  • dx: Number of deaths between ages (x) and (x+1):
  • Lx: Person-years lived between ages (x) and (x+1), approximately
  • Tx: Total person-years remaining above age (x):
  • ex: Life expectancy at age (x):

The details are in this Jupyter notebook.

Attention, Chinese Readers

Attention, Chinese Readers

The Chinese edition of Probably Overthinking It is available now (also here)!

If you have the Chinese edition, there are two sections you won’t get to read — so I am including them here.

Here is an excerpt from Chapter 3, including the deleted paragraph:

In the Present

The women surveyed in 1990 rejected the childbearing example of their mothers emphatically. On average, each woman had 2.3 fewer children than her mother. If that pattern had continued for another generation, the average family size in 2018 would have been about 0.8. But it wasn’t.

In fact, the average family size in 2018 was very close to 2, just as in 1990. So how did that happen?

As it turns out, this is close to what we would expect if every woman had one child fewer than her mother. The following distribution shows the actual distribution in 2018, compared to the result if we start with the 1990 distribution and simulate the “one child fewer” scenario.

_images/ddb1f82d657fad8171d5c400c9a539aead9ac1a4f85b7460f3a4ae7f7cb00237.png

The means of the two distributions are almost the same, but the shapes are different. In reality, there were more zero- and two-child families in 1990 than the simulation predicts, and fewer one-child families. But at least on average, it seems like women in the U.S. have been following the “one child fewer” policy for the last 30 years.

The scenario at the beginning of this chapter is meant to be light-hearted, but in reality governments in many places and times have enacted policies meant to control family sizes and population growth. Most famously, China implemented a one-child policy in 1980 that imposed severe penalties on families with more than one child. Of course, this policy is objectionable to anyone who considers reproductive freedom a fundamental human right. But even as a practical matter, the unintended consequences were profound.

Rather than catalog them, I will mention one that is particularly ironic: while this policy was in effect, economic and social forces reduced the average desired family size so much that, when the policy was relaxed in 2015 and again in 2021, average lifetime fertility increased to only 1.3, far below the level needed to keep the population constant, near 2.1. Since then, China has implemented new policies intended to increase family sizes, but it is not clear whether they will have much effect. Demographers predict that by the time you read this, the population of China will probably be shrinking [UPDATE: It is.]. The consequences of the one-child policy are widespread and will affect China and the rest of the world for a long time.

And here is an excerpt from Chapter 5, including the deleted explanation.

Child mortality

Fortunately, child mortality has decreased since 1900. The following figure shows the percentage of children who die before age 5 for four geographical regions, from 1900 to 2019. These data were combined from several sources by Gapminder, a foundation based in Sweden that “promotes sustainable global development […] by increased use and understanding of statistics.”

_images/220c5c7e411ef012b610deab7f65ab6dbd0a010aa40d46318a9a14823f2a268e.png

In every region, child mortality has decreased consistently and substantially. The only exceptions are indicated by the vertical lines: the 1918 influenza pandemic, which visibly affected Asia, the Americas, and Europe; World War II in Europe (1939-1945); and the Great Leap Forward in China (1958-1962). In every case, these exceptions did not affect the long-term trend.

[COMMENT: I thought I was being diplomatic by referring generally to the Great Leap Forward — rather than the Great Chinese Famine or “Three Years of Great Famine” (三年大饥荒) — but apparently that was not enough.]

Although there is more work to do, especially in Africa, child mortality is substantially lower now, in every region of the world, than in 1900. As a result most people now are better new than used.

To demonstrate this change, I collected recent mortality data from the Global Health Observatory of the World Health Organization (WHO). For people born in 2019, we don’t know what their future lifetimes will be, but we can estimate it if we assume that the mortality rate in each age group will not change over their lifetimes.

Based on that simplification, the following figure shows average remaining lifetime as a function of age for Sweden and Nigeria in 2019, compared to Sweden in 1905.

_images/6a45a65e6a7b3201af0f74c7b7df4d57ce5c5976972ce0e69538a1914fa5cc5b.png

Since 1905, Sweden has continued to make progress; life expectancy at every age is higher in 2019 than in 1905. And Swedes now have the new-better-than-used property. Their life expectancy at birth is about 82 years, and it declines consistently over their lives, just like a light bulb.

Unfortunately, Nigeria has one of the highest rates of child mortality in the world: in 2019, almost 8% of babies died in their first year of life. After that, they are briefly better used than new: life expectancy at birth is about 62 years; however, a baby who survives the first year will live another 65 years, on average.

Going forward, I hope we continue to reduce child mortality in every region; if we do, soon every person born will be better new than used. Or maybe we can do even better than that.

Field Sobriety Tests and the Base Rate Fallacy

Field Sobriety Tests and the Base Rate Fallacy

In Chapter 9 of Probably Overthinking It I wrote about Drug Recognition Experts (DREs), who are law enforcement officers trained to recognize impaired drivers.

I reviewed the research papers that were supposed to evaluate the accuracy of DREs and I summarized my impressions like this:

What I found was a collection of studies that are, across the board, deeply flawed. Every one of them features at least one methodological error so blatant it would be embarrassing at a middle school science fair.

Recently the related topic of Field Sobriety Tests (FSTs) came up in this Reddit discussion, which links to this TV news report about sober drivers who were arrested based on FST results.

The TV report refers to this 2023 paper in JAMA Psychiatry. Because it’s recent, published in a good quality journal, and called “Evaluation of Field Sobriety Tests for Identifying Drivers Under the Influence of Cannabis: A Randomized Clinical Trial”, I thought it might address the problems I found in previous research.

Unfortunately, it has the same problems:

  • Selection bias: It excludes as subjects people with conditions that might cause them to fail an FST while sober – but these are exactly the people most vulnerable to false positive results.
  • Wrong metrics: The paper focuses on the true positive and false positive rates, and neglects the predictive value of the test – which is more relevant to the policy question.
  • Unrealistic base rate: In the test conditions, two thirds of the participants were impaired, which is almost certainly higher than the relevant fraction in the real world.

Despite all that, the false positive rate they reported is 49%, which means that nearly half of the sober participants were wrongly classified as impaired.

Let’s look at each of these problems more closely.

False Positives

The study tested 184 participants, 121 randomly assigned to the THC group and 63 to the placebo group. The THC group smoked cannabis cigarettes containing THC; the placebo group smoked cigarettes with almost none. Each participant was evaluated by one officer, who was “blinded to treatment assignment”. The paper reports

Officers classified 98 participants (81.0%) in the THC group and 31 (49.2%) in the placebo group as FST impaired.

The following table summarizes these results as a confusion matrix:

FST PositiveFST NegativeTotal
THC Group9823121
Placebo Group313263
Total12955184

Let’s start with the most obvious problem: of 63 people in the placebo group, 31 were wrongly classified as impaired, so the false positive rate was 49%.

Although the tests “were administered by certified DRE instructors, the highest training level for impaired driving detection”, the results for sober participants were no better than a coin toss. That’s pretty bad, but in reality it’s probably worse, because of selection bias.

Selection Bias

The study recruited 261 people who met these requirements: “age 21 to 55 years, cannabis use 4 or more times in the past month, holding a valid driver’s license, and driving at least 1000 miles in the past year.”

But it excluded 62 recruits for reasons including “history of traumatic brain injury [and] significant medical conditions or psychiatric conditions”. They also excluded people with a positive urine test for nonprescription drugs or substance use disorder in the past year.

That’s a problem because people with these kinds of medical conditions are more likely to fail an FST – even if they are not actually impaired. By excluding them, the study excludes exactly the people most vulnerable to a false positive result.

A better experiment would recruit a representative sample of drivers, including people older than 55 and people with conditions that make it hard to pass a field sobriety test. The TV report highlights an example: an autistic man who was arrested for DUI because his autism-related differences were mistaken for impairment. I assume he would have been excluded from the study.

To see how much difference the selection criteria could make, suppose 20 of the excluded participants (about one third) had been assigned to the placebo group. And suppose that because of their conditions 16 of them were wrongly classified as impaired – that’s 80%, somewhat higher than the rate among included participants.

That would increase the number of false positives by 16 and the number of true negatives by 4, so the unbiased false positive rate might be 57%.

This is just a guess: it’s not clear how many were excluded specifically for medical conditions or how many of the excluded would have failed the FST. But this calculation gives us a sense of how big the bias could be.

As I wrote in Probably Overthinking It:

How can you estimate the number of false positives if you exclude from the study everyone likely to yield a false positive? You can’t.

And that brings us to the next problem.

Predictive Value

The paper reports:

Officers classified 98 participants (81.0%) in the THC group and 31 (49.2%) in the placebo group as FST impaired at the first evaluation

They quantify this difference as 31.8 percentage points, with 95% CI, 16.4-47.2 percentage points, and report a p-value < .001. Based on this analysis, they conclude:

FSTs administered by highly trained law enforcement officers differentiated between individuals receiving THC vs placebo

This conclusion is true in the sense that the difference in percentages is statistically significant, but the policy question is not whether THC exposure changes FST performance under laboratory conditions. The question is whether an FST result provides sufficiently strong evidence to justify detention or arrest.

For that, the false positive rate is relevant, and as we have discussed, it is probably more than 50%.

But even more important is the positive predictive value (PPV), which is the probability that a positive test is correct. In the confusion matrix, there are 129 positive tests, of which 98 are correct and 31 incorrect, so the PPV is 98 out of 129, about 76%.

Of the people who failed the FST, 76% were actually impaired. That might sound good enough for probable cause, but that conclusion is misleading because there is still another problem – the base rate.

Base Rate

In the study, two thirds of the participants were impaired. In the real world, it is unlikely that two thirds of drivers are impaired – or even two thirds of drivers who take an FST. So the base rate in the study is too high.

To see why that matters, we have to do a little math. First we’ll use the confusion matrix to compute one more metric, sensitivity, which is the percentage of impaired participants who were classified correctly.

We can use sensitivity, along with the false positive rate we already computed, to figure out the positive predictive value of a test with a more realistic base rate.

Of all people pulled over and given a field sobriety test, how many do you think are impaired by THC? That’s a hard question to answer, so we’ll try a couple of values.

First, suppose the base rate is one third, rather than the two thirds in the study. If we imagine 100 drivers:

  • If 33 are impaired, and sensitivity is 81%, we expect 27 true positive results.
  • If 67 are not impaired, and the false positive rate is 49%, we expect 33 false positive results.

In that case the positive predictive value is 27 / (27 + 33), which means that only 45% of positive tests are correct. If we put those numbers in a table, the calculation might be clearer.

TestsProb posPos testsPercent
Impaired330.81026.72744.773
Not impaired670.49232.96855.227

With a lower base rate, PPV is lower, which means that a positive test is weaker evidence of impairment. But even 45% might be too high.

If we suppose that 15% of drivers who take an FST are impaired, we can run the numbers again.

TestsProb posPos testsPercent
Impaired150.81012.14922.508
Not impaired850.49241.82577.492

With 15% base rate, the predictive value of the test is only 23% – which means 77% of drivers identified as impaired would actually be sober.

In reality, the base rate depends on the context. At a checkpoint where every driver is stopped, the base rate might be lower than 15%. If a driver is stopped for driving erratically, the base rate might be relatively high. But even then, it is unlikely to be as high as 66%, as in the study.

Discussion

The JAMA Psychiatry study provides valuable data, but it suffers from the same methodological problems as previous DRE validation studies:

  1. High false positive rate: Nearly half of sober participants were incorrectly classified as impaired.
  2. Selection bias: The study excluded exactly the people most likely to be falsely accused, making it impossible to assess the true false positive rate in the general population.
  3. Unrealistic base rate: The base rate in the study is higher than what we expect in real-world use, which inflates the predictive value of the test.

Although I have been critical of the study, I agree with their interpretation of the results:

…the substantial overlap of FST impairment between groups and the high frequency at which FST impairment was suspected to be due to THC suggest that absent other indicators, FSTs alone may be insufficient to identify THC-specific driving impairment.

Emphasis mine.

Notes

In my interpretation of the results, I follow the methodology of the study, which treats assignment to the THC group as ground truth – that is, we assume that participants in the THC group were actually impaired and participants in the placebo group were not. And the paper reports:

Median self-reported highness (scale of 0 to 100, with higher scores indicating more impairment) at 30 minutes was 64 (IQR, 32-76) for the THC group and 13 (IQR, 1-28) for the placebo group (P < .001).

The THC group felt that they were more impaired, but based on the IQRs, it looks like there might be overlap. That complicates the interpretation of “impaired”, but for this analysis I use the study’s operational definition.

Click here to run this notebook on Colab.