When will the haunt begin?

When will the haunt begin?

One of the favorite board games at my house is Betrayal at House on the Hill.

A unique feature of the game is the dice, which yield three possible outcomes, 0, 1, or 2, with equal probability. When you add them up, you get some unusual probability distributions.

There are two phases of the game: During the first phase, players explore a haunted house, drawing cards and collecting items they will need during the second phase, called “The Haunt”, which is when the players battle monsters and (usually) each other.

So when does the haunt begin? It depends on the dice. Each time a player draws an “omen” card, they have to make a “haunt roll”: they roll six dice and add them up; if the total is less than the number of omen cards that have been drawn, the haunt begins.

For example, suppose four omen cards have been drawn. A player draws a fifth omen card and then rolls six dice. If the total is less than 5, the haunt begins. Otherwise the first phase continues.

Last time I played this game, I was thinking about the probabilities involved in this process. For example:

  1. What is the probability of starting the haunt after the first omen card?
  2. What is the probability of drawing at least 4 omen cards before the haunt?
  3. What is the average number of omen cards before the haunt?

My answers to these questions are in this notebook, which you can run on Colab.

Millennials are not getting married

Millennials are not getting married

In 2015 I wrote a paper called “Will Millennials Ever Get Married?” where I used data from the National Survey of Family Growth (NSFG) to estimate the age at first marriage for women in the U.S, broken down by decade of birth.  

I found that women born in the 1980s and 90s were getting married later than previous cohorts, and I generated projections that suggest they are on track to stay unmarried at substantially higher rates.

Here are the results from that paper, based on 58 488 women surveyed between 1983 to 2015:

Percentage of women ever married, based on data up to 2015.

Each line represents a cohort grouped by decade of birth. For example, the top line represents women born in the 1940s.

The colored segments show the fraction of women who had ever been married as a function of age. For example, among women born in the 1940s, 82% had been married by age 25. Among women born in the 1980s, only 41% had been married by the same age.

The gray lines show projections I generated by assuming that going forward each cohort would be subject to the hazard function of the previous cohort. This method is likely to overestimate marriage rates.

These results show two trends:

  • Each cohort is getting married later than the previous cohort.
  • The fraction of women who never marry is increasing from one cohort to the next.

New data

Yesterday the National Center for Health Statistics (NCHS) released a new batch of data from surveys conducted in 2017-2019.  So we can compare the predictions from 2015 with the new data, and generate updated predictions.

The following figure shows the predictions from the previous figure, which are based on data up to 2015, compared to the new curves based on data up to 2019, which includes 70 183 respondents.

Percentage of women ever married, based on data up to 2019,
compared to predictions based on data up to 2015.

For women born in the 1980s, the fraction who have married is almost exactly as predicted. For women born in the 1990s, it is substantially lower.

New projections

The following figure shows projections based on data up to 2019.

Percentage of women ever married, based on data up to 2019,
with predictions based on data up to 2019.

The vertical dashed lines show the ages where we have the last reliable estimate for each cohort. The following table summarizes the results at age 28:

Decade of birth1940s1950s1960s1970s1980s1990s
% married
before age 28
Percentage of women married by age 28, grouped by decade of birth.

The percentage of women married by age 28 has dropped quickly from each cohort to the next, by about 11 percentage points per decade.

The following table shows the same percentage at age 38; the last value, for women born in the 1990s, is a projection based on the data we have so far.

Decade of birth1940s1950s1960s1970s1980s1990s
% married
before age 38
Percentage of women married by age 38, grouped by decade of birth.

Based on current trends, we expect barely half of women born in the 1990s to be married before age 38.

Finally, here are the percentages of women married by age 48; the last two values are projections.

Decade of birth1940s1950s1960s1970s1980s1990s
% married
before age 48
Percentage of women married by age 48, grouped by decade of birth.

Based on current trends, we expect women born in the 1980s and 1990s to remain unmarried at rates substantially higher than previous generations.

Projections like these are based on the assumption that the future will be like the past, but of course, things change. In particular:

  • These data were collected before the COVID-19 pandemic. Marriage rates in 2020 will probably be lower than predicted, and the effect could continue into 2021 or beyond.
  • However, as economic conditions improve in the future, marriage rates might increase.

We’ll find out when we get the next batch of data in October 2022.

The code I used for this analysis is in this GitHub repository.

Whatever the question was, correlation is not the answer

Whatever the question was, correlation is not the answer

Pearson’s coefficient of correlation, ρ, is one of the most widely-reported statistics. But in my opinion, it is useless; there is no good reason to report it, ever.

Most of the time, what you really care about is either effect size or predictive value:

  • To quantify effect size, report the slope of a regression line.

If there’s no reason to prefer one measure over another, report reduction in RMSE, because you can compute it directly from R².

If you don’t care about effect size or predictive value, and you just want to show that there’s a (linear) relationship between two variables, use R², which is more interpretable than ρ, and exaggerates the strength of the relationship less.

In summary, there is no case where ρ is the best statistic to report. Most of the time, it answers the wrong question and makes the relationship sound more important than it is.

To explain that second point, let me show an example.

Height and weight

I’ll use data from the BRFSS to quantify the relationship between weight and height. Here’s a scatter plot of the data and a regression line:

The slope of the regression line is 0.9 kg / cm, which means that if someone is 1 cm taller, we expect them to be 0.9 kg heavier. If we care about effect size, that’s what we should report.

If we care about predictive value, we should compare predictive error with and without the explanatory variable.

  • Without the model, the estimate that minimizes mean absolute error (MAE) is the median; in that case, the MAE is about 15.9 kg.
  • With the model, MAE is 13.8 kg.

So the model reduces MAE by about 13%.

If you don’t care about effect size or predictive value, you are probably up to no good. But even in that case, you should report R² = 0.22 rather than ρ = 0.47, because

  • R² can be interpreted as the fraction of variance explained by the model; I don’t love this interpretation because I think the use of “explained” is misleading, but it’s better than ρ, which has no natural interpretation.
  • R² is generally smaller than ρ, which means it exaggerates the strength of the relationship less.

[UPDATE: Katie Corker corrected my claim that ρ has no natural interpretation: it is the standardized slope. In this example, we expect someone who is one standard deviation taller than the mean to be 0.47 standard deviations heavier than the mean. Sebastian Raschka does a nice job explaining this here.]

In general…

This dataset is not unusual.  and ρ generally overstate the predictive value of the model.

The following figure shows the relationship between ρ, , and the reduction in RMSE.

Values of ρ that sound impressive correspond to values of R² that are more modest and to reductions in RMSE which are substantially less impressive.

This inflation is particularly hazardous when ρ is small. For example, if you see ρ = 0.25, you might think you’ve found an important relationship. But that only “explains” 6% of the variance, and in terms of predictive value, only decreases RMSE by 3%.

In some contexts, that predictive value might be useful, but it is substantially more modest than ρ=0.25 might lead you to believe.

The details of this example are in this Jupyter notebook.

And the analysis I used to generate the last figure is in this notebook.

Fair cross-section

Fair cross-section

Abstract: The unusual circumstances of Curtis Flowers’ trials make it possible to estimate the probabilities that white and black jurors would vote to convict him, 98% and 68% respectively, and the probability a jury of his peers would find him guilty, 15%.


Curtis Flowers was tried six times for the same crime. Four trials ended in conviction; two ended in a mistrial due to a hung jury.

Three of the convictions were invalidated by the Mississippi Supreme Court, at least in part because the prosecution had excluded black jurors, depriving Flowers of the right to trial by a jury composed of a “fair cross-section of the community“.

In 2019, the fourth conviction was invalidated by the Supreme Court of the United States for the same reason. And on September 5, 2020, Mississippi state attorneys announced that charges against him would be dropped.

Because of the unusual circumstances of these trials, we can perform a statistical analysis that is normally impossible: we can estimate the probability that black and white jurors would vote to convict, and use those estimates to compute the probability that he would be convicted by a jury that represents the racial makeup of Montgomery County.


According to my analysis, the probability that a white juror in this pool would vote to convict Flowers, given the evidence at trial, is 98%. The same probability for black jurors is 68%. So this difference is substantial.

The probability that Flowers would be convicted by a fair jury is only 15%, and the probability that he would be convicted four times out of six times is less than 1%.

The following figure shows the probability of a guilty verdict as a function of the number of black jurors:

According to the model, the probability of a guilty verdict is 55% with an all-white jury. If the jury includes 5-6 black jurors, which would be representative of Montgomery County, the probability of conviction would be only 14-15%.

The shaded area represents a 90% credible interval. It is quite wide, reflecting uncertainty due to limits of the data. Also, the model is based on the simplifying assumptions that

  • All six juries saw essentially the same evidence,
  • The probabilities we’re estimating did not change substantially over the period of the trials,
  • Interactions between jurors had negligible effects on their votes,
  • If any juror refuses to convict, the result is a hung jury.

For the details of the analysis, you can

Thanks to the Law Office of Zachary Margulis-Ohnuma for their assistance with this article and for their continuing good work for equal justice.

Alice and Bob exchange data

Alice and Bob exchange data

Two questions crossed my desktop this week, and I think I can answer both of them with a single example.

On Twitter, Kareem Carr asked, “If Alice believes an event has a 90% probability of occurring and Bob also believes it has a 90% chance of occurring, what does it mean to say they have the same degree of belief? What would we expect to observe about both Alice’s and Bob’s behavior?”

And on Reddit, a reader of /r/statistics asked, “I have three coefficients from three different studies that measure the same effect, along with their 95% CIs. Is there an easy way to combine them into a single estimate of the effect?”

So let me tell you a story:

One day Alice tells her friend, Bob, “I bought a random decision-making box. Every time you press this button, it says ‘yes’ or ‘no’. I’ve tried it a few times, and I think it says ‘yes’ 90% of the time.”

Bob says he has some important decisions to make and asks if he can borrow the box. The next day, he returns the box to Alice and says, “I used the box several times, and I also think it says ‘yes’ 90% of the time.”

Alice says, “It sounds like we agree, but just to make sure, we should compare our predictions. Suppose I press the button twice; what do you think is the probability it says ‘yes’ both times?”

Bob does some calculations and reports the predictive probability 81.56%.

Alice says, “That’s interesting. I got a slightly different result, 81.79%. So maybe we don’t agree after all.”

Bob says, “Well let’s see what happens if we combine our data. I can tell you how many times I pressed the button and how many times it said ‘yes’.”

Alice says, “That’s ok, I don’t actually need your data; it’s enough if you tell me what prior distribution you used.”

Bob tells her he used a Jeffreys prior.

Alice does some calculations and says, “Ok, I’ve updated my beliefs to take into account your data as well as mine. Now I think the probability of ‘yes’ is 91.67%.”

Bob says, “That’s interesting. Based on your data, you thought the probability was 90%, and based on my data, I thought it was 90%, but when we combine the data, we get a different result. Tell me what data you saw, and let me see what I get.”

Alice tells him she pressed the button 8 times and it always said ‘yes’.

“So,” says Bob, “I guess you used a uniform prior.”

Bob does some calculations and reports, “Taking into account all of the data, I think the probability of ‘yes’ is 93.45%.”

Alice says, “So when we started, we had seen different data, but we came to the same conclusion.”

“Sort of,” says Bob, “we had the same posterior mean, but our posterior distributions were different; that’s why we made different predictions for pressing the button twice.”

Alice says, “And now we’re using the same data, but we have different posterior means. Which makes sense, because we started with different priors.”

“That’s true,” says Bob, “but if we collect enough data, eventually our posterior distributions will converge, at least approximately.”

“Well that’s good,” says Alice. “Anyway, how did those decisions work out yesterday?”

“Mostly bad,” says Bob. “It turns out that saying ‘yes’ 93% of the time is a terrible way to make decisions.”

If you would like to know how any of those calculations work, you can see the details in a Jupyter notebook:

And if you don’t want the details, here is the summary:

  • If two people have different priors OR they see different data, they will generally have different posterior distributions.
  • If two posterior distributions have the same mean, some of their predictions will be the same, but many others will not.
  • If you are given summary statistics from a posterior distribution, you might be able to figure out the rest of the distribution, depending on what other information you have. For example, if you know the posterior is a two-parameter beta distribution (or is well-modeled by one) you can recover it from the mean and second moment, or the mean and a credible interval, or almost any other pair of statistics.
  • If someone has done a Bayesian update using data you don’t have access to, you might be able to “back out” their likelihood function by dividing their posterior distribution by the prior.
  • If you are given a posterior distribution and the data used to compute it, you can back out the prior by dividing the posterior by the likelihood of the data (unless the prior contains values with zero likelihood).
  • If you are given summary statistics from two posterior distributions, you might be able to combine them. In general, you need enough information to recover both posterior distributions and at least one prior.
Maxima, Minima, and Mixtures

Maxima, Minima, and Mixtures

I am hard at work on the second edition of Think Bayes, currently working on Chapter 6, which is about computing distributions of minima, maxima and mixtures of other distributions.

Of all the changes in the second edition, I am particularly proud of the exercises. I present three new exercises from Chapter 6 below. If you want to work on them, you can use this notebook, which contains the material you will need from the chapter and some code to get you started.

Exercise 1

Henri Poincaré was a French mathematician who taught at the Sorbonne around 1900. The following anecdote about him is probably fabricated, but it makes an interesting probability problem.

Supposedly Poincaré suspected that his local bakery was selling loaves of bread that were lighter than the advertised weight of 1 kg, so every day for a year he bought a loaf of bread, brought it home and weighed it. At the end of the year, he plotted the distribution of his measurements and showed that it fit a normal distribution with mean 950 g and standard deviation 50 g. He brought this evidence to the bread police, who gave the baker a warning.

For the next year, Poincaré continued the practice of weighing his bread every day. At the end of the year, he found that the average weight was 1000 g, just as it should be, but again he complained to the bread police, and this time they fined the baker.

Why? Because the shape of the distribution was asymmetric. Unlike the normal distribution, it was skewed to the right, which is consistent with the hypothesis that the baker was still making 950 g loaves, but deliberately giving Poincaré the heavier ones.

To see whether this anecdote is plausible, let’s suppose that when the baker sees Poincaré coming, he hefts n loaves of bread and gives Poincaré the heaviest one. How many loaves would the baker have to heft to make the average of the maximum 1000 g?

Exercise 2

Two doctors fresh out of medical school are arguing about whose hospital delivers more babies. The first doctor says, “I’ve been at Hospital A for two weeks, and already we’ve had a day when we delivered 20 babies.”

The second doctor says, “I’ve only been at Hospital B for one week, but already there’s been a 19-baby day.”

Which hospital do you think delivers more babies on average? You can assume that the number of babies born in a day is well modeled by a Poisson distribution.

Exercise 3

Suppose I drive the same route three times and the fastest of the three attempts takes 8 minutes.

There are two traffic lights on the route. As I approach each light, there is a 40% chance that it is green; in that case, it causes no delay. And there is a 60% chance it is red; in that case it causes a delay that is uniformly distributed from 0 to 60 seconds.

What is the posterior distribution of the time it would take to drive the route with no delays?

The solution to this exercise is very similar to a method I developed for estimating the minimum time for a packet of data to travel through a path in the internet.

Again, here’s the notebook where you can work on these exercises. I will publish solutions later this week.

Think DSP v1.1

Think DSP v1.1

For the last week or so I have been working on an update to Think DSP. The latest version is available now from Green Tea Press. Here are some of the changes I made:

Running on Colab

All notebooks now run on Colab. Judging by my inbox, many readers find it challenging to download and run the code. Running on Colab is a lot easier.

If you want to try an example, here’s a preview of Chapter 1. And if you want to see where we’re headed, here’s a preview of Chapter 10. You can get to the rest of the notebooks from here.

No more thinkplot

For the first edition, I used a module called thinkplot that provides functions that make it easier to use Matplotlib. It also overrides some of the default options.

But since I wrote the first edition, Matplotlib has improved substantially. I found I was able to eliminate thinkplot with minimal changes. As a result, the code is simpler and the figures look better.

Still using thinkdsp

I provide a module called thinkdsp that contains classes and functions used throughout the book. I think this module is good for learners. It lets me hide details that would otherwise be distracting. It lets me present some topics “top-down”, meaning that we learn how to use some features before we know how they work.

And when you learn the API provided by thinkdsp, you are also learning about DSP. For example, thinkdsp provides classes called Signal, Wave, and Spectrum.

A Signal represents a continuous function; a Wave represents a sequence of discrete samples. So Signal provides make_wave, but Wave does not provide make_signal. When you use this API, you understand implicitly that this is a one-way operation: you can sample a Signal to get a Wave, but you cannot recover a Signal from a Wave.

On the other hand, you can convert from Wave to Spectrum and from Spectrum to Wave, which implies (correctly) that they are equivalent representations of the same information. Given one, you can compute the other.

I realize that not everyone loves it when a book uses a custom library like thinkdsp. When people don’t like Think DSP, this is the most common reason. But looking at thinkdsp with fresh eyes, I am doubling down; I still think it’s a good way to learn.

Less object-oriented

Nevertheless, I found a few opportunities to simplify the code, and in particular to make it less object-oriented. I generally like OOP, but I acknowledge that there are drawbacks. One of the biggest is that it can be hard to keep an inheritance hierarchy in your head and easy to lose track of what classes provide which methods.

I still think the template pattern is a good way to present a framework: the parent class provides the framework and child classes fill in the details.

However, based on feedback from readers, I have come to realize that object-oriented programming is not as universally known and loved as I assumed.

In several places I found that I could eliminate object-oriented features and simplify the code without losing explanatory power.

Pretty, pretty good

Coming back to this book after some time, I think it’s pretty good. If you are interested in digital signal processing, I think the computation-first approach is a good way to get started. And if you are not interested in digital signal processing, maybe I can change your mind!

Here are the links again:

Bayesian hypothesis testing

Bayesian hypothesis testing

I have mixed feelings about Bayesian hypothesis testing. On the positive side, it’s better than null-hypothesis significance testing (NHST).

And it is probably necessary as an onboarding tool: Hypothesis testing is one of the first things future Bayesians ask about; we need to have an answer.

On the negative side, Bayesian hypothesis testing is often unsatisfying because the question it answers is not the most useful question to ask.

To explain, I’ll use an example from Bite Size Bayes, which is a series of Jupyter notebooks I am writing to introduce Bayesian statistics.

In Notebook 7, I present the following problem from David MacKay’s book, Information Theory, Inference, and Learning Algorithms:

“A statistical statement appeared in The Guardian on Friday January 4, 2002:

“When spun on edge 250 times, a Belgian one-euro coin came up heads 140 times and tails 110. ‘It looks very suspicious to me’, said Barry Blight, a statistics lecturer at the London School of Economics. ‘If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%’.”

“But [asks MacKay] do these data give evidence that the coin is biased rather than fair?”

I start by formulating the question as an estimation problem. That is, I assume that the coin has some probability, x, of landing heads, and I use the data to estimate it.

If we assume that the prior distribution is uniform, which means that any value between 0 and 1 is equally likely, the posterior distribution looks like this:

Posterior distribution of x, which is the probability of heads, given a uniform prior.

This distribution represents everything we know about x given the prior and the data. And we can use it to answer whatever questions we have about the coin.

So let’s answer MacKay’s question: “Do these data give evidence that the coin is biased rather than fair?”

The question implies that we should consider two hypotheses:

  • The coin is fair.
  • The coin is biased.

In classical hypothesis testing, we would define a null hypothesis, choose a test statistic, and compute a p-value. That’s what the statistician quoted in The Guardian did. His null hypothesis is that the coin is fair. The test statistic is the difference between the observed number of heads (140) and the expected number under the null hypothesis (125). The p-value he computes is 7%, which he describes as “suspicious”.

In Bayesian hypothesis testing, we choose prior probabilities that represent our degree of belief in the two hypotheses. Then we compute the likelihood of the data under each hypothesis. The details are in Bite Size Bayes Notebook 12.

In this example the answer depends on how we define the hypothesis that the coin is biased:

  • If you know ahead of time that the probability of heads is exactly 56%, which is the fraction of heads in the dataset, the data are evidence in favor of the biased hypothesis.
  • If you don’t know the probability of heads, but you think any value between 0 and 1 is equally likely, the data are evidence in favor of the fair hypothesis.
  • And if you have knowledge about biased coins that informs your beliefs about x, the data might support the fair or biased hypothesis.

In the notebook I summarize these results using Bayes factors, which quantify the strength of the evidence. If you insist on doing Bayesian hypothesis testing, reporting a Bayes factor is probably a good choice.

But in most cases I think you’ll find that the answer is not very satisfying. As in this example, the answer is often “it depends”. But even when the hypotheses are well defined, a Bayes factor is generally less useful than a posterior distribution, because it contains less information.

The posterior distribution incorporates everything we know about the coin; we can use it to compute whatever summary statistics we like and to inform decision-making processes. We’ll see examples in the next two notebooks.

Correlation, determination, and prediction error

Correlation, determination, and prediction error

This tweet appeared in my feed recently:

I wrote about this topic in Elements of Data Science Notebook 9, where I suggest that using Pearson’s coefficient of correlation, usually denoted ρ, to summarize the relationship between two variables is problematic because:

  1. Correlation only quantifies the linear relationship between variables; if the relationship is non-linear, correlation tends to underestimate it.
  2. Correlation does not quantify the “strength” of the relationship in terms of slope, which is often more important in practice.

For an explanation of either of those points, see the discussion in Notebook 9. But that tweet and the responses got me thinking, and now I think there are even more reasons correlation is not a great statistic:

  1. It is hard to interpret as a measure of predictive power.
  2. It makes the relationship between variables sound more impressive than it is.

As an example, I’ll quantify the relationship between SAT scores and IQ tests. I know this is a contentious topic; people have strong feelings about the SAT, IQ, and the consequences of using standardized tests for college admissions.

I chose this example because it is a topic people care about, and I think the analysis I present can contribute to the discussion.

But a similar analysis applies in any domain where we use a correlation to quantify the strength of a relationship between two variables.

SAT scores and IQ

According to Frey and Detterman, “Scholastic Assessment or g? The relationship between the Scholastic Assessment Test and general cognitive ability“, the correlation between SAT scores and general intelligence (g) is 0.82.

That’s just one study, and if you read the paper, you might have questions about the methodology. But for now I will take this estimate at face value. If you find another source that reports a different correlation, feel free to plug in another value and run my analysis again.

In the notebook, I generate fake datasets with the same mean and standard deviation as the SAT and the IQ, and with a correlation of 0.82.

Then I use them to compute

  • The coefficient of determination, R²,
  • The mean absolute error (MAE),
  • Root mean squared error (RMSE), and
  • Mean absolute percentage error (MAPE).

In the SAT-IQ example, the correlation is 0.82, which is a strong correlation, but I think it sounds stronger than it is.

R² is 0.66, which means we can reduce variance by 66%. But that also makes the relationship sound stronger than it is.

Using SAT scores to predict IQ, we can reduce MAE by 44%, we can reduce RMSE by 42%, and we can reduce MAPE also by 42%.

Admittedly, these are substantial reductions. If you have to guess someone’s IQ (for some reason) your guesses will be more accurate if you know their SAT scores.

But any of these reductions in error is substantially more modest than the correlation might lead you to believe.

The same pattern holds over the range of possible correlations. The following figure shows R² and the fractional improvement in RMSE as a function of correlation:

For all values except 0 and 1, R² is less than correlation and the reduction in RMSE is even less than that.


Correlation is a problematic statistic because it sounds more impressive than it is.

Coefficient of determination, R², is a little better because it has a more natural interpretation: percentage reduction in variance. But reducing variance it usually not what we care about.

A better option is to choose a measure of error that is meaningful in context, possibly MAE, RMSE, or MAPE.

Which one of these is most meaningful depends on the cost function. Does the cost of being wrong depend on the absolute error, squared error, or percentage error? If so, that should guide your choice.

One advantage of RMSE is that we don’t need the data to compute it; we only need the variance of the dependent variable and either ρ or R². So if you read a paper that reports ρ, you can compute the corresponding reduction in RMSE.

But any measure of predictive error is more meaningful than reporting correlation or R².

The details of my analysis are in this Jupyter notebook.

The Girl Named Florida

The Girl Named Florida

In The Drunkard’s Walk, Leonard Mlodinow presents “The Girl Named Florida Problem”:

“In a family with two children, what are the chances, if [at least] one of the children is a girl named Florida, that both children are girls?”

I added “at least” to Mlodinow’s statement of the problem to avoid a subtle ambiguity.

I wrote about this problem in a previous article from 2011. As you can see in the comments, my explanation was not met with universal acclaim.

This time, I want to take a different approach.

First, to avoid some real-world complications, let’s assume that this question takes place in an imaginary city called Statesville where:

  • Every family has two children.
  • 50% of children are male and 50% are female.
  • All children are named after U.S. states, and all state names are chosen with equal probability.
  • Genders and names within each family are chosen independently.

Second, rather than solve it mathematically, I’ll demonstrate it computationally:

Either way, I hope you enjoy getting your head around this problem.