Bootstrapping a Proportion

October 15, 2024 AllenDowney

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

Here’s a question from the Reddit statistics forum.

How do I use bootstrapping to generate confidence intervals for a proportion/ratio? The situation is this:

I obtain samples of text with differing numbers of lines. From several tens to over a million. I have no control over how many lines there are in any given sample. Each line of each sample may or may not contain a string S. Counting lines according to S presence or S absence generates a ratio of S to S’ for that sample. I want to use bootstrapping to calculate confidence intervals for the found ratio (which of course will vary with sample size).

To do this I could either:

A. Literally resample (10,000 times) of size (say) 1,000 from the original sample (with replacement) then categorise S (and S’), and then calculate the ratio for each resample, and finally identify highest and lowest 2.5% (for 95% CI), or

B. Generate 10,000 samples of 1,000 random numbers between 0 and 1, scoring each stochastically as above or below original sample ratio (equivalent to S or S’). Then calculate CI as in A.

Programmatically A is slow and B is very fast. Is there anything wrong with doing B? The confidence intervals generated by each are almost identical.

The answer to the immediate question is that A and B are equivalent, so there’s nothing wrong with B. But in follow-up responses, a few related questions were raised:

Is resampling a good choice for this problem?
What size should the resamplings be?
How many resamplings do we need?

I don’t think resampling is really necessary here, and I’ll show some alternatives. And I’ll answer the other questions along the way.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

Pallor and Probability

As an example, let’s use one of the exercises from Think Python:

The Count of Monte Cristo is a novel by Alexandre Dumas that is considered a classic. Nevertheless, in the introduction of an English translation of the book, the writer Umberto Eco confesses that he found the book to be “one of the most badly written novels of all time”.

In particular, he says it is “shameless in its repetition of the same adjective,” and mentions in particular the number of times “its characters either shudder or turn pale.”

To see whether his objection is valid, let’s count the number number of lines that contain the word pale in any form, including pale, pales, paled, and paleness, as well as the related word pallor. Use a single regular expression that matches all of these words and no others.

The following cell downloads the text of the book from Project Gutenberg.

download('https://www.gutenberg.org/cache/epub/1184/pg1184.txt');

We’ll use the following functions to remove the additional material that appears before and after the text of the book.

def is_special_line(line):
    return line.startswith('*** ')

def clean_file(input_file, output_file):
    reader = open(input_file)
    writer = open(output_file, 'w')

    for line in reader:
        if is_special_line(line):
            break

    for line in reader:
        if is_special_line(line):
            break
        writer.write(line)
        
    reader.close()
    writer.close()

clean_file('pg1184.txt', 'pg1184_cleaned.txt')

And we’ll use the following function to count the number of lines that contain a particular pattern of characters.

import re

def count_matches(lines, pattern):
    count = 0
    for line in lines:
        result = re.search(pattern, line)
        if result:
            count += 1
    return count

readlines reads the file and creates a list of strings, one for each line.

lines = open('pg1184_cleaned.txt').readlines()
n = len(lines)
n

There are about 61,000 lines in the file.

The following pattern matches “pale” and several related words.

pattern = r'\b(pale|pales|paled|paleness|pallor)\b'
k = count_matches(lines, pattern)
k

These words appear in 223 lines of the file.

p_est = k / n
p_est

0.0036372533028869677

So the estimated proportion is about 0.0036. To quantify the precision of that estimate, we’ll compute a confidence interval.

Resampling

First we’ll use the method OP called A – literally resampling the lines of the file. The following function takes a list of lines and selects a sample, with replacement, that has the same size.

def resample(lines):
    return np.random.choice(lines, len(lines), replace=True)

In a resampled list, the same line can appear more than once, and some lines might not appear at all. So in any resampling, the forbidden words might appear more times than in the original text, or fewer. Here’s an example.

np.random.seed(1)
count_matches(resample(lines), pattern)

In this resampling, the words appear in 201 lines, fewer than in the original (223).

If we repeat this process many times, we can compute a sample of possible values of k. Because this method is slow, we’ll only repeat it 101 times.

ks_resampling = [count_matches(resample(lines), pattern) for i in range(101)]

With these different values of k, we can divide by n to get the corresponding values of p.

ps_resampling = np.array(ks_resampling) / n

To see what the distribution of those values looks like, we’ll plot the CDF.

from empiricaldist import Cdf

Cdf.from_seq(ps_resampling).plot(label='resampling')
decorate(xlabel='Resampled proportion', ylabel='Density')

So that’s the slow way to compute the sampling distribution of the proportion. The method OP calls B is to simulate a Bernoulli trial with size n and probability of success p_est. One way to do that is to draw random numbers from 0 to 1 and count how many are less than p_est.

(np.random.random(n) < p_est).sum()

Equivalently, we can draw a sample from a Bernoulli distribution and add it up.

from scipy.stats import bernoulli

bernoulli(p_est).rvs(n).sum()

These values follow a binomial distribution with parameters n and p_est. So we can simulate a large number of trials quickly by drawing values from a binomial distribution.

from scipy.stats import binom

ks_binom = binom(n, p_est).rvs(10001)

Dividing by n, we can compute the corresponding sample of proportions.

ps_binom = np.array(ks_binom) / n

Because this method is so much faster, we can generate a large number of values, which means we get a more precise picture of the sampling distribution.

The following figure compares the CDFs of the values we got by resampling and the values we got from the binomial distribution.

Cdf.from_seq(ps_resampling).plot(label='resampling')
Cdf.from_seq(ps_binom).plot(label='binomial')
decorate(xlabel='Resampled proportion', ylabel='CDF')

If we run the resampling method longer, these CDFs converge, so the two methods are equivalent.

To compute a 90% confidence interval, we can use the values we sampled from the binomial distribution.

np.percentile(ps_binom, [5, 95])

array([0.0032458 , 0.00404502])

Or we can use the inverse CDF of the binomial distribution, which is even faster than drawing a sample. And it’s deterministic – that is, we get the same result every time, with no randomness.

binom(n, p_est).ppf([0.05, 0.95]) / n

array([0.0032458 , 0.00404502])

Using the inverse CDF of the binomial distribution is a good way to compute confidence intervals. But before we get to that, let’s see how resampling behaves as we increase the sample size and the number of iterations.

Sample Size

In the example, the sample size is more than 60,000, so the CI is very small. The following figure shows what it looks like for more moderate sample sizes, using p=0.1 as an example.

p = 0.1
ns = [50, 500, 5000]
ci_df = pd.DataFrame(index=ns, columns=['low', 'high'])

for n in ns:
    ks = binom(n, p).rvs(10001)
    ps = ks / n
    Cdf.from_seq(ps).plot(label=f"n = {n}")
    ci_df.loc[n] = np.percentile(ps, [5, 95])
    
decorate(xlabel='Proportion', ylabel='CDF')

As the sample size increases, the spread of the sampling distribution gets smaller, and so does the width of the confidence interval.

ci_df['width'] = ci_df['high'] - ci_df['low']
ci_df

	low	high	width
50	0.04	0.18	0.14
500	0.078	0.122	0.044
5000	0.0932	0.1072	0.014

With resampling methods, it is important to draw samples with the same size as the original dataset – otherwise the result is wrong.

But the number of iterations doesn’t matter as much. The following figure shows the sampling distribution if we run the sampling process 101, 1001, and 10,001 times.

p = 0.1
n = 100
iter_seq = [101, 1001, 100001]

for iters in iter_seq:
    ks = binom(n, p).rvs(iters)
    ps = ks / n
    Cdf.from_seq(ps).plot(label=f"iters = {iters}")
    
decorate()

The sampling distribution is the same, regardless of how many iterations we run. But with more iterations, we get a better picture of the distribution and a more precise estimate of the confidence interval. For most problems, 1001 iterations is enough, but if you can generate larger samples fast enough, more is better.

However, for this problem, resampling isn’t really necessary. As we’ve seen, we can use the binomial distribution to compute a CI without drawing a random sample at all. And for this problem, there are approximations that are even easier to compute – although they come with some caveats.

Approximations

If n is large and p is not too close to 0 or 1, the sampling distribution of a proportion is well modeled by a normal distribution, and we can approximate a confidence interval with just a few calculations.

For a given confidence level, we can use the inverse CDF of the normal distribution to compute a

score, which is the number of standard deviations the CI should span – above and below the observed value of p – in order to include the given confidence.

from scipy.stats import norm

confidence = 0.9
z = norm.ppf(1 - (1 - confidence) / 2)
z

1.6448536269514722

A 90% confidence interval spans about 1.64 standard deviations.

Now we can use the following function, which uses p, n, and this z score to compute the confidence interval.

def confidence_interval_normal_approx(k, n, z):
    p = k / n
    margin_of_error = z * np.sqrt(p * (1 - p) / n)
    
    lower_bound = p - margin_of_error
    upper_bound = p + margin_of_error
    return lower_bound, upper_bound

To test it, we’ll compute n and k for the example again.

n = len(lines)
k = count_matches(lines, pattern)
n, k

(61310, 223)

Here’s the confidence interval based on the normal approximation.

ci_normal = confidence_interval_normal_approx(k, n, z)
ci_normal

(0.003237348046298746, 0.00403715855947519)

In the example, n is large, which is good for the normal approximation, but p is small, which is bad. So it’s not obvious whether we can trust the approximation.

An alternative that’s more robust is the Wilson score interval, which is reliable for values of p close to 0 and 1, and sample sizes bigger than about 5.

def confidence_interval_wilson_score(k, n, z):    
    p = k / n
    factor = z**2 / n
    denominator = 1 + factor
    center = p + factor / 2
    half_width = z * np.sqrt((p * (1 - p) + factor / 4) / n)
    
    lower_bound = (center - half_width) / denominator
    upper_bound = (center + half_width) / denominator
    
    return lower_bound, upper_bound

Here’s the 90% CI based on Wilson scores.

ci_wilson = confidence_interval_wilson_score(k, n, z)
ci_wilson

(0.003258660468175958, 0.00405965209814987)

Another option is the Clopper-Pearson interval, which is what we computed earlier with the inverse CDF of the binomial distribution. Here’s a function that computes it.

from scipy.stats import binom

def confidence_interval_exact_binomial(k, n, confidence=0.9):
    alpha = 1 - confidence
    p = k / n

    lower_bound = binom.ppf(alpha / 2, n, p) / n if k > 0 else 0
    upper_bound = binom.ppf(1 - alpha / 2, n, p) / n if k < n else 1
    
    return lower_bound, upper_bound

And here’s the interval we get.

ci_binomial = confidence_interval_exact_binomial(k, n)
ci_binomial

(0.003245800032621106, 0.0040450171260805745)

A final alternative is the Jeffreys interval, which is derived from Bayes’s Theorem. If we start with a Jeffreys prior and observe k successes out of n attempts, the posterior distribution of p is a beta distribution with parameters a = k + 1/2 and b = n - k + 1/2. So we can use the inverse CDF of the beta distribution to compute a CI.

from scipy.stats import beta

def bayesian_confidence_interval_beta(k, n, confidence=0.9):
    alpha = 1 - confidence    
    a, b = k + 1/2, n - k + 1/2
    
    lower_bound = beta.ppf(alpha / 2, a, b)
    upper_bound = beta.ppf(1 - alpha / 2, a, b)
    
    return lower_bound, upper_bound

And here’s the interval we get.

ci_beta = bayesian_confidence_interval_beta(k, n)
ci_beta

(0.003254420914221609, 0.004054683138668112)

The following figure shows the four intervals we just computed graphically.

intervals = {
    'Normal Approximation': ci_normal,
    'Wilson Score': ci_wilson,
    'Clopper-Pearson': ci_binomial,
    'Jeffreys': ci_beta
}
y_pos = np.arange(len(intervals))

for i, (label, (lower, upper)) in enumerate(intervals.items()):
    middle = (lower + upper) / 2
    xerr = [[(middle - lower)], [(upper - middle)]]
    plt.errorbar(x=middle, y=i-0.2, xerr=xerr, fmt='o', capsize=5)
    plt.text(middle, i, label, ha='center', va='top')
    
decorate(xlabel='Proportion', ylim=[3.5, -0.8], yticks=[])

In this example, because n is so large, the intervals are all similar – the differences are too small to matter in practice. For smaller values of n, the normal approximation becomes unreliable, and for very small values, none of them are reliable.

The normal approximation and Wilson score interval are easy and fast to compute. On my old laptop, they take 1-2 microseconds.

%timeit confidence_interval_normal_approx(k, n, z)

1.04 µs ± 4.04 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit confidence_interval_wilson_score(k, n, z)

1.64 µs ± 28.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Evaluating the inverse CDF of the binomial and beta distributions are more complex computations – they take about 100 times longer.

%timeit confidence_interval_exact_binomial(k, n)

195 µs ± 7.53 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit bayesian_confidence_interval_beta(k, n)

269 µs ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

But they still take less than 300 microseconds, so unless you need to compute millions of confidence intervals per second, the difference in computation time doesn’t matter.

Discussion

If you took a statistics class and learned one of these methods, you probably learned the normal approximation. That’s because it is easy to explain and, because it is based on a form of the Central Limit Theorem, it helps to justify time spent learning about the CLT. But in my opinion it should never be used in practice because it is dominated by the Wilson score interval – that is, it is worse than Wilson in at least one way and better in none.

I think the Clopper-Pearson interval is equally easy to explain, but when n is small, there are few possible values of k, and therefore few possible values of p – and the interval can be wider than it needs to be.

The Jeffreys interval is based on Bayesian statistics, so it takes a little more explaining, but it behaves well for all values of n and p. And when n is small, it can be extended to take advantage of background information about likely values of p.

For these reasons, the Jeffreys interval is my usual choice, but in a computational environment that doesn’t provide the inverse CDF of the beta distribution, I would use a Wilson score interval.

OP is working in LiveCode, which doesn’t provide a lot of math and statistics libraries, so Wilson might be a good choice. Here’s a LiveCode implementation generated by ChatGPT.

-- Function to calculate the z-score for a 95% confidence level (z ≈ 1.96)
function zScore
    return 1.96
end zScore

-- Function to calculate the Wilson Score Interval with distinct bounds
function wilsonScoreInterval k n
    -- Calculate proportion of successes
    put k / n into p
    put zScore() into z
    
    -- Common term for the interval calculation
    put (z^2 / n) into factor
    put (p + factor / 2) / (1 + factor) into adjustedCenter
    
    -- Asymmetric bounds
    put sqrt(p * (1 - p) / n + factor / 4) into sqrtTerm
    
    -- Lower bound calculation
    put adjustedCenter - (z * sqrtTerm / (1 + factor)) into lowerBound
    
    -- Upper bound calculation
    put adjustedCenter + (z * sqrtTerm / (1 + factor)) into upperBound
    
    return lowerBound & comma & upperBound
end wilsonScoreInterval

Data Q&A: Answering the real questions with Python

Ears Are Weird

September 16, 2024 AllenDowney

In a previous article, I looked at 93 measurements from the ANSUR-II dataset and found that ear protrusion is not correlated with any other measurement. In a followup article, I used principle component analysis to explore the correlation structure of the measurements, and found that once you have exhausted the information encoded in the most obvious measurements, the ear-related measurements are left standing alone.

I have a conjecture about why ears are weird: ear growth might depend on idiosyncratic details of the developmental environment — so they might be like fingerprints. Recently I discovered a hint that supports my conjecture.

This Veritasium video explains how we locate the source of a sound.

In general, we use small differences between what we hear in each ear — specifically, differences in amplitude, quality, time delay, and phase. That works well if the source of the sound is to the left or right, but not if it’s directly in front, above, or behind — anywhere on vertical plane through the centerline of your head — because in those cases, the paths from the source to the two ears are symmetric.

Fortunately we have another trick that helps in this case. The shape of the outer ear changes the quality of the sound, depending on the direction of the source. The resulting spectral cues makes it possible to locate sources even when they are on the central plane.

The video mentions that owls have asymmetric ears that make this trick particularly effective. Human ears are not as distinctly asymmetric as owl ears, but they are not identical.

And now, based on the Veritasium video, I suspect that might be a feature — the shape of the outer ear might be unpredictably variable because it’s advantageous for our ears to be asymmetric. Almost everything about the way our bodies grow is programmed to be as symmetric as possible, but ears might be programmed to be different.

Rip-off ETF?

September 4, 2024 AllenDowney

An article in a recent issue of The Economist suggests, right in the title, “Investors should avoid a new generation of rip-off ETFs”. An ETF is an exchange-traded fund, which holds a collection of assets and trades on an exchange like a single stock. For example, the SPDR S&P 500 ETF Trust (SPY) tracks the S&P 500 index, but unlike traditional index funds, you can buy or sell shares in minutes.

There’s nothing obviously wrong with that – but as an example of a “rip-off ETF”, the article describes “defined-outcome funds” or buffer ETFs, which “offer investors an enviable-sounding opportunity: hold stocks, with protection against falling prices. All they must do is forgo annual returns above a certain level, often 10% or so.”

That might sound good, but the article explains, “Over the long term, they are a terrible deal for investors. Much of the compounding effect of stock ownership comes from rallies.”

To demonstrate, they use the value of the S&P index since 1980: “An investor with returns capped at 10% and protected from losses would have made a real return of 403% over the period, a fraction of the 3,155% return offered by just buying and holding the S&P 500.”

So that sounds bad, but returns from 1980 to the present have been historically unusual. To get a sense of whether buffer ETFs are more generally a bad deal, let’s get a bigger picture.

Click here to run this notebook on Colab

The Dow Jones

The MeasuringWorth Foundation has compiled the value of the Dow Jones Industrial Average at the end of each day from February 16, 1885 to the present, with adjustments at several points to make the values comparable. The series I collected starts on February 16, 1885 and ends on August 30, 2024. The following cells download and read the data.

DATA_PATH = "https://github.com/AllenDowney/ThinkStats/raw/v3/data/"
filename = "DJA.csv"
download(DATA_PATH + filename)

djia = pd.read_csv(filename, skiprows=4, parse_dates=[0], index_col=0)
djia.head()

	DJIA
Date
1885-02-16	30.9226
1885-02-17	31.3365
1885-02-18	31.4744
1885-02-19	31.6765
1885-02-20	31.4252

To compute annual returns, we’ll start by selecting the closing price on the last trading day of each year (dropping 2024 because we don’t have a complete year).

annual = djia.groupby(djia.index.year).last().drop(2024)
annual

	DJIA
Date
1885	39.4859
1886	41.2391
1887	37.7693
1888	39.5866
1889	42.0394
…	…
2019	28538.4400
2020	30606.4800
2021	36338.3000
2022	33147.2500
2023	37689.5400

139 rows × 1 columns

Next we’ll compute the annual price return, which is the ratio of successive year-end closing prices.

annual['Ratio'] = annual['DJIA'] / annual['DJIA'].shift(1)
annual

	DJIA	Ratio
Date
1885	39.4859	NaN
1886	41.2391	1.044401
1887	37.7693	0.915861
1888	39.5866	1.048116
1889	42.0394	1.061960
…	…	…
2019	28538.4400	1.223384
2020	30606.4800	1.072465
2021	36338.3000	1.187275
2022	33147.2500	0.912185
2023	37689.5400	1.137034

139 rows × 2 columns

And the relative return as a percentage.

annual['Return'] = (annual['Ratio'] - 1) * 100

Looking at the years with the biggest losses and gains, we can see that most of the extremes were before the 1960s – with the exception of the 2008 financial crisis.

annual.dropna().sort_values(by='Return')

	DJIA	Ratio	Return
Date
1931	77.9000	0.473326	-52.667396
1907	43.0382	0.622683	-37.731743
2008	8776.3900	0.661629	-33.837097
1930	164.5800	0.662347	-33.765293
1920	71.9500	0.670988	-32.901240
…	…	…	…
1954	404.3900	1.439623	43.962264
1908	63.1104	1.466381	46.638103
1928	300.0000	1.482213	48.221344
1933	99.9000	1.666945	66.694477
1915	99.1500	1.816599	81.659949

138 rows × 3 columns

Here’s what the distribution of annual returns looks like.

from empiricaldist import Cdf

cdf_return = Cdf.from_seq(annual['Return'])
cdf_return.plot()

decorate(xlabel='Annual return (percent)', ylabel='CDF')

Immediately we see why capping returns at 10% might be a bad idea – this cap is exceeded almost 45% of the time, and sometimes by a lot!

1 - cdf_return(10)

0.4492753623188406

Long-Term Returns

We’ll use the following function to compute long-term returns. It takes a start date and a duration, and computes two ratios:

The total price return based on actual annual returns.
The total price return if annual returns are clipped at 0 and 10 – that is, any negative returns are set to 0 and any returns above 10 are set to 10.

def compute_ratios(start=1993, duration=30):
    end = start + duration
    interval = annual.loc[start: end]
    ratio = interval['Ratio'].prod()
    low, high = 1.0, 1.10
    clipped = interval['Ratio'].clip(low, high)
    ratio_clipped = clipped.prod()
    return start, end, ratio, ratio_clipped

With this function, we can replicate the analysis The Economist did with the S&P 500. Here are the results for the DJIA from the beginning of 1980 to the end of 2023.

compute_ratios(1980, 43)

(1980, 2023, 44.93751117788029, 15.356490985533199)

A buffer ETF over this period would have grown by a factor of more than 15 in nominal dollars, with no risk of loss. But an index fund would have grown by a factor of almost 45. So yeah, the ETF would have been a bad deal.

However, if we go back to the bad old days, an investor in 1900 would have been substantially better off with a buffer ETF held for 43 years – a factor of 7.2 compared to a factor of 2.8.

compute_ratios(1900, 43)

(1900, 1943, 2.8071864303140583, 7.225624631784611)

It seems we can cherry-pick the data to make the comparison go either way – so let’s see how things look more generally. Starting in 1886, we’ll compute price returns for all 30-year intervals, ending with the interval from 1993 to 2023.

duration = 30
ratios = [compute_ratios(start, duration) for start in range(1886, 2024-duration)]
ratios = pd.DataFrame(ratios, columns=['Start', 'End', 'Index Fund', 'Buffer ETF'])
ratios.index = ratios['Start']
ratios.tail()

	Start	End	Index Fund	Buffer ETF
Start
1989	1989	2019	13.160027	6.532125
1990	1990	2020	11.116693	6.368615
1991	1991	2021	13.797643	7.005476
1992	1992	2022	10.460407	6.368615
1993	1993	2023	11.417232	6.724757

Here’s what the returns look like for an index fund compared to a buffer ETF.

ratios['Index Fund'].plot()
ratios['Buffer ETF'].plot()

decorate(xlabel='Start year', ylabel='30-year price return')

The buffer ETF performs as advertised, substantially reducing volatility. But it has only occasionally been a good deal, and not in my lifetime.

According to ChatGPT, the primary reasons for strong growth in stock prices since the 1960s are “technological advancements, globalization, financial market innovation, and favorable monetary policies”. If you think these elements will generally persist over the next 30 years, you might want to avoid buffer ETFs.

Probably the Book

August 23, 2024 AllenDowney

Last week I had the pleasure of presenting a keynote at posit::conf(2024). When the video is available, I will post it here. In the meantime, you can read the slides, if you don’t mind spoilers.

For people at the conference who don’t know me, this might be a good time to introduce you to this blog, where I write about data science and Bayesian statistics, and to Probably Overthinking It, the book based on the blog, which was published by University of Chicago Press last December. Here’s an outline of the book with links to excerpts I’ve published in the blog and talks I’ve presented based on some of the chapters.

For your very own copy, you can order from Bookshop.org if you want to support independent bookstores, or Amazon if you don’t.

Twelve Excellent Chapters

In Chapter 1, we learn that no one is normal, everyone is weird, and everyone is about the same amount of weird. I published an excerpt from this chapter, and talked about it during this section of the SuperDataScience podcast. And it is featured in an interactive article at Brilliant.org, which includes this animation showing how measurements are distributed in multiple dimensions.

Chapter 2 is about the inspection paradox, which affects our perception of many real-world scenarios, including fun examples like class sizes and relay races, and more serious examples like our understanding of criminal justice and ability to track infectious disease. I published a prototype of this chapter as an article called “The Inspection Paradox is Everywhere“, and gave a talk about it at PyData NYC:

Chapter 3 presents three consequences of the inspection paradox in demography, especially changes in fertility in the United States over the last 50 years. It explains Preston’s paradox, named after the demographer who discovered it: if each woman has the same number of children as her mother, family sizes — and population — grow quickly; in order to maintain constant family sizes, women must have fewer children than their mothers, on average. I published an excerpt from this chapter, and it was discussed on Hacker News.

Chapter 4 is about extremes, outliers, and GOATs (greatest of all time), and two reasons the distribution of many abilities tends toward a lognormal distribution: proportional gain and weakest link effects. I gave a talk about this chapter for PyData Global 2023:

Chapter 5 is about the surprising conditions where something used is better than something new. Most things wear out over time, but sometimes longevity implies information, which implies even greater longevity. This property has implications for life expectancy and the possibility of much longer life spans. I gave a talk about this chapter at ODSC East 2024 — there’s no recording, but the slides are here.

Chapter 6 introduces Berkson’s paradox — a form of collision bias — with some simple examples like the correlation of test scores and some more important examples like COVID and depression. Chapter 7 uses collision bias to explain the low birthweight paradox and other confusing results from epidemiology. I gave a “Talk at Google” about these chapters:

Chapter 8 shows that the magnitudes of natural and human-caused disasters follow long-tailed distributions that violate our intuition, defy prediction, and leave us unprepared. Examples include earthquakes, solar flares, asteroid impacts, and stock market crashes. I gave a talk about this chapter at SciPy 2023:

The talk includes this animation showing how plotting a tail distribution on a log-y scale provides a clearer picture of the extreme tail behavior.

Chapter 9 is about the base rate fallacy, which is the cause of many statistical errors, including misinterpretations of medical tests, field sobriety tests, and COVID statistics. It includes a discussion of the COMPAS system for predicting criminal behavior.

Chapter 10 is about Simpson’s paradox, with examples from ecology, sociology, and economics. It is the key to understanding one of the most notorious examples of misinterpretation of COVID data. This is the first of three chapters that use data from the General Social Survey (GSS).

Chapter 11 is about the expansion of the Moral Circle — specifically about changes in attitudes about race, gender, and homosexuality in the U.S. over the last 50 years. I published an excerpt about the remarkable decline of homophobia since 1990, featuring lyrics from “A Message From the Gay Community“.

Chapter 12 is about the Overton Paradox, a name I’ve given to a pattern observed in GSS data: as people get older, their beliefs become more liberal, on average, but they are more likely to say they are conservative. This chapter is the basis of this interactive lesson at Brilliant.org. And I gave a talk about it at PyData NYC 2022:

There are still a few chapters I haven’t given a talk about, so watch this space!

Again, you can order the book from Bookshop.org if you want to support independent bookstores, or Amazon if you don’t.

Supporting code for the book is in this GitHub repository. All of the chapters are available as Jupyter notebooks that run in Colab, so you can replicate my analysis. If you are teaching a data science or statistic class, they make good teaching examples.

Chapter 1: Are You Normal? Hint: No.

Run the code on Colab

Run the code that prepares the BRFSS data

Run the code that prepares the Big Five data

Chapter 2: Relay Races and Revolving Doors

Run the code on Colab

Chapter 3: Defy Tradition, Save the World

Run the code on Colab

Chapter 4: Extremes, Outliers, and GOATs

Run the code on Colab

Run the code that prepares the BRFSS data

Run the code that prepares the NSFG data

Chapter 5: Bettter Than New

Run the code on Colab

Chapter 6: Jumping to Conclusions

Run the code on Colab

Chapter 7: Causation, Collision, and Confusion

Run the code on Colab

Run the code that prepares the NCHS data

Chapter 8: The Long Tail of Disaster

Run the code on Colab

Run the code that prepares the earthquake data

Run the code that prepares the solar flare data

Chapter 9: Fairness and Fallacy

Run the code on Colab

Chapter 10: Penguins, Pessimists, and Paradoxes

Run the code on Colab

Run the code that prepares the GSS data

Chapter 11: Changing Hearts and Minds

Run the code on Colab

Chapter 12: Chasing the Overton Window

Run the code on Colab

Too many bronze medals?

August 18, 2024 AllenDowney

In a recent video, Hank Green nerd-sniped me by asking a question I couldn’t not answer.

At one point in the video, he shows “a graph of the last 20 years of Olympic games showing the gold, silver, and bronze medals from continental Europe. And it “shows continental Europe having significantly more bronze medals than gold medals.”

Hank wonders why and offers a few possible explanations, finally settling on the one I think is correct:

… the increased numbers of athletes who come from European countries weight them more toward bronze, which might actually be a more randomized medal. Placing gold might just be a better judge of who is first, because gold medal winners are more likely to be truer outliers, while bronze medal recipients are closer to the middle of the pack. And so randomness might play a bigger role, which would mean that having a larger number of athletes gives you more bronze medal winners and more athletes is what you get when you lump a bunch of countries together.

In the following notebook, I use a simple simulation to show that this explanation is plausible. Click here to run the notebook on Colab. Or read the details below.

olympics

So Many Bronze¶

In a recent video, Hank Green nerd-sniped me by asking a question I couldn’t not answer.

No description has been provided for this image

Hank wonders why and offers a few possible explanations, finally settling on the one I think is correct:

… the increased numbers of athletes who come from European countries weight them more toward bronze, which might actually be a more randomized medal. Placing gold might just be a better judge of who is first, because gold medal winners are more likely to be truer outliers, while bronze medal recipients are closer to the middle of the pack. And so randomness might play a bigger role, which would mean that having a larger number of athletes gives you more bronze medal winners and more athletes is what you get when you lump a bunch of countries together.

In the following simulations, I show that this explanation is plausible. If you like this kind of analysis, you might like my book, Probably Overthinking It.

Click here to run this notebook on Colab.

In [1]:

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

In [2]:

from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download(
    "https://github.com/AllenDowney/ProbablyOverthinkingIt/raw/book/examples/utils.py"
)

In [3]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
plt.rcParams["figure.dpi"] = 75
plt.rcParams["figure.figsize"] = [6, 3.5]

Simulate the Olympics¶

The following function takes a random distribution, generates a population of athletes with random abilities, and returns the top three.

In [4]:

def generate(dist, n, label):
    """Generate the top 3 athletes from a country with population n.
    
    dist: distribution of ability
    n: population
    label: name of country
    """
    # generate a sample with the given size
    sample = dist.rvs(n)
    
    # select the top 3
    top3 = top_k = np.sort(sample)[-3:]
    
    # put the results in a DataFrame with country labels
    df = pd.DataFrame(dict(ability=top3))
    df['label'] = label
    return df

Here’s an example based on a normal distribution with mean 500 and standard deviation 100.

In [5]:

from scipy.stats import norm

dist = norm(500, 100)
generate(dist, 300, 'Example')

Out[5]:

	ability	label
0	746.324211	Example
1	772.016917	Example
2	885.273149	Example

Now let’s simulate the trials in two regions:

A single large country called “UnaGrandia”, with population of 30,000 athletes,
And a group of ten smaller countries called “MultiParvia” with 3,000 athletes each

In [6]:

def run_trials(dist):
    """Simulate the trials.
    
    dist: distribution of ability
    """
    # generate athletes from 10 countries with population 30
    dfs = [generate(dist, 3000, 'MultiParvia') for i in range(10)]
    
    # add in athletes from one country with population 300
    dfs.append(generate(dist, 30000, 'UnaGrandia'))
    
    # combine into a single DataFrame
    athletes = pd.concat(dfs)
    return athletes

The result is 33 athletes, 3 from UnaGrandia and 30 from the various countries of MultiParvia.

In [7]:

athletes = run_trials(dist)
athletes['label'].value_counts()

Out[7]:

label
MultiParvia    30
UnaGrandia      3
Name: count, dtype: int64

Here’s what the distribution of ability looks like.

In [8]:

from empiricaldist import Surv
from utils import decorate

surv_ability = Surv.from_seq(athletes['ability'], normalize=False)
surv_ability.plot(style='o', alpha=0.6, label='')
decorate(xlabel='Ability', ylabel='Rank', title='Distribution of ability')

Because we’ve selected the largest values from the distribution of ability, the result is skewed to the right — that is, there are a few extreme outliers who have the best chances of winning, and a middle of the pack that have fewer chances (with a reminder that it’s a pretty elite pack to be in the middle of).

Now let’s simulate the competition. The following function takes the distribution of ability and an additional parameter, std, that controls the randomness of the results.

When std is 0, the outcome of the competition depends only on the abilities of the athletes — the athlete with the highest ability wins every time.
As std increases, the outcome is more random, so an athlete with a lower ability has a better chance of beating an athlete with higher ability.

In [9]:

medals = ['Gold', 'Silver', 'Bronze']

def compete(dist, std=0):
    """Simulate a competition.
    
    dist: distribution of ability
    std: standard deviation of randomness
    """
    # run the trials
    athletes = run_trials(dist)
    
    # add a random factor to ability to get scores
    randomness = norm(0, std).rvs(len(athletes))
    athletes['score'] = athletes['ability'] + randomness
    
    # select and return athlete with top 3 scores
    podium = athletes.nlargest(3, columns='score')
    podium['medal'] = medals
    return podium

The result shows the abilities of each winner, which region they are from, their score in the competition, and the medal they won.

In [10]:

compete(dist, std=10)

Out[10]:

	ability	label	score	medal
2	920.202590	UnaGrandia	926.182143	Gold
0	876.618008	UnaGrandia	884.973475	Silver
1	876.623360	UnaGrandia	877.887775	Bronze

Now let’s simulate multiple events. The following function takes the distribution of ability again, along with the number of events and the amount of randomness in the outcomes.

In [11]:

def games(dist, num_events, std=0):
    """Simulate multiple games.
    
    dist: distribution of abilities
    num_events: how many events are contested
    """
    dfs = [compete(dist, std) for i in range(num_events)]
    results = pd.concat(dfs)
    xtab = pd.crosstab(results['label'], results['medal'])
    return xtab[medals]

The result is a table that shows the number of each kind of medal won by each region.

In [12]:

table = games(dist, 100, std=20)
table

Out[12]:

medal	Gold	Silver	Bronze
label
MultiParvia	53	49	69
UnaGrandia	47	51	31

The following function plots the results.

In [13]:

colors = ['#FFD700', '#C0C0C0', '#CD7F32']

def plot_results(tables):
    plt.figure(figsize=(10, 3))

    for i, table in enumerate(tables):
        ax = plt.subplot(1, 6, i+1)
        plt.axhline(y=500, ls='--', color='gray', alpha=0.4)
        table.loc['MultiParvia'].plot.bar()
        plt.xticks(rotation=45)
        for bar, color in zip(ax.patches, colors):
            bar.set_color(color)
            
        if i>0:
            plt.yticks([])
        xlabel = 2004 + 4*i
        decorate(xlabel=xlabel, ylim=[0, 700], legend=False)
        
        for spine in ax.spines.values():
            spine.set_visible(False)

If there is no randomness in the outcomes, each region wins about half of the medals, and there is no excess of bronze medals for MultiParvia.

In [14]:

tables = [games(dist, 1000, std=0) for i in range(6)]

plot_results(tables)

However, if we add enough randomness that the outcome is not certain, we see a pattern that resembles the actual data:

The two regions get about the same number of gold medals.
MultiParvia gets more bronze medals, and possibly more silver medals, too.

In [15]:

tables = [games(dist, 1000, std=20) for i in range(6)]

plot_results(tables)

The results here are more consistent that what we see in the real data because we simulated 1000 events.

If we increase the amount of randomness, the advantage of sending more athletes to the games is even stronger — and it looks like it has an effect on the number of gold medals as well.

In [16]:

tables = [games(dist, 1000, std=30) for i in range(6)]

plot_results(tables)

Lognormal distribution of ability¶

I was curious to know how the distribution of ability affects the result, so I tried the simulations with a lognormal distribution, too. This choice might be more realistic because the distribution of ability in many fields follows a lognormal distribution — see Chapter 4 of Probably Overthinking It or this article).

Here’s a lognormal distribution that’s a good match for the distribution of Elo scores in chess.

In [17]:

from scipy.stats import lognorm

m, s = 7.08959557, 0.32758329
dist2 = lognorm(s=s, scale=np.exp(m))

Here’s what it looks like.

In [21]:

low, high = 200, 3200
qs = np.linspace(low, high)
ps = dist2.cdf(qs) * 100
plt.plot(qs, ps)

decorate(xlabel="Ability", ylabel="Percentile rank")

If we run the competition with this distribution, we can see that the scale of abilities is higher.

In [19]:

compete(dist2, std=100)

Out[19]:

	ability	label	score	medal
0	4329.190762	UnaGrandia	4471.560302	Gold
2	4579.134646	UnaGrandia	4471.322015	Silver
2	4267.159885	MultiParvia	4272.824660	Bronze

And here’s what the results look like.

In [20]:

tables = [games(dist2, 1000, std=200) for i in range(6)]
plot_results(tables)

They are similar to the results with a normal distribution of abilities, so it seems like the shape of the distribution is not an essential reason for the excess of bronze medals.

Conclusions¶

I think Hank is right. If you have two regions with the same population, and one is allowed to send more athletes to the games, it is not much more likely to win gold medals, but notably more likely to win silver and bronze medals — and the size of the excess depends on how much randomness there is in the outcome of the events.

If you like this kind of analysis, you might like my book, Probably Overthinking It.

Bonus¶

One more look at the data — it didn’t really pan out.

In [36]:

dfs = [compete(dist, std=20) for i in range(2000)]
results = pd.concat(dfs)

In [37]:

from empiricaldist import Cdf

results['diff'] = results['score'] - results['ability']
for name, group in results.groupby('medal'):
    cdf = Cdf.from_seq(group['diff']) * 100
    cdf.plot(label=name)
    
decorate(xlabel='Under / over performance', ylabel='Percentile rank')

The code in this notebook and utils.py is under the MIT license.

In [ ]:

Where’s My Train?

July 25, 2024 AllenDowney

Yesterday I presented a webinar for PyMC Labs where I solved one of the exercises from Think Bayes, called “The Red Line Problem”. Here’s the scenario:

The Red Line is a subway that connects Cambridge and Boston, Massachusetts. When I was working in Cambridge I took the Red Line from Kendall Square to South Station and caught the commuter rail to Needham. During rush hour Red Line trains run every 7-8 minutes, on average.

When I arrived at the subway stop, I could estimate the time until the next train based on the number of passengers on the platform. If there were only a few people, I inferred that I just missed a train and expected to wait about 7 minutes. If there were more passengers, I expected the train to arrive sooner. But if there were a large number of passengers, I suspected that trains were not running on schedule, so I expected to wait a long time.

While I was waiting, I thought about how Bayesian inference could help predict my wait time and decide when I should give up and take a taxi.

I used this exercise to demonstrate a process for developing and testing Bayesian models in PyMC. The solution uses some common PyMC features, like the Normal, Gamma, and Poisson distributions, and some less common features, like the Interpolated and StudentT distributions.

The video is on YouTube now:

The slides are here.

This talk will be remembered for the first public appearance of the soon-to-be-famous “Banana of Ignorance”. In general, when the data we have are unable to distinguish between competing explanations, that uncertainty is reflected in the joint distribution of the parameters. In this example, if we see more people waiting than expected, there are two explanation: a higher-than-average arrival rate or a longer-than-average elapsed time since the last train. If we make a contour plot of the joint posterior distribution of these parameters, it looks like this:

The elongated shape of the contour indicates that either explanation is sufficient: if the arrival rate is high, elapsed time can be normal, and if the elapsed time is high, the arrival rate can be normal. Because this shape indicates that we don’t know which explanation is correct, I have dubbed it “The Banana of Ignorance”:

For all of the details, you can read the Jupyter notebook or run it on Colab.

The original Red Line Problem is based on a student project from my Bayesian Statistics class at Olin College, way back in Spring 2013.

Elements of Data Science

July 17, 2024 AllenDowney

I’m excited to announce the launch of my newest book, Elements of Data Science. As the subtitle suggests, it is about “Getting started with Data Science and Python”.

Order now from Lulu.com and get 20% off!

I am publishing this book myself, which has one big advantage: I can print it with a full color interior without increasing the cover price. In my opinion, the code is more readable with syntax highlighting, and the data visualizations look great!

In addition to the printed edition, all chapters are available to read online, and they are in Jupyter notebooks, where you can read the text, run the code, and work on the exercises.

Description

Elements of Data Science is an introduction to data science for people with no programming experience. My goal is to present a small, powerful subset of Python that allows you to do real work with data as quickly as possible.

Part 1 includes six chapters that introduce basic Python with a focus on working with data.

Part 2 presents exploratory data analysis using Pandas and empiricaldist — it includes a revised and updated version of the material from my popular DataCamp course, “Exploratory Data Analysis in Python.”

Part 3 takes a computational approach to statistical inference, introducing resampling method, bootstrapping, and randomization tests.

Part 4 is the first of two case studies. It uses data from the General Social Survey to explore changes in political beliefs and attitudes in the U.S. in the last 50 years. The data points on the cover are from one of the graphs in this section.

Part 5 is the second case study, which introduces classification algorithms and the metrics used to evaluate them — and discusses the challenges of algorithmic decision-making in the context of criminal justice.

This project started in 2019, when I collaborated with a group at Harvard to create a data science class for people with no programming experience. We discussed some of the design decisions that went into the course and the book in this article.

Density and Likelihood: What’s the Difference?

July 13, 2024 AllenDowney

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

If you get this post by email, the formatting might be broken — if so, you might want to read it on the site.

likelihood

Density and Likelihood¶

Here’s a question from the Reddit statistics forum.

I’m a math graduate and am partially self taught. I am really frustrated with likelihood and probability density, two concepts that I personally think are explained so disastrously that I’ve been struggling with them for an embarrassingly long time. Here’s my current understanding and what I want to understand:

probability density is the ‘concentration of probability’ or probability per unit and the value of the density in any particular interval depends on the density function used. When you integrate the density curve over all outcomes x in X where X is a random variable and x are its realizations then the result should be all the probability or 1.

likelihood is the joint probability, in the discrete case, of observing fixed and known data depending on what parameter(s) value we choose. In the continuous case we do not have a nonzero probability of any single value but we do have nonzero probability within some infinitely small interval (containing infinite values?) [x, x+h] and maximizing the likelihood of observing this data is equivalent to maximizing the probability of observing it, which we can do by maximizing the density at x.

My questions are:

Is what I wrote above correct? Probability density and likelihood are not the same thing. But what the precise distinction is in the continuous case is not completely cut and dry to me. […]

I agree with OP — these topics are confusing and not always explained well. So let’s see what we can do.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

Out[1]:

'utils.py'

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

In [3]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

Mass¶

I’ll start with a discrete distribution, so we can leave density out of it for now and focus on the difference between a probability mass function (PMF) and a likelihood function.

As an example, suppose we know that a hockey team scores goals at a rate of 3 goals per game on average. If we model goal scoring as a Poisson process — which is not a bad model — the number of goals per game follows a Poisson distribution with parameter mu=3.

The PMF of the Poisson distribution tells us the probability of scoring k goals in a game, for non-negative values of k.

In [4]:

from scipy.stats import poisson

ks = np.arange(12)
ps = poisson.pmf(ks, mu=3)

Here’s what it looks like.

In [5]:

plt.bar(ks, ps)
decorate(xlabel='Goals (k)', ylabel='Probability mass')

The PMF is a function of k, with mu as a fixed parameter.

The values in the distribution are probability masses, so if we want to know the probability of scoring exactly 5 goals, we can look it up like this.

In [6]:

ps[5]

Out[6]:

0.10081881344492458

It’s about 10%.

The sum of the ps is close to 1.

In [7]:

ps.sum()

Out[7]:

0.999928613371026

If we extend the ks to infinity, the sum is exactly 1. So this set of probability masses is a proper discrete distribution.

Likelihood¶

Now suppose we don’t know the goal scoring rate, but we observe 4 goals in one game. There are several ways we can use this data to estimate mu. One is to find the maximum likelihood estimator (MLE), which is the value of mu that makes the observed data most likely.

To find the MLE, we need to maximize the likelihood function, which is a function of mu with a fixed number of observed goals, k. To evaluate the likelihood function, we can use the PMF of the Poisson distribution again, this time with a single value of k and a range of values for mu.

In [8]:

k = 4
mus = np.linspace(0, 20, 201)
ls = poisson.pmf(k, mus)

Here’s what the likelihood function looks like.

In [9]:

plt.plot(mus, ls)
decorate(xlabel='mu', ylabel='Likelihood')

To find the value of mu that maximizes the likelihood of the data, we can use argmax to find the index of the highest value in ls, and then look up the corresponding element of mus.

In [10]:

i = np.argmax(ls)
mus[i]

Out[10]:

4.0

In this case, the maximum likelihood estimator is equal to the number of goals we observed.

That’s the answer to the estimation problem, but now let’s look more closely at those likelihoods. Here’s the likelihood at the maximum of the likelihood function.

In [11]:

np.max(ls)

Out[11]:

0.19536681481316454

This likelihood is a probability mass — specifically, it is the probability of scoring 4 goals, given that the goal-scoring rate is exactly 4.0.

In [12]:

poisson.pmf(4, mu=4)

Out[12]:

0.19536681481316454

So, some likelihoods are probability masses — but not all.

Density¶

Now suppose, again, that we know the goal scoring rate is exactly 3, but now we want to know how long it will be until the next goal. If we model goal scoring as a Poisson process, the time until the next goal follows an exponential distribution with a rate parameter, lam=3.

Because the exponential distribution is continuous, it has a probability density function (PDF) rather than a probability mass function (PMF). We can approximate the distribution by evaluating the exponential PDF at a set of equally-spaced times, ts.

SciPy’s implementation of the exponential distribution does not take lam as a parameter, so we have to set scale=1/lam.

In [13]:

from scipy.stats import expon

lam = 3.0
ts = np.linspace(0, 2.5, 201)
ps = expon.pdf(ts, scale=1/lam)

The PDF is a function of t with lam as a fixed parameter. Here’s what it looks like.

In [14]:

plt.plot(ts, ps)
decorate(xlabel='Games until next goal', ylabel='Density')

Notice that the values on the y-axis extend above 1. That would not be possible if they were probability masses, but it is possible because they are probability densities.

By themselves, probability densities are hard to interpret. As an example, we can pick an arbitrary element from ts and the corresponding element from ps.

In [15]:

ts[40], ps[40]

Out[15]:

(0.5, 0.6693904804452895)

So the probability density at t=0.5 is about 0.67. What does that mean? Not much.

To get something meaningful, we have to compute an area under the PDF. For example, if we want to know the probability that the first goal is scored during the first half of a game, we can compute the area under the curve from t=0 to t=0.5.

We can use a slice index to select the elements of ps and ts in this interval, and NumPy’s trapz function, which uses the trapezoid method to compute the area under the curve.

In [16]:

np.trapz(ps[:41], ts[:41])

Out[16]:

0.7769608771522626

The probability of a goal in the first half of the game is about 78%. To check that we got the right answer, we can compute the same probability using the exponential CDF.

In [17]:

expon.cdf(0.5, scale=1/lam)

Out[17]:

0.7768698398515702

Considering that we used a discrete approximation of the PDF, our estimate is pretty close.

This example provides an operational definition of a probability density: it’s something you can add up over an interval — or integrate — to get a probability mass.

Likelihood and Density¶

Now let’s suppose that we don’t know the parameter lam and we want to use data to estimate it. And suppose we observe a game where the first goal is scored at t=0.5. As we did when we estimated the parameter mu of the Poisson distribution, we can find the value of lam that maximizes the likelihood of this data.

First we’ll define a range of possible values of lam.

In [18]:

lams = np.linspace(0, 20, 201)

Then for each value of lam, we can evaluate the exponential PDF at the observed time t=0.5 — using errstate to ignore the “division by zero” warning when lam is 0.

In [19]:

t = 0.5

with np.errstate(divide='ignore'):
    ls = expon.pdf(t, scale=1/lams)

The result is a likelihood function, which is a function of lam with a fixed values of t. Here’s what it looks like.

In [20]:

plt.plot(lams, ls)
decorate(xlabel='Scoring rate (lam)', ylabel='Likelihood')

Again, we can use argmax to find the index where the value of ls is maximized, and look up the corresponding value of lams.

In [21]:

i = np.argmax(ls)
lams[i]

Out[21]:

2.0

If the first goal is scored at t=0.5, the value of lam that maximizes the likelihood of this observation is 2.

Now let’s look more closely at those likelihoods. If we select the corresponding element of ls, the result is a probability density.

In [22]:

ls[i]

Out[22]:

0.7357588823428847

Specifically, it’s the value of the exponential PDF with parameter 2, evaluated at t=0.5.

In [23]:

expon.pdf(0.5, scale=1/2)

Out[23]:

0.7357588823428847

So, some likelihoods are probability densities — but as we’ve already seen, not all.

Discussion¶

What have we learned?

In the first example, we evaluated a Poisson PMF at discrete values of k with a fixed parameter, mu. The results were probability masses.
In the second example, we evaluated the same PMF for possible values of a parameter, mu, with a fixed value of k. The result was a likelihood function where each point is a probability mass.
In the third example, we evaluated an exponential PDF at possible values of t with a fixed parameter, lam. The results were probability densities, which we integrated over an interval to get a probability mass.
In the fourth example, we evaluated the same PDF at possible values of a parameter, lam, with a fixed value of t. The result was a likelihood function where each point is a probability density.

A PDF is a function of an outcome — like the number of goals scored or the time under the first goal — given a fixed parameter. If you evaluate a PDF, you get a probability density. If you integrate density over an interval, you get a probability mass.

A likelihood function is a function of a parameter, given a fixed outcome. If you evaluate a likelihood function, you might get a probability mass or a density, depending on whether the outcome is discrete or continuous. Either way, evaluating a likelihood function at a single point doesn’t mean much by itself. A common use of a likelihood function is finding a maximum likelihood estimator.

As OP says, “Probability density and likelihood are not the same thing”, but the distinction is not clear because they are not completely distinct things, either.

A probability density can be a likelihood, but not all densities are likelihoods.
A likelihood can be a probability density, but not all likelihoods are densities.

I hope that helps!

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

PMFs and PDFs

July 6, 2024 AllenDowney

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

If you get this post by email, the formatting is not good — you might want to read it on the site.

pmf_and_pdf

PMFs and PDFs¶

Here’s a question from the Reddit statistics forum.

I met a basic problem in pdf and pmf. I perform a grid approximation on bayesian problem. After normalizing the vector, I got a discretized pmf. Then I want to draw pmf and pdf on a plot to check if they are similar distribution. However, the pmf doesn’t resemble the pdf at all. The instruction told me that I need to sample from my pmf then draw a histogram with density for comparison. It really works, but why can’t I directly compare them?

In Think Bayes I used this kind of discrete approximation in almost every chapter, so this issue came up a lot! There are at least two good solutions:

Normalize the PDF and PMF so they are on the same scale, or
Use CDFs.

I’ll demonstrate both.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Baseball¶

As an example, I’ll solve an exercise from Chapter 4 of Think Bayes:

In Major League Baseball, most players have a batting average between .230 and .300, which means that their probability of getting a hit is between 0.23 and 0.30.

Suppose a player appearing in their first game gets 3 hits out of 3 attempts. What is the posterior distribution for their probability of getting a hit?

To represent the prior distribution, I’ll use a beta distribution with parameters I chose to be consistent with the expected range of batting averages.

In [3]:

from scipy.stats import beta

a, b = 40, 110
beta_prior = beta(a, b)
beta_prior.mean(), beta_prior.ppf([0.15, 0.85])

Out[3]:

(0.26666666666666666, array([0.22937673, 0.30411982]))

Here’s what the PDF of this distribution looks like.

In [4]:

qs = np.linspace(0, 1, 201)
ps = beta_prior.pdf(qs)

In [5]:

plt.plot(qs, ps, label='prior')
decorate(xlabel='Batting average', ylabel='PDF')

Unlike probability masses, probability densities can exceed 1. But the area under the curve should be 1, which we can check using trapz, which is a NumPy function that computes numerical integrals using the trapezoid rule.

In [6]:

area = np.trapz(ps, qs)
area

Out[6]:

1.0

Within floating-point error, the area under the PDF is 1. To do the Bayesian update, I’ll put these values in a Pmf object and normalize it so the sum of the probability masses is 1.

In [7]:

from empiricaldist import Pmf

pmf_prior = Pmf(ps, qs, copy=True)
pmf_prior.normalize()

Out[7]:

200.0

Now we can use binom to compute the probability of the data for each possible batting average in qs.

In [8]:

from scipy.stats import binom

k = 3
n = 3
likelihood = binom.pmf(k, n, qs)

To compute the posterior distribution, we multiply the prior by the likelihood and normalize again so the sum of the posterior Pmf is 1.

In [9]:

pmf_posterior = pmf_prior * likelihood
pmf_posterior.normalize()

Out[9]:

0.020006971070059262

The result is a discrete approximation of the actual posterior distribution — so let’s see how close it is.

Because the beta distribution is the conjugate prior of the binomial likelihood function, the posterior is also a beta distribution, with parameters updated to reflect the data: 3 successes and 0 failures.

In [10]:

beta_posterior = beta(a+3, b)

Here’s what this theoretical posterior distribution looks like compared to our numerical approximation.

In [11]:

ps = beta_posterior.pdf(qs)
plt.plot(qs, ps, color='gray', label='beta')
pmf_posterior.plot(label='posterior')

decorate(xlabel='Batting average', ylabel='PDF/PMF')

Oops! Something has gone wrong. I assume this is what OP meant by “the pmf doesn’t resemble the pdf at all”.

The problem is that the PMF is normalized so the total of the probability masses is 1, and the PDF is normalized so the area under the curve is 1. They are not on the same scale, so the y-axis in this figure is different for the two curves.

To fix the problem, we can find the area under the PMF.

In [12]:

area = np.trapz(pmf_posterior.ps, pmf_posterior.qs)
area

Out[12]:

0.004999999999999999

And divide through to create a discrete approximation of the posterior PDF.

In [13]:

pdf_posterior = pmf_posterior / area

Now we can compare density to density.

In [14]:

ps = beta_posterior.pdf(qs)
plt.plot(qs, ps, color='gray', label='beta')
pdf_posterior.plot(label='posterior')

decorate(xlabel='Batting average', ylabel='PDF')

The curves are visually indistinguishable, and the numerical differences are small.

In [15]:

diff = ps - pdf_posterior.ps
np.max(np.abs(diff))

Out[15]:

4.973799150320701e-14

As an aside, note that the posterior and prior distributions are not very different. The prior mean is 0.267 and the posterior mean is 0.281. If a rookie goes 3 for 3 in their first game, that’s a good start, but it doesn’t mean they are the next Ted Williams.

In [16]:

pmf_prior.mean(), pmf_posterior.mean()

Out[16]:

(0.2666666666666667, 0.28104575163398693)

As an alternative to comparing PDFs, we could convert the PMF to a CDF, which contains the cumulative sum of the probability masses.

In [17]:

cdf_posterior = pmf_posterior.make_cdf()

And compare to the mathematical CDF of the beta distribution.

In [18]:

ps = beta_posterior.cdf(qs)
plt.plot(qs, ps, color='gray', label='beta')

cdf_posterior.step(label='posterior')

decorate(xlabel='Batting average', ylabel='CDF')

In this example, I plotted the discrete approximation of the CDF as a step function — which is technically what it is — to highlight how it overlaps with the beta distribution.

Discussion¶

Probability density is hard to define — the best I can do is usually, “It’s something that you can integrate to get a probability mass”. And when you work with probability densities computationally, you often have to discretize them, which can add another layer of confusion.

One way to distinguish PMFs and PDFs is to remember that PMFs are normalized so sum of the probability masses is 1, and PDFs are normalized so the area under the curve is 1. In general, you can’t plot PMFs and PDFs on the same axes because they are not in the same units.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Regrets and Regression

July 1, 2024 AllenDowney

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

standardize

Standardization and Normalization¶

Here’s a question from the Reddit statistics forum.

I want to write a research article that has regression analysis in it. I normalized my independent variables and want to include in the article the results of a statistical test showing that all variables are normal. I normalized using the scale function in R and some custom normalization functions I found online but whatever I do the new data fails the Shapiro Wilkinson and KS test on some columns? What to do?

There might be a few points of confusion here:

One is the idea that the independent variables in a regression model have to follow a normal distribution. This is a common misconception. Ordinary least squares regression assumes that the residuals follow a normal distribution — the dependent and independent variables don’t have to.
Another is the idea that statistical tests are useful for checking whether a variable follows a normal distribution. As I’ve explained in other articles, these tests don’t do anything useful.
The question might also reveal confusion about what “normalization” means. In the context of regression models, it usually means scaling a variable so its range is between 0 and 1. Its sounds like OP might be using it to mean transforming a variable so it follows a normal distribution.

In general, transforming an independent variable to normal is not a good idea. It has no benefit, and it changes the estimated parameters for no justified reason. Let me explain.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Happiness¶

As an example, I’ll use data from the World Happiness Report, which uses survey data from 153 countries to explore the relationship between self-reported happiness and six potentially predictive factors:

Income as represented by per capita GDP.
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption

The dependent variable, happiness, is the national average of responses to the “Cantril ladder question” used by the Gallup World Poll:

Please imagine a ladder with steps numbered from zero at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?

The data they used is available here.

In [3]:

download('https://happiness-report.s3.amazonaws.com/2020/WHR20_DataForFigure2.1.xls')

Out[3]:

'WHR20_DataForFigure2.1.xls'

We can read the data into a Pandas DataFrame.

In [4]:

df = pd.read_excel('WHR20_DataForFigure2.1.xls')
df.head()

Out[4]:

	Country name	Regional indicator	Ladder score	Standard error of ladder score	upperwhisker	lowerwhisker	Logged GDP per capita	Social support	Healthy life expectancy	Freedom to make life choices	Generosity	Perceptions of corruption	Ladder score in Dystopia	Explained by: Log GDP per capita	Explained by: Social support	Explained by: Healthy life expectancy	Explained by: Freedom to make life choices	Explained by: Generosity	Explained by: Perceptions of corruption	Dystopia + residual
0	Finland	Western Europe	7.8087	0.031156	7.869766	7.747634	10.639267	0.954330	71.900826	0.949172	-0.059482	0.195445	1.972317	1.285190	1.499526	0.961271	0.662317	0.159670	0.477857	2.762835
1	Denmark	Western Europe	7.6456	0.033492	7.711245	7.579955	10.774001	0.955991	72.402504	0.951444	0.066202	0.168489	1.972317	1.326949	1.503449	0.979333	0.665040	0.242793	0.495260	2.432741
2	Switzerland	Western Europe	7.5599	0.035014	7.628528	7.491272	10.979933	0.942847	74.102448	0.921337	0.105911	0.303728	1.972317	1.390774	1.472403	1.040533	0.628954	0.269056	0.407946	2.350267
3	Iceland	Western Europe	7.5045	0.059616	7.621347	7.387653	10.772559	0.974670	73.000000	0.948892	0.246944	0.711710	1.972317	1.326502	1.547567	1.000843	0.661981	0.362330	0.144541	2.460688
4	Norway	Western Europe	7.4880	0.034837	7.556281	7.419719	11.087804	0.952487	73.200783	0.955750	0.134533	0.263218	1.972317	1.424207	1.495173	1.008072	0.670201	0.287985	0.434101	2.168266

I’ll select the columns we’re interested in and give them shorter names.

In [5]:

data = pd.DataFrame()

data['ladder'] = df['Ladder score']
data['log_gdp'] = df['Logged GDP per capita']
data['social'] = df['Social support']
data['life_exp'] = df['Healthy life expectancy']
data['freedom'] = df['Freedom to make life choices']
data['generosity'] = df['Generosity']
data['corruption'] = df['Perceptions of corruption']

Notice that GDP is already on a log scale, which is generally a good idea. The other variables are in a variety of units and different scales.

Now, contrary to the advice OP was given, let’s run a regression model without even looking at the data.

In [6]:

import statsmodels.formula.api as smf

formula = 'ladder ~ log_gdp + social + life_exp + freedom + generosity + corruption'

results_raw = smf.ols(formula, data=data).fit()
results_raw.params

Out[6]:

Intercept    -2.059377
log_gdp       0.229079
social        2.723318
life_exp      0.035307
freedom       1.776815
generosity    0.410566
corruption   -0.628162
dtype: float64

Because the independent variables are on different scales, the estimated parameters are in different units. For example, life_exp is healthy life expectancy at birth in units of years, so the estimated parameter, 0.035, indicates that a difference of one year of life expectancy between countries is associated with an increase of 0.035 units on the Cantril ladder.

Now let’s check retroactively whether this dataset is consistent with the assumption that the residuals follow a normal distribution. I’ll use the following functions to compare the CDF of the residuals to a normal model with the same mean and standard deviation.

In [7]:

from scipy.stats import norm
from empiricaldist import Cdf

def make_normal_model(sample):
    """Make a Cdf of a normal distribution.
    
    sample: sequence of numbers
    
    returns: Cdf object
    """
    m, s = sample.mean(), sample.std()
    qs = np.linspace(m - 4 * s, m + 4 * s, 101)
    ps = norm.cdf(qs, m, s)
    return Cdf(ps, qs)

In [8]:

def plot_normal_model(sample, **options):
    """Plot the Cdf of a sample and a normal model.
    
    sample: sequence of numbers
    """
    cdf_model = make_normal_model(sample)
    cdf_data = Cdf.from_seq(sample)

    cdf_model.plot(color='gray')
    cdf_data.plot(**options)

In [9]:

plot_normal_model(results_raw.resid, label='residuals')
decorate(xlabel='residuals', ylabel='CDF')

The normal model fits the residuals well enough that the deviations are not a concern. This regression model is just fine with no transformations required.

But suppose we follow the advice OP received and check whether the dependent and independent variables follow normal distributions.

Looking for trouble¶

Here are the distributions of the variables compared to a normal model. Sometimes the model fits pretty well, sometimes not.

In [10]:

column = 'ladder'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [11]:

column = 'log_gdp'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [12]:

column = 'social'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [13]:

column = 'life_exp'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [14]:

column = 'freedom'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [15]:

column = 'generosity'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

In [16]:

column = 'corruption'
plot_normal_model(data[column], label=column)
decorate(xlabel=column, ylabel='CDF')

If we ran a statistical test, most of these would fail. But we don’t care — regression models don’t require these variables to follow normal distributions. And when we make this kind of comparison across countries, there no reason to expect them to.

Standardize¶

Nevertheless, there are reasons we might want to transform the independent variables before running a regression model — one is to quantify the “importance” of the different factors. We can do that by standardizing the variables, which means transforming them to have mean 0 and standard deviation 1.

In the untransformed variables, the means vary in magnitude from about 0.1 to 64.

In [17]:

data.mean().sort_values()

Out[17]:

generosity    -0.014568
corruption     0.733120
freedom        0.783360
social         0.808721
ladder         5.473240
log_gdp        9.295706
life_exp      64.445529
dtype: float64

And the standard deviations vary from about 0.1 to 7.

In [18]:

data.std().sort_values()

Out[18]:

freedom       0.117786
social        0.121453
generosity    0.151809
corruption    0.175172
ladder        1.112270
log_gdp       1.201588
life_exp      7.057848
dtype: float64

And, as I’ve already noted, they are expressed in different units. There is no meaningful way to compare the parameter of life_exp, which is steps of the Cantril ladder per year of life expectancy, with the parameter of corruption, which is in steps of the ladder per percentage point. But we can make these comparisons meaningful by standardizing the variables — that is, subtracting off the mean and dividing by the standard deviation.

In [19]:

standardized = (data - data.mean()) / data.std()

After transformation, the means are all close to 0.

In [20]:

standardized.mean()

Out[20]:

ladder       -2.786442e-16
log_gdp       6.966105e-16
social       -3.715256e-16
life_exp     -5.108477e-16
freedom      -3.250849e-16
generosity   -2.612289e-17
corruption   -2.554239e-16
dtype: float64

And the standard deviations are all close to 1.

In [21]:

standardized.std()

Out[21]:

ladder        1.0
log_gdp       1.0
social        1.0
life_exp      1.0
freedom       1.0
generosity    1.0
corruption    1.0
dtype: float64

Here’s the regression with the transformed variables.

In [22]:

results_standardized = smf.ols(formula, data=standardized).fit()
results_standardized.params

Out[22]:

Intercept    -4.072828e-16
log_gdp       2.474749e-01
social        2.973701e-01
life_exp      2.240379e-01
freedom       1.881597e-01
generosity    5.603632e-02
corruption   -9.892975e-02
dtype: float64

The intercept is close to 0 — that’s a consequence of how the math works out. And now the magnitudes of the parameters indicate how much a change in each independent variable, expressed as a multiple of its standard deviation, is associated with a change in the dependent variables, also as a multiple of its standard deviation.

By this metric, social is the most important factor, followed closely by log_gdp and life_exp. generosity is the least important factor, at least as its quantified by this variable.

Normalizing¶

An alternative to standardization is normalization, which transforms a variable so its range is between 0 and 1. To be honest, I’m not sure what the point of normalization is — it makes the results a little harder to interpret, and it doesn’t have any advantages I can think of.

But just for completeness, here’s how it’s done.

In [23]:

normalized = (data - data.min()) / (data.max() - data.min())

As intended, the normalized variables have the same range.

In [24]:

normalized.min()

Out[24]:

ladder        0.0
log_gdp       0.0
social        0.0
life_exp      0.0
freedom       0.0
generosity    0.0
corruption    0.0
dtype: float64

In [25]:

normalized.max()

Out[25]:

ladder        1.0
log_gdp       1.0
social        1.0
life_exp      1.0
freedom       1.0
generosity    1.0
corruption    1.0
dtype: float64

But as a result they have somewhat different means and standard deviations.

In [26]:

normalized.mean()

Out[26]:

ladder        0.554455
log_gdp       0.565357
social        0.746725
life_exp      0.608947
freedom       0.668690
generosity    0.332345
corruption    0.754826
dtype: float64

In [27]:

normalized.std()

Out[27]:

ladder        0.212192
log_gdp       0.242351
social        0.185365
life_exp      0.223317
freedom       0.203633
generosity    0.176200
corruption    0.212124
dtype: float64

Here’s the regression with normalized variables.

In [28]:

results_normalized = smf.ols(formula, data=normalized).fit()
results_normalized.params

Out[28]:

Intercept    -0.030706
log_gdp       0.216678
social        0.340407
life_exp      0.212877
freedom       0.196069
generosity    0.067483
corruption   -0.098962
dtype: float64

We’ll look more closely at those results soon, but first let’s look at one more preprocessing option, transforming values to follow a normal distribution.

Transform to normal¶

Here’s how we transform values to a normal distribution.

In [29]:

n = len(data)
ps = np.arange(1, n+1) / (n + 1)
zs = norm.ppf(ps)

transformed = data.copy()
for column in transformed.columns:
    transformed.sort_values(column, inplace=True)
    transformed[column] = zs

The ps are the cumulative probabilities; the zs are the corresponding quantiles in a standard normal distribution (ppf computes the “percent point function” which is an unnecessary name for the inverse CDF). Inside the loop, we sort transformed by one of the columns and then assign the zs in that order.

The transformed means are all close to 0.

In [30]:

transformed.mean()

Out[30]:

ladder        4.644070e-17
log_gdp       4.644070e-17
social        1.161018e-17
life_exp      0.000000e+00
freedom       0.000000e+00
generosity    4.644070e-17
corruption   -4.644070e-17
dtype: float64

And the standard deviations are close to 1 (but not exact because the tails of the distribution are cut off).

In [31]:

transformed.std()

Out[31]:

ladder        0.975034
log_gdp       0.975034
social        0.975034
life_exp      0.975034
freedom       0.975034
generosity    0.975034
corruption    0.975034
dtype: float64

As intended, the transformed variables follow a normal distribution.

In [32]:

plot_normal_model(transformed['corruption'], label='corruption')
decorate(ylabel='CDF')

Here’s the regression model with the transformed values.

In [33]:

results_transformed = smf.ols(formula, data=transformed).fit()
results_transformed.params

Out[33]:

Intercept     5.425248e-18
log_gdp       2.382770e-01
social        3.642517e-01
life_exp      1.951131e-01
freedom       2.028179e-01
generosity    3.216042e-02
corruption   -4.517596e-02
dtype: float64

Now let’s compare the estimated parameters with the three transformations.

In [34]:

results_standardized.params.plot(style='o', label='standardized')
results_normalized.params.plot(style='x', label='normalized')
results_transformed.params.plot(style='+', label='transformed')

decorate(ylabel='Estimated parameter')

The results with standardized variables are the easiest to interpret, so in that sense they are “right”. The other transformations have no advantages over standardization, and they produce somewhat different parameters.

Discussion¶

Regression does not require the dependent or independent variables to follow a normal distribution. It only assumes that the distribution of residuals is approximately normal, and even that requirement is not strict — regression is quite robust.

You do not need to check whether any of these variables follows a normal distribution, and you definitely should not use a statistical test.

As part of your usual exploratory data analysis, you should look at the distribution of all variables, but that’s primarily to detect any anomalies, outliers, or really extreme distributions — not to check whether they fit normal distributions.

If you want to use the parameters of the model to quantify the importance of the independent variables, use standardization. There’s no reason to use normalization, and definitely no reason to transform to a normal distribution.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International