Testing Percentiles

April 28, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

test_percentile

Testing percentiles¶

Here’s a question from the Reddit statistics forum.

I have two different samples (about 100 observations per sample) drawn from the same population (or that’s what I hypothesize; the populations may in fact be different). The samples and population are approximately normal in distribution.

I want to estimate the 85th percentile value for both samples, and then see if there is a statistically significant difference between these two values. I cannot use a normal z- or t-test for this, can I? It’s my current understanding that those tests would only work if I were comparing the means of the samples.

As an extension of this, say I wanted to compare one of these 85th percentile values to a fixed value; again, if I was looking at the mean, I would just construct a confidence interval and see if the fixed value fell within it…but the percentile stuff is throwing me for a loop.

This is […] related to a research project I’m working on (in my job).

There are two questions here. The first is about testing a difference in percentiles between two groups. The second is about the difference between a percentile from an observed sample and an expected value.

We’ll answer the first question with a permutation test, and we’ll answer the second in two ways: bootstrap resampling and a Gaussian model.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Data¶

Since OP didn’t provide a dataset, we have to generate one. I’ll draw two samples from Gaussian distributions with the same standard deviation and different means.

In [3]:

np.random.seed(17)

mu = 10
sigma = 2
size = 100

group1 = np.random.normal(mu, sigma, size=size)
group2 = np.random.normal(mu+1, sigma, size=size)

Here’s what the distributions of the groups look like.

In [4]:

sns.kdeplot(group1, label='Group1')
sns.kdeplot(group2, label='Group2')

decorate(xlabel='Quantity',
         ylabel='Density',
         title='Distributions of the data')

No description has been provided for this image

If we compute the 85th percentile in both groups, we see a difference, as expected.

In [5]:

stat1 = np.percentile(group1, 85)
actual_stat2 = np.percentile(group2, 85)
stat1, actual_stat2

Out[5]:

(12.241826987876475, 13.057003640622057)

In [6]:

actual_diff = actual_stat2 - stat1
actual_diff

Out[6]:

0.8151766527455813

Now let’s see if a difference of that size would be likely if the two samples were actually drawn from the same distribution.

Testing the Difference¶

When we test a difference between two groups, the usual model of the null hypothesis is that the groups are actually identical. If that’s true, the two samples came from the same distribution, so we can combine them into a single sample.

In [7]:

pooled = np.concatenate([group1, group2])
n, m = len(group1), len(group2)

Now we can simulate the null hypothesis by permutation — that is, by shuffling the pooled data and splitting it into two groups with the same sizes as the originals. The following function generates two samples under this assumption, and returns the difference in their 85th percentiles.

In [8]:

def simulate_percentile_difference():
    np.random.shuffle(pooled)
    shuffled1 = pooled[:n]
    shuffled2 = pooled[n:]
    diff = np.percentile(shuffled1, 85) - np.percentile(shuffled2, 85)
    return diff

If we call it many times, the result is a sample from the distribution of differences under the null hypothesis.

In [9]:

np.random.seed(19)
sample_diff = [simulate_percentile_difference() for i in range(1001)]

Here’s what it looks like, with a vertical line at the observed difference. I’m plotting it with cut=0 so the estimated density doesn’t extend past the minimum and maximum of the data.

In [10]:

sns.kdeplot(sample_diff, label='', cut=0)
plt.axvline(actual_diff, ls=':', color='gray')
decorate(xlabel='Difference',
         ylabel='Density',
         title='Distribution of differences under H0')

The distribution is multimodal, which is the result of selecting a moderately high percentile from a moderately small dataset — the diversity of the results is limited. However, in this example we are interested in the tails of the distribution, so multimodality is not a problem.

To estimate a one-sided p-value, we can compute the fraction of the sample that exceeds the actual difference.

In [11]:

p_value_one_sided = (sample_diff >= actual_diff).mean()
p_value_one_sided

Out[11]:

0.04195804195804196

Or, for a two-sided p-value, we can compute the fraction of the sample that exceeds the actual difference in absolute value.

In [12]:

p_value_two_sided = (np.abs(sample_diff) > actual_diff).mean()
p_value_two_sided

Out[12]:

0.07892107892107893

In this example, the result of the one-sided test would be considered significant at the 5% significance level, but the two-sided test would not. So which is it?

I think it’s not worth worrying about. My interpretation of the results is the same either way: they are inconclusive. Under the null hypothesis, a difference as big as the one we saw would be unlikely, but we can’t rule out the possibility that the groups are identical — or nearly so — and the apparent difference is due to random variation.

Testing a fixed value¶

Now let’s turn to the second question. Suppose we have reason to think that the actual value of the 85th percentile is 12.3, and we would like to know whether the data contradict this hypothesis.

In [13]:

expected = 12.3

We’ll test group1 first. Here’s the 85th percentile of group1 and its difference from the expected value.

In [14]:

actual_stat1 = np.percentile(group1, 85)
actual_diff1 = actual_stat1 - expected
actual_stat1, actual_diff1

Out[14]:

(12.241826987876475, -0.058173012123525325)

Let’s see if a difference of this magnitude is likely to happen under the null hypothesis. One way to model the null hypothesis is to create a dataset that is similar to the observed data, but where the 85th percentile is exactly as expected. We can do that by shifting the observed data by the observed difference.

In [15]:

shifted = group1 - actual_diff1
np.percentile(shifted, 85)

Out[15]:

12.3

The 85th percentile of the shifted data is the expected value, exactly.

Now, to generate samples under the null hypothesis, we can use the following function, which takes a sample, shifts it to have the expected value of the 85th percentile, generates a bootstrap resample of the shifted values, and returns the difference between the 85th percentile of the sample and the expected value.

In [16]:

def bootstrap_percentile(group):
    stat = np.percentile(group, 85) - expected
    shifted = group - stat
    resampled = np.random.choice(shifted, size=len(group), replace=True)
    return np.percentile(resampled, 85) - expected

If we call this function many times, we get a sample of the differences we expect under the null hypothesis.

In [17]:

np.random.seed(17)
sample1 = [bootstrap_percentile(group1) for i in range(1001)]

The following function shows the distribution of the sample with a vertical line at the observed value.

In [18]:

sns.kdeplot(sample1, label='Sampling distribution')
plt.axvline(actual_diff1, ls=':', color='gray')
decorate(xlabel='Deviation',
         ylabel='Density',
         title='Distribution of deviations from expected under H0')

Without computing a p-value, we can see that a difference as big as actual_diff1 is entirely plausible under the null hypothesis. We can confirm that by computing a one-sided p-value.

In [19]:

p_value_one_side = (sample1 < actual_diff1).mean()
p_value_one_side

Out[19]:

0.4405594405594406

So the observed difference in the first group is not statistically significant. Now let’s do the same thing for the second group.

In [20]:

actual_stat2 = np.percentile(group2, 85)
actual_diff2 = actual_stat2 - expected
actual_diff2

Out[20]:

0.7570036406220559

In [21]:

np.random.seed(17)
sample2 = [bootstrap_percentile(group2) for i in range(1001)]

Here’s what the distribution of differences looks like under the null hypothesis, with a vertical line at the observed value.

In [22]:

sns.kdeplot(sample2, label='Group 2', cut=0)
plt.axvline(actual_diff2, ls=':', color='gray')
decorate(xlabel='Deviation',
         ylabel='Density',
         title='Distribution of deviations from expected under H0')

There are no differences in the sample that exceed the observed value.

In [23]:

np.max(sample2), actual_diff2

Out[23]:

(0.6391539586773582, 0.7570036406220559)

We can conclude that a difference as big as that is very unlikely under the null hypothesis. There’s not much point in computing a p-value more precisely than that, but if it’s required, we can estimate it if we assume that the tail of the sampling distribution is roughly Gaussian. In that case, we can fit a KDE to the sampling distribution like this.

In [24]:

from scipy.stats import gaussian_kde

kde = gaussian_kde(sample2)

And use a Pmf object to approximate the estimated density.

In [25]:

from empiricaldist import Pmf

qs = np.linspace(-2, 2, 201)
ps = kde.evaluate(qs)
pmf = Pmf(ps, qs)
pmf.normalize()

Out[25]:

49.99999999999998

Then we can use the corresponding CDF to compute the probability of a value that exceeds the observed difference.

In [26]:

cdf = pmf.make_cdf()
p_value = 1 - cdf(actual_diff)
p_value

Out[26]:

3.319666203671634e-05

So the p-value is quite small.

Model-based resampling¶

The bootstrap method in the previous section is a good choice if we are unsure about the distribution of the data, or if there are outliers. But multiple modes in the sampling distribution suggest that there might not be enough diversity in the data for bootstrapping to be reliable.

Fortunately, there is another way we might model the null hypothesis: using a Gaussian distribution. If we generate data from a continuous distribution, we expect the sampling distribution to be unimodal.

But there is a problem we have to solve first — we have to make an assumption about the standard deviation of the hypothetical Gaussian distribution. One option is to use the standard deviation of the data.

In [27]:

s = np.std(group1)

Now we need to find a Gaussian distribution with a given standard deviation that has the expected 85th percentile. We can do that by starting with a distribution centered at 0, computing it’s 85th percentile and then shifting it.

In [28]:

from scipy.stats import norm

dist0 = norm(0, s)
quantity = dist0.ppf(0.85)
quantity

Out[28]:

2.3232308032911324

ppf stands for “percentile point function”, which is another name for the quantile function, which is the inverse of the CDF — it takes a cumulative probability and returns the corresponding quantity.

In [29]:

center = expected - quantity
dist = norm(center, s)
dist.ppf(0.85)

Out[29]:

12.3

The following function takes one of the groups, fits a hypothetical model to it, generates a sample from the model, and returns the difference between the 85th percentile of the sample and the expected value.

In [30]:

def gaussian_percentile(group):
    s = np.std(group)
    dist0 = norm(0, s)
    quantity = dist0.ppf(0.85)
    center = expected - quantity
    dist = norm(center, s)
    sample = dist.rvs(size=len(group))
    return np.percentile(sample, 85) - expected

If we call this function many times, we get the sampling distribution of the test statistic under the null hypothesis.

In [31]:

np.random.seed(17)
sample3 = [gaussian_percentile(group1) for i in range(1001)]

Here’s what the distribution looks like, compared to the corresponding distribution from the bootstrapped model.

In [32]:

sns.kdeplot(sample1, label='Bootstrap model', cut=0)
sns.kdeplot(sample3, label='Gaussian model', cut=0)
plt.axvline(actual_diff1, ls=':', color='gray')
decorate(xlabel='Quantity',
         ylabel='Density',
         title='Distribution of deviations from expected under H0, Group 1')

The shapes of the distributions are different, but their ranges are comparable. And the conclusion is the same: a difference as big as actual_diff1 is entirely plausible under the null hypothesis.

In [33]:

p_value_one_side = (sample3 < actual_diff1).mean()
p_value_one_side

Out[33]:

0.48451548451548454

Now let’s try the same test with Group 2.

In [34]:

np.random.seed(17)
sample4 = [gaussian_percentile(group2) for i in range(1001)]

Here’s the result, along with the result from the bootstrap model.

In [35]:

sns.kdeplot(sample2, label='Bootstrap model', cut=0)
sns.kdeplot(sample4, label='Gaussian model', cut=0)
plt.axvline(actual_diff2, ls=':', color='gray')
decorate(xlabel='Quantity',
         ylabel='Density',
         title='Distribution of deviations from expected under H0, Group 2')

Again, the shapes of the distributions are different, but the conclusion is the same. A difference as big as actual_diff2 is unlikely under the null hypothesis.

In [36]:

p_value_one_side = (sample4 > actual_diff2).mean()
p_value_one_side

Out[36]:

0.003996003996003996

As usual, the two-sided p-value is bigger by a factor of two, roughly, but the difference never matters in practice.

In [37]:

p_value_two_sided = (np.abs(sample4) > actual_diff2).mean()
p_value_two_sided

Out[37]:

0.006993006993006993

Under this model of the null hypothesis, the probability is small that the 85th percentile of the data would exceed the expected value by so much.

Discussion¶

This example demonstrates a kind of inconsistency in hypothesis testing. We found that Group 1 is not significantly different from the expected value — in the technical sense of significantly — but Group 2 is. So that suggests that Group 1 and Group 2 are different from each other, but when we test that hypothesis, the difference is not statistically significant.

People who are new to hypothesis testing find results like this surprising, but they are not rare. Generally, they are a consequence of the logic of null hypothesis testing and the arbitrariness of the significance threshold.

I think it helps to interpret p-values qualitatively.

A p-value greater than 10% means that the observed effect is plausible under the null hypothesis, and could happen by chance.
A p-value less than 1% means that an observed effect is unlikely under the null hypothesis — so it is unlikely to have happened by chance.
Anything in between is inconclusive.

There is nothing special about 5%.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

In [ ]:

Small percentiles and missing data

April 26, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

low_percentile

Bootstrapping percentiles¶

Here’s a question from the Reddit statistics forum.

I’m trying to figure out how to determine the confidence interval for the .2 percentile temperature for specific set of observed temperatures (all hourly temperatures during January, February, and December since 2000). I have recordings for 53128 of the 53424 possible hourly recordings.

How would I go about saying that I am X% sure that the actual .2 percentile value is between two numbers? Could anyone provide any insight on how to accomplish this. Thank you.

OP provided a link to the data, so this is a question we can answer! For computing confidence intervals, my first choice is bootstrap resampling, but as it turns out, it does not work well for this problem. I’ll show what goes wrong and how to fix it. Then we’ll answer a follow-up question about quantifying the effect of missing data.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Data¶

I downloaded the data as a CSV file, which we can read into a Pandas DataFrame.

In [3]:

# download the data

download('https://github.com/AllenDowney/DataQnA/raw/main/data/temperature_data_ama.csv')

Out[3]:

'temperature_data_ama.csv'

In [4]:

col = 'Date/Time'
df = pd.read_csv('temperature_data_ama.csv', parse_dates=[col], index_col=col)
df.head()

Out[4]:

	tmpf
Date/Time
2000-01-01 00:00:00	NaN
2000-01-01 01:00:00	NaN
2000-01-01 02:00:00	NaN
2000-01-01 03:00:00	NaN
2000-01-01 04:00:00	NaN

There are 53424 measurements, of which 306 are missing.

In [5]:

len(df)

Out[5]:

In [6]:

missing = df['tmpf'].isna()
missing.sum()

Out[6]:

The range of temperatures is from -9.9 degF to 89 degF.

In [37]:

data_clean = df['tmpf'].dropna()
data_clean.describe()

Out[37]:

count    53118.00000
mean        38.28920
std         13.71636
min         -9.90000
25%         29.00000
50%         37.00000
75%         47.00000
max         89.00000
Name: tmpf, dtype: float64

And the 0.2 percentile is 1 degF.

In [8]:

np.percentile(data_clean, 0.2)

Out[8]:

1.0

Basic bootstrap¶

The following function takes the cleaned data, resamples it, and computes the 0.2 percentile of the bootstrapped sample.

In [9]:

def bootstrap_percentile(data):
    resampled = np.random.choice(data, size=len(data), replace=True)
    return np.percentile(resampled, 0.2)

If we call this function 1001 times, we get a sample from the sampling distribution of the percentile.

In [10]:

np.random.seed(17)
sample = [bootstrap_percentile(data_clean) for i in range(1001)]

Here’s what that sample looks like.

In [11]:

sns.histplot(sample)
decorate(xlabel='')

Immediately we can see that something has gone wrong. The resampling process produces only 8 unique values.

In [12]:

np.unique(sample)

Out[12]:

array([0.    , 0.234 , 1.    , 1.0936, 1.2106, 1.4   , 1.517 , 1.9   ])

If we try to compute a CI by pulling percentiles from the sample, the results are not credible.

In [13]:

np.percentile(sample, [5, 95])

Out[13]:

array([1.    , 1.0936])

This example demonstrates a limitation of bootstrap resampling — it does not work well when there are a small number of unique values.

However, because the data are temperature measurements, they are actually continuous quantities. So one option is to replace bootstrapping with a model that generates continuous quantities. We’ll try that with a normal model, see that it does not work, and they try again with KDE.

Resampling from a normal model¶

If we look at the CDF of the data, it resembles the characteristic sigma of the normal distribution.

In [14]:

from empiricaldist import Cdf

cdf_data = Cdf.from_seq(data_clean)
cdf_data.plot(label='data')

decorate(xlabel='Temperature (degF)',
         ylabel='CDF')

So let’s see how it compares to a normal model. I’ll estimate the parameters by computing the mean and standard deviation of the data.

In [15]:

from scipy.stats import norm

mu = np.mean(data_clean)
sigma = np.std(data_clean)
dist = norm(mu, sigma)

And compute the normal CDF within 4 standard deviations of the mean.

In [16]:

low, high = mu - 4*sigma, mu + 4*sigma
xs = np.linspace(low, high, 201)
ys = dist.cdf(xs)

Here’s what the model looks like compared to the data.

In [17]:

plt.plot(xs, ys, color='gray', label='Normal model')
cdf_data.plot(label='data')

decorate(xlabel='Temperature (degF)',
         ylabel='CDF')

It looks pretty good, but there are places where the data clearly deviate from the model. That’s enough to make me worry, but let’s proceed and see how it goes.

The following function takes the cleaned data, generates a random sample from the normal model, and returns the 0.2 percentile of the sample.

In [18]:

def resample_percentile_norm(data):
    resampled = dist.rvs(len(data))
    return np.percentile(resampled, 0.2)

If we call it 1001 times, we hope the result is a sample from the sampling distribution of the percentile.

In [19]:

np.random.seed(17)
sample2 = [resample_percentile_norm(data_clean) for i in range(1001)]

And at first glance it looks good.

In [20]:

sns.kdeplot(sample2, label='Sampling distribution')
decorate(xlabel='Temperature (degF)',
         ylabel='Density')

But notice that range of the sampling distribution does not include the 0.2 percentile of the data, which is 1. We can compute a 90% CI, but again, it is not credible.

In [21]:

ci90 = np.percentile(sample2, [5, 95])
ci90

Out[21]:

array([-1.84803327, -0.46534009])

To see what went wrong, let’s look at the normal model and the data again, this time with the y axis on a log scale. The log scale is like a microscope that lets us see more clearly what is happening in the tail of the distribution.

In [22]:

plt.plot(xs, ys, color='gray', label='Normal model')
cdf_data.plot(label='data')

decorate(xlabel='Temperature (degF)',
         ylabel='CDF',
         yscale='log')

On a linear scale, it seemed like the normal model might be good enough; on a log scale, it is clear that the data deviate from the model in the left tail.

In retrospect, it is not a surprise if a simple two-parameter model fails to capture every detail of the distribution — the world is a complicated place. So let’s try a nonparametric approach.

Resampling with KDE¶

We can use kernel density estimation (KDE) to model the distribution of the data, then use the model to resample. Here’s how we estimate the distribution.

In [23]:

from scipy.stats import gaussian_kde

kde = gaussian_kde(data_clean)

To see what the result looks like, we can approximate the density of the model with a discrete PMF.

In [24]:

from empiricaldist import Pmf

pmf_kde = Pmf(kde.pdf(xs), xs)
pmf_kde.normalize()

Out[24]:

1.8226580141477107

And then compare the CDF of the model with the CDF of the data.

In [25]:

pmf_kde.make_cdf().plot(color='gray', label='KDE model')
cdf_data.plot(label='data')
decorate(xlabel='Temperature (degF)',
         ylabel='CDF')

The result shows that KDE is doing what it is meant to do — fitting a continuous distribution to the data with minimal assumptions.

The following function takes the cleaned data, uses the KDE model to generate a random sample, and returns the 0.2 percentile of the sample.

In [26]:

def resample_percentile_kde(data):
    resampled = kde.resample(len(data))
    return np.percentile(resampled, 0.2)

If we call it 1001 times, we hope once again that the result is a sample from the sampling distribution of the percentile.

In [27]:

np.random.seed(17)
sample3 = [resample_percentile_kde(data_clean) for i in range(1001)]

And this time we get a better result. The sampling distribution looks good, and it contains the actual percentile of the data.

In [28]:

sns.kdeplot(sample3, label='Sampling distribution')
decorate(xlabel='Temperature (degF)',
         ylabel='Density')

And the width of the 90% CI is plausible.

In [29]:

np.percentile(sample3, [5, 95])

Out[29]:

array([0.27006571, 1.18702612])

So with a couple of false starts, we have answered the original question. But it turns out there’s more.

Fill missing values¶

In a follow-up message, OP wrote:

Just in case it helps any, here’s what I’m ultimately trying to accomplish with this endeavor… I am trying to come up with a plausible way of demonstrating that the .2 percentile value (1 degF) that is derived from this data set is sufficiently representative of what the value would be if there were no missing data points (hourly readings) from the dataset.

OK, that’s a different question! However, the resampling framework can be extended naturally to estimate the effect of missing data. Here’s a function that takes the original data — including NaNs — and fills the missing values with a random selection of valid values. For historical reasons, this way of filling missing values is called “hot deck imputation”.

In [30]:

data_nan = df['tmpf']
valid = data_nan.dropna()
missing = data_nan.isna()

def fill_missing(data):
    filled = data.copy()
    filled[missing] = np.random.choice(valid, size=missing.sum(), replace=True)
    return filled

To test it, we can check that the result has no NaNs.

In [31]:

filled = fill_missing(data_nan)
filled.isna().sum()

Out[31]:

Now we can include fill_missing as part of the resampling pipeline. The following function takes the original data, fills missing values, generates a sample from a KDE model, and returns the 0.2 percentile of the sample.

In [32]:

def resample_percentile_kde_fill(data):
    filled = fill_missing(data)
    kde = gaussian_kde(filled)
    resampled = kde.resample(len(data))
    return np.percentile(resampled, 0.2)

If we call it many times, we get a sample from a distribution that represents the uncertainty of the estimate due to a combination of missing data and random sampling.

In [33]:

np.random.seed(17)
sample4 = [resample_percentile_kde_fill(data_nan) for i in range(1001)]

Here’s what the result looks like, compared to the sampling distribution from the previous section, which represents only uncertainty due to random sampling.

In [34]:

sns.kdeplot(sample3, label='Sampling distribution')
sns.kdeplot(sample4, label='Sampling distribution with fill')
decorate(xlabel='Temperature (degF)',
         ylabel='Density')

The difference does not seem substantial, and the CIs are similar.

In [35]:

np.percentile(sample3, [5, 95])

Out[35]:

array([0.27006571, 1.18702612])

In [36]:

np.percentile(sample4, [5, 95])

Out[36]:

array([0.29608271, 1.18784825])

We can conclude that missing data does not have much effect on the CI.

To estimate the effect more precisely, we could run this again with a sample size of 10,001 rather than 1001. But I won’t bother because with only 306 missing values out of 53,424, I did not expect the missing data to affect the results by much, and this result confirms it. Rather than estimate the CI more precisely, I would conclude that missing data is not a problem, and drop it.

Discussion¶

Normally I am quick to recommend bootstrap resampling because “it just works”. It makes almost no assumptions about the distribution of the data, and it is easy to extend to almost any statistic. But as this example shows, it is not infallible — the kryptonite of bootstrapping is lack of diversity in the data.

To diagnose this problem, it is a good idea to explore the sampling distribution. If bootstrapping goes well, the sampling distribution should have many unique values, and the range should contain the estimate computed directly from the data, usually close to the middle of the CI.

If the results from bootstrapping fail these tests, think about other ways to model the data-generating process. If a parametric model fits the data well, you can use the data to estimate parameters and then use the model to generate simulated samples. Otherwise, consider a non-parametric approach like KDE.

For filling missing data, hot deck imputation ignores serial correlation and other statistical structure in the data, so the imputed values are likely to be unrealistic. But in this case that’s probably a feature, because the results overestimate the effect of missing data. As a result, we can make the argument, “Even if we assume that the missing data is highly variable, it has no substantial effect on the estimated percentile or the computed CI.”

If it were necessary to fill missing data with more realistic values, we could use a time series method like ARIMA or a Gaussian process.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

What does “strength” mean?

April 21, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

corr_trend

What does “strength” mean?¶

Here’s a question from the Reddit statistics forum.

I am currently doing a uni assignment and one of my tasks is analysing the correlation between two variables. When I use the correlation function in Excel, it returns a correlation of -0.0377. When I use the same data to create a scatter plot, the trend line is positive. I need to identify the correlation strength and direction and thereby, I am confused by these opposing outcomes. Can somebody please explain why the correlation is showing as negative but the trend line is positive? What does this indicate in terms of the strength and direction of the relationship between the two variables?

To answer the immediate question, correlation and the slope of a linear regression line always have the same sign. Mathematically, they are both related to the dot product of the x and y variables.

So there is something strange going on. It might be a simple error — for example, maybe the correlation and regression were based on different data. Or it might be that the trend computed by Excel is something other than linear regression. For example, a line that minimizes mean absolute error (MAE) rather than mean squared error (MSE) can have a slope with the opposite sign of the correlation.

Without more information it’s hard to be sure what’s going on, but for this example it might not matter. The computed correlation is negative but very small. If we fit a line (other than a regression line) to the same data and the slope is positive but similarly small, that is not necessarily inconsistent. Within statistical uncertainty, both are indistinguishable from zero.

OP also asks, “What does this indicate in terms of the strength and direction of the relationship between the two variables?” So let’s answer that question, too.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

Interpreting correlation and slope¶

When people talk about the strength of a relationship, they might mean correlation or they might mean the slope of a fitted line. But these measures of “strength” are not always consistent.

For example, suppose we are concerned about the health effects of weight gain, so we plot weight versus age from 20 to 50 years old. I’ll generate two fake datasets to demonstrate the point.

In [2]:

np.random.seed(18)
xs1 = np.linspace(20, 50)
ys1 = 75 + 0.02 * xs1 + np.random.normal(0, 0.15, len(xs1))

In [3]:

np.random.seed(18)
xs2 = np.linspace(20, 50)
ys2 = 65 + 0.2 * xs2 + np.random.normal(0, 3, len(xs2))

I used the same random seed to generate both, so they look similar, as we can see in these scatter plots.

In [4]:

from utils import underride

def text(x, y, string, **options):
    """Plot text using axis coordinates.
    """
    transform = plt.gca().transAxes
    options = underride(options, transform=transform, ha='left', va='top')
    plt.text(x, y, string, **options)

In [5]:

plt.plot(xs1, ys1, 'o', alpha=0.5)
text(0.05, 0.9, 'Fake dataset A')
decorate(xlabel='Age in years',
         ylabel='Weight in kg')

In [6]:

plt.plot(xs2, ys2, 'o', alpha=0.5)
text(0.05, 0.9, 'Fake dataset B')
decorate(xlabel='Age in years',
         ylabel='Weight in kg')

Nevertheless, they have substantially different correlations.

In [7]:

rho1 = np.corrcoef(xs1, ys1)[0][1]
rho1

Out[7]:

0.7579660563439401

In [8]:

rho2 = np.corrcoef(xs2, ys2)[0][1]
rho2

Out[8]:

0.4782776976576317

In the first dataset, the correlation is close to 0.75. In the second, it is close to 0.5. So we might think the first relationship is stronger.

But let’s look at the slopes of the regression lines. For the first dataset, the estimated slope is about 0.019 kilograms per year or about 0.56 kilograms over the 30-year range.

In [9]:

from scipy.stats import linregress

res1 = linregress(xs1, ys1)
res1.slope, res1.slope * 30

Out[9]:

(0.018821034903244386, 0.5646310470973316)

For the second dataset, the estimated slope is almost 10 times higher — about 0.18 kilograms per year or 5.3 kilograms per 30 years.

In [10]:

res2 = linregress(xs2, ys2)
res2.slope, res2.slope * 30

Out[10]:

(0.17642069806488855, 5.292620941946657)

According to the correlations, the first relationship is stronger. According to the slopes, the second relationship is stronger. So which is it? The answer depends on context.

In this example, the slope of the regression line indicates the magnitude of weight gain. If we are concerned about the health effects of weight gain, the second relationship is probably more important.

On the other hand, correlation indicates how well we can predict one value based on the other. If, for some reason, we are trying to guess someone’s weight, based on their age, the first relationship would be more important.

Here are all the results in the same plot.

In [11]:

def make_plot(xs, ys, title):
    """Make a scatter plot with fitted line.
    """
    res = linregress(xs, ys)
    plt.plot(xs, ys, 'o', alpha=0.5)

    fx = np.array([xs.min(), xs.max()])
    fy = res.intercept + res.slope * fx
    plt.plot(fx, fy, '-')

    text(0.05, 0.9, title)
    text(0.05, 0.82, f'correlation = {res.rvalue:0.2f}')
    text(0.05, 0.74, f'slope = {res.slope:0.3f} kg/yr')
    decorate(xlabel='Age in years',
             ylabel='Weight in kg')

In [12]:

plt.figure(figsize=(6, 7))

plt.subplot(2, 1, 1)
make_plot(xs1, ys1, 'Fake dataset A')

plt.subplot(2, 1, 2)
make_plot(xs2, ys2, 'Fake dataset B')

Because of the way the plots are scaled, the slope looks smaller in the second figure, but that’s misleading. So this example is a reminder to look at the labels of the y axis — which is where the effect size often hides.

Minimizing MAE¶

Earlier I said a line that minimizes mean absolute error (MAE) rather than mean squared error (MSE) can have a slope with the opposite sign of the correlation. To demonstrate, I’ll use the following function to minimize MAE.

In [13]:

from scipy.optimize import minimize

def error_func(params, xs, ys):
    intercept, slope = params
    y_pred = intercept + slope * xs
    return np.mean(np.abs(y_pred - ys))

def minimize_mae(xs, ys):
    param0 = [0, 0]
    result = minimize(error_func, param0, args=(xs, ys), method='Nelder-Mead')
    assert result.success
    
    return result.x

Now I’ll generate a dataset where xs and ys are actually uncorrelated.

In [14]:

n = 100

np.random.seed(20)
xs = np.random.normal(0, 1, n)
ys = np.random.normal(0, 1, n)

In this dataset, the correlation is slightly negative and the slope of the fitted line is slightly positive.

In [15]:

corr = np.corrcoef(xs, ys)[0, 1]
intercept, slope = minimize_mae(xs, ys)

corr, slope

Out[15]:

(-0.08198650127894906, 0.04675271007547886)

Here’s what the scatter plot looks like with the minimum MAE line.

In [16]:

fxs = np.array([np.min(xs), np.max(xs)])
fys = intercept + slope * fxs

In [17]:

plt.plot(xs, ys, '.')
plt.plot(fxs, fys)
decorate()

To find this example, I generated datasets with different random number seeds. Out of the first 100 attempts, 19 yield correlation and slope with opposite signs.

In [18]:

count = 0
for i in range(100):
    np.random.seed(i)
    xs = np.random.normal(0, 1, n)
    ys = np.random.normal(0, 1, n)
    corr = np.corrcoef(xs, ys)[0, 1]
    intercept, slope = minimize_mae(xs, ys)
    if corr * slope < 0:
        count += 1
count

Out[18]:

So examples like this are not rare, if the actual correlation is close to zero.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

In [ ]:

What does a confidence interval mean?

April 17, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. In general, I will try to focus on practical problems, but this one is a little more philosophical.

confidence

What does a confidence interval mean?¶

Here’s a question from the Reddit statistics forum (with an edit for clarity):

Why does a confidence interval not tell you that 90% of the time, [the true value of the population parameter] will be in the interval, or something along those lines?

I understand that the interpretation of confidence intervals is that with repeated samples from the population, 90% of the time the interval would contain the true value of whatever it is you’re estimating. What I don’t understand is why this method doesn’t really tell you anything about what that parameter value is.

This is, to put it mildly, a common source of confusion. And here is one of the responses:

From a frequentist perspective, the true value of the parameter is fixed. Thus, once you have calculated your confidence interval, one if two things are true: either the true parameter value is inside the interval, or it is outside it. So the probability that the interval contains the true value is either 0 or 1, but you can never know which.

This response is the conventional answer to this question — it is what you find in most textbooks and what is taught in most classes. And, in my opinion, it is wrong. To explain why, I’ll start with a story.

Suppose Frank and Betsy visit a factory where 90% of the widgets are good and 10% are defective. Frank chooses a part at random and asks Betsy, “What is the probability that this part is good?”

Betsy says, “If 90% of the parts are good, and you choose one at random, the probability is 90% that it is good.”

“Wrong!” says Frank. “Since the part has already been manufactured, one of two things must be true: either it is good or it is defective. So the probability is either 100% or 0%, but we don’t know which.”

Frank’s argument is based on a strict interpretation of frequentism, which is a particular philosophy of probability. But it is not the only interpretation, and it is not a particularly good one. In fact, it suffers from several flaws. This example shows one of them — in many real-world scenarios where it would be meaningful and useful to assign a probability to a proposition, frequentism simply refuses to do so.

Fortunately, Betsy is under no obligation to adopt Frank’s interpretation of probability. She is free to adopt any of several alternatives that are consistent with her commonsense claim that a randomly-chosen part has a 90% probability of being functional.

Now let’s see how this story relates to confidence intervals.

Click here to run this notebook on Colab

I’ll start by importing the usual libraries.

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Generating a confidence interval¶

Suppose that Frank is a statistics teacher and Betsy is one of his students. One day Frank teaches the class a process for computing confidence intervals that goes like this:

Collect a sample of size $n$.
Compute the sample mean, $m$, and the sample standard deviation, $s$.
If those estimates are correct, the sampling distribution of the mean is a normal distribution with mean $m$ and standard deviation $s / \sqrt{n}$.
Compute the 5th and 95th percentiles of this sampling distribution. The result is a 90% confidence interval.

As an example, Frank generates a sample with size 100 from a normal distribution with known parameters mean $\mu=10$ and standard deviation $\sigma=3$.

In [2]:

from scipy.stats import norm

mu = 10
sigma = 3

np.random.seed(17)
data = norm.rvs(mu, sigma, size=100)

Then Betsy uses the following function to compute a 90% CI.

In [3]:

def compute_ci(data):
    n = len(data)
    m = np.mean(data)
    s = np.std(data)
    sampling_dist = norm(m, s / np.sqrt(n))
    ci90 = sampling_dist.ppf([0.05, 0.95])
    return ci90

In [4]:

ci90 = compute_ci(data)
ci90

Out[4]:

array([ 9.78147291, 10.88758585])

In this example, we know that the actual population mean is 10 so we can see that this CI contains the population mean. But if we draw another sample, we might get a sample mean that is substantially higher or lower than $\mu$, and the CI we compute might not contain $\mu$.

To see how often that happens, we’ll use this function, which generates a sample, computes a 90% CI, and checks whether the CI contains $\mu$.

In [5]:

def run_experiment(mu, sigma):
    data = norm.rvs(mu, sigma, size=100)
    low, high = compute_ci(data)
    return low < mu < high

If we run this function 1000 times, we can count how often the CI contains $\mu$.

In [6]:

np.mean([run_experiment(mu, sigma) for i in range(1000)]) * 100

Out[6]:

90.60000000000001

The answer is close to 90% — that is, if we run this process many times, 90% of the CIs it generates contain $\mu$ and 10% don’t. So the CI-computing process is like a factory where 90% of the widgets are good and 10% are defective.

Now suppose Frank chooses a different value of $\mu$ and does not tell Betsy what it is. To simulate that scenario, I’ll choose a value from a random number generator with a specific seed.

In [7]:

np.random.seed(17)
unknown_mu = np.random.uniform(10, 20)

And just for good measure, I’ll generate a random value for $\sigma$, too.

In [8]:

unknown_sigma = np.random.uniform(2, 3)

Next Frank generates a sample from a normal distribution with those parameters, and gives the sample to Betsy.

In [9]:

data2 = norm.rvs(unknown_mu, unknown_sigma, size=100)

And Betsy uses the data to compute a CI.

In [10]:

compute_ci(data2)

Out[10]:

array([12.81278165, 13.73152148])

Now suppose Frank asks, “What is the probability that this CI contains the actual value of $\mu$ that I chose?”

Betsy says, “We have established that 90% of the CIs generated by this process contain $\mu$, so the probability that this CI contains $\mu$ is 90%.”

And of course Frank says “Wrong! Now that we have computed the CI, it is unknown whether it contains the true parameter, but it is not random. The probability that it contains $\mu$ is either 100% or 0%. We can’t say it has a 90% chance of containing $\mu$.”

Once again, Frank is asserting a particular interpretation of probability — one that has the regrettable property of rendering probability nearly useless. Fortunately, Betsy is under no obligation to join Frank’s cult.

Under most reasonable interpretations of probability, you can say that a specific 90% CI has a 90% chance of containing the true parameter. There is no real philosophical problem with that.

But there might be practical problems.

Practical problems¶

The process we use to construct a CI takes into account variability due to random sampling, but it does not take into account other problems, like measurement error and non-representative sampling. To see why that matters, let’s consider a more realistic example.

Suppose we want to estimate the average height of adult male residents of the United States. If we define terms like “height”, “adult”, “male”, and “resident of the United States” precisely enough, we have defined a population that has a true, unknown average height. If we collect a representative sample from the population and measure their heights, we can use the sample mean to estimate the population mean and compute a confidence interval.

To demonstrate, I’ll use data from the Behavioral Risk Factor Surveillance System (BRFSS). Here’s an extract I prepared for Elements of Data Science, based on BRFSS data from 2021.

In [11]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/ElementsOfDataScience/raw/v1/data/brfss_2021.hdf')

Out[11]:

'brfss_2021.hdf'

In [12]:

brfss = pd.read_hdf('brfss_2021.hdf', 'brfss')

It includes data from 203,760 male respondents.

In [13]:

male = brfss.query('_SEX == 1')
len(male)

Out[13]:

For 193,701 of them, we have their self-reported height recorded in centimeters.

In [14]:

male['HTM4'].count()

Out[14]:

We can use this data to compute a sample mean and 90% confidence interval.

In [15]:

m = male['HTM4'].mean()
ci90 = compute_ci(male['HTM4'])
m, ci90

Out[15]:

(178.14807357731763, array([178.11896943, 178.17717773]))

Because the sample size is so large, the confidence interval is quite small — its width is only 0.03% of the estimate.

In [16]:

np.diff(ci90) / m * 100

Out[16]:

array([0.03267411])

So there is very little variability in this estimate due to random sampling. That means the estimate is precise, but that doesn’t mean it’s accurate.

For one thing, the measurements in this dataset are self-reported. If people tend to round up — and they do — that would make the estimated mean too high.

For another thing, it is difficult to construct a representative sample of a population as large as the United States. The BRFSS is run by people who know what they are doing, but nothing is perfect — it is likely that some groups are systematically overrepresented or underrepresented. And that could make the estimated mean too high or too low.

Given that there is almost certainly some measurement error and some sampling bias, it is unlikely that the actual population falls in the very small confidence interval we computed.

And that’s true in general — when the sample size is large, variability due to random sampling is small, which means that other sources of error are likely to be bigger. So as sample size increases, the probability decreases that the CI contains the true value.

Summary¶

The way confidence intervals are taught in most statistics class is based on the frequentist interpretation of probability. But you are not obligated to adopt that interpretation, and there are good reasons you should not.

Some people will say that confidence intervals are a frequentist method that is inextricable from the frequentist interpretation. I don’t think that’s true — there is nothing about the computation of a confidence interval that depends on the frequentist interpretation. So you are free to interpret the CI under any philosophy of probability you like.

If you want to say that a 90% CI has a 90% chance of containing the true value, there is nothing wrong with that, philosophically. I think it is a meaningful and useful probabilistic claim.

However, it is only true if other sources of error — like sampling bias and measurement error — are small compared to variability due to random sampling.

For that reason, I think the best interpretation of a confidence interval, for practical purposes, is that it quantifies the precision of the estimate but says nothing about its accuracy.

Credit: I borrowed Frank and Betsy from my friend Ted Bunn. They first appeared in his blog post Who knows what evil lurks in the hearts of men? The Bayesian doesn’t care..

Standard deviation of a count

April 13, 2024 AllenDowney

This post is part of a new project with the working title Data Q&A: Answering the real questions with Python. In each installment, I’ll take a question from Reddit’s statistics forum and answer it, using Python code to demonstrate. My answer is in a Jupyter notebook — see the link below to run it in Colab.

count_data

Is taking the SD of a count variable helpful?¶

Here’s a question from the Reddit statistics forum.

A student brought this up to me in class this week and I had no idea how to answer. For some context they are doing in experiment that involves a count variable and then have a choice of what inferential stat test they want to run. If they pick a t-test they need to show SD error bars on their graph but one student kept telling me it isn’t possible. I’ve spent time looking around forums and asked my PI who also shrugged since we’re chemistry people. I was just wondering if someone can explain if finding the SD of a count variable is possible, and if it is, does it tell you anything important or is it just a waste of time?

In a follow-up, OP provided more context:

To be specific students were counting the amount of eggs laid on two different types of substrate in petri dishes (something like wood and grass). They were then given the option to choose whatever inferential stat test they thought best fits the data and a number of them chose unpaired t-tests. They used [statistical software] to do the actual test (plugging in the count numbers from 2 variables with 3 replicates, so 6 numbers in total) .

It was my understanding that because the data probably doesn’t follow a normal distribution, SD error bars wouldn’t really be helpful.

There are several questions going on here, so let’s start with the headline — is taking the SD of a count variable helpful?

Click here to run this notebook on Colab

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Generating a dataset¶

Count data are often well modeled by a negative binomial distribution, so that’s what I’ll use to generate a dataset.

In [3]:

from scipy.stats import nbinom

n = 4
p = 0.5

np.random.seed(17)
data = nbinom.rvs(n, p, size=100)

Here’s what the distribution of values looks like. I’m using a Pmf object, which shows all of the values, rather than a histogram, which puts the values into bins.

In [4]:

from empiricaldist import Pmf

pmf = Pmf.from_seq(data)
pmf.bar()

decorate(xlabel='Count data', ylabel='PMF')

We can compute the mean and standard deviation of the data in the usual way.

In [5]:

m = np.mean(data)
m

Out[5]:

4.23

In [6]:

s = np.std(data)
s

Out[6]:

2.7160817366198686

Now the question is whether reporting the standard deviation of this dataset is useful as a descriptive statistic. I think it’s OK — it quantifies the spread of the distribution, just as it’s intended to.

The only problem is that when you report a mean and standard deviation, people often picture a normal distribution, and in this example, that picture is misleading. To see why, I’ll plot the distribution of the data again, along with a normal distribution with the same mean and standard deviation.

In [7]:

from scipy.stats import norm

qs = np.linspace(m - 4*s, m+4*s)
ps = norm(m, s).pdf(qs)

In [8]:

pmf.bar()
plt.plot(qs, ps, color='C1')

decorate(xlabel='Count data', ylabel='PMF')

The normal distribution is symmetric, unlike the distribution of the data, which is skewed to the right. And the normal distribution extends to negative values, which count data cannot.

So, rather than reporting mean and standard deviation, it might be better to report a median and interquartile range (IQR).

In [9]:

low, median, high = np.percentile(data, [25, 50, 75])
low, median, high

Out[9]:

(2.0, 4.0, 6.0)

In [10]:

iqr = high - low
median, iqr

Out[10]:

(4.0, 4.0)

A less common alternative would be to fit a negative binomial distribution to the data and report the estimated parameters. We can do that in this example by matching moments.

In [11]:

p_estimate = m / (s ** 2)
n_estimate = m * p_estimate / (1 - p_estimate)
n_estimate, p_estimate

Out[11]:

(5.685520002542022, 0.5733960499383226)

With these parameters, we can compute the PMF of a negative binomial distribution, and see that it fits the data reasonably well.

In [12]:

qs = np.arange(np.max(data) + 3)
ps = nbinom(n_estimate, p_estimate).pmf(qs)

In [13]:

pmf.bar()
plt.plot(qs, ps, color='C1')

decorate(xlabel='Count data', ylabel='PMF')

So that’s my answer to the first question — there’s nothing wrong with reporting the standard deviation of a count variable, as long as we don’t assume that the distribution is normal.

What about those error bars?¶

OP also asked about showing “SD error bars”. I’m not sure what they mean, but I think they are confusing standard deviation and standard error.

Standard deviation is a descriptive statistic that quantifies the spread of a dataset.
Standard error is an inferential statistic that quantifies the precision of an estimate.

For example, suppose use this sample to estimate the mean in the population. We might wonder how precise the estimate is. One way to answer that question is to ask how much the result varies if we run the experiment many times. We can answer that question by bootstrap resampling.

The following function takes a dataset as a parameter, draws a random sample from it with replacement, and returns the mean of the resampled data.

In [14]:

def bootstrap_mean(data):
    resampled = np.random.choice(data, size=len(data), replace=True)
    return np.mean(resampled)

If we call this function 1000 times, we get a sample from from sampling distribution of the mean.

In [15]:

sample = [bootstrap_mean(data) for i in range(1000)]

Here’s what the sampling distribution looks like.

In [16]:

sns.kdeplot(sample)
decorate(xlabel='Sample mean',
         ylabel='Density',
         title='Sampling distribution of the mean')

This distribution shows how precise our estimate of the mean is. If we run the experiment again, the mean could plausibly be as low as 3.5 or as high as 5.0.

There are two ways to summarize this distribution. First, the standard deviation of the sampling distribution is the standard error.

In [17]:

se = np.std(sample)
se

Out[17]:

0.26519081413201323

Second, the interval from the 5th to the 95th percentile is a 90% confidence interval.

In [18]:

ci90 = np.percentile(sample, [5, 95])
ci90

Out[18]:

array([3.82, 4.69])

To report the estimated mean and its standard error, you could write 4.23 ± 0.27. To report the estimated mean and its confidence interval, you could write 4.23 (CI90: 3.82, 4.69).

The standard error and confidence interval contain pretty much the same information, so I think it’s better to report one or the other, not both.

Hypothesis testing¶

Now let’s get to the last part of the question, whether it’s OK to run a t-test with count data. In general, a t-test works well if the variance of the data is not too big and the sample size is not too small. But it’s not easy to say how but or how small.

So, rather than worrying about when a t-test is OK or not, I suggest using simulations. To demonstrate, let’s suppose we have count data from two groups.

In [19]:

np.random.seed(17)

In [20]:

n = 4
p = 0.5
data1 = nbinom.rvs(n, p, size=30)
m1 = np.mean(data1)
m1

Out[20]:

4.333333333333333

In [21]:

n = 4
p = 0.55
data2 = nbinom.rvs(n, p, size=30)
m2 = np.mean(data2)
m2

Out[21]:

3.433333333333333

In [22]:

diff = m1 - m2
diff

Out[22]:

0.8999999999999999

It looks like there is a difference in the means, but we might wonder if it could be due to chance. To answer that question, we can simulate a world where there is actually no difference between the groups. One way to do that is a permutation test, where we combine the groups, shuffle, then split them again, and compute the difference in means.

In [23]:

def simulate_two_groups(data1, data2):
    n, m = len(data1), len(data2)
    data = np.append(data1, data2)
    np.random.shuffle(data)
    group1 = data[:n]
    group2 = data[n:]
    return group1.mean() - group2.mean()

Each time we call this function, it computes a difference in means under the null hypothesis that there is actually no difference between the groups.

In [24]:

simulate_two_groups(data1, data2)

Out[24]:

-0.4333333333333331

And if we run it 1000 times, we get a sample from the distribution of differences under the null hypothesis.

In [25]:

sample = [simulate_two_groups(data1, data2) for i in range(1000)]

Here’s what that distribution looks like:

In [26]:

sns.kdeplot(sample)
decorate(xlabel='Difference in means',
         ylabel='Density',
         title='Distribution of the difference under the null hypothesis')

The mean of this distribution is close to 0, as we expect if the two groups are actually the same.

In [27]:

np.mean(sample)

Out[27]:

-0.0017333333333333326

Now we can ask — under the null hypothesis, what is the probability that we would see a difference as big as the one we saw (in either direction)?

In [28]:

p_value = np.mean(np.abs(sample) > diff)
p_value

Out[28]:

0.136

The answer is about 14%, which means that if there is actually no difference between the groups, it would not be surprising to see a difference as big as diff by chance. We can conclude that this dataset does not provide strong evidence that there is a substantial difference between the groups.

In the dataset I generated, the sample size in each group is 30. In the dataset the OP asked about, the sample size in each group is only 3. With such a small sample, the permutation test might not work well. But with such a small sample, there is not much point in running a hypothesis test of any kind. I think it would be better to consider the experiment exploratory and report descriptive statistics only.

Data Q&A

April 9, 2024 AllenDowney

Today I’m starting a new project with the working title Data Q&A: Answering the real questions with Python. In each installment, I’ll take a question from Reddit’s statistics forum and answer it, using Python code to demonstrate. The first installment is a question about the harmonic mean, which is a recurring topic of discussion on Reddit. It’s in a Jupyter notebook — see the link below to run it in Colab.

harmonic

Bootstrapping the harmonic mean¶

Here’s a question from the Reddit statistics forum.

Can you calculate a standard error for harmonic mean?

I’m trying to use harmonic mean instead of arithmetic because it might describe my data better (there are a few extreme outliers). My question is, I’ve read that you can’t calculate a standard deviation for the harmonic mean, so does that mean I can’t calculate a standard error then? That feels wrong.

The immediate question is how to compute the standard error of a harmonic mean. But there are two more questions here that I think are worth addressing:

Is the harmonic mean a good choice for summarizing this dataset?
Is it true that “You can’t calculate a standard deviation for the harmonic mean”?

Let’s answer the immediate question first, using bootstrap resampling. If you are not familiar with bootstrapping, there’s an introduction in Chapter 12 of Elements of Data Science.

Click here to run this notebook on Colab

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

Generating a dataset¶

To demonstrate the behavior of the harmonic mean, I’ll generate two datasets, one from a Gaussian, and another with the same data plus a few outliers.

The first dataset contains n=100 values from a Gaussian distribution with mean 10 and standard deviation 1.

In [2]:

mu = 10
sigma = 1

np.random.seed(1)
data = np.random.normal(10, 1, size=100)

The (arithmetic) mean is close to 10, as expected.

In [3]:

np.mean(data)

Out[3]:

10.060582852075699

And the harmonic mean is not much different.

In [4]:

from scipy.stats import hmean

hmean(data)

Out[4]:

9.981387932002434

OP says the dataset contains “a few extreme outliers”. Without context, it’s hard to day how extreme they are. But in a Gaussian distribution, values 4-6 standard deviations from the mean might be considered extreme. So I’ll add in one of each.

In [5]:

data2 = np.concatenate([data, [14, 15, 16]])

Here’s what the distribution of the data looks like with the outliers.

In [6]:

sns.kdeplot(data2)
decorate(xlabel='Measurements',
         ylabel='Density',
         title='Distribution of measurements')

The outliers increase the mean of the data substantially.

In [7]:

np.mean(data2)

Out[7]:

10.20444937094728

They have a smaller effect on the harmonic mean — which seems like it’s the primary reason OP is using it.

In [8]:

hmean(data2)

Out[8]:

10.079025419217924

Bootstrapping¶

Now let’s compute the standard error of the harmonic mean using bootstrap resampling. The following function takes a dataset as a parameter, draws a random sample from it with replacement, and returns the harmonic mean of the resampled data.

In [9]:

def bootstrap_hmean(data):
    resampled = np.random.choice(data, size=len(data), replace=True)
    return hmean(resampled)

If we call this function 1000 times, we get a sample from from sampling distribution of the harmonic mean.

In [10]:

sample = [bootstrap_hmean(data2) for i in range(1000)]

Here’s what the sampling distribution looks like.

In [11]:

sns.kdeplot(sample)
decorate(xlabel='Measurements',
         ylabel='Density',
         title='Sampling distribution of the mean')

The standard deviation of these values is an estimate of the standard error.

In [12]:

se = np.std(sample)
se

Out[12]:

0.11120255874179302

As an alternative to the standard error, we can estimate the 90% confidence interval by computing the 5th and 95th percentiles of the sample.

In [13]:

ci90 = np.percentile(sample, [5, 95])
ci90

Out[13]:

array([ 9.90600311, 10.27715017])

So that’s the answer to the immediate question — there’s nothing unusually difficult about computing the standard error of a harmonic mean.

Now let’s get to the other questions. First, what did OP mean by “I’ve read that you can’t calculate a standard deviation for the harmonic mean”? I’m not sure, but one possibility is that they read something else: that there is no mathematical formula for the standard error of the harmonic mean, as there is for the arithmetic mean. That’s true, which is why bootstrapping is particularly useful.

Why harmonic?¶

The other outstanding question is whether the harmonic mean is the best choice for summarizing this data. Without more context, I can’t say, but I can offer some general advice.

The harmonic mean is often a good choice when the quantities in the dataset are rates. As an example, suppose you drive to the store at 20 mph and then return on the same route at 30 mph. If you want to know your average speed for the round trip, you might be tempted to compute the arithmetic mean.

In [14]:

np.mean([20, 30])

Out[14]:

25.0

But that’s not right. To see why, suppose the store is 60 miles away, so it takes 3 hours on the way there and 2 hours on the way back. That’s 120 miles in 5 hours, which is 24 mph on average. The harmonic mean is the equivalent of this calculation, and produces the same answer (within floating-point error).

In [15]:

hmean([20, 30])

Out[15]:

23.999999999999996

So that’s a case where the harmonic mean naturally computes the quantity we’re interested in. More generally, the harmonic mean might be a good choice when the quantities are rates or ratios, but in my opinion this advice is often stated too strongly. It depends on the context, and on what question you are trying to answer.

For example, suppose you drive at 20 mph for an hour, and then 30 mph for an hour, and you want to know your average speed for the whole trip. In this case, the arithmetic mean is correct — you traveled 50 miles in 2 hours, so the average is 25 mph.

If OP is using the harmonic mean only because the dataset has a few outliers, it might not be the best choice. The harmonic mean is relatively robust to outliers if they are above the mean, but it is relatively sensitive to outliers below the mean.

As an example, here’s another dataset with outliers 4-6 standard deviations below the mean.

In [16]:

data3 = np.concatenate([data, [6, 5, 4]])

As expected, the arithmetic mean gets dragged down by these outliers.

In [17]:

np.mean(data3)

Out[17]:

9.913187235024951

But the harmonic mean gets dragged down even more!

In [18]:

hmean(data3)

Out[18]:

9.684716917792475

So outliers alone are not a strong reason to use the harmonic mean.

If there is reason to think the outliers are genuine errors, like bad measurements, the best choice might be a trimmed mean. In this example, trimming 10% of the data works well for all the datasets: the one with no outliers, outliers above the mean, and outliers below the mean.

In [19]:

from scipy.stats import trim_mean

trim_mean(data, proportiontocut=0.1)

Out[19]:

10.054689104329984

In [20]:

trim_mean(data2, proportiontocut=0.1)

Out[20]:

10.099867738778054

In [21]:

trim_mean(data3, proportiontocut=0.1)

Out[21]:

10.01327288288814

In all three cases, the estimated mean is close to the actual mean of the dataset, before the addition of outliers.

In [ ]: