What does “strength” mean?

April 21, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

corr_trend

What does “strength” mean?¶

Here’s a question from the Reddit statistics forum.

I am currently doing a uni assignment and one of my tasks is analysing the correlation between two variables. When I use the correlation function in Excel, it returns a correlation of -0.0377. When I use the same data to create a scatter plot, the trend line is positive. I need to identify the correlation strength and direction and thereby, I am confused by these opposing outcomes. Can somebody please explain why the correlation is showing as negative but the trend line is positive? What does this indicate in terms of the strength and direction of the relationship between the two variables?

To answer the immediate question, correlation and the slope of a linear regression line always have the same sign. Mathematically, they are both related to the dot product of the x and y variables.

So there is something strange going on. It might be a simple error — for example, maybe the correlation and regression were based on different data. Or it might be that the trend computed by Excel is something other than linear regression. For example, a line that minimizes mean absolute error (MAE) rather than mean squared error (MSE) can have a slope with the opposite sign of the correlation.

Without more information it’s hard to be sure what’s going on, but for this example it might not matter. The computed correlation is negative but very small. If we fit a line (other than a regression line) to the same data and the slope is positive but similarly small, that is not necessarily inconsistent. Within statistical uncertainty, both are indistinguishable from zero.

OP also asks, “What does this indicate in terms of the strength and direction of the relationship between the two variables?” So let’s answer that question, too.

Click here to run this notebook on Colab.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

Interpreting correlation and slope¶

When people talk about the strength of a relationship, they might mean correlation or they might mean the slope of a fitted line. But these measures of “strength” are not always consistent.

For example, suppose we are concerned about the health effects of weight gain, so we plot weight versus age from 20 to 50 years old. I’ll generate two fake datasets to demonstrate the point.

In [2]:

np.random.seed(18)
xs1 = np.linspace(20, 50)
ys1 = 75 + 0.02 * xs1 + np.random.normal(0, 0.15, len(xs1))

In [3]:

np.random.seed(18)
xs2 = np.linspace(20, 50)
ys2 = 65 + 0.2 * xs2 + np.random.normal(0, 3, len(xs2))

I used the same random seed to generate both, so they look similar, as we can see in these scatter plots.

In [4]:

from utils import underride

def text(x, y, string, **options):
    """Plot text using axis coordinates.
    """
    transform = plt.gca().transAxes
    options = underride(options, transform=transform, ha='left', va='top')
    plt.text(x, y, string, **options)

In [5]:

plt.plot(xs1, ys1, 'o', alpha=0.5)
text(0.05, 0.9, 'Fake dataset A')
decorate(xlabel='Age in years',
         ylabel='Weight in kg')

No description has been provided for this image

In [6]:

plt.plot(xs2, ys2, 'o', alpha=0.5)
text(0.05, 0.9, 'Fake dataset B')
decorate(xlabel='Age in years',
         ylabel='Weight in kg')

Nevertheless, they have substantially different correlations.

In [7]:

rho1 = np.corrcoef(xs1, ys1)[0][1]
rho1

Out[7]:

0.7579660563439401

In [8]:

rho2 = np.corrcoef(xs2, ys2)[0][1]
rho2

Out[8]:

0.4782776976576317

In the first dataset, the correlation is close to 0.75. In the second, it is close to 0.5. So we might think the first relationship is stronger.

But let’s look at the slopes of the regression lines. For the first dataset, the estimated slope is about 0.019 kilograms per year or about 0.56 kilograms over the 30-year range.

In [9]:

from scipy.stats import linregress

res1 = linregress(xs1, ys1)
res1.slope, res1.slope * 30

Out[9]:

(0.018821034903244386, 0.5646310470973316)

For the second dataset, the estimated slope is almost 10 times higher — about 0.18 kilograms per year or 5.3 kilograms per 30 years.

In [10]:

res2 = linregress(xs2, ys2)
res2.slope, res2.slope * 30

Out[10]:

(0.17642069806488855, 5.292620941946657)

According to the correlations, the first relationship is stronger. According to the slopes, the second relationship is stronger. So which is it? The answer depends on context.

In this example, the slope of the regression line indicates the magnitude of weight gain. If we are concerned about the health effects of weight gain, the second relationship is probably more important.

On the other hand, correlation indicates how well we can predict one value based on the other. If, for some reason, we are trying to guess someone’s weight, based on their age, the first relationship would be more important.

Here are all the results in the same plot.

In [11]:

def make_plot(xs, ys, title):
    """Make a scatter plot with fitted line.
    """
    res = linregress(xs, ys)
    plt.plot(xs, ys, 'o', alpha=0.5)

    fx = np.array([xs.min(), xs.max()])
    fy = res.intercept + res.slope * fx
    plt.plot(fx, fy, '-')

    text(0.05, 0.9, title)
    text(0.05, 0.82, f'correlation = {res.rvalue:0.2f}')
    text(0.05, 0.74, f'slope = {res.slope:0.3f} kg/yr')
    decorate(xlabel='Age in years',
             ylabel='Weight in kg')

In [12]:

plt.figure(figsize=(6, 7))

plt.subplot(2, 1, 1)
make_plot(xs1, ys1, 'Fake dataset A')

plt.subplot(2, 1, 2)
make_plot(xs2, ys2, 'Fake dataset B')

Because of the way the plots are scaled, the slope looks smaller in the second figure, but that’s misleading. So this example is a reminder to look at the labels of the y axis — which is where the effect size often hides.

Minimizing MAE¶

Earlier I said a line that minimizes mean absolute error (MAE) rather than mean squared error (MSE) can have a slope with the opposite sign of the correlation. To demonstrate, I’ll use the following function to minimize MAE.

In [13]:

from scipy.optimize import minimize

def error_func(params, xs, ys):
    intercept, slope = params
    y_pred = intercept + slope * xs
    return np.mean(np.abs(y_pred - ys))

def minimize_mae(xs, ys):
    param0 = [0, 0]
    result = minimize(error_func, param0, args=(xs, ys), method='Nelder-Mead')
    assert result.success
    
    return result.x

Now I’ll generate a dataset where xs and ys are actually uncorrelated.

In [14]:

n = 100

np.random.seed(20)
xs = np.random.normal(0, 1, n)
ys = np.random.normal(0, 1, n)

In this dataset, the correlation is slightly negative and the slope of the fitted line is slightly positive.

In [15]:

corr = np.corrcoef(xs, ys)[0, 1]
intercept, slope = minimize_mae(xs, ys)

corr, slope

Out[15]:

(-0.08198650127894906, 0.04675271007547886)

Here’s what the scatter plot looks like with the minimum MAE line.

In [16]:

fxs = np.array([np.min(xs), np.max(xs)])
fys = intercept + slope * fxs

In [17]:

plt.plot(xs, ys, '.')
plt.plot(fxs, fys)
decorate()

To find this example, I generated datasets with different random number seeds. Out of the first 100 attempts, 19 yield correlation and slope with opposite signs.

In [18]:

count = 0
for i in range(100):
    np.random.seed(i)
    xs = np.random.normal(0, 1, n)
    ys = np.random.normal(0, 1, n)
    corr = np.corrcoef(xs, ys)[0, 1]
    intercept, slope = minimize_mae(xs, ys)
    if corr * slope < 0:
        count += 1
count

Out[18]:

So examples like this are not rare, if the actual correlation is close to zero.

Data Q&A: Answering the real questions with Python

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

In [ ]:

What does a confidence interval mean?

April 17, 2024 AllenDowney

Here’s another installment in Data Q&A: Answering the real questions with Python. In general, I will try to focus on practical problems, but this one is a little more philosophical.

confidence

What does a confidence interval mean?¶

Here’s a question from the Reddit statistics forum (with an edit for clarity):

Why does a confidence interval not tell you that 90% of the time, [the true value of the population parameter] will be in the interval, or something along those lines?

I understand that the interpretation of confidence intervals is that with repeated samples from the population, 90% of the time the interval would contain the true value of whatever it is you’re estimating. What I don’t understand is why this method doesn’t really tell you anything about what that parameter value is.

This is, to put it mildly, a common source of confusion. And here is one of the responses:

From a frequentist perspective, the true value of the parameter is fixed. Thus, once you have calculated your confidence interval, one if two things are true: either the true parameter value is inside the interval, or it is outside it. So the probability that the interval contains the true value is either 0 or 1, but you can never know which.

This response is the conventional answer to this question — it is what you find in most textbooks and what is taught in most classes. And, in my opinion, it is wrong. To explain why, I’ll start with a story.

Suppose Frank and Betsy visit a factory where 90% of the widgets are good and 10% are defective. Frank chooses a part at random and asks Betsy, “What is the probability that this part is good?”

Betsy says, “If 90% of the parts are good, and you choose one at random, the probability is 90% that it is good.”

“Wrong!” says Frank. “Since the part has already been manufactured, one of two things must be true: either it is good or it is defective. So the probability is either 100% or 0%, but we don’t know which.”

Frank’s argument is based on a strict interpretation of frequentism, which is a particular philosophy of probability. But it is not the only interpretation, and it is not a particularly good one. In fact, it suffers from several flaws. This example shows one of them — in many real-world scenarios where it would be meaningful and useful to assign a probability to a proposition, frequentism simply refuses to do so.

Fortunately, Betsy is under no obligation to adopt Frank’s interpretation of probability. She is free to adopt any of several alternatives that are consistent with her commonsense claim that a randomly-chosen part has a 90% probability of being functional.

Now let’s see how this story relates to confidence intervals.

Click here to run this notebook on Colab

I’ll start by importing the usual libraries.

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Generating a confidence interval¶

Suppose that Frank is a statistics teacher and Betsy is one of his students. One day Frank teaches the class a process for computing confidence intervals that goes like this:

Collect a sample of size $n$.
Compute the sample mean, $m$, and the sample standard deviation, $s$.
If those estimates are correct, the sampling distribution of the mean is a normal distribution with mean $m$ and standard deviation $s / \sqrt{n}$.
Compute the 5th and 95th percentiles of this sampling distribution. The result is a 90% confidence interval.

As an example, Frank generates a sample with size 100 from a normal distribution with known parameters mean $\mu=10$ and standard deviation $\sigma=3$.

In [2]:

from scipy.stats import norm

mu = 10
sigma = 3

np.random.seed(17)
data = norm.rvs(mu, sigma, size=100)

Then Betsy uses the following function to compute a 90% CI.

In [3]:

def compute_ci(data):
    n = len(data)
    m = np.mean(data)
    s = np.std(data)
    sampling_dist = norm(m, s / np.sqrt(n))
    ci90 = sampling_dist.ppf([0.05, 0.95])
    return ci90

In [4]:

ci90 = compute_ci(data)
ci90

Out[4]:

array([ 9.78147291, 10.88758585])

In this example, we know that the actual population mean is 10 so we can see that this CI contains the population mean. But if we draw another sample, we might get a sample mean that is substantially higher or lower than $\mu$, and the CI we compute might not contain $\mu$.

To see how often that happens, we’ll use this function, which generates a sample, computes a 90% CI, and checks whether the CI contains $\mu$.

In [5]:

def run_experiment(mu, sigma):
    data = norm.rvs(mu, sigma, size=100)
    low, high = compute_ci(data)
    return low < mu < high

If we run this function 1000 times, we can count how often the CI contains $\mu$.

In [6]:

np.mean([run_experiment(mu, sigma) for i in range(1000)]) * 100

Out[6]:

90.60000000000001

The answer is close to 90% — that is, if we run this process many times, 90% of the CIs it generates contain $\mu$ and 10% don’t. So the CI-computing process is like a factory where 90% of the widgets are good and 10% are defective.

Now suppose Frank chooses a different value of $\mu$ and does not tell Betsy what it is. To simulate that scenario, I’ll choose a value from a random number generator with a specific seed.

In [7]:

np.random.seed(17)
unknown_mu = np.random.uniform(10, 20)

And just for good measure, I’ll generate a random value for $\sigma$, too.

In [8]:

unknown_sigma = np.random.uniform(2, 3)

Next Frank generates a sample from a normal distribution with those parameters, and gives the sample to Betsy.

In [9]:

data2 = norm.rvs(unknown_mu, unknown_sigma, size=100)

And Betsy uses the data to compute a CI.

In [10]:

compute_ci(data2)

Out[10]:

array([12.81278165, 13.73152148])

Now suppose Frank asks, “What is the probability that this CI contains the actual value of $\mu$ that I chose?”

Betsy says, “We have established that 90% of the CIs generated by this process contain $\mu$, so the probability that this CI contains $\mu$ is 90%.”

And of course Frank says “Wrong! Now that we have computed the CI, it is unknown whether it contains the true parameter, but it is not random. The probability that it contains $\mu$ is either 100% or 0%. We can’t say it has a 90% chance of containing $\mu$.”

Once again, Frank is asserting a particular interpretation of probability — one that has the regrettable property of rendering probability nearly useless. Fortunately, Betsy is under no obligation to join Frank’s cult.

Under most reasonable interpretations of probability, you can say that a specific 90% CI has a 90% chance of containing the true parameter. There is no real philosophical problem with that.

But there might be practical problems.

Practical problems¶

The process we use to construct a CI takes into account variability due to random sampling, but it does not take into account other problems, like measurement error and non-representative sampling. To see why that matters, let’s consider a more realistic example.

Suppose we want to estimate the average height of adult male residents of the United States. If we define terms like “height”, “adult”, “male”, and “resident of the United States” precisely enough, we have defined a population that has a true, unknown average height. If we collect a representative sample from the population and measure their heights, we can use the sample mean to estimate the population mean and compute a confidence interval.

To demonstrate, I’ll use data from the Behavioral Risk Factor Surveillance System (BRFSS). Here’s an extract I prepared for Elements of Data Science, based on BRFSS data from 2021.

In [11]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/ElementsOfDataScience/raw/v1/data/brfss_2021.hdf')

Out[11]:

'brfss_2021.hdf'

In [12]:

brfss = pd.read_hdf('brfss_2021.hdf', 'brfss')

It includes data from 203,760 male respondents.

In [13]:

male = brfss.query('_SEX == 1')
len(male)

Out[13]:

For 193,701 of them, we have their self-reported height recorded in centimeters.

In [14]:

male['HTM4'].count()

Out[14]:

We can use this data to compute a sample mean and 90% confidence interval.

In [15]:

m = male['HTM4'].mean()
ci90 = compute_ci(male['HTM4'])
m, ci90

Out[15]:

(178.14807357731763, array([178.11896943, 178.17717773]))

Because the sample size is so large, the confidence interval is quite small — its width is only 0.03% of the estimate.

In [16]:

np.diff(ci90) / m * 100

Out[16]:

array([0.03267411])

So there is very little variability in this estimate due to random sampling. That means the estimate is precise, but that doesn’t mean it’s accurate.

For one thing, the measurements in this dataset are self-reported. If people tend to round up — and they do — that would make the estimated mean too high.

For another thing, it is difficult to construct a representative sample of a population as large as the United States. The BRFSS is run by people who know what they are doing, but nothing is perfect — it is likely that some groups are systematically overrepresented or underrepresented. And that could make the estimated mean too high or too low.

Given that there is almost certainly some measurement error and some sampling bias, it is unlikely that the actual population falls in the very small confidence interval we computed.

And that’s true in general — when the sample size is large, variability due to random sampling is small, which means that other sources of error are likely to be bigger. So as sample size increases, the probability decreases that the CI contains the true value.

Summary¶

The way confidence intervals are taught in most statistics class is based on the frequentist interpretation of probability. But you are not obligated to adopt that interpretation, and there are good reasons you should not.

Some people will say that confidence intervals are a frequentist method that is inextricable from the frequentist interpretation. I don’t think that’s true — there is nothing about the computation of a confidence interval that depends on the frequentist interpretation. So you are free to interpret the CI under any philosophy of probability you like.

If you want to say that a 90% CI has a 90% chance of containing the true value, there is nothing wrong with that, philosophically. I think it is a meaningful and useful probabilistic claim.

However, it is only true if other sources of error — like sampling bias and measurement error — are small compared to variability due to random sampling.

For that reason, I think the best interpretation of a confidence interval, for practical purposes, is that it quantifies the precision of the estimate but says nothing about its accuracy.

Credit: I borrowed Frank and Betsy from my friend Ted Bunn. They first appeared in his blog post Who knows what evil lurks in the hearts of men? The Bayesian doesn’t care..

Standard deviation of a count

April 13, 2024 AllenDowney

This post is part of a new project with the working title Data Q&A: Answering the real questions with Python. In each installment, I’ll take a question from Reddit’s statistics forum and answer it, using Python code to demonstrate. My answer is in a Jupyter notebook — see the link below to run it in Colab.

count_data

Is taking the SD of a count variable helpful?¶

Here’s a question from the Reddit statistics forum.

A student brought this up to me in class this week and I had no idea how to answer. For some context they are doing in experiment that involves a count variable and then have a choice of what inferential stat test they want to run. If they pick a t-test they need to show SD error bars on their graph but one student kept telling me it isn’t possible. I’ve spent time looking around forums and asked my PI who also shrugged since we’re chemistry people. I was just wondering if someone can explain if finding the SD of a count variable is possible, and if it is, does it tell you anything important or is it just a waste of time?

In a follow-up, OP provided more context:

To be specific students were counting the amount of eggs laid on two different types of substrate in petri dishes (something like wood and grass). They were then given the option to choose whatever inferential stat test they thought best fits the data and a number of them chose unpaired t-tests. They used [statistical software] to do the actual test (plugging in the count numbers from 2 variables with 3 replicates, so 6 numbers in total) .

It was my understanding that because the data probably doesn’t follow a normal distribution, SD error bars wouldn’t really be helpful.

There are several questions going on here, so let’s start with the headline — is taking the SD of a count variable helpful?

Click here to run this notebook on Colab

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

In [2]:

# install the empiricaldist library, if necessary

try:
    import empiricaldist
except ImportError:
    !pip install empiricaldist

Generating a dataset¶

Count data are often well modeled by a negative binomial distribution, so that’s what I’ll use to generate a dataset.

In [3]:

from scipy.stats import nbinom

n = 4
p = 0.5

np.random.seed(17)
data = nbinom.rvs(n, p, size=100)

Here’s what the distribution of values looks like. I’m using a Pmf object, which shows all of the values, rather than a histogram, which puts the values into bins.

In [4]:

from empiricaldist import Pmf

pmf = Pmf.from_seq(data)
pmf.bar()

decorate(xlabel='Count data', ylabel='PMF')

We can compute the mean and standard deviation of the data in the usual way.

In [5]:

m = np.mean(data)
m

Out[5]:

4.23

In [6]:

s = np.std(data)
s

Out[6]:

2.7160817366198686

Now the question is whether reporting the standard deviation of this dataset is useful as a descriptive statistic. I think it’s OK — it quantifies the spread of the distribution, just as it’s intended to.

The only problem is that when you report a mean and standard deviation, people often picture a normal distribution, and in this example, that picture is misleading. To see why, I’ll plot the distribution of the data again, along with a normal distribution with the same mean and standard deviation.

In [7]:

from scipy.stats import norm

qs = np.linspace(m - 4*s, m+4*s)
ps = norm(m, s).pdf(qs)

In [8]:

pmf.bar()
plt.plot(qs, ps, color='C1')

decorate(xlabel='Count data', ylabel='PMF')

The normal distribution is symmetric, unlike the distribution of the data, which is skewed to the right. And the normal distribution extends to negative values, which count data cannot.

So, rather than reporting mean and standard deviation, it might be better to report a median and interquartile range (IQR).

In [9]:

low, median, high = np.percentile(data, [25, 50, 75])
low, median, high

Out[9]:

(2.0, 4.0, 6.0)

In [10]:

iqr = high - low
median, iqr

Out[10]:

(4.0, 4.0)

A less common alternative would be to fit a negative binomial distribution to the data and report the estimated parameters. We can do that in this example by matching moments.

In [11]:

p_estimate = m / (s ** 2)
n_estimate = m * p_estimate / (1 - p_estimate)
n_estimate, p_estimate

Out[11]:

(5.685520002542022, 0.5733960499383226)

With these parameters, we can compute the PMF of a negative binomial distribution, and see that it fits the data reasonably well.

In [12]:

qs = np.arange(np.max(data) + 3)
ps = nbinom(n_estimate, p_estimate).pmf(qs)

In [13]:

pmf.bar()
plt.plot(qs, ps, color='C1')

decorate(xlabel='Count data', ylabel='PMF')

So that’s my answer to the first question — there’s nothing wrong with reporting the standard deviation of a count variable, as long as we don’t assume that the distribution is normal.

What about those error bars?¶

OP also asked about showing “SD error bars”. I’m not sure what they mean, but I think they are confusing standard deviation and standard error.

Standard deviation is a descriptive statistic that quantifies the spread of a dataset.
Standard error is an inferential statistic that quantifies the precision of an estimate.

For example, suppose use this sample to estimate the mean in the population. We might wonder how precise the estimate is. One way to answer that question is to ask how much the result varies if we run the experiment many times. We can answer that question by bootstrap resampling.

The following function takes a dataset as a parameter, draws a random sample from it with replacement, and returns the mean of the resampled data.

In [14]:

def bootstrap_mean(data):
    resampled = np.random.choice(data, size=len(data), replace=True)
    return np.mean(resampled)

If we call this function 1000 times, we get a sample from from sampling distribution of the mean.

In [15]:

sample = [bootstrap_mean(data) for i in range(1000)]

Here’s what the sampling distribution looks like.

In [16]:

sns.kdeplot(sample)
decorate(xlabel='Sample mean',
         ylabel='Density',
         title='Sampling distribution of the mean')

This distribution shows how precise our estimate of the mean is. If we run the experiment again, the mean could plausibly be as low as 3.5 or as high as 5.0.

There are two ways to summarize this distribution. First, the standard deviation of the sampling distribution is the standard error.

In [17]:

se = np.std(sample)
se

Out[17]:

0.26519081413201323

Second, the interval from the 5th to the 95th percentile is a 90% confidence interval.

In [18]:

ci90 = np.percentile(sample, [5, 95])
ci90

Out[18]:

array([3.82, 4.69])

To report the estimated mean and its standard error, you could write 4.23 ± 0.27. To report the estimated mean and its confidence interval, you could write 4.23 (CI90: 3.82, 4.69).

The standard error and confidence interval contain pretty much the same information, so I think it’s better to report one or the other, not both.

Hypothesis testing¶

Now let’s get to the last part of the question, whether it’s OK to run a t-test with count data. In general, a t-test works well if the variance of the data is not too big and the sample size is not too small. But it’s not easy to say how but or how small.

So, rather than worrying about when a t-test is OK or not, I suggest using simulations. To demonstrate, let’s suppose we have count data from two groups.

In [19]:

np.random.seed(17)

In [20]:

n = 4
p = 0.5
data1 = nbinom.rvs(n, p, size=30)
m1 = np.mean(data1)
m1

Out[20]:

4.333333333333333

In [21]:

n = 4
p = 0.55
data2 = nbinom.rvs(n, p, size=30)
m2 = np.mean(data2)
m2

Out[21]:

3.433333333333333

In [22]:

diff = m1 - m2
diff

Out[22]:

0.8999999999999999

It looks like there is a difference in the means, but we might wonder if it could be due to chance. To answer that question, we can simulate a world where there is actually no difference between the groups. One way to do that is a permutation test, where we combine the groups, shuffle, then split them again, and compute the difference in means.

In [23]:

def simulate_two_groups(data1, data2):
    n, m = len(data1), len(data2)
    data = np.append(data1, data2)
    np.random.shuffle(data)
    group1 = data[:n]
    group2 = data[n:]
    return group1.mean() - group2.mean()

Each time we call this function, it computes a difference in means under the null hypothesis that there is actually no difference between the groups.

In [24]:

simulate_two_groups(data1, data2)

Out[24]:

-0.4333333333333331

And if we run it 1000 times, we get a sample from the distribution of differences under the null hypothesis.

In [25]:

sample = [simulate_two_groups(data1, data2) for i in range(1000)]

Here’s what that distribution looks like:

In [26]:

sns.kdeplot(sample)
decorate(xlabel='Difference in means',
         ylabel='Density',
         title='Distribution of the difference under the null hypothesis')

The mean of this distribution is close to 0, as we expect if the two groups are actually the same.

In [27]:

np.mean(sample)

Out[27]:

-0.0017333333333333326

Now we can ask — under the null hypothesis, what is the probability that we would see a difference as big as the one we saw (in either direction)?

In [28]:

p_value = np.mean(np.abs(sample) > diff)
p_value

Out[28]:

0.136

The answer is about 14%, which means that if there is actually no difference between the groups, it would not be surprising to see a difference as big as diff by chance. We can conclude that this dataset does not provide strong evidence that there is a substantial difference between the groups.

In the dataset I generated, the sample size in each group is 30. In the dataset the OP asked about, the sample size in each group is only 3. With such a small sample, the permutation test might not work well. But with such a small sample, there is not much point in running a hypothesis test of any kind. I think it would be better to consider the experiment exploratory and report descriptive statistics only.

Data Q&A

April 9, 2024 AllenDowney

Today I’m starting a new project with the working title Data Q&A: Answering the real questions with Python. In each installment, I’ll take a question from Reddit’s statistics forum and answer it, using Python code to demonstrate. The first installment is a question about the harmonic mean, which is a recurring topic of discussion on Reddit. It’s in a Jupyter notebook — see the link below to run it in Colab.

harmonic

Bootstrapping the harmonic mean¶

Here’s a question from the Reddit statistics forum.

Can you calculate a standard error for harmonic mean?

I’m trying to use harmonic mean instead of arithmetic because it might describe my data better (there are a few extreme outliers). My question is, I’ve read that you can’t calculate a standard deviation for the harmonic mean, so does that mean I can’t calculate a standard error then? That feels wrong.

The immediate question is how to compute the standard error of a harmonic mean. But there are two more questions here that I think are worth addressing:

Is the harmonic mean a good choice for summarizing this dataset?
Is it true that “You can’t calculate a standard deviation for the harmonic mean”?

Let’s answer the immediate question first, using bootstrap resampling. If you are not familiar with bootstrapping, there’s an introduction in Chapter 12 of Elements of Data Science.

Click here to run this notebook on Colab

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

In [1]:

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + str(local))
    return filename

download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from utils import decorate

Generating a dataset¶

To demonstrate the behavior of the harmonic mean, I’ll generate two datasets, one from a Gaussian, and another with the same data plus a few outliers.

The first dataset contains n=100 values from a Gaussian distribution with mean 10 and standard deviation 1.

In [2]:

mu = 10
sigma = 1

np.random.seed(1)
data = np.random.normal(10, 1, size=100)

The (arithmetic) mean is close to 10, as expected.

In [3]:

np.mean(data)

Out[3]:

10.060582852075699

And the harmonic mean is not much different.

In [4]:

from scipy.stats import hmean

hmean(data)

Out[4]:

9.981387932002434

OP says the dataset contains “a few extreme outliers”. Without context, it’s hard to day how extreme they are. But in a Gaussian distribution, values 4-6 standard deviations from the mean might be considered extreme. So I’ll add in one of each.

In [5]:

data2 = np.concatenate([data, [14, 15, 16]])

Here’s what the distribution of the data looks like with the outliers.

In [6]:

sns.kdeplot(data2)
decorate(xlabel='Measurements',
         ylabel='Density',
         title='Distribution of measurements')

The outliers increase the mean of the data substantially.

In [7]:

np.mean(data2)

Out[7]:

10.20444937094728

They have a smaller effect on the harmonic mean — which seems like it’s the primary reason OP is using it.

In [8]:

hmean(data2)

Out[8]:

10.079025419217924

Bootstrapping¶

Now let’s compute the standard error of the harmonic mean using bootstrap resampling. The following function takes a dataset as a parameter, draws a random sample from it with replacement, and returns the harmonic mean of the resampled data.

In [9]:

def bootstrap_hmean(data):
    resampled = np.random.choice(data, size=len(data), replace=True)
    return hmean(resampled)

If we call this function 1000 times, we get a sample from from sampling distribution of the harmonic mean.

In [10]:

sample = [bootstrap_hmean(data2) for i in range(1000)]

Here’s what the sampling distribution looks like.

In [11]:

sns.kdeplot(sample)
decorate(xlabel='Measurements',
         ylabel='Density',
         title='Sampling distribution of the mean')

The standard deviation of these values is an estimate of the standard error.

In [12]:

se = np.std(sample)
se

Out[12]:

0.11120255874179302

As an alternative to the standard error, we can estimate the 90% confidence interval by computing the 5th and 95th percentiles of the sample.

In [13]:

ci90 = np.percentile(sample, [5, 95])
ci90

Out[13]:

array([ 9.90600311, 10.27715017])

So that’s the answer to the immediate question — there’s nothing unusually difficult about computing the standard error of a harmonic mean.

Now let’s get to the other questions. First, what did OP mean by “I’ve read that you can’t calculate a standard deviation for the harmonic mean”? I’m not sure, but one possibility is that they read something else: that there is no mathematical formula for the standard error of the harmonic mean, as there is for the arithmetic mean. That’s true, which is why bootstrapping is particularly useful.

Why harmonic?¶

The other outstanding question is whether the harmonic mean is the best choice for summarizing this data. Without more context, I can’t say, but I can offer some general advice.

The harmonic mean is often a good choice when the quantities in the dataset are rates. As an example, suppose you drive to the store at 20 mph and then return on the same route at 30 mph. If you want to know your average speed for the round trip, you might be tempted to compute the arithmetic mean.

In [14]:

np.mean([20, 30])

Out[14]:

25.0

But that’s not right. To see why, suppose the store is 60 miles away, so it takes 3 hours on the way there and 2 hours on the way back. That’s 120 miles in 5 hours, which is 24 mph on average. The harmonic mean is the equivalent of this calculation, and produces the same answer (within floating-point error).

In [15]:

hmean([20, 30])

Out[15]:

23.999999999999996

So that’s a case where the harmonic mean naturally computes the quantity we’re interested in. More generally, the harmonic mean might be a good choice when the quantities are rates or ratios, but in my opinion this advice is often stated too strongly. It depends on the context, and on what question you are trying to answer.

For example, suppose you drive at 20 mph for an hour, and then 30 mph for an hour, and you want to know your average speed for the whole trip. In this case, the arithmetic mean is correct — you traveled 50 miles in 2 hours, so the average is 25 mph.

If OP is using the harmonic mean only because the dataset has a few outliers, it might not be the best choice. The harmonic mean is relatively robust to outliers if they are above the mean, but it is relatively sensitive to outliers below the mean.

As an example, here’s another dataset with outliers 4-6 standard deviations below the mean.

In [16]:

data3 = np.concatenate([data, [6, 5, 4]])

As expected, the arithmetic mean gets dragged down by these outliers.

In [17]:

np.mean(data3)

Out[17]:

9.913187235024951

But the harmonic mean gets dragged down even more!

In [18]:

hmean(data3)

Out[18]:

9.684716917792475

So outliers alone are not a strong reason to use the harmonic mean.

If there is reason to think the outliers are genuine errors, like bad measurements, the best choice might be a trimmed mean. In this example, trimming 10% of the data works well for all the datasets: the one with no outliers, outliers above the mean, and outliers below the mean.

In [19]:

from scipy.stats import trim_mean

trim_mean(data, proportiontocut=0.1)

Out[19]:

10.054689104329984

In [20]:

trim_mean(data2, proportiontocut=0.1)

Out[20]:

10.099867738778054

In [21]:

trim_mean(data3, proportiontocut=0.1)

Out[21]:

10.01327288288814

In all three cases, the estimated mean is close to the actual mean of the dataset, before the addition of outliers.

In [ ]:

Think Python Goes to Production

March 8, 2024 AllenDowney

Think Python has moved into production, on schedule for the official publication date in July — but maybe earlier if things go well.

To celebrate, I have posted the next batch of chapters on the new site, up through Chapter 12, which is about Markov text analysis and generation, one of my favorite examples in the book. From there, you can follow links to run the notebooks on Colab.

And we have a cover!

The new animal is a ringneck parrot, I’ve been told. I will miss the Carolina parakeet that was on the old cover, which was particularly apt because it is an ex-parrot. Nevertheless, I think the new cover looks great!

Huge thanks to Sam Lau and Luciano Ramalho for their technical reviews. Both made many helpful corrections and suggestions that improved the book. Sam is an expert on learning to program with AI assistants. And Luciano was inspired by the turtles to make an improved module for turtle graphics in Jupyter, called jupyturtle. Here’s an example of what it looks like (from Chapter 5):

If you have a chance to check out the current draft, and you have any corrections or suggestions, please create an issue on GitHub.

And if you would like a copy of the book as soon as possible, you can read the Early Release version and order from O’Reilly here or pre-order the third edition from Amazon.

The Gender Gap in Political Beliefs Is Small

February 18, 2024 AllenDowney

In previous articles (here, here, and here) I’ve looked at evidence of a gender gap in political alignment (liberal or conservative), party affiliation (Democrat or Republican), and policy preferences.

Using data from the GSS, I found that women are more likely to say they are liberal, and more likely to say they are Democrats, by 5-10 percentage points. But in their responses to 15 policy questions that most distinguish conservatives and liberals, men and women give similar answers.

In other words, the political gap is mostly in what people say about themselves, not in what they believe about specific policy questions.

Now let’s see if we get similar results with ANES data. As with the GSS, I looked for questions where liberals and conservatives give different answers. From those, I selected questions about specific policies, plus four questions related to moral foundations, with preference for questions asked over a long period of time. Here are the 16 topics that met these criteria:

For each question, I identified one or more responses that were more likely to be given by conservatives, which is what I’m calling “conservative responses”.

Not every respondent was asked every question, so I used a Bayesian method based on item response theory to fill missing values. You can get the details of the method here.

As in the GSS data, the average number of conservative responses has gone down over time.

Men give more conservative responses than women, on average, but the differences is only half a question, and the gap is not getting bigger.

Among people younger than 30, the gap is closer to 1 question, on average. And it is not growing.

In summary:

In the ANES, there is no evidence of a growing gender gap in political alignment, party affiliation, or policy preferences.
In both the GSS and the ANES the gap in policy preferences is small and not growing.

The details of this analysis are in this Jupyter notebook.

What about economics?

Many of the questions in the previous section are about social issues. On economic issues some of the patterns are different. Here are 15 questions I selected that are mostly about federal spending.

Unlike the social issues, which trend liberal over time, responses to these questions are almost unchanged.

In the general population, the gender gap is about 0.5 questions and not growing.

Among young adults, the gender gap is smaller, and not growing.

On a total of 30 questions where conservatives and liberal disagree, men and women provide similar responses.

Think Python third edition!

February 15, 2024 AllenDowney

I am happy to announce the third edition of Think Python, which will be published by O’Reilly Media later this year.

You can read the online version of the book here. I’ve posted the Preface and the first four chapters — more on the way soon!

You can read the Early Release and pre-order from O’Reilly, or pre-order the third edition on Amazon.

Here is an excerpt from the Preface that explains…

What’s new in the third edition?

The biggest changes in this edition were driven by two new technologies — Jupyter notebooks and virtual assistants.

Each chapter of this book is a Jupyter notebook, which is a document that contains both ordinary text and code. For me, that makes it easier to write the code, test it, and keep it consistent with the text. For readers, it means you can run the code, modify it, and work on the exercises, all in one place.

The other big change is that I’ve added advice for working with virtual assistants like ChatGPT and using them to accelerate your learning. When the previous edition of this book was published in 2016, the predecessors of these tools were far less useful and most people were unaware of them. Now they are a standard tool for software engineering, and I think they will be a transformational tool for learning to program — and learning a lot of other things, too.

The other changes in the book were motivated by my regrets about the second edition.

The first is that I did not emphasize software testing. That was already a regrettable omission in 2016, but with the advent of virtual assistants, automated testing has become even more important. So this edition presents Python’s most widely-used testing tools, doctest and unittest, and includes several exercises where you can practice working with them.

My other regret is that the exercises in the second edition were uneven — some were more interesting than others and some were too hard. Moving to Jupyter notebooks helped me develop and test a more engaging and effective sequence of exercises.

In this revision, the sequence of topics is almost the same, but I rearranged a few of the chapters and compressed two short chapters into one. Also, I expanded the coverage of strings to include regular expressions.

A few chapters use turtle graphics. In previous editions, I used Python’s turtle module, but unfortunately it doesn’t work in Jupyter notebooks. So I replaced it with a new turtle module that should be easier to use. Here’s what it looks like in the notebooks.

Finally, I rewrote a substantial fraction of the text, clarifying places that needed it and cutting back in places where I was not as concise as I could be.

I am very proud of this new edition — I hope you like it!

The Political Gender Gap is Not Growing

February 11, 2024 AllenDowney

In a previous article, I used data from the General Social Survey (GSS) to see if there is a growing gender gap among young people in political alignment, party affiliation, or political attitudes. So far, the answer is no.

Young women are more likely than men to say they are liberal by 5-10 percentage points. But there is little or no evidence that the gap is growing.
Young women are more likely to say they are Democrats. In the 1990s, the gap was almost 20 percentage points. Now it is only 5-10 percentage points. So there’s no evidence this gap is growing — if anything, it is shrinking.
To 15 questions related to policies and attitudes, young men give slightly more conservative responses than women, on average, but the gap is small and consistent over time — there is no evidence it is growing.

Ryan Burge has done a similar analysis with data from the Cooperative Election Study (CSE). Looking at stated political alignment, he finds that young women are more likely to say they are liberal by 5-10 percentage points. But there is no evidence that the gap is growing.

That leaves one other long-running survey to consider, the American National Election Studies (ANES). I have been meaning to explore this dataset for a long time, so this project is a perfect excuse.

This Jupyter notebook shows my analysis of alignment and party affiliation. I’ll get to beliefs and attitudes next week.

Alignment

This figure shows the percent who say they are liberal minus the percent who say they are conservative, for men and women ages 18-29.

It looks like the gender gap in political alignment appeared in the 1980s, but it has been nearly constant since then.

Affiliation

This figure shows the percent who say they are Democrats minus the percent who say they are Republicans, for men and women ages 18-29.

The gender gap in party affiliation has been mostly constant since the 1970s. It might have been a little wider in the 1990s, and might be shrinking now.

So what’s up with Gallup?

The results from GSS, CES, and ANES are consistent: there is no evidence of a growing gender gap in alignment, affiliation, or attitudes. So why does the Gallup data tell a different story?

Here’s the figure from the Financial Times article again, zooming in on just the US data.

First, I think this figure is misleading. As explained in this tweet, the data here have been adjusted by subtracting off the trend in the general population. As a result, the figure gives the impression that young men now are more likely to identify as conservative than in the past, and that’s not true. They are more likely to identify as liberal, but this trend is moving slightly slower than in the general population.

But misleading or not, this way of showing the data doesn’t change the headline result, which is that the gender gap in this dataset has grown substantially, from about 10 percentage points in 2010 to about 30 percentage points now.

On Twitter, the author of the FT article points out that one difference is that the sample size is bigger for the Gallup data than the datasets I looked at — and that’s true. Sample size explains why the variability from year to year is smaller in the Gallup data, but it does not explain why we see a big trend in the Gallup data that does not exist at in the other datasets.

As a next step, I would ideally like to access the Gallup data so I can replicate the analysis in the FT article and explore reasons for the discrepancy. If anyone with access to the Gallup data can and will share it with me, let me know.

Barring that, we are left with two criteria to consider: plausibility and preponderance of evidence.

Plausibility: The size of the changes in the Gallup data are at least surprising if not implausible. A change of 20 percentage points in 10 years is unlikely, especially in an analysis like this where we follow an age group over time — so the composition of the group changes slowly.

Preponderance of evidence: At this point see a trend in one analysis of one dataset, and no sign of that result in several analyses of three other similar datasets.

Until we see better evidence to support the surprising claim, it seems most likely that the gender gap among young people is not growing, and is currently no larger than it has been in the past.

Political Alignment, Affiliation, and Attitudes

February 4, 2024 AllenDowney

Is there a growing gender gap in the U.S?

Alignment

A recent article in the Financial Times suggests that among young people there is a growing gender gap in political alignment on a spectrum from liberal to conservative.

In last week’s post, I tried to replicate this result using data from the General Social Survey. I generated the following figure, which shows the percentage of liberals minus the percentage of conservatives from 1988 to 2021, among people 18 to 29 years old. The analysis is in this Jupyter notebook.

Women are more likely to say they are liberal by 5-10 percentage points. But there is little or no evidence that the gap is growing.

Party Affiliation

This figure shows the percentage of Democrats minus the percentage of Republicans from 1988 to 2021. The analysis is in this Jupyter notebook.

Women are more likely than men to say they are Democrats. In the 1990s, the gap was almost 20 percentage points. Now it is only 5-10 percentage points. So there’s no evidence this gap is growing — if anything, it is shrinking.

Attitudes and beliefs

To quantify political attitudes, I will take advantage of a method I used in Chapter 12 of Probably Overthinking It. In the General Social Survey, I chose 15 questions where there is the biggest difference in the responses of people who identify as liberal or conservative. Then I estimated the number of conservative responses from each respondent.

The following figure shows the average number of conservative responses for young men and women since 1974. The analysis is in this Jupyter notebook.

Men give slightly more conservative responses than women, on average, but the gap is small and consistent over time — there is no evidence it is growing.

In summary, GSS data provides no support for the claim that there is a growing gender gap in political alignment, affiliation, or attitudes.

Extremes, outliers, and GOATS

February 1, 2024 AllenDowney

The video from my PyData Global 2023 talk, Extremes, outliers, and GOATS, is available now:

The slides are here.

There are two Jupyter notebooks that contain the analysis I presented:

Here’s the abstract:

The fastest runners are much faster than we expect from a Gaussian distribution, and the best chess players are much better. In almost every field of human endeavor, there are outliers who stand out even among the most talented people in the world. Where do they come from?

In this talk, I present as possible explanations two data-generating processes that yield lognormal distributions, and show that these models describe many real-world scenarios in natural and social sciences, engineering, and business. And I suggest methods — using SciPy tools — for identifying these distributions, estimating their parameters, and generating predictions.

This talk is based on Chapter 4 of Probably Overthinking It. If you liked the talk, you’ll love the book 🙂

Thanks to the organizers of PyData Global and NumFOCUS!