Browsed by
Tag: Simpson's paradox

Simpson’s What?

Simpson’s What?

I like Simpson’s paradox so much I wrote three chapters about it in Probably Overthinking It. In fact, I like it so much I have a Google alert that notifies me when someone publishes a new example (or when the horse named Simpson’s Paradox wins a race).

So I was initially excited about this paper that appeared recently in Nature: “The geographic association of multiple sclerosis and amyotrophic lateral sclerosis”. But sadly, I’m pretty sure it’s bogus.

The paper compares death rates due to multiple sclerosis (MS) and amyotrophic lateral sclerosis (ALS) across 50 states and the District of Columbia, and reports a strong correlation.

This result is contrary to all previous work on these diseases – which might be a warning sign. But the author explains that this correlation has not been detected in previous work because it is masked when the analysis combines male and female death rates.

This could make sense, because death rates due to MS are higher for women, and death rates due to ALS are higher for men. So if we compare different groups with different proportions of males and females, it’s possible we could see something like Simpson’s paradox.

But as far as I know, the proportions of men and women are the same in all 50 states, plus the District Columbia – or maybe a little higher in Alaska. So an essential element of Simpson’s paradox – different composition of the subgroups – is missing.

Annoyingly, the “Data Availability” section of the paper only identifies the public sources of the data – it does not provide the processed data. But we can use synthesized data to figure out what’s going on.

Specifically, let’s try to replicate this key figure from the paper:

The x-axis is age adjusted death rates from MS; the y-axis is age-adjusted death rates from ALS. Each dot corresponds to one gender group in one state. The blue line fits the male data, with correlation 0.7. The pink line fits the female data, with correlation 0.75.

The black line is supposed to be a fit to all the data, showing the non-correlation we supposedly get if we combine the two groups. But I’m pretty sure that line is a mistake.

Click here to read this article with the Python code, or if you want to replicate my analysis, you can click here to run the notebook on Colab.

Synthetic Data

I used a random number generator to synthesize correlated data with the approximate distribution of the date in the figure. The following figure shows a linear regression for the male and female data separately, and a third line that is my attempt to replicate the black line in the original figure.

I thought the author might have combined the dots from the male and female groups into a collection of 102 points, and fit a line to that. That is a nonsensical thing to do, but it does yield a Simpson-like reversal in the slope of the line — and the sign of the correlation.

_images/3e7af43cbca8d3b134f479a3aa1cdd26131edd1562308344099d9f17782e7d57.png

The line for the combined data has a non-negligible negative slope, and the correlation is about -0.4 – so this is not the line that appears in the original figure, which has a very small correlation. So, I don’t know where that line came from.

In any case, the correct way to combine the data is not to plot a line through 102 points in the scatter plot, but to fit a line to the combined death rates in the 51 states. Assuming that the gender ratios in the states are close to 50/50, the combined rates are just the means of the male and female rates. The following figure shows what we get if we combine the rates correctly.

_images/21fd70a6c01747e4906e2c0e87286110ce9a96b1e6d1e145b5e69c2e987bf9f4.png

So there’s no Simpson’s paradox here – there’s a positive correlation among the subgroups, and there’s a positive correlation when we combine them. I love a good Simpson’s paradox, but this isn’t one of them.

On a quick skim, I think the rest of the paper is also likely to be nonsensical, but I’ll leave that for other people to debunk. Also, peer review is dead.

It gets worse

UPDATE: After I published the first draft of this article, I noticed that there are an unknown number of data points at (0, 0) in the original figure. They are probably states with missing data, but if they were included in the analysis as zeros — which they absolutely should not be — that would explain the flat line.

If we assume there are two states with missing data, that strengthens the effect in the subgroups, and weakens the effect in the combined groups. The result is a line with a small negative slope, as in the original paper.

Simpson’s paradox and real wages

Simpson’s paradox and real wages

I have good news and bad news. First the good news: after a decade of stagnation, real wages have been rising since 2010. The following figure shows weekly wages for full-time employees (source), which I adjusted for inflation and indexed so the series starts at 100.

Real wages in 2019 Q3 were about 5% higher than in 2010.

Now here’s the bad news: at every level of education, real wages are lower now than in 2000, or practically the same. The following figure shows real weekly wages grouped by educational attainment:

For people with some college or an associate degree, real wages have fallen by about 5% since 2000 Q1. People with a high school diploma or a bachelor’s degree are making less money, too. People with advanced degrees are making about the same, and high school dropouts are doing slightly better.

But the net change for every group is substantially less than the 5% increase we see if we put the groups together. How is that even possible?

The answer is Simpson’s paradox, which is when a trend appears in every subgroup, but “disappears or reverses when these groups are combined”. In this case, real wages are declining or stagnant in every subgroup, but when we put the groups together, wages are increasing.

In general, Simpson’s paradox can happen when there is a confounding variable that interacts with the variables you are looking at. In this example, the variables we’re looking at are real wages, education level, and time. So here’s my question: what is the confounding variable that explains these seemingly impossible results?

Before you read the next section, give yourself time to think about it.

Credit: I got this example from a 2013 article by Floyd Norris, who was the chief financial correspondent of The New York Times at the time. He responded very helpfully to my request for help replicating his analysis.

The answer

The key (as Norris explained) is that the fraction of people in each educational level has changed. I don’t have the number from the BLS, but we can approximate them with data from the General Social Survey (GSS). It’s not exactly the same because:

  1. The GSS represents the adult residents of the U.S.; the BLS sample includes only people employed full time.
  2. The GSS data includes number of years of school, so I used that to approximate the educational levels in the BLS dataset. For example, I assume that someone with 12 years of school has a high school diploma, someone with 16 years of school has a bachelor’s degree, etc.

With those caveats, the following figure shows the fraction of GSS respondents in each educational level, from 1973 to 2018:

During the relevant period (from 2000 to 2018), the fraction of people with bachelor’s and advanced degrees increased substantially, and the fraction of high school dropouts declined.

These changes are the primary reason for the increase in median real wages when we put all educational levels together. Here’s one way to think about it:

  1. If you compare two people with the same educational level, one in 2000 and one in 2018, the one in 2018 is probably making less money, in real terms.
  2. But if you compare two people, chosen at random, one in 2000 and one in 2018, the one in 2018 is probably making more money, because the one in 2018 probably has more education.

These changes in educational attainment might explain the paradox, but the explanation raises another question: The same changes were happening between 2000 and 2010, so why were real wages flat during that interval?

I’m not sure I know the answer, but it looks like wages at each level were falling more steeply between 2000 and 2010; after that, some of them started to recover. So maybe the decreases within educational levels were canceled out by the shifts between levels, with a net change close to zero.

And there’s one more question that nags me: Why are real wages increasing for people with less than a high school diploma? With all the news stories about automation and the gig economy, I expected people in this group to see decreasing wages.

The resolution of this puzzle might be yet another statistical pitfall: survivorship bias. The BLS dataset reports median wages for people who are employed full-time. So if people in the bottom half of the wage distribution lose their jobs, or shift to part-time work, the median of the survivors goes up.

And that raises one final question: Are real wages going up or not?