Browsed by
Category: Uncategorized

Founded Upon an Error

Founded Upon an Error

A recent post on Reddit asks, “Why was Bayes’ Theory not accepted/popular historically until the late 20th century?”

Great question! As always, there are many answers to a question like this, and the good people of Reddit provide several. But the first and most popular answer is, in my humble opinion, wrong.

The story goes something like this: “Bayesian methods are computationally expensive, so even though they were known in the early days of modern statistics, they were not practical until the availability of computational power and the recent development of efficient sampling algorithms.”

This theory is appealing because, if we look at problems where Bayesian methods are currently used, many of them are large and complex, and would indeed have been impractical to solve just a few years ago.

I think it is also appealing because it rationalizes the history of statistics. Ignoring Bayesian methods for almost 100 years wasn’t a mistake, we can tell ourselves; we were just waiting for the computers to catch up.

Well, I’m sorry, but that’s bunk. In fact, we could have been doing Bayesian statistics all along, using conjugate priors and grid algorithms.

Conjugate Priors

A large fraction of common, practical problems in statistics can be solved using conjugate priors, and the solutions require almost no computation. For example:

  • Problems that involve estimating proportions can be solved using a beta prior and binomial likelihood function. In that case, a Bayesian update requires exactly two addition operations.
  • In the multivariate case, with a Dirichlet prior and a multinomial likelihood function, the update consists of adding two vectors.
  • Problems that involve estimating rates can be solved with a gamma prior and an exponential or Poisson likelihood function — and the update requires two additions.
  • For problems that involve estimating the parameters of a normal distribution, things are a little more challenging: you have to compute the mean and standard deviation of the data, and then perform about a dozen arithmetic operations.

For details, see Chapter 18 of Think Bayes. And for even more examples, see this list of conjugate priors. All of these could have been done with paper and pencil, or chalk and rock, at any point in the 20th century.

And these methods would be sufficient to solve many common problems in statistics, including everything covered in an introductory statistics class, and a lot more. In the time it takes for students to understand p-values and confidence intervals, you could teach them Bayesian methods that are more interesting, comprehensible, and useful.

In terms of computational efficiency, updates with prior conjugates border on miraculous. But they are limited to problems where the prior and likelihood can be well modeled by simple analytic functions. For other problems, we need other methods.

Grid Algorithms

The idea behind grid algorithms is to enumerate all possible values for the parameters we want to estimate and, for each set of parameters:

  1. Compute the prior probability,
  2. Compute the likelihood of the data,
  3. Multiply the priors and the likelihoods,
  4. Add up the products to get the total probability of the data, and
  5. Divide through to normalize the posterior distribution.

If the parameters are continuous, we approximate the results by evaluating the prior and likelihood at a discrete set of values, often evenly spaced to form a d-dimensional grid, where d is the number of parameters.

If there are n possible values and m elements in the dataset, the total amount of computation we need is proportional to the product n m, which is practical for most problems. And in many cases we can do even better by summarizing the data; then the computation we need is proportional to n + m.

For problems with 1-2 parameters — which includes many useful, real-world problems — grid algorithms are efficient enough to run on my 1982 vintage Commodore 64.

For problems with 3-4 parameters, we need a little more power. For example, in Chapter 15 of Think Bayes I solve a problem with 3 parameters, which takes a few seconds on my laptop, and in Chapter 17 I solve a problem that takes about a minute.

With some optimization, you might be able to estimate 5-6 parameters using a coarse grid, but at that point you are probably better off with Markov chain Monte Carlo (MCMC) or Approximate Bayesian Computation (ABC).

For more than six parameters, grid algorithms are not practical at all. But you can solve a lot of real-world problems with fewer than six parameters, using only the computational power that’s been available since 1970.

So why didn’t we?

Awful People, Bankrupt Ideas

In 1925, R.A. Fisher wrote, “… it will be sufficient … to reaffirm my personal conviction … that the theory of inverse probability is founded upon an error, and must be wholly rejected.” By “inverse probability”, he meant what is now called Bayesian statistics, and this is probably the nicest thing he ever wrote about it.

Unfortunately for Bayesianism, Fisher’s “personal conviction” carried more weight than most. Fisher was “the single most important figure in 20th century statistics”, at least according this article. He was also, according to contemporaneous accounts, a colossal jerk who sat on 20th century statistics like a 400-pound gorilla, a raving eugenicist, even after World War II, and a paid denier that smoking causes lung cancer.

For details of the story, I recommend The Theory That Would Not Die, where Sharon Bertsch McGrayne writes: “If Bayes’ story were a TV melodrama, it would need a clear-cut villain, and Fisher would probably be the audience’s choice by acclamation.”

Among other failings, Fisher feuded endlessly with Karl Pearson, Egon Pearson, and Jerzy Neyman, to the detriment of statistics, science, and the world. But he and Neyman agreed about one thing: they were both rabid and influential anti-Bayesians.

The focus of their animosity was the apparent subjectivity of Bayesian statistics, particularly in the choice of prior distributions. But this concern is, in my personal conviction, founded upon an error: the belief that frequentist methods are less subjective than Bayesian methods.

All statistical methods are based on modeling decisions, and modeling decisions are subjective. With Bayesian methods, the modeling decisions are represented more explicitly, but that’s a feature, not a bug. As I.J. Good said, “The subjectivist [Bayesian] states his judgements, whereas the objectivist [frequentist] sweeps them under the carpet by calling assumptions knowledge, and he basks in the glorious objectivity of science.”

In summary, it would be nice to think it was reasonable to neglect Bayesian statistics for most of the 20th century because we didn’t have the computational power to make them practical. But that’s a rationalization. A much more substantial part of the reason is the open opposition of awful people with bankrupt ideas.

What’s new in Think Bayes 2?

What’s new in Think Bayes 2?

I’m happy to report that the second edition of Think Bayes is available for preorder now.

Cover of Think Bayes second edition

What’s new in the second edition?

  • I wrote a new Chapter 1 that introduces conditional probability by using the Linda the Banker problem and data from the General Social Survey.
  • I added new chapters on survival analysis, linear regression, logistic regression, conjugate priors, MCMC, and ABC.
  • I added a lot of new examples and exercises, most from classes I taught using the first edition.
  • I rewrote all of the code using NumPy, SciPy, and Pandas (rather than basic Python types). The new code is shorter, clearer, and faster!
  • For every chapter, there’s a Jupyter notebook where you can read the text, run the code, and work on exercises. You can run the notebooks on your own computer or, if you don’t want to install anything, you can run them on Colab.

More generally, the second edition reflects everything I’ve learned in the 10 years since I started the first edition, and it benefits from the comments, suggestions, and corrections I’ve received from readers. I think it’s really good!

If you would like to preorder, click here.

Bayesian and frequentist results are not the same, ever

Bayesian and frequentist results are not the same, ever

I often hear people say that the results from Bayesian methods are the same as the results from frequentist methods, at least under certain conditions. And sometimes it even comes from people who understand Bayesian methods.

Today I saw this tweet from Julia Rohrer: “Running a Bayesian multi-membership multi-level probit model with a custom function to generate average marginal effects only to find that the estimate is precisely the same as the one generated by linear regression with dummy-coded group membership.” [emphasis mine]

Which elicited what I interpret as good-natured teasing, like this tweet from Daniël Lakens: “I always love it when people realize that the main difference between a frequentist and Bayesian analysis is that for the latter approach you first need to wait 24 hours for the results.”

Ok, that’s funny, but there is a serious point here I want to respond to because both of these comments are based on the premise that we can compare the results from Bayesian and frequentist methods. And that’s not just wrong, it is an important misunderstanding.

You can’t compare results from Bayesian and frequentist methods because the results are different kinds of things. Results from frequentist methods are generally a point estimate, a confidence interval, and/or a p-value. Each of those results is an answer to a different question:

  • Point estimate: If I have to pick a single value, which one minimizes a particular cost function under a particular set of constraints? For example, which one minimizes mean squared error while being unbiased?
  • Confidence interval: If my estimated parameters are correct and I run the experiment again, how much would the results vary due to random sampling?
  • p-value: If my estimated parameters are wrong and the actual effect size is zero, what is the probability I would see an effect as big as the one I saw?

In contrast, the result from Bayesian methods is a posterior distribution, which is a different kind of thing from a point estimate, an interval, or a probability. It doesn’t make any sense to say that a distribution is “the same as” or “close to” a point estimate because there is no meaningful way to compute a distance between those things. It makes as much sense as comparing 1 second and 1 meter.

If you have a posterior distribution and someone asks for a point estimate, you can compute one. In fact, you can compute several, depending on what you want to minimize. And if someone asks for an interval, you can compute one of those, too. In fact, you could compute several, depending on what you want the interval to contain. And if someone really insists, you can compute something like a p-value, too.

But you shouldn’t.

The posterior distribution represents everything you know about the parameters; if you reduce it to a single number, an interval, or a probability, you lose useful information. In fact, you lose exactly the information that makes the posterior distribution useful in the first place.

It’s like comparing a car and an airplane by driving the airplane on the road. You would conclude that the airplane is complicated, expensive, and not particularly good as a car. But that would be a silly conclusion because it’s a silly comparison. The whole point of an airplane is that it can fly.

https://slate.com/human-interest/2010/03/how-to-land-a-plane-on-a-highway.html

And the whole point of Bayesian methods is that a posterior distribution is more useful than a point estimate or an interval because you can use it to guide decision-making under uncertainty.

For example, suppose you compare two drugs and you estimate that one is 90% effective and the other is 95% effective. And let’s suppose that difference is statistically significant with p=0.04. For the next patient that comes along, which drug should you prescribe?

You might be tempted to prescribe the second drug, which seems to have higher efficacy. However:

  1. You are not actually sure it has higher efficacy; it’s still possible that the first drug is better. If you always prescribe the second drug, you’ll never know.
  2. Also, point estimates and p-values don’t help much if one of the drugs is more expensive or has more side effects.

With a posterior distribution, you can use a method like Thompson sampling to balance exploration and exploitation, choosing each drug in proportion to the probability that it is the best. And you can make better decisions by maximizing expected benefits, taking into account whatever factors you can model, including things like cost and side effects (which is not to say that it’s easy, but it’s possible).

Bayesian methods answer different questions, provide different kinds of answers, and solve different problems. The results are not the same as frequentist methods, ever.

Conciliatory postscript: If you don’t need a posterior distribution — if you just want a point estimate or an interval — and you conclude that you don’t need Bayesian methods, that’s fine. But it’s not because the results are the same.

Simpson’s paradox and real wages

Simpson’s paradox and real wages

I have good news and bad news. First the good news: after a decade of stagnation, real wages have been rising since 2010. The following figure shows weekly wages for full-time employees (source), which I adjusted for inflation and indexed so the series starts at 100.

Real wages in 2019 Q3 were about 5% higher than in 2010.

Now here’s the bad news: at every level of education, real wages are lower now than in 2000, or practically the same. The following figure shows real weekly wages grouped by educational attainment:

For people with some college or an associate degree, real wages have fallen by about 5% since 2000 Q1. People with a high school diploma or a bachelor’s degree are making less money, too. People with advanced degrees are making about the same, and high school dropouts are doing slightly better.

But the net change for every group is substantially less than the 5% increase we see if we put the groups together. How is that even possible?

The answer is Simpson’s paradox, which is when a trend appears in every subgroup, but “disappears or reverses when these groups are combined”. In this case, real wages are declining or stagnant in every subgroup, but when we put the groups together, wages are increasing.

In general, Simpson’s paradox can happen when there is a confounding variable that interacts with the variables you are looking at. In this example, the variables we’re looking at are real wages, education level, and time. So here’s my question: what is the confounding variable that explains these seemingly impossible results?

Before you read the next section, give yourself time to think about it.

Credit: I got this example from a 2013 article by Floyd Norris, who was the chief financial correspondent of The New York Times at the time. He responded very helpfully to my request for help replicating his analysis.

The answer

The key (as Norris explained) is that the fraction of people in each educational level has changed. I don’t have the number from the BLS, but we can approximate them with data from the General Social Survey (GSS). It’s not exactly the same because:

  1. The GSS represents the adult residents of the U.S.; the BLS sample includes only people employed full time.
  2. The GSS data includes number of years of school, so I used that to approximate the educational levels in the BLS dataset. For example, I assume that someone with 12 years of school has a high school diploma, someone with 16 years of school has a bachelor’s degree, etc.

With those caveats, the following figure shows the fraction of GSS respondents in each educational level, from 1973 to 2018:

During the relevant period (from 2000 to 2018), the fraction of people with bachelor’s and advanced degrees increased substantially, and the fraction of high school dropouts declined.

These changes are the primary reason for the increase in median real wages when we put all educational levels together. Here’s one way to think about it:

  1. If you compare two people with the same educational level, one in 2000 and one in 2018, the one in 2018 is probably making less money, in real terms.
  2. But if you compare two people, chosen at random, one in 2000 and one in 2018, the one in 2018 is probably making more money, because the one in 2018 probably has more education.

These changes in educational attainment might explain the paradox, but the explanation raises another question: The same changes were happening between 2000 and 2010, so why were real wages flat during that interval?

I’m not sure I know the answer, but it looks like wages at each level were falling more steeply between 2000 and 2010; after that, some of them started to recover. So maybe the decreases within educational levels were canceled out by the shifts between levels, with a net change close to zero.

And there’s one more question that nags me: Why are real wages increasing for people with less than a high school diploma? With all the news stories about automation and the gig economy, I expected people in this group to see decreasing wages.

The resolution of this puzzle might be yet another statistical pitfall: survivorship bias. The BLS dataset reports median wages for people who are employed full-time. So if people in the bottom half of the wage distribution lose their jobs, or shift to part-time work, the median of the survivors goes up.

And that raises one final question: Are real wages going up or not?

Berkson Goes to College

Berkson Goes to College

Suppose one day you visit Representative College, where the student body is a representative sample of the college population. You meet a randomly chosen student and you learn (because it comes up in conversation) that they got a 600 on the SAT Verbal test, which is about one standard deviation above the mean. What do you think they got on the SAT Math test?

  • A: 600 or more
  • B: Between 500 and 600 (above the mean)
  • C: Between 400 and 500 (below the mean)
  • D: 400 or less

If you chose B, you are right! Scores on the SAT Math and Verbal tests are correlated, so if someone is above average on one, they are probably above average on the other. The correlation coefficient is about 0.7, so people who get 600 on the verbal test get about 570 on the math test, on average.

Now suppose you visit Elite University, where the average score on both tests is 700. You meet a randomly chosen student and you learn (because they bring it up) that they got a 750 on the verbal test, which is about one standard deviation above the mean at E.U. What do you think they got on the math test?

  • A: 750 or more
  • B: Between 700 and 750 (above the mean)
  • C: Between 650 and 700 (below the mean)
  • D: 650 or less

If you chose B again, you are wrong! Among students at E.U., the correlation between test scores is negative. If someone is above average on one, they are probably below average on the other.

This is an example of Berkson’s paradox, which is a form of selection bias. In this case, the selection is the college admission process, which is partly based on exam scores. And the effect, at elite colleges and universities, is a negative correlation between test scores, even though the correlation in the general population is positive.

Data

To see how it works in this example, let’s look at some numbers. I got data from the National Longitudinal Survey of Youth 1997 (NLSY97), which “follows the lives of a sample of [8,984] American youth born between 1980-84”. The public data set includes the participants’ scores on several standardized tests, including the SAT and ACT.

About 1400 respondents took the SAT. Their average and standard deviation are close to the national average (500) and standard deviation (100). And the correlation is about 0.73. To get a sense of how strong that is, here’s what the scatter plot looks like.

Since the correlation is about 0.7, someone who is one standard deviation above the mean on the verbal test is about 0.7 standard deviations above the mean on the math test, on average. So at Representative College, if we select people with verbal scores near 600, their average math score is about 570.

Elite University

Now let’s see what happens when we select students for Elite University. Suppose that in order to get into E.U., your total SAT score has to be 1320 or higher. If we select students who meet or exceed that threshold, their average on both tests is about 700, and the standard deviation is about 50.

Among these students, the correlation between test scores is about -0.33, which means that if you are one standard deviation above the E.U. mean on one test, you are about 0.33 standard deviations below the E.U. mean on the other, on average.

The following figure shows why this happens:

The students who meet the admission requirements at Elite University form a triangle in the upper right, with a moderate negative correlation between test scores.

Specialized University

Of course, most admissions decisions are based on more than the sum of two SAT scores. But we get the same effect even if the details of the admission criteria are different. For example, suppose another school, Specialized University, admits students if either test score is 720 or better, regardless of the other score.

With this threshold, the mean for both tests is close to 700, the same as Elite University, and the standard deviations are a little higher. But again, the correlation is negative, and a little stronger than at E.U., about -0.38, compared to -0.33.

The following figure shows the distribution of scores for admitted students.

There are three kinds of students Specialized University: good at math, good at language, and good at both. But the first two groups are bigger than the third, so the overall correlation is negative.

Sweep the Threshold

Now let’s see what happens as we vary the admissions requirements. I’ll go back to the previous version, where admission depends on the total of the two tests, and vary the threshold.

As we increase the threshold, the average total score increases and the correlation decreases. The following figure shows the results.

At Representative College, where the average total SAT is near 1000, test scores are strongly correlated. At Elite University, where the average is over 1400, the correlation is moderately negative.

Secondtier College

But at a college that is selective but not elite, the effect might be even stronger than that. Suppose at Secondtier College (it’s pronounced “seh con’ tee ay'”), a student with a total score of 1220 or more is admitted, but a student with 1320 or more is likely to go somewhere else.

In that case, the average total score would be about 1260. So, based on the parameter sweep in the previous section, we would expect a weak positive correlation, but the correlation is actually strongly negative, about -0.8! The following picture shows why.

At Secondtier, if you meet a student who got a 690 on the math test, about one standard deviation above the mean, you should expect them to get a 580 on the verbal test, on average. That’s a remarkable effect.

Summary

Among the students at a given college or university, verbal skills and math skills might be strongly correlated, anti-correlated, or uncorrelated, depending on how the students are selected. This is an example of Berkson’s paradox.

If you enjoy this kind of veridical paradox, you might like my previous article “The Inspection Paradox Is Everywhere“. And if you like thinking about probability, you might like the second edition of Think Bayes (affiliate link), which will be published by O’Reilly Media later this month.

If you want to see the details of my analysis and run the code, click here to run the notebook on Colab.

Finally, if you have access to standardized test scores at a college or university, and you are willing to compute a few statistics, I would love to compare my results with some real-world data. For students who enrolled, I would need

  • Mean and standard deviation for each section of the SAT or ACT.
  • Correlations between the sections.

The results, if you share them, would appear as a dot on a graph, either labeled or unlabeled at your discretion.

The Retreat From Religion Continues

The Retreat From Religion Continues

A few years ago I wrote an article for the Scientific American blog where I used data from the General Social Survey (GSS) to describe changes in religious affiliation in the U.S.

And in this longer article I described changes in religious belief as well, including belief in God, interpretation of the Bible, and confidence in religious institutions.

Those articles were based on GSS data released in 2017, which included interviews up to 2016. Now the GSS has released additional data from interviews conducted in 2017 and 2018, including young adults born in 1999 and 2000.

So it’s time to update the results.

Religious affiliation by cohort

The following figure shows religious affiliation by year of birth.

The youngest group in the survey, people born between 1996 and 2000, depart from several long-term trends:

  • They are more likely to identify as Protestant than the previous cohort, and slightly more likely to identify as Catholic.
  • And they are less likely to report no religious preference.

There are only 201 people in this group, so these results might be an anomaly. Data from other sources indicates that young adults are less religious than older groups.

Religious affiliation by year

The following figure shows religious affiliation by year of interview along with predictions based on a simple model of generational replacement.

Despite the reversal of trends in the previous figure, the long-term trends in religious affiliation continue:

  • The fraction of people who identify as Protestant or Christian is declining and accelerating.
  • The fraction who identify as Catholic has started to decline.
  • The fraction with no religious affiliation is increasing and accelerating.

Based on the previous batch of data, I predicted that there would be more Nones than Catholics in 2020. With the most recent data, it looks like the cross-over might be ahead of schedule.

On these trends, there will be more Nones than Protestants sometime after 2030.

In the next article, I’ll present related trends in religious belief and confidence in religious institutions.

College Freshmen are More Godless Than Ever

College Freshmen are More Godless Than Ever

In the last 30 years, college students have become much less religious. The fraction who say they have no religious affiliation has more than tripled, from about 10% to 34%. And the fraction who say they have attended a religious service in the last year fell from more than 85% to 66%.

I’ve been following this trend for a while, using data from the CIRP Freshman Survey, which has surveyed a large sample of entering college students since 1966.

The most recently published data is from “95,505 first-time, full-time freshmen entering 148 baccalaureate institutions” in Fall 2019.

Of course, college students are not a representative sample of the U.S. population. Furthermore, as rates of college attendance have increased, they represent a different slice of the population over time. Nevertheless, surveying young adults over a long interval provides an early view of trends in the general population.

Religious preference

Among other questions, the Freshman Survey asks students to select their “current religious preference” from a list of seventeen common religions, “Other religion,” “Atheist”, “Agnostic”, or “None.”  

The options “Atheist” and “Agnostic” were added in 2015.  For consistency over time, I compare the “Nones” from previous years with the sum of “None”, “Atheist” and “Agnostic” since 2015.

The following figure shows the fraction of Nones from 1969, when the question was added, to 2019, the most recent data available.

Percentage of students with no religious preference from 1969 to 2019.

The blue line shows data until 2015; the orange line shows data from 2015 through 2019. The gray line shows a quadratic fit.  The light gray region shows a 95% predictive interval.

The quadratic model continues to fit the data well and the most recent data point is above the trend line, which suggests that the “rise of the Nones” is still accelerating.

Attendance

The survey also asks students how often they “attended a religious service” in the last year. The choices are “Frequently,” “Occasionally,” and “Not at all.” Respondents are instructed to select “Occasionally” if they attended one or more times, so a wedding or a funeral would do it.

The following figure shows the fraction of students who reported any religious attendance in the last year, starting in 1968. I discarded a data point from 1966 that seems unlikely to be correct.

Percentage of students who reported attending a religious service in the previous year.

About 66% of incoming college students said they attended a religious service in the last year, an all-time low in the history of the survey, and down more than 20 percentage points from the peak.

This curve is on trend, with no sign of slowing down.

Gender Gap

Female students are more religious than male students. The following graph shows the gender gap over time, that is, the difference in percentages of male and female students with no religious affiliation.

Difference in religious affiliation between male and female students.

The gender gap was growing until recently. It has shrunk in the last 3-4 years, but since it varies substantially from year to year, it is hard to rule out random variation.

Data Source

The American Freshman: National Norms Fall 2019
Stolzenberg, Aragon, Romo, Couch, McLennon, Eagan, and Kang,
Higher Education Research Institute, UCLA, June 2020

This and all previous reports are available from the HERI publications page.