A unique feature of the game is the dice, which yield three possible outcomes, 0, 1, or 2, with equal probability. When you add them up, you get some unusual probability distributions.
There are two phases of the game: During the first phase, players explore a haunted house, drawing cards and collecting items they will need during the second phase, called “The Haunt”, which is when the players battle monsters and (usually) each other.
So when does the haunt begin? It depends on the dice. Each time a player draws an “omen” card, they have to make a “haunt roll”: they roll six dice and add them up; if the total is less than the number of omen cards that have been drawn, the haunt begins.
For example, suppose four omen cards have been drawn. A player draws a fifth omen card and then rolls six dice. If the total is less than 5, the haunt begins. Otherwise the first phase continues.
Last time I played this game, I was thinking about the probabilities involved in this process. For example:
What is the probability of starting the haunt after the first omen card?
What is the probability of drawing at least 4 omen cards before the haunt?
What is the average number of omen cards before the haunt?
Abstract: The unusual circumstances of Curtis Flowers’ trials make it possible to estimate the probabilities that white and black jurors would vote to convict him, 98% and 68% respectively, and the probability a jury of his peers would find him guilty, 15%.
Background
Curtis Flowers was tried six times for the same crime. Four trials ended in conviction; two ended in a mistrial due to a hung jury.
Three of the convictions were invalidated by the Mississippi Supreme Court, at least in part because the prosecution had excluded black jurors, depriving Flowers of the right to trial by a jury composed of a “fair cross-section of the community“.
Because of the unusual circumstances of these trials, we can perform a statistical analysis that is normally impossible: we can estimate the probability that black and white jurors would vote to convict, and use those estimates to compute the probability that he would be convicted by a jury that represents the racial makeup of Montgomery County.
Results
According to my analysis, the probability that a white juror in this pool would vote to convict Flowers, given the evidence at trial, is 98%. The same probability for black jurors is 68%. So this difference is substantial.
The probability that Flowers would be convicted by a fair jury is only 15%, and the probability that he would be convicted four times out of six times is less than 1%.
The following figure shows the probability of a guilty verdict as a function of the number of black jurors:
According to the model, the probability of a guilty verdict is 55% with an all-white jury. If the jury includes 5-6 black jurors, which would be representative of Montgomery County, the probability of conviction would be only 14-15%.
The shaded area represents a 90% credible interval. It is quite wide, reflecting uncertainty due to limits of the data. Also, the model is based on the simplifying assumptions that
All six juries saw essentially the same evidence,
The probabilities we’re estimating did not change substantially over the period of the trials,
Interactions between jurors had negligible effects on their votes,
If any juror refuses to convict, the result is a hung jury.
Two questions crossed my desktop this week, and I think I can answer both of them with a single example.
On Twitter, Kareem Carr asked, “If Alice believes an event has a 90% probability of occurring and Bob also believes it has a 90% chance of occurring, what does it mean to say they have the same degree of belief? What would we expect to observe about both Alice’s and Bob’s behavior?”
And on Reddit, a reader of /r/statistics asked, “I have three coefficients from three different studies that measure the same effect, along with their 95% CIs. Is there an easy way to combine them into a single estimate of the effect?”
So let me tell you a story:
One day Alice tells her friend, Bob, “I bought a random decision-making box. Every time you press this button, it says ‘yes’ or ‘no’. I’ve tried it a few times, and I think it says ‘yes’ 90% of the time.”
Bob says he has some important decisions to make and asks if he can borrow the box. The next day, he returns the box to Alice and says, “I used the box several times, and I also think it says ‘yes’ 90% of the time.”
Alice says, “It sounds like we agree, but just to make sure, we should compare our predictions. Suppose I press the button twice; what do you think is the probability it says ‘yes’ both times?”
Bob does some calculations and reports the predictive probability 81.56%.
Alice says, “That’s interesting. I got a slightly different result, 81.79%. So maybe we don’t agree after all.”
Bob says, “Well let’s see what happens if we combine our data. I can tell you how many times I pressed the button and how many times it said ‘yes’.”
Alice says, “That’s ok, I don’t actually need your data; it’s enough if you tell me what prior distribution you used.”
Bob tells her he used a Jeffreys prior.
Alice does some calculations and says, “Ok, I’ve updated my beliefs to take into account your data as well as mine. Now I think the probability of ‘yes’ is 91.67%.”
Bob says, “That’s interesting. Based on your data, you thought the probability was 90%, and based on my data, I thought it was 90%, but when we combine the data, we get a different result. Tell me what data you saw, and let me see what I get.”
Alice tells him she pressed the button 8 times and it always said ‘yes’.
“So,” says Bob, “I guess you used a uniform prior.”
Bob does some calculations and reports, “Taking into account all of the data, I think the probability of ‘yes’ is 93.45%.”
Alice says, “So when we started, we had seen different data, but we came to the same conclusion.”
“Sort of,” says Bob, “we had the same posterior mean, but our posterior distributions were different; that’s why we made different predictions for pressing the button twice.”
Alice says, “And now we’re using the same data, but we have different posterior means. Which makes sense, because we started with different priors.”
“That’s true,” says Bob, “but if we collect enough data, eventually our posterior distributions will converge, at least approximately.”
“Well that’s good,” says Alice. “Anyway, how did those decisions work out yesterday?”
“Mostly bad,” says Bob. “It turns out that saying ‘yes’ 93% of the time is a terrible way to make decisions.”
If you would like to know how any of those calculations work, you can see the details in a Jupyter notebook:
And if you don’t want the details, here is the summary:
If two people have different priors OR they see different data, they will generally have different posterior distributions.
If two posterior distributions have the same mean, some of their predictions will be the same, but many others will not.
If you are given summary statistics from a posterior distribution, you might be able to figure out the rest of the distribution, depending on what other information you have. For example, if you know the posterior is a two-parameter beta distribution (or is well-modeled by one) you can recover it from the mean and second moment, or the mean and a credible interval, or almost any other pair of statistics.
If someone has done a Bayesian update using data you don’t have access to, you might be able to “back out” their likelihood function by dividing their posterior distribution by the prior.
If you are given a posterior distribution and the data used to compute it, you can back out the prior by dividing the posterior by the likelihood of the data (unless the prior contains values with zero likelihood).
If you are given summary statistics from two posterior distributions, you might be able to combine them. In general, you need enough information to recover both posterior distributions and at least one prior.
I am hard at work on the second edition of Think Bayes, currently working on Chapter 6, which is about computing distributions of minima, maxima and mixtures of other distributions.
Of all the changes in the second edition, I am particularly proud of the exercises. I present three new exercises from Chapter 6 below. If you want to work on them, you can use this notebook, which contains the material you will need from the chapter and some code to get you started.
Exercise 1
Henri Poincaré was a French mathematician who taught at the Sorbonne around 1900. The following anecdote about him is probably fabricated, but it makes an interesting probability problem.
Supposedly Poincaré suspected that his local bakery was selling loaves of bread that were lighter than the advertised weight of 1 kg, so every day for a year he bought a loaf of bread, brought it home and weighed it. At the end of the year, he plotted the distribution of his measurements and showed that it fit a normal distribution with mean 950 g and standard deviation 50 g. He brought this evidence to the bread police, who gave the baker a warning.
For the next year, Poincaré continued the practice of weighing his bread every day. At the end of the year, he found that the average weight was 1000 g, just as it should be, but again he complained to the bread police, and this time they fined the baker.
Why? Because the shape of the distribution was asymmetric. Unlike the normal distribution, it was skewed to the right, which is consistent with the hypothesis that the baker was still making 950 g loaves, but deliberately giving Poincaré the heavier ones.
To see whether this anecdote is plausible, let’s suppose that when the baker sees Poincaré coming, he hefts n loaves of bread and gives Poincaré the heaviest one. How many loaves would the baker have to heft to make the average of the maximum 1000 g?
Exercise 2
Two doctors fresh out of medical school are arguing about whose hospital delivers more babies. The first doctor says, “I’ve been at Hospital A for two weeks, and already we’ve had a day when we delivered 20 babies.”
The second doctor says, “I’ve only been at Hospital B for one week, but already there’s been a 19-baby day.”
Which hospital do you think delivers more babies on average? You can assume that the number of babies born in a day is well modeled by a Poisson distribution.
Exercise 3
Suppose I drive the same route three times and the fastest of the three attempts takes 8 minutes.
There are two traffic lights on the route. As I approach each light, there is a 40% chance that it is green; in that case, it causes no delay. And there is a 60% chance it is red; in that case it causes a delay that is uniformly distributed from 0 to 60 seconds.
What is the posterior distribution of the time it would take to drive the route with no delays?
The solution to this exercise is very similar to a method I developed for estimating the minimum time for a packet of data to travel through a path in the internet.
For the first edition, I used a module called thinkplot that provides functions that make it easier to use Matplotlib. It also overrides some of the default options.
But since I wrote the first edition, Matplotlib has improved substantially. I found I was able to eliminate thinkplot with minimal changes. As a result, the code is simpler and the figures look better.
Still using thinkdsp
I provide a module called thinkdsp that contains classes and functions used throughout the book. I think this module is good for learners. It lets me hide details that would otherwise be distracting. It lets me present some topics “top-down”, meaning that we learn how to use some features before we know how they work.
And when you learn the API provided by thinkdsp, you are also learning about DSP. For example, thinkdsp provides classes called Signal, Wave, and Spectrum.
A Signal represents a continuous function; a Wave represents a sequence of discrete samples. So Signal provides make_wave, but Wave does not provide make_signal. When you use this API, you understand implicitly that this is a one-way operation: you can sample a Signal to get a Wave, but you cannot recover a Signal from a Wave.
On the other hand, you can convert from Wave to Spectrum and from Spectrum to Wave, which implies (correctly) that they are equivalent representations of the same information. Given one, you can compute the other.
I realize that not everyone loves it when a book uses a custom library like thinkdsp. When people don’t likeThink DSP, this is the most common reason. But looking at thinkdsp with fresh eyes, I am doubling down; I still think it’s a good way to learn.
Less object-oriented
Nevertheless, I found a few opportunities to simplify the code, and in particular to make it less object-oriented. I generally like OOP, but I acknowledge that there are drawbacks. One of the biggest is that it can be hard to keep an inheritance hierarchy in your head and easy to lose track of what classes provide which methods.
I still think the template pattern is a good way to present a framework: the parent class provides the framework and child classes fill in the details.
However, based on feedback from readers, I have come to realize that object-oriented programming is not as universally known and loved as I assumed.
In several places I found that I could eliminate object-oriented features and simplify the code without losing explanatory power.
Pretty, pretty good
Coming back to this book after some time, I think it’s pretty good. If you are interested in digital signal processing, I think the computation-first approach is a good way to get started. And if you are not interested in digital signal processing, maybe I can change your mind!
The inspection paradox is a statistical illusion you’ve probably never heard of. It’s a common source of confusion, an occasional cause of error, and an opportunity for clever experimental design. And once you know about it, you see it everywhere.
The examples in the talk include social networks, transportation, education, incarceration, and more. And now I am happy to report that I’ve stumbled on yet another example, courtesy of John D. Cook.
For a multivariate normal distribution in high dimensions, nearly all the probability mass is concentrated in a thin shell some distance away from the origin.
John does a nice job of explaining this result, so you should read his article, too. But I’ll try to explain it another way, using a dartboard.
If you are not familiar with the layout of a “clock” dartboard, it looks like this:
I got the measurements of the board from the British Darts Organization rules, and drew the following figure with dimensions in mm:
Now, suppose I throw 100 darts at the board, aiming for the center each time, and plot the location of each dart. It might look like this:
Suppose we analyze the results and conclude that my errors in the x and y directions are independent and distributed normally with mean 0 and standard deviation 50 mm.
Assuming that model is correct, then, which do you think is more likely on my next throw, hitting the 25 ring (the innermost red circle), or the triple ring (the middlest red circle)?
It might be tempting to say that the 25 ring is more likely, because the probability density is highest at the center of the board and lower at the triple ring.
We can see that by generating a large sample, generating a 2-D kernel density estimate (KDE), and plotting the result as a contour.
In the contour plot, darker color indicates higher probability density. So it sure looks like the inner ring is more likely than the outer rings.
But that’s not right, because we have not taken into account the area of the rings. The total probability mass in each ring is the product of density and area (or more precisely, the density integrated over the area).
The 25 ring is more dense, but smaller; the triple ring is less dense, but bigger. So which one wins?
In this example, I cooked the numbers so the triple ring wins: the chance of hitting triple ring is about 6%; the chance of hitting the double ring is about 4%.
If I were a better dart player, my standard deviation would be smaller and the 25 ring would be more likely. And if I were even worse, the double ring (the outermost red ring) might be the most likely.
Inspection Paradox?
It might not be obvious that this is an example of the inspection paradox, but you can think of it that way. The defining characteristic of the inspection paradox is length-biased sampling, which means that each member of a population is sampled in proportion to its size, duration, or similar quantity.
In the dartboard example, as we move away from the center, the area of each ring increases in proportion to its radius (at least approximately). So the probability mass of a ring at radius r is proportional to the density at r, weighted by r.
We can see the effect of this weighting in the following figure:
The blue line shows estimated density as a function of r, based on a sample of throws. As expected, it is highest at the center, and drops away like one half of a bell curve.
The orange line shows the estimated density of the same sample weighted by r, which is proportional to the probability of hitting a ring at radius r.
It peaks at about 60 mm. And the total density in the triple ring, which is near 100 mm, is a little higher than in the 25 ring, near 10 mm.
If I get a chance, I will add the dartboard problem to my talk as yet another example of length-biased sampling, also known as the inspection paradox.
UPDATE November 6, 2019: This “thin shell” effect has practical consequences. This excerpt from The End of Average talks about designing the cockpit of a plan for the “average” pilot, and discovering that there are no pilots near the average in 10 dimensions.
In the first article in this series, I looked at data from the General Social Survey (GSS) to see how political alignment in the U.S. has changed, on the axis from conservative to liberal, over the last 50 years.
In the second article, I suggested that self-reported political alignment could be misleading.
Do you think most people would try to take advantage of you if they got a chance, or would they try to be fair?
And generated seven “headlines” to describe the results.
In this article, we’ll use resampling to see how much the results depend on random sampling. And we’ll see which headlines hold up and which might be overinterpretation of noise.
Overall trends
In the previous article we looked at this figure, which was generated by resampling the GSS data and computing a smooth curve through the annual averages.
If we run the resampling process two more times, we get somewhat different results:
Now, let’s review the headlines from the previous article. Looking at different versions of the figure, which conclusions do you think are reliable?
Absolute value: “Most respondents think people try to be fair.”
Rate of change: “Belief in fairness is falling.”
Change in rate: “Belief in fairness is falling, but might be leveling off.”
In my opinion, the three figures are qualitatively similar. The shapes of the curves are somewhat different, but the headlines we wrote could apply to any of them.
Even the tentative conclusion, “might be leveling off”, holds up to varying degrees in all three.
Grouped by political alignment
When we group by political alignment, we have fewer samples in each group, so the results are noisier and our headlines are more tentative.
Here’s the figure from the previous article:
And here are two more figures generated by random resampling:
Now we see more qualitative differences between the figures. Let’s review the headlines again:
Absolute value: “Moderates have the bleakest outlook; Conservatives and Liberals are more optimistic.” This seems to be true in all three figures, although the size of the gap varies substantially.
Rate of change: “Belief in fairness is declining in all groups, but Conservatives are declining fastest.” This headline is more questionable. In one version of the figure, belief is increasing among Liberals. And it’s not at all clear the the decline is fastest among Conservatives.
Change in rate: “The Liberal outlook was declining, but it leveled off in 1990.” The Liberal outlook might have leveled off, or even turned around, but we could not say with any confidence that 1990 was a turning point.
Change in rate: “Liberals, who had the bleakest outlook in the 1980s, are now the most optimistic”. It’s not clear whether Liberals have the most optimistic outlook in the most recent data.
As we should expect, conclusions based on smaller sample sizes are less reliable.
Also, conclusions about absolute values are more reliable than conclusions about rates, which are more reliable than conclusions about changes in rates.
In the first article in this series, I looked at data from the General Social Survey (GSS) to see how political alignment in the U.S. has changed, on the axis from conservative to liberal, over the last 50 years.
In the second article, I suggested that self-reported political alignment could be misleading.
In this article we’ll look at results from questions related to “outlook”, that is, how the respondents see the world and people in it.
Specifically, the questions are:
fair: Do you think most people would try to take advantage of you if they got a chance, or would they try to be fair?
trust: Generally speaking, would you say that most people can be trusted or that you can’t be too careful in dealing with people?
helpful: Would you say that most of the time people try to be helpful, or that they are mostly just looking out for themselves?
Do people try to be fair?
Let’s start with fair. The responses are coded like this:
1 Take advantage
2 Fair
3 Depends
To put them on a numerical scale, I recoded them like this:
1 Fair
0.5 Depends
0 Take advantage
I flipped the axis so the more positive answer is higher, and put “Depends” in the middle. Now we can plot the mean response by year, like this:
Looking at a figure like this, there are three levels we might describe:
Absolute value: “Most respondents think people try to be fair.”
Rate of change: “Belief in fairness is falling.”
Change in rate: “Belief in fairness is falling, but might be leveling off.”
For any of these qualitative descriptions, we could add quantitative estimates. For example, “About 55% of U.S. residents think people try to be fair”, or “Belief in fairness has dropped 10 percentage points since 1970”.
Statistically, the estimates of absolute value are probably reliable, but we should be more cautious estimating rates of change, and substantially more cautious talking about changes in rates. We’ll come back to this issue, but first let’s look at breakdowns by group.
Outlook and political alignment
In the previous article I grouped respondents by self-reported political alignment: Conservative, Moderate, or Liberal.
We can use these groups to see the relationship between outlook and political alignment. For example, the following figure shows the average response to the fairness question, grouped by political alignment and plotted over time:
Results like these invite comparisons between groups, and we can make those comparisons at several levels. Here are some potential headlines for this figure:
Absolute value: “Moderates have the bleakest outlook; Conservatives and Liberals are more optimistic.”
Rate of change: “Belief in fairness is declining in all groups, but Conservatives are declining fastest.”
Change in rate: “The Liberal outlook was declining, but it leveled off in 1990.” or “Liberals, who had the bleakest outlook in the 1980s, are now the most optimistic”.
Because we divided the respondents into three groups, the sample size in each group is smaller. Statistically, we need to be more skeptical about our estimates of absolute level, even more skeptical about rates of change, and extremely skeptical about changes in rates.
In the next article, I’ll use resampling to quantify the uncertainly of these estimates, and we’ll see how many of these headlines hold up.
In the previous article, I looked at data from the General Social Survey (GSS) to see how political alignment in the U.S. has changed, on the axis from conservative to liberal, over the last 50 years.
The GSS asks respondents where they place themselves on a 7-point scale from “extremely liberal” (1) to “extremely conservative” (7), with “moderate” in the middle (4).
In the previous article I computed the mean and standard deviation of the responses as a way of quantifying the center and spread of the distribution. But it can be misleading to treat categorical responses as if they were numerical. So let’s see what we can do with the categories.
The following plot shows the fraction of respondents who place themselves in each category, plotted over time:
My initial reaction is that these lines are mostly flat. If political alignment is changing in the U.S., it is changing slowly, and the changes might not matter much in practice.
If we look more closely, it seems like the number of people who consider themselves “extreme” is increasing, and the number of moderates might be decreasing. The following plot shows a closer look at the extremes.
There is some evidence of polarization here, but we should not make too much of it. People who consider themselves extreme are still less than 10% of the population, and moderates are still the biggest group, at almost 40%.
To get a better sense of what’s happening with the other groups, I reduced the number of categories to 3: “Conservative” at any level, “Liberal” at any level, and “Moderate”. Here’s what the plot looks like with these categories:
Moderates make up a plurality; conservatives are the next biggest group, followed by liberals.
From 1974 to 1990, the number of people who call themselves “Conservative” was increasing, but it has decreased ever since. And the number of “Liberals” has been increasing since 2000.
At least, that’s what this plot seems to show. We should be careful about over-interpreting patterns that might be random noise. And we might not want to take these categories too seriously, either.
The hazards of self-reporting
There are several problems with self-reported labels like this.
First, political beliefs are multi-dimensional. “Conservative” and “liberal” are labels for collections of ideas that sometimes go together. But most people hold a mixture of these beliefs.
Also, these labels are relative; that is, when someone says they are conservative, what they often mean is that they are more conservative than the center of the population, or where they think the center is, for the population they have in mind.
Finally, nearly all survey responses are subject to social desirability bias, which is the tendency of people to give answers that make them look better or feel better about themselves.
Over time, the changes we see in these responses depend on actual changes in political beliefs, but they also depend on where the center of the population is, where people think the center is, and the perceived desirability of the labels “liberal”, “conservative”, and “moderate”.
So, in the next article we’ll look more closely at changes in beliefs and attitudes, not just labels.
I am planning to turn these articles into a case study for an upcoming Data Science class, so I welcome comments and questions.
Is the United States getting more conservative? With the rise of the alt-right, Republican control of Congress, and the election of Donald Trump, it might seem so.
Or is the country getting more liberal? With the 2015 Supreme Court decision supporting same-sex marriage, the incremental legalization of marijuana, and recent proposals to expand public health care, you might think so.
Or maybe the country is becoming more polarized, with moderates choosing sides and partisans moving to the extremes.
In a series of articles, I’ll use data from the General Social Survey (GSS) to explore these questions. The GSS goes back to 1972; every second year they survey a representative sample of U.S. residents and ask questions about their political beliefs. Many of the questions have been unchanged for almost 50 years, making it possible to observe long-term trends.
In this article, I’ll look at political alignment, that is, whether the respondents consider themselves liberal or conservative. In subsequent articles, I’ll explore their political beliefs on a range of topics.
Political alignment
From 1974 to the most recent cycle in 2018, the GSS asked the following question, “We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal–point 1–to extremely conservative–point 7. Where would you place yourself on this scale?”
The following figure shows the distribution of responses in 1974 and 2018.
In 2018, it looks like there are more 1s (Extremely Liberal) and maybe more 7s (Extremely Conservative). So this figure provides some evidence of polarization.
We can get a better sense of the long term trend by taking the mean of the 7-point scale and plotting it over time. By treating this scale as a numerical quantity, I’m making assumptions about the spacing between the values. The numbers we get don’t mean much in absolute terms, but they provide a quick look at the trend.
It looks like the “center of mass” was increasing until about 1990, which means more conservative on this scale, and has been decreasing ever since. On average the country might be a little more conservative now than it was in 1974.
With the same caveat about treating this scale as a numerical quantity, we can also compute the standard deviation, which measures average distance from the mean, as a way of quantifying polarization.
The trend is clearly increasing, indicating increasing polarization, but with the way we computed these numbers, it’s hard to get a sense of how substantial the increase is in practical terms.
In the next article, I’ll look more closely at changes in political alignment over time.
I am planning to turn these articles into a case study for an upcoming Data Science class, so I welcome comments and questions.