Browsed by
Tag: bayesian statistics

Emitter Detector Redux

Emitter Detector Redux

In the first edition of Think Bayes, I presented what I called the Geiger counter problem, which is based on an example in Jaynes, Probability Theory. But I was not satisfied with my solution or the way I explained it, so I cut it from the second edition.

I am re-reading Jaynes now, following the excellent series of videos by Aubrey Clayton, and this problem came back to haunt me. On my second attempt, I have a solution that is much clearer, and I think I can explain it better.

I’ll outline the solution here, but for all of the details, you can read the bonus chapter, or click here to run the notebook on Colab.

The Emitter-Detector Problem

Here’s the example from Jaynes, page 168:

We have a radioactive source … which is emitting particles of some sort … There is a rate p, in particles per second, at which a radioactive nucleus sends particles through our counter; and each particle passing through produces counts at the rate θ. From measuring the number {c1 , c2 , …} of counts in different seconds, what can we say about the numbers {n1 , n2 , …} actually passing through the counter in each second, and what can we say about the strength of the source?

As a model of the source, Jaynes suggests we imagine “N nuclei, each of which has independently the probability r of sending a particle through our counter in any one second”. If N is large and r is small, the number of particles emitted in a given second is well modeled by a Poisson distribution with parameter s=Nr, where s is the strength of the source.

As a model of the sensor, we’ll assume that “each particle passing through the counter has independently the probability ϕ of making a count”. So if we know the actual number of particles, n, and the efficiency of the sensor, ϕ, the distribution of the count is Binomial(n,ϕ).

With that, we are ready to solve the problem. Following Jaynes, I’ll start with a uniform prior for s, over a range of values wide enough to cover the region where the likelihood of the data is non-negligible. To represent distributions, I’ll use the Pmf class from empiricaldist.

ss = np.linspace(0, 350, 101)
prior_s = Pmf(1, ss)

For each value of s, the distribution of n is Poisson, so we can form the joint prior of s and n using the poisson function from SciPy. The following function creates a Pandas DataFrame that represents the joint prior.

def make_joint(prior_s, ns):
    ss = prior_s.qs
    S, N = np.meshgrid(ss, ns)
    ps = poisson(S).pmf(N) *
    joint = pd.DataFrame(ps, index=ns, columns=ss) = 'n' = 's'
    return joint

The result is a DataFrame with one row for each value of n and one column for each value of s.

To update the prior, we need to compute the likelihood of the data for each pair of parameters. However, in this problem the likelihood of a given count depends only on n, regardless of s, so we only have to compute it once for each value of n. Then we multiply each column in the prior by this array of likelihoods. The following function encapsulates this computation, normalizes the result, and returns the posterior distribution.

def update(joint, phi, c):
    ns = joint.index
    likelihood = binom(ns, phi).pmf(c)
    posterior = joint.multiply(likelihood, axis=0)
    return posterior

First update

Let’s test the update function with the first example, on page 178 of Probability Theory:

During the first second, c1 = 10 counts are registered. What can [we] say about the number n1 of particles?

Here’s the update:

c1 = 10
phi = 0.1
posterior = update(joint, phi, c1)

The following figures show the posterior marginal distributions of s and n.

posterior_s = marginal(posterior, 0)
posterior_n = marginal(posterior, 1)

The posterior mean of n is close to 109, which is consistent with Equation 6.116. The MAP is 99, which is one less than the analytic result in Equation 6.113, which is 100. It looks like the posterior probabilities for 99 and 100 are the same, but the floating-point results differ slightly.

Jeffreys prior

Instead of a uniform prior for s, we can use a Jeffreys prior, in which the prior probability for each value of s is proportional to 1/s. This has the advantage of “invariance under certain changes of parameters”, which is “the only correct way to express complete ignorance of a scale parameter.” However, Jaynes suggests that it is not clear “whether s can properly be regarded as a scale parameter in this problem.” Nevertheless, he suggests we try it and see what happens. Here’s the Jeffreys prior for s.

prior_jeff = Pmf(1/ss[1:], ss[1:])

We can use it to compute the joint prior of s and n, and update it with c1.

joint_jeff = make_joint(prior_jeff, ns)
posterior_jeff = update(joint_jeff, phi, c1)

Here’s the marginal posterior distribution of n:

posterior_n = marginal(posterior_jeff, 1)

The posterior mean is close to 100 and the MAP is 91; both are consistent with the results in Equation 6.122.

Robot A

Now we get to what I think is the most interesting part of this example, which is to take into account a second observation under two models of the scenario:

Two robots, [A and B], have different prior information about the source of the particles. The source is hidden in another room which A and B are not allowed to enter. A has no knowledge at all about the source of particles; for all [it] knows, … the other room might be full of little [people] who run back and forth, holding first one radioactive source, then another, up to the exit window. B has one additional qualitative fact: [it] knows that the source is a radioactive sample of long lifetime, in a fixed position.

In other words, B has reason to believe that the source strength s is constant from one interval to the next, while A admits the possibility that s is different for each interval. The following figure, from Jaynes, represents these models graphically.

For A, the “different intervals are logically independent”, so the update with c2 = 16 starts with the same prior.

c2 = 16
posterior2 = update(joint, phi, c2)

Here’s the posterior marginal distribution of n2.


The posterior mean is close to 169, which is consistent with the result in Equation 6.124. The MAP is 160, which is consistent with Equation 6.123.

Robot B

For B, the “logical situation” is different. If we consider s to be constant, we can – and should! – take the information from the first update into account when we perform the second update. We can do that by using the posterior distribution of s from the first update to form the joint prior for the second update, like this:

joint = make_joint(posterior_s, ns)
posterior = update(joint, phi, c2)
posterior_n = marginal(posterior, 1)

The posterior mean of n is close to 137.5, which is consistent with Equation 6.134. The MAP is 132, which is one less than the analytic result, 133. But again, there are two values with the same probability except for floating-point errors.

Under B’s model, the data from the first interval updates our belief about s, which influences what we believe about n2.

Going the other way

That might not seem surprising, but there is an additional point Jaynes makes with this example, which is that it also works the other way around: Having seen c2, we have more information about s, which means we can – and should! – go back and reconsider what we concluded about n1.

We can do that by imagining we did the experiments in the opposite order, so

  1. We’ll start again with a joint prior based on a uniform distribution for s
  2. Update it based on c2,
  3. Use the posterior distribution of s to form a new joint prior,
  4. Update it based on c1, and
  5. Extract the marginal posterior for n1.
joint = make_joint(prior_s, ns)
posterior = update(joint, phi, c2)
posterior_s = marginal(posterior, 0)

joint = make_joint(posterior_s, ns)
posterior = update(joint, phi, c1)
posterior_n2 = marginal(posterior, 1)

The posterior mean is close to 131.5, which is consistent with Equation 6.133. And the MAP is 126, which is one less than the result in Equation 6.132, again due to floating-point error.

Here’s what the new distribution of n1 looks like compared to the original, which was based on c1 only.


With the additional information from c2:

  • We give higher probability to large values of s, so we also give higher probability to large values of n1, and
  • The width of the distribution is narrower, which shows that with more information about s, we have more information about n1.


This is one of several examples Jaynes uses to distinguish between “logical and causal dependence.” In this example, causal dependence only goes in the forward direction: “s is the physical cause which partially determines n; and then n in turn is the physical cause which partially determines c”.

Therefore, c1 and c2 are causally independent: if the number of particles counted in one interval is unusually high (or low), that does not cause the number of particles during any other interval to be higher or lower.

But if s is unknown, they are not logically independent. For example, if c1 is lower than expected, that implies that lower values of s are more likely, which implies that lower values of n2 are more likely, which implies that lower values of c2 are more likely.

And, as we’ve seen, it works the other way, too. For example, if c2 is higher than expected, that implies that higher values of s, n1, and c1 are more likely.

If you find the second result more surprising – that is, if you think it’s weird that c2 changes what we believe about n1 – that implies that you are not (yet) distinguishing between logical and causal dependence.

Flipping USB Connectors

Flipping USB Connectors

I am not the first person to observe that it sometimes takes several tries to plug in a USB connector (specifically the rectangular Type A connector, which is not reversible). There are memes about it, there are cartoons about it, and on sites like Quora, people have asked about it more than a few times.

But I might be the first to use Bayesian decision analysis to figure out the optimal strategy for plugging in a USB connector. Specifically, I have worked out how long you should try on the first side before flipping, how long you should try on the second side before flipping again, how long you should try on the third side, and so on.

For a high-level view of the analysis, see this article in Towards Data Science.

For the details, you can  read the Jupyter notebook on the Think Bayes site or run it on Colab.

What’s new in Think Bayes 2?

What’s new in Think Bayes 2?

I’m happy to report that the second edition of Think Bayes is available for preorder now.

Cover of Think Bayes second edition

What’s new in the second edition?

  • I wrote a new Chapter 1 that introduces conditional probability using the Linda the Banker problem and data from the General Social Survey.
  • I added new chapters on survival analysis, linear regression, logistic regression, conjugate priors, MCMC, and ABC.
  • I added a lot of new examples and exercises, most from classes I taught using the first edition.
  • I rewrote all of the code using NumPy, SciPy, and Pandas (rather than basic Python types). The new code is shorter, clearer, and faster!
  • For every chapter, there’s a Jupyter notebook where you can read the text, run the code, and work on exercises. You can run the notebooks on your own computer or, if you don’t want to install anything, you can run them on Colab.

More generally, the second edition reflects everything I’ve learned in the 10 years since I started the first edition, and it benefits from the comments, suggestions, and corrections I’ve received from readers. I think it’s really good!

If you would like to preorder, click here.

Fair cross-section

Fair cross-section

Abstract: The unusual circumstances of Curtis Flowers’ trials make it possible to estimate the probabilities that white and black jurors would vote to convict him, 98% and 68% respectively, and the probability a jury of his peers would find him guilty, 15%.


Curtis Flowers was tried six times for the same crime. Four trials ended in conviction; two ended in a mistrial due to a hung jury.

Three of the convictions were invalidated by the Mississippi Supreme Court, at least in part because the prosecution had excluded black jurors, depriving Flowers of the right to trial by a jury composed of a “fair cross-section of the community“.

In 2019, the fourth conviction was invalidated by the Supreme Court of the United States for the same reason. And on September 5, 2020, Mississippi state attorneys announced that charges against him would be dropped.

Because of the unusual circumstances of these trials, we can perform a statistical analysis that is normally impossible: we can estimate the probability that black and white jurors would vote to convict, and use those estimates to compute the probability that he would be convicted by a jury that represents the racial makeup of Montgomery County.


According to my analysis, the probability that a white juror in this pool would vote to convict Flowers, given the evidence at trial, is 98%. The same probability for black jurors is 68%. So this difference is substantial.

The probability that Flowers would be convicted by a fair jury is only 15%, and the probability that he would be convicted four times out of six times is less than 1%.

The following figure shows the probability of a guilty verdict as a function of the number of black jurors:

According to the model, the probability of a guilty verdict is 55% with an all-white jury. If the jury includes 5-6 black jurors, which would be representative of Montgomery County, the probability of conviction would be only 14-15%.

The shaded area represents a 90% credible interval. It is quite wide, reflecting uncertainty due to limits of the data. Also, the model is based on the simplifying assumptions that

  • All six juries saw essentially the same evidence,
  • The probabilities we’re estimating did not change substantially over the period of the trials,
  • Interactions between jurors had negligible effects on their votes,
  • If any juror refuses to convict, the result is a hung jury.

For the details of the analysis, you can

Thanks to the Law Office of Zachary Margulis-Ohnuma for their assistance with this article and for their continuing good work for equal justice.

Alice and Bob exchange data

Alice and Bob exchange data

Two questions crossed my desktop this week, and I think I can answer both of them with a single example.

On Twitter, Kareem Carr asked, “If Alice believes an event has a 90% probability of occurring and Bob also believes it has a 90% chance of occurring, what does it mean to say they have the same degree of belief? What would we expect to observe about both Alice’s and Bob’s behavior?”

And on Reddit, a reader of /r/statistics asked, “I have three coefficients from three different studies that measure the same effect, along with their 95% CIs. Is there an easy way to combine them into a single estimate of the effect?”

So let me tell you a story:

One day Alice tells her friend, Bob, “I bought a random decision-making box. Every time you press this button, it says ‘yes’ or ‘no’. I’ve tried it a few times, and I think it says ‘yes’ 90% of the time.”

Bob says he has some important decisions to make and asks if he can borrow the box. The next day, he returns the box to Alice and says, “I used the box several times, and I also think it says ‘yes’ 90% of the time.”

Alice says, “It sounds like we agree, but just to make sure, we should compare our predictions. Suppose I press the button twice; what do you think is the probability it says ‘yes’ both times?”

Bob does some calculations and reports the predictive probability 81.56%.

Alice says, “That’s interesting. I got a slightly different result, 81.79%. So maybe we don’t agree after all.”

Bob says, “Well let’s see what happens if we combine our data. I can tell you how many times I pressed the button and how many times it said ‘yes’.”

Alice says, “That’s ok, I don’t actually need your data; it’s enough if you tell me what prior distribution you used.”

Bob tells her he used a Jeffreys prior.

Alice does some calculations and says, “Ok, I’ve updated my beliefs to take into account your data as well as mine. Now I think the probability of ‘yes’ is 91.67%.”

Bob says, “That’s interesting. Based on your data, you thought the probability was 90%, and based on my data, I thought it was 90%, but when we combine the data, we get a different result. Tell me what data you saw, and let me see what I get.”

Alice tells him she pressed the button 8 times and it always said ‘yes’.

“So,” says Bob, “I guess you used a uniform prior.”

Bob does some calculations and reports, “Taking into account all of the data, I think the probability of ‘yes’ is 93.45%.”

Alice says, “So when we started, we had seen different data, but we came to the same conclusion.”

“Sort of,” says Bob, “we had the same posterior mean, but our posterior distributions were different; that’s why we made different predictions for pressing the button twice.”

Alice says, “And now we’re using the same data, but we have different posterior means. Which makes sense, because we started with different priors.”

“That’s true,” says Bob, “but if we collect enough data, eventually our posterior distributions will converge, at least approximately.”

“Well that’s good,” says Alice. “Anyway, how did those decisions work out yesterday?”

“Mostly bad,” says Bob. “It turns out that saying ‘yes’ 93% of the time is a terrible way to make decisions.”

If you would like to know how any of those calculations work, you can see the details in a Jupyter notebook:

And if you don’t want the details, here is the summary:

  • If two people have different priors OR they see different data, they will generally have different posterior distributions.
  • If two posterior distributions have the same mean, some of their predictions will be the same, but many others will not.
  • If you are given summary statistics from a posterior distribution, you might be able to figure out the rest of the distribution, depending on what other information you have. For example, if you know the posterior is a two-parameter beta distribution (or is well-modeled by one) you can recover it from the mean and second moment, or the mean and a credible interval, or almost any other pair of statistics.
  • If someone has done a Bayesian update using data you don’t have access to, you might be able to “back out” their likelihood function by dividing their posterior distribution by the prior.
  • If you are given a posterior distribution and the data used to compute it, you can back out the prior by dividing the posterior by the likelihood of the data (unless the prior contains values with zero likelihood).
  • If you are given summary statistics from two posterior distributions, you might be able to combine them. In general, you need enough information to recover both posterior distributions and at least one prior.
Maxima, Minima, and Mixtures

Maxima, Minima, and Mixtures

I am hard at work on the second edition of Think Bayes, currently working on Chapter 6, which is about computing distributions of minima, maxima and mixtures of other distributions.

Of all the changes in the second edition, I am particularly proud of the exercises. I present three new exercises from Chapter 6 below. If you want to work on them, you can use this notebook, which contains the material you will need from the chapter and some code to get you started.

Exercise 1

Henri Poincaré was a French mathematician who taught at the Sorbonne around 1900. The following anecdote about him is probably fabricated, but it makes an interesting probability problem.

Supposedly Poincaré suspected that his local bakery was selling loaves of bread that were lighter than the advertised weight of 1 kg, so every day for a year he bought a loaf of bread, brought it home and weighed it. At the end of the year, he plotted the distribution of his measurements and showed that it fit a normal distribution with mean 950 g and standard deviation 50 g. He brought this evidence to the bread police, who gave the baker a warning.

For the next year, Poincaré continued the practice of weighing his bread every day. At the end of the year, he found that the average weight was 1000 g, just as it should be, but again he complained to the bread police, and this time they fined the baker.

Why? Because the shape of the distribution was asymmetric. Unlike the normal distribution, it was skewed to the right, which is consistent with the hypothesis that the baker was still making 950 g loaves, but deliberately giving Poincaré the heavier ones.

To see whether this anecdote is plausible, let’s suppose that when the baker sees Poincaré coming, he hefts n loaves of bread and gives Poincaré the heaviest one. How many loaves would the baker have to heft to make the average of the maximum 1000 g?

Exercise 2

Two doctors fresh out of medical school are arguing about whose hospital delivers more babies. The first doctor says, “I’ve been at Hospital A for two weeks, and already we’ve had a day when we delivered 20 babies.”

The second doctor says, “I’ve only been at Hospital B for one week, but already there’s been a 19-baby day.”

Which hospital do you think delivers more babies on average? You can assume that the number of babies born in a day is well modeled by a Poisson distribution.

Exercise 3

Suppose I drive the same route three times and the fastest of the three attempts takes 8 minutes.

There are two traffic lights on the route. As I approach each light, there is a 40% chance that it is green; in that case, it causes no delay. And there is a 60% chance it is red; in that case it causes a delay that is uniformly distributed from 0 to 60 seconds.

What is the posterior distribution of the time it would take to drive the route with no delays?

The solution to this exercise is very similar to a method I developed for estimating the minimum time for a packet of data to travel through a path in the internet.

Again, here’s the notebook where you can work on these exercises. I will publish solutions later this week.

Bayesian hypothesis testing

Bayesian hypothesis testing

I have mixed feelings about Bayesian hypothesis testing. On the positive side, it’s better than null-hypothesis significance testing (NHST).

And it is probably necessary as an onboarding tool: Hypothesis testing is one of the first things future Bayesians ask about; we need to have an answer.

On the negative side, Bayesian hypothesis testing is often unsatisfying because the question it answers is not the most useful question to ask.

To explain, I’ll use an example from Bite Size Bayes, which is a series of Jupyter notebooks I am writing to introduce Bayesian statistics.

In Notebook 7, I present the following problem from David MacKay’s book, Information Theory, Inference, and Learning Algorithms:

“A statistical statement appeared in The Guardian on Friday January 4, 2002:

“When spun on edge 250 times, a Belgian one-euro coin came up heads 140 times and tails 110. ‘It looks very suspicious to me’, said Barry Blight, a statistics lecturer at the London School of Economics. ‘If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%’.”

“But [asks MacKay] do these data give evidence that the coin is biased rather than fair?”

I start by formulating the question as an estimation problem. That is, I assume that the coin has some probability, x, of landing heads, and I use the data to estimate it.

If we assume that the prior distribution is uniform, which means that any value between 0 and 1 is equally likely, the posterior distribution looks like this:

Posterior distribution of x, which is the probability of heads, given a uniform prior.

This distribution represents everything we know about x given the prior and the data. And we can use it to answer whatever questions we have about the coin.

So let’s answer MacKay’s question: “Do these data give evidence that the coin is biased rather than fair?”

The question implies that we should consider two hypotheses:

  • The coin is fair.
  • The coin is biased.

In classical hypothesis testing, we would define a null hypothesis, choose a test statistic, and compute a p-value. That’s what the statistician quoted in The Guardian did. His null hypothesis is that the coin is fair. The test statistic is the difference between the observed number of heads (140) and the expected number under the null hypothesis (125). The p-value he computes is 7%, which he describes as “suspicious”.

In Bayesian hypothesis testing, we choose prior probabilities that represent our degree of belief in the two hypotheses. Then we compute the likelihood of the data under each hypothesis. The details are in Bite Size Bayes Notebook 12.

In this example the answer depends on how we define the hypothesis that the coin is biased:

  • If you know ahead of time that the probability of heads is exactly 56%, which is the fraction of heads in the dataset, the data are evidence in favor of the biased hypothesis.
  • If you don’t know the probability of heads, but you think any value between 0 and 1 is equally likely, the data are evidence in favor of the fair hypothesis.
  • And if you have knowledge about biased coins that informs your beliefs about x, the data might support the fair or biased hypothesis.

In the notebook I summarize these results using Bayes factors, which quantify the strength of the evidence. If you insist on doing Bayesian hypothesis testing, reporting a Bayes factor is probably a good choice.

But in most cases I think you’ll find that the answer is not very satisfying. As in this example, the answer is often “it depends”. But even when the hypotheses are well defined, a Bayes factor is generally less useful than a posterior distribution, because it contains less information.

The posterior distribution incorporates everything we know about the coin; we can use it to compute whatever summary statistics we like and to inform decision-making processes. We’ll see examples in the next two notebooks.