Sadly, today is my last day at DrivenData, so it’s a good time to review one of the projects I’ve been working on, using probabilistic predictions from Zamba to find animals in camera trap videos, like this:

Most recently, I’ve been working on calibrating the predictions from convolutional neural networks (CNNs). I haven’t written about it, but at ODSC East 2023, I’m giving a talk about it:

Don’t Discard That Pangolin, Calibrate Your Deep Learning Classifier

Suppose you are an ecologist studying a rare species like a pangolin. You can use motion-triggered camera traps to collect data about the presence and abundance of species in the wild, but for every video showing a pangolin, you have 100 that show other species, and 100 more that are blank. You might have to watch hours of video to find one pangolin.

Deep learning can help. Project Zamba provides models that classify camera trap videos and identify the species that appear in them. Of course, the results are not perfect, but we can often remove 80% of the videos we don’t want while losing only 10-20% of the videos we want.

But there’s a problem. The output from deep learning classifiers is generally a “confidence score”, not a probability. If a classifier assigns a label with 80% confidence, that doesn’t mean there is an 80% chance it is correct. However, with a modest number of human-generated labels, we can often calibrate the output to produce more accurate probabilities, and make better predictions.

In this talk, I’ll present use cases based on data from Africa, Guam, and New Zealand, and show how we can use deep learning and calibration to save the pangolin… or at least the pangolin videos. This real-world problem shows how users of ML models can tune the results to improve performance on their applications.

The ODSC schedule isn’t posted yet, but I’ll fill in the details later.

Way back in 2013, I wrote this blog post explaining why you should never use a statistical test to check whether a sample came from a Gaussian distribution. I argued that data from the real world never come from a Gaussian distribution, or any other simple mathematical model, so the answer to the question is always no. And there are only two possible outcomes from the test:

If you have enough data, the test will reject the hypothesis that the data came from a Gaussian distribution, or

If you don’t have enough data, the test will fail to reject the hypothesis.

Either way, the result doesn’t tell you anything useful.

In this article, I will explore a particular example and demonstrate this relationship between the sample size and the outcome of the test. And I will conclude, again, that

Choosing a distribution is not a statistical question; it is a modeling decision. No statistical test can tell you whether a particular distribution is a good model for your data.

I’ll start by generating a sample that is actually from a lognormal distribution, then use the sample mean and standard deviation to make a Gaussian model. Here’s what the empirical distribution of the sample looks like compared to the CDF of the Gaussian distribution.

It looks like the Gaussian distribution is a pretty good model for the data, and probably good enough for most purposes.

According to the Anderson-Darling test, the test statistic is 1.7, which exceeds the critical value, 0.77, so at the 5% significance level, we can reject the hypothesis that this sample came from a Gaussian distribution. That’s the right answer, so it might seem like we’ve done something useful. But we haven’t.

Sample size

The result from the A-D test depends on the sample size. The following figure shows the probability of rejecting the null hypothesis as a function of sample size, using the lognormal distribution from the previous section.

When the sample size is more than 200, the probability of rejection is high. When the sample size is less than 100, the probability of rejection is low. But notice that it doesn’t go all the way to zero, because there is always a 5% chance of a false positive.

The critical value is about 120; at that sample size, the probability of rejecting the null is close to 50%.

So, again, if you have enough data, you’ll reject the null; otherwise you probably won’t. Either way, you learn nothing about the question you really care about, which is whether the Gaussian model is a good enough model of the data for your purposes.

That’s a modeling decision, and no statistical test can help. In the original article, I suggested some methods that might.

A recent question on Reddit asked about using resampling with logistic regression. The responses suggest two ways to do it, one parametric and one non-parametric. I implemented both of them and then invented a third, which is hybrid of the two.

Different ways of computing sampling distributions – and the statistics derived from them, like standard errors and confidence intervals – yield different results. None of them are right or wrong; rather, they are based on different modeling assumptions.

In this example, it is easy to implement multiple models and compare the results. If they were substantially different, we would need to think more carefully about the modeling assumptions they are based on and choose the one we think is the best description of the data-generating process.

But in this example, the differences are small enough that they probably don’t matter in practice. So we are free to choose whichever is easiest to implement, or fastest to compute, or convenient in some other way.

It is a common error to presume that the result of an analytic method is uniquely correct, and that results from computational methods like resampling are approximations to it. Analytic methods are often fast to compute, but they are always based on modeling assumptions and often based on approximations, so they are no more correct than computational methods.

The Centers for Disease Control and Prevention states on its website that “in the United States, cigarette smoking causes about 90% of lung cancers.” If S is the event “smokes cigarettes” and L is the event “has lung cancer,” then the probability 0.90 is expressed in probability notation as

P(S and L).

P(S | L).

P(L | S).

Let’s consider a scenario that’s not exactly what the question asks about, but will help us understand the relationships among these quantities:

Suppose 20% of people smoke, so out of 1000 people, we have 200 smokers and 800 nonsmokers.

Suppose 1% of nonsmokers get lung cancer, so out of 800 nonsmokers, there would be 8 cases of lung cancer, 0 caused by smoking.

And suppose 20% of smokers get lung cancer, 19% caused by smoking and 1% caused by something else (same as the nonsmokers). Out of 200 smokers, there would be 40 cases of lung cancer, 38 caused by smoking.

In this scenario, there are a total of 48 cases of lung cancer, 38 caused by smoking. So smoking caused 38/48 cancers, which is 79%.

P(S and L) is 40 / 1000, which is 4%.

P(S | L) = 40 / 48, which is 83%.

P(L | S) = 40 / 200, which is 20%.

From this scenario we can conclude:

The percentage of cases caused by smoking does not correspond to any of the listed probabilities, so the answer to the question is “None of the above”.

In order to compute these quantities, we need to know the percentage of smokers and the risk ratio for smokers vs nonsmokers.

In reality, the relationships among these quantities are complicated by time: the percentage of smokers changes over time, and there is a long lag between smoking and cancer diagnosis.

How can I find the least frequent value (antimode) between 2 modes in a bimodal distribution?

I’m only mildly self taught in anything in statistics so please be patient with my ignorance. I’ve found so little info on a Google search for “antimode” that I thought it was a word made up by the author of a random article.

Then I found one tiny mention on the Wikipedia page for “Multimodal distribution” but no citation or details beyond that it’s the least frequent value between modes.

What data do I need in order to find this number and what is the formula?

This site had a short mention of it and in their example listed:

Mode A: 33.25836

Mode B: 71.55446

Antimode: 55.06092

But I can’t seem to reverse engineer it with just this data.

Here’s the reply I wrote:

With continuous data, there is no off-the-shelf formula to compute modes or antimodes. You have to make some modeling decisions.

One option is to use kernel density estimation KDE. Adjust the parameters until, in your judgment, the result is a good representation of the distribution, and then you can read off maxima and minima.

“Tell me if you agree or disagree with this statement: Most men are better suited emotionally for politics than are most women.”

That’s one of the questions on the General Social Survey. In 1974, when it was first asked, 47% of respondents said they agreed. In 2018, the most recent survey where this question was asked, that percentage was down to 14%.

Last week I posed the same question to an audience at PyData NYC. Of 56 people who responded, none of them agreed with the statement (0%).

However, if we take the question literally, the correct answer is “I don’t know”. In fact, it is almost certain that either

Most men are better suited emotionally for politics than are most women, or

Most women are better suited emotionally for politics than are most men.

The reason is that saying that “most A are better than most B” is equivalent to saying that the median in group A is higher than the median in group B.

Here’s the proof: If one median is higher than the other, we can find a value between them where most members of group A are higher (by the definition of median) and most members of group B are lower. So unless the medians are identical, either “most A are better than most B” or the other way around.

And what does it mean to be “suited emotionally for politics”? I don’t know, but if we had to quantify it, it would be some combination of emotional factors. Men and women differ in many ways, on average; the difference is small for some factors and bigger for others. So it is likely that a combination of factors also differs between the groups. And even if the difference is very small, the medians are unlikely to be identical.

So, if we take the question literally, the best answer is “I don’t know”. But it’s clear that people don’t take the question literally. Instead, they are probably answering something like “Do you think men are so much better suited emotionally for politics that they should hold a large majority of positions of power?”

This talk is based on a chapter of my forthcoming book called Probably Overthinking It that is about using evidence and reason to answer questions and make better decisions. If you would like to get an occasional update about the book, please join my mailing list.

The results I reported are from 16 questions from the General Social Survey (GSS). If you would like to see the text of the questions, and answer them yourself, you can

In honor of NASA’s successful DART mission, here’s a relevant excerpt from my forthcoming book, Probably Overthinking It.

On March 11, 2022, an astronomer near Budapest, Hungary detected a new asteroid, now named 2022 EB5, on a collision course with Earth. Less than two hours later, it exploded in the atmosphere near Greenland. Fortunately, no large fragments reached the surface, and they caused no damage.

We have not always been so lucky. In 1908, a much larger asteroid entered the atmosphere over Siberia, causing an estimated 2 megaton explosion, about the same size as the largest thermonuclear device tested by the United States. The explosion flattened something like 80 million trees in an area covering 2100 square kilometers, almost the size of Rhode Island. Fortunately, the area was almost unpopulated; a similar-sized impact could destroy a large city.

These events suggest that we would be wise to understand the risks we face from large asteroids. To do that, we’ll look at evidence of damage they have done in the past: the impact craters on the Moon.

The largest crater on the near side of the moon, named Bailly, is 303 kilometers in diameter; the largest on the far side, the South Pole-Aitken basin, is roughly 2500 kilometers in diameter.

In addition to large, visible craters like these, there are innumerable smaller craters. The Lunar Crater Database catalogs nearly all of the ones larger than one kilometer, about 1.3 million in total. It is based on images taken by the Lunar Reconnaissance Orbiter, which NASA sent to the Moon in 2009.

The following figures show the distribution of their sizes on a log scale, compared to a lognormal model. The left figure shows the tail distribution on a linear scale; the right figure shows the same curve on a log scale.

Since the dataset does not include craters smaller than one kilometer, I cut off the model at the same threshold. We can assume that there are many smaller craters, but with this dataset we can’t tell what the distribution of their sizes looks like.

Looking at the figure on the left, we can see a discrepancy between the data and the model between 3 and 10 kilometers. Nevertheless, the model fits the logarithms of the diameters well, so we could conclude that the distribution of crater sizes is approximately lognormal.

However, looking at the figure on the right, we can see big differences between the data and the model in the tail. In the dataset, the fraction of craters as big as 100 kilometers is about 250 per million; according to the model, it would be less than 1 per million. The dotted line in the figure shows this difference.

Going farther into the tail, the fraction of craters as big as 1000 kilometers is about 3 per million; according to the model, it would be less than one per trillion. And the probability of a crater as big as the South Pole-Aitken basin is about 50 per sextillion.

If we are only interested in craters less than 10 kilometers in diameter, the lognormal model might be good enough. But for questions related to the biggest craters, it is way off.

The following figure shows the distribution of crater sizes again, compared to a log-t model [which is the abbreviated name I use for a Student-t distribution on a log scale].

The figure on the left shows that the model fits the data well in the middle of the distribution, a little better than the lognormal model. The figure on the right shows that the log-t model fits the tail of the distribution substantially better than the lognormal model.

It’s not perfect: there are more craters near 100 km than the model expects, and fewer craters larger than 1000 km. As usual, the world is under no obligation to follow simple rules, but this model does pretty well.

We might wonder why. To explain the distribution of crater sizes, it helps to think about where they come from. Most of the craters on the Moon were formed during a period in the life of the solar system called the “Late Heavy Bombardment”, about 4 billion years ago. During this period, an unusual number of asteroids were displaced from the asteroid belt – possibly by interactions with large outer planets – and some of them collided with the Moon.

We assume that some of them also collided with the Earth, but none of the craters they made still exist; due to plate tectonics and volcanic activity, the surface of Earth has been recycled several times since the Late Heavy Bombardment.

But the Moon is volcanically inert and it has no air or water to erode craters away, so its craters are visible now in almost the same condition they were 4 billion years ago (with the exception of some that have been disturbed by later impacts and lunar spacecraft).

Asteroids

As you might expect, there is a relationship between the size of an asteroid and the size of the crater it makes: in general, a bigger asteroid makes a bigger crater. So, to understand why the distribution of crater sizes is long-tailed, let’s consider the distribution of asteroid sizes.

The Jet Propulsion Laboratory (JPL) and NASA provide data related to asteroids, comets and other small bodies in our solar system. From their Small-Body Database, I selected asteroids in the “main asteroid belt” between the orbits of Mars and Jupiter.

There are more than one million asteroids in this dataset, about 136,000 with known diameter. The largest are Ceres (940 kilometers in diameter), Vesta (525 km), Pallas (513 km), and Hygeia (407 km). The smallest are less than one kilometer.

The following figure shows the distribution of asteroid sizes compared to a lognormal model.

As you might expect by now, the lognormal model fits the distribution of asteroid sizes in the middle of the range, but it does not fit the tail at all. Rather than belabor the point, let’s get to the log-t model, as shown in the following figure.

On the left, we see that the log-t model fits the data well near the middle of the range, except possibly near 1 kilometer.

On the right, we see that the model does not fit the tail of the distribution particularly well: there are more asteroids near 100 km than the model predicts. So the distribution of asteroid sizes does not strictly follow a log-t distribution. Nevertheless, it is clearly longer-tailed than a lognormal distribution, and we can use it to explain the distribution of crater sizes, as I’ll show in the next section.

Origins of Long-Tailed Distributions

One of the reasons long-tailed distributions are common in natural systems is that they are persistent; for example, if a quantity comes from a long-tailed distribution and you multiply it by a constant or raise it to a power, the result follows a long-tailed distribution.

Long-tailed distributions also persist when they interact with other distributions. When you add together two quantities, if either comes from a long-tailed distribution, the sum follows a long-tailed distribution, regardless of what the other distribution looks like. Similarly, when you multiply two quantities, if either comes from a long-tailed distribution, the product usually follows a long-tailed distribution. This property might explain why the distribution of crater sizes is long-tailed.

Empirically – that is, based on data rather than a physical model – the diameter of an impact crater depends on the diameter of the projectile that created it, raised to the power 0.78, and on the impact velocity raised to the power 0.44. It also depends on the density of the asteroid and the angle of impact.

As a simple model of this relationship, I’ll simulate the crater formation process by drawing asteroid diameters from the distribution in the previous section and drawing the other factors – density, velocity, and angle – from a lognormal distribution with parameters chosen to match the data. The following figure shows the results from this simulation along with the actual distribution of crater sizes.

In the center of the distribution (left) and in the tail (right), the simulation results are a good match for the data. This example suggests that the distribution of crater sizes can be explained by the relationship between the distributions of asteroid sizes and other factors like velocity and density.

In turn, there are physical models that might explain the distribution of asteroid sizes. Our best current understanding is that the asteroids in the asteroid belt were formed by dust particles that collided and stuck together. This process is called “accretion”, and simple models of accretion processes can yield long-tailed distributions.

So it may be that craters are long-tailed because of asteroids, and asteroids are long-tailed because of accretion.

In The Fractal Geometry of Nature, Benoit Mandelbrot proposes what he calls a “heretical” explanation for the prevalence of long-tailed distributions in natural systems: there may be only a few systems that generate long-tailed distributions, but interactions between systems might cause them to propagate.

He suggests that data we observe are often “the joint effect of a fixed underlying true distribution and a highly variable filter”, and “a wide variety of filters leave their asymptotic behavior unchanged”.

The long tail in the distribution of asteroid sizes is an example of “asymptotic behavior”. And the relationship between the size of an asteroid and the size of the crater it makes is an example of a “filter”. In this relationship, the size of the asteroid gets raised to a power and multiplied by a “highly-variable” lognormal distribution. These operations change the location and spread of the distribution, but they don’t change the shape of the tail.

When Mandelbrot wrote in the 1970s, this explanation might have been heretical, but now long-tailed distributions are more widely known and better understood. What was heresy then is orthodoxy now.

On September 25, 2022, Eliud Kipchoge ran the Berlin Marathon in 2:01:09, breaking his own world record by 30 seconds and taking another step in the progression toward a two-hour marathon.

In a previous article, I noted that the marathon record speed since 1970 has been progressing linearly over time, and I proposed a model that explains why we might expect it to continue. Based on a linear extrapolation of the data so far, I predicted that someone would break the two hour barrier in 2036, plus or minus five years.

Now it is time to update my predictions in light of the new record. The following figure shows the progression of world record speed since 1970 (orange dots), a linear fit to the data (green line) and a 90% predictive confidence interval (shaded area).

This model predicts that we will see a two-hour marathon in 2035 plus or minus 6 years. Since the last two points are above the long-term trend, we might expect to cross the finish line on the early end of that range.

If you like this sort of thing, you will like my forthcoming book, called Probably Overthinking It, which is about using evidence and reason to answer questions and guide decision making. If you would like to get an occasional update about the book, please join my mailing list.

I’m working on a book called Probably Overthinking It that is about using evidence and reason to answer questions and guide decision making. If you would like to get an occasional update about the book, please join my mailing list.

In the previous article, I used data from the General Social Survey (GSS) to show that polarization on an individual level has increased since the 1970s, but not by very much. I identified fifteen survey questions that distinguish conservatives and liberals; for each respondent, I estimated the number of conservative responses they would give to these questions.

Since the 1970s, the average number of conservative responses had decreased consistently. The spread of the distribution has increased slightly, but if we quantify the spread as a mean absolute difference, here’s what it means: in 1986, if you chose two people at random and asked them the fifteen questions, they would differ on 3.4 questions, on average. If you repeated the experiment in 2016, they would differ by 3.6.

That is not a substantial change.

But even without polarization at the individual level, there can be polarization at the level of political parties. So that’s what this article is about.

Here’s an overview of the results:

In the 1970s, there was not much difference between Democrats and Republicans. On average, their answers to the fifteen questions were about the same.

Since then, both parties have moved to the left (and nonpartisans, too). But the Democrats have moved faster, so the gap between the parties has increased.

Although it is tempting to interpret these results as a case of Democrats careening off to the left, that is a misleading way to tell the story, because it sounds like we are following a group of people over time. The groups we call Democrats and Republicans are not the same groups of people over time. The composition of the groups changes due to:

Generational replacement: In the conveyor belt of demography, when old people die, they are replaced by young people, and

Partisan sorting: No one is born Democrat or Republican; rather, they choose a party (or not) at some point in their lives, and this sorting process has changed over time.

These two factors explain the increasing ideological difference between Democrats and Republicans: polarization between the parties is not caused by people changing their minds; it is caused by changes in the composition of the groups.

Let me show you what I mean.

The race to the bottom

Each GSS respondent was asked “Generally speaking, do you usually think of yourself as a Republican, Democrat, Independent, or what?” Their responses were coded on a 7-point scale, but for simplicity, I’ve reduced it to three: Republican, Democrat, and Nonpartisan (which I think is more precise than “Independent”).

The following figure shows how the percentage of respondents in each group has changed over time:

The percentage of Democrats was decreasing until 2000; the percentage of Republicans has decreased since. But there is no sign of increasing partisanship. In fact, the percentage of Nonpartisans has increased to the point where they are now the plurality.

Now let’s see what these groups believe, on average. The following figure shows the estimated number of conservative responses for each group over time.

In the 1970s, there was not much difference between Democrats and Republicans. In fact, Democrats were more conservative than Nonpartisans. Since then, all three groups have moved to the left; that is, they give more liberal responses to the fifteen questions. But Democrats have moved faster than the other groups.

These are just averages; we can get a better view of what’s happening by looking at the distributions. The following figure shows the distribution of responses for the three groups in 1988 and 2018.

In 1988, the three groups were almost indistinguishable. In 2018, they still overlap substantially, but Democrats have shifted farther to the left than the other two groups. As a result, the difference in the means between the groups has increased, as shown in this figure:

Now it might be clearer why I chose 1988, which is close to the lowest point in the long-term trend, and 2018, which is the most recent point that is not subject to the effects of data collection during the pandemic.

It might be tempting, especially for conservative Republicans, to interpret these graphs as a case of Democrats going off the rails, but I think that’s misleading. Again, these groups are not made up of the same people over time, so this is not a story about people whose views are changing. It is a story about groups whose composition is changing.

Partisan sorting

In the 1970s, there was not much difference between Democrats and Republicans, in term of their political views. Now, in the 2020s, there is. So what changed?

To find out, let’s look at the relationship between conservatism, as measured by responses to the fifteen questions, and party affiliation. The following figure shows a scatter plot of these values in 1988 and 2018. The purple lines show a smoothed average of party affiliation as a function of conservatism.

In 1988, the relationship between these values was weak. Someone who gave conservative responses to the questions was only marginally more likely to identify as Republican; someone who gave liberal responses was only marginally more likely to identify as Democrat.

Since then, the relationship has grown stronger. The following figure shows the correlation between these values over time:

In 2018, the correlation was about 0.15, which is quite weak; in 2018 it was almost 0.4. That’s substantially higher, but it is still not a strong correlation. As you can see in the scatter plots, there are still people with liberal views who call themselves Republicans (even if Republicans have different names for them) and people with conservative views who consider themselves Democrats.

This is why I think it’s misleading to say that Democrats have moved to the left; rather, people have sorted themselves into parties according to their beliefs, at least more than they used to.

Consider this analogy: In my first year of middle school, we all took the same physical education class, so we all played the same sports. During some weeks, everyone played basketball; during other weeks, everyone wrestled. So that year, the wrestlers and the basketball players were the same height, on average.

The next year, we got to choose which sports to play. As you would expect, taller people were more likely to choose basketball and shorter people were more likely to wrestle. So, all of a sudden, the basketball players were taller than the wrestlers, on average.

Does that mean the basketball players got taller and the wrestlers got shorter? That would not be a reasonable interpretation, because they were not the same groups of people. The increased difference between the groups was entirely because of how they sorted themselves out.

Let’s see if the same is true for Democrats and Republicans.

Getting counterfactual

How much of the increased difference since the 1980s can be explained by partisan sorting? To answer that question, I used ordinal logistic regression to model the relationship between conservative views and party affiliation for each year of the survey. Then I used the model to simulate the sorting process under two scenarios:

With observed changes partisan sorting: In this version, I used the model from each year to simulating the sorting process each year — so we expect the results to be similar to the actual data.

With no change in partisan sorting: In this version, I built a model using data from the low correlation period (1973 to 1985) — then I used this model to simulate the sorting process for every year.

The second scenario is meant to simulate what would have happened if the sorting process had not changed. The following figure shows the results.

When the model includes the observed changes in partisan sorting, it matches the data well, which shows that the model includes the essential features that replicate the observed increase in the difference between Democrats and Republicans.

When we run the model with no increase in partisan sorting, there is no increase in the difference between Democrats and Republicans. This result shows that the observed increase in partisan sorting is sufficient to explain the entire increase in difference between the parties.

Alignment is not polarization

Compared to the 1980s, there is more alignment now between political ideology and political parties. Liberals are more likely to identify as Democrats and conservatives are more likely to identify as Republicans. As a result, the parties are more differentiated now than they were.

This alignment might not be a bad thing. In a two party system, it might be desirable if one party represents a more conservative world view than the other. If voters have only two options, the options should be different in ways voters care about. That way, at least, you know what you are voting for. The alternative, with no substantial difference between parties, is a recipe for voter frustration and disengagement.

Of course, there are problems with extreme partisanship. But I don’t think a moderate level of alignment — which is what we have — is necessarily a problem.