Browsed by
Author: AllenDowney

Probably the Book

Probably the Book

Last week I had the pleasure of presenting a keynote at posit::conf(2024). When the video is available, I will post it here [UPDATE here it is].

In the meantime, you can read the slides, if you don’t mind spoilers.

For people at the conference who don’t know me, this might be a good time to introduce you to this blog, where I write about data science and Bayesian statistics, and to Probably Overthinking It, the book based on the blog, which was published by University of Chicago Press last December. Here’s an outline of the book with links to excerpts I’ve published in the blog and talks I’ve presented based on some of the chapters.

For your very own copy, you can order from Bookshop.org if you want to support independent bookstores, or Amazon if you don’t.

Twelve Excellent Chapters

In Chapter 1, we learn that no one is normal, everyone is weird, and everyone is about the same amount of weird. I published an excerpt from this chapter, and talked about it during this section of the SuperDataScience podcast. And it is featured in an interactive article at Brilliant.org, which includes this animation showing how measurements are distributed in multiple dimensions.

Chapter 2 is about the inspection paradox, which affects our perception of many real-world scenarios, including fun examples like class sizes and relay races, and more serious examples like our understanding of criminal justice and ability to track infectious disease. I published a prototype of this chapter as an article called “The Inspection Paradox is Everywhere“, and gave a talk about it at PyData NYC:

Chapter 3 presents three consequences of the inspection paradox in demography, especially changes in fertility in the United States over the last 50 years. It explains Preston’s paradox, named after the demographer who discovered it: if each woman has the same number of children as her mother, family sizes — and population — grow quickly; in order to maintain constant family sizes, women must have fewer children than their mothers, on average. I published an excerpt from this chapter, and it was discussed on Hacker News.

Chapter 4 is about extremes, outliers, and GOATs (greatest of all time), and two reasons the distribution of many abilities tends toward a lognormal distribution: proportional gain and weakest link effects. I gave a talk about this chapter for PyData Global 2023:

And here’s a related exploration I cut from the book.

Chapter 5 is about the surprising conditions where something used is better than something new. Most things wear out over time, but sometimes longevity implies information, which implies even greater longevity. This property has implications for life expectancy and the possibility of much longer life spans. I gave a talk about this chapter at ODSC East 2024 — there’s no recording, but the slides are here.

Chapter 6 introduces Berkson’s paradox — a form of collision bias — with some simple examples like the correlation of test scores and some more important examples like COVID and depression. Chapter 7 uses collision bias to explain the low birthweight paradox and other confusing results from epidemiology. I gave a “Talk at Google” about these chapters:

Chapter 8 shows that the magnitudes of natural and human-caused disasters follow long-tailed distributions that violate our intuition, defy prediction, and leave us unprepared. Examples include earthquakes, solar flares, asteroid impacts, and stock market crashes. I gave a talk about this chapter at SciPy 2023:

The talk includes this animation showing how plotting a tail distribution on a log-y scale provides a clearer picture of the extreme tail behavior.

Chapter 9 is about the base rate fallacy, which is the cause of many statistical errors, including misinterpretations of medical tests, field sobriety tests, and COVID statistics. It includes a discussion of the COMPAS system for predicting criminal behavior.

Chapter 10 is about Simpson’s paradox, with examples from ecology, sociology, and economics. It is the key to understanding one of the most notorious examples of misinterpretation of COVID data. This is the first of three chapters that use data from the General Social Survey (GSS).

Chapter 11 is about the expansion of the Moral Circle — specifically about changes in attitudes about race, gender, and homosexuality in the U.S. over the last 50 years. I published an excerpt about the remarkable decline of homophobia since 1990, featuring lyrics from “A Message From the Gay Community“.

Chapter 12 is about the Overton Paradox, a name I’ve given to a pattern observed in GSS data: as people get older, their beliefs become more liberal, on average, but they are more likely to say they are conservative. This chapter is the basis of this interactive lesson at Brilliant.org. And I gave a talk about it at PyData NYC 2022:

There are still a few chapters I haven’t given a talk about, so watch this space!

Again, you can order the book from Bookshop.org if you want to support independent bookstores, or Amazon if you don’t.

Supporting code for the book is in this GitHub repository. All of the chapters are available as Jupyter notebooks that run in Colab, so you can replicate my analysis. If you are teaching a data science or statistic class, they make good teaching examples.

Chapter 1: Are You Normal? Hint: No.

Run the code on Colab

Run the code that prepares the BRFSS data

Run the code that prepares the Big Five data

Chapter 2: Relay Races and Revolving Doors

Run the code on Colab

Chapter 3: Defy Tradition, Save the World

Run the code on Colab

Chapter 4: Extremes, Outliers, and GOATs

Run the code on Colab

Run the code that prepares the BRFSS data

Run the code that prepares the NSFG data

Chapter 5: Bettter Than New

Run the code on Colab

Chapter 6: Jumping to Conclusions

Run the code on Colab

Chapter 7: Causation, Collision, and Confusion

Run the code on Colab

Run the code that prepares the NCHS data

Chapter 8: The Long Tail of Disaster

Run the code on Colab

Run the code that prepares the earthquake data

Run the code that prepares the solar flare data

Chapter 9: Fairness and Fallacy

Run the code on Colab

Chapter 10: Penguins, Pessimists, and Paradoxes

Run the code on Colab

Run the code that prepares the GSS data

Chapter 11: Changing Hearts and Minds

Run the code on Colab

Chapter 12: Chasing the Overton Window

Run the code on Colab

Too many bronze medals?

Too many bronze medals?

In a recent video, Hank Green nerd-sniped me by asking a question I couldn’t not answer.

At one point in the video, he shows “a graph of the last 20 years of Olympic games showing the gold, silver, and bronze medals from continental Europe. And it “shows continental Europe having significantly more bronze medals than gold medals.”

Hank wonders why and offers a few possible explanations, finally settling on the one I think is correct:

… the increased numbers of athletes who come from European countries weight them more toward bronze, which might actually be a more randomized medal. Placing gold might just be a better judge of who is first, because gold medal winners are more likely to be truer outliers, while bronze medal recipients are closer to the middle of the pack. And so randomness might play a bigger role, which would mean that having a larger number of athletes gives you more bronze medal winners and more athletes is what you get when you lump a bunch of countries together.

In the following notebook, I use a simple simulation to show that this explanation is plausible. Click here to run the notebook on Colab. Or read the details below.

olympics

Where’s My Train?

Where’s My Train?

Yesterday I presented a webinar for PyMC Labs where I solved one of the exercises from Think Bayes, called “The Red Line Problem”. Here’s the scenario:

The Red Line is a subway that connects Cambridge and Boston, Massachusetts. When I was working in Cambridge I took the Red Line from Kendall Square to South Station and caught the commuter rail to Needham. During rush hour Red Line trains run every 7-8 minutes, on average.

When I arrived at the subway stop, I could estimate the time until the next train based on the number of passengers on the platform. If there were only a few people, I inferred that I just missed a train and expected to wait about 7 minutes. If there were more passengers, I expected the train to arrive sooner. But if there were a large number of passengers, I suspected that trains were not running on schedule, so I expected to wait a long time.

While I was waiting, I thought about how Bayesian inference could help predict my wait time and decide when I should give up and take a taxi.

I used this exercise to demonstrate a process for developing and testing Bayesian models in PyMC. The solution uses some common PyMC features, like the Normal, Gamma, and Poisson distributions, and some less common features, like the Interpolated and StudentT distributions.

The video is on YouTube now:

The slides are here.

This talk will be remembered for the first public appearance of the soon-to-be-famous “Banana of Ignorance”. In general, when the data we have are unable to distinguish between competing explanations, that uncertainty is reflected in the joint distribution of the parameters. In this example, if we see more people waiting than expected, there are two explanation: a higher-than-average arrival rate or a longer-than-average elapsed time since the last train. If we make a contour plot of the joint posterior distribution of these parameters, it looks like this:

The elongated shape of the contour indicates that either explanation is sufficient: if the arrival rate is high, elapsed time can be normal, and if the elapsed time is high, the arrival rate can be normal. Because this shape indicates that we don’t know which explanation is correct, I have dubbed it “The Banana of Ignorance”:

For all of the details, you can read the Jupyter notebook or run it on Colab.

The original Red Line Problem is based on a student project from my Bayesian Statistics class at Olin College, way back in Spring 2013.

Elements of Data Science

Elements of Data Science

I’m excited to announce the launch of my newest book, Elements of Data Science. As the subtitle suggests, it is about “Getting started with Data Science and Python”.

Order now from Lulu.com and get 20% off!

I am publishing this book myself, which has one big advantage: I can print it with a full color interior without increasing the cover price. In my opinion, the code is more readable with syntax highlighting, and the data visualizations look great!

In addition to the printed edition, all chapters are available to read online, and they are in Jupyter notebooks, where you can read the text, run the code, and work on the exercises.

Description

Elements of Data Science is an introduction to data science for people with no programming experience. My goal is to present a small, powerful subset of Python that allows you to do real work with data as quickly as possible.

Part 1 includes six chapters that introduce basic Python with a focus on working with data.

Part 2 presents exploratory data analysis using Pandas and empiricaldist — it includes a revised and updated version of the material from my popular DataCamp course, “Exploratory Data Analysis in Python.”

Part 3 takes a computational approach to statistical inference, introducing resampling method, bootstrapping, and randomization tests.

Part 4 is the first of two case studies. It uses data from the General Social Survey to explore changes in political beliefs and attitudes in the U.S. in the last 50 years. The data points on the cover are from one of the graphs in this section.

Part 5 is the second case study, which introduces classification algorithms and the metrics used to evaluate them — and discusses the challenges of algorithmic decision-making in the context of criminal justice.

This project started in 2019, when I collaborated with a group at Harvard to create a data science class for people with no programming experience. We discussed some of the design decisions that went into the course and the book in this article.

Density and Likelihood: What’s the Difference?

Density and Likelihood: What’s the Difference?

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

If you get this post by email, the formatting might be broken — if so, you might want to read it on the site.

likelihood
PMFs and PDFs

PMFs and PDFs

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

If you get this post by email, the formatting is not good — you might want to read it on the site.

pmf_and_pdf
Regrets and Regression

Regrets and Regression

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

standardize
Have the Nones Leveled Off?

Have the Nones Leveled Off?

Last month Ryan Burge published “The Nones Have Hit a Ceiling“, using data from the 2023 Cooperative Election Study to show that the increase in the number of Americans with no religious affiliation has hit a plateau. Comparing the number of Atheists, Agnostics, and “Nothing in Particular” between 2020 and 2023, he found that “the share of non-religious Americans has stopped rising in any meaningful way.”

When I read that, I was frustrated that the HERI Freshman Survey had not published new data since 2019. I’ve been following the rise of the “Nones” in that dataset since one of my first blog articles.

As you might guess, the Freshman Survey reports data from incoming college students. Of course, college students are not a representative sample of the U.S. population, and as rates of college attendance have increased, they represent a different slice of the population over time. Nevertheless, surveying young adults over a long interval provides an early view of trends in the general population.

Well, I have good news! I got a notification today that HERI has published data tables for the 2020 through 2023 surveys. They are in PDF, so I had to do some manual data entry, but I have results!

Religious preference

Among other questions, the Freshman Survey asks students to select their β€œcurrent religious preference” from a list of seventeen common religions, β€œOther religion,” β€œAtheist”, β€œAgnostic”, or β€œNone.”  

The options β€œAtheist” and β€œAgnostic” were added in 2015.  For consistency over time, I compare the β€œNones” from previous years with the sum of β€œNone”, β€œAtheist” and β€œAgnostic” since 2015.

The following figure shows the fraction of Nones from 1969, when the question was added, to 2023, the most recent data available.

The blue line shows data until 2015; the orange line shows data from 2015 through 2019. The gray line shows a quadratic fit.  The light gray region shows a 95% predictive interval.

The quadratic model continues to fit the data well and the recent trend is still increasing, but if you look at only the last few data points, there is some evidence that the rate of increase is slowing.

But not for women

Now here’s where things get interesting. Until recently, female students have been consistently more religious than male students. But that might be changing. The following figure shows the percentages of Nones for male and female students (with a missing point in 2018, when this breakdown was not available).

Since 2019, the percentage of Nones has increased for women and decreased for men, and it looks like women may now be less religious. So the apparent slowdown in the overall trend might be a mix of opposite trends in the two groups.

The following graph shows the gender gap over time, that is, the difference in percentages of male and female students with no religious affiliation.

The gap was essentially unchanged from 1990 to 2020. But in the last three years it has changed drastically. It now falls outside the predictive range based on past data, which suggests a change this large would be unlikely by chance.

Similarly with attendance at religious services, the gender gap has closed and possibly reversed.

UPDATE: Ryan Burge looked at the gender gap in CES and GSS data and found similar results: especially among young people, the gender gap has either disappeared or crossed over. And Ryan pointed me to this article by Dan Cox and Kelsey Eyre Hammond which reports similar trends in data from the Survey Center on American Life.

Attendance

The survey also asks students how often they β€œattended a religious service” in the last year. The choices are β€œFrequently,” β€œOccasionally,” and β€œNot at all.” Respondents are instructed to select β€œOccasionally” if they attended one or more times, so a wedding or a funeral would do it.

The following figure shows the fraction of students who reported any religious attendance in the last year, starting in 1968. I discarded a data point from 1966 that seems unlikely to be correct.

There is a clear dip in 2021, likely due to the pandemic, but the last two data points have returned to the long-term trend.

Data Source

The data reported here are available from the HERI publications page. Since I entered the data manually from PDF documents, it’s possible I have made errors.

Should divorce be more difficult?

Should divorce be more difficult?

“The Christian right is coming for divorce next,” according to this recent Vox article, and “Some conservatives want to make it a lot harder to dissolve a marriage.”

As always when I read an article like this, I want to see data — and the General Social Survey has just the data I need. Since 1974, they have asked a representative sample of the U.S. population, “Should divorce in this country be easier or more difficult to obtain than it is now?” with the options to respond “Easier”, “More difficult”, or “Stay as is”.

Here’s how the responses have changed over time:

Since the 1990s, the percentage saying divorce should be more difficult has dropped from about 50% to about 30%. [The last data point, in 2022, may not be reliable. Due to disruptions during the COVID pandemic, the GSS changed some elements of their survey process — in the 2021 and 2022 data, responses to several questions have deviated from long-term trends in ways that might not reflect real changes in opinion.]

If we break down the results by political alignment, we can see whether these changes are driven by liberals, conservatives, or both.

Not surprisingly, conservatives are more likely than liberals to believe that divorce should be more difficult, by a margin of about 20 percentage points. But the percentages have declined in all groups — and fallen below 50% even among self-described conservatives.

As the Vox article documents, conservatives in several states have proposed legislation to make divorce more difficult. Based on the data, these proposals are likely to be unpopular.

To see my analysis, you can run this notebook on Colab. For similar analysis of other topics, see Chapter 11 of Probably Overthinking It.

Which Standard Deviation?

Which Standard Deviation?

It’s another installment in Data Q&A: Answering the real questions with Python. Previous installments are available from the Data Q&A landing page.

standard_dev