In a previous article, I looked at 93 measurements from the ANSUR-II dataset and found that ear protrusion is not correlated with any other measurement. In a followup article, I used principle component analysis to explore the correlation structure of the measurements, and found that once you have exhausted the information encoded in the most obvious measurements, the ear-related measurements are left standing alone.

I have a conjecture about why ears are weird: ear growth might depend on idiosyncratic details of the developmental environment — so they might be like fingerprints. Recently I discovered a hint that supports my conjecture.

This Veritasium video explains how we locate the source of a sound.

In general, we use small differences between what we hear in each ear — specifically, differences in amplitude, quality, time delay, and phase. That works well if the source of the sound is to the left or right, but not if it’s directly in front, above, or behind — anywhere on vertical plane through the centerline of your head — because in those cases, the paths from the source to the two ears are symmetric.

Fortunately we have another trick that helps in this case. The shape of the outer ear changes the quality of the sound, depending on the direction of the source. The resulting spectral cues makes it possible to locate sources even when they are on the central plane.

The video mentions that owls have asymmetric ears that make this trick particularly effective. Human ears are not as distinctly asymmetric as owl ears, but they are not identical.

And now, based on the Veritasium video, I suspect that might be a feature — the shape of the outer ear might be unpredictably variable because it’s advantageous for our ears to be asymmetric. Almost everything about the way our bodies grow is programmed to be as symmetric as possible, but ears might be programmed to be different.

An article in a recent issue of The Economist suggests, right in the title, “Investors should avoid a new generation of rip-off ETFs”. An ETF is an exchange-traded fund, which holds a collection of assets and trades on an exchange like a single stock. For example, the SPDR S&P 500 ETF Trust (SPY) tracks the S&P 500 index, but unlike traditional index funds, you can buy or sell shares in minutes.

There’s nothing obviously wrong with that – but as an example of a “rip-off ETF”, the article describes “defined-outcome funds” or buffer ETFs, which “offer investors an enviable-sounding opportunity: hold stocks, with protection against falling prices. All they must do is forgo annual returns above a certain level, often 10% or so.”

That might sound good, but the article explains, “Over the long term, they are a terrible deal for investors. Much of the compounding effect of stock ownership comes from rallies.”

To demonstrate, they use the value of the S&P index since 1980: “An investor with returns capped at 10% and protected from losses would have made a real return of 403% over the period, a fraction of the 3,155% return offered by just buying and holding the S&P 500.”

So that sounds bad, but returns from 1980 to the present have been historically unusual. To get a sense of whether buffer ETFs are more generally a bad deal, let’s get a bigger picture.

The MeasuringWorth Foundation has compiled the value of the Dow Jones Industrial Average at the end of each day from February 16, 1885 to the present, with adjustments at several points to make the values comparable. The series I collected starts on February 16, 1885 and ends on August 30, 2024. The following cells download and read the data.

To compute annual returns, we’ll start by selecting the closing price on the last trading day of each year (dropping 2024 because we don’t have a complete year).

Looking at the years with the biggest losses and gains, we can see that most of the extremes were before the 1960s – with the exception of the 2008 financial crisis.

annual.dropna().sort_values(by='Return')

DJIA

Ratio

Return

Date

1931

77.9000

0.473326

-52.667396

1907

43.0382

0.622683

-37.731743

2008

8776.3900

0.661629

-33.837097

1930

164.5800

0.662347

-33.765293

1920

71.9500

0.670988

-32.901240

…

…

…

…

1954

404.3900

1.439623

43.962264

1908

63.1104

1.466381

46.638103

1928

300.0000

1.482213

48.221344

1933

99.9000

1.666945

66.694477

1915

99.1500

1.816599

81.659949

138 rows × 3 columns

Here’s what the distribution of annual returns looks like.

With this function, we can replicate the analysis The Economist did with the S&P 500. Here are the results for the DJIA from the beginning of 1980 to the end of 2023.

A buffer ETF over this period would have grown by a factor of more than 15 in nominal dollars, with no risk of loss. But an index fund would have grown by a factor of almost 45. So yeah, the ETF would have been a bad deal.

However, if we go back to the bad old days, an investor in 1900 would have been substantially better off with a buffer ETF held for 43 years – a factor of 7.2 compared to a factor of 2.8.

It seems we can cherry-pick the data to make the comparison go either way – so let’s see how things look more generally. Starting in 1886, we’ll compute price returns for all 30-year intervals, ending with the interval from 1993 to 2023.

The buffer ETF performs as advertised, substantially reducing volatility. But it has only occasionally been a good deal, and not in my lifetime.

According to ChatGPT, the primary reasons for strong growth in stock prices since the 1960s are “technological advancements, globalization, financial market innovation, and favorable monetary policies”. If you think these elements will generally persist over the next 30 years, you might want to avoid buffer ETFs.

Last week I had the pleasure of presenting a keynote at posit::conf(2024). When the video is available, I will post it here. In the meantime, you can read the slides, if you don’t mind spoilers.

For people at the conference who don’t know me, this might be a good time to introduce you to this blog, where I write about data science and Bayesian statistics, and to Probably Overthinking It, the book based on the blog, which was published by University of Chicago Press last December. Here’s an outline of the book with links to excerpts I’ve published in the blog and talks I’ve presented based on some of the chapters.

For your very own copy, you can order from Bookshop.org if you want to support independent bookstores, or Amazon if you don’t.

Chapter 2 is about the inspection paradox, which affects our perception of many real-world scenarios, including fun examples like class sizes and relay races, and more serious examples like our understanding of criminal justice and ability to track infectious disease. I published a prototype of this chapter as an article called “The Inspection Paradox is Everywhere“, and gave a talk about it at PyData NYC:

Chapter 3 presents three consequences of the inspection paradox in demography, especially changes in fertility in the United States over the last 50 years. It explains Preston’s paradox, named after the demographer who discovered it: if each woman has the same number of children as her mother, family sizes — and population — grow quickly; in order to maintain constant family sizes, women must have fewer children than their mothers, on average. I published an excerpt from this chapter, and it was discussed on Hacker News.

Chapter 4 is about extremes, outliers, and GOATs (greatest of all time), and two reasons the distribution of many abilities tends toward a lognormal distribution: proportional gain and weakest link effects. I gave a talk about this chapter for PyData Global 2023:

Chapter 5 is about the surprising conditions where something used is better than something new. Most things wear out over time, but sometimes longevity implies information, which implies even greater longevity. This property has implications for life expectancy and the possibility of much longer life spans. I gave a talk about this chapter at ODSC East 2024 — there’s no recording, but the slides are here.

Chapter 6 introduces Berkson’s paradox — a form of collision bias — with some simple examples like the correlation of test scores and some more important examples like COVID and depression. Chapter 7 uses collision bias to explain the low birthweight paradox and other confusing results from epidemiology. I gave a “Talk at Google” about these chapters:

Chapter 8 shows that the magnitudes of natural and human-caused disasters follow long-tailed distributions that violate our intuition, defy prediction, and leave us unprepared. Examples include earthquakes, solar flares, asteroid impacts, and stock market crashes. I gave a talk about this chapter at SciPy 2023:

The talk includes this animation showing how plotting a tail distribution on a log-y scale provides a clearer picture of the extreme tail behavior.

Chapter 9 is about the base rate fallacy, which is the cause of many statistical errors, including misinterpretations of medical tests, field sobriety tests, and COVID statistics. It includes a discussion of the COMPAS system for predicting criminal behavior.

Chapter 10 is about Simpson’s paradox, with examples from ecology, sociology, and economics. It is the key to understanding one of the most notorious examples of misinterpretation of COVID data. This is the first of three chapters that use data from the General Social Survey (GSS).

Chapter 12 is about the Overton Paradox, a name I’ve given to a pattern observed in GSS data: as people get older, their beliefs become more liberal, on average, but they are more likely to say they are conservative. This chapter is the basis of this interactive lesson at Brilliant.org. And I gave a talk about it at PyData NYC 2022:

There are still a few chapters I haven’t given a talk about, so watch this space!

Again, you can order the book from Bookshop.org if you want to support independent bookstores, or Amazon if you don’t.

Supporting code for the book is in this GitHub repository. All of the chapters are available as Jupyter notebooks that run in Colab, so you can replicate my analysis. If you are teaching a data science or statistic class, they make good teaching examples.

In a recent video, Hank Green nerd-sniped me by asking a question I couldn’t not answer.

At one point in the video, he shows “a graph of the last 20 years of Olympic games showing the gold, silver, and bronze medals from continental Europe. And it “shows continental Europe having significantly more bronze medals than gold medals.”

Hank wonders why and offers a few possible explanations, finally settling on the one I think is correct:

… the increased numbers of athletes who come from European countries weight them more toward bronze, which might actually be a more randomized medal. Placing gold might just be a better judge of who is first, because gold medal winners are more likely to be truer outliers, while bronze medal recipients are closer to the middle of the pack. And so randomness might play a bigger role, which would mean that having a larger number of athletes gives you more bronze medal winners and more athletes is what you get when you lump a bunch of countries together.

In the following notebook, I use a simple simulation to show that this explanation is plausible. Click here to run the notebook on Colab. Or read the details below.

In a recent video, Hank Green nerd-sniped me by asking a question I couldn’t not answer.

At one point in the video, he shows “a graph of the last 20 years of Olympic games showing the gold, silver, and bronze medals from continental Europe. And it “shows continental Europe having significantly more bronze medals than gold medals.”

Hank wonders why and offers a few possible explanations, finally settling on the one I think is correct:

… the increased numbers of athletes who come from European countries weight them more toward bronze, which might actually be a more randomized medal. Placing gold might just be a better judge of who is first, because gold medal winners are more likely to be truer outliers, while bronze medal recipients are closer to the middle of the pack. And so randomness might play a bigger role, which would mean that having a larger number of athletes gives you more bronze medal winners and more athletes is what you get when you lump a bunch of countries together.

In the following simulations, I show that this explanation is plausible.
If you like this kind of analysis, you might like my book, Probably Overthinking It.

The following function takes a random distribution, generates a population of athletes with random abilities, and returns the top three.

In [4]:

defgenerate(dist,n,label):"""Generate the top 3 athletes from a country with population n. dist: distribution of ability n: population label: name of country """# generate a sample with the given sizesample=dist.rvs(n)# select the top 3top3=top_k=np.sort(sample)[-3:]# put the results in a DataFrame with country labelsdf=pd.DataFrame(dict(ability=top3))df['label']=labelreturndf

Here’s an example based on a normal distribution with mean 500 and standard deviation 100.

A single large country called “UnaGrandia”, with population of 30,000 athletes,

And a group of ten smaller countries called “MultiParvia” with 3,000 athletes each

In [6]:

defrun_trials(dist):"""Simulate the trials. dist: distribution of ability """# generate athletes from 10 countries with population 30dfs=[generate(dist,3000,'MultiParvia')foriinrange(10)]# add in athletes from one country with population 300dfs.append(generate(dist,30000,'UnaGrandia'))# combine into a single DataFrameathletes=pd.concat(dfs)returnathletes

The result is 33 athletes, 3 from UnaGrandia and 30 from the various countries of MultiParvia.

Here’s what the distribution of ability looks like.

In [8]:

fromempiricaldistimportSurvfromutilsimportdecoratesurv_ability=Surv.from_seq(athletes['ability'],normalize=False)surv_ability.plot(style='o',alpha=0.6,label='')decorate(xlabel='Ability',ylabel='Rank',title='Distribution of ability')

Because we’ve selected the largest values from the distribution of ability, the result is skewed to the right — that is, there are a few extreme outliers who have the best chances of winning, and a middle of the pack that have fewer chances (with a reminder that it’s a pretty elite pack to be in the middle of).

Now let’s simulate the competition.
The following function takes the distribution of ability and an additional parameter, std, that controls the randomness of the results.

When std is 0, the outcome of the competition depends only on the abilities of the athletes — the athlete with the highest ability wins every time.

As std increases, the outcome is more random, so an athlete with a lower ability has a better chance of beating an athlete with higher ability.

In [9]:

medals=['Gold','Silver','Bronze']defcompete(dist,std=0):"""Simulate a competition. dist: distribution of ability std: standard deviation of randomness """# run the trialsathletes=run_trials(dist)# add a random factor to ability to get scoresrandomness=norm(0,std).rvs(len(athletes))athletes['score']=athletes['ability']+randomness# select and return athlete with top 3 scorespodium=athletes.nlargest(3,columns='score')podium['medal']=medalsreturnpodium

The result shows the abilities of each winner, which region they are from, their score in the competition, and the medal they won.

In [10]:

compete(dist,std=10)

Out[10]:

ability

label

score

medal

2

920.202590

UnaGrandia

926.182143

Gold

0

876.618008

UnaGrandia

884.973475

Silver

1

876.623360

UnaGrandia

877.887775

Bronze

Now let’s simulate multiple events.
The following function takes the distribution of ability again, along with the number of events and the amount of randomness in the outcomes.

In [11]:

defgames(dist,num_events,std=0):"""Simulate multiple games. dist: distribution of abilities num_events: how many events are contested """dfs=[compete(dist,std)foriinrange(num_events)]results=pd.concat(dfs)xtab=pd.crosstab(results['label'],results['medal'])returnxtab[medals]

The result is a table that shows the number of each kind of medal won by each region.

The results here are more consistent that what we see in the real data because we simulated 1000 events.

If we increase the amount of randomness, the advantage of sending more athletes to the games is even stronger — and it looks like it has an effect on the number of gold medals as well.

I was curious to know how the distribution of ability affects the result, so I tried the simulations with a lognormal distribution, too.
This choice might be more realistic because the distribution of ability in many fields follows a lognormal distribution — see Chapter 4 of Probably Overthinking It or this article).

Here’s a lognormal distribution that’s a good match for the distribution of Elo scores in chess.

They are similar to the results with a normal distribution of abilities, so it seems like the shape of the distribution is not an essential reason for the excess of bronze medals.

I think Hank is right. If you have two regions with the same population, and one is allowed to send more athletes to the games, it is not much more likely to win gold medals, but notably more likely to win silver and bronze medals — and the size of the excess depends on how much randomness there is in the outcome of the events.

fromempiricaldistimportCdfresults['diff']=results['score']-results['ability']forname,groupinresults.groupby('medal'):cdf=Cdf.from_seq(group['diff'])*100cdf.plot(label=name)decorate(xlabel='Under / over performance',ylabel='Percentile rank')

Copyright 2024 Allen Downey

The code in this notebook and utils.py is under the MIT license.

Yesterday I presented a webinar for PyMC Labs where I solved one of the exercises from Think Bayes, called “The Red Line Problem”. Here’s the scenario:

The Red Line is a subway that connects Cambridge and Boston, Massachusetts. When I was working in Cambridge I took the Red Line from Kendall Square to South Station and caught the commuter rail to Needham. During rush hour Red Line trains run every 7-8 minutes, on average.

When I arrived at the subway stop, I could estimate the time until the next train based on the number of passengers on the platform. If there were only a few people, I inferred that I just missed a train and expected to wait about 7 minutes. If there were more passengers, I expected the train to arrive sooner. But if there were a large number of passengers, I suspected that trains were not running on schedule, so I expected to wait a long time.

While I was waiting, I thought about how Bayesian inference could help predict my wait time and decide when I should give up and take a taxi.

I used this exercise to demonstrate a process for developing and testing Bayesian models in PyMC. The solution uses some common PyMC features, like the Normal, Gamma, and Poisson distributions, and some less common features, like the Interpolated and StudentT distributions.

This talk will be remembered for the first public appearance of the soon-to-be-famous “Banana of Ignorance”. In general, when the data we have are unable to distinguish between competing explanations, that uncertainty is reflected in the joint distribution of the parameters. In this example, if we see more people waiting than expected, there are two explanation: a higher-than-average arrival rate or a longer-than-average elapsed time since the last train. If we make a contour plot of the joint posterior distribution of these parameters, it looks like this:

The elongated shape of the contour indicates that either explanation is sufficient: if the arrival rate is high, elapsed time can be normal, and if the elapsed time is high, the arrival rate can be normal. Because this shape indicates that we don’t know which explanation is correct, I have dubbed it “The Banana of Ignorance”:

I’m excited to announce the launch of my newest book, Elements of Data Science. As the subtitle suggests, it is about “Getting started with Data Science and Python”.

I am publishing this book myself, which has one big advantage: I can print it with a full color interior without increasing the cover price. In my opinion, the code is more readable with syntax highlighting, and the data visualizations look great!

In addition to the printed edition, all chapters are available to read online, and they are in Jupyter notebooks, where you can read the text, run the code, and work on the exercises.

Description

Elements of Data Science is an introduction to data science for people with no programming experience. My goal is to present a small, powerful subset of Python that allows you to do real work with data as quickly as possible.

Part 1 includes six chapters that introduce basic Python with a focus on working with data.

Part 2 presents exploratory data analysis using Pandas and empiricaldist — it includes a revised and updated version of the material from my popular DataCamp course, “Exploratory Data Analysis in Python.”

Part 3 takes a computational approach to statistical inference, introducing resampling method, bootstrapping, and randomization tests.

Part 4 is the first of two case studies. It uses data from the General Social Survey to explore changes in political beliefs and attitudes in the U.S. in the last 50 years. The data points on the cover are from one of the graphs in this section.

Part 5 is the second case study, which introduces classification algorithms and the metrics used to evaluate them — and discusses the challenges of algorithmic decision-making in the context of criminal justice.

This project started in 2019, when I collaborated with a group at Harvard to create a data science class for people with no programming experience. We discussed some of the design decisions that went into the course and the book in this article.

I’m a math graduate and am partially self taught. I am really frustrated with likelihood and probability density, two concepts that I personally think are explained so disastrously that I’ve been struggling with them for an embarrassingly long time. Here’s my current understanding and what I want to understand:

probability density is the ‘concentration of probability’ or probability per unit and the value of the density in any particular interval depends on the density function used. When you integrate the density curve over all outcomes x in X where X is a random variable and x are its realizations then the result should be all the probability or 1.

likelihood is the joint probability, in the discrete case, of observing fixed and known data depending on what parameter(s) value we choose. In the continuous case we do not have a nonzero probability of any single value but we do have nonzero probability within some infinitely small interval (containing infinite values?) [x, x+h] and maximizing the likelihood of observing this data is equivalent to maximizing the probability of observing it, which we can do by maximizing the density at x.

My questions are:

Is what I wrote above correct? Probability density and likelihood are not the same thing. But what the precise distinction is in the continuous case is not completely cut and dry to me. […]

I agree with OP — these topics are confusing and not always explained well. So let’s see what we can do.

I’ll start with a discrete distribution, so we can leave density out of it for now and focus on the difference between a probability mass function (PMF) and a likelihood function.

As an example, suppose we know that a hockey team scores goals at a rate of 3 goals per game on average.
If we model goal scoring as a Poisson process — which is not a bad model — the number of goals per game follows a Poisson distribution with parameter mu=3.

The PMF of the Poisson distribution tells us the probability of scoring k goals in a game, for non-negative values of k.

Now suppose we don’t know the goal scoring rate, but we observe 4 goals in one game.
There are several ways we can use this data to estimate mu.
One is to find the maximum likelihood estimator (MLE), which is the value of mu that makes the observed data most likely.

To find the MLE, we need to maximize the likelihood function, which is a function of mu with a fixed number of observed goals, k.
To evaluate the likelihood function, we can use the PMF of the Poisson distribution again, this time with a single value of k and a range of values for mu.

To find the value of mu that maximizes the likelihood of the data, we can use argmax to find the index of the highest value in ls, and then look up the corresponding element of mus.

In [10]:

i=np.argmax(ls)mus[i]

Out[10]:

4.0

In this case, the maximum likelihood estimator is equal to the number of goals we observed.

That’s the answer to the estimation problem, but now let’s look more closely at those likelihoods.
Here’s the likelihood at the maximum of the likelihood function.

In [11]:

np.max(ls)

Out[11]:

0.19536681481316454

This likelihood is a probability mass — specifically, it is the probability of scoring 4 goals, given that the goal-scoring rate is exactly 4.0.

In [12]:

poisson.pmf(4,mu=4)

Out[12]:

0.19536681481316454

So, some likelihoods are probability masses — but not all.

Now suppose, again, that we know the goal scoring rate is exactly 3,
but now we want to know how long it will be until the next goal.
If we model goal scoring as a Poisson process, the time until the next goal follows an exponential distribution with a rate parameter, lam=3.

Because the exponential distribution is continuous, it has a probability density function (PDF) rather than a probability mass function (PMF).
We can approximate the distribution by evaluating the exponential PDF at a set of equally-spaced times, ts.

SciPy’s implementation of the exponential distribution does not take lam as a parameter, so we have to set scale=1/lam.

The PDF is a function of t with lam as a fixed parameter.
Here’s what it looks like.

In [14]:

plt.plot(ts,ps)decorate(xlabel='Games until next goal',ylabel='Density')

Notice that the values on the y-axis extend above 1. That would not be possible if they were probability masses, but it is possible because they are probability densities.

By themselves, probability densities are hard to interpret.
As an example, we can pick an arbitrary element from ts and the corresponding element from ps.

In [15]:

ts[40],ps[40]

Out[15]:

(0.5, 0.6693904804452895)

So the probability density at t=0.5 is about 0.67. What does that mean? Not much.

To get something meaningful, we have to compute an area under the PDF.
For example, if we want to know the probability that the first goal is scored during the first half of a game, we can compute the area under the curve from t=0 to t=0.5.

We can use a slice index to select the elements of ps and ts in this interval, and NumPy’s trapz function, which uses the trapezoid method to compute the area under the curve.

In [16]:

np.trapz(ps[:41],ts[:41])

Out[16]:

0.7769608771522626

The probability of a goal in the first half of the game is about 78%.
To check that we got the right answer, we can compute the same probability using the exponential CDF.

In [17]:

expon.cdf(0.5,scale=1/lam)

Out[17]:

0.7768698398515702

Considering that we used a discrete approximation of the PDF, our estimate is pretty close.

This example provides an operational definition of a probability density: it’s something you can add up over an interval — or integrate — to get a probability mass.

Now let’s suppose that we don’t know the parameter lam and we want to use data to estimate it.
And suppose we observe a game where the first goal is scored at t=0.5.
As we did when we estimated the parameter mu of the Poisson distribution, we can find the value of lam that maximizes the likelihood of this data.

First we’ll define a range of possible values of lam.

In [18]:

lams=np.linspace(0,20,201)

Then for each value of lam, we can evaluate the exponential PDF at the observed time t=0.5 — using errstate to ignore the “division by zero” warning when lam is 0.

In the first example, we evaluated a Poisson PMF at discrete values of k with a fixed parameter, mu. The results were probability masses.

In the second example, we evaluated the same PMF for possible values of a parameter, mu, with a fixed value of k. The result was a likelihood function where each point is a probability mass.

In the third example, we evaluated an exponential PDF at possible values of t with a fixed parameter, lam. The results were probability densities, which we integrated over an interval to get a probability mass.

In the fourth example, we evaluated the same PDF at possible values of a parameter, lam, with a fixed value of t. The result was a likelihood function where each point is a probability density.

A PDF is a function of an outcome — like the number of goals scored or the time under the first goal — given a fixed parameter.
If you evaluate a PDF, you get a probability density.
If you integrate density over an interval, you get a probability mass.

A likelihood function is a function of a parameter, given a fixed outcome.
If you evaluate a likelihood function, you might get a probability mass or a density, depending on whether the outcome is discrete or continuous.
Either way, evaluating a likelihood function at a single point doesn’t mean much by itself.
A common use of a likelihood function is finding a maximum likelihood estimator.

As OP says, “Probability density and likelihood are not the same thing”, but the distinction is not clear because they are not completely distinct things, either.

A probability density can be a likelihood, but not all densities are likelihoods.

A likelihood can be a probability density, but not all likelihoods are densities.