Browsed by
Category: Uncategorized

Go Get the Data

Go Get the Data

My mantra when I was working on Probably Overthinking It was “Go Get the Data.” If I wanted to use a result from prior work, I would get the data whenever possible and make my own visualization. Of course, that’s more work than copying and pasting a figure, but there are a lot of benefits. One is that I can often get newer data. Another is that I can check the results.

I was reminded of these benefits a few months ago when I was reading a very good book that I won’t name because I’m about to point out an error, and I don’t want to be a jerk about it. The book includes the following figure:

Source: A book I’m not identifying because the point of this example is not to name and shame.

The top line got my attention because I have worked on several projects with data from the General Social Survey (GSS), and this particular question is one of the examples in my Political Alignment Case Study. Plotting the fraction of people who think homosexuality is wrong, I generated this figure:

Source: Downey, Political Alignment Case Study.

These results are not the same (even accounting for the inverted y-axis).

So I checked the source, which is Steven Pinker’s The Better Angels of Our Nature, specifically this figure:

Source: Page 452 of Pinker, The Better Angels of Our Nature

It looks like Pinker’s figure is consistent with mine. So what went wrong with the first figure?

Here’s what I think happened: they wanted to invert the y axis to show the percentage who think homosexuality is not wrong. But instead of flipping the y axis, they seem to have flipped the x axis, reversing the arrow of time! If you compare the other lines, it looks like they have all been reversed left-for-right.

Now, this is just one error in an otherwise excellent book. But it’s a reminder that this kind of thing can be avoided, if you Go Get the Data.

Abstracts and keywords

Abstracts and keywords

As Probably Overthinking It approaches the finish line, there are just a few more tasks: I am working on the index and — as I have recently learned — I also have to write a 200-word abstract, a list of keywords for each chapter, and a 250-word abstract for the whole book. I have come to two conclusions:

  • There is no good software for making an index from a PDF document.
  • ChatGPT is pretty good at writing chapter abstracts.

By converting each chapter to a simple text file (no figures and tables) I was able to get most of them under the ~4000 token limit. Then I asked, “write a one-paragraph abstract for this chapter with no more than 180 words. write about the chapter in the third person”.

Here’s the unedited abstract for the last chapter:

The chapter titled “Chasing the Overton Window” explores the phenomenon of people considering themselves more conservative as they age, despite their views actually becoming more liberal over time. The chapter begins by examining the trends of self-identification and political views across different age groups. It then delves into the concept of the Overton window, which represents the range of acceptable public opinion on a given issue. The chapter explains how the center of public opinion has shifted toward liberalism over the past 50 years and how the relative terms “liberal” and “conservative” depend on the perceived center. Furthermore, it demonstrates that both liberals and conservatives have become more liberal over time. The chapter concludes by proposing an explanation for why people think they are becoming more conservative, even though their views are becoming more liberal, which involves the interplay of generational effects, the shifting center of public opinion, and the connotations associated with political labels.

ChatGPT June 10, 2023

It’s not great prose, but I think I can revise it into something acceptable without much effort.

Three of the chapters exceeded the token limit, so I asked for a summary of the first half, then a summary of the second half, then I asked, “Combine the following two abstracts into a single paragraph with no more than 180 words”. Here’s the combined abstract of Chapter 8:

This chapter delves into the distribution of natural and human-caused disasters, investigating their sizes, costs, prevalence, and characteristics within long-tailed distributions. Understanding the probabilities of major disasters is crucial for effective preparedness and response, despite the challenge of comprehending rare and large-scale events. By analyzing a dataset of 125 disasters, including hurricanes, earthquakes, floods, nuclear disasters, and terror attacks, the author demonstrates a pattern where doubling the rank of a disaster corresponds to halving its costs when plotted on a logarithmic scale. While exploring the limitations of the lognormal distribution in predicting the probabilities of large disasters, the author introduces Student’s t-distribution as a more suitable model for estimating the probabilities of extreme events. The chapter also examines lunar craters and their abundance and sizes, revealing the prevalence of long-tailed distributions and their connection to asteroid sizes. Additionally, it explores the occurrence of long-tailed distributions in stock market crashes and introduces the concept of black swans to emphasize their relevance in understanding rare and impactful events. Concluding the chapter, it discusses the challenges associated with predicting and comprehending rare, large events in a long-tailed world, with a specific focus on earthquake magnitudes and a comparison of prediction models.

ChatGPT June 10, 2023

Again, I think that’s editing distance away from acceptable — and a near perfect 198 words.

It does pretty well with keywords, too:

  1. Disasters
  2. Long-tailed distributions
  3. Probabilities
  4. Preparedness
  5. Response
  6. Natural events
  7. Human-made incidents
  8. Lognormal distribution
  9. Student’s t-distribution
  10. Black swans

To generate the abstract for the whole book, I pasted the chapter abstracts and keywords into a single document and asked, “the following are chapter abstracts for a book called ‘Probably Overthinking It’. write a one-paragraph abstract for the whole book, more than 200 words, but not more than 250 words. write about the book in the third person. not too much like an ad”

Here’s the unedited result:

“Probably Overthinking It” is a thought-provoking book that delves into various statistical and cognitive paradoxes, challenging common intuitions and shedding light on the complexities of human behavior and measurement. Through a series of captivating chapters, the author explores topics such as normalcy and weirdness, inspection paradoxes, family size and population growth, lognormal distributions, expected remaining time, Berkson’s paradox, low birthweight paradox, long-tailed distributions, fairness and fallacy in algorithmic decision-making, Simpson’s paradox, the moral circle, the evolution of societal values, and the shifting perceptions of political ideologies. By analyzing real-world data and employing statistical models, the book uncovers the nuances and contradictions that emerge when multiple factors interact, leading to counterintuitive outcomes. Readers will gain a deeper understanding of statistical biases, the limitations of measurement, and the challenges inherent in making accurate predictions and drawing meaningful conclusions. “Probably Overthinking It” encourages readers to question their assumptions, approach complex problems with skepticism, and develop a more nuanced and informed perspective on the intricacies of human behavior and the world around us.

ChatGPT June 10, 2023

Surprisingly, the book abstract is only 169 words, despite my prompt. In general, the word counts are not precise, and more often over than under (which is why I asked for 180, not 200).

So, I have some editing to do, but I’d say ChatGPT saved me at least a few hours — and spared me from exactly the kind of writing I dislike the most.

Absence of Evidence

Absence of Evidence

If anyone tells you that absence of evidence is not evidence of absence, you have my permission to slap them. Of course, my permission will not prevent you from getting slapped back or charged with assault. Regardless, absence of evidence is very often evidence of absence, and sometimes strong evidence.

To make this claim precise, I propose we use the Bayesian definition of evidence:

If an observation, D, would be more likely under a hypothesis, H, than under the alternative hypothesis, then D is evidence in favor of H. Conversely, if D is less likely under H than under the alternative, D is evidence against H.

As an example, suppose H is the hypothesis that unicorns exist. Since people have explored most of the world’s land mass, I’d say there’s a 99% chance we would have found unicorns if they existed.

So if D is the fact that we have not found unicorns, the probability of D is only 1% if unicorns exist, and 100% if they don’t. Therefore, D is evidence that unicorns don’t exist, with a likelihood ratio of 100:1.

Let’s consider a more realistic example. In a recent article, The Economist discusses the hypothesis that social media use is a major cause of recent increases in rates of self-harm and suicide among teenage girls. To test this hypothesis, they propose an experiment:

Because smartphones were adopted at different rates in different countries, the timing of any increases they caused in suicides or self-harm should vary on this basis.

But their experiment came up empty:

[W]e could not find any statistical link between changes over time in the prevalence of either mobile-internet subscriptions or self-reported social-media use in a country, and changes over time in that country’s suicide or self-harm hospitalisation rates, for either boys or girls.

They conclude:

But if social media were the sole or main cause of rising levels of suicide or self-harm—rather than just one part of a complex problem—country-level data would probably show signs of their effect.

Since it did not, this negative result is evidence against the hypothesis. It may not be strong evidence; there are other reasons the experiment might have failed. And in light of other evidence, it is still plausible that social media is harmful to mental health.

Nevertheless, in this example, as in any reasonable experiment, absence of evidence is evidence of absence.

[In this 2015 article, I made a similar claim that we should stop saying correlation does not imply causation.]

Causation, Collision, and Confusion

Causation, Collision, and Confusion

Today I presented a talk about Berkson’s paradox at ODSC East 2023. If you missed it, the slides are here. When the video is available, I’ll post it here.

Abstract: Collision bias is the most treacherous error in statistics: it can be subtle, it is easy to induce it by accident, and the error it causes can be bigger than the effect you are trying to measure. It is the cause of Berkson’s paradox, the low birthweight paradox, and the obesity paradox, among other famous historical errors. And it might be the cause of your next blunder! Although it is best known in epidemiology, it appears in other fields of science, engineering, and business.

In this talk, I will present examples of collision bias and show how it can be caused by a biased sampling process or induced by inappropriate statistical controls; and I will introduce causal diagrams as a tool for representing causal hypotheses and diagnosing collision bias.

So, don’t tell anyone, but this talk is part of my stealth book tour!

  • It started in 2019, when I presented a talk at PyData NYC based on Chapter 2: Relay Races and Revolving Doors.
  • In 2022, I presented another talk at PyData NYC, based on Chapter 12: Chasing the Overton Window.
  • Today’s talk is based on Chapter 7: Causation, Collision, and Confusion.
  • In July I’m presenting a talk at SciPy based on Chapter 9: The Long Tail of Disaster.

And, if things go according to plan, I’ll present Chapter 1 at a book event at the Needham Public Library on December 7.

More chapters coming soon!

The Overton Paradox

The Overton Paradox

Chapter 12 of Probably Overthinking It is about three trends that form what I’m calling the Overton Paradox:

  • Older people are more likely to say they are conservative.
  • And older people hold more conservative views.
  • But people don’t become more conservative as they get older — on average they get a little more liberal.

To demonstrate these trends, I used data from the General Social Survey.

Older people are more likely to say they are conservative:

And older people hold more conservative views:

But if we split people up by decade of birth, most cohorts don’t become more conservative as they get older; on average they become a little more liberal.

So if people become more liberal as they age, why are they more likely to say they are conservative?

I think the reason is that the perceived center of mass changes over time. Here’s how the average number of conservative responses has changed over the ~50 years of the GSS:

And it’s not just liberals going off the rails — all three groups have changed:

Let’s compare these changes to the average for people born in the 1940s:

In 1970, when they were in their 20s, this cohort was about as liberal as the average liberal. In 1990, when they were in their 40s, they were indistinguishable from the average moderate. In 2020, they were in their 70s, they found themselves substantially right of center.

On average, they are more liberal now than they were in 1970, but the world has moved faster. They are more likely to say they are conservative because, relative to the center of mass, they are.

LLM-Assisted Programming

LLM-Assisted Programming

I’ve been experimenting with programming assisted by Large Language Models (LLMs) like ChatGPT. I am amazed at how good it is, and it seems clear to me that the great majority of programming work will be LLM-assisted, starting now. Here are some of the examples I’ve tried.

Think Python

For most of the exercises in Think Python, if you paste the question into ChatGPT, you get a correct answer on the first try. For example, here’s one of the exercises in the chapter on dictionaries:

Prompt: If you did Exercise 7, you already have a function named has_duplicates that takes a list as a parameter and returns True if there is any object that appears more than once in the list.

Use a dictionary to write a faster, simpler version of has_duplicates.

And here’s the response:

ChatGPT: Here’s one possible implementation of the has_duplicates function using a dictionary to keep track of the elements in the list:

def has_duplicates(lst):
    """
    Returns True if there are any duplicates in the given list, False otherwise.
    """
    seen = {}
    for item in lst:
        if item in seen:
            return True
        seen[item] = True
    return False

ChatGPT: The function works by iterating through the list and adding each item to the seen dictionary. If an item is already in the dictionary, it means that we’ve seen it before and therefore there’s a duplicate in the list, so we return True. If we iterate through the entire list without finding any duplicates, we return False.

This implementation is faster and simpler than the previous one because checking for the presence of an item in a dictionary is much faster than iterating through the entire list each time. Also, using a dictionary allows us to keep track of the elements in the list using less code.

One of the later exercises asks what’s wrong with this class definition:

class Kangaroo:
    """A Kangaroo is a marsupial."""
    
    def __init__(self, name, contents=[]):
        """Initialize the pouch contents.
        name: string
        contents: initial pouch contents.
        """
        self.name = name
        self.pouch_contents = contents

Here’s the response:

ChatGPT: The Kangaroo class has a potential issue in its __init__ method with the contents parameter. Specifically, the default value for the contents parameter is a mutable object (i.e., a list), which can cause unexpected behavior.

When the __init__ method is called without a contents argument, it creates a new instance of the list object and assigns it to the contents parameter. However, since the default value for contents is a mutable object, the same list object is used as the default value for all Kangaroo instances that don’t provide a contents argument.

This means that if you create two Kangaroo instances without providing a contents argument, both instances will share the same list object. This can lead to unexpected behavior if you modify the list of one Kangaroo instance, as the changes will be reflected in the other instance as well.

To fix this issue, you can use a more appropriate default value for the contents parameter, such as None, and then create a new list instance inside the __init__ method if no contents argument is provided, like this:

class Kangaroo:
    
    def __init__(self, name, contents=None):
        """Initialize the pouch contents.
        name: string
        contents: initial pouch contents.
        """
        self.name = name
        if contents is None:
            self.pouch_contents = []
        else:
            self.pouch_contents = contents

So that’s pretty good.

On Twitter, many people have helpfully pointed out that the exercises in Think Python, and solutions available on the web, are part of ChatGPT’s training corpus, so I am “testing on the training set”.

While that may be true, I think it misses the point several times over:

  1. The responses I saw were not my solutions, and based on a few Google searches, I could not find a source they were copied from. At worse, ChatGPT makes it easier and faster to find a correct solution than a conventional search. But it seems to be doing more than that, recombining elements of the training corpus in useful (possibly novel) ways.
  2. Even if LLMs do nothing more than retrieve solutions from a large database, that alone would accelerate the great majority of programming — so substantially that I think it will change the day-to-day work of a great majority of programmers.

Data Structures

One of my complaints about data structures classes is that most of the work students do is re-implementation of basic algorithms. Most of them have been available in libraries for a long time, which means there is an opportunity to spend less time on how they work and more time on how we use them to do more interesting things. I hope the prevalence of LLMs will accelerate this process.

Here’s one of the exercises from Data Structures and Information Retrieval in Python (aka DSIRP):

Prompt: Write a function called intersect that takes two LetterSet objects and returns a new LetterSet that represents the intersection of the two sets. In other words, the new LetterSet should contain only elements that appear in both sets. In a comment, identify the order of growth of this function.

I also provided the introduction to the question, which introduced the LetterSet object. Here’s the code ChatGPT generated:

def intersect(set1, set2):
    bits = [a and b for a, b in zip(set1.bits, set2.bits)]
    return LetterSet(bits)

And its explanation:

ChatGPT: The intersect function first creates a new list bits by using a list comprehension that computes the element-wise logical AND of the bits in the two LetterSet objects. The resulting list contains True values only for the elements that appear in both sets, and False values for all other elements.

The intersect function then creates a new LetterSet object using the bits list and returns it.

The order of growth of the intersect function is O(1), because the size of the LetterSet objects is fixed at 26 elements, and the function always performs a constant number of operations (i.e., create a new list using a list comprehension, create a new LetterSet object, and return it).

Unlike Think Python, DSIRP is not well known and if there are answers to these exercises on the web, they are not easy to find.

But again, I don’t think it matters where the answers are coming from. The impact on the practice of programming is the same: the solution to basic programming problems can be generated quickly, which means programmers will be able to spend more time on higher-value activities.

Probability of Superiority

Let’s see how ChatGPT does with something a little harder:

Prompt: write a python function that takes the parameters of two poisson distributions and computes the probability of superiority

Here’s the code (I’ll skip the explanatory text):

from math import exp, factorial

def poisson_probability_of_superiority(lambda1, lambda2):
    """
    Calculates the probability of superiority between two Poisson distributions.
    
    Parameters:
    lambda1 (float): the parameter of the first Poisson distribution.
    lambda2 (float): the parameter of the second Poisson distribution.
    
    Returns:
    float: the probability that a random variable drawn from the first Poisson distribution is greater than a random variable drawn from the second Poisson distribution.
    """
    prob = 0
    for i in range(int(lambda1) + 1):
        for j in range(int(lambda2)):
            prob += (exp(-lambda1) * lambda1**i / factorial(i)) * (exp(-lambda2) * lambda2**j / factorial(j))
    
    return prob

I asked for a version that uses SciPy and NumPy, and got this:

import numpy as np
from scipy.stats import poisson

def poisson_probability_of_superiority(lambda1, lambda2):
    """
    Calculates the probability of superiority between two Poisson distributions.
    
    Parameters:
    lambda1 (float): the parameter of the first Poisson distribution.
    lambda2 (float): the parameter of the second Poisson distribution.
    
    Returns:
    float: the probability that a random variable drawn from the first Poisson distribution is greater than a random variable drawn from the second Poisson distribution.
    """
    k = np.arange(int(lambda1) + 1)
    pmf1 = poisson.pmf(k, lambda1)
    cdf2 = poisson.cdf(k - 1, lambda2)
    prob = np.sum(pmf1 * (1 - cdf2))
    
    return prob

This is close to being a good solution, but there are still a couple of problems. However, after just a few more iterations, I got a good solution to this problem, and a similar solution that works with exponential distributions. You can see the sequence of prompts and responses in this notebook.

I have no idea where this solution came from. If you Google for a solution to this problem, the results are not very useful. If anyone reading this can find the source of this code, please let me know.

Looking at the sequence of solutions ChatGPT generated, I got several ideas for solving this problem that I might not have come up with. The solutions I got still need some work, but I could polish them off without much additional effort. And I got ChatGPT to generate a unit test as well:

def test_exponential_probability_of_superiority():
    # Test case with known result
    lambda1 = 0.5
    lambda2 = 1.0
    expected_prob = 0.368
    tol = 0.001
    prob = exponential_probability_of_superiority(lambda1, lambda2)
    assert abs(prob - expected_prob) < tol, f"Test case failed: expected {expected_prob}, but got {prob}"

I would not have confidence in this solution without additional refactoring and testing, but ChatGPT got me off to a very fast start on a non-trivial problem.

Correlation in Elm

If you find yourself working with an unfamiliar language or libraries, LLMs can accelerate your learning — by a lot. Recently I started learning Elm, so I am not very good at it. And it is not a commonly-used language, so I expect the corpus to be limited. Let’s see how it does on a basic stats question:

Prompt: write an elm function to compute the correlation of two lists of floats

Here’s the response:

correlation : List Float -> List Float -> Maybe Float
correlation xs ys =
    let
        n = List.length xs
        meanX = List.sum xs / Float.fromInt n
        meanY = List.sum ys / Float.fromInt n
        numerator = List.foldl (\(x, y) acc -> acc + ((x - meanX) * (y - meanY))) 0 (List.zip xs ys)
        denominatorX = List.foldl (\x acc -> acc + ((x - meanX) * (x - meanX))) 0 xs
        denominatorY = List.foldl (\y acc -> acc + ((y - meanY) * (y - meanY))) 0 ys
        denominator = sqrt (denominatorX * denominatorY)
    in
    if denominator == 0 then
        Nothing
    else
        Just (numerator / denominator)

That’s pretty good, and the explanation is even better:

ChatGPT: The correlation function takes in two lists of floats xs and ys and computes the Pearson correlation coefficient between them. The result is returned as a Maybe Float, as it is possible for the denominator to be 0 if one of the lists is empty or if all the elements in one of the lists are the same, in which case the correlation is undefined.

The function first computes the mean of each list and then uses these means to calculate the numerator of the correlation coefficient using a fold over the zipped lists. It then calculates the denominators for each list separately, and computes the denominator of the correlation coefficient by taking the product of these two denominators and then taking the square root. Finally, the function checks if the denominator is 0 and returns Nothing in that case, or returns the correlation coefficient as a Just Float.

To someone getting the hang of Elm’s type system, this is very helpful.

I asked it to generate a version using map instead of foldl, and another using existing functions to compute mean and variance. You can see the entire exchange in this notebook.

Coding is different now

When I wrote about these examples on Twitter, I got more disagreement than I expected. Lots of people reminded me of the limitations of LLMs for generating code. But again, I think this is missing the point. Even if LLMs only solve simple programming problems, there are a lot of simple programming problems! And I conjecture that most programmers spend most of their time on things that ChatGPT could greatly accelerate — or just solve.

And we’ve only been using them for a few weeks! LLMs will get better, and we will get better at using them. So I stand by my conclusion: The great majority of coding will be LLM-assisted, starting now.

LLMs will also have a huge effect on how we teach and learn programming, but I’ll get to that later.

Addendum: The New Skills

Most of the skills programmers use now are also the skills they will need to work with LLMs. Breaking a problem down into smaller problems, and designing good interfaces between components, are still essential skills. One difference is that now, for each of those smaller problems, programmers need to decide whether it would be easier and faster to solve it themselves or start a conversation with an LLM.

After deciding whether to start a conversation, the next big question is how to compose the prompt. In particular, it makes a big difference how much information is included as a preamble. Just as we all got better at composing search terms, we’ll get better at composing prompts. (I have a hard time not being polite to ChatGPT. I wonder if that will persist, or we’ll start writing blunt imperatives.)

And a final question is when to stop a conversation and work with the code you have, or ask for further refinements. In my experiments, it felt like I reached a point of diminishing returns, where further refinements were likely to introduce new errors. On the other hand, asking for at least two versions of a function produced useful variations.

At least for now, we cannot assume that code produced by an LLM is correct, which means it needs extensive testing. People who are used to test-driven development (TDD) will have a head start with LLM-assisted programming. Of course, we can use LLMs to generate unit tests as well, but then we have to validate the unit tests, too.

Which brings me to what I think will be the most important skill for LLM-assisted programming: reading code. LLMs can generate code much faster than we can understand it, so the ability to read, understand, and check code will be critical.

The other skill that will become more important is meta-language, that is, the vocabulary we use to talk about programs. In my correlation in Elm example, I asked ChatGPT to “factor out the anonymous function”, and it new exactly what I meant. In general, it seems to understand the meta-language of programming well, so it will be useful if we can speak it.

Most of the skills programmers need to work with LLMs are the skills they already have, but some of them will become more important, especially problem decomposition, reading code, and speaking the meta-language of programming.

How Many Typos?

How Many Typos?

When I started work at Brilliant a couple of weeks ago, I learned that one of my new colleagues, Michelle McSweeney, just published a book called OK, which is all about the word OK.

As we discussed the joys and miseries of publishing, Michelle mentioned that she had found a typo in the book after publication. So naturally I took it as a challenge to find the typo. While I was searching, I enjoyed the book very much. If you are interested in etymology, linguistics, and history, I recommend it!

As it turned out, I found exactly one typo. When I told Michelle, she asked me nervously which page it was on. Page 17. She looked disappointed – that was not the same typo she found.

Now, for people who like Bayesian statistics, this scenario raises some questions:

  1. After our conversation, how many additional typos should we expect there to be?
  2. If she and I had found the same typo, instead of different ones, how many typos would we expect?

As it happens, I used a similar scenario as an example in Think Bayes. So I was able to reuse some code and answer these questions.

You can read my solution here.

You can also click here to run the notebook with the solution on Colab.

The Bayesian Killer App

The Bayesian Killer App

It’s been a while since anyone said “killer app” without irony, so let me remind you that a killer app is software “so necessary or desirable that it proves the core value of some larger technology,” quoth Wikipedia. For example, most people didn’t have much use for the internet until the world wide web was populated with useful content and the first generation of browsers made it easy to access.

So what is the Bayesian killer app? That is, for people who don’t know much about Bayesian methods, what’s the application that demonstrates their core value? I have a nomination: Thompson sampling, also known as the Bayesian bandit strategy, which is the foundation of Bayesian A/B testing.

I’ve been writing and teaching about Bayesian methods for a while, and Thompson sampling is the destination that provides the shortest path from Bayes’s Theorem to a practical, useful method that is meaningfully better than the more familiar alternative, hypothesis testing in general and Student’s t test in particular.

So what does that path look like? Well, funny you should ask, because I presented my answer last November as a tutorial at PyData Global 2022, and the video has just been posted:

VIDEO HERE

The materials for the tutorial — including the slides and Jupyter notebooks — are in this repository.

And if you like the tutorial, you’ll love the game: here are the instructions for a game I designed that uses dice to implement Thompson sampling.

Abstract

This tutorial is a hands-on introduction to Bayesian Decision Analysis (BDA), which is a framework for using probability to guide decision-making under uncertainty. I start with Bayes’s Theorem, which is the foundation of Bayesian statistics, and work toward the Bayesian bandit strategy, which is used for A/B testing, medical tests, and related applications. For each step, I provide a Jupyter notebook where you can run Python code and work on exercises. In addition to the bandit strategy, I summarize two other applications of BDA, optimal bidding and deriving a decision rule. Finally, I suggest resources you can use to learn more.

Outline

  • Problem statement: A/B testing, medical tests, and the Bayesian bandit problem
  • Prerequisites and goals
  • Bayes’s theorem and the five urn problem
  • Using Pandas to represent a PMF
  • Estimating proportions
  • From belief to strategy
  • Implementing and testing Thompson sampling
  • More generally: two other examples of BDA
  • Resources and next steps

Prerequisites

For this tutorial, you should be familiar with Python at an intermediate level. We’ll use NumPy, SciPy, and Pandas, but I’ll explain what you need to know as we go. You should be familiar with basic probability, but you don’t need to know anything about Bayesian statistics. I provide Jupyter notebooks that run on Colab, so you don’t have to install anything or prepare ahead of time. But you should be familiar with Jupyter notebooks.

What does skew mean?

What does skew mean?

My audience skews left; that is, the people who read my blog are more liberal, on average, than the general population. For example, if I surveyed my readers and asked where they place themselves on a scale from liberal to conservative, the results might look like this:

To be clear, I have not done a survey and this is fake data, but if it were real, we would conclude that my audience is more liberal, on average, than the general population. So in the normal use of the word skew, we might say that this distribution “skews to the left”.

But according to statisticians, that would be wrong, because within the field of statistics, skew has been given a technical meaning that is contrary to its normal use. Here’s how Wikipedia explains the technical definition:

positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right, despite the fact that the curve itself appears to be skewed or leaning to the left; right instead refers to the right tail being drawn out and, often, the mean being skewed to the right of a typical center of the data. A right-skewed distribution usually appears as a left-leaning curve.

https://en.wikipedia.org/wiki/Skewness

By this definition, we would say that the distribution of political alignment in my audience is “skewed to the right”. It is regrettable that the term was defined this way, because it’s very confusing.

Recently I ran a Twitter poll to see what people think skew means. Here are the results:

Interpreting these results is almost paradoxical: the first two responses are almost equally common, which proves that the third response is correct. If the statistically-literate people who follow me on Twitter don’t agree about what skew means, we have to treat it as ambiguous unless specified.

The comments suggest I’m not the only one who thinks the technical definition is contrary to intuition.

  • This has always been confusing for me, since the shape of a right-skewed distribution looks like it’s “leaning” to the left…
  • I learnt it as B, but there’s always this moment when I consciously have to avoid thinking it’s A.
  • This is one of those things where once I learned B was right, I hated it so much that I never forgot it.

It gets worse

If you think the definition of skew is bad, let’s talk about bias. In the context of statistics, bias is “a systematic tendency which causes differences between results and fact”. In particular, sampling bias is bias caused by a non-representative sampling process.

In my imaginary survey, the mean of the sample is less than the actual mean in the population, so we could say that my sample is biased to the left. Which means that the distribution is technically biased to the left and skewed to the right. Which is particularly confusing because in natural use, bias and skew mean the same thing.

So 20th century statisticians took two English words that are (nearly) synonyms, and gave them technical definitions that can be polar opposites. The result is 100 years of confusion.

For early statisticians, it seems like creating confusing vocabulary was a hobby. In addition to bias and skew, here’s a partial list of English words that are either synonyms or closely related, which have been given technical meanings that are opposites or subtly different.

  • accuracy and precision
  • probability and likelihood
  • efficacy and effectiveness
  • sensitivity and specificity
  • confidence and credibility

And don’t get me started on “significance”.


If you got this far, it seems like you are part of my audience, so if you want to answer a one-question survey about your political alignment, follow this link. Thank you!

My poll and this article were prompted by this excellent video about the Central Limit Theorem:

Around the 7:52 mark, a distribution that leans left is described as “skewed towards the left”. In statistics jargon, that’s technically incorrect, but in this context I think it’s is likely to be understood as intended.