January 2019 - Probably Overthinking It

Data visualization for academics

January 30, 2019 AllenDowney

One of the reasons I am excited about the rise of data journalism is that journalists are doing amazing things with visualization. At the same time, one of my frustrations with academic research is that the general quality of visualization is so poor.

One of the problems is that most academic papers are published in grayscale, so the figures don’t use color. But most papers are read in electronic formats now; the world is safe for color!

Another problem is the convention of putting figures at the end, which is an extreme form of burying the lede.

Also, many figures are generated by software with bad defaults: lines are too thin, text is too small, axis and grids lines are obtrusive, and when colors are used, they tend to be saturated colors that clash. And I won’t even mention the gratuitous use of 3-D.

But I think the biggest problem is the simplest: the figures in most academic papers do a poor job of communicating one point clearly.

I wrote about one example a few months ago, a paper showing that children who start school relatively young are more likely to be diagnosed with ADHD.

Here’s the figure from the original paper:

How long does it take you to understand the point of this figure? Now here’s my representation of the same data:

I believe this figure is easier to interpret. Here’s what I changed:

Instead of plotting the difference between successive months, I plotted the diagnosis rate for each month, which makes it possible to see the pattern (diagnosis rate increases month over month for the first six months, then levels), and the magnitude of the difference (from 60 to 90 diagnoses per 10,000, an increase of about 50%).
I shifted the horizontal axis to put the cutoff date (September 1) at zero.
I added a vertical line and text to distinguish and interpret the two halves of the plot.
I added a title that states the primary conclusion supported by the figure. Alternatively, I could have put this text in a caption.
I replaced the error bars with a shaded area, which looks better (in my opinion) and appropriately gives less visual weight to less important information.

I came across a similar visualization makeover recently. In this Washington Post article, Catherine Rampell writes, “Colleges have been under pressure to admit needier kids. It’s backfiring.”

Her article is based on this academic paper; here’s the figure from the original paper:

It’s sideways, it’s on page 29, and it fails to make its point. So Rampell designed a better figure. Here’s the figure from her article:

The title explains what the figure shows clearly: enrollment rates are highest for low-income students that qualify for Pell grants and lowest for low-income students who don’t qualify for Pell grants.

To nitpick, I might have plotted this data with a line rather than a bar chart, and I might have used a less saturated color. But more importantly, this figure makes its point clearly and compellingly.

Here’s one last example, and a challenge: this recent paper reports, “the number of scale points used in faculty teaching evaluations (e.g., whether instructors are rated on a scale of 6 vs. a scale of 10) substantially affects the size of the gender gap in evaluations.”

To demonstrate this effect, they show eight histograms on pages 44 and 45. Here’s page 44:

And here’s page 45:

With some guidance from the captions, we can extract the message:

Under the 6-point system, there is no visible difference between ratings for male and female instructors.
Under the 10-point system, in the least male-dominated subject areas, there is no visible difference.
Under the 10-point system, in the most male-dominated subject areas, there is a visibly obvious difference: students are substantially less likely to give female instructors a 9 or 10.

This is an important result — it makes me want to read the previous 43 pages. And the visualizations are not bad — they show the effect clearly, and it is substantial.

But I still think we could do better. So let me pose this challenge to readers: Can you design a visualization of this data that communicates the results so that

Readers can see the effect quickly and easily, and
Understand the magnitude of the effect in practical terms?

You can get the data you need from the figures, at least approximately. And your visualization doesn’t have to be fancy; you can send something hand-drawn if you want. The point of the exercise is the design, not the details.

I will post submissions in a few days. If you send me something, let me know how you would like to be acknowledged.

UPDATE: We discussed this example in class today and I presented one way we could summarize and visualize the data:

Students in the most male-dominated fields are less likely to give female instructors top scores, but only on a 10-point scale. The effect does not appear on a 6-point scale.

There are definitely things to do to improve this, but I generated it using Pandas with minimal customization. All the code is in this Jupyter notebook.

The library of data visualization

January 18, 2019 AllenDowney

Getting ready for my Data Science class (starting next week!) I am updating my data visualization library, looking for resources to help students learn about visualization.

Last week I asked Twitter to help me find resources, especially new ones. Here’s the thread. Thank you to everyone who responded!

I’ll try to summarize and organize the responses. I am mostly interested in books and web pages about visualization, rather than examples of it or tools for doing it.

There are lots of good books; to impose some order, I put them in three categories: newer work, the usual suspects, and moldy oldies.