Browsed by
Month: January 2019

Data visualization for academics

Data visualization for academics

One of the reasons I am excited about the rise of data journalism is that journalists are doing amazing things with visualization. At the same time, one of my frustrations with academic research is that the general quality of visualization is so poor.

One of the problems is that most academic papers are published in grayscale, so the figures don’t use color. But most papers are read in electronic formats now; the world is safe for color!

Another problem is the convention of putting figures at the end, which is an extreme form of burying the lede.

Also, many figures are generated by software with bad defaults: lines are too thin, text is too small, axis and grids lines are obtrusive, and when colors are used, they tend to be saturated colors that clash. And I won’t even mention the gratuitous use of 3-D.

But I think the biggest problem is the simplest: the figures in most academic papers do a poor job of communicating one point clearly.

I wrote about one example a few months ago, a paper showing that children who start school relatively young are more likely to be diagnosed with ADHD.

Here’s the figure from the original paper:

How long does it take you to understand the point of this figure? Now here’s my representation of the same data:

I believe this figure is easier to interpret. Here’s what I changed:

  1. Instead of plotting the difference between successive months, I plotted the diagnosis rate for each month, which makes it possible to see the pattern (diagnosis rate increases month over month for the first six months, then levels), and the magnitude of the difference (from 60 to 90 diagnoses per 10,000, an increase of about 50%).
  2. I shifted the horizontal axis to put the cutoff date (September 1) at zero.
  3. I added a vertical line and text to distinguish and interpret the two halves of the plot.
  4. I added a title that states the primary conclusion supported by the figure. Alternatively, I could have put this text in a caption.
  5. I replaced the error bars with a shaded area, which looks better (in my opinion) and appropriately gives less visual weight to less important information.

I came across a similar visualization makeover recently. In this Washington Post article, Catherine Rampell writes, “Colleges have been under pressure to admit needier kids. It’s backfiring.”

Her article is based on this academic paper; here’s the figure from the original paper:

It’s sideways, it’s on page 29, and it fails to make its point. So Rampell designed a better figure. Here’s the figure from her article:


The title explains what the figure shows clearly: enrollment rates are highest for low-income students that qualify for Pell grants and lowest for low-income students who don’t qualify for Pell grants.

To nitpick, I might have plotted this data with a line rather than a bar chart, and I might have used a less saturated color. But more importantly, this figure makes its point clearly and compellingly.

Here’s one last example, and a challenge: this recent paper reports, “the number of scale points used in faculty teaching evaluations (e.g., whether instructors are rated on a scale of 6 vs. a scale of 10) substantially affects the size of the gender gap in evaluations.”

To demonstrate this effect, they show eight histograms on pages 44 and 45. Here’s page 44:

And here’s page 45:

With some guidance from the captions, we can extract the message:

  1. Under the 6-point system, there is no visible difference between ratings for male and female instructors.
  2. Under the 10-point system, in the least male-dominated subject areas, there is no visible difference.
  3. Under the 10-point system, in the most male-dominated subject areas, there is a visibly obvious difference: students are substantially less likely to give female instructors a 9 or 10.

This is an important result — it makes me want to read the previous 43 pages. And the visualizations are not bad — they show the effect clearly, and it is substantial.

But I still think we could do better. So let me pose this challenge to readers: Can you design a visualization of this data that communicates the results so that

  1. Readers can see the effect quickly and easily, and
  2. Understand the magnitude of the effect in practical terms?

You can get the data you need from the figures, at least approximately. And your visualization doesn’t have to be fancy; you can send something hand-drawn if you want. The point of the exercise is the design, not the details.

I will post submissions in a few days. If you send me something, let me know how you would like to be acknowledged.

UPDATE: We discussed this example in class today and I presented one way we could summarize and visualize the data:

Students in the most male-dominated fields are less likely to give female instructors top scores, but only on a 10-point scale. The effect does not appear on a 6-point scale.

There are definitely things to do to improve this, but I generated it using Pandas with minimal customization. All the code is in this Jupyter notebook.

The library of data visualization

The library of data visualization

Getting ready for my Data Science class (starting next week!) I am updating my data visualization library, looking for resources to help students learn about visualization.

Last week I asked Twitter to help me find resources, especially new ones.  Here’s the thread.  Thank you to everyone who responded!

I’ll try to summarize and organize the responses.  I am mostly interested in books and web pages about visualization, rather than examples of it or tools for doing it.

There are lots of good books; to impose some order, I put them in three categories: newer work, the usual suspects, and moldy oldies.

Newer books

The following are some newer books (or at least new to me).

Fundamentals of Data Visualization, by Claus O. Wilke (online preview of a book forthcoming from O’Reilly)

Data Visualization: A practical introduction Kieran Healy (free online draft)

Data Visualization: Charts, Maps, and Interactive Graphics Robert Grant

Data Visualisation: A Handbook for Data Driven Design by Andy Kirk

Dear Data by Giorgia Lupi, Stefanie Posavec

Established books

The following are more established books that appear on most lists.

The Functional Art: An Introduction to Information Graphics and Visualization by Alberto Cairo

The Truthful Art: Data, Charts, and Maps for Communication by Alberto Cairo

Interactive Data Visualization for the Web by Scott Murray

Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic

Beautiful Visualization: Looking at Data through the Eyes of Experts by Julie Steele

Designing Data Visualizations: Representing Informational Relationships by Noah Iliinsky, Julie Steele

Visualization Analysis and Design by Tamara Munzner

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau

Data Points: Visualization That Means Something by Nathan Yau

Show Me the Numbers: Designing Tables and Graphs to Enlighten by Stephen Few

Now You See It: Simple Visualization Techniques for Quantitative Analysis by Stephen Few

Older books

The Visual Display of Quantitative Information by Edward R. Tufte

The Elements of Graphing Data by William S. Cleveland

Websites and blogs

Again, I mostly went for sites that are about visualization, rather than examples of it.

Junk Charts

The Graphic Continuum

information is beautiful

From Data to Viz

The Data Visualisation Catalogue

Storytelling with Data

FlowingData

More references and resources from MPA 635: DATA VISUALIZATION

Videos and podcasts

The Art of Data Visualization | Off Book | PBS Digital Studios

Data Stories A podcast on data visualization with Enrico Bertini and Moritz Stefaner

Python-specific resources

Python Plotting for Exploratory Data Analysis

The Python Graph Gallery

How to visualize data in Python