The library of data visualization

The library of data visualization

Getting ready for my Data Science class (starting next week!) I am updating my data visualization library, looking for resources to help students learn about visualization.

Last week I asked Twitter to help me find resources, especially new ones.  Here’s the thread.  Thank you to everyone who responded!

I’ll try to summarize and organize the responses.  I am mostly interested in books and web pages about visualization, rather than examples of it or tools for doing it.

There are lots of good books; to impose some order, I put them in three categories: newer work, the usual suspects, and moldy oldies.

Newer books

The following are some newer books (or at least new to me).

Fundamentals of Data Visualization, by Claus O. Wilke (online preview of a book forthcoming from O’Reilly)

Data Visualization: A practical introduction Kieran Healy (free online draft)

Data Visualization: Charts, Maps, and Interactive Graphics Robert Grant

Data Visualisation: A Handbook for Data Driven Design by Andy Kirk

Dear Data by Giorgia Lupi, Stefanie Posavec

Established books

The following are more established books that appear on most lists.

The Functional Art: An Introduction to Information Graphics and Visualization by Alberto Cairo

The Truthful Art: Data, Charts, and Maps for Communication by Alberto Cairo

Interactive Data Visualization for the Web by Scott Murray

Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic

Beautiful Visualization: Looking at Data through the Eyes of Experts by Julie Steele

Designing Data Visualizations: Representing Informational Relationships by Noah Iliinsky, Julie Steele

Visualization Analysis and Design by Tamara Munzner

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau

Data Points: Visualization That Means Something by Nathan Yau

Show Me the Numbers: Designing Tables and Graphs to Enlighten by Stephen Few

Now You See It: Simple Visualization Techniques for Quantitative Analysis by Stephen Few

Older books

The Visual Display of Quantitative Information by Edward R. Tufte

The Elements of Graphing Data by William S. Cleveland

Websites and blogs

Again, I mostly went for sites that are about visualization, rather than examples of it.

Junk Charts

The Graphic Continuum

information is beautiful

From Data to Viz

The Data Visualisation Catalogue

Storytelling with Data

FlowingData

More references and resources from MPA 635: DATA VISUALIZATION

Videos and podcasts

The Art of Data Visualization | Off Book | PBS Digital Studios

Data Stories A podcast on data visualization with Enrico Bertini and Moritz Stefaner

Python-specific resources

Python Plotting for Exploratory Data Analysis

The Python Graph Gallery

How to visualize data in Python

Race, religion, and politics

Race, religion, and politics

In their November 3, 2018 issue, The Economist published the following figure showing their analysis of data from YouGov.

For the second edition of Think Bayes, I plan to use it to demonstrate conditional probability.  In the following notebook, I replicate their analysis (loosely!) using data from the General Social Survey (GSS).

How to write a book

How to write a book

All of my books were written in LaTeX.  For a long time I used emacs to compose, pdflatex to convert to PDF, Hevea to convert to HTML, and a hacked-up version of plasTeX to convert to DocBook, which is one of the formats I can submit to my publisher, O’Reilly Media.

Recently I switched from emacs to Texmaker for composition, and I recommend it strongly.  I also use Overleaf for shared LaTeX documents, and I can recommend that, too.

However, the rest of the tools I use are pretty clunky.  The HTML I get from Hevea is not great, and my hacked version of plasTeX is just awful (which is not plasTeX’s fault).

Since I am starting some new book projects, I decided to rethink my tools.  So I asked Twitter, “If you were starting a new book project today, what typesetting language / development environment would you use?  LaTeX with Texmaker? Bookdown with RStudio? Jupyter?Other?”

I got some great responses.  You can read the whole thread yourself, but I will try to summarize it here.

LaTeX

Nelis Willers “wrote a 510 page book with LaTeX, using WinEdt and MiKTeX and CorelDraw for diagrams. Worked really well.”

Matt Boelkins likes “PreTeXt, hands down:  It has LaTeX and HTML as potential outputs among many. See the gallery of existing texts on the linked page.”

makusu recommends “Emacs org-mode. Easy to just write your content, seamless integration with latex, easy output to latex, PDF, markdown and HTML.”

AsciiDoc

Luciano Ramalho recommends “AsciiDoc, for sure. That’s how I wrote @fluentpython. It’s syntax more user-friendly than ReStructuredText and way more expressive than Markdown. AsciiDoc was *designed for* book publishing. It’s as expressive as DocBook, but it ain’t XML. With @asciidoctor you can render locally.”

JD Long provides a useful reminder: “It’s dependent on the publisher as well as the content of the book. I like Bookdown for R, but if I were doing a devops book for O’Reilly I’d write directly in AsciiDoc, for example. So I think context matters highly.”

Yves Hilpisch says “AsciiDoctor is my favorite these days. Clear syntax, nice output, fast rendering (HTML/PDF). Have custom Python scripts that convert @ProjectJupyter notebooks into text files from which I include code snippets automatically.”  His scripts are in this GitHub repository.

Markdown

Robert Talbot recommends “Markdown in a plain text editor, with Pandoc on the back end for the finished product.  This is assuming that the book is mostly text. If it involves code, I might lean more toward Jupyter and some kind of Binder based process.”  Here’s a blog post Robert wrote on the topic.

I got a recommendation for this blog post by Thorsten Ball, who uses Markdown, pandoc, and KindleGen.

One person recommended “writing Markdown then using pandoc to pass to LaTeX”, which is an interesting chain.

Visual Studio Code got a few mentions: “I haven’t written a full book using it, but VS Code plus markdown preview and other editing plugins is my current go-to for small articles”

“Bookdown in RStudio is wonderful to use.”

Jupyter

Chris Holdgraf is “working on a project to help people make nicely rendered online books from collections of Jupyter notebooks. We use it @ Berkeley for teaching at http://inferentialthinking.com.”

RestructuredText

Jason Moore recommends “your preferred text editor + RestructuredText + Sphinx = pdf/epub/html output; wrote my dissertation with it 6 years ago and was quite happy with the results.”

Matt Harrison uses his “own tools around rst (with conversion to LaTeX and epub).”

Other

Pollen: the book is a program

Raffaele Abate recommends “ScrivenerApp: I’ve used their Linux beta in past for a short, nonscientific, book and I can say it’s an amazing software for this purpose. I’ve read that is usable also for scientific publishing with profit.”

Lak Lakshmanan wrote, “I used Google docs for my previous book and for my current offer. Not as composable as latex, but amazing for collaboration. Books need to fine-grained reviews and edits by several people spread around the world. Nothing like Google docs for that.”

And the winner is…

For now I am working in LaTeX with Texmaker, but I have run it through pandoc to generate AsciiDoc, and that seems to work well.  I will work on the book and the conversion process at the same time.  At some point, I might switch over to editing in AsciiDoctor.  I also need to do a test run with O’Reilly to see if they can ingest the AsciiDoc I generate.

I will post updates as I work out the details.

Thank you to everyone who responded!

Cats and rats and elephants

Cats and rats and elephants

Wall decoration featuring an elephant.A few weeks ago I posted “Lions and Tigers and Bears“, which poses a Bayesian problem related to the Dirichlet distribution.  If you have not read it, you might want to start there.

Now here’s the follow-up question:

Suppose there are six species that might be in a zoo: lions and tigers and bears, and cats and rats and elephants. Every zoo has a subset of these species, and every subset is equally likely.

One day we visit a zoo and see 3 lions, 2 tigers, and one bear. Assuming that every animal in the zoo has an equal chance to be seen, what is the probability that the next animal we see is an elephant?

And here’s my solution

August causes ADHD

August causes ADHD

A recent paper in the New England Journal of Medicine, Attention Deficit–Hyperactivity Disorder and Month of School Enrollment, compares diagnosis rates for ADHD for children born in August and children born in September.   In states that have a September 1 cutoff for children to start school, students born in September are generally the oldest in their class, and children born in August the youngest.

And it turns out that children born in August are about 30% more likely to be diagnosed with ADHD, plausibly due to age-related differences in behavior.

The analysis in the paper uses null-hypothesis significance tests (NHST) and focuses on the difference between August and September births.  But if it is true that the difference in diagnosis rates is due to age differences, we should expect to see a “dose-response” curve with gradually increasing rates from September to August.

Fortunately, the article includes enough data for me to replicate and extend the analysis.  Here is the figure from the paper showing the month-to-month comparisons.

Note: there is a typographical error in the table, explained in my notebook, below.

Comparing adjacent months, only one of the differences is statistically significant.  But I think there are other ways to look at this data that might make the effect more apparent.  The following figure, from my re-analysis, shows diagnosis rates as a function of the difference, in months, between a child’s birth date and the September 1 cutoff:

For the first 9 months, from September to May, we see what we would expect if at least some of the excess diagnoses are due to age-related behavior differences. For each month of difference in age, we see an increase in the number of diagnoses.

This pattern breaks down for the last three months, June, July, and August. This might be explained by random variation, but it also might be due to parental intervention; if some parents hold back students born near the deadline, the observations for these months include some children who are relatively old for their grade and therefore less likely to be diagnosed.

We could test this hypothesis by checking the actual ages of these students when they started school, rather than just looking at their months of birth.  I will see whether that additional data is available; in the meantime, I will proceed taking the data at face value.

I fit the data using a Bayesian logistic regression model, assuming a linear relationship between month of birth and the log-odds of diagnosis.  The following figure shows the fitted models superimposed on the data.

Most of these regression lines fall within the credible intervals of the observed rates, so in that sense this model is not ruled out by the data.  But it is clear that the lower rates in the last 3 months bring down the estimated slope, so we should probably consider the estimated effect size to be a lower bound on the true effect size.

To express this effect size in a way that’s easier to interpret, I used the posterior predictive distributions to estimate the difference in diagnosis rate for children born in September and August.  The difference is 21 diagnoses per 10,000, with 95% credible interval (13, 30).

As a percentage of the baseline (71 diagnoses per 10,000), that’s an increase of 30%, with credible interval (18%, 42%).

However, if it turns out that the observed rates for June, July, and August are brought down by red-shirting, the effect could be substantially higher.  Here’s what the model looks like if we exclude those months:

Of course, it is hazardous to exclude data points because they violate expectations, so this result should be treated with caution.  But under this assumption, the difference in diagnosis rate would be 42 per 10,000.  On a base rate of 67, that’s an increase of 62%.

Here is the notebook with the details of my analysis:

Lions and tigers and bears

Lions and tigers and bears

Here’s another Bayes puzzle:

Suppose we visit a wild animal preserve where we know that the only animals are lions and tigers and bears, but we don’t know how many of each there are.

During the tour, we see 3 lions, 2 tigers, and one bear. Assuming that every animal had an equal chance to appear in our sample, estimate the prevalence of each species.

What is the probability that the next animal we see is a bear?

Solutions

Will Koehrsen posted an excellent solution here.  His solution is more general than mine, allowing for uncertainty about the parameters of the Dirichlet prior.

And vlad posted another good solution using WebPPL.

 

Cats and rats and elephants

Now that we solved the appetizer, we are ready for the main course…

Suppose there are six species that might be in a zoo: lions and tigers and bears, and cats and rats and elephants.  Every zoo has a subset of these species, and every subset is equally likely.

One day we visit a zoo and see 3 lions, 2 tigers, and one bear.  Assuming that every animal in the zoo has an equal chance to be seen, what is the probability that the next animal we see is an elephant?

The Game of Ur problem

The Game of Ur problem

Here’s a probability puzzle to ruin your week.

In the Royal Game of Ur, players advance tokens along a track with 14 spaces. To determine how many spaces to advance, a player rolls 4 dice with 4 sides. Two corners on each die are marked; the other two are not. The total number of marked corners — which is 0, 1, 2, 3, or 4 — is the number of spaces to advance.

For example, if the total on your first roll is 2, you could advance a token to space 2. If you roll a 3 on the next roll, you could advance the same token to space 5.

Suppose you have a token on space 13. How many rolls did it take to get there?

Hint: you might want to start by computing the distribution of k given n, where k is the number of the space and n is the number of rolls.  Then think about the prior distribution of n.

I’ll post a solution later this week, but I have to confess: I believe my solution is correct, but there is still part of it I am not satisfied with.

[UPDATE November 1, 2018]

Here’s the thread on Twitter where a few people discuss this problem.

And here’s my solution.  As you will see there are still some unresolved questions.

Here’s another solution from Austin Rochford, which estimates the posterior distribution by simulation.

And here’s a solution from vlad, also based on simulation, using WebPPL:

How tall is A?

How tall is A?

Here are a series of problems I posed in my Bayesian statistics class:

1) Suppose you meet an adult resident of the U.S. who is 170 cm tall. What is the probability that they are male?

2) Suppose I choose two U.S. residents at random and A is taller than B.  How tall is A?

3) In a room of 10 randomly chosen U.S. residents, A is the second tallest.  How tall is A?  And what is the probability that A is male?

As background: For adult male residents of the US, the mean and standard deviation of height are 178 cm and 7.7 cm. For adult female residents the corresponding stats are 163 cm and 7.3 cm.  And 51% of the adult population is female.

If you solve the problems in order, you can reuse code from the first two to solve the third.

Here’s my solution, using a grid algorithm and the libraries from Think Bayes:

When I tweeted about this problem, I heard from Colin Carroll, who wrote a solution using PyMC:

And vlad posted a this solution using WebPPL, a browser-based environment for probablistic programming:

You can run that solution at WebPPL.

The Dungeons and Dragons problem

The Dungeons and Dragons problem

Last week I posed this problem in my Bayesian Statistics class:

Suppose there are 10 people in my Dungeons and Dragons club; on any game day, each of them has a 70% chance of showing up.

Each player has one character and each character has 6 attributes, each of which is generated by rolling and adding up 3 6-sided dice.

At the beginning of the game, I ask whose character has the lowest attribute. The wizard says, “My constitution is 5; does anyone have a lower attribute?”, and no one does.

The warrior says “My strength is 16; does anyone have a higher attribute?”, and no one does.

How many characters are in the party?

My solution is in this Jupyter notebook:

The six R’s of debugging

The six R’s of debugging

In Modeling and Simulation yesterday I presented my six R’s of debugging:

Read: You have to read the code, and read what it really says, not what you think it says.  You have to read the documentation, read the error message, and read the Stack Overflow page that comes up when you Google the error message.

But sometimes the bug is in your head.  If the problem is your misunderstanding, you won’t find it by staring at the code…

Run: You also have to run the code, makes some changes, and run the code again.  Sometimes when you clean up the code, and you make a change that should have no effect, and it does, that gives you a hint.

But don’t just make random changes…

Ruminate: Take time to think!  What have you changed since the last time you had a working program?  What is the program doing wrong, and what kind of error could cause it?  What are you assuming that might be wrong?

Question everything, but don’t just sit in silence…

Rubber duck: You have to talk about it.  Find someone who’s willing to listen and explain the problem.  You might figure it out before they have a chance to say a word.  In that case you don’t even need a person.  A rubber duck will do.

Be persistent, but not too persistent…

Rest: If you’ve been at it a while, take a break.  Get away from the computer, do something else, and wait for your blood pressure to come down.  Some of the best places to find bugs are trains, showers, and bed, just before you fall asleep.

Finally, if you are pretty much stuck, you might have to be strategic…

Retreat: Get back to a previous working version and start building up the code.  Take smaller steps this time, or take different steps.  Spend some time building code that helps you debug, like functions that visualize your data structures.

I hope these suggestions are helpful.  There are six things to try, so if you are stuck on one, try another.  Debugging can be frustrating, but it is one of the most useful skills you can develop, and it applies to almost every domain, not just software development.

If you are good at debugging, you can do anything.