Browsed by
Category: Uncategorized

Local regression in Python

Local regression in Python

I love data visualization make-overs (like this one I wrote a few months ago), but sometimes the tone can be too negative (like this one I wrote a few months ago).

Sarah Leo, a data journalist at The Economist, has found the perfect solution: re-making your own visualizations. Here’s her tweet.

And here’s the link to the article, which you should go read before you come back here.

One of her examples is the noisy line plot on the left, which shows polling results over time.


Here’s Leo’s explanation of what’s wrong and why:

Instead of plotting the individual polls with a smoothed curve to show the trend, we connected the actual values of each individual poll. This happened, primarily, because our in-house charting tool does not plot smoothed lines. Until fairly recently, we were less comfortable with statistical software (like R) that allows more sophisticated visualisations. Today, all of us are able to plot a polling chart like the redesigned one above.

This confession made me realize that I am in the same boat they were in: I know about local regression, but I don’t use it because I haven’t bothered to learn the tools.

Fortunately, filling this gap in my toolkit took less than an hour. The StatsModels library provides lowess, which computes locally weighted scatterplot smoothing.

I grabbed the data from The Economist and read it into a Pandas DataFrame. Then I wrote the following function, which takes a Pandas Series, computes a LOWESS, and returns a Pandas Series with the results:

from statsmodels.nonparametric.smoothers_lowess import lowess

def make_lowess(series):
    endog = series.values
    exog = series.index.values

    smooth = lowess(endog, exog)
    index, data = np.transpose(smooth)

    return pd.Series(data, index=pd.to_datetime(index)) 

And here’s what the results look like:

The smoothed lines I got look a little different from the ones in The Economist article. In general the results depends on the parameters we give LOWESS. You can see all the details in this Jupyter notebook.

Thanks to Sarah Leo for inspiring me to learn to use LOWESS, and for providing the data I used to replicate the results.

Happiness, Mental Health, Drugs, Politics, and Language

Happiness, Mental Health, Drugs, Politics, and Language

The following are abstracts from 13 projects where students in my Data Science class explore public data sets related to a variety of topics. Each abstract ends with a link to a report where you can see the details.

A Deeper Dive into US Suicides

Diego Berny and Anna Griffin

The world’s suicide rate has been decreasing over the past decade but unfortunately the United States’ rate is doing the exact opposite. Using data from the CDC and Our World in Data organizations, we explored different demographics to see if there are any patterns of vulnerable populations. We found that the group at the most risk is middle aged men. Men’s suicide rate is nearly 4 times higher than women’s and the group of adults between the ages of 45 and 59 has seen 36.5% increase over the past 17 years. When comparing their methods of suicide to their female counterparts we found that men tend to use more lethal means, resulting is less nonfatal suicide attempts. Read more

The Opioid Epidemic and Its Socioeconomic Effects

Daniel Connolly and Bryce Mann

Between 2002 and 2016, heroin use increased by 40%, while the use of other seemingly similar drugs declined in the same period. Using data from the National Survey on Drug Use and Health, we explore how the characteristics of opioid users have changed since the beginning of the epidemic. We find that so-called “late-starters” make up a new population of opioid users, as the average starting age of heroin users has increased by 2.5 years since 2002. We find a major discrepancy between the household incomes of users and nonusers as well, a discovery possibly related to socioeconomic factors like marriage. Read more

What is the Mother Tongue in U.S. Communities?

Allison Busa and Jordan Crawford-O’Banner

By watching the news, a person can assume that diversity is increasing rapidly in the United States. The current generation has been heralded as the most diverse in the history of the country. However, some Americans do not feel very positively about this change, and some even feel that the change is happening too rapidly. We decided to use the data from the U.S. Census to put these claims to the test. Using linguistic Census data, we ask “Is cultural diversity changing over time?” and “How is it spread out?” With PMFs, we analyze the number of people who speak a language other than English at home (SONELAHs). There is a wide range of SONELAHs in the U.S., from only 2 % of West Viriginians to 42% of Californians. Compared inside individual states, however, variations are less extreme. Read More

Heroin and Alcohol: Could there be a relationship?

Daphka Alius

Alcohol abuse is a disease that affects millions in the US. Similarly, opioids have become a national health crisis signaling a substantial increase in opioids use. The question under investigation is whether the same people who abuse heroin, a form of opioid, are also drinking congruously throughout the year. Using data from National Survey of Drug Use and Health, I found that people who infrequently (< 30 days/year) drink alcohol in a year are consuming heroin 1.7 times longer in a year than those who frequently (> 300 days/year) drink alcohol. Additionaly, the two variables are weakly correlated with a Pearson correlation that corresponds to -0.22. Read More

Does Health Insurance Type Lead To Opioid Addiction?

Micah Reid, Filipe Borba

The rate of opioid addiction has escalated into a crisis in recent years. Studies have linked health insurance with prescription painkiller overuse, but little has been done to investigate differences tied to health insurance type. We used data from the National Survey on Drug Use and Health from the year 2017 single out variations in drug use and abuse prevalence and duration across these groups. We found that while those with private health insurance were more likely to have used opioids than those with Medicaid/CHIP or no health insurance (57.3% compared to 45% and 47.4%, respectively), those with Medicaid/CHIP or no health insurance were more likely to have abused opioids when controlling for past opioid use (24.6% and 27.2% versus 17.6%, respectively). Those with private health insurance were also more likely to have used opioids in the past, while those with Medicaid/CHIP or no health insurance were more likely to have continued their use. This suggests that even though those with private health insurance are more likely to use opioids, those without are more likely to continue use and begin misuse once started on opiates. Read more

Finding differences between Conservatives and Liberals

Siddharth Garimella

I looked through data from the General Social Survey (GSS) to gain a better understanding about what issues conservatives and liberals differ most on. After making some guesses of my own, I separated conservative and liberal respondents, and sorted their effect sizes for every variable in the dataset segment I had available, ultimately finding three big differences between the two groups. My results suggest conservatives most notably disagree more with same-sex relationships, tend to be slightly older, and attend religious events far more often than liberals do. Read more

Exploring OxyContin Use in the United States

Ariana Olson

According to the CDC, OxyContin is among the most common prescription opioids involved in overdose death. I explored variables related to OxyContin use, both medical and non-medical, from the 2014 National Survey on Drug Use and Health (NSDUH). I found that the median age that respondents tried OxyContin for the first time in a way that wasn’t prescribed for them is around 22, and that almost all respondents who had tried OxyContin non-medically did so before the age of 50. I also found that the overwhelming majority of respondents had never used OxyContin non-medically, but out of those who had, there was an 82% probability that they had used it over 12 months prior to the survey. People who used OxyContin also reported using the drug for fewer days total per year in a way that wasn’t prescribed to them than people who used it at all, prescribed or not. Finally, I found that the median age at which people first used OxyContin in a way that wasn’t directed by a doctor increased with older age groups, and the minimum age of first trying OxyContin non-medically per age group tended to increase as the age of the groups increased. Read more

Subjective Class Compared to Income Class

Cassandra Overney

Back in my hometown, many people consider themselves middle class regardless of their incomes. I grew up confusing income class with subjective class. Now that I am living in a new environment, I am curious to see whether a discrepancy between subjective and income class exists throughout America. The main question I want to answer is: how does subjective class compare to income class?

Income is not the only factor that Americans associate with class since most respondents consider themselves to be either working or middle class. However, there are some discernable differences in subjective class based on income. For example, respondents in the lowest income class are more likely to consider themselves working class than middle class (10.7% vs 6.3%) while respondents in the highest income class are more likely to consider themselves middle class than working class (13.3% vs 4.2%). Read more

The Contribution of the Opioid Epidemic on the Falling Life Expectancy in the United States

Sabrina Pereira

In recent years, a downward trend in the Average Life Expectancy (ALE) in the US has emerged. At the same time, the number of deaths by opioid poisoning has risen dramatically. Using mortality data from the Centers for Disease Control and Prevention, I create a model to quantify the effect of the increase of opioid-related deaths on the ALE in the US. According to the model, the ALE in 2017 would have been about .46 years higher if there had been no opioid-related deaths (79.06 years, compared to the observed 78.6 years). It is only recently that these deaths have created an observable effect this large. Read more

Exploring the Opioid Epidemic

Emma Price

People who use heroin are most likely to do so between the ages of 18 and 40, whereas people who misuse opiate pain relievers are consistently likely to misuse for the first time starting in their early teens. The portion of heroin and prescribed opiate users that stay in school until they complete high school is higher than that of people who do not use opiates; however, the portion of the population of heroin users drop very quickly in their likelihood to survive through college. The rate at which people who misuse opiate pain relievers drop out of school generally follows that of non-users once the high school tipping point is past. Read more

Drug use patterns and correlations

Sreekanth Reddy Sajjala

For users of various regulated substances, their exposure to and use of them varies greatly substance to substance. The National Survey on Drug Use and Health dataset has extensive data which can allow us to view patterns and correlations in their usage. Only 40% of the people who have ever tried cocaine have used it in the past year, but almost 60% of those who have tried heroin use it atleast once a week. People who have tried cannabis tend to try alcohol at an age 15% lower than users who haven’t tried cannabis do so. Unless drug use patterns change drastically, if someone has consumed cannabis at any point in their life they are over 20 times more likely to try heroin at some point in their life. Read more

Age and Generation Affect Happiness Levels in Marriage… A Little

Ashley Swanson

Among age, time, and cohort analysis, happiness levels in marriage are most drastically affected by the age of an individual up until their early 40’s. Between age 20 and age 40, the reported percentage of happy marriages drops by -0.45% percent a year, nearly 10% over the course of those two decades. The following 4 decades see a rebound of about 8%, meaning that 90-year-olds are nearly as happy as 20-year-olds, with those in their early 40’s experiencing the lowest levels of marital happiness. However, cohort effects have the highest explanatory value with an r-value of 0.44. Those born in 1950 experience 13.3% fewer happy marriages than those born in 1900, and those born in 2000 experience an average of 10.5% more happy marriages than those born in 1950. Each of these variables has a small effect size per year, a fraction of a percentage point, but the sustained trends over the decades are significant enough to have real effects. Read more

Associations between screen time and kids’ mental health

MinhKhang Vu

Previous research on children and adolescents has suggested strong associations between screen time and their mental health, contributing to growing concerns among parents, teachers, counselors and doctors about digital technology’s negative effects on children. Using the Census Bureau’s 2017 National Survey of Children’s Health (NSCH), I investigated a large (n=21,599) national random sample of 0- to 17-year-old children in the U.S. in 2017. The NSCH collects data on the physical and emotional health of American children every year, which includes information about their screen time usage and other comprehensive well-being measures. Children who spend 3 hours or more daily using computers are twice more likely to have an anxiety problem (CI 2.06 2.38) and four times more likely to experience depression (CI 3.97 5.11) than those who spend less than 3 hours. For kids spending 4 hours or more with computers, about 16% of them have some anxiety problems (CI 14.98 17.07), and 11% of them experience depression recently (CI 9.73 11.61). Along with the associations between screen time and diagnoses of anxiety and depression, how frequently a family has meals together also has strong linear relationships with both their children’s screen time and mental health. Children who do not have any meal with their family during the past week are twice more likely to have anxiety and three times more likely to experience depression than children who have meals with their family every day. However, in this study, I could not find any strong associations between the severity of kids’ mental illness and screen time, which leaves the open question, whether screen time directly affects children’s mental health. Read more

Stop worrying and love the black box

Stop worrying and love the black box

In many engineering classes, computational methods are treated with fear, uncertainty, and doubt. At the same time, analytic methods are presented as if they were magic.

I think we should spend more time on computational methods, which means cutting back on analytic methods. But I get a lot of resistance from faculty with a dread fear of black boxes.

They warn me that students have to know how these methods work in order to use them correctly; otherwise they are likely to produce nonsense results and accept them blindly.

And if they let students use computational tools at all, the order of presentation is usually “bottom-up”, that is, a lot of “how it works” before “what it does”, and not much “why you should care”.

In my books and classes, we often got “top-down”, learning to use tools first, and opening the hood only when it’s useful. It’s like learning to drive; knowing about internal combustion engines does not make you a better driver.

But a lot of people don’t like that analogy. Recently one of the good people I follow on Twitter wrote, “No, doing fancy analyses without understanding the basic statistical principles isn’t like driving a car without knowing the mechanics. It’s like driving a car while heavily intoxicated, being in all kinds of accidents without knowing it.”

I replied, “I don’t think there is a general principle here. Sometimes you can use black boxes safely. Sometimes you have to know how they work. Sometimes knowing how they work doesn’t actually help.”

So how do we know which scenario we’re in, and what should we do about it? I suggest the following flow chart:

Many black boxes can be used safely; that is, they produce accurate results over the range of relevant problems. In that case, we should ask whether it (really) helps to know how they work. In Scenario 1, the answer is no; we can stop worrying, stop teaching how it works, and use the time we save to teach more useful things.

Of course, some black boxes have sharp edges. They work when they work, but when they don’t, bad things happen. In that case, we should still ask whether it helps to know how they work. In Scenario 3, the answer is no again. In that case, we have to teach diagnosis: What happens when the black box fails? How can we tell? What can we do about it? Often we can answer these questions without knowing much about how the method works.

But sometimes we can’t, and students really need to open the hood. In that case (Scenario 2 in the diagram) I recommend going top down. Show students methods that solve problems they care about. Start with examples where the methods work, then introduce examples where they break. If the examples are authentic, they motivate students to understand the problems and how to fix them.

With this framework, I can explain more concisely my misgivings about how computational methods are taught:

The engineering curriculum is designed on the assumption that we are always in Scenario 2, but Scenarios 1 and 3 are actually more common.

Bayesian Zig-Zag Webinar

Bayesian Zig-Zag Webinar

On February 13 I presented a webinar for the ACM Learning Center, entitled “The Bayesian Zig Zag: Developing Probabilistic Models Using Grid Methods and MCMC“. Eric Ma served as moderator, introducing me and joining me to answer questions at the end.

The example I presented is an updated version of the Boston Bruins Problem, which is in Chapter 7 of my book, Think Bayes. At the end of the talk, I generated a probablistic prediction for the Bruins’ game against the Anaheim Ducks on February 15. I predicted that the Bruins had a 59% chance of winning, which they did, 3-0.

Does that mean I was right? Maybe.

According to the good people at the ACM, there were more than 3000 people registered for the webinar, and almost 900 who watched it live. I’m glad I didn’t know that while I was presenting 🙂

If you did not watch it live, you can view the recorded webinar at no cost other than registering and providing contact information.

Here are the slides I presented. And here is a static view of the Jupyter notebook with all of the code and results. You can also run the notebook on Binder.

Thanks to the ACM Learning Center for inviting me, to Eric for moderating, and to Chris Fonnesbeck and Colin Carroll for their help developing the example I presented.

Are men getting married later or never? Both.

Are men getting married later or never? Both.

Last week I wrote about marriage patterns for women in the U.S. Now let’s see what’s happening with men.

Again, I’m working with data from the National Survey of Family Growth, which surveyed 29,192 men in the U.S. between 2002 and 2017. I used Kaplan–Meier estimation to compute “survival” curves for the time until first marriage. The following figure shows the results for men grouped by decade of birth:


The colored lines show the estimated curves; the gray lines show projections based on moderate assumptions about future marriage rates. Two trends are apparent:

  • From one generation to the next, men have been getting married later. The median age at first marriage for men born in the 1950s was 26; for men born in the 1980s is it 30, and for men born in the 1990s, it is projected to be 35.
  • The fraction of men never married at age 44 was 18% for men born in the 1950s, 1960s, and 1970s. It is projected to increase to 30% for men born in the 1980s and 37% for men born in the 1990s.

Of course, thing could change in the future and make these projections wrong. But marriage rates in the last 5 years have been very low for both men and women. In order to catch up to previous generations, young men would have to start marrying at unprecedented rates, and they would have to start soon.

For details of the methods I used for this analysis, you can read my paper from SciPy 2015.

And for even more details, you can read this Jupyter notebook.

As always, thank you to the good people who run the NSFG for making this data available.

The marriage strike continues

The marriage strike continues

Last month The National Survey of Family Growth released new data from 5,554 respondents interviewed between 2015 and 2017. I’ve worked on several studies using data from the NSFG, so it’s time to do some updates!

In 2015 I gave a talk at SciPy called “Will Millennials Ever Get Married” and wrote a paper that appeared in the proceedings. I used data from the NSFG through 2013 to generate this plot showing marriage rates for women in the U.S. grouped by decade of birth:

The vertical axis, S(t), is the estimated survival curve, which is the fraction of women who have never been married as a function of age. The gray lines show projections based on the assumption that each cohort going forward will “inherit” the hazard function of the previous cohort.

If you are not familiar with survival functions and hazard functions, you might want to read the SciPy paper, which explains the methodology.

Now let’s see how things look with the new data. Here’s the updated plot:

A few things to notice:

  1. Age at first marriage has been increasing for decades. Median age at first marriage has gone from 20 for women born in the 1940s to 27 for women born in the 1980s and looks likely to be higher for women born in the 1990s.
  2. The fraction of women unmarried at age 44 has gone from 7% for women born in the 1940s to 18% for women born in the 1970s. And according to the projections I computed, this fraction will increase to 30% for women born in the 1980s and 42% for women born in the 1990s.
  3. For women born the 1980s and 1990s, the survival curves have been surprisingly flat for the last five years; that is, very few women in these cohorts have been married during this time. This is the “marriage strike” I mentioned in the SciPy paper, and it seems to be ongoing.

Because the marriage strike is happening at the same time in two cohorts, it may be a period effect rather than a cohort effect; that is, it might be due to external factors affecting both cohorts, rather than a generational change. For example, economic conditions might be discouraging marriage.

If so, the marriage strike might end when external conditions change. But at least for now, it looks like people will continue getting married later, and substantially more people will remain unmarried in the future.

In my next post, I will show the results of this analysis for men.

Data visualization for academics

Data visualization for academics

One of the reasons I am excited about the rise of data journalism is that journalists are doing amazing things with visualization. At the same time, one of my frustrations with academic research is that the general quality of visualization is so poor.

One of the problems is that most academic papers are published in grayscale, so the figures don’t use color. But most papers are read in electronic formats now; the world is safe for color!

Another problem is the convention of putting figures at the end, which is an extreme form of burying the lede.

Also, many figures are generated by software with bad defaults: lines are too thin, text is too small, axis and grids lines are obtrusive, and when colors are used, they tend to be saturated colors that clash. And I won’t even mention the gratuitous use of 3-D.

But I think the biggest problem is the simplest: the figures in most academic papers do a poor job of communicating one point clearly.

I wrote about one example a few months ago, a paper showing that children who start school relatively young are more likely to be diagnosed with ADHD.

Here’s the figure from the original paper:

How long does it take you to understand the point of this figure? Now here’s my representation of the same data:

I believe this figure is easier to interpret. Here’s what I changed:

  1. Instead of plotting the difference between successive months, I plotted the diagnosis rate for each month, which makes it possible to see the pattern (diagnosis rate increases month over month for the first six months, then levels), and the magnitude of the difference (from 60 to 90 diagnoses per 10,000, an increase of about 50%).
  2. I shifted the horizontal axis to put the cutoff date (September 1) at zero.
  3. I added a vertical line and text to distinguish and interpret the two halves of the plot.
  4. I added a title that states the primary conclusion supported by the figure. Alternatively, I could have put this text in a caption.
  5. I replaced the error bars with a shaded area, which looks better (in my opinion) and appropriately gives less visual weight to less important information.

I came across a similar visualization makeover recently. In this Washington Post article, Catherine Rampell writes, “Colleges have been under pressure to admit needier kids. It’s backfiring.”

Her article is based on this academic paper; here’s the figure from the original paper:

It’s sideways, it’s on page 29, and it fails to make its point. So Rampell designed a better figure. Here’s the figure from her article:


The title explains what the figure shows clearly: enrollment rates are highest for low-income students that qualify for Pell grants and lowest for low-income students who don’t qualify for Pell grants.

To nitpick, I might have plotted this data with a line rather than a bar chart, and I might have used a less saturated color. But more importantly, this figure makes its point clearly and compellingly.

Here’s one last example, and a challenge: this recent paper reports, “the number of scale points used in faculty teaching evaluations (e.g., whether instructors are rated on a scale of 6 vs. a scale of 10) substantially affects the size of the gender gap in evaluations.”

To demonstrate this effect, they show eight histograms on pages 44 and 45. Here’s page 44:

And here’s page 45:

With some guidance from the captions, we can extract the message:

  1. Under the 6-point system, there is no visible difference between ratings for male and female instructors.
  2. Under the 10-point system, in the least male-dominated subject areas, there is no visible difference.
  3. Under the 10-point system, in the most male-dominated subject areas, there is a visibly obvious difference: students are substantially less likely to give female instructors a 9 or 10.

This is an important result — it makes me want to read the previous 43 pages. And the visualizations are not bad — they show the effect clearly, and it is substantial.

But I still think we could do better. So let me pose this challenge to readers: Can you design a visualization of this data that communicates the results so that

  1. Readers can see the effect quickly and easily, and
  2. Understand the magnitude of the effect in practical terms?

You can get the data you need from the figures, at least approximately. And your visualization doesn’t have to be fancy; you can send something hand-drawn if you want. The point of the exercise is the design, not the details.

I will post submissions in a few days. If you send me something, let me know how you would like to be acknowledged.

UPDATE: We discussed this example in class today and I presented one way we could summarize and visualize the data:

Students in the most male-dominated fields are less likely to give female instructors top scores, but only on a 10-point scale. The effect does not appear on a 6-point scale.

There are definitely things to do to improve this, but I generated it using Pandas with minimal customization. All the code is in this Jupyter notebook.

The library of data visualization

The library of data visualization

Getting ready for my Data Science class (starting next week!) I am updating my data visualization library, looking for resources to help students learn about visualization.

Last week I asked Twitter to help me find resources, especially new ones.  Here’s the thread.  Thank you to everyone who responded!

I’ll try to summarize and organize the responses.  I am mostly interested in books and web pages about visualization, rather than examples of it or tools for doing it.

There are lots of good books; to impose some order, I put them in three categories: newer work, the usual suspects, and moldy oldies.

Newer books

The following are some newer books (or at least new to me).

Fundamentals of Data Visualization, by Claus O. Wilke (online preview of a book forthcoming from O’Reilly)

Data Visualization: A practical introduction Kieran Healy (free online draft)

Data Visualization: Charts, Maps, and Interactive Graphics Robert Grant

Data Visualisation: A Handbook for Data Driven Design by Andy Kirk

Dear Data by Giorgia Lupi, Stefanie Posavec

Established books

The following are more established books that appear on most lists.

The Functional Art: An Introduction to Information Graphics and Visualization by Alberto Cairo

The Truthful Art: Data, Charts, and Maps for Communication by Alberto Cairo

Interactive Data Visualization for the Web by Scott Murray

Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic

Beautiful Visualization: Looking at Data through the Eyes of Experts by Julie Steele

Designing Data Visualizations: Representing Informational Relationships by Noah Iliinsky, Julie Steele

Visualization Analysis and Design by Tamara Munzner

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics by Nathan Yau

Data Points: Visualization That Means Something by Nathan Yau

Show Me the Numbers: Designing Tables and Graphs to Enlighten by Stephen Few

Now You See It: Simple Visualization Techniques for Quantitative Analysis by Stephen Few

Older books

The Visual Display of Quantitative Information by Edward R. Tufte

The Elements of Graphing Data by William S. Cleveland

Websites and blogs

Again, I mostly went for sites that are about visualization, rather than examples of it.

Junk Charts

The Graphic Continuum

information is beautiful

From Data to Viz

The Data Visualisation Catalogue

Storytelling with Data

FlowingData

More references and resources from MPA 635: DATA VISUALIZATION

Videos and podcasts

The Art of Data Visualization | Off Book | PBS Digital Studios

Data Stories A podcast on data visualization with Enrico Bertini and Moritz Stefaner

Python-specific resources

Python Plotting for Exploratory Data Analysis

The Python Graph Gallery

How to visualize data in Python

Race, religion, and politics

Race, religion, and politics

In their November 3, 2018 issue, The Economist published the following figure showing their analysis of data from YouGov.

For the second edition of Think Bayes, I plan to use it to demonstrate conditional probability.  In the following notebook, I replicate their analysis (loosely!) using data from the General Social Survey (GSS).

How to write a book

How to write a book

All of my books were written in LaTeX.  For a long time I used emacs to compose, pdflatex to convert to PDF, Hevea to convert to HTML, and a hacked-up version of plasTeX to convert to DocBook, which is one of the formats I can submit to my publisher, O’Reilly Media.

Recently I switched from emacs to Texmaker for composition, and I recommend it strongly.  I also use Overleaf for shared LaTeX documents, and I can recommend that, too.

However, the rest of the tools I use are pretty clunky.  The HTML I get from Hevea is not great, and my hacked version of plasTeX is just awful (which is not plasTeX’s fault).

Since I am starting some new book projects, I decided to rethink my tools.  So I asked Twitter, “If you were starting a new book project today, what typesetting language / development environment would you use?  LaTeX with Texmaker? Bookdown with RStudio? Jupyter?Other?”

I got some great responses.  You can read the whole thread yourself, but I will try to summarize it here.

LaTeX

Nelis Willers “wrote a 510 page book with LaTeX, using WinEdt and MiKTeX and CorelDraw for diagrams. Worked really well.”

Matt Boelkins likes “PreTeXt, hands down:  It has LaTeX and HTML as potential outputs among many. See the gallery of existing texts on the linked page.”

makusu recommends “Emacs org-mode. Easy to just write your content, seamless integration with latex, easy output to latex, PDF, markdown and HTML.”

AsciiDoc

Luciano Ramalho recommends “AsciiDoc, for sure. That’s how I wrote @fluentpython. It’s syntax more user-friendly than ReStructuredText and way more expressive than Markdown. AsciiDoc was *designed for* book publishing. It’s as expressive as DocBook, but it ain’t XML. With @asciidoctor you can render locally.”

JD Long provides a useful reminder: “It’s dependent on the publisher as well as the content of the book. I like Bookdown for R, but if I were doing a devops book for O’Reilly I’d write directly in AsciiDoc, for example. So I think context matters highly.”

Yves Hilpisch says “AsciiDoctor is my favorite these days. Clear syntax, nice output, fast rendering (HTML/PDF). Have custom Python scripts that convert @ProjectJupyter notebooks into text files from which I include code snippets automatically.”  His scripts are in this GitHub repository.

Markdown

Robert Talbot recommends “Markdown in a plain text editor, with Pandoc on the back end for the finished product.  This is assuming that the book is mostly text. If it involves code, I might lean more toward Jupyter and some kind of Binder based process.”  Here’s a blog post Robert wrote on the topic.

I got a recommendation for this blog post by Thorsten Ball, who uses Markdown, pandoc, and KindleGen.

One person recommended “writing Markdown then using pandoc to pass to LaTeX”, which is an interesting chain.

Visual Studio Code got a few mentions: “I haven’t written a full book using it, but VS Code plus markdown preview and other editing plugins is my current go-to for small articles”

“Bookdown in RStudio is wonderful to use.”

Jupyter

Chris Holdgraf is “working on a project to help people make nicely rendered online books from collections of Jupyter notebooks. We use it @ Berkeley for teaching at http://inferentialthinking.com.”

RestructuredText

Jason Moore recommends “your preferred text editor + RestructuredText + Sphinx = pdf/epub/html output; wrote my dissertation with it 6 years ago and was quite happy with the results.”

Matt Harrison uses his “own tools around rst (with conversion to LaTeX and epub).”

Other

Pollen: the book is a program

Raffaele Abate recommends “ScrivenerApp: I’ve used their Linux beta in past for a short, nonscientific, book and I can say it’s an amazing software for this purpose. I’ve read that is usable also for scientific publishing with profit.”

Lak Lakshmanan wrote, “I used Google docs for my previous book and for my current offer. Not as composable as latex, but amazing for collaboration. Books need to fine-grained reviews and edits by several people spread around the world. Nothing like Google docs for that.”

And the winner is…

For now I am working in LaTeX with Texmaker, but I have run it through pandoc to generate AsciiDoc, and that seems to work well.  I will work on the book and the conversion process at the same time.  At some point, I might switch over to editing in AsciiDoctor.  I also need to do a test run with O’Reilly to see if they can ingest the AsciiDoc I generate.

I will post updates as I work out the details.

Thank you to everyone who responded!