May 2019 - Probably Overthinking It

Foundations of data science?

May 16, 2019 AllenDowney

“Foundation” is one of several words I would like to ban from all discussion of higher education. Others include “liberal arts”, “rigor”, and “service class”, but I’ll write about them another time. Right now, “foundation” is on my mind because of a new book from Microsoft Research, Foundations of Data Science, by Avrim Blum, John Hopcroft, and Ravindran Kannan.

The goal of their book is to “cover the theory we expect to be useful in the next 40 years, just as an understanding of automata theory, algorithms, and related topics gave students an advantage in the last 40 years.”

As an aside, I am puzzled by their use of “advantage” here: who did those hypothetical students have an advantage over? I don’t think competitive advantage is the primary goal of learning. If a theory is useful, it helps you solve problems and make the world a better place, not just crush your enemies.

I am also puzzled by their use of “foundation”, because it can mean two contradictory things:

The most useful ideas in a field; the things you should learn first.
The most theoretical ideas in a field; the things you should use to write mathematical proofs.

Both kinds of foundation are valuable. If you identify the right things to learn first, you can give students powerful tools quickly, they can work on real problems and have impact, and they are more likely to be excited about learning more. And if you find the right abstractions, you can build intuition, develop insight, make connections, and create new tools and ideas.

The problems come when we confuse these meanings, assume that the most abstract ideas are the most useful, and require students to learn them first. In higher education, confusion about “foundations” is the root of a lot of bad curriculum design.

For example, in the traditional undergraduate engineering curriculum, students take 1-2 years of math and science classes before they learn anything about engineering. These prerequisites are called the “Math and Science Death March” because so many students don’t get through them; in the U.S., about 40% of students who start an engineering program don’t finish it, largely because of the incorrect assumption that they need two years of theory before they can start engineering.

The introduction to Foundations of Data Science hints at the first meaning of “foundation”. The authors note that “increasingly researchers of the future will be involved with using computers to understand and extract usable information from massive data arising in applications,” which suggests that this book will help them do those things.

But the rest of the introduction makes it clear that the second meaning is what they have in mind.

“Chapters 2 and 3 lay the foundations of geometry and linear algebra respectively.”
“We give a from-first-principles description of the mathematics and algorithms for SVD.”
“The underlying mathematical theory of such random walks, as well as connections to electrical networks, forms the core of Chapter 4 on Markov chains.”
“Chapter 9 focuses on linear-algebraic problems of making sense from data, in particular topic modeling and non-negative matrix factorization.”

The “fundamentals” in this book are abstract, mathematical, and theoretical. The authors assert that learning them will give you an “advantage”, but if you are looking for practical tools to solve real problems, you might need to build on a different foundation.

Probably Overthinking It

Data science, Bayesian Statistics, and other ideas

Browsed by
Month: May 2019

Foundations of data science?

May 16, 2019 AllenDowney