I have temperatures reading over times (2 secs interval) in a computer that is control by an automatic fan. The temperature fluctuate between 55 to 65 in approximately sine wave fashion. I wish to find out the average time between each cycle of the wave (time between 55 to 65 then 55 again the average over the entire data sets which includes many of those cycles) . What sort of statistical analysis do I use?
[The following] is one of my data set represents one of the system configuration. Temperature reading are taken every 2 seconds. Please show me how you guys do it and which software. I would hope for something low tech like libreoffice or excel. Hopefully nothing too fancy is needed.
A few people recommended using FFT, and I agreed, but I also suggested two other options:
And then another person suggested autocorrelation.
I ran some experiments to see what each of these solutions looks like and what works best. If you are too busy for the details, I think the best option is computing the distance between zero crossings using a spline fitted to the smoothed data.
For a long time I have recommended using CDFs to compare distributions. If you are comparing an empirical distribution to a model, the CDF gives you the best view of any differences between the data and the model.
Now I want to amend my advice. CDFs give you a good view of the distribution between the 5th and 95th percentiles, but they are not as good for the tails.
To compare both tails, as well as the “bulk” of the distribution, I recommend a triptych that looks like this:
There’s a lot of information in that figure. So let me explain.
Suppose you observe a random process, like daily changes in the S&P 500. And suppose you have collected historical data in the form of percent changes from one day to the next. The distribution of those changes might look like this:
If you fit a Gaussian model to this data, it looks like this:
It looks like there are small discrepancies between the model and the data, but if you follow my previous advice, you might look at these CDFs and conclude that the Gaussian model is pretty good.
If we zoom in on the middle of the distribution, we can see the discrepancies more clearly:
In this figure it is clearer that the Gaussian model does not fit the data particularly well. And, as we’ll see, the tails are even worse.
Survival on a log-log scale
In my opinion, the best way to compare tails is to plot the survival curve (which is the complementary CDF) on a log-log scale.
In this case, because the dataset includes positive and negative values, I shift them right to view the right tail, and left to view the left tail.
Here’s what the right tail looks like:
This view is like a microscope for looking at tail behavior; it compresses the bulk of the distribution and expands the tail. In this case we can see a small discrepancy between the data and the model around 1 percentage point. And we can see a substantial discrepancy above 3 percentage points.
The Gaussian distribution has “thin tails”; that is, the probabilities it assigns to extreme events drop off very quickly. In the dataset, extreme values are much more common than the model predicts.
The results for the left tail are similar:
Again, there is a small discrepancy near -1 percentage points, as we saw when we zoomed in on the CDF. And there is a substantial discrepancy in the leftmost tail.
Student’s t-distribution
Now let’s try the same exercise with Student’s t-distribution. There are two ways I suggest you think about this distribution:
1) Student’s t is similar to a Gaussian distribution in the middle, but it has heavier tails. The heaviness of the tails is controlled by a third parameter, ν.
2) Also, Student’s t is a mixture of Gaussian distributions with different variances. The tail parameter, ν, is related to the variance of the variances.
I used PyMC to estimate the parameters of a Student’s t model and generate a posterior predictive distribution. You can see the details in this Jupyter notebook.
Here is the CDF of the Student t model compared to the data and the Gaussian model:
In the bulk of the distribution, Student’s t-distribution is clearly a better fit.
Now here’s the right tail, again comparing survival curves on a log-log scale:
Student’s t-distribution is a better fit than the Gaussian model, but it overestimates the probability of extreme values. The problem is that the left tail of the empirical distribution is heavier than the right. But the model is symmetric, so it can only match one tail or the other, not both.
Here is the left tail:
The model fits the left tail about as well as possible.
If you are primarily worried about predicting extreme losses, this model would be a good choice. But if you need to model both tails well, you could try one of the asymmetric generalizations of Student’s t.
The old six sigma
The tail behavior of the Gaussian distribution is the key to understanding “six sigma events”.
“Six sigma means six standard deviations away from the mean of a probability distribution, sigma (σ) being the common notation for a standard deviation. Moreover, the underlying distribution is implicitly a normal (Gaussian) distribution; people don’t commonly talk about ‘six sigma’ in the context of other distributions.”
This is important. John also explains:
“A six-sigma event isn’t that rare unless your probability distribution is normal… The rarity of six-sigma events comes from the assumption of a normal distribution more than from the number of sigmas per se.”
So, if you see a six-sigma event, you should probably not think, “That was extremely rare, according to my Gaussian model.” Instead, you should think, “Maybe my Gaussian model is not a good choice”.
In the first article in this series, I looked at data from the General Social Survey (GSS) to see how political alignment in the U.S. has changed, on the axis from conservative to liberal, over the last 50 years.
In the second article, I suggested that self-reported political alignment could be misleading.
Do you think most people would try to take advantage of you if they got a chance, or would they try to be fair?
And generated seven “headlines” to describe the results.
In this article, we’ll use resampling to see how much the results depend on random sampling. And we’ll see which headlines hold up and which might be overinterpretation of noise.
Overall trends
In the previous article we looked at this figure, which was generated by resampling the GSS data and computing a smooth curve through the annual averages.
If we run the resampling process two more times, we get somewhat different results:
Now, let’s review the headlines from the previous article. Looking at different versions of the figure, which conclusions do you think are reliable?
Absolute value: “Most respondents think people try to be fair.”
Rate of change: “Belief in fairness is falling.”
Change in rate: “Belief in fairness is falling, but might be leveling off.”
In my opinion, the three figures are qualitatively similar. The shapes of the curves are somewhat different, but the headlines we wrote could apply to any of them.
Even the tentative conclusion, “might be leveling off”, holds up to varying degrees in all three.
Grouped by political alignment
When we group by political alignment, we have fewer samples in each group, so the results are noisier and our headlines are more tentative.
Here’s the figure from the previous article:
And here are two more figures generated by random resampling:
Now we see more qualitative differences between the figures. Let’s review the headlines again:
Absolute value: “Moderates have the bleakest outlook; Conservatives and Liberals are more optimistic.” This seems to be true in all three figures, although the size of the gap varies substantially.
Rate of change: “Belief in fairness is declining in all groups, but Conservatives are declining fastest.” This headline is more questionable. In one version of the figure, belief is increasing among Liberals. And it’s not at all clear the the decline is fastest among Conservatives.
Change in rate: “The Liberal outlook was declining, but it leveled off in 1990.” The Liberal outlook might have leveled off, or even turned around, but we could not say with any confidence that 1990 was a turning point.
Change in rate: “Liberals, who had the bleakest outlook in the 1980s, are now the most optimistic”. It’s not clear whether Liberals have the most optimistic outlook in the most recent data.
As we should expect, conclusions based on smaller sample sizes are less reliable.
Also, conclusions about absolute values are more reliable than conclusions about rates, which are more reliable than conclusions about changes in rates.
In the first article in this series, I looked at data from the General Social Survey (GSS) to see how political alignment in the U.S. has changed, on the axis from conservative to liberal, over the last 50 years.
In the second article, I suggested that self-reported political alignment could be misleading.
In this article we’ll look at results from questions related to “outlook”, that is, how the respondents see the world and people in it.
Specifically, the questions are:
fair: Do you think most people would try to take advantage of you if they got a chance, or would they try to be fair?
trust: Generally speaking, would you say that most people can be trusted or that you can’t be too careful in dealing with people?
helpful: Would you say that most of the time people try to be helpful, or that they are mostly just looking out for themselves?
Do people try to be fair?
Let’s start with fair. The responses are coded like this:
1 Take advantage
2 Fair
3 Depends
To put them on a numerical scale, I recoded them like this:
1 Fair
0.5 Depends
0 Take advantage
I flipped the axis so the more positive answer is higher, and put “Depends” in the middle. Now we can plot the mean response by year, like this:
Looking at a figure like this, there are three levels we might describe:
Absolute value: “Most respondents think people try to be fair.”
Rate of change: “Belief in fairness is falling.”
Change in rate: “Belief in fairness is falling, but might be leveling off.”
For any of these qualitative descriptions, we could add quantitative estimates. For example, “About 55% of U.S. residents think people try to be fair”, or “Belief in fairness has dropped 10 percentage points since 1970”.
Statistically, the estimates of absolute value are probably reliable, but we should be more cautious estimating rates of change, and substantially more cautious talking about changes in rates. We’ll come back to this issue, but first let’s look at breakdowns by group.
Outlook and political alignment
In the previous article I grouped respondents by self-reported political alignment: Conservative, Moderate, or Liberal.
We can use these groups to see the relationship between outlook and political alignment. For example, the following figure shows the average response to the fairness question, grouped by political alignment and plotted over time:
Results like these invite comparisons between groups, and we can make those comparisons at several levels. Here are some potential headlines for this figure:
Absolute value: “Moderates have the bleakest outlook; Conservatives and Liberals are more optimistic.”
Rate of change: “Belief in fairness is declining in all groups, but Conservatives are declining fastest.”
Change in rate: “The Liberal outlook was declining, but it leveled off in 1990.” or “Liberals, who had the bleakest outlook in the 1980s, are now the most optimistic”.
Because we divided the respondents into three groups, the sample size in each group is smaller. Statistically, we need to be more skeptical about our estimates of absolute level, even more skeptical about rates of change, and extremely skeptical about changes in rates.
In the next article, I’ll use resampling to quantify the uncertainly of these estimates, and we’ll see how many of these headlines hold up.
In the previous article, I looked at data from the General Social Survey (GSS) to see how political alignment in the U.S. has changed, on the axis from conservative to liberal, over the last 50 years.
The GSS asks respondents where they place themselves on a 7-point scale from “extremely liberal” (1) to “extremely conservative” (7), with “moderate” in the middle (4).
In the previous article I computed the mean and standard deviation of the responses as a way of quantifying the center and spread of the distribution. But it can be misleading to treat categorical responses as if they were numerical. So let’s see what we can do with the categories.
The following plot shows the fraction of respondents who place themselves in each category, plotted over time:
My initial reaction is that these lines are mostly flat. If political alignment is changing in the U.S., it is changing slowly, and the changes might not matter much in practice.
If we look more closely, it seems like the number of people who consider themselves “extreme” is increasing, and the number of moderates might be decreasing. The following plot shows a closer look at the extremes.
There is some evidence of polarization here, but we should not make too much of it. People who consider themselves extreme are still less than 10% of the population, and moderates are still the biggest group, at almost 40%.
To get a better sense of what’s happening with the other groups, I reduced the number of categories to 3: “Conservative” at any level, “Liberal” at any level, and “Moderate”. Here’s what the plot looks like with these categories:
Moderates make up a plurality; conservatives are the next biggest group, followed by liberals.
From 1974 to 1990, the number of people who call themselves “Conservative” was increasing, but it has decreased ever since. And the number of “Liberals” has been increasing since 2000.
At least, that’s what this plot seems to show. We should be careful about over-interpreting patterns that might be random noise. And we might not want to take these categories too seriously, either.
The hazards of self-reporting
There are several problems with self-reported labels like this.
First, political beliefs are multi-dimensional. “Conservative” and “liberal” are labels for collections of ideas that sometimes go together. But most people hold a mixture of these beliefs.
Also, these labels are relative; that is, when someone says they are conservative, what they often mean is that they are more conservative than the center of the population, or where they think the center is, for the population they have in mind.
Finally, nearly all survey responses are subject to social desirability bias, which is the tendency of people to give answers that make them look better or feel better about themselves.
Over time, the changes we see in these responses depend on actual changes in political beliefs, but they also depend on where the center of the population is, where people think the center is, and the perceived desirability of the labels “liberal”, “conservative”, and “moderate”.
So, in the next article we’ll look more closely at changes in beliefs and attitudes, not just labels.
I am planning to turn these articles into a case study for an upcoming Data Science class, so I welcome comments and questions.
Is the United States getting more conservative? With the rise of the alt-right, Republican control of Congress, and the election of Donald Trump, it might seem so.
Or is the country getting more liberal? With the 2015 Supreme Court decision supporting same-sex marriage, the incremental legalization of marijuana, and recent proposals to expand public health care, you might think so.
Or maybe the country is becoming more polarized, with moderates choosing sides and partisans moving to the extremes.
In a series of articles, I’ll use data from the General Social Survey (GSS) to explore these questions. The GSS goes back to 1972; every second year they survey a representative sample of U.S. residents and ask questions about their political beliefs. Many of the questions have been unchanged for almost 50 years, making it possible to observe long-term trends.
In this article, I’ll look at political alignment, that is, whether the respondents consider themselves liberal or conservative. In subsequent articles, I’ll explore their political beliefs on a range of topics.
Political alignment
From 1974 to the most recent cycle in 2018, the GSS asked the following question, “We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal–point 1–to extremely conservative–point 7. Where would you place yourself on this scale?”
The following figure shows the distribution of responses in 1974 and 2018.
In 2018, it looks like there are more 1s (Extremely Liberal) and maybe more 7s (Extremely Conservative). So this figure provides some evidence of polarization.
We can get a better sense of the long term trend by taking the mean of the 7-point scale and plotting it over time. By treating this scale as a numerical quantity, I’m making assumptions about the spacing between the values. The numbers we get don’t mean much in absolute terms, but they provide a quick look at the trend.
It looks like the “center of mass” was increasing until about 1990, which means more conservative on this scale, and has been decreasing ever since. On average the country might be a little more conservative now than it was in 1974.
With the same caveat about treating this scale as a numerical quantity, we can also compute the standard deviation, which measures average distance from the mean, as a way of quantifying polarization.
The trend is clearly increasing, indicating increasing polarization, but with the way we computed these numbers, it’s hard to get a sense of how substantial the increase is in practical terms.
In the next article, I’ll look more closely at changes in political alignment over time.
I am planning to turn these articles into a case study for an upcoming Data Science class, so I welcome comments and questions.
Until recently, I was using FuncAnimation, provided by the matplotlib.animation package, as in this example from Think Complexity. The documentation of this function is pretty sparse, but if you want to use it, you can find examples.
For me, there are a few drawbacks:
It requires a back end like ffmpeg to display the animation. Based on my email, many readers have trouble installing packages like this, so I avoid using them.
It runs the entire computation before showing the result, so it takes longer to debug, and makes for a less engaging interactive experience.
For each element you want to animate, you have to use one API to create the element and another to update it.
For example, if you are using imshow to visualize an array, you would run
im = plt.imshow(a, **options)
to create an AxesImage, and then
im.set_array(a)
to update it. For beginners, this is a lot to ask. And even for experienced people, it can be hard to find documentation that shows how to update various display elements.
As another example, suppose you have a 2-D array and plot it like this:
plot(a)
The result is a list of Line2D objects. To update them, you have to traverse the list and invoke set_xdata() on each one.
Updating a display is often more complicated than creating it, and requires substantial navigation of the documentation. Wouldn’t it be nice to just call plot(a) again?
Clear output
Recently I discovered simpler alternative using clear_output() from Ipython.display and sleep() from the time module. If you have Python and Jupyter, you already have these modules, so there’s nothing to install.
Here’s a minimal example using imshow:
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from IPython.display import clear_output
from time import sleep
n = 10
a = np.zeros((n, n))
plt.figure()
for i in range(n):
plt.imshow(a)
plt.show()
a[i, i] = 1
sleep(0.1)
clear_output(wait=True)
The drawback of this method is that it is relatively slow, but for the examples I’ve worked on, the performance has been good enough.
In the ModSimPy library, I provide a function that encapsulates this pattern:
def animate(results, draw_func, interval=None):
plt.figure()
try:
for t, state in results.iterrows():
draw_func(state, t)
plt.show()
if interval:
sleep(interval)
clear_output(wait=True)
draw_func(state, t)
plt.show()
except KeyboardInterrupt:
pass
results is a Pandas DataFrame that contains results from a simulation; each row represents the state of a system at a point in time.
draw_func is a function that takes a state and draws it in whatever way is appropriate for the context.
interval is the time between frames in seconds (not counting the time to draw the frame).
Because the loop is wrapped in a try statement that captures KeyboardInterrupt, you can interrupt an animation cleanly.
And tomorrow I’m presenting a talk, “Generational Changes in Support for Gun Laws: A Case Study in Computational Statistics”:
Abstract: In the United States, support for gun control has been declining among all age groups since 1990; among young adults, support is substantially lower than among previous generations. Using data from the General Social Survey (GSS), I perform age-period-cohort analysis to measure generational effects. In this talk, I demonstrate a computational approach to statistics that replaces mathematical analysis with random simulation. Using Python and libraries like NumPy and StatsModels, we can define basic operations — like resampling, filling missing values, modeling, and prediction — and assemble them into a data analysis pipeline.
In the last 30 years, college students have become much less religious. The fraction who say they have no religious affiliation tripled, from about 10% to 30%. And the fraction who say they have attended a religious service in the last year fell from 85% to 70%.
One of the questions asks students to select their “current religious preference,” from a choice of seventeen common religions, “Other religion,” “Atheist”, “Agnostic”, or “None.”
The options “Atheist” and “Agnostic” were added in 2015. For consistency with previous years, I compare the “Nones” from previous years with the sum of “None”, “Atheist” and “Agnostic” since 2015.
The following figure shows the fraction of Nones over the 50 years of the survey.
Percentage of students with no religious preference from 1968 to 2017.
The blue line shows actual data through 2017; the gray line shows a quadratic fit. The light gray region shows a 90% predictive interval.
For the first time since 2011, the fraction of Nones decreased this year, reverting to the trend line.
Another question asks students how often they “attended a religious service” in the last year. The choices are “Frequently,” “Occasionally,” and “Not at all.” Students are instructed to select “Occasionally” if they attended one or more times.
Here is the fraction of students who reported any religious attendance in the last year:
Percentage of students who reported attending a religious service in the previous year.
Slightly more students reported attending a religious service in 2017 than in the previous year, contrary to the long-term trend.
Female students are more religious than male students. The following graph shows the gender gap over time, that is, the difference in percentages of male and female students with no religious affiliation.
Difference in religious affiliation between male and female students.
The gender gap was growing until recently. It has shrunk in the last 3-4 years, but since it varies substantially from year to year, it is hard to rule out random variation.
Data from 2018 should be available soon; I’ll post an update when I can.
Data Source
The American Freshman: National Norms Fall 2017 Stolzenberg, E. B., Eagan, M. K., Aragon, M. C., Cesar-Davis, N. M., Jacobo, S., Couch, V., & Rios-Aguilar, C. Higher Education Research Institute, UCLA. Apr 2019
“Foundation” is one of several words I would like to ban from all discussion of higher education. Others include “liberal arts”, “rigor”, and “service class”, but I’ll write about them another time. Right now, “foundation” is on my mind because of a new book from Microsoft Research, Foundations of Data Science, by Avrim Blum, John Hopcroft, and Ravindran Kannan.
The goal of their book is to “cover the theory we expect to be useful in the next 40 years, just as an understanding of automata theory, algorithms, and related topics gave students an advantage in the last 40 years.”
As an aside, I am puzzled by their use of “advantage” here: who did those hypothetical students have an advantage over? I don’t think competitive advantage is the primary goal of learning. If a theory is useful, it helps you solve problems and make the world a better place, not just crush your enemies.
I am also puzzled by their use of “foundation”, because it can mean two contradictory things:
The most useful ideas in a field; the things you should learn first.
The most theoretical ideas in a field; the things you should use to write mathematical proofs.
Both kinds of foundation are valuable. If you identify the right things to learn first, you can give students powerful tools quickly, they can work on real problems and have impact, and they are more likely to be excited about learning more. And if you find the right abstractions, you can build intuition, develop insight, make connections, and create new tools and ideas.
The problems come when we confuse these meanings, assume that the most abstract ideas are the most useful, and require students to learn them first. In higher education, confusion about “foundations” is the root of a lot of bad curriculum design.
For example, in the traditional undergraduate engineering curriculum, students take 1-2 years of math and science classes before they learn anything about engineering. These prerequisites are called the “Math and Science Death March” because so many students don’t get through them; in the U.S., about 40% of students who start an engineering program don’t finish it, largely because of the incorrect assumption that they need two years of theory before they can start engineering.
The introduction to Foundations of Data Science hints at the first meaning of “foundation”. The authors note that “increasingly researchers of the future will be involved with using computers to understand and extract usable information from massive data arising in applications,” which suggests that this book will help them do those things.
But the rest of the introduction makes it clear that the second meaning is what they have in mind.
“Chapters 2 and 3 lay the foundations of geometry and linear algebra respectively.”
“We give a from-first-principles description of the mathematics and algorithms for SVD.”
“The underlying mathematical theory of such random walks, as well as connections to electrical networks, forms the core of Chapter 4 on Markov chains.”
“Chapter 9 focuses on linear-algebraic problems of making sense from data, in particular topic modeling and non-negative matrix factorization.”
The “fundamentals” in this book are abstract, mathematical, and theoretical. The authors assert that learning them will give you an “advantage”, but if you are looking for practical tools to solve real problems, you might need to build on a different foundation.