MA332 lecture notes, Fall 1998 Week 13 Tuesday Curve-fitting redux ------------------- We have two variables, an independent variable, x, which is supposed to allow us to predict the independent variable, y. Assume that the relationship between the two is known to be linear, in other words: y = a + bx + e Where the e is an "error" term that might contain measurement error, or might contain additional factors not contained in the model. For example, last time we talked about the relationship between message size and transit time: transit time = latency + packet size / bandwidth + e In this case, e probably represents a measurement error. In the handout, they use the example of predicting first-year GPA based on SAT scores. In this case, e captures not only the measurement error (since the SATs are not perfect) and all the other factors besides test scores that determine your GPA. GPA = a + b * SAT + e (I will leave aside the question of whether those two are really different things -- in other words, are the SATs measuring something imperfectly or something imperfect?). In both cases we have only two parameters and more than two data points (observations), so we will not be able to fit all the data exactly. Least Squares ------------- Instead, we choose values of the parameters a and b that minimize the sum of the squared errors. Given data xi and yi, and estimates of a and b, we find ei = yi - (a + b xi) and minimize sum ei^2 Last time we crammed the whole problem into matrix notation and derived: 1 x1 y1 A = 1 x2 b = y2 . . . . . . 1 xn yn Which leads to AtA x = At b In other words, if Ax = b is over constrained, multiply both sides by At! This time, let's do the calculus in non-matrix notation. Result: n Sxy - Sx Sy b = --------------- n Sx^2 - (Sx)^2 a = (Sy - b Sx) / n Can we convince ourselves that that's really the same thing as before? Goodness of fit --------------- How do we evaluate whether or not this is a good fit for the data? We could use: 1) the mean error, which is zero by construction (duh!) 2) the mean absolute error (not a bad choice) 3) the mean squared error (which is the thing we just minimized) The last two are good, but they depend on the particular dataset in a way that makes them hard to evaluate. An alternative is to look at the ratio of the variance in the errors compared to the variance in the original data. This ratio indicates how much variability (unpredictability) is left after we subtract away our predictions from each value. variability of original data = SST = sum (y - ybar)^2 where ybar is the calculated mean of the y values. variability after we subtract the predictions = SSE = sum (y - a - bx)^2 R^2 --- The reading provides an alternate derivation of the same thing, both leading to the result: R^2 = 1 - SSE / SST The value of R^2 is always between -1 and 1, with values near 0 indicating that SSE is not much smaller than SST, which means that the predictions weren't very good. Values near 1 (or -1) indicate that the model "explains" the data pretty well. Data modeling -------------- How are we going to use all this? Based on our development of the FFT, we expect that the DFT algorithm is O(n^2) and the FFT is O(n log n). Let's say we wanted to check that and see if it seems to be true for a given set of n. Model independent var dependent variable linear run time x = n quadratic run time x = n^2 n log n run time x = n log n A model is a simplification of the real world that captures at least some of the behavior of the real world, at least approximately. In this case, we have a theoretical model of program performance that leads us to expect certain relationships between run time and problem size. In the case of the DFT, we expect run time = a + b n^2 and for the FFT we expect run time = a + b n log n Using linear least squares we can estimate the parameters of a and b for a given dataset and a given model. Using R^2 we can compare how well different models fit a given dataset. How do we know if a given model is correct? We don't, really. On the other hand, some models fit the data so well it is hard not to think that it means something. Example: DFT n run time 32 1.0 64 5.4 128 26.4 Fit a linear model to the data and predict the run time for 256 (answer = 128) What are the xs? What are the ys? n = 3 x = (32, 64, 128) y = (1.0, 5.4, 26.4) Sx = 224 Sy = 32.8 Sx^2 = 21504 Sxy = 3756.8 3 * 3756.8 - 224 * 32.8 b = ----------------------- = .2737 3 * 21504 - 224 * 224 a = (32.8 - .2737 * 224) / 3 = -9.5 -9.5 + .2737 * 32 = -0.743 -9.5 + .2737 * 64 = 8.01 -9.5 + .2737 * 128 = 25.53 Errors = (1.743, -2.61, 0.87) Prediction for 256 (do we expect it to be too high or too low)? -9.5 + .2737 * 256 = 60.6 Actual = 127.9 Warning signs of a bad model: 1) poor predictions 2) patterns in the residuals In the next assignment you get to do some of this!