MA332 lecture notes, Fall 1998
Week 13 Tuesday

Curve-fitting redux
-------------------

We have two variables, an independent variable, x, which is
supposed to allow us to predict the independent variable, y.

Assume that the relationship between the two is known to
be linear, in other words:

   y = a + bx + e

Where the e is an "error" term that might contain measurement
error, or might contain additional factors not contained in
the model.

For example, last time we talked about the relationship
between message size and transit time:

	transit time = latency + packet size / bandwidth + e

In this case, e probably represents a measurement error.

In the handout, they use the example of predicting first-year
GPA based on SAT scores.  In this case, e captures not only
the measurement error (since the SATs are not perfect) and
all the other factors besides test scores that determine your
GPA.

	GPA = a + b * SAT + e

(I will leave aside the question of whether those two are
really different things -- in other words, are the SATs
measuring something imperfectly or something imperfect?).

In both cases we have only two parameters and more than
two data points (observations), so we will not be able to
fit all the data exactly.

Least Squares
-------------

Instead, we choose values of the parameters a and b that
minimize the sum of the squared errors.

Given data xi and yi, and estimates of a and b, we find

      ei = yi - (a + b xi)

and minimize

    sum  ei^2


Last time we crammed the whole problem into matrix notation
and derived:

         1 x1              y1
    A =  1 x2          b = y2
         . .               .
         . .               .
	 1 xn              yn

Which leads to

   AtA x = At b

In other words, if Ax = b is over constrained, multiply both
sides by At!


This time, let's do the calculus in non-matrix notation.

Result:

             n Sxy - Sx Sy
	b = ---------------
            n Sx^2 - (Sx)^2

	a = (Sy - b Sx) / n


Can we convince ourselves that that's really the same thing
as before?

Goodness of fit
---------------

How do we evaluate whether or not this is a good fit for the
data?  We could use:

1) the mean error, which is zero by construction (duh!)

2) the mean absolute error (not a bad choice)

3) the mean squared error (which is the thing we just minimized)

The last two are good, but they depend on the particular dataset
in a way that makes them hard to evaluate.

An alternative is to look at the ratio of the variance in the
errors compared to the variance in the original data.

This ratio indicates how much variability (unpredictability)
is left after we subtract away our predictions from each value.

variability of original data =

	    SST = sum (y - ybar)^2

where ybar is the calculated mean of the y values.

variability after we subtract the predictions =

	    SSE = sum (y - a - bx)^2

R^2
---

The reading provides an alternate derivation of the same thing,
both leading to the result:

     R^2 = 1 - SSE / SST

The value of R^2 is always between -1 and 1, with values near
0 indicating that SSE is not much smaller than SST, which means
that the predictions weren't very good.

Values near 1 (or -1) indicate that the model "explains" the
data pretty well.

Data modeling
--------------

How are we going to use all this?

Based on our development of the FFT, we expect that the DFT
algorithm is O(n^2) and the FFT is O(n log n).  Let's say
we wanted to check that and see if it seems to be true for
a given set of n.

Model	    independent var     dependent variable

linear	    run time		x = n
quadratic   run time		x = n^2
n log n	    run time		x = n log n


A model is a simplification of the real world that captures
at least some of the behavior of the real world, at least
approximately.

In this case, we have a theoretical model of program performance
that leads us to expect certain relationships between run time
and problem size.

In the case of the DFT, we expect

   run time = a + b n^2

and for the FFT we expect

   run time = a + b n log n

Using linear least squares we can estimate the parameters of a
and b for a given dataset and a given model.

Using R^2 we can compare how well different models fit a given
dataset.

How do we know if a given model is correct?  We don't, really.
On the other hand, some models fit the data so well it is hard
not to think that it means something.

Example:  DFT 

	  n       run time

	  32	  1.0
	  64	  5.4
	  128	  26.4

Fit a linear model to the data and predict the run time
for 256 (answer = 128)

What are the xs?  What are the ys?

     n = 3
     x = (32, 64, 128)   y = (1.0, 5.4, 26.4)

Sx = 224       Sy = 32.8
Sx^2 = 21504   Sxy = 3756.8

    3 * 3756.8 - 224 * 32.8 
b = ----------------------- = .2737
    3 * 21504 - 224 * 224 
    

a = (32.8 - .2737 * 224) / 3 = -9.5


-9.5 + .2737 * 32 = -0.743        
-9.5 + .2737 * 64 = 8.01
-9.5 + .2737 * 128 = 25.53

Errors = (1.743, -2.61, 0.87)

Prediction for 256 (do we expect it to be too high or too low)?

-9.5 + .2737 * 256 = 60.6

Actual = 127.9

Warning signs of a bad model:

1) poor predictions

2) patterns in the residuals

In the next assignment you get to do some of this!