Software Systems
Spring 2008

For today, you should have:

1) done the reading from Tanenbaum

2) done a semaphore puzzle

3) read Cow Book chapter 5 and done programming exercise 5-2


Outline:

1) reading questions from last time

2) address space experiment #2

3) semaphore puzzle (mutex solution)

4) programming exercise 5-2

5) workload modeling, Irlam study


For next time you should:

1) do Homework 2, which includes a Semaphore problem and a
   programming problem

2) do the reading below

3) prepare for a quiz


Address space experiment #2
---------------------------

First some optional reading on Experiment #1:

1) http://lwn.net/Articles/91829/

2) http://kerneltrap.org/node/2450


And now for today's experiment:

cd address_space

Read the code in sleep.c

Make it and run it

make sleep
./sleep 1         # don't forget the ./

Now run two copies of it at the same time

(./sleep 1 &) ; ./sleep 2

What sense can we make of the output?


C Programming
-------------

Here is a BROKEN solution to Exercise 5-2 from the Cow Book.

http://wb/ss/code/bad_sphere.c

Please download it and get it working.  Think about your
debugging process.


#include <stdio.h>

// this is the old-school way to define constants
#define LINELEN 128

double sphere_volume(double radius);
{
  double volume;

  // this is the new way to define constants
  const float PI = 3.141592653;

  volume = 4.0 / 3.0 * PI * radius^2;
  return volume;
}

int main()
{
  double radius;
  char line[LINELEN];
  double volume;

  // prompt the user and get input
  printf("Enter the radius of the sphere in meters:\n");
  fgets(line, LINELEN, stdin);
  sscanf(line, "%f\n", radius);

  // compute the volume and print the result
  volume = sphere_volume(double radius);
  printf("The volume of a sphere with radius %f m is %f m^3\n", 
	 radius, volume);
}


What happens if you put a semi-colon at the end of the #define?


Workload modeling
-----------------

System performance often depends on properties of the workload.
For example:

1) Reading contiguous blocks from disk can be hundreds of
   times faster than reading non-contiguous blocks.

2) Cache performance depends on temporal and spatial locality.


Often we are interested in statistical properties of a workload:

1) likelihood of contiguous access

2) cache hit rates

3) distribution of size, time between requests, etc.

4) correlations among various properties


A quick workload measurement
----------------------------

As an example, let's do a quick in-class example, looking at
the distribution of file sizes on your laptops.

Move into a directory where you want to put class related things.

cd ~/ss/code

Get some code from the class web page and unpack it:

wget http://wb/ss/code/irlam.tgz
tar -xzf irlam.tgz
cd irlam

Use the makefile to compile cdf:

make

Unpack xgraph and compile it:

tar -xzf xgraph-11.tar.gz
cd xgraph
xmkmf
make

Edit your ~/.bashrc and add the following:

alias xgraph=/complete/path/to/xgraph

Save it and reread it:

. ~/.bashrc

Check that it works:

which xgraph


Collect file size info:

cd ~/ss/code/irlam

sh collect_sizes

The result should be files named sizes.1, sizes.2, etc.
One per filesystem.

wc sizes.1
cdf sizes.1 | xgraph

This cdf is typical of many operating system workloads:

1) lots of small things
2) a few very large things

It is almost always a good idea to look at these things
under a log transform:

./cdf -t logx sizes.1 | xgraph

Now there's something we can look at.


Cumulative distribution functions
---------------------------------

Many of you are used to looking at histograms, which are a
smoothed representation of a pdf.

With real data, histograms/pdfs are approximations at best
and DANGEROUSLY MISLEADING at worst.

Among other things, it is hard to choose the right smoothing
parameter.


A cdf is a complete description of a dataset.

How to read one:

1) cdf is a mapping from values to percentiles

2) or from percentiles to values

3) it's important not to smooth or interpolate


It takes some getting used to, but once you have adjusted,
it is a powerful tool for exploratory data analysis.

Two steps:

1) exploration: find something interesting

2) presentation: figure out the best way to convey what you have
   found


Distribution models
-------------------

What you just plotted is called an empirical distribution,
because it comes from empirical data.

Many natural empirical distributions can be modeled by families
of continuous distributions.

For example, the distribution of human heights matches a normal
distribution (with carefully-chosen parameters) pretty well.

So you can _summarize_ the distribution with only two numbers.

Three common distributions in computer science:

1) exponential

2) lognormal

3) pareto

Which of these distribution families is the best summary of
your file system data?

What would be omitted?


Reading questions
-----------------

Tanenbaum Chapter 2, pages 81-90

1) what is the difference between a thread and a process?


2) what state is shared by cooperating threads, and what data
   is not shared?


3) why does each thread have its own stack?


4) Tanenbaum gives several examples of applications that would
   benefit from multiple threads.  See if you can think of another.


5) what does it mean to say that a system call is "blocking"
   or "non-blocking"?


Stop at the end of Section 2.2.2