Software Systems Spring 2008 For today, you should have: 1) done the reading from Tanenbaum 2) done a semaphore puzzle 3) read Cow Book chapter 5 and done programming exercise 5-2 Outline: 1) reading questions from last time 2) address space experiment #2 3) semaphore puzzle (mutex solution) 4) programming exercise 5-2 5) workload modeling, Irlam study For next time you should: 1) do Homework 2, which includes a Semaphore problem and a programming problem 2) do the reading below 3) prepare for a quiz Address space experiment #2 --------------------------- First some optional reading on Experiment #1: 1) http://lwn.net/Articles/91829/ 2) http://kerneltrap.org/node/2450 And now for today's experiment: cd address_space Read the code in sleep.c Make it and run it make sleep ./sleep 1 # don't forget the ./ Now run two copies of it at the same time (./sleep 1 &) ; ./sleep 2 What sense can we make of the output? C Programming ------------- Here is a BROKEN solution to Exercise 5-2 from the Cow Book. http://wb/ss/code/bad_sphere.c Please download it and get it working. Think about your debugging process. #include // this is the old-school way to define constants #define LINELEN 128 double sphere_volume(double radius); { double volume; // this is the new way to define constants const float PI = 3.141592653; volume = 4.0 / 3.0 * PI * radius^2; return volume; } int main() { double radius; char line[LINELEN]; double volume; // prompt the user and get input printf("Enter the radius of the sphere in meters:\n"); fgets(line, LINELEN, stdin); sscanf(line, "%f\n", radius); // compute the volume and print the result volume = sphere_volume(double radius); printf("The volume of a sphere with radius %f m is %f m^3\n", radius, volume); } What happens if you put a semi-colon at the end of the #define? Workload modeling ----------------- System performance often depends on properties of the workload. For example: 1) Reading contiguous blocks from disk can be hundreds of times faster than reading non-contiguous blocks. 2) Cache performance depends on temporal and spatial locality. Often we are interested in statistical properties of a workload: 1) likelihood of contiguous access 2) cache hit rates 3) distribution of size, time between requests, etc. 4) correlations among various properties A quick workload measurement ---------------------------- As an example, let's do a quick in-class example, looking at the distribution of file sizes on your laptops. Move into a directory where you want to put class related things. cd ~/ss/code Get some code from the class web page and unpack it: wget http://wb/ss/code/irlam.tgz tar -xzf irlam.tgz cd irlam Use the makefile to compile cdf: make Unpack xgraph and compile it: tar -xzf xgraph-11.tar.gz cd xgraph xmkmf make Edit your ~/.bashrc and add the following: alias xgraph=/complete/path/to/xgraph Save it and reread it: . ~/.bashrc Check that it works: which xgraph Collect file size info: cd ~/ss/code/irlam sh collect_sizes The result should be files named sizes.1, sizes.2, etc. One per filesystem. wc sizes.1 cdf sizes.1 | xgraph This cdf is typical of many operating system workloads: 1) lots of small things 2) a few very large things It is almost always a good idea to look at these things under a log transform: ./cdf -t logx sizes.1 | xgraph Now there's something we can look at. Cumulative distribution functions --------------------------------- Many of you are used to looking at histograms, which are a smoothed representation of a pdf. With real data, histograms/pdfs are approximations at best and DANGEROUSLY MISLEADING at worst. Among other things, it is hard to choose the right smoothing parameter. A cdf is a complete description of a dataset. How to read one: 1) cdf is a mapping from values to percentiles 2) or from percentiles to values 3) it's important not to smooth or interpolate It takes some getting used to, but once you have adjusted, it is a powerful tool for exploratory data analysis. Two steps: 1) exploration: find something interesting 2) presentation: figure out the best way to convey what you have found Distribution models ------------------- What you just plotted is called an empirical distribution, because it comes from empirical data. Many natural empirical distributions can be modeled by families of continuous distributions. For example, the distribution of human heights matches a normal distribution (with carefully-chosen parameters) pretty well. So you can _summarize_ the distribution with only two numbers. Three common distributions in computer science: 1) exponential 2) lognormal 3) pareto Which of these distribution families is the best summary of your file system data? What would be omitted? Reading questions ----------------- Tanenbaum Chapter 2, pages 81-90 1) what is the difference between a thread and a process? 2) what state is shared by cooperating threads, and what data is not shared? 3) why does each thread have its own stack? 4) Tanenbaum gives several examples of applications that would benefit from multiple threads. See if you can think of another. 5) what does it mean to say that a system call is "blocking" or "non-blocking"? Stop at the end of Section 2.2.2