Software Systems Spring 2005 For today, you should have: 1) read the Magnetic Disk handout from Hennessey and Patterson 2) read 'The Memory Hierarchy' from Stallings 3) started Homework 2 Outline: 1) Homework 1 discussion 2) Homework 2 discussion 3) a little workload characterization For next time you should: 1) keep working on Homework 2 2) prepare for a quiz on disk drives and the memory hierarchy Homework 1 ---------- Distance limits latency (sometimes) Bandwidth is for big things (more precisely, when you are dealing with big things, bandwidth is _usually_ the limiting factor) Latency is for little things (likewise) Size amortizes cost (as a result, performance tuning often has the effect of making systems in which both latency and bandwidth are relevant) Total latency, bottleneck bandwidth (when links are connected in series, the latencies add and the bandwidths are minned) (what about parallel?) Section 1.4: Campus: 0.1 to 3 ms (outlier at 25) East coast: 2 to 50 ms (outlier at 71) West coast: 65 to 100 ms (outliers at 40 and 180) Asia: 190 to 250 ms (outlier at 690) Australia: 240 to 260 (outlier at 30) Google challenges: What's the speed of light to Australia? What's the speed of light to a geosynchronous satellite? Section 1.6: What is the point of the last two questions? Homework 2 ---------- Any problems running the program? Any problems making plots? Back to notes03.txt Workload modeling ----------------- We've already seen two examples where the performance of a system depends on how you use it: 1) Reading contiguous blocks from disk can be hundreds of times faster than non-contiguous blocks. 2) Cache performance depends on temporal and spatial locality. Often we are interested in statistical properties of a workload: 1) likelihood of contiguous access 2) cache hit rates 3) distribution of size, time between requests, etc. 4) correlations among various properties A quick workload measurement ---------------------------- As an example, let's do a quick in-class example, looking at the distribution of file sizes on your laptops. Move into a directory where you want to put class related things. cd ~/ss/code Get some code from the class web page and unpack it: wget wb/ss/code/irlam.tgz tar -xzf irlam.tgz cd irlam Use the makefile to compile cdf: make Unpack xgraph and compile it: tar -xzf xgraph-11.tar.gz cd xgraph xmkmf make Edit your ~/.bashrc and add the following: alias xgraph=/complete/path/to/xgraph Save it and reread it: . ~/.bashrc Check that it works: which xgraph Collect file size info: cd ~/ss/code/irlam sh collect_sizes The result should be files named sizes.1, sizes.2, etc. One per filesystem. wc sizes.1 cdf sizes.1 | xgraph This cdf is typical of many operating system workloads: 1) lots of small things 2) a few very large things It is almost always a good idea to look at these things under a log transform: cdf -t logx sizes.1 | xgraph Now there's something we can look at. Cumulative distribution functions --------------------------------- Many of you are used to looking at histograms, which are a smoothed representation of a pdf. With real data, histograms/pdfs are approximations at best and DANGEROUSLY MISLEADING at worst. 1) hard to choose the right smoothing parameter A cdf is a complete description of a dataset. How to read one: 1) cdf is a mapping from values to percentiles 2) or from percentiles to values 3) it's important not to smooth or interpolate It takes some getting used to, but once you have adjusted, it is a powerful tool for exploratory data analysis. Two steps: 1) exploration: find something interesting 2) presentation: figure out the best way to convey what you have found Three common distributions in computer science: 1) exponential 2) lognormal 3) pareto