Software Systems Spring 2005 For today, you should have: 1) worked on your project 2) read Silberschatz and Galvin, I/O Systems, part 2 3) written a draft of your final report Outline: 1) questions from I/O Systems 2) order of growth 3) hashtables For next time you should: 1) work on your project 2) prepare for a quiz on I/O (last quiz!) 3) read the wikipedia article about distributed hash tables and "Looking up Data in P2P Systems" Order of growth --------------- We want to compare algorithms, but there are some problems: 1) relative performance might depend on hardware issues Solution: count abstract operations 2) relative performance might depend on the details of the problem Solution: compare worst case behavior (or sometimes average case; seldom best case) 3) relative performance depends on problem size Solution: express run time as a function of problem size and then compare the asymptotic behavior of the functions The what now? What we have been calling order of growth is also called asymptotic behavior because it describes the behavior of the function as (usually) n goes to infinity. A complexity class is the set of all functions with the same asymptotic behavior. All the linear functions belong to the set O(n), pronounced "Big-Oh of n" Strictly, to say that f(n) belongs to O(g(n)) is to say that there are values c and n* such that f(n) < c g(n) for all n > n* In other words, for large n, f is bounded by some multiple of g. To see why this is meaningful, consider why n^2 is not in O(n). Also, notice that the constant c is arbitrary, which is why, when classifying algorithms, we ignore leading constants. Properties of combination: If f(n) is O(g(n)), what can we say about k f(n)? If f1(n) and f2(n) are O(g(n)), what can we say about f1(n) + f2(n)? If f1(n) is O(g(n)), and f2(n) is O(h(n)), what can we say about f1(n) + f2(n)? If f1(n) is O(g(n)), and f2(n) is O(h(n)), what can we say about f1(n) * f2(n)? Examples: What is the order of growth of 1) n^3 + n^2 2) n^3 + 1000000000000000000000000000000000000000000 n^2 3) (n^2 + n) * (n + 1) Is this really a useful way to describe an algorithm? 1) For large problems, a "better" algorithm is usually better. 2) Sometimes it is MUCH better, like the difference between feasible or not. 3) Sometimes there is a crossover point where for small problems, the "worse" algorithm is better. 4) The location of the crossover point tends to depend on details of the implementation and hardware. The difference between a good implementation and a bad implementation is usually a constant factor, but the difference between a good algorithm and a bad algorithm is infinite! (as n goes to infinity) Hashtables ---------- A map is a data structure that contains a set of key-value pairs and that supports the operations add (key, value): add a new key-value pair to the set lookup (key): lookup and return the value associated with the given key In Python both of these operations are performed with the [] operator d = {} # create a new empty dictionary d['name'] = 'allen # add ('name', 'allen') print d['name'] # lookup ('name') Maps are sometimes called associative arrays, because they are similar to arrays, with the primary difference that in an array the key usually has to be an integer (in which case it's called an index). There are many ways to implement a map, but a hashtable is a particularly good one for many purposes because it can perform both add() and lookup() in O(1) time (that is, the same amount of time no matter how many pairs are in the set!). Hashtables are based on a "hash function" that maps from the key type to integers: h(k) Example, if the keys are strings, you could treat the letters as digits in base 26. The simplest implementation of the table is an array of p elements. To add a pair: compute i = h(k) mod p, and store the pair at table[i] To lookup compute i = h(k) mod p, and look at table[i] Why isn't this sufficient? Collision resolution 1) chaining 2) open addressing Taking chaining as an example, we've avoided searching a long list, but we still have to search one of the short lists. As the number of pairs (n) gets big, so do the short lists, so the average search time is n/p, which is still O(n). So how can we make lookup O(1)? By making the hashtable bigger! 1) Initially p = 2. 2) When n = p, we double the size of the array and rehash everything. Since n