CS231 lecture notes, Fall 1998 Week 14, Wednesday Topics: Huffman code and hash tables Huffman Code ------------ Last part of the assignment. 1) Traverse the Huffman tree, keeping track of the path to each node. 2) When you come to a lead node, add a new entry to the CodeTab Structures Code : an array of booleans CodeTab: an array of Codes, indexed by character (just like the frequency table). How do we keep track of the path? Write a recursive traversal with two extra arguments: 1) the length of the path so far, 2) an array of booleans containing the path so far. When we create a new Code, we can take the path array as an argument -- and MAKE A COPY OF IT, rather than store a reference to it. Why? Because if we store a reference to it, and it changes later, then the reference also sees the change. If we make a copy, we are free to modify the array. This is important, since we will need to change the array constantly during the traversal: public void getCodesFromTree (HuffTree tree, boolean[] path, int sofar) { if (tree == null) return; if (isLeaf (tree)) { // build a new Code based on the path so far, and add // a new entry to the code table } // add the next element to the path and make the first recursive call path[sofar] = true; getCodesFromTree (tree.left, path, sofar+1); // change the next element and make the second recursive call path[sofar] = false; getCodesFromTree (tree.right, path, sofar+1); } The Table ADT and the Hash table implementation ----------------------------------------------- Last topic for the semester. No time to do an assignment with them, but since they are an important data structure, I want to at least tell you about them. Read Chapter 9, Sections 9.1--9.3 are fair game for the exam. The Table ADT ------------- We have been using arrays for lots of things, since they are a versatile structure, but we have often run across a problematic limitation: 1) the indices of an array can only be integers We have worked around that limitation sometimes, by converting characters to integers, for example. We have also been confronted by another limitation, which is 2) the size of the array does not depend on the number of elements being store. If the array is too big, it wastes space. If it's too small, it might cause an error, or we might have to add complicated code to make the array bigger. The table ADT is an abstract data type intended to improve on arrays by generalizing and abstracting. Generalizing: Tables are more general than arrays because they can use any type as an index. The basic operation on a table is a lookup, where you provide a key (of any kind) and the table returns the corresponding value (of any kind). Abstracting: Like all ADTs, the table is defined by its set of operations, not by its implementation. As usual, there are many possible implementations for a table. They all have the same behavior, but different performance characteristics. Table entries ------------- The things that go in a table are entries, where each entry contains a key and a value. The key, K, is like a handle you use to look things up. The value, I, is the information you are storing. Example: in a dictionary, the word is the key and the information is the definition Example: at the credit card company, the CC# is the key and the information is the available credit for that account Table operations ---------------- Construct an empty table. Test for an empty table. Add a new entry to the table, where an entry is a pair (K, I) Retrieve, or lookup, an entry. You provide K; retrieve returns I Delete an entry from the table, given the key K Update an entry -- change the information associated with a key. same as deleting the old entry and adding a new one Enumerate the entries in the table. What are some possible implementations? --------------------------------------- 1) Create a Pair object and build a set of Pairs a) use an unsorted List of Pairs: insert is constant time, but lookup and delete require us to search the list, which is linear time b) use a sorted array of Pairs: we might have to resize the array periodically. insert is linear, but searching can be done by binary partioning, which is log time c) use a Heap! d) and so on 2) Hash tables are a remarkable idea that makes it possible (with certain assumptions) to insert, remove, and lookup things in a table in constant time! Hash tables ----------- Let's say that the pairs we are interested in (K, I) have types Tk and Ti. In order to build a hash table, we build an array of Ti, and we create a hash function that maps Tk onto the integers. For example, we already used one hash function: int hash (char c) { int index = c - 'a'; return index; } This function maps characters a-z onto the integers 0-25. Then we use the integers as indices into the array. Example: store the code associated with the letter 'c' 1) convert 'c' to 2, 2) store c's code at the 2th array location So far there is nothing new here, except a semi-obvious hack. The interesting thing happens when the hash function is not one-to-one! Real hash functions ------------------- Real hash functions lose information; that is, they map many input values to the same output value. In other words, many keys get mapped to the same location. In other words, given the output of the hash function, we don't know what the input was. A common example of a hash function is the modulus operator: int hash (char c) { int index = c - 'a'; return index % 6; } Now the only possible return values are 0,1,2,3,4,5 a maps to 0 b maps to 1 ... g maps to 0 h maps to 1 ... m maps to 0 n maps to 1 ... s maps to 0 t maps to 1 ... y maps to 0 z maps to 1 Given the output 0, we don't know whether the input was a,g,m,s,y Why would we do this? 1) the number of keys might be very large (how many credit card numbers are there? 1000 0000 0000 0000!) 2) this leaves us free to make the array size proportional to the number of entries, not the number of keys What's the price for this freedom? 1) we can't tell for sure which of several possible entries is stored at a given location (need to store the keys with the information, so we can check) 2) there might be more than one entry vying for the same location (we hope the number of collisions is small, if the array is a few times bigger than the number of entries) (regardless, we need to find a way to reconcile collisions) Simplest implementation: 1) each hash table entry is a List of pairs 2) to add an entry, you hash the key, and add the entry to the corresponding list 3) to lookup an entry, hash the key and search the corresponding list Statistically, we expect the number of entries per array location to be 0 or 1, and only occasionally greater than 1. Java Hashtable class -------------------- Dictionary is an abstract class that specifies the interface all Tables should have. Both the keys and the information are Objects. Check out... http://java.sun.com/products/jdk/1.1/docs/api/java.util.Dictionary.html Hashtable is an implementation of a Dictionary. http://java.sun.com/products/jdk/1.1/docs/api/java.util.Hashtable.html