CS231 lecture notes, Fall 1998
Week 14, Wednesday

Topics:  Huffman code and hash tables

Huffman Code
------------

Last part of the assignment.

1) Traverse the Huffman tree, keeping track of the path to each node.

2) When you come to a lead node, add a new entry to the CodeTab

Structures

Code : an array of booleans
CodeTab: an array of Codes, indexed by character (just like the
	 frequency table).

How do we keep track of the path?

Write a recursive traversal with two extra arguments:

1) the length of the path so far,
2) an array of booleans containing the path so far.

When we create a new Code, we can take the path array as an
argument -- and MAKE A COPY OF IT, rather than store a reference
to it.  Why?

Because if we store a reference to it, and it changes later,
then the reference also sees the change.

If we make a copy, we are free to modify the array.

This is important, since we will need to change the array
constantly during the traversal:

  public void getCodesFromTree (HuffTree tree, boolean[] path, int sofar) {

    if (tree == null) return;

    if (isLeaf (tree)) {
      // build a new Code based on the path so far, and add
      // a new entry to the code table
    }

    // add the next element to the path and make the first recursive call
    path[sofar] = true;
    getCodesFromTree (tree.left, path, sofar+1);

    // change the next element and make the second recursive call
    path[sofar] = false;
    getCodesFromTree (tree.right, path, sofar+1);
  }


The Table ADT and the Hash table implementation
-----------------------------------------------

Last topic for the semester.  No time to do an assignment with them,
but since they are an important data structure, I want to at least
tell you about them.

Read Chapter 9, Sections 9.1--9.3 are fair game for the exam.

The Table ADT
-------------

We have been using arrays for lots of things, since they are a
versatile structure, but we have often run across a problematic
limitation:

1) the indices of an array can only be integers

We have worked around that limitation sometimes, by converting
characters to integers, for example.

We have also been confronted by another limitation, which
is

2) the size of the array does not depend on the number of elements
   being store.  If the array is too big, it wastes space.  If it's
   too small, it might cause an error, or we might have to add
   complicated code to make the array bigger.

The table ADT is an abstract data type intended to improve on
arrays by generalizing and abstracting.

Generalizing:  Tables are more general than arrays because they
	       can use any type as an index.  The basic operation
	       on a table is a lookup, where you provide a key
	       (of any kind) and the table returns the corresponding
	       value (of any kind).

Abstracting:   Like all ADTs, the table is defined by its set of
	       operations, not by its implementation.  As usual,
	       there are many possible implementations for a table.
	       They all have the same behavior, but different
	       performance characteristics.

Table entries
-------------

The things that go in a table are entries, where each entry
contains a key and a value.

    The key, K, is like a handle you use to look things up.
    The value, I, is the information you are storing.

    Example: in a dictionary, the word is the key and the
    information is the definition

    Example: at the credit card company, the CC# is the key
    and the information is the available credit for that account


Table operations
----------------

Construct an empty table.

Test for an empty table.

Add a new entry to the table, where an entry is a pair (K, I)

Retrieve, or lookup, an entry.  You provide K; retrieve returns I

Delete an entry from the table, given the key K

Update an entry -- change the information associated with a key.
       same as deleting the old entry and adding a new one

Enumerate the entries in the table.


What are some possible implementations?
---------------------------------------

1) Create a Pair object and build a set of Pairs

   a) use an unsorted List of Pairs: insert is constant time,
      but lookup and delete require us to search the list,
      which is linear time

   b) use a sorted array of Pairs: we might have to resize the
      array periodically.  insert is linear, but searching can
      be done by binary partioning, which is log time

   c) use a Heap!

   d) and so on

2) Hash tables are a remarkable idea that makes it possible
   (with certain assumptions) to insert, remove, and lookup
   things in a table in constant time!


Hash tables
-----------

Let's say that the pairs we are interested in (K, I) have
types Tk and Ti.

In order to build a hash table, we build an array of Ti,
and we create a hash function that maps Tk onto the integers.

For example, we already used one hash function:

    int hash (char c) {
       int index = c - 'a';
       return index;
    }

This function maps characters a-z onto the integers 0-25.

Then we use the integers as indices into the array.

     Example: store the code associated with the letter 'c'

              1) convert 'c' to 2,
	      2) store c's code at the 2th array location

So far there is nothing new here, except a semi-obvious hack.

The interesting thing happens when the hash function is not
one-to-one!

Real hash functions
-------------------

Real hash functions lose information; that is, they map many
input values to the same output value.

In other words, many keys get mapped to the same location.

In other words, given the output of the hash function, we
don't know what the input was.

A common example of a hash function is the modulus operator:

    int hash (char c) {
       int index = c - 'a';
       return index % 6;
    }
  
Now the only possible return values are 0,1,2,3,4,5

a maps to 0
b maps to 1
  ...
g maps to 0
h maps to 1
  ...
m maps to 0
n maps to 1
  ...
s maps to 0
t maps to 1
  ...
y maps to 0
z maps to 1

Given the output 0, we don't know whether the input was a,g,m,s,y

Why would we do this?

1) the number of keys might be very large (how many credit card
   numbers are there?  1000 0000 0000 0000!)

2) this leaves us free to make the array size proportional to
   the number of entries, not the number of keys

What's the price for this freedom?

1) we can't tell for sure which of several possible entries is
   stored at a given location

   (need to store the keys with the information, so we can check)

2) there might be more than one entry vying for the same location

   (we hope the number of collisions is small, if the array
   is a few times bigger than the number of entries)

   (regardless, we need to find a way to reconcile collisions)


Simplest implementation:

1) each hash table entry is a List of pairs

2) to add an entry, you hash the key, and add the entry to
   the corresponding list

3) to lookup an entry, hash the key and search the corresponding
   list

Statistically, we expect the number of entries per array location
to be 0 or 1, and only occasionally greater than 1.


Java Hashtable class
--------------------

Dictionary is an abstract class that specifies the interface
all Tables should have.  Both the keys and the information
are Objects.

Check out...

http://java.sun.com/products/jdk/1.1/docs/api/java.util.Dictionary.html

Hashtable is an implementation of a Dictionary.

http://java.sun.com/products/jdk/1.1/docs/api/java.util.Hashtable.html