CS151 lecture notes, Fall 1999
Week 7, Wednesday

QUIZ!

Representation of characters
----------------------------

Inside the computer, everything is represented ultimately as bits.

Typically we think of a collection of bits as a binary number.

For example, the bits 0000 0011 usually represent the number 3.

Some time in the 50's, someone came up with a way to represent
characters, by assigning a number to each letter.

This MAPPING is arbitrary, except that the following are
probably good properties

1) the letters and numbers should appear in order

   a) so we can alphabetize just by comparing integers

   b) so we can get from one letter to the next using arithmetic

2) the upper and lower case letters should differ by a single bit

   (this property is not obvious when you look at the representation
   in binary)

   A = 65 = 0010 0001
   a = 97 = 0110 0001

3) punctuation can be scattered around pretty much arbitrarily

4) EVERYONE SHOULD ADOPT THE SAME STANDARD

The standard that was adopted is called ASCII (ask-key)
(American Standard Code for Information Exchange)

ASCII uses 7 bits, which means that there are 128 possible
characters (that's 2^7)

That's enough room for upper and lower case, punctuation,
and some special things like newlines, tabs, and bells.
(sometimes called control characters)

But that's not nearly enough for

1) letters with accent marks

2) other alphabets (Cyrillic)

3) languages that use ideographs


UNICODE
-------

The solution to this problem is the new and improved standard
called UNICODE.  Instead of 7 bits, it uses 16, which means
it has room for 65,536 characters, of which 34,168 have already
been allocated.

There is even a 31-bit version of UNICODE that has room for
2 billion characters, but most of them are not in use.

Of course, implementing new standards in hard, especially in
ways that are compatible with existing code.

So adoption of UNICODE has been slow.


Java uses UNICODE
-----------------

Internally, Java represents characters in UNICODE, although
not all Java environments currently support non-European
characters.

Interestingly, the representation for European characters
is the same in UNICODE and ASCII.

For the most part, though, WE DON'T CARE what representation
we are using.

It is always better to write code that is portable and
abstract.  UNICODE and ASCII numbers should never appear
in your programs.

In other words, if you want to compare characters, use
character constants like 'a', not integers like 97.


Section 7.7
-----------

Looping and counting

    String fruit = "banana";
    int length = fruit.length();
    int count = 0;

    int index = 0;
    while (index < length) {
      if (fruit.charAt(index) == 'a') {
        count = count + 1;
      }
      index = index + 1;
    }
    System.out.println (count);

Don't forget about the ++ operator.

As an exercise, convert this so it uses indexOf to
find the letters.


Character arithmetic
--------------------

You can add integers to characters, but the result is an integer.

So, to "increment" a letter you can do something like

    char c = (char)('a' + 1);

To convert a character that is a digit to the corresponding number

    int x = (int)(letter - '0');

This works because we know that the numbers are encoded in
increasing order.

A better way to do the same thing:

    int x = Character.digit (letter, 10);

Character is the name of a class that contains methods that
pertain to characters.  The second argument is the base.


Strings are immutable
---------------------

toUpperCase and toLowerCase SOUND like they modify the
String, but they don't

they return a new String.


Strings are incomparable
------------------------

Don't use == to compare Strings!

Unfortunately, it is syntactically legal, but it does not
do what you want.

The equals method does.

You invoke it on one String and pass the other as an argument,
which is weird:

      s1.equals (s2)

compareTo is similar, except that it returns an integer
indicating which one is bigger, assuming that they are not
equal:

	s1.compareTo (s2)

returns negative if s1 comes before s2 in the alphabet
        positive if s2 comes before s1,
	 0 if they are the same

The actual return is the DIFFERENCE between the ASCII/UNICODE
codes for the first characters in the Strings that differ.

Usually that's more than you need to know.


sorting with compareTo
----------------------

One of the problems with compareTo is that it does not really
know the rules for alphabetizing things.

All it knows is the order things appear in the ASCII/UNICODE
tables.

For example, all the capital letters come before all the lowercase
letters.

This is a real problem in the world, when people use compareTo
blindly.

Example: faculty directory

Some names have capital letters in funny places

Some names are two words (they have a space in them)

     How does compareTo deal with spaces?

Some names begin with lower-case letters.

The history of the Colby directory indicates that these
things are being ignored or dealt with in an ad hoc way.

How can we deal with them ALGORITHMICALLY?