CS151 lecture notes, Fall 1999 Week 7, Wednesday QUIZ! Representation of characters ---------------------------- Inside the computer, everything is represented ultimately as bits. Typically we think of a collection of bits as a binary number. For example, the bits 0000 0011 usually represent the number 3. Some time in the 50's, someone came up with a way to represent characters, by assigning a number to each letter. This MAPPING is arbitrary, except that the following are probably good properties 1) the letters and numbers should appear in order a) so we can alphabetize just by comparing integers b) so we can get from one letter to the next using arithmetic 2) the upper and lower case letters should differ by a single bit (this property is not obvious when you look at the representation in binary) A = 65 = 0010 0001 a = 97 = 0110 0001 3) punctuation can be scattered around pretty much arbitrarily 4) EVERYONE SHOULD ADOPT THE SAME STANDARD The standard that was adopted is called ASCII (ask-key) (American Standard Code for Information Exchange) ASCII uses 7 bits, which means that there are 128 possible characters (that's 2^7) That's enough room for upper and lower case, punctuation, and some special things like newlines, tabs, and bells. (sometimes called control characters) But that's not nearly enough for 1) letters with accent marks 2) other alphabets (Cyrillic) 3) languages that use ideographs UNICODE ------- The solution to this problem is the new and improved standard called UNICODE. Instead of 7 bits, it uses 16, which means it has room for 65,536 characters, of which 34,168 have already been allocated. There is even a 31-bit version of UNICODE that has room for 2 billion characters, but most of them are not in use. Of course, implementing new standards in hard, especially in ways that are compatible with existing code. So adoption of UNICODE has been slow. Java uses UNICODE ----------------- Internally, Java represents characters in UNICODE, although not all Java environments currently support non-European characters. Interestingly, the representation for European characters is the same in UNICODE and ASCII. For the most part, though, WE DON'T CARE what representation we are using. It is always better to write code that is portable and abstract. UNICODE and ASCII numbers should never appear in your programs. In other words, if you want to compare characters, use character constants like 'a', not integers like 97. Section 7.7 ----------- Looping and counting String fruit = "banana"; int length = fruit.length(); int count = 0; int index = 0; while (index < length) { if (fruit.charAt(index) == 'a') { count = count + 1; } index = index + 1; } System.out.println (count); Don't forget about the ++ operator. As an exercise, convert this so it uses indexOf to find the letters. Character arithmetic -------------------- You can add integers to characters, but the result is an integer. So, to "increment" a letter you can do something like char c = (char)('a' + 1); To convert a character that is a digit to the corresponding number int x = (int)(letter - '0'); This works because we know that the numbers are encoded in increasing order. A better way to do the same thing: int x = Character.digit (letter, 10); Character is the name of a class that contains methods that pertain to characters. The second argument is the base. Strings are immutable --------------------- toUpperCase and toLowerCase SOUND like they modify the String, but they don't they return a new String. Strings are incomparable ------------------------ Don't use == to compare Strings! Unfortunately, it is syntactically legal, but it does not do what you want. The equals method does. You invoke it on one String and pass the other as an argument, which is weird: s1.equals (s2) compareTo is similar, except that it returns an integer indicating which one is bigger, assuming that they are not equal: s1.compareTo (s2) returns negative if s1 comes before s2 in the alphabet positive if s2 comes before s1, 0 if they are the same The actual return is the DIFFERENCE between the ASCII/UNICODE codes for the first characters in the Strings that differ. Usually that's more than you need to know. sorting with compareTo ---------------------- One of the problems with compareTo is that it does not really know the rules for alphabetizing things. All it knows is the order things appear in the ASCII/UNICODE tables. For example, all the capital letters come before all the lowercase letters. This is a real problem in the world, when people use compareTo blindly. Example: faculty directory Some names have capital letters in funny places Some names are two words (they have a space in them) How does compareTo deal with spaces? Some names begin with lower-case letters. The history of the Colby directory indicates that these things are being ignored or dealt with in an ad hoc way. How can we deal with them ALGORITHMICALLY?