- Markup Languages
- What Does XML Look Like?
- What Does XML Look Like in a Browser?
- What's So Great About XML?
- Well-Formed XML Documents
- Valid XML Documents
- Parsing XML Yourself
- XML Resources
- XML Editors
- XML Browsers
- XML Parsers
- XML Validators
- CSS and XSL
- XLinks and XPointers
- URLs Versus URIs
- ASCII, Unicode, and the Universal Character System
- XML Applications
ASCII, Unicode, and the Universal Character System
The actual characters in documents are stored as numeric codes, and today the most common code set is the American Standard Code for Information Interchange (ASCII). ASCII codes extend from 0 to 127; for example, the ASCII code for A is 65, the ASCII code for B is 66, and so on.
On the other hand, the World Wide Web is just that todayworldwide. And plenty of scripts are not handled by ASCII, including Bengali, Armenian, Hebrew, Thai, Tibetan, Japanese Katakana, Arabic, and Cyrillic.
For that reason, the default character set specified for XML by W3C is Unicode, not ASCII. Unicode codes are made up of 2 bytes, not 1, so they extend from 0 to 65,535 instead of just 0 to 255 (however, to make things easier, the Unicode codes 0 to 255 do indeed correspond to the ASCII 0 to 255 codes). Unicode can therefore include many of the symbols commonly used in worldwide character and ideograph sets today.
Only about 40,000 Unicode codes are reserved at this point (of which about 20,000 codes are used for Han ideographs, although there are more than 80,000 such ideographs defined and 11,000 for Korean Hangul syllables).
In practice, Unicode support, like many parts of the XML technology, is not fully supported on most platforms today. Windows 95/98 does not have full support for Unicode, although Windows NT and Windows 2000 come much closer (and XML Spy lets you use Unicode to write XML documents in Windows NT). What this means most often is that XML documents are written in simply ASCII or in UTF-8, which is a compressed version of Unicode that uses 8 bits to represent characters (in practice, this is well suited to ASCII documents because multiple bytes are needed for many non-ASCII symbols and because ASCII documents converted to Unicode are two times as long). Here's how to specify the UTF-8 character encoding in an XML document:
<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> Hello From XML </GREETING> <MESSAGE> Welcome to the wild and woolly world of XML. </MESSAGE> </DOCUMENT>
In fact, the default for XML processors today is to assume that your document is in UTF-8, so if you omit the encoding specification, UTF-8 is assumed. So if you're writing XML documents in ASCII, you'll have no trouble.
Actually, not even Unicode has enough space for all symbols in common use, so a new specification, the Universal Character System (UCS, also called ISO 10646) uses 4 bytes per symbol. This gives it a range of two billion symbolsfar more than needed. You can specify that you want to use pure Unicode encoding in your XML documents by using the UCS-2 encoding (also called ISO-10646-UCS-2), which is compressed 2-byte UCS. You can also use UTF-16, which is a special encoding that represents UCS symbols using 2 bytes so that the result corresponds to UCS-2. Straight UCS encoding is referred to as UCS-4 (also called ISO-10646-UCS-4).
You can write documents in a local character set and use a translation utility to translate them to Unicode, or you can insert the actual Unicode codes directly into your documents. For example, the Unicode for þ is 0x3C0 in hexadecimal, so you can insert þ into your document with the character entity (more on entities in the next chapter) π.
More character sets are available than those mentioned here; for a longer list, take a look at the list posted by the Internet Assigned Numbers Authority (IANA), at http://www.iana.org/assignments/character-sets.
Converting ASCII to Unicode
If you want to convert ASCII files to straight Unicode, you can use the native2ascii program that comes with Sun Microsystems's Java Software Development Kit. Using this tool, you can convert to Unicode like this: native2ascii file.txt file.uni. You can also convert to a number of other encodings besides Unicode, such as compressed Unicode, UTF-8.