diff options
author | aiju <aiju@phicode.de> | 2011-07-18 11:01:22 +0200 |
---|---|---|
committer | aiju <aiju@phicode.de> | 2011-07-18 11:01:22 +0200 |
commit | 8c4c1f39f4e369d7c590c9d119f1150a2215e56d (patch) | |
tree | cd430740860183fc01de1bc1ddb216ceff1f7173 /sys/doc/utf.ms | |
parent | 11bf57fb2ceb999e314cfbe27a4e123bf846d2c8 (diff) |
added /sys/doc
Diffstat (limited to 'sys/doc/utf.ms')
-rw-r--r-- | sys/doc/utf.ms | 1248 |
1 files changed, 1248 insertions, 0 deletions
diff --git a/sys/doc/utf.ms b/sys/doc/utf.ms new file mode 100644 index 000000000..7f822d689 --- /dev/null +++ b/sys/doc/utf.ms @@ -0,0 +1,1248 @@ +.HTML "Hello World or Καλημέρα κόσμε or こんにちは 世界 +.TL +Hello World +.br +or +.br +.ft R +Καλημέρα κόσμε +.ft +.br +or +.br +\f(Jpこんにちは 世界\fP +.AU +Rob Pike +Ken Thompson +.sp +rob,ken@plan9.bell-labs.com +.AB +.FS +Originally appeared, in a slightly different form, in +.I +Proc. of the Winter 1993 USENIX Conf., +.R +pp. 43-50, +San Diego +.FE +Plan 9 from Bell Labs has recently been converted from ASCII +to an ASCII-compatible variant of the Unicode Standard, a 16-bit character set. +In this paper we explain the reasons for the change, +describe the character set and representation we chose, +and present the programming models and software changes +that support the new text format. +Although we stopped short of full internationalization\(emfor +example, system error messages are in Unixese, not Japanese\(emwe +believe Plan 9 is the first system to treat the representation +of all major languages on a uniform, equal footing throughout all its +software. +.AE +.SH +Introduction +.PP +The world is multilingual but most computer systems +are based on English and ASCII. +The first release of Plan 9 [Pike90], a new distributed operating +system from Bell Laboratories, seemed a good occasion +to correct this chauvinism. +It is easier to make such deep changes when building new systems than +by refitting old ones. +.PP +The ANSI C standard [ANSIC] contains some guidance on the matter of +`wide' and `multi-byte' characters but falls far short of +solving the myriad associated problems. +We could find no literature on how to convert a +.I system +to larger character sets, although some individual +.I programs +had been converted. +This paper reports what we discovered as we +explored the problem of representing multilingual +text at all levels of an operating system, +from the file system and kernel through +the applications and up to the window system +and display. +.PP +Plan 9 has not been `internationalized': +its manuals are in English, +its error messages are in English, +and it can display text that goes from left to right only. +But before we can address these other problems, +we need to handle, uniformly and comfortably, +the textual representation of all the major written languages. +That subproblem is richer than we had anticipated. +.SH +Standards +.PP +Our first step was to select a standard. +At the time (January 1992), +there were only two viable options: +ISO 10646 [ISO10646] and Unicode [Unicode]. +The documents describing both proposals were still in the draft stage. +.PP +The draft of ISO 10646 was not +very attractive to us. +It defined a sparse set of 32-bit characters, +which would be +hard to implement +and have punitive storage requirements. +Also, the draft attempted to +mollify national interests by allocating +16-bit subspaces to national committees +to partition individually. +The suggested mode of use was to +``flip'' between separate national +standards to implement the international standard. +This did not strike us as a sound basis for a character set. +As well, transmitting 32-bit values in a byte stream, +such as in pipes, would be expensive and hard to implement. +Since the standard does not define a byte order for such +transmission, the byte stream would also have to carry +state to enable the values to be recovered. +.PP +The Unicode Standard is a proposal by a consortium of mostly American +computer companies formed +to protest the technical +failings of ISO 10646. +It defines a uniform 16-bit code based on the +principle of unification: +two characters are the same if they look the +same even though they are from different +languages. +This principle, called Han unification, +allows the large Japanese, Chinese, and Korean +character sets to be packed comfortably into a 16-bit representation. +.PP +We chose the Unicode Standard for its technical merits and because its +code space was better defined. +Moreover, +the Unicode Consortium was derailing the +ISO 10646 standard. +(Now, in 1995, +ISO 10646 is a standard +with one 16-bit group defined, +which is almost exactly the Unicode Standard. +As most people expected, the two standards bodies +reached a détente and +ISO 10646 and Unicode represent the same character set.) +.PP +The Unicode Standard defines an adequate character set +but an unreasonable representation. +It states that all characters +are 16 bits wide and are communicated and stored in +16-bit units. +It also reserves a pair of characters +(hexadecimal FFFE and FEFF) to detect byte order +in transmitted text, requiring state in the byte stream. +(The Unicode Consortium was thinking of files, not pipes.) +To adopt this encoding, +we would have had to convert all text going +into and out of Plan 9 between ASCII and Unicode, which cannot be done. +Within a single program, in command of all its input and output, +it is possible to define characters as 16-bit quantities; +in the context of a networked system with +hundreds of applications on diverse machines +by different manufacturers, +it is impossible. +.PP +We needed a way to adapt the Unicode Standard to the tools-and-pipes +model of text processing embodied by the Unix system. +To do that, we +needed an ASCII-compatible textual +representation of Unicode characters for transmission +and storage. +In the draft ISO standard there was an informative +(non-required) +Annex +called UTF +that provided a byte stream encoding +of the 32-bit ISO code. +The encoding uses multibyte sequences composed +from the 190 printable characters of Latin-1 +to represent character values larger +than 159. +.PP +The UTF encoding has several good properties. +By far the most important is that +a byte in the ASCII range 0-127 represents +itself in UTF. +Thus UTF is backward compatible with ASCII. +.PP +UTF has other advantages. +It is a byte encoding and is +therefore byte-order independent. +ASCII control characters appear in the byte stream +only as themselves, never as an element of a sequence +encoding another character, +so newline bytes separate lines of UTF text. +Finally, ANSI C's +.CW strcmp +function applied to UTF strings preserves the ordering of Unicode characters. +.PP +To encode and decode UTF is expensive (involving multiplication, +division, and modulo operations) but workable. +UTF's major disadvantage is that the encoding +is not self-synchronizing. +It is in general impossible to find the character +boundaries in a UTF string without reading from +the beginning of the string, although in practice +control characters such as newlines, +tabs, and blanks provide synchronization points. +.PP +In August 1992, +X-Open circulated a proposal for another UTF-like +byte encoding of Unicode characters. +Their major concern was that an embedded character +in a file name +(in particular a slash) +could be part of an escape sequence in UTF and +therefore confuse a traditional file system. +Their proposal would allow all 7-bit ASCII characters +to represent themselves +.I "and only themselves" +in text. +Multibyte sequences would contain only characters +with the high bit set. +We proposed a modification to the new UTF that +would address our synchronization problem. +Our proposal, which was originally known informally as UTF-2 and FSS-UTF, +is now referred to as UTF-8 and has been approved by ISO to become +Annex P to ISO 10646. +.PP +The model for text in Plan 9 is chosen from these +three standards*: +.FS +* ``That's the nice thing about standards\(emthere's so many to choose from.'' \- Andy Tannenbaum (no, the other one) +.FE +the Unicode character set encoded as a byte stream by +UTF-8, from +(soon to be) Annex P of ISO 10646. +Although this mixture may seem like a precarious position for us to adopt, +it is not as bad as it sounds. +ISO 10646 and the Unicode Standard have converged, +other systems such as Linux have adopted the same character set and encoding, +and the general feeling seems to be that Unicode and UTF-8 will be accepted as the way +to exchange text between systems. +The prognosis for wide acceptance is good. +.PP +There are a couple of aspects of the Unicode Standard we have not faced. +One is the issue of right-to-left text such as Hebrew or Arabic. +Since that is an issue of display, not representation, we believe +we can defer that problem for the moment without affecting our +ability to solve it later. +Another issue is diacriticals and `combining characters', +which cause overstriking of multiple Unicode characters. +Although necessary for some scripts, such as Thai, Arabic, and Hebrew, +such characters confuse the issues for Latin languages because they +generate multiple representations for accented characters. +ISO 10646 describes three levels of implementation; +in Plan 9 we decided not to address the issue. +Again, this can be labeled as a display issue and its finer points are still being debated, +so we felt comfortable deferring. Mañana. +.PP +Although we converted Plan 9 in the altruistic interests of +serving foreign languages, we have found the large character +set attractive for other reasons. The Unicode Standard includes many +characters\(emmathematical symbols, scientific notation, +more general punctuation, and more\(emthat we now use +daily in our work. We no longer test our imaginations +to find ways to include non-ASCII symbols in our text; +why type +.CW :-) +when you can use the character ☺? +Most compelling is the ability to absorb documents +and data that contain non-ASCII characters; our browser for the +Oxford English Dictionary +lets us see the dictionary as it really is, with pronunciation +in the IPA font, foreign phrases properly rendered, and so on, +.I "in plain text. +.PP +In the rest of this paper, except when +stated otherwise, the term `UTF' refers to the UTF-8 encoding +of Unicode characters as adopted by Plan 9. +.SH +C Compiler +.PP +The first program to be converted to UTF +was the C Compiler. +There are two levels of conversion. +On the syntactic level, +input to the C compiler +is UTF; on the semantic level, +the C language needs to define +how compiled programs manipulate +the UTF set. +.PP +The syntactic part is simple. +The ANSI C language standard defines the +source character set to be ASCII. +Since UTF is backward compatible with ASCII, +the compiler needs little change. +The only places where a larger character set +is allowed are in character constants, strings, and comments. +Since 7-bit ASCII characters can represent only +themselves in UTF, +the compiler does not have to be careful while looking +for the termination of a string or comment. +.PP +The Plan 9 compiler extends ANSI C to treat any Unicode +character with a value outside of the ASCII range as +an alphabetic. +To a Greek programmer or an English mathematician, +α is a sensible and now valid variable name. +.PP +On the semantic level, ANSI C allows, +but does not tie down, +the notion of a +.I "wide character +and admits string and character constants +of this type. +We chose the wide character type to be +.CW unsigned +.CW short . +In the libraries, the word +.CW Rune +is defined by a +.CW typedef +to be equivalent to +.CW unsigned +.CW short +and is +used to signify a Unicode character. +.PP +There are surprises; for example: +.P1 +L'x' \f1is 120\fP +\&'x' \f1is 120\fP +L'ÿ' \f1is 255\fP +\&'ÿ' \f1is -1, stdio \fPEOF\f1 (if \fPchar\f1 is signed)\fP +L'\f1α\fP' \f1is 945\fP +\&'\f1α\fP' \f1is illegal\fP +.P2 +In the string constants, +.P1 +"\f(Jpこんにちは 世界\fP" +L"\f(Jpこんにちは 世界\fP", +.P2 +the former is an array of +.CW chars +with 22 elements +and a null byte, +while the latter is an array of +.CW unsigned +.CW shorts +.CW Runes ) ( +with 8 elements and a null +.CW Rune . +.PP +The Plan 9 library provides an output conversion function, +.CW print +(analogous to +.CW printf ), +with formats +.CW %c , +.CW %C , +.CW %s , +and +.CW %S . +Since +.CW print +produces text, its output is always UTF. +The character conversion +.CW %c +(lower case) masks its argument +to 8 bits before converting to UTF. +Thus +.CW L'ÿ' +and +.CW 'ÿ' +printed under +.CW %c +will be identical, +but +.CW L'\f1α\fP' +will print as the Unicode +character with decimal value 177. +The character conversion +.CW %C +(upper case) masks its argument +to 16 bits before converting to UTF. +Thus +.CW L'ÿ' +and +.CW L'\f1α\fP' +will print correctly under +.CW %C , +but +.CW 'ÿ' +will not. +The conversion +.CW %s +(lower case) +expects a pointer to +.CW char +and copies UTF sequences up to a null byte. +The conversion +.CW %S +(upper case) expects a pointer to +.CW Rune +and +performs sequential +.CW %C +conversions until a null +.CW Rune +is encountered. +.PP +Another problem in format conversion +is the definition of +.CW %10s : +does the number refer to bytes or characters? +We decided that such formats were most +often used to align output columns and +so made the number count characters. +Some programs, however, use the count +to place blank-padded strings +in fixed-sized arrays. +These programs must be found and corrected. +.PP +Here is a complete example: +.P1 +#include <u.h> + +char c[] = "\f(Jpこんにちは 世界\fP"; +Rune s[] = L"\f(Jpこんにちは 世界\fP"; + +main(void) +{ + print("%d, %d\en", sizeof(c), sizeof(s)); + print("%s\en", c); + print("%S\en", s); +} +.P2 +.PP +This program prints +.CW 23, +.CW 18 +and then two identical lines of +UTF text. +In practice, +.CW %S +and +.CW L"..." +are rare in programs; one reason is +that most formatted I/O is done in unconverted UTF. +.SH +Ramifications +.PP +All programs in Plan 9 now read and write text as UTF, not ASCII. +This change breaks two deep-rooted symmetries implicit in most C programs: +.IP 1. +A character is no longer a +.CW char . +.IP 2. +The internal representation (Rune) of a character now differs from its +external representation (UTF). +.PP +In the sections that follow, +we show how these issues were faced in the layers of +system software from the operating system up to the applications. +The effects are wide-reaching and often surprising. +.SH +Operating system +.PP +Since UTF is the only format for text in Plan 9, +the interface to the operating system had to be converted to UTF. +Text strings cross the interface in several places: +command arguments, +file names, +user names (people can log in using their native name), +error messages, +and miscellaneous minor places such as commands to the I/O system. +Little change was required: null-terminated UTF strings +are equivalent to null-terminated ASCII strings for most purposes +of the operating system. +The library routines described in the next section made that +change straightforward. +.PP +The window system, once called +.CW 8.5 , +is now rightfully called +.CW 8½ . +.SH +Libraries +.PP +A header file included by all programs (see [Pike92]) declares +the +.CW Rune +type to hold 16-bit character values: +.P1 +typedef unsigned short Rune; +.P2 +Also defined are several constants relevant to UTF: +.P1 +enum +{ + UTFmax = 3, /* maximum bytes per rune */ + Runesync = 0x80, /* can't appear in UTF sequence (<) */ + Runeself = 0x80, /* rune==UTF sequence (<) */ + Runeerror = 0x80, /* decoding error in UTF */ +}; +.P2 +(With the original UTF, +.CW Runesync +was hexadecimal 21 and +.CW Runeself +was A0.) +.CW UTFmax +bytes are sufficient +to hold the UTF encoding of any Unicode character. +Characters of value less than +.CW Runesync +only appear in a UTF string as +themselves, never as part of a sequence encoding another character. +Characters of value less than +.CW Runeself +encode into single bytes +of the same value. +Finally, when the library detects errors in UTF input\(embyte sequences +that are not valid UTF sequences\(emit converts the first byte of the +error sequence to the character +.CW Runeerror . +There is little a rune-oriented program can do when given bad data +except exit, which is unreasonable, or carry on. +Originally the conversion routines, described below, +returned errors when given invalid UTF, +but we found ourselves repeatedly checking for errors and ignoring them. +We therefore decided to convert a bad sequence to a valid rune +and continue processing. +(The ANSI C routines, on the other hand, return errors.) +.PP +This technique does have the unfortunate property that converting +invalid UTF byte strings in and out of runes does not preserve the input, +but this circumstance only occurs when non-textual input is +given to a textual program. +The Unicode Standard defines an error character, value FFFD, to stand for +characters from other sets that it does not represent. +The +.CW Runeerror +character is a different concept, related to the encoding rather than the character set, so we +chose a different character for it. +.PP +The Plan 9 C library contains a number of routines for +manipulating runes. +The first set converts between runes and UTF strings: +.P1 +extern int runetochar(char*, Rune*); +extern int chartorune(Rune*, char*); +extern int runelen(long); +extern int fullrune(char*, int); +.P2 +.CW Runetochar +translates a single +.CW Rune +to a UTF sequence and returns the number of bytes produced. +.CW Chartorune +goes the other way, reporting how many bytes were consumed. +.CW Runelen +returns the number of bytes in the UTF encoding of a rune. +.CW Fullrune +examines a UTF string up to a specified number of bytes +and reports whether the string begins with a complete UTF encoding. +All these routines use the +.CW Runeerror +character to work around encoding problems. +.PP +There is also a set of routines for examining null-terminated UTF strings, +based on the model of the ANSI standard +.CW str +routines, but with +.CW utf +substituted for +.CW str +and +.CW rune +for +.CW chr : +.P1 +extern int utflen(char*); +extern char* utfrune(char*, long); +extern char* utfrrune(char*, long); +extern char* utfutf(char*, char*); +.P2 +.CW Utflen +returns the number of runes in a UTF string; +.CW utfrune +returns a pointer to the first occurrence of a rune in a UTF string; +and +.CW utfrrune +a pointer to the last. +.CW Utfutf +searches for the first occurrence of a UTF string in another UTF string. +Given the synchronizing property of UTF-8, +.CW utfutf +is the same as +.CW strstr +if the arguments point to valid UTF strings. +.PP +It is a mistake to use +.CW strchr +or +.CW strrchr +unless searching for a 7-bit ASCII character, that is, a character +less than +.CW Runeself . +.PP +We have no routines for manipulating null-terminated arrays of +.CW Runes . +Although they should probably exist for completeness, we have +found no need for them, for the same reason that +.CW %S +and +.CW L"..." +are rarely used. +.PP +Most Plan 9 programs use a new buffered I/O library, BIO, in place of +Standard I/O. +BIO contains routines to read and write UTF streams, converting to and from +runes. +.CW Bgetrune +returns, as a +.CW Rune +within a +.CW long , +the next character in the UTF input stream; +.CW Bputrune +takes a rune and writes its UTF representation. +.CW Bungetrune +puts a rune back into the input stream for rereading. +.PP +Plan 9 programs use a simple set of macros to process command line arguments. +Converting these macros to UTF automatically updated the +argument processing of most programs. +In general, +argument flag names can no longer be held in bytes and +arrays of 256 bytes cannot be used to hold a set of flags. +.PP +We have done nothing analogous to ANSI C's locales, partly because +we do not feel qualified to define locales and partly because we remain +unconvinced of that model for dealing with the problems. +That is really more an issue of internationalization than conversion +to a larger character set; on the other hand, +because we have chosen a single character set that encompasses +most languages, some of the need for +locales is eliminated. +(We have a utility, +.CW tcs , +that translates between UTF and other character sets.) +.PP +There are several reasons why our library does not follow the ANSI design +for wide and multi-byte characters. +The ANSI model was designed by a committee, untried, almost +as an afterthought, whereas +we wanted to design as we built. +(We made several major changes to the interface +as we became familiar with the problems involved.) +We disagree with ANSI C's handling of invalid multi-byte sequences. +Also, the ANSI C library is incomplete: +although it contains some crucial routines for handling +wide and multi-byte characters, there are some serious omissions. +For example, our software can exploit +the fact that UTF preserves ASCII characters in the byte stream. +We could remove that assumption by replacing all +calls to +.CW strchr +with +.CW utfrune +and so on. +(Because of the weaker properties of the original UTF, +we have actually done so.) +ANSI C cannot: +the standard says nothing about the representation, so portable code should +.I never +call +.CW strchr , +yet there is no ANSI equivalent to +.CW utfrune . +ANSI C simultaneously invalidates +.CW strchr +and offers no replacement. +.PP +Finally, ANSI did nothing to integrate wide characters +into the I/O system: it gives no method for printing +wide characters. +We therefore needed to invent some things and decided to invent +everything. +In the end, some of our entry points do correspond closely to +ANSI routines\(emfor example +.CW chartorune +and +.CW runetochar +are similar to +.CW mbtowc +and +.CW wctomb \(embut +Plan 9's library defines more functionality, enough +to write real applications comfortably. +.SH +Converting the tools +.PP +The source for our tools and applications had already been converted to +work with Latin-1, so it was `8-bit safe', but the conversion to the Unicode +Standard and UTF is more involved. +Some programs needed no change at all: +.CW cat , +for instance, +interprets its argument strings, delivered in UTF, +as file names that it passes uninterpreted to the +.CW open +system call, +and then just copies bytes from its input to its output; +it never makes decisions based on the values of the bytes. +(Plan 9 +.CW cat +has no options such as +.CW -v +to complicate matters.) +Most programs, however, needed modest change. +.PP +It is difficult to +find automatically the places that need attention, +but +.CW grep +helps. +Software that uses the libraries conscientiously can be searched +for calls to library routines that examine bytes as characters: +.CW strchr , +.CW strrchr , +.CW strstr , +etc. +Replacing these by calls to +.CW utfrune , +.CW utfrrune , +and +.CW utfutf +is enough to fix many programs. +Few tools actually need to operate on runes internally; +more typically they need only to look for the final slash in a file +name and similar trivial tasks. +Of the 170 C source programs in the top levels of +.CW /sys/src/cmd , +only 23 now contain the word +.CW Rune . +.PP +The programs that +.I do +store runes internally +are mostly those whose +.I raison +.I d'être +is character manipulation: +.CW sam +(the text editor), +.CW sed , +.CW sort , +.CW tr , +.CW troff , +.CW 8½ +(the window system and terminal emulator), +and so on. +To decide whether to compute using runes +or UTF-encoded byte strings requires balancing the cost of converting +the data when read and written +against the cost of converting relevant text on demand. +For programs such as editors that run a long time with a relatively +constant dataset, runes are the better choice. +There are space considerations too, but they are more complicated: +plain ASCII text grows when converted to runes; UTF-encoded Japanese +shrinks. +.PP +Again, it is hard to automate the conversion of a program from +.CW chars +to +.CW Runes . +It is not enough just to change the type of variables; the assumption +that bytes and characters are equivalent can be insidious. +For instance, to clear a character array by +.P1 +memset(buf, 0, BUFSIZE) +.P2 +becomes wrong if +.CW buf +is changed from an array of +.CW chars +to an array of +.CW Runes . +Any program that indexes tables based on character values needs +rethinking. +Consider +.CW tr , +which originally used multiple 256-byte arrays for the mapping. +The naïve conversion would yield multiple 65536-rune arrays. +Instead Plan 9 +.CW tr +saves space by building in effect +a run-encoded version of the map. +.PP +.CW Sort +has related problems. +The cooperation of UTF and +.CW strcmp +means that a simple sort\(emone with no options\(emcan be done +on the original UTF strings using +.CW strcmp . +With sorting options enabled, however, +.CW sort +may need to convert its input to runes: for example, +option +.CW -t\f1α\fP +requires searching for alphas in the input text to +crack the input into fields. +The field specifier +.CW +3.2 +refers to 2 runes beyond the third field. +Some of the other options are hopelessly provincial: +consider the case-folding and dictionary order options +(Japanese doesn't even have an official dictionary order) or +.CW -M +which compares by case-insensitive English month name. +Handling these options involves the +larger issues of internationalization and is beyond the scope +of this paper and our expertise. +Plan 9 +.CW sort +works sensibly with options that make sense relative to the input. +The simple and most important options are, however, usually meaningful. +In particular, +.CW sort +sorts UTF into the same order that +.CW look +expects. +.PP +Regular expression-matching algorithms need rethinking to +be applied to UTF text. +Deterministic automata are usually applied to bytes; +converting them to operate on variable-sized byte sequences is awkward. +On the other hand, converting the input stream to runes adds measurable +expense +and the state tables expand +from size 256 to 65536; it can be expensive just to generate them. +For simple string searching, +the Boyer-Moore algorithm works with UTF provided the input is +guaranteed to be only valid UTF strings; however, it does not work +with the old UTF encoding. +At a more mundane level, even character classes are harder: +the usual bit-vector representation within a non-deterministic automaton +is unwieldy with 65536 characters in the alphabet. +.PP +We compromised. +An existing library for compiling and executing regular expressions +was adapted to work on runes, with two entry points for searching +in arrays of runes and arrays of chars (the pattern is always UTF text). +Character classes are represented internally as runs of runes; +the reserved value +.CW FFFF +marks the end of the class. +Then +.I all +utilities that use regular expressions\(emeditors, +.CW grep , +.CW awk , +etc.\(emexcept the shell, whose notation +was grandfathered, were converted to use the library. +For some programs, there was a concomitant loss of performance, +but there was also a strong advantage. +To our knowledge, Plan 9 is the only Unix-like system +that has a single definition and implementation of +regular expressions; patterns are written and interpreted +identically by all the programs in the system. +.PP +A handful of programs have the notion of character built into them +so strongly as to confuse the issue of what they should do with UTF input. +Such programs were treated as individual special cases. +For example, +.CW wc +is, by default, unchanged in behavior and output; a new option, +.CW -r , +counts the number of correctly encoded runes\(emvalid UTF sequences\(emin +its input; +.CW -b +the number of invalid sequences. +.PP +It took us several months to convert all the software in the system +to the Unicode Standard and the old UTF. +When we decided to convert from that to the new UTF, +only three things needed to be done. +First, we rewrote the library routines to encode and decode the +new UTF. This took an evening. +Next, we converted all the files containing UTF +to the new encoding. +We wrote a trivial program to look for non-ASCII bytes in +text files and used a Plan 9 program called +.CW tcs +(translate character set) to change encodings. +Finally, we recompiled all the system software; +the library interface was unchanged, so recompilation was sufficient +to effect the transformation. +The second two steps were done concurrently and took an afternoon. +We concluded that the actual encoding is relatively unimportant to the +software; the adoption of large characters and a byte-stream encoding +.I per +.I se +are much deeper issues. +.SH +Graphics and fonts +.PP +Plan 9 provides only minimal support for plain text terminals. +It is instead designed to be used with all character input and +output mediated by a window system such as +.CW 8½ . +The window system and related software are responsible for the +display of UTF text as Unicode character images. +For plain text, the window system must provide a user-settable +.I font +that provides a (possibly empty) picture for each Unicode character. +Fancier applications that use bold and Italic characters +need multiple fonts storing multiple pictures for each +Unicode value. +All the issues are apparent, though, +in just the problem of +displaying a single image for each character, that is, the +Unicode equivalent of a plain text terminal. +With 128 or even 256 characters, a font can be just +an array of bitmaps. With 65536 characters, +a more sophisticated design is necessary. To store the ideographs +for just Japanese as 16×16×1 bit images, +the smallest they can reasonably be, takes over a quarter of a +megabyte. Make the images a little larger, store more bits per +pixel, and hold a copy in every running application, and the +memory cost becomes unreasonable. +.PP +The structure of the bitmap graphics services is described at length elsewhere +[Pike91]. +In summary, the memory holding the bitmaps is stored in the same machine that has +the display, mouse, and keyboard: the terminal in Plan 9 terminology, +the workstation in others'. +Access to that memory and associated services is provided +by device files served by system +software on the terminal. One of those files, +.CW /dev/bitblt , +interprets messages written upon it as requests for actions +corresponding to entry points in the graphics library: +allocate a bitmap, execute a raster operation, draw a text string, etc. +The window system +acts as a multiplexer that mediates access to the services +and resources of the terminal by simulating in each client window +a set of files mirroring those provided by the system. +That is, each window has a distinct +.CW /dev/mouse , +.CW /dev/bitblt , +and so on through which applications drive graphical +input and output. +.PP +One of the resources managed by +.CW 8½ +and the terminal is the set of active +.I subfonts. +Each subfont holds the +bitmaps and associated data structures for a sequential set of Unicode +characters. +Subfonts are stored in files and loaded into the terminal by +.CW 8½ +or an application. +For example, one subfont +might hold the images of the first 256 characters of the Unicode space, +corresponding to the Latin-1 character set; +another might hold the standard phonetic character set, Unicode characters +with value 0250 to 02E9. +These files are collected in directories corresponding to typefaces: +.CW /lib/font/bit/pelm +contains the Pellucida Monospace character set, with subfonts holding +the Latin-1, Greek, Cyrillic and other components of the typeface. +A suffix on subfont files encodes (in a subfont-specific +way) the size of the images: +.CW /lib/font/bit/pelm/latin1.9 +contains the Latin-1 Pellucida Monospace characters with lower +case letters 9 pixels high; +.CW /lib/font/bit/jis/jis5400.16 +contains 16-pixel high +ideographs starting at Unicode value 5400. +.PP +The subfonts do not identify which portion of the Unicode space +they cover. Instead, a +font file, in plain text, +describes how to assemble subfonts into a complete +character set. +The font file is presented as an argument to the window system +to determine how plain text is displayed in text windows and +applications. +Here is the beginning of the font file +.CW /lib/font/bit/pelm/jis.9.font , +which describes the layout of a font covering that portion of +the Unicode Standard for which we have characters of typical +display size, using Japanese characters +to cover the Han space: +.P1 +18 14 +0x0000 0x00FF latin1.9 +0x0100 0x017E latineur.9 +0x0250 0x02E9 ipa.9 +0x0386 0x03F5 greek.9 +0x0400 0x0475 cyrillic.9 +0x2000 0x2044 ../misc/genpunc.9 +0x2070 0x208E supsub.9 +0x20A0 0x20AA currency.9 +0x2100 0x2138 ../misc/letterlike.9 +0x2190 0x21EA ../misc/arrows +0x2200 0x227F ../misc/math1 +0x2280 0x22F1 ../misc/math2 +0x2300 0x232C ../misc/tech +0x2500 0x257F ../misc/chart +0x2600 0x266F ../misc/ding +.P2 +.P1 +0x3000 0x303f ../jis/jis3000.16 +0x30a1 0x30fe ../jis/katakana.16 +0x3041 0x309e ../jis/hiragana.16 +0x4e00 0x4fff ../jis/jis4e00.16 +0x5000 0x51ff ../jis/jis5000.16 +\&... +.P2 +The first two numbers set the interline spacing of the font (18 +pixels) and the distance from the baseline to the top of the +line (14 pixels). +When characters are displayed, they are placed so as best +to fit within those constraints; characters +too large to fit will be truncated. +The rest of the file associates subfont files +with portions of Unicode space. +The first four such files are in the Pellucida Monospace typeface +and directory; others reside in other directories. The file names +are relative to the font file's own location. +.PP +There are several advantages to this two-level structure. +First, it simultaneously breaks the huge Unicode space into manageable +components and provides a unifying architecture for +assembling fonts from disjoint pieces. +Second, the structure promotes sharing. +For example, we have only one set of Japanese +characters but dozens of typefaces for the Latin-1 characters, +and this structure permits us to store only one copy of the +Japanese set but use it with any Roman typeface. +Also, customization is easy. +English-speaking users who don't need Japanese characters +but may want to read an on-line Oxford English Dictionary can +assemble a custom font with the +Latin-1 (or even just ASCII) characters and the International +Phonetic Alphabet (IPA). +Moreover, to do so requires just editing a plain text file, +not using a special font editing tool. +Finally, the structure guides the design of +caching protocols to improve performance and memory usage. +.PP +To load a complete Unicode character set into each application +would consume too +much memory and, particularly on slow terminal lines, would take +unreasonably long. +Instead, Plan 9 assembles a multi-level cache structure for +each font. +An application opens a font file, reads and parses it, +and allocates a data structure. +A message written to +.CW /dev/bitblt +allocates an associated structure held in the terminal, in particular, +a bitmap to act as a cache +for recently used character images. +Other messages copy these images to bitmaps such as the screen +by loading characters from subfonts into the cache on demand and +from there to the destination bitmap. +The protocol to draw characters is in terms of cache indices, +not Unicode character number or UTF sequences. +These details are hidden from the application, which instead +sees only a subroutine to draw a string in a bitmap from a +given font, functions to discover character size information, +and routines to allocate and to free fonts. +.PP +As needed, whole +subfonts are opened by the graphics library, read, and then downloaded +to the terminal. +They are held open by the library in an LRU-replacement list. +Even when the program closes a subfont, it is retained +in the terminal for later use. +When the application opens the subfont, it asks the terminal +if it already has a copy to avoid reading it from the file +server if possible. +This level of cache has the property that the bitmaps for, say, +all the Japanese characters are stored only once, in the terminal; +the applications read only size and width information from the terminal +and share the images. +.PP +The sizes of the character and subfont caches held by the +application are adaptive. +A simple algorithm monitors the cache miss rate to enlarge and +shrink the caches as required. +The size of the character cache is limited to 2048 images maximum, +which in practice seems enough even for Japanese text. +For plain ASCII-like text it naturally stays around 128 images. +.PP +This mechanism sounds complicated but is implemented by only about +500 lines in the library and considerably less in each of the +terminal's graphics driver and +.CW 8½ . +It has the advantage that only characters that are +being used are loaded into memory. +It is also efficient: if the characters being drawn +are in the cache the extra overhead is negligible. +It works particularly well for alphabetic character sets, +but also adapts on demand for ideographic sets. +When a user first looks at Japanese text, it takes a few +seconds to read all the font data, but thereafter the +text is drawn almost as fast as regular text (the images +are larger, so draw a little slower). +Also, because the bitmaps are remembered by the terminal, +if a second application then looks at Japanese text +it starts faster than the first. +.PP +We considered +building a `font server' +to cache character images and associated data +for the applications, the window system, and the terminal. +We rejected this design because, although isolating +many of the problems of font management into a separate program, +it didn't simplify the applications. +Moreover, in a distributed system such as Plan 9 it is easy +to have too many special purpose servers. +Making the management of the fonts the concern of only +the essential components simplifies the system and makes +bootstrapping less intricate. +.SH +Input +.PP +A completely different problem is how to type Unicode characters +as input to the system. +We selected an unused key on our ASCII keyboards +to serve as a prefix for multi-keystroke +sequences that generate Unicode characters. +For example, the character +.CW ü +is generated by the prefix key +(typically +.CW ALT +or +.CW Compose ) +followed by a double quote and a lower-case +.CW u . +When that character is read by the application, from the file +.CW /dev/cons , +it is of course presented as its UTF encoding. +Such sequences generate characters from an arbitrary set that +includes all of Latin-1 plus a selection of mathematical +and technical characters. +An arbitrary Unicode character may be generated by typing the prefix, +an upper case X, and four hexadecimal digits that identify +the Unicode value. +.PP +These simple mechanisms are adequate for most of our day-to-day needs: +it's easy to remember to type `ALT 1 2' for ½\^ or `ALT accent letter' +for accented Latin letters. +For the occasional unusual character, the cut and paste features of +.CW 8½ +serve well. A program called (perhaps misleadingly) +.CW unicode +takes as argument a hexadecimal value, and prints the UTF representation of that character, +which may then be picked up with the mouse and used as input. +.PP +These methods +are clearly unsatisfactory when working in a non-English language. +In the native country of such a language +the appropriate keyboard is likely to be at hand. +But it's also reasonable\(emespecially now that the system handles Unicode characters\(emto +work in a language foreign to the keyboard. +.PP +For alphabetic languages such as Greek or Russian, it is +straightforward to construct a program that does phonetic substitution, +so that, for example, typing a Latin `a' yields the Greek `α'. +Within Plan 9, such a program can be inserted transparently +between the real keyboard and a program such as the window system, +providing a manageable input device for such languages. +.PP +For ideographic languages such as Chinese or Japanese the problem is harder. +Native users of such languages have adopted methods for dealing with +Latin keyboards that involve a hybrid technique based on phonetics +to generate a list of possible symbols followed by menu selection to +choose the desired one. +Such methods can be +effective, but their design must be rooted in information about +the language unknown to non-native speakers. +.CW Cxterm , ( +a Chinese terminal emulator built by and for +Chinese programmers, +employs such a technique +[Pong and Zhang].) +Although the technical problem of implementing such a device +is easy in Plan 9\(emit is just an elaboration of the technique for +alphabetic languages\(emour lack of familiarity with such languages +has restrained our enthusiasm for building one. +.PP +The input problem is technically the least interesting but perhaps +emotionally the most important of the problems of converting a system +to an international character set. +Beyond that remain the deeper problems of internationalization +such as multi-lingual error messages and command names, +problems we are not qualified to solve. +With the ability to treat text of most languages on an equal +footing, though, we can begin down that path. +Perhaps people in non-English speaking countries will +consider adopting Plan 9, solving the input problem locally\(emperhaps +just by plugging in their local terminals\(emand begin to use +a system with at least the capacity to be international. +.SH +Acknowledgements +.PP +Dennis Ritchie provided consultation and encouragement. +Bob Flandrena converted most of the standard tools to UTF. +Brian Kernighan suffered cheerfully with several +inadequate implementations and converted +.CW troff +to UTF. +Rich Drechsler converted his Postscript driver to UTF. +John Hobby built the Postscript ☺. +We thank them all. +.SH +References +.LP +[ANSIC] \f2American National Standard for Information Systems \- +Programming Language C\f1, American National Standards Institute, Inc., +New York, 1990. +.LP +[ISO10646] +ISO/IEC DIS 10646-1:1993 +\f2Information technology \- +Universal Multiple-Octet Coded Character Set (UCS) \(em +Part 1: Architecture and Basic Multilingual Plane\fP. +.LP +[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey, +``Plan 9 from Bell Labs'', +UKUUG Proc. of the Summer 1990 Conf., +London, England, +1990. +.LP +[Pike91] R. Pike, ``8½, The Plan 9 Window System'', USENIX Summer +Conf. Proc., Nashville, 1991, reprinted in this volume. +.LP +[Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume. +.LP +[Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm: +A Chinese Terminal Emulator for the X Window System'', +.I +Software\(emPractice and Experience, +.R +Vol 22(1), 809-926, October 1992. +.LP +[Unicode] +\f2The Unicode Standard, +Worldwide Character Encoding, +Version 1.0, Volume 1\f1, +The Unicode Consortium, +Addison Wesley, +New York, +1991. |