summaryrefslogtreecommitdiff
path: root/sys/man
diff options
context:
space:
mode:
authorcinap_lenrek <cinap_lenrek@felloff.net>2015-09-24 12:14:08 +0200
committercinap_lenrek <cinap_lenrek@felloff.net>2015-09-24 12:14:08 +0200
commit8003c8b1e2d5d6e2a22ca7e552b53e631db86df4 (patch)
treea92aa7ab3c2fea017159e0f080e7a878ce79f2e7 /sys/man
parentbba6d26ca26a60690d50b3fe41a8778abd66cff0 (diff)
utf(6), rune(2): document 21-bit runes
Diffstat (limited to 'sys/man')
-rw-r--r--sys/man/2/rune2
-rw-r--r--sys/man/6/utf20
2 files changed, 12 insertions, 10 deletions
diff --git a/sys/man/2/rune b/sys/man/2/rune
index ca290115d..124692797 100644
--- a/sys/man/2/rune
+++ b/sys/man/2/rune
@@ -54,7 +54,7 @@ bytes starting at
and returns the number of bytes copied.
.BR UTFmax ,
defined as
-.B 3
+.B 4
in
.BR <libc.h> ,
is the maximum number of bytes required to represent a rune.
diff --git a/sys/man/6/utf b/sys/man/6/utf
index 92f7c9534..7d15b8185 100644
--- a/sys/man/6/utf
+++ b/sys/man/6/utf
@@ -7,7 +7,7 @@ based on the Unicode Standard and on the ISO multibyte
.SM UTF-8
encoding (Universal Character
Set Transformation Format, 8 bits wide).
-The Unicode Standard represents its characters in 16
+The Unicode Standard represents its characters in 21
bits;
.SM UTF-8
represents such
@@ -19,7 +19,7 @@ is shortened to
.PP
In Plan 9, a
.I rune
-is a 16-bit quantity representing a Unicode character.
+is a 32-bit quantity representing a Unicode character.
Internally, programs may store characters as runes.
However, any external manifestation of textual information,
in files or at the interface between programs, uses a
@@ -65,19 +65,21 @@ a rune x is converted to a multibyte
sequence
as follows:
.PP
-01. x in [00000000.0bbbbbbb] → 0bbbbbbb
+001. x in [00000000.00000000.0bbbbbbb] → 0bbbbbbb
.br
-10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
+010. x in [00000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
.br
-11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
+011. x in [00000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
+.br
+100. x in [000bbbbb.bbbbbbbb.bbbbbbbb] → 11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
.br
.PP
-Conversion 01 provides a one-byte sequence that spans the
+Conversion 001 provides a one-byte sequence that spans the
.SM ASCII
character set in a compatible way.
-Conversions 10 and 11 represent higher-valued characters
-as sequences of two or three bytes with the high bit set.
-Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open.
+Conversions 010, 011 and 100 represent higher-valued characters
+as sequences of two, three or four bytes with the high bit set.
+Plan 9 does not support the 5 and 6 byte sequences proposed by X-Open.
When there are multiple ways to encode a value, for example rune 0,
the shortest encoding is used.
.PP