UTF-8 to UTF-32 Conversion
Dave Mielke
dave at mielke.cc
Sat Apr 17 14:55:24 UTC 2004
[quoted lines by John J. Boyer on 2004/04/17 at 09:13 -0500]
>For one of my projects I need to convert UTF-8 to ?UTF-32. However, I
>can't find information on which bits are set in the various bytes of a
>multi-byte UTI-8 character.
0X00 through 0X7F are literal, i.e. single-byte characters.
If bit 7 is set and bit 6 is clear, i.e. the range 0X80 through 0XBF, it's a
continuation byte containing six more bits. The first byte of a multi-byte
character is never within this range.
If bits 7 and 6 are set but bit 5 isn't, i.e. the range 0XC0 through 0XDF, then
it's the first 5 bits of a two-byte character. The resultant value is an
11-bit character in the range 0 through 0X7FF.
Each time the first clear bit is moved one position to the right the length of
the multi-byte character increases by one byte and the number of leading bits
in the first byte decreases by 1. Every non-leading byte, as mentioned above,
has bit 7 set and bit 6 clear, i.e. is within the range 0X80 through 0XBF, and
appends six bits to the value. Here's a table to illustrate:
First RangeOf NumOf Init Totl MaxUnicode
0-Bit FirstByte Bytes Bits Bits Character
7 0X00 0X7F 1 7 7 0X0000007F
5 0XC0 0XDF 2 5 11 0X000007FF
4 0XE0 0XEF 3 4 16 0X0000FFFF
3 0XF0 0XF7 4 3 21 0X001FFFFF
2 0XF8 0XFB 5 2 26 0X03FFFFFF
1 0XFC 0XFD 6 1 31 0X7FFFFFFF
--
Dave Mielke | 2213 Fox Crescent | I believe that the Bible is the
Phone: 1-613-726-0014 | Ottawa, Ontario | Word of God. Please contact me
EMail: dave at mielke.cc | Canada K2A 1H7 | if you're concerned about Hell.
http://familyradio.com/ | http://mielke.cc/bible/
More information about the Blinux-list
mailing list