UTF-8 bit by bit (2001)

栏目: IT技术 · 发布时间: 4年前

内容简介:Richard Suchenwirth 2001-02-28 - From a delightful debugging chat at theTcl chatroom, I was brought to write down what I think on UTF-8 analysis (cf.Unicode and UTF-8, see also).I imagine a UTF-8 string as a railroad. It operates single-unitThe xxxxx bits

Richard Suchenwirth 2001-02-28 - From a delightful debugging chat at theTcl chatroom, I was brought to write down what I think on UTF-8 analysis (cf.Unicode and UTF-8, see also).

I imagine a UTF-8 string as a railroad. It operates single-unit railcars (one-byte ASCII characters, to be known from the fact that the highest bit is 0), and trains (sequences of two or more bytes that together form a character). Each train consists of exactly one locomotive (you see I'm European) and one or more trailers . The locomotive indicates the length of the train, including itself, in the highest bits that form a consecutive row of 1's, and one 0 bit. Examples:

0xxxxxxx : I'm a railcar, just a single unit
 110xxxxx : I'm leading a train of length 2

The xxxxx bits are used for other purposes (locomotives carry some freight too ;-)

Trailers indicate that they are trailers by the initial bit sequence 10. This way, they can't be mistaken for railcars or locomotives. E.g.

10yyyyyy: I'm a trailer

The freight of the train is in the x's and y's. In that concrete case, a C program reported to have received the bytes C3 and A4. Written as binary, that's

Now for clarity we delimit the indicators with parens:

(110)00011 (10)100100

and can just remove them:

00011 100100
 11100100 => E4, the iso8859-1 value for German ä ("a umlaut").

The generalized rule for the indicator of each byte is "those bits from highest(leftmost) down, up to and including the first zero bit".

Now going the other way. In orthodox UTF-8, a NUL byte (\x00) is represented by a NUL byte. Plain enough. But in Tcl we sometimes want NUL bytes inside "binary" strings (e.g. image data), without them terminating it as a real NUL byte does. To represent a NUL byte without any physical NUL bytes, we treat it like a character above ASCII, which must be a minimum two bytes long:

(110)00000 (10)000000 => C0 80

Whoops. Took us a while, but now we can read UTF-8, bit by bit.

andrewsh 2010-03-12 - Please note that 0xc0 0x80 sequence is illegal in the "Real" UTF-8: [], []


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

不是为了快乐

不是为了快乐

宗萨蒋扬钦哲仁波切 / 姚仁喜 / 深圳报业集团出版社 / 2013-1 / 38.00元

前行修持是一套完整的实修系统,它既是一切佛法修持的根基,又囊括了所有修持的精华,以及心灵之道上所需的一切;既适合入门者打造学佛基本功,也是修行人需要终生修持的心法。书中除了实际的方法指导之外,还不断启发佛法的珍贵与修持的必要,并处处可见对学佛者的鼓舞和纠正,其最终的用心,是让我们踏上不间断的修持之路,真正转化我们僵硬、散乱和困惑的心。 在现代人看来,快乐,理应是最值得追求的目标。我们希望生活......一起来看看 《不是为了快乐》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

html转js在线工具
html转js在线工具

html转js在线工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具