Write out Unicode in Octal – Lubutu

栏目: IT技术 · 发布时间: 5年前

内容简介：This is just a brief note about something that I came to realise recently. I've been working a lot with UTF-8 byte streams, Unicode characters, and all that, and I've come to realise that writing them out in hexadecimal is completely wrong. It's a form of

Write out Unicode in Octal

This is just a brief note about something that I came to realise recently. I've been working a lot with UTF-8 byte streams, Unicode characters, and all that, and I've come to realise that writing them out in hexadecimal is completely wrong. It's a form of obfuscation.

I'll walk through an example, which should explain very quickly why I feel we should be using octal. Take the Unicode codepoint U+2B776, which has this UTF-8 encoding in hexadecimal:

0x2B776	F0	AB	9D	B6

Obviously, right? This is the same thing in octal:

0533566	360	253	235	266

The first octal digit of a leading byte is 3, of a continuation byte is 2, and of an ASCII byte is 1 or 0. We only really need to worry about the other digits. Now, compare the other digits of each continuation byte with the octal of the codepoint.

0533566	360	2 53	2 35	2 66

Well, that's pretty straightforward. This is because the lower six bits of a UTF-8 byte contain the whole of the value contributed to the codepoint, and each octal digit represents three bits, so two octal digits hold this value exactly.

What about the leading byte? Okay, that's slightly trickier, although still easier than it would be in hexadecimal. As well as contributing a few bits to the codepoint, a leading byte also indicates how many bytes are expected to be in this codepoint (the others of which will be continuation bytes).

So here are the rules, given a leading byte 3xx(in octal):

If the middle digit is < 4 (i.e. 0 , 1 , 2 , 3 ) then we count it, as we would in a continuation byte.
Otherwise (i.e. 4 , 5 , 6 ) we count only the last digit, but if the middle digit is 5 (i.e. is odd) then a 1 comes first.

You can often get away without knowing what the length of the sequence should be, but if you have to know then just look at the middle digit: 0-3 is 2-byte, 4-5 is 3-byte, and 6 is 4-byte.

With a little practice these are easy to learn. Much easier than learning the hexadecimal ones. Here are some examples, taken from Wikipedia (where they're written out in hexadecimal, making them a lot harder to read).

044	044
0242	3 02	2 42
020254	34 2	2 02	2 54
0201510	36 0	2 20	2 15	2 10

Edit— By request, here is an extra example of 35x .

0120254	35 2	2 02	2 54

I can now quite happily read UTF-8 sequences as Unicode codepoints in my head (something I had to do a lot of while working on a UTF-8 library), and with a little thought can write out codepoints in UTF-8 too. But only in octal! If I had to do it in hexadecimal then I wouldn't have a clue.

So please, to make things easier on us humans, write out Unicode in octal .

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Write out Unicode in Octal – Lubutu

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Intel汇编语言程序设计

Kip Irvine / 电子工业出版社 / 2007-9-1 / 61.00元

《国外计算机科学教材系列•Intel汇编语言程序设计(第5版)》全面细致地讲述了汇编语言程序设计的各个方面。从微处理器体系结构、工作机制到指令集；从最基本的编译器链器的使用到高级过程、结构和宏的使用；从用纯汇编编写程序到用C／C++等最新编译器与汇编的混合接口编程；从16位实模式下BIOS、DOS实模式文本及图形程序设计到32位保护模式的Windows程序设计；从磁盘基础知识到Intel指令编码、......一起来看看《Intel汇编语言程序设计》这本书的介绍吧!

码农工具

Write out Unicode in Octal – Lubutu

Write out Unicode in Octal

Intel汇编语言程序设计

HTML 压缩/解压工具

随机密码生成器

UNIX 时间戳转换