leetcode393. UTF-8 Validation

栏目: 编程工具 · 发布时间: 6年前

内容简介:检验整数数组能否构成合法的UTF8编码的序列。UTF8的字节编码规则如下:首先我们整理一下,每一种类型的UTF8字符包含什么样的规格:综上所述:

题目要求

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all one's, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
This is how the UTF-8 encoding would work:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Given an array of integers representing the data, return whether it is a valid utf-8 encoding.

Note:
The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

data = [197, 130, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
Example 2:

data = [235, 140, 4], which represented the octet sequence: 11101011 10001100 00000100.

Return false.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

检验整数数组能否构成合法的UTF8编码的序列。UTF8的字节编码规则如下:

  1. 每个UTF8字符包含1~4个字节
  2. 如果只包含1个字节,则该字节以0作为开头,剩下的位随意
  3. 如果包含两个或两个以上字节,则起始字节以n个1和1个0开头,例如,如果该UTF8字符包含两个字节,则第一个字节以110开头,同理,三个字符的第一个字节以1110开头。剩余的字节必须以10开头。

思路和代码

首先我们整理一下,每一种类型的UTF8字符包含什么样的规格:

  1. 只包含一个字节,该字节格式为0xxxxxxx,则转换为整数的话,该整数必须小于128(1000000)
  2. 包含多个字节,则头字节格式为110xxxxx, 1110xxxx, 11110xxx。而紧跟其后的字符必须格式为10xxxxxx。

综上所述:

  1. num<1000000: 单字节
  2. 10000000=<num<11000000: 多字节字符的跟随字节
  3. 11000000<=num<11100000: 两个字节的起始字节
  4. 11100000<=num<11110000: 三个字节的起始字节
  5. 11110000<=num<11111000: 四个字节的起始字节

下面分别是这题的两种实现:

递归实现:

private static final int ONE_BYTE = 128; //10000000
    private static final int FOLLOW_BYTE = 192; //11000000
    private static final int TWO_BYTE = 224; //11100000
    private static final int THREE_BYTE = 240;//11110000
    private static final int FOUR_BYTE = 248;//11111000
    public boolean validUtf8(int[] data) {
        return validUtf8(data, 0);
    }
    
    public boolean validUtf8(int[] data, int startAt) {
        if(startAt >= data.length) return true;
        int first = data[startAt];
        
        int followLength = 0;
        if(first < ONE_BYTE) {
            return validUtf8(data, startAt+1);
        }else if(first < FOLLOW_BYTE){
            return false;
        }else if(first <TWO_BYTE) {
            followLength = 2;
        }else if(first < THREE_BYTE) {
            followLength = 3;
        }else if(first < FOUR_BYTE) {
            followLength = 4;
        }else {
            return false;
        }
        if(startAt + followLength > data.length) return false; 
        for(int i = 1 ; i<followLength ; i++) {
            int next = data[startAt + i];
            if(next < ONE_BYTE || next >= FOLLOW_BYTE) {
                return false;
            }
        }
        return validUtf8(data, startAt + followLength);
    }

循环实现:

private static final int ONE_BYTE = 128; //10000000
    private static final int FOLLOW_BYTE = 192; //11000000
    private static final int TWO_BYTE = 224; //11100000
    private static final int THREE_BYTE = 240;//11110000
    private static final int FOUR_BYTE = 248;//11111000
    public boolean validUtf8(int[] data) {
        return validUtf8(data, 0);
    }
    
    public boolean validUtf8(int[] data, int startAt) {
        int followCount = 0;
        for(int num : data) {
            if(num < ONE_BYTE) {
                if(followCount != 0) {
                    return false;
                }
            }else if(num < FOLLOW_BYTE) {
                if(followCount == 0) {
                    return false;
                }
                followCount--;
            }else if(num < TWO_BYTE) {
                if(followCount != 0) {
                    return false;
                }
                followCount = 1;
            }else if(num < THREE_BYTE) {
                if(followCount != 0) {
                    return false;
                }
                followCount = 2;
            }else if(num < FOUR_BYTE) {
                if(followCount != 0) {
                    return false;
                }
                followCount = 3;
            }else {
                return false;
            }
        }
        return followCount == 0;
    }

以上所述就是小编给大家介绍的《leetcode393. UTF-8 Validation》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Refactoring

Refactoring

Martin Fowler、Kent Beck、John Brant、William Opdyke、Don Roberts / Addison-Wesley Professional / 1999-7-8 / USD 64.99

Refactoring is about improving the design of existing code. It is the process of changing a software system in such a way that it does not alter the external behavior of the code, yet improves its int......一起来看看 《Refactoring》 这本书的介绍吧!

MD5 加密
MD5 加密

MD5 加密工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换