Byte Ordered Stream Parsing

栏目: IT技术 · 发布时间: 5年前

内容简介：Byte Ordered Streams— 2020-01-22While on vacation in Japan I dug into streaming parsing, for fun. It's something that piqued my interest ever since I saw Mafintosh's

Byte Ordered Streams

— 2020-01-22

While on vacation in Japan I dug into streaming parsing, for fun. It's something that piqued my interest ever since I saw Mafintosh's csv-parser package back in 2014.

This led to the creation of the omnom library, and as part of it I came up with an ergonomic way to parse and write numbers with a specified endianness to and from streams. An example:

use std::fs::File;
use omnom::prelude::*;

let mut file = File::open("foo.txt")?;

file.write_le(56_i32)?;        // Write an i32 as little-endian to a byte stream.
let n: u16 = file.read_be()?;  // Read a u16 as big-endian from a byte stream.

The name omnom is a pun on the excellent nom package, which you should probably check out if you're looking to write a maintainable parser.

Bytes and Endianness

"Endianness" is a word that you might hear occasionally, but if you've never written network drivers or serialization formats you might not have had to learn about before. Basically what it means is "the order in which bytes form numbers".

A byte is the smallest representation of data in a computer, and in Rust it's referred to as a u8 . A u16 is two bytes long, which means it's two u8 s.

let b1 = [0u8];      // length 1 byte, size equal to u8
let b2 = [0u8, 0u8]; // length 2 bytes, size equal to u16

However in the second case, do we interpret the bytes from left-to-right, or right-to-left? Endianness refers to the order we read bytes in. Little-endian means we interpret bytes right-to-left. And big-endian means we interpret them left-to-right. Different protocols will interpret bytes in different orders, but it will usually say so in the protocol. The most important thing to be aware of is that there's a difference between the two.

In Rust's stdlib there exists "little-endian", "big-endian", and "native-endian". The latter refers to the system's default representation.

Reading Bytes from a stream

Say we wanted to read a u16 encoded as little-endian from a stream. We would write something like this:

use std::io::prelude::*;
use std::fs::File;
use std::mem;

let mut f = File::open("foo.txt")?;

let mut buf = [0; mem::size_of::<u16>()];
f.read_exact(&mut buf)?;
let num = u16::from_le_bytes(buf);

This uses mem::size_of to ensure we get the right size for the type we're trying to read. Then uses Read::read_exact to read bytes from the stream, and u16::from_le_bytes to convert it to a u16.

If you haven't seen the bytes methods before: they're fairly new, and have been implemented as part of Rust 1.32. In the tracking issue there's a fair bit of talk about operating on streams. Which is what we're trying to improve today with this post (:

To write a u16 as little-endian to a stream, we can do:

use std::io::prelude::*;
use std::fs::File;

let mut buf = File::create("foo.txt")?;

file.write_all(99_u16.to_le_bytes())?; // Write a u16 to a stream.

This approach uses u16::to_le_bytes and Write::write_all to write bytes to a stream.

As you can see writing bytes to a stream is somewhat straight forward; combining two API calls to create the buffer and write it. But expressively reading bytes is much more involved: a combination of 3 APIs that must be used in the right order to read from a stream.

Having to write this for each number read from a stream would feel repetitive, making it harder to read and write. But it's also not very discoverable for people who want to use these APIs. In order to figure it out you have to learn about several different APIs, and the interaction between them.

For the sake of this post I put together these examples. I know this is anecdotal evidence, but it took me about 10 minutes to find the right APIs and make something that compiles. This is after having written a library that makes extensive use of both. The end-result of the std APIs might look pretty decent when presented as-is. But we shouldn't underestimate the time it takes to find the APIs and make them work together.

ReadBytes / WriteBytes

The Iterator::collect function is one of the functions that best embodies Rust's values of expressiveness, performance, and correctness. There's entire articles written about just that one function. And rightfully so; it's because of the type system and the borrow checker that we can have a single API that can just as easily convert an Iterator to a HashMap , as to a Result<Vec> -- all without any performance penalty. That's quite impressive for a language that has such a strong focus on correctness as Rust.

It turns out we can write a similar API for number encoding and decoding number types as well. In omnom we've done this by implementing two traits for each number: ReadBytes and WriteBytes . The APIs look like this:

⁴

I'm not particularly attached to the naming of these methods. It's mostly that alternatives that I came up with, such as ByteOrderedRead , seemed really long. If this ever makes it to an RFC, the naming of traits is something might need to be reconsidered.

use std:io::{Read, Write};

pub trait ReadBytes: Sized {
    fn read_be<R: Read>(reader: &mut R) -> Result<Self>;
    fn read_le<R: Read>(reader: &mut R) -> Result<Self>;
    fn read_ne<R: Read>(reader: &mut R) -> Result<Self>;
}

pub trait WriteBytes {
    fn write_be<W: Write>(&self, writer: &mut W) -> Result<usize>;
    fn write_le<W: Write>(&self, writer: &mut W) -> Result<usize>;
    fn write_ne<W: Write>(&self, writer: &mut W) -> Result<usize>;
}

We implement this for each individual number type so that it knows how to encode and decode itself from streams. We then add various endian-based read/write methods to the Read and Write traits that can take any T: ReadBytes or T: WriteBytes and write it to themselves:

pub trait ReadExt: Read + Sized {
    fn read_be<B: ReadBytes>(&mut self) -> Result<B> { ... }
    fn read_le<B: ReadBytes>(&mut self) -> Result<B> { ... }
    fn read_ne<B: ReadBytes>(&mut self) -> Result<B> { ... }
}

pub trait WriteExt: Write + Sized {
    fn write_be<B: WriteBytes>(&mut self, num: B) -> Result<usize> { ... }
    fn write_le<B: WriteBytes>(&mut self, num: B) -> Result<usize> { ... }
    fn write_ne<B: WriteBytes>(&mut self, num: B) -> Result<usize> { ... }
}

impl<T: Read> ReadExt for T {}
impl<T: Write> WriteExt for T {}

When put together, this allow reading and writing bytes from any number type since they all implement ReadBytes and WriteBytes . This removes all boilerplate we saw earlier, and replaces it with a few straight forward APIs.

use std::fs::File;
use omnom::prelude::*;

let mut file = File::open("foo.txt")?;

file.write_le(56_i32)?;        // write an i32 as little-endian to a stream.
let n: u16 = file.read_be()?;  // read a u16 as big-endian from a stream.

Future possibilities

RFC

Making people use the omnom library for such a small API addition doesn't make much sense. I think this would be a fantastic addition to the standard library, making something commonplace both easier to author, review, and learn about. Perhaps this would make for a good RFC?

Floating point numbers

Since Rust 1.41 endianness conversions are now available for floats as well. This isn't part of omnom yet, but would be a 2-line change to add.

Multi-value reads and writes

Something else that would be a lot of fun to add would be multi-value reads and writes:

// Proposed
let head: [u32; 4] = file.read_be()?;

// Current
let b1: u32 = file.read_be()?;
let b2: u32 = file.read_be()?;
let b3: u32 = file.read_be()?;
let b4: u32 = file.read_be()?;
let head = [b1, b2, b3, b4];

Support for async streams

Another interesting avenue to explore would be to implement these methods for AsyncRead and AsyncWrite . Reading bytes asynchronously from streams is very common, which would create benefits comparable to their synchronous counterparts.

use async_std::fs::File;
use omnom::prelude::*;

let mut file = File::open("foo.txt").await?;

file.write_le(56_i32).await?;        // write an i32 as little-endian to a byte stream.
let n: u16 = file.read_be().await?;  // read a u16 as big-endian from a byte stream.

In async-std we've implemented Stream::collect using FromStream and IntoStream traits. I recently wrote about the challenges of these traits , and given how similar they'd be to the traits we're proposing in this post, they'd likely face the same challenges. In short: we can probably implement them today, but not as efficient as theoretically possible until we support async fn in traits.

Alternatives

Back in November I wrote a rather long survey of byte-ordered reads and writes , and discussed a few alternatives. The design implemented in omnom ended up being the third option we discussed. But it's probably worthwhile discussing the other two options we considered.

Alternative 1: Inherent Methods on Read/Write

Instead of implementing a new trait for each number type, we could add number-specific methods on the Read and Write traits. If we followed std's naming conventions of prefixing with be_ , le_ , and ne_ we would end up implementing 96 new methods on Read and Write (2 traits * 3 endianness kinds * 16 number types).

use std::io::prelude::*;

let mut reader = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, reader.read_u16_be()?);
assert_eq!(768, reader.read_u16_be()?);

That's a lot of methods that would be shown in the documentation of Read and Write , making it generally harder to browse and explore. But it would also prevent us from implementing multi-value reads / writes later on (seeFuture Possibilities).

Alternative 2: ByteOrder inspired

Another option would be to follow byteorder 's approach of adding 32 methods on the Read and Write traits, and using Endianness types as type parameters to choose the endianness.

use std::io::prelude::*;
use std::io::BigEndian;

let mut reader = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, reader.read_u16::<BigEndian>()?);
assert_eq!(768, reader.read_u16::<BigEndian>()?);

Compared to the first option that's still a lot of methods (16 per trait), and would also prevent us from implementing multi-value read/writes later on. Additionally we would need to introduce at least 3 additional Endianness types that are used to switch between endianness, which means additional imports in order to use the traits.

pub mod io {
    pub mod endian {
        pub struct BigEndian;
        pub struct LittleEndian;
        pub struct NativeEndian;
    }
}

⁵

Unfortunately Rust's current type system does not support constraining generic paramaters by enum members. This means that "endianness" needs to have 3 different structs instead of a single enum containing different kinds of endianness. When added to the stdlib it might make sense to create a submodule for this so that it still feels like we'd be contraining by an enum's member, even if we aren't actually doing that. Because familiarity and all.

Prior Art

byteorder

The byteorder crate uses a variation of the second alternative. byteorder was one of the first crates to provide convenience APIs for byte order aware operations, so the naming predates the additions to the stdlib.

As of Rust 1.32 numeric types have started to provide built-in methods such as to_le and from_le that partially cover the same functionality as byteorder . The purpose of this post is to share additional designs that could replace the remaining uses of byteorder .

tokio-byteorder

The tokio-byteorder crate is a port of byteorder for tokio 's own AsyncRead and AsyncWrite traits. The design mostly matches byteorder , with the exception that it operates on asynchronous byte streams instead of synchronous ones.

use std::io::Cursor;
use tokio_byteorder::{BigEndian, AsyncReadBytesExt};

let mut rdr = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, rdr.read_u16::<BigEndian>().await?);
assert_eq!(768, rdr.read_u16::<BigEndian>().await?);

tokio

Two months ago tokio added support for reading and writing numbers as big-endian only:

use tokio::io::{self, AsyncReadExt};
use std::io::Cursor;

let mut reader = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, reader.read_u16().await?);
assert_eq!(768, reader.read_u16().await?);

This currently includes 20 new methods total for AsyncRead and AsyncWrite . A design to add support for little-endian writes has been proposed, which closely follows the first alternative we outlined (e.g. write_u16_le ). If accepted it would introduce an additional 40 top-level methods for AsyncRead and AsyncWrite . At the time of writing tokio equivalents of stdlib features such as support for floating point numbers or platform-native endianness do not seem to have been proposed.

nom

nom is a parser-combinator library that operates on byte buffers instead of streams. As such it isn't quite the same as the others in this list. However it does have support for streaming operations, and provides several methods to read bytes with a given endianness:

use nom::number::streaming::be_u16;

let parser = |s| {
  be_u16::<(_, ErrorKind)>(s)
};

assert_eq!(parser(b"\x00\x01abcd"), Ok((&b"abcd"[..], 0x0001)));
assert_eq!(parser(b"\x01"), Err(Err::Incomplete(Needed::Size(2))));

This approach is different from any API we've discussed so far, though the naming scheme somewhat matches the first alternative.

Conclusion

In this post we've covered what endianness is, why streaming parsing is useful, and showed how byte-ordered aware parsing can be performed today. We've then gone on to introduce a novel approach for streaming byte-ordered aware parsing, inspired by the stdlib's collect function, and shown how to use a reference implementation. We then covered alternative API designs we and prior art.

I originally wanted to publish this post in early December after I came back from Japan. But the holidays happened, so I figured I'd make the effort now that we're approaching the end of the first month of the new year.

Overall I'm pretty excited about this API. It's very close to: "What if collect worked for numbers?"; being able to read and write numbers using 3 new methods on each trait. I think that's something that would be really nice.

let head: [u32; 4] = file.read_be()?;
file.write_le(56_i32)?;

Thanks to Joshua Gould (Technetos) and Florian Gilcher (Skade) for proof-reading this post.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Byte Ordered Stream Parsing

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

SNS网站构建

Gavin Bell / 张卫星、李占波、徐静 / 机械工业出版社 / 2011-2 / 69.00元

过去的十年里，Web成为了非常重要的社交工具。社会活动已经超出了BBS这个概念，而指范围更广的互联网。大多数人对Facebook、MySpace以及Twitter并不陌生，事实上，现在很多人在网络上都有个人档案。社会媒体已经成为我们生活的一部分，它可以让我们的生活更加美好，也可以使其更糟糕，像公民新闻这样的表达已变得很常见。仅仅Facebook就有两亿注册用户。那么在这个新领域中到底有什么奥秘呢......一起来看看《SNS网站构建》这本书的介绍吧!

码农工具