内容简介:I've recently been writing a bit of parsing code in Rust, and I've been jumping back and forth between a few different parsing libraries - they all have different advantages and disadvantages, so I wanted to write up some notes here to help folks who are u
I've recently been writing a bit of parsing code in Rust, and I've been jumping back and forth between a few different parsing libraries - they all have different advantages and disadvantages, so I wanted to write up some notes here to help folks who are undecided choose what libraries and techniques to consider, and also to offer some suggestions for the future of the Rust parsing ecosystem.
I'm writing this after writing parsers for two fairly different usecases:
- a fairly simple binary format (mostly just serializing structs, but it also has varints, fixed-point numbers, various types of containers)
- a very simple templating language ( liquid-template -esque)
Some of these libraries are only/mostly applicable to one or the other - I will note when that is the case.
Last updated: April 2020- the Rust ecosystem is evolving quickly, things may have changed since this was written.
nom (v5.1.1)
nom is probably the most famous Rust parsing library - it's a parser-combinator library, which means that you write functions to parse little bits of your inputs, then you use the "combinators" that nom provides to combine them together into larger parsers. nom is good for both binary and textual formats, and can handle streaming data (where you start parsing before you have all of the data), which is nice if you want that.
I like nom a lot - I really enjoy using parser combinators, and nom is definitely my goto for lots of things, but it does come with a few caveats:
- The documentation isn't great, and figuring out how you're supposed to use it can be difficult. It's gotten a lot better now that it's no longer macro-based, but it can still be pretty frustrating to figure out.
-
The documentation being lacking exacerbates the fact that the selection of combinators and helper functions isn't as large as I'd like - for instance, I spent quite a while looking for something that operated like
take_until1
, just to find out that it doesn't exist in the current version of nom, and I'd have to implement it myself. - The error handling story is not good. It's possible that I'm just holding it wrong, since apparently error handling was improved in 5.0, but after reading all of the available docs and some of the source code, I wasn't able to figure out how to return custom error types in a way that's not extremely verbose. And that's just to get any information out - if you want to get spans and human-friendly error messages, that's probably even harder.
Overall, I liked using nom, and will probably continue to use it, especially for small to medium sized parsers where I care about performance. Hopefully the error handling will get better soon, and it can be my goto for more things.
LALRPOP (v0.18.1)
LALRPOP is a parser-generator, meaning that it takes in a file that describes the grammar you want to parse, and generates code that implements a parser for that grammar. You can check out some examples of what the grammar files look like here . It's designed primarily for textual formats.
It's primary goal is usability, and that really shows - since the grammar is designed specifically to work with Rust, it can directly generate the Rust structs that describe your AST, which is really really nice - you don't need to write any repetitive code at all, you can just describe your grammar and get an AST out. However, there's a big caveat here - LALRPOP splits out the parsing and lexing phases, and the built-in lexer is described by the author as being "intended to be a toy" , and it seems pretty annoying to write your own. So, if you have a grammar that requires a non-trivial lexer, you still will need to write Rust code.
The particular issue that bit me is if you have two "terminals" that are ambiguous, even if the grammar isn't ambiguous overall (because only one of the terminals will be able to be parsed at any given time), LALRPOP will not be able to parse the grammar. For instance, a silly example:
grammar; pub Main: () = { "foo: " <foo: Foo> => foo, "bar: " <bar: Bar> => bar, }; Foo: String = { <value: r"[A-Za-z0-9]+"> => value, }; Bar: String = { <value: r"[A-Z]+"> => value, };
This grammar parses either the string "foo: " followed by any number of alphanumeric characters, or the string "bar: " followed by any number of uppercase latin letters. This is non-ambiguous, but LALRPOP does not accept it, because "any alphanumeric character" is ambiguous with "any uppercase latin letter".
I ran into this issue and stopped using LALRPOP (since I was too lazy to figure out how to write my lexer the way LALRPOP wanted me to). Which is a shame, since LALRPOP is extremely nice otherwise! Being able to just write the grammar file and get structs out is amazing, and the grammar language seems quite nice. There's an open issue for fixing this, but unfortunately it seems pretty fundamental to LALRPOP's current design.
pest (v2.1.3)
pest is a parser-generator using PEG as an input, making it more powerful than LALRPOP (presumably at the cost of being slower, but I haven't tried to benchmark either of them). This makes it very easy to write a grammar and get a tree back, but unfortunately the tree that you get has no types associated with it (unlike LALRPOP), so you need to manually write the code to take the output of pest and write it into an AST that you'd actually want to use. This code tends to be pretty rote and easy to write, but you still do have to write it, which is a bit annoying, and keeping it in sync with a changing grammar seems extremely annoying and bug-prone.
pest also has a pretty nice error handling story, given that it gives you spans to work with so you can report errors fairly easily. I haven't used it enough to see how that really scales up for more complicated things, but superficially, it seems pretty nice.
What I went with
For the template parser, I went with pest, since it allowed me to get a prototype of the ground as quickly as possible, which is what I was optimizing for for that project. I don't think I'd be happy with it in the long term (keeping the untyped pest output to AST code in sync with the grammar long-term seems really frustrating and error prone), but I was very happy with it as a prototype.
For the binary protocol, I originally wrote a parser with nom, which I was quite happy with. However, once I wanted to implement serialization as well as parsing, after a quick detour trying to use serde, I decided to implement both parsing and serializing using a custom procmacro which I wrote myself (with the help of byteorder ), and I ended up happier with that solution than I was with the original nom parser - it ended up being fewer lines of code, as well as seeming simpler overall.
What I want in the future
I'd love to see a crate that combines the usability of LALRPOP's format with the expressivity of PEG parsers, for when I want to quickly hack together a prototype of something involving a parser, or for cases where optimization is not as important.
I'd also love to see nom get better error handling support and docs, as well as more combinators in its library.
And implementing all of this was also definitely a good reminder that writing a parser by hand can be a good choice a lot of the time as well! You don't always have to turn to a library to solve your problems.
I'll also close this out with the caveat that basically all of the parsing I've done has been for extremely small "toy" problems, and so this post ignores a lot of concerns that I'd have writing something more "real" - particularly, I think it's quite a bit harder to get good error messages out of a parser-generator than a parser-combinator or a handwritten parser, and error messages are generally a really important and underappreciated part of parser design.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
精彩绝伦的jQuery
[美] Jake Rutter / 魏 忠 / 人民邮电出版社 / 2012-6 / 59.00元
内容简介: 通过创建更具交互性的富Web界面增强用户体验 本书阐述如何利用少量的JavaScript基础知识将jQuery框架整合进网站,以创建富Web界面,并建立兼容所有主流浏览器的交互性网站。作者是一位具有丰富经验的Web设计师和开发者,通过一系列指导性步骤清晰讲述了添加交互性以创建卓越Web应用的方法和技巧。 使用jQuery可以节省大量的开发时间,开发者在没有丰富编程经验......一起来看看 《精彩绝伦的jQuery》 这本书的介绍吧!