内容简介:In the field of parsingLex and Yacc were the first popular and efficient lexers and parsers generators, flex and Bison were the first widespread open-source versions compatible with the original software. Each of these software has more than 30 years of hi
In the field of parsing Lex and Yacc , as well as their respective successors flex and GNU Bison , have a sort of venerable status. And you could still use them today. But you should not do that. In this article will explain why they have problems and show you some alternatives.
Lex and Yacc were the first popular and efficient lexers and parsers generators, flex and Bison were the first widespread open-source versions compatible with the original software. Each of these software has more than 30 years of history, which is an achievement in itself. For some people these are still the first software they think about when talking about parsing. So, why you should avoid them? Well, we found a few reasons based in our experience developing parsers for our clients.
For example, we had to worked with existing lexers in flex and found difficult adding modern features, like Unicode support or making the lexer re-entrant (i.e., usable in many threads). With Bison our clients had trouble organizing large codebases and we found difficult improving the efficiency of a parser without rewriting large part of the grammar. The short version is that there are tools that are more flexible and productive, like ANTLR.
The first part of this article explains the history of these two software, while the second one analyze their flaws. If you do not care about their history, you can go to the second part using this handy table of contents.
Table of Contents
1.1.1 The Story of Lex
1.1.2 The Story of Yacc
1.2 The Open Source Versions
2.1 A Glorious History Comes with its Set of Problems
2.2 Summary of Differences
2.3 Stability and Development of New Features
2.4.1 The Simplest Example Possible
2.5.1 They Use Different Licenses
2.7 Features of Lexing Algorithms
2.8.1 Comparison of the Speed of the Algorithms
2.8.2 Comparison of the Power of the Algorithms
2.9 Community and Documentation
History of Parsing Pioneers
We start from the beginning, of course. We briefly look at the history of these software.
The Beginnings
Both software have an interesting history, although Yacc’s story looks like a soap opera.
The Story of Lex
Lex is a lexer generator, that is to say a tool to generate lexical analyzers. In case you do not know what a lexer is, these are the basics. A lexer accepts as input normal text and it organize it in corresponding tokens. All you need to have is a grammar that describes the rules to follow. So, for instance you can tell Lex that any grouping of digits (i.e., [0-9]+ characters) should be classified as an INT. A standalone lexer has few uses, because it basically just organize set of characters into categories. It can be used for things like the recognition of elements of a programming language to perform syntax highlighting. You can read more about the basics of parsing in our article: A Guide to Parsing: Algorithms and Terminology .
Mike Lesk and Eric Schmidt (who later became CEO of Google) developed it as proprietary software in 1975 in C. It was an important software, both for its quality and for the needs of development at that time. In fact it became part of the POSIX standard, essentially any respectable OS needed to have a tool like that.
The Story of Yacc
Yacc is a parser generator, specifically a tool to generate LALR parsers. Essentially a parser groups tokens (like the ones generated by Lex) into logical structures. Again, all you need to have is a grammar that describes the rules to follow. For example, you can generate a parser that can take a IDENTIFIER (e.g., a variable), a EQUAL token, and a INT token and understand that together they form an assignment. A parser deals with much more complicated problems than that of a lexer.
Stephen Johnson developed Yacc during the early 1970s, writing (and rewriting) it many times between 1973 and 1978 as proprietary software. He wrote the latest version in C, although initially he wrote in B . It started as a practical tool to help the development of the B Language. It was continually improved, and effectively also helped developing the theory of parsing. Johnson was a computer scientist that make sure that every trick to improve speed was theoretically sound and based in computer science. That is quite the level of dedication that you do not see in contemporary software. Yacc also became part of the POSIX standard.
Since Lex is used to generate lexers and Yacc to generate parsers, they were complementary and often used together. They were simple the best software available in their respective niches.
The Open Source Versions
As mentioned, the initial versions were proprietary software. This means that their use were restricted, which left some users dissatisfied. That is why several open source versions of both software were created. These software were often compatible with the grammars of the original software, but also added their own improvements. The Lex-compatible software that gained prominence was flex , which is still used today. It uses the BSD License and supports also C++. There are also countless implementations of Lex in different languages.
There were two major replacements for Yacc, Berkeley Yacc (byacc) and GNU Bison, the first is in the public domain, while the other obviously use the GPL License. The original author of both software was Robert Corbett and, confusingly, byacc was originally called bison. Furthermore Bison was not originally compatible with Yacc, but it was made compatible by Richard Stallman, which brought it in the GNU world. In the meantime, Corbett rewrote byacc from scratch to be compatible with Yacc. Byacc was originally the most popular version of Yacc-compatible software, but now is GNU Bison (byacc is still developed). There are also ports of Bison in other languages .
In many ways Yacc was an innovative software, because of its impact on development of parser. Lex was less innovative, however it also had a long lasting influence. You will find that many standalone lexer generators are inspired by Lex. This is both because of its quality and because there is less need for standalone lexers. So, lex-inspired lexer generators are still good enough for common uses.
There are even software that are compatible and replace both Lex and Yacc together, likePLY in Python.
Comparison between flex, bison and modern parser generators
Now that the introduction is over, we can see why you should avoid these tools, if you have the chance. We want to avoid just talking about what is possible and ideal, to focus on what is really feasible today. That is why we are going to compare Flex and Bison with a modern parser generator: ANTLR.
A Glorious History Comes with its Set of Problems
Now that we know the history of these software, we can already imagine some of the problems. These are very old software created to be compatible with even older software. Even worse, their design is based on really old principles that are not best practices today. Can you even imagine the quality of code that was written and fixed for 30 years?
Their main point was, and still is, compatibility with an old codebase. In short, there are mainly three cases to consider:
- if you have existing parsers in maintenance mode using flex and Bison, it makes sense to keep using them;
- if you have existing parsers that are actively developed, you should consider porting them to more productive tools, the effort of porting them will be repaid with greater productivity;
- if you need to develop new parsers from scratch, there are too many downsides in using them and you should opt for more productive tools, like ANTLR
Summary of Differences
Here is a brief summary of the comparison between flex/Bison and ANTLR.
- Stability and Development of New Features
- flex and Bison are stable and maintained software but there is no active development. C++ support can be of limited quality.
- ANTLR is actively developed and new features are added periodically
- Separation between Grammar and Code
- flex and Bison maintain an old-school design with little support for readability or productivity.
- ANTLR is a modern parsing generator tool with a design that favors a portable grammar, usable for multiple target languages
- Unicode Support
- flex does not directly support Unicode
- ANTLR does support all flavors of Unicode and even makes easy to select Unicode properties (e.g., Emoji)
- They Use Different Licenses
- flex uses a BSD license, while Bison uses the GPL. Bison waive the license for most generated parsers, but it can be a problem
- ANTLR uses the BSD license
- Grammar Format
- Bison only supports BNF, which makes grammars more complicated
- ANTLR supports EBNF, which makes easier to describe most languages
- Features of Lexing Algorithms
- flex supports regular expressions to define rules, which works for most elements, but adds complexity
- ANTLR supports context-free expression to define rules, which simplifies defining some common elements
- Features of parsing algorithms
- Bison supports two parsing algorithms that cover all ranges of performance and languages. It gives cryptic error messages
- ANTLR supports one algorithm that works for all languages with usually a great performance
- Community and documentation
- flex and Bison have smaller and fractured communities, but they have good documentation
- ANTLR has one vibrant community and a good documentation
Stability and Development of New Features
Flex and Bison are very much stable software. They are maintained, but development of new features is limited if not absent. Furthermore their code is stable, but not always of great quality. This is especially true if you consider that C and C++ are different languages, but they are often considered and used together because of their compatibility. Let me give you an example, this is a comment straight from flex source code:
/* The c++ scanner is a mess. The FlexLexer.h header file relies on the
* following macro. This is required in order to pass the c++-multiple-scanners
* test in the regression suite. We get reports that it breaks inheritance.
* We will address this in a future release of flex, or omit the C++ scanner
* altogether. */
from https://github.com/westes/flex/blob/master/src/flex.skl#L127So, yeah stable code does not mean good quality code. This is nothing unusual for old project, if you change something you break something. So you leave it as it is, but that is exactly why old software does not always have a stellar reputation.
ANTLR instead is more actively developed. It has been re-written from scratch a few times during the years, so the code has is of good quality. Furthermore, new features are added regularly that improves both parsing and supporting features. These features can be small improvements, for example better support for selecting Unicode characters depending on properties. However, they are also features that significantly improves the life of developers, such as robust support for semantic predicates. Other additions are useful for the support of the project, for instance the generation of diagrams of the parse tree for documentation purposes.
The design of Flex and Bison is stagnant and antiquated by principle. This is in part because their killer feature is compatibility, so it cannot be changed. It is also due to the fact that they are two different software that must work together, so neither can evolve alone. Instead ANTLR is developed as a whole software that works both as a lexer and a parser, so it has more freedom to change.
Separation between Grammar and Code
Both Flex and Bison support only a grammar that contains actions (i.e., the code to manipulate the results of parsing is inside the grammar). ANTLR on the other hands allows to separate the definition of the language from the code that handles the results.
A great advantage of this separation between grammar and normal code is that allows to reuse the grammar. You can reuse the grammar with other projects or even with other languages. This is possible because ANTLR supports generating parsers in different languages from the same grammar, as long as the grammar does not contain actions. This is not always possible, since some complex languages requires advanced manipulation to be parsed, but it is usually doable. This means that to parse a real programming language like Java you would probably need to use actions, but you do not need them to parse JSON or an average DSL.
ANTLR supports writing a grammar that contains only the description of the language. It is possible to write actions, like in Flex and Bison, but it is recommended to avoid it when it is possible. This allows to write a cleaner and more understandable grammar. There is separation of concerns: the grammar describes the language to parse it, while the code handles the results of the parsing.
Furthermore, the code embedded inside the grammar lacks context. While a developer is editing the grammar, it is not obvious which arguments can be accessed inside each rule. Writing code in a grammar means also having no support from an IDE: there are no syntax highlighting or autocomplete features. Instead, writing the code apart from the grammar, in a traditional source code file, allows to take advantage of the usual IDE features.
The Simplest Example Possible
We are going to show a simple example of a rule written in Bison and ANTLR, to get the feel of how it is working with them.
This is a typical simple rule written for Bison. The token NUM is defined in Flex (not shown).
sum : NUM '+' NUM { $$ = std::stoi($1) + std::stoi($2); }
Setting aside that the code is a bit cryptic, the combination of the rule definition and the traditional code is harder to read and understand. Imagine many rules like this one floating together, interspersed with long paragraphs of code. This is even more true when you think that, in real cases the grammar must also includes code to manage the usual includes, namespaces, setting up, etc.
Bison grammars must obviously also include code to integrate Flex tokens inside Bison, since Flex is a separate software.
/* Bison setup and linking to Flex tokens */ %token NUM // we indicate that the sum uses left associativity %left '+'
On the other hand with ANTLR you can write simple rules like this one in the grammar.
sum : NUM PLUS NUM ;
Then ANTLR offers two simple ways to manipulate the results of the parsing: a visitor or a listener. These are both patterns that allows to execute code while you traverse the parse tree created by the parser. The difference between the two is that with a listener the tree is traversed automatically: you cannot change its path. Instead with a visitor you can alter the traversal path as you wish.
An example method of a visitor for the previous rule could be the following.
antlrcpp::Any LanguageVisitor::visitSum (LanguageParser::SumContext *ctx) { antlrcpp::Any result = std::stoi(ctx->num()[0]->getText()) + std::stoi(ctx->num()[1]->getText()); return result; }
The code is in a traditional source code file, together with the rest of the program code. So, it is easier to understand it and the IDE can support the developer as usual.
Unicode Support
Flex does not natively support Unicode. This is because it can only read 8-bit of input. While this does not make impossible to recognize Unicode characters, it means that Flex does not make it easy. The developer would have to create its own solution to recognize Unicode characters. This would require significant work and the developer would have to maintain it.
ANTLR has full support for Unicode, both the widespread UTF8 encoding and the other ones like UTF16. It also supports properties, that simplify referencing range of characters. For example, the following is valid ANTLR code that creates a lexer rule to match all Emoji characters.
// select all Emoji characters EMOJI : [\p{Emoji}] ;
They Use Different Licenses
Flex and Bison are two separate software. Even though they are designed to work together there are compatibility issues and incompatible features. For example, Bison supports the generation of parser in C, C++ and Java, while Flex only supports C well. Flex has a permissive license (BSD), while Bison has a restrictive license (GPL).
ANTLR is one software, which simplify deployment, and has a permissive license (BSD).
In the simplest parsing projects, the restrictions of the GPL license are not relevant, since there is no need to include Bison itself with the generated parser. In addition to that, Bison provides a special exception to the GPL that exclude its applicability in parsers generated by Bison. However, there are some advanced usages that would require the inclusion of the parser generator. For example, an IDE that provided a way to create custom syntax highlighting rules might use a parser generator. In that case the license could be an issue.
Grammar Format
Bison supports only a BNF-equivalent format to define rules, instead of the more flexible EBNF. You can see our article on EBNF: How to Describe the Grammar of a Language to see the main differences. The short version is that the grammar will be more verbose and complicated in Bison than in ANTLR. For example, rules that use list of elements have to be defined in a rather awkward way.
lines : lines line | line ;
This rule is trying to say that a lines
is a series of one or more line
. However, because of the limitations of the BNF format, it actually saying something much more complicated. It is saying a rule lines
is a set of lines
followed by a line, this in a recursive way until it find the last line
. You can clearly see the issue with the graphical representation of the parsing tree for a simple example.
Compare this definition with the more natural way, which is not supported by Bison.
lines: line+ ;
This rule is actually just saying that a rule lines
is a series of one or more line
. ANTLR supports this format both in the lexer and in the parser.
Features of Lexing Algorithms
Flex supports regular expressions. ANTLR supports context-free expressions in the definition of lexer rules. The difference is that context-free expressions are more powerful and can match more rules.
Lexing is generally simpler than the second phase of proper parsing, so it requires less powerful algorithms. This means that flex, a tool dedicated to lexing, is strictly less powerful than ANTLR. That is because ANTLR is one tool that combines lexing and parsing, so it can employ a more powerful algorithm in both phases.
In practical terms, it means that it is easier to define complex lexical structures in ANTLR. For example, you can easily define recursive lexer rules in ANTLR.
For instance, this is how you define the common programming element language of a string between double quotes in flex.
STRING : \"([^"])*\"
This rule matches a \"
(double quotes), any instances of [^"]
(any character that is not a double quote) and finally a \"
again.
This is an equivalent rule defined in ANTLR.
STRING : '"' .*? '"' ;
It is more readable this way.
This limitation of flex has an impact of productivity, given that the lexer is where things get trickier to define and it is easier to make mistake. That is because you are working character by character, without any context, so it is more difficult to conceptualize what is going on.
Both flex and ANTLR supports a feature to activate the recognition of some tokens only in certain conditions. This feature is called states in flex and lexical modes in ANTLR. This feature allows to create one lexer to parse multiple languages. For example, it would allow to build one lexer to parse a markup language (e.g., XML, HTML), which contains both free text and markup code.
Features of Parsing Algorithms
A defining characteristic of a parser generator is what algorithm it uses to create a parser. The theory of parsing has been extensively studied by computer scientists, that developed different parsing algorithms with different strengths and weaknesses.
Comparison of the Speed of the Algorithms
ANTLR uses an algorithm developed by its main author Terence Parr. The algorithm is called ALL(*) . This is an improvement over the traditional LL algorithm, which is less powerful. The LL algorithm is easier to implement manually and powerful enough to be useful. So historically many computer languages were purposefully designed to be parsed with an LL algorithm. The ALL(*) variant is more powerful than the basic LL algorithm and can handle most languages you will encounter.
Bison supports the generation of parsers using two algorithms: LALR and GLR. One is quicker and less powerful than the one used by ANTLR, the other is potentially slower, but is slightly more powerful (i.e. it can parse more languages).
In practical terms, both ALL(*) and GLR can parse most languages encountered by developers. The only difference is that GLR might allow to write simpler rules in some cases. We are going to see which ones later.
There is no comprehensive performance comparison between ANTLR 4 and flex/Bison, however there is a scientific paper that compares a few parser generators based on the same algorithms . In the following paragraphs we present these findings.
According to theoretical analysis, the new algorithm used by ANTLR (i.e., ALL(*)) has as an higher complexity ( O(n 4 ) ) than the GLR algorithm ( O(n 3 ) ). However in practical usage ALL(*) has shown to be quicker than GLR. Testing made on a corpus of Java 6 source files has found that:
In practice, we found GLL and GLR to be ̃135x slower than ALL(*) on a corpus of 12,920 Java 6 library source files (123M) and 6 orders of magnitude slower on a single 3.2M Java file, respectively.
[..] For this experiment, ALL(*) outperforms the other parser generators and is only about 20% slower than the hand built parser in the Java compiler itself
In any case both algorithms are used extensively in production system, so they are quick enough for everyday use. ALL(*) and LALR parsers have about the same performance.
Comparison of the Power of the Algorithms
Before proceeding with discussing the power of the algorithms, we have to understand one important difference. Namely, the different ways a grammar formats and a parsing algorithm influence how you write a grammar. The format in which you can write a grammar limits the formal rules that you can use to describe your language. It is basically a productivity issue: the more powerful the format, the easier is to write rules (e.g., some formats allows to easily use repetitions). We have already seen the difference of grammar formats between Bison and ANTLR, so this part should be clear. Instead the parsing algorithm determines how you can combine the different rules or their parts. A specific parsing algorithm might make impossible some combinations of rules. I.e., each rule might be formally correct, but the whole grammar might not be valid.
In simple terms: a grammar format determines the syntax of a rule (e.g., the repetition of elements), while the parsing algorithm might limit the semantics of the rules (e.g., the combination of elements). Let’s see an example of rules that make a grammar not valid.
The GLR algorithm, that can be used by Bison, has the advantage of essentially allowing to write a valid grammar in any way the developer likes. As we said before, GLR is more powerful than ALL(*) in the sense that is more flexible and allows to write simpler grammars. The main example is the issue of left-recursive rules. A left-recursive rule is a rule of the grammar that references itself at the beginning of at least one option. These kinds of rules are a common reason that make a grammar invalid with some parsing algorithms.
In the following example expr
is a left-recursive rule, because the rule defined (i.e., the expr
before the :
) appears at the beginning of the definition of an alternative of the rule itself (the first alternative, with the expr
after the :
).
expr : expr ('*'|'/') expr | [0-9]+ ;
GLR can handle grammars with mutual and indirect left-recursive rules, while ALL(*) can only handle direct left-recursive rules.
Traditional LL parsers cannot handle left-recursive rules at all, which makes complicated to write grammars. This limitation is one of the reasons because many official grammars define expressions as a cascading series of recursive expressions, rather than with just one rule. For example, you can see this limitation in the official grammar of Python . The other reason is that by writing the grammar this way the parser is usually quicker than by writing the grammar in the more intuitive way.
This is an example of a grammar which uses left-recursive rules, which both ANTLR and Bison supports.
expr : expr ('*'|'/') expr | expr ('+'|'-') expr | NUM ; NUM : [0-9]+;
This is a grammar which uses indirect (mutually) left-recursive rules, which only bison supports.
expr : expr ('*'|'/') expr | expr ('+'|'-') expr | power // mutually left-recursive with the rule "power" | NUM ; // mutually left-recursive with the rule "expr" power : expr '^' expr ;
A common effect of this limitation is that the developer ends up writing one very large expression rule that contains directly all the expressions supported by a language. That is because you cannot definite each expression in separate rules and then combine them together. Generally, this is a minor inconvenience in productivity and performance.
Error Messages
The GLR algorithm is harder to debug compared to the ALL(*) algorithm. That is because the ALL(*) algorithm follows a pattern that seems natural to developers: top-down. It tries to parse a document from the root of the tree (i.e., the main rule of the grammar that encompasses all the input) until the leaves of the tree (i.e., the individual elements). The GLR does the opposite: it is a bottom-up algorithm that starts from the leaves of the tree and rises up until it finds the main rule.
Let’s imagine a simple grammar:
statement : assignment NEWLINE assignment: identifier ‘=’ expr expr : NUMBER ‘+’ NUMBER
A top-down parser would try to match the rule statement
first, then assignment
, etc. up until matching the second NUMBER
in the expression rule. A bottom-up parser instead would try to do the opposite, it would try to match a NUMBER
token and rise up until it could match the whole rule for statement
. This means that if you might receive cryptic warning messages about a certain number of shift/reduce conflicts, without any clear indication of which rule contains the error.
For instance, this might happen when you defined something like the following.
lines : lines line | line ; line : WORD | /* empty */ ;
You will probably receive a warning message because any two WORD
could be separated by any number of empty line
. So the Bison parser does not know what to do. However Bison will not understand that there is one mistake, instead it will say that there are a certain number of shift/reduce conflicts. The number of conflicts is obviously not random, but it depends on the internal structure of the parser which is not evident by looking at the grammar. So, the number of conflicts seems random.
Community and Documentation
ANTLR has a community that is alive and vibrant. The community provided more than a hundred grammars that can be freely used by everybody. There is also a mailing list that can help. The main author of the tool, Terence Parr, created the book The Definitive ANTLR 4 Reference that explains the features of ANTLR and much more: the most useful patterns in parsing and the common issues encountered when parsing. We created aMega Tutorial to help you with ANTLR and even an entire course to help you Using ANTLR Like a Professional .
Flex and Bison have a smaller and less active community. Bison has an official mailing list, flex does not even have that. Though they both have a manual that provide ample information about the features and the options of the respective tool.
ANTLR has an involved community, that have contributed not just grammars, but even new target languages. Everybody can create and contribute a new target language or help improve the existing ones. For example, we built experimental support for Kotlin as a target language . Instead flex and Bison have a long tradition, but they are also less used now, and they have fractured communities that makes harder to find support.
Traditionally open source tools have issues in creating comprehensive documentation, that can hurt productivity and slow acquisition of proficiency in the tool. It is one of the hidden costs of open source software. However, in this case all of the three tools have a good documentation. Flex and Bison have good official manuals, while ANTLR has a book that explains the tool and the how to use it in the most common parsing situations.
Summary
If we have to sum up this article in one phrase it would be this one: flex and Bison still work, but every day they become less of a good choice. It is not that using them will doom your project, but there is really no good reason to pick them over ANTLR or any other modern parsing generator tools available. If for some reason you do not like ANTLR, for instance because you want something simpler, there are also many other valid options. We have reviewed all the major parsing tools and libraries forJava, C# ,Python andJavaScript. So pick one of them, any one of them, but flex and Bison.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。