内容简介:One-pass CompilerLet’s look at what is a one-pass compiler and try to implement one.
One-pass Compiler
Primer
Let’s look at what is a one-pass compiler and try to implement one.
A one-pass compiler emits assembly (or binary code) right during parsing, without creating an intermediate representation, such as an AST. This is a rare technique that was used back in the days when computer memory was scarce. This limited both the language features that were possible and the quality of the produced code. But this techniques produced fast compilers that made Bill Gates envy.
Turbo Pascal
A notable example of a one-pass compiler is Turbo Pascal.
Fast compilation speed that the one-pass compiler architecture enabled is often cited as the decisive factor in the success of Turbo Pascal.
From Wikipedia :
Bill Gates saw the success of Turbo Pascal “in very personal terms, and ‘couldn’t understand why [Microsoft’s] stuff was so slow. He would bring in Greg Whitten [programming director of Microsoft languages] and yell at him for half an hour.’ He couldn’t understand why Kahn had been able to beat an established competitor like Microsoft . ”
Compiler
Let’s make a simple one-pass compiler. Not for a whole programming language, but just for simple arithmetic expressions, like the following:
4 + 2 * 10 + 3 * (5 + 1)
We’ll target x86-64 and will use flex and bison for generating our lexer and parser, respectively. I’ve used the Wikipedia bison example as an inspiration.
We start with defining our token type in the “yacc” file:
%token TOKEN_LPAREN <em>"("</em> %token TOKEN_RPAREN <em>")"</em> %token TOKEN_PLUS <em>"+"</em> %token TOKEN_STAR <em>"*"</em> %token <int> TOKEN_NUMBER <em>"number"</em>
Then we go onto defining our simple lexer in the “lex” file:
[ \r\n\t]* { continue; /* Skip blanks. */ } [0-9]+ { sscanf(yytext, <em>"<b>%d</b>"</em>, &yylval->value); return TOKEN_NUMBER; } <em>"*"</em> { return TOKEN_STAR; } <em>"+"</em> { return TOKEN_PLUS; } <em>"("</em> { return TOKEN_LPAREN; } <em>")"</em> { return TOKEN_RPAREN; } . { continue; /* Skip unexpected characters. */ }
And now, the grammar for our expression language. Let’s leave out the semantic actions, for now.
input: expr expr: <em>"number"</em> | expr <em>"+"</em> expr | expr <em>"*"</em> expr | <em>"("</em> expr <em>")"</em>
Above the grammar we specify the operators in order of increasing precedence:
%left <em>"+"</em> %left <em>"*"</em>
They are both left-associative: 2 + 3 + 4 means ((2 + 3) + 4). This is not a very important property for addition and multiplication, since the operations themselves are associative.
We need to directly emit some assembly code in each semantic action that we add to our grammar.
We can get fancier later if we need to, but for now, let’s define our emit function as an alias to printf
.
<em>#define emit printf </em>
Thus, we'll spit assembly instructions directly to the standard output channel, which we can pipe to a file if needed. And now, onto the semantic actions. Each time we encounter a number, we push it onto the stack:
<em>"number"</em> { emit(<em>" push $<b>%d</b>\n"</em>, $1); }
The order of when each semantic action is firing matters. So, when we encounter an operation, like addition, the two inner expressions have already been emitted. Thus, we can expect their values to be at the top of the stack. What we can do is, pop the values into some registers, perform the addition (in this case), and push the resulting value back onto the stack:
| expr <em>"+"</em> expr { emit(<em>" pop %%rax\n"</em>); emit(<em>" pop %%rbx\n"</em>); emit(<em>" add %%rbx, %%rax\n"</em>); emit(<em>" push %%rax\n"</em>); }
We need to use double percent signs for registers since this is a printf
format string.
We do the same for multiplication, except that the accumulator register %rax
is an implicit parameter of the mul
instruction.
| expr <em>"*"</em> expr { emit(<em>" pop %%rax\n"</em>); emit(<em>" pop %%rbx\n"</em>); emit(<em>" mul %%rbx\n"</em>); emit(<em>" push %%rax\n"</em>); }
What do we do when we encounter parethesis? We do nothing, since the inner expression is already emitted.
| <em>"("</em> expr <em>")"</em> { /* No action. */ }
Now, we can generate assembly snippets given an arithmetic expression.
However, a bunch of pushes and pops don’t make for a complete assembly listing.
We need a main
(assuming we link to libc
) or a _start
.
We can use a mid-rule
semantic action to generate our main
label:
input: { emit(<em>".global main\n"</em>); emit(<em>"main:\n"</em>); } expr { emit(<em>" pop %%rax\n"</em>); emit(<em>" ret\n"</em>); }
As the final semantic action, we pop the only expected value and return it from main
as the exit code of our program.
Now, if we feed this parser our original expression:
4 + 2 * 10 + 3 * (5 + 1)
It will emit the following assembly listing:
.global main main: push $4 push $2 push $10 pop %rax pop %rbx mul %rbx push %rax pop %rax pop %rbx add %rbx, %rax push %rax push $3 push $5 push $1 pop %rax pop %rbx add %rbx, %rax push %rax pop %rax pop %rbx mul %rbx push %rax pop %rax pop %rbx add %rbx, %rax push %rax pop %rax ret
If we pipe it into a file called test.s
, assemble it and execute it, the program will produce no output, however, we can check the exit code:
$ cc test.s -o test $ ./test $ echo $? 42
Which is the result of our arithmetic expression!
This is pretty much all we can cover in a short blog post.
I can imagine implementing variables by pushing their values onto the stack and remembering their stack offsets in a hash table… Something similar for functions’ signatures… It looks like a lot of global mutable state is needed for a one-pass compiler to work.
How do you optimize something like this? An optimizing compiler would constant-fold our expression into a single number ahead of time. Still, in one-pass case you only get a very myopic view of the code.
The Code
The complete code, including lexer, parser, and a makefile is available on GitHub as a gist . ■
Did you know, I’m writing a book about compilers? Unfortunately, the compiler described in the book is not one-pass. I know, right? But it’s a two-pass compiler that produces ARM assembly code. The book even teaches you enough assembly programming to get going. The examples are written in TypeScript, but it can be followed in any language, so check it out!
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。