One-pass Compiler Primer

栏目: IT技术 · 发布时间: 4年前

内容简介:One-pass CompilerLet’s look at what is a one-pass compiler and try to implement one.

One-pass Compiler

Primer

Vladimir Keleshev • 2020-05-21

Let’s look at what is a one-pass compiler and try to implement one.

A one-pass compiler emits assembly (or binary code) right during parsing, without creating an intermediate representation, such as an AST. This is a rare technique that was used back in the days when computer memory was scarce. This limited both the language features that were possible and the quality of the produced code. But this techniques produced fast compilers that made Bill Gates envy.

Turbo Pascal

A notable example of a one-pass compiler is Turbo Pascal.

Fast compilation speed that the one-pass compiler architecture enabled is often cited as the decisive factor in the success of Turbo Pascal.

From Wikipedia :

Bill Gates saw the success of Turbo Pascal “in very personal terms, and ‘couldn’t understand why [Microsoft’s] stuff was so slow. He would bring in Greg Whitten [programming director of Microsoft languages] and yell at him for half an hour.’ He couldn’t understand why Kahn had been able to beat an established competitor like Microsoft .

Compiler

Let’s make a simple one-pass compiler. Not for a whole programming language, but just for simple arithmetic expressions, like the following:

4 + 2 * 10 + 3 * (5 + 1)

We’ll target x86-64 and will use flex and bison for generating our lexer and parser, respectively. I’ve used the Wikipedia bison example as an inspiration.

We start with defining our token type in the “yacc” file:

%token TOKEN_LPAREN <em>"("</em>
%token TOKEN_RPAREN <em>")"</em>
%token TOKEN_PLUS   <em>"+"</em>
%token TOKEN_STAR   <em>"*"</em>
%token <int> TOKEN_NUMBER <em>"number"</em>

Then we go onto defining our simple lexer in the “lex” file:

[ \r\n\t]*   { continue; /* Skip blanks. */ }
[0-9]+       { sscanf(yytext, <em>"<b>%d</b>"</em>⁠, &yylval->value);
               return TOKEN_NUMBER; }
<em>"*"</em>          { return TOKEN_STAR; }
<em>"+"</em>          { return TOKEN_PLUS; }
<em>"("</em>          { return TOKEN_LPAREN; }
<em>")"</em>          { return TOKEN_RPAREN; }
.            { continue; /* Skip unexpected characters. */ }

And now, the grammar for our expression language. Let’s leave out the semantic actions, for now.

input:
  expr

expr:
    <em>"number"</em>
  | expr <em>"+"</em> expr
  | expr <em>"*"</em> expr
  | <em>"("</em> expr <em>")"</em>

Above the grammar we specify the operators in order of increasing precedence:

%left <em>"+"</em>
%left <em>"*"</em>

They are both left-associative: 2 + 3 + 4 means ((2 + 3) + 4). This is not a very important property for addition and multiplication, since the operations themselves are associative.

We need to directly emit some assembly code in each semantic action that we add to our grammar. We can get fancier later if we need to, but for now, let’s define our emit function as an alias to printf .

<em>#define emit printf
</em>

Thus, we'll spit assembly instructions directly to the standard output channel, which we can pipe to a file if needed. And now, onto the semantic actions. Each time we encounter a number, we push it onto the stack:

<em>"number"</em> { emit(<em>"  push $<b>%d</b>\n"</em>⁠, $1); }

The order of when each semantic action is firing matters. So, when we encounter an operation, like addition, the two inner expressions have already been emitted. Thus, we can expect their values to be at the top of the stack. What we can do is, pop the values into some registers, perform the addition (in this case), and push the resulting value back onto the stack:

| expr <em>"+"</em> expr  { emit(<em>"  pop %%rax\n"</em>);
                     emit(<em>"  pop %%rbx\n"</em>);
                     emit(<em>"  add %%rbx, %%rax\n"</em>);
                     emit(<em>"  push %%rax\n"</em>); }

We need to use double percent signs for registers since this is a printf format string.

We do the same for multiplication, except that the accumulator register %rax is an implicit parameter of the mul instruction.

| expr <em>"*"</em> expr  { emit(<em>"  pop %%rax\n"</em>);
                     emit(<em>"  pop %%rbx\n"</em>);
                     emit(<em>"  mul %%rbx\n"</em>);
                     emit(<em>"  push %%rax\n"</em>); }

What do we do when we encounter parethesis? We do nothing, since the inner expression is already emitted.

| <em>"("</em> expr <em>")"</em>   { /* No action. */ }

Now, we can generate assembly snippets given an arithmetic expression. However, a bunch of pushes and pops don’t make for a complete assembly listing. We need a main (assuming we link to libc ) or a _start .

We can use a mid-rule semantic action to generate our main label:

input:
                     { emit(<em>".global main\n"</em>);
                       emit(<em>"main:\n"</em>); }
    expr             { emit(<em>"  pop %%rax\n"</em>);
                       emit(<em>"  ret\n"</em>); }

As the final semantic action, we pop the only expected value and return it from main as the exit code of our program.

Now, if we feed this parser our original expression:

4 + 2 * 10 + 3 * (5 + 1)

It will emit the following assembly listing:

.global main
main:
  push $4
  push $2
  push $10
  pop %rax
  pop %rbx
  mul %rbx
  push %rax
  pop %rax
  pop %rbx
  add %rbx, %rax
  push %rax
  push $3
  push $5
  push $1
  pop %rax
  pop %rbx
  add %rbx, %rax
  push %rax
  pop %rax
  pop %rbx
  mul %rbx
  push %rax
  pop %rax
  pop %rbx
  add %rbx, %rax
  push %rax
  pop %rax
  ret

If we pipe it into a file called test.s , assemble it and execute it, the program will produce no output, however, we can check the exit code:

$ cc test.s -o test
$ ./test
$ echo $?
42

Which is the result of our arithmetic expression!

This is pretty much all we can cover in a short blog post.

I can imagine implementing variables by pushing their values onto the stack and remembering their stack offsets in a hash table… Something similar for functions’ signatures… It looks like a lot of global mutable state is needed for a one-pass compiler to work.

How do you optimize something like this? An optimizing compiler would constant-fold our expression into a single number ahead of time. Still, in one-pass case you only get a very myopic view of the code.

The Code

The complete code, including lexer, parser, and a makefile is available on GitHub as a gist .

Did you know, I’m writing a book about compilers? Unfortunately, the compiler described in the book is not one-pass. I know, right? But it’s a two-pass compiler that produces ARM assembly code. The book even teaches you enough assembly programming to get going. The examples are written in TypeScript, but it can be followed in any language, so check it out!


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

趋势红利

趋势红利

刘润 / 文化发展出版社(原印刷工业出版社) / 2016-6-1 / 45.00

【编辑推荐】 1、国内顶尖的互联网转型专家,海尔、百度等知名企业战略顾问刘润送给传统企业的转型、创新“导航仪”,这个时代企业家的必修课 站在近200年商业全景图角度,刘润发现三种企业类型(产品型、渠道型、营销型),针对不同企业类型定制转型战略(找到自己的未来红利),方便 传统企业对号入座:不走错路就是节省时间,适合自己的最有效率。 本书内容还源自芬尼克兹、红领集团、名创优品、必要......一起来看看 《趋势红利》 这本书的介绍吧!

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具