One-pass Compiler Primer

栏目: IT技术 · 发布时间: 4年前

内容简介:One-pass CompilerLet’s look at what is a one-pass compiler and try to implement one.

One-pass Compiler

Primer

Vladimir Keleshev • 2020-05-21

Let’s look at what is a one-pass compiler and try to implement one.

A one-pass compiler emits assembly (or binary code) right during parsing, without creating an intermediate representation, such as an AST. This is a rare technique that was used back in the days when computer memory was scarce. This limited both the language features that were possible and the quality of the produced code. But this techniques produced fast compilers that made Bill Gates envy.

Turbo Pascal

A notable example of a one-pass compiler is Turbo Pascal.

Fast compilation speed that the one-pass compiler architecture enabled is often cited as the decisive factor in the success of Turbo Pascal.

From Wikipedia :

Bill Gates saw the success of Turbo Pascal “in very personal terms, and ‘couldn’t understand why [Microsoft’s] stuff was so slow. He would bring in Greg Whitten [programming director of Microsoft languages] and yell at him for half an hour.’ He couldn’t understand why Kahn had been able to beat an established competitor like Microsoft .

Compiler

Let’s make a simple one-pass compiler. Not for a whole programming language, but just for simple arithmetic expressions, like the following:

4 + 2 * 10 + 3 * (5 + 1)

We’ll target x86-64 and will use flex and bison for generating our lexer and parser, respectively. I’ve used the Wikipedia bison example as an inspiration.

We start with defining our token type in the “yacc” file:

%token TOKEN_LPAREN <em>"("</em>
%token TOKEN_RPAREN <em>")"</em>
%token TOKEN_PLUS   <em>"+"</em>
%token TOKEN_STAR   <em>"*"</em>
%token <int> TOKEN_NUMBER <em>"number"</em>

Then we go onto defining our simple lexer in the “lex” file:

[ \r\n\t]*   { continue; /* Skip blanks. */ }
[0-9]+       { sscanf(yytext, <em>"<b>%d</b>"</em>⁠, &yylval->value);
               return TOKEN_NUMBER; }
<em>"*"</em>          { return TOKEN_STAR; }
<em>"+"</em>          { return TOKEN_PLUS; }
<em>"("</em>          { return TOKEN_LPAREN; }
<em>")"</em>          { return TOKEN_RPAREN; }
.            { continue; /* Skip unexpected characters. */ }

And now, the grammar for our expression language. Let’s leave out the semantic actions, for now.

input:
  expr

expr:
    <em>"number"</em>
  | expr <em>"+"</em> expr
  | expr <em>"*"</em> expr
  | <em>"("</em> expr <em>")"</em>

Above the grammar we specify the operators in order of increasing precedence:

%left <em>"+"</em>
%left <em>"*"</em>

They are both left-associative: 2 + 3 + 4 means ((2 + 3) + 4). This is not a very important property for addition and multiplication, since the operations themselves are associative.

We need to directly emit some assembly code in each semantic action that we add to our grammar. We can get fancier later if we need to, but for now, let’s define our emit function as an alias to printf .

<em>#define emit printf
</em>

Thus, we'll spit assembly instructions directly to the standard output channel, which we can pipe to a file if needed. And now, onto the semantic actions. Each time we encounter a number, we push it onto the stack:

<em>"number"</em> { emit(<em>"  push $<b>%d</b>\n"</em>⁠, $1); }

The order of when each semantic action is firing matters. So, when we encounter an operation, like addition, the two inner expressions have already been emitted. Thus, we can expect their values to be at the top of the stack. What we can do is, pop the values into some registers, perform the addition (in this case), and push the resulting value back onto the stack:

| expr <em>"+"</em> expr  { emit(<em>"  pop %%rax\n"</em>);
                     emit(<em>"  pop %%rbx\n"</em>);
                     emit(<em>"  add %%rbx, %%rax\n"</em>);
                     emit(<em>"  push %%rax\n"</em>); }

We need to use double percent signs for registers since this is a printf format string.

We do the same for multiplication, except that the accumulator register %rax is an implicit parameter of the mul instruction.

| expr <em>"*"</em> expr  { emit(<em>"  pop %%rax\n"</em>);
                     emit(<em>"  pop %%rbx\n"</em>);
                     emit(<em>"  mul %%rbx\n"</em>);
                     emit(<em>"  push %%rax\n"</em>); }

What do we do when we encounter parethesis? We do nothing, since the inner expression is already emitted.

| <em>"("</em> expr <em>")"</em>   { /* No action. */ }

Now, we can generate assembly snippets given an arithmetic expression. However, a bunch of pushes and pops don’t make for a complete assembly listing. We need a main (assuming we link to libc ) or a _start .

We can use a mid-rule semantic action to generate our main label:

input:
                     { emit(<em>".global main\n"</em>);
                       emit(<em>"main:\n"</em>); }
    expr             { emit(<em>"  pop %%rax\n"</em>);
                       emit(<em>"  ret\n"</em>); }

As the final semantic action, we pop the only expected value and return it from main as the exit code of our program.

Now, if we feed this parser our original expression:

4 + 2 * 10 + 3 * (5 + 1)

It will emit the following assembly listing:

.global main
main:
  push $4
  push $2
  push $10
  pop %rax
  pop %rbx
  mul %rbx
  push %rax
  pop %rax
  pop %rbx
  add %rbx, %rax
  push %rax
  push $3
  push $5
  push $1
  pop %rax
  pop %rbx
  add %rbx, %rax
  push %rax
  pop %rax
  pop %rbx
  mul %rbx
  push %rax
  pop %rax
  pop %rbx
  add %rbx, %rax
  push %rax
  pop %rax
  ret

If we pipe it into a file called test.s , assemble it and execute it, the program will produce no output, however, we can check the exit code:

$ cc test.s -o test
$ ./test
$ echo $?
42

Which is the result of our arithmetic expression!

This is pretty much all we can cover in a short blog post.

I can imagine implementing variables by pushing their values onto the stack and remembering their stack offsets in a hash table… Something similar for functions’ signatures… It looks like a lot of global mutable state is needed for a one-pass compiler to work.

How do you optimize something like this? An optimizing compiler would constant-fold our expression into a single number ahead of time. Still, in one-pass case you only get a very myopic view of the code.

The Code

The complete code, including lexer, parser, and a makefile is available on GitHub as a gist .

Did you know, I’m writing a book about compilers? Unfortunately, the compiler described in the book is not one-pass. I know, right? But it’s a two-pass compiler that produces ARM assembly code. The book even teaches you enough assembly programming to get going. The examples are written in TypeScript, but it can be followed in any language, so check it out!


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

大数据时代

大数据时代

[英] 维克托•迈尔•舍恩伯格(Viktor Mayer-Schönberger) / 周涛 / 浙江人民出版社 / 2012-12 / 49.90元

《大数据时代》是国外大数据研究的先河之作,本书作者维克托•迈尔•舍恩伯格被誉为“大数据商业应用第一人”,拥有在哈佛大学、牛津大学、耶鲁大学和新加坡国立大学等多个互联网研究重镇任教的经历,早在2010年就在《经济学人》上发布了长达14页对大数据应用的前瞻性研究。 维克托•迈尔•舍恩伯格在书中前瞻性地指出,大数据带来的信息风暴正在变革我们的生活、工作和思维,大数据开启了一次重大的时代转型,并用三......一起来看看 《大数据时代》 这本书的介绍吧!

随机密码生成器
随机密码生成器

多种字符组合密码

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换