内容简介:Some of you may not be aware, but I'm currently rewriting the GDScript. It was discussed during the last Godot Sprint in Brussels and the core developers approved the idea.The main rationale to rewrite the GDScript implementation is that it has been change
Some of you may not be aware, but I'm currently rewriting the GDScript. It was discussed during the last Godot Sprint in Brussels and the core developers approved the idea.
The main rationale to rewrite the GDScript implementation is that it has been changed so much that it's gotten quite hard to understand after receiving so many new features. This is also an opportunity to clean up, tidy up, and modernize the code, like it happened for some other parts of the engine that got rewritten in the past years.
The objective is to have a more "textbook-like" compiler code, which will help maintenance both by those already acquainted with the Godot source and those new faces who saw something about writing compilers in the past. Hopefully this will make it harder to mess something up, given this is critical code for the engine users.
Why not a parser generator?
I've seen this question more than once: why can't we use a parser generator? Then we would just need to write a grammar specification and the generator makes the whole parser for us.
The main problem with parser generators—and the main reason modern language compilers don't use them—is that it is incredibly hard to provide meaningful error messages to the users of the language. With a hand-made parsers we can inject checks, such as properly validating the path you put in a preload
statement, among other things.
GDScript grammar is also pretty much settled (except maybe for a few of the new features) so we don't benefit of prototyping and iterating on a grammar definition, which is the main selling point of parser generators.
Rewriting the tokenizer
The first step when writing a compiler is to make the tokenizer (also called scanner by some authors). It has the responsibility of reading the characters of the source code and bundling them together as meaningful chunks which are called tokens . Tokens contain information about what it means and where it is. This makes the following compiling steps easier to manage as they don't have to deal with minutiae such as comments and whitespace, which are not meaningful for the final execution.
I'm taking this opportunity to make the tokenizer a bit smarter (and maybe a bit dumber in some regards) in order to make the parsing simpler (I'll write about the parser when I get there). The new tokenizer will be emit special tokens whent detects there's an indentation change and also a newline. Since GDScript is indentation based, this helps the parser identify the start and end of blocks. This approach is also used by Python, so I'm being revolutionary here.
I'm also rewriting the --test gd*
commands so they work with the new code. Those are very useful during this rewrite since they allow me test the steps of the compiler without needing to have everything ready. I can make sure the tokenizer is working properly before moving on to the parser.
What is a token?
A token is a data container which describes the characters in the source code as something meaningful for the programming language. The main thing to store is the token type (e.g. a symbol, an identifier, a keyword, etc.), which is a value taken from an enum, since there are limited variety. This is used by the parser to understand which statement to expect.
Another thing is the potential associated data. If it's, say, a +
sign it doesn't matter since the type Token::PLUS
is enough to recognize it, but a Token::IDENTIFIER
needs to have the actual identifier to be referred. Literal values, such as numbers and strings, also need to be stored. The new tokenizer uses a single Variant to store this, since it's enough.
It's also interesting to store positional information about the token: line/column where it starts and end. This is not useful for interpreting the code, but is incredibly valuable for pointing out error messages.
Features of the new tokenizer
As mentioned, the main reason to rewrite is to make it smarter, so let me show some of the new features.
Storing correct positional data
The old tokenizer stored only line and column, but there are many tokens that span across many columns and some across multiple lines. For example, a snippet like this one:
var x = "string" x += """\ Multiline String """
Can now be tokenized like this:
0001 var x = "string" ^^^ --> var ------------------------------------------------------- 0001 var x = "string" ^ --> Identifier(StringName) x ------------------------------------------------------- 0001 var x = "string" ^ --> = ------------------------------------------------------- 0001 var x = "string" ^^^^^^^^ --> Literal(String) string ------------------------------------------------------- 0001 var x = "string" ^ --> Newline ------------------------------------------------------- 0002 x += """\ ^ --> Identifier(StringName) x ------------------------------------------------------- 0002 x += """\ ^^ --> += ------------------------------------------------------- 0002 x += """\ 0003 Multiline String 0004 """ ^^^^^^^^^^^^^^^^ --> Literal(String) Multiline String ------------------------------------------------------- 0004 """ ^ --> Newline ------------------------------------------------------- EOF
This is actually the output of the new tokenizer test. It rewrites the lines and points to the whole span of the token. This is meant to improve error messages by pointing the user to the exact spot.
Creating indentation and newline tokens
While the previous tokenizer kept track of indentation characters and newlines, it let to the parser decide the current indentation level, and whether or not a newline was relevant. The new one only generates a newline token when it's not empty, and also emits tokens for indentation. For example:
"start" "indent" "more indent" "two dedents"
Is tokenized like this:
0001 "start" ^^^^^^^ --> Literal(String) start ------------------------------------------------------- 0001 "start" ^ --> Newline ------------------------------------------------------- 0003 "indent" ^^^^ --> Indent ------------------------------------------------------- 0003 "indent" ^^^^^^^^ --> Literal(String) indent ------------------------------------------------------- 0003 "indent" ^ --> Newline ------------------------------------------------------- 0004 "more indent" ^^^^^^^^ --> Indent ------------------------------------------------------- 0004 "more indent" ^^^^^^^^^^^^^ --> Literal(String) more indent ------------------------------------------------------- 0004 "more indent" ^ --> Newline ------------------------------------------------------- 0005 "two dedents" ^ --> Dedent ------------------------------------------------------- 0005 "two dedents" ^ --> Dedent ------------------------------------------------------- 0005 "two dedents" ^^^^^^^^^^^^^ --> Literal(String) two dedents ------------------------------------------------------- 0005 "two dedents" ^ --> Newline ------------------------------------------------------- EOF
You can see that it detects a decrease in two levels by emitting two dedent
tokens. It also doesn't emit a newline on the line 2 because it's empty.
Error reporting and leniency
The old tokenizer only allowed for one error, after which it would only return the same error if you asked for a new token. Now it returns the error but keeps going through the process so you can detect many errors at once. So this:
"invalid escape \h <- here" "indent" "mismatched unindent"
Gives this:
0001 "invalid escape \h <- here" ^^^^^^^^^^^^^^^^^^^^^^^^^^^ --> Literal(String) invalid escape <- here ------------------------------------------------------- 0001 "invalid escape \h <- here" ^^ --> Error(String) Invalid escape in string. ------------------------------------------------------- 0001 "invalid escape \h <- here" ^ --> Newline ------------------------------------------------------- 0002 "indent" ^^^^^^^^ --> Indent ------------------------------------------------------- 0002 "indent" ^^^^^^^^ --> Literal(String) indent ------------------------------------------------------- 0002 "indent" ^ --> Newline ------------------------------------------------------- 0003 "mismatched unindent" ^^^^^ --> Error(String) Unindent doesn't match the previous indentation level. ------------------------------------------------------- 0003 "mismatched unindent" ^^^^^ --> Dedent ------------------------------------------------------- 0003 "mismatched unindent" ^^^^^^^^^^^^^^^^^^^^^ --> Literal(String) mismatched unindent ------------------------------------------------------- 0003 "mismatched unindent" ^ --> Newline ------------------------------------------------------- 0004 ^ --> Dedent ------------------------------------------------------- EOF
You can notice that the error in the string comes after the string token. That's okay because the parser will report all errors at once and they will be sorted by position.
Indentation level is also correctly maintained even in case of a mismatch (that's why there's two dedents for one indent). This is helpful to keep the parsing lenient in some special cases, which I'll talk more when I write about the parser.
Recognizing extra tokens
Some characters and sequences aren't allowed in the source code but they can be tokenized so a better error message can be provided. Example:
? ` <<<<<<< ======= >>>>>>>
Result:
0001 ? ` <<<<<<< ======= >>>>>>> ^ --> ? ------------------------------------------------------- 0001 ? ` <<<<<<< ======= >>>>>>> ^ --> ` ------------------------------------------------------- 0001 ? ` <<<<<<< ======= >>>>>>> ^^^^^^^ --> VCS conflict marker ------------------------------------------------------- 0001 ? ` <<<<<<< ======= >>>>>>> ^^^^^^^ --> VCS conflict marker ------------------------------------------------------- 0001 ? ` <<<<<<< ======= >>>>>>> ^^^^^^^ --> VCS conflict marker ------------------------------------------------------- 0001 ? ` <<<<<<< ======= >>>>>>> ^ --> Newline ------------------------------------------------------- EOF
The question mark can be recognized as an attempt to use the C-style ternary conditional operator, so we can show the user the correct operator for GDScript.
We also recognize conflict markers so you can easily notice that you have a merge conflict in the file.
What comes next
Now that the tokenizer is pretty much done, I'll start rewriting the parser. This is responsible for reading the sequence of tokens and making sense out of them, deciding if it's a function, variable, expression, etc. I'll write more about it when I have something to show.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Perl语言入门
[美] Randal L.Schwartz、Tom Phoenix / 李晓峰 / 中国电力出版社 / 2002-8 / 48.00元
本书第一版于1993年问世,并从此成为畅销书。本书由Perl社区最著名、最活跃的两位成员写成,是Perl程序设计语言的精髓指南。 Perl最初只是Unix系统管理员的一个工具,在工作日里被用在无数的小任务中。从那以后,它逐步发展成为一种全功能的程序设计语言,特别是在各种计算平台上,它被用作Web编程、数据库处理、XML处理以及系统管理——它能够完成所有这些工作,同时仍然是处理小的日常工作的完......一起来看看 《Perl语言入门》 这本书的介绍吧!