Chapter 2. Parsing a Number

栏目: IT技术 · 发布时间: 4年前

内容简介:In the following two chapters, we will temporary leave the compiler created in the previous chapter, and will be working on a separate helper project, a calculator. It is an interesting thing on its own, and we will explore it on an isolated example. In th

Chapter 2. Parsing a Number This is a chapter from

Creating a compiler with Raku

In the following two chapters, we will temporary leave the compiler created in the previous chapter, and will be working on a separate helper project, a calculator. It is an interesting thing on its own, and we will explore it on an isolated example. In the next chapters, it will be integrated to the interpreter.

Finding a number

Calculators work with numbers, so the first thing we’ll do will be a parser that parses numbers. In the previous chapter, we only worked with non-negative integers, but there are much more number formats that a good calculator has to understand. It includes, for example, negative numbers and floating-point numbers, which can be also written in scientific notation. We also have to allow the cases when people omit zeros before the decimal point and type .5 instead of 0.5 .

Let’s iteratively create the parser for different kinds of numbers. As we are going to make a lot of changes in the grammar, a test suite comes for the rescue.

my @cases = 7, 77, -84;
for @cases -> $number {
    say "Test $number";
    say Number.parse($number);
}

The @cases array contains a list of numbers that will be tested against the  Number grammar. Here is its first version:

grammar Number {
    rule TOP {
        <number>
    }

    token number {
        \d+
    }
}

The whole grammar expects a single number, which is a sequence of digits. It works with our current test cases, and the program’s output is quite predictable:

Test 7
「7」
 number => 「7」
Test 77
「77」
 number => 「77」
Test -84
Nil

The first two numbers pass the exam, but the third one does not. For a negative number, the parse method returns Nil .

So, we need to extend the number token and add an optional minus sign in front of the number. An important thing that it still has to be a token, as you normally do not expect a space between the sign and the number.

token number {
'-'? \d+
}

This small change makes another half of integer numbers, negative integers, valid. The above test will now pass.

Test -84
「-84」
 number => 「-84」

But what if you spell a positive number as +7

my @cases = 7, 77, -84, '+7', 0;

A grammar parser always consumes strings, but we can let Raku convert positive numbers such as 7 or 77 to avoid extra quoting in the list of test cases. It the case of +7 , we have to preserve the plus sign, so this test case is an explicit string.

This number is not a valid number according to the grammar. As it expects the whole string to be a number, we cannot expect that the plus sign will be simply ignored. The parser needs to know about the possible sign at the beginning, thus, we need to add it to the token body:

token number {
<[+-]>? \d+
}

Let us also reduce the noise in the output, and do not print the parse tree:

for @cases -> $number {
my $test = Number.parse($number);
say ($test ?? 'OK ' !! 'NOT OK ') ~ $number;
}

The output now is partially compatible with the TAP (Test Anything Protocol) which is widely used in testing Perl and Raku modules.

OK 7
OK 77
OK -84
OK +7
OK 0

Add a non-valid number and you’ll immediately see that it fails, for example:

NOT OK 3.14

The first attempt to implement a parser for the flowing-point numbers can be as simple as this:

token number {
    <[+-]>? \d+ ['.' \d+]?
}

We just added an optional decimal part. But what if you meet a number with a point but with no integer part? It will not pass the test, but it is easy to solve by changing \d+ to  \d* in the left part of the regex:

token number {
    <[+-]>? \d* ['.' \d+]?
}

Unfortunately, this change breaks the token, as it can be applied to a single sign or even to an empty string. All these test cases are now OK :

my @cases =
    7, 77, -84, '+7', 0,
    3.14, -2.78, 5.0, '.5',
    '', '-', '+';

It is a bit tricky to try to express all the options in a single regex. It is much easier to explicitly list all the alternatives. As we know that the sign can appear in front of any number, let us group the alternatives in square brackets:

token number {
    <[+-]>? [
        | \d+ 
        | \d* ['.' \d+]
    ]
}

This approach makes the whole token more readable and more extendable. We can add another alternative to match the numbers in scientific notation.

token number {
<[+-]>? [
| \d+
| \d* ['.' \d+]
| \d+ <[eE]> <[+-]>? \d+
]
}

More test cases pass the grammar now:

'3E4', '-33E55', '3E-3', '-1E-2'

What we are missing are the numbers in scientific notation with non-integer mantissa, e.g., 3.14E2 or  .5E-3 . Another alternative can solve this issue:

token number {
    <[+-]>? [
        | \d+
        | \d* ['.' \d+]
        | \d+ <[eE]> <[+-]>? \d+
<strong>       | \d* ['.' \d+] <[eE]> <[+-]>? \d+</strong>
    ]
}

In this form, there are parts that repeat, such as \d+ or  \d* ['.' \d+] . In such compact rules that may be fine, but it is also possible to decompose it further and to introduce sub-tokens that are responsible for such repeated parts. The transformed  number token and its family look like this:

token number {
    <sign>? [
        | <integer>
        | <floating-point>
        | <integer> <exponent>
        | <floating-point> <exponent>
    ]
}

token sign {
    <[+-]>
}

token exp {
    <[eE]>
}

token integer {
    \d+
}

token floating-point {
    \d* ['.' <integer>]
}

token exponent {
    <exp> <sign>? <integer>
}

While there is much more code comparing to the previous version, each individual token is simpler to understand. Compare, for example, the previous and the current regexes for a number in scientific notation having a floating point number in its mantissa. Earlier, it was:

\d* ['.' \d+] <[eE]> <[+-]>? \d+

After the transformation, it became:

<floating-point> <exponent>

Looking at the alternatives of the main token,

| <integer>
| <floating-point>
| <integer> <exponent>
| <floating-point> <exponent>

you immediately see what they describe. Even more, we can reduce it further:

token number {
    <sign>? [
        | <integer>
        | <floating-point>
    ] <exponent>?
}

In other words, a number is either an integer or a floating-point value, which is preceded by an optional sign and can be followed by an optional exponent part. In this form, the description is so clean and compact. All the details (the regex “noise”) are hidden in the supplementary tokens.

Getting the value

The number has been parsed but what is its value? For a compiler, we need not only to check the validity of the number format but also to convert it from a string to a number, integer or floating-point. In this section, we will append actions to the Number grammar so that we can build the number and finally print it.

Let us start with integers and keep the number in a global variable.

my $n = 0;
class NumberActions {
    method integer($/) {
        $n = +$/;
    }
}

Everything looks simple here. An integer value is derived directly from converting the matched piece of the string to a numeric value using the + prefix operator. To see how it works, let us change the main test loop so that it prints the parsed value:

for @cases -> $number {
    my $test = Number.parse($number, :actions(NumberActions));
    
    if ($test) {
        say "OK $number = $n";
    }
    else {
        say "NOT OK $number";
    }
}

With non-negative integer numbers, it works well:

OK 7 = 7
OK 77 = 77
OK -84 = 84
OK +7 = 7
OK 0 = 0

The negative number did not work. It looks like the +7 string was processed correctly, but actually that’s not fully true as we completely ignored the sign. This time, the task is a bit more complicated. The first idea is to flip the sign if we meet the minus sign:

method sign($/) {
    $n *= -1 if ~$/ eq '-';
}

That does not work though, as the sign is parsed before the digits are parsed, and negating $n means applying a sign to zero. We can use a separate variable to keep the information about the sign but that’s not the best thing probably. But let’s do that because this will reveal another problem.

my $n = 0;
my $sign = 1;

class NumberActions {
method integer($/) {
$n = $sign * +$/;
}

method sign($/) {
$sign = -1 if ~$/ eq '-';
}
}

This helps to detect the first negative number, but it breaks all the rest. Of course, you can check if the sign was '+' but the problem is that the  $n and  $sign variables are global, and they must be reset before the next variable is parsed. This is a good time to move them into the actions class.

class NumberActions {
    has $.n = 0;
    has $!sign = 1;

    method integer($/) {
        $!n = $!sign * +$/;
    }

    method sign($/) {
        $!sign = -1 if ~$/ eq '-';
    }
}

The $n variable is intentionally made a public data member as we have to get the result somehow. You also need to change the test loop.

for @cases -> $number {
my $actions = NumberActions.new();
my $test = Number.parse($number, :actions($actions));

if ($test) {
say "OK $number = " ~ $actions.n;
}
else {
say "NOT OK $number";
}
}

The main change here is that you pass an instance of the NumberActions class to the parse method of the grammar. Now, in each iteration, the parser creates its own variables to keep the result.

We moved far enough to have all the integers be parsed correctly:

OK 7 = 7
OK 77 = 77
OK -84 = -84
OK +7 = 7
OK 0 = 0

For the floating-point numbers, it does not work as smoothly:

OK 3.14 = 14
OK -2.78 = -78
OK 5.0 = 0
OK .5 = 5
OK -5.3 = -3
OK -.3 = -3
OK 3E4 = 4
OK -33E55 = -55
OK 3E-3 = -3
OK 3.14E2 = 2
OK .5E-3 = -3

As you can see, either the decimal or the exponential part wins. In both cases, that was the last integer part that went through he grammar. Indeed, earlier, we transformed the grammar to factor out the repeating parts. The bell ran when we had to introduce the $sign variable but now we are suffering even more. All that needs to be handled differently. And here is how  AST can help.

Using AST

AST, or abstract syntax tree , is a mechanism that allows collecting and keeping data parsed at different stages. If the integer token is called twice when reading a floating-point number  3.14 , or if it was called three times for the number with an exponent, e.g.,  3.14E2 , all those integers can be kept in the AST and used later to build the value corresponding to the whole string.

There are two methods available for the $/ variable inside the methods of the actions class:  make and  made . With  make , you store a value (you assign an attribute to the current node of the parse tree). With  made , you read the earlier-stored value.

Add the following calls to the token actions:

method integer($/) {
    $/.make(+$/);
}

method sign($/) {
    $/.make(~$/ eq '-' ?? -1 !! 1);
}

The values are now saved even if the method is called more than once. To understand how that works, let us look inside the grammar object:

my $test = Number.parse($number, :actions($actions));
dd $test;

The dd routine is a Rakudo-specific tool that shows the inner structure of the object. For the input number  -84 , the following object is built after parsing:

Number $test = Match.new(list => (), hash => Map.new((:number(Match.new(list => (), hash => Map.new((:integer(Match.new(list => (), hash => Map.new(()), made => 84, orig => -84, pos => 3, from => 1)),:sign(Match.new(list => (), hash => Map.new(()), made => -1, orig => -84, pos => 1, from => 0)))), made => Any, orig => -84, pos => 3, from => 0)))), made => Any, orig => -84, pos => 3, from => 0)

Looks messy but you should be able to spot the two places we are most interested in:

made => 84, orig => -84, pos => 3, from => 1
made => -1, orig => -84, pos => 1, from => 0

The from and  pos keys are set to point to the first character and to the character after the last one that matches the regex. So, the first of these two sub-hashes are the results of parsing the digits (from position 1 to position 3 in the string  '-84' , thus  84 ). The second hash corresponds to the minus character (positions 0 to 1 in the same string).

The made attribute is set to 84 and −1 respectively, which confirms that the grammar was able to parse the number and its sign correctly.

These values can now be used to make the result in the parent token.

method number($/) {
    my $n = $<integer>.made;
    $n *= $<sign>.made if $<sign>;
    $/.make($n);
}

It accesses the integer value and the sign multiplier via the made attributes of the  $<integer> and the  $<sign> objects. The last line passes the result to the next level, and you can access it from the  TOP rule:

method TOP($/) {
    $/.make($<number>.made);
}

For the current task of parsing numbers, the TOP rule and method could be completely replaced with the content of the  number token and the action method (notice that  TOP is now a token, not a rule):

grammar Number {
token TOP {
<sign>? [
| <integer>
| <floating-point>
] <exponent>?
}

. . .
}

class NumberActions {
method TOP($/) {
my $n = $<integer>.made;
$n *= $<sign>.made if $<sign>;
$/.make($n);
}

. . .
}

If you look at the object returned by the parse method, you will see that it contains the following fields:

from => 0, orig => -84, made => -84, pos => 3

The made pair contains our desired negative integer. It is accessible from outside of the grammar and from outside of the actions using the same made attribute:

my $test = Number.parse($number, :actions($actions));

if ($test) {
say "OK $number = " ~ $test.made;
}

The task is complete, and we can move on to the integers with an exponential part, e. g., 3E4 . With such numbers, the  integer token is triggered twice, but that is not a problem, as both integer numbers reside in the corresponding  made attributes of  different objects.

Create an action to work with the exponential part:

method exponent($/) {
    my $e = $<integer>;
    $e *= -1 if $<sign> && ~$<sign> eq '-';
    $/.make($e);
}

And use the value to multiply the number:

method TOP($/) {
    my $n = $<integer>.made;
    $n *= $<sign>.made if $<sign>;
    <strong>$n *= 10 ** $<exponent>.made if $<exponent>;</strong>
    $/.make($n);
}

Run the test suite and examine what it produces for the numbers like 3E4 or  −1E−2 :

OK 3E4 = 30000
OK -33E55 =
-330000000000000000000000000000000000000000000000000000000
OK 3E-3 = 0.003
OK -1E-2 = -0.01

At this moment, the only token that has no associated action is floating-point . (The  exp token does not need any, as its only task is to match with either  e or  E ). Let’s look at it once again:

token floating-point {
    \d* ['.' <integer>]
}

When we were creating the token, we replaced the \d+ part with  <integer> . Actually, an optional sequence  \d* can be also replace with it:

token floating-point {
<integer>? ['.' <integer>]
}

There are two <integer> calls within the same token now! What do you write in the action in this case? It’s simple: if the name is mentioned more than once, you get an array and thus you can refer to the first occurrence as  $<integer>[0] , and as  $<integer>[1] to the second one.

The only problem is that in our case the first integer part is optional. If you parse 3.14 , you’ll get two elements as expected, but if you parse .14 , then 14 will go to the element with the index 0. One of the possible solutions is just to check the length of the array. Evaluating the value is a relatively simple task.

method floating-point($/) {
    my $int = 0;
    my $frac = 0;

    if $<integer>.elems == 2 {
        ($int, $frac) = $<integer>;
    }
    else {
        $frac = $<integer>[0];
    }

    my $n = $int + $frac / 10 ** $frac.chars;

    $/.make($n);
}

The TOP token also has to be updated to get the value of the floating-point number if it was parsed:

my $n = $<integer> ?? 
        $<integer>.made !! $<floating-point>.made;

The task seems to be solved. All the numbers, including those with a decimal point and an exponential part, are successfully handled:

OK 3.14 = 3.14
OK -2.78 = -2.78
OK 5.0 = 5
OK .5 = 0.5
OK -5.3 = -5.3
OK -.3 = -0.3

OK 3E-3 = 0.003

OK -1E-2 = -0.01

OK 3.14E2 = 314

OK .5E-3 = 0.0005

Final notes

In this chapter, we managed to convert strings to numbers, but what kind of numbers are they? To get the numbers, we were using operators that are available in Raku, such as binary + , arithmetic operators, and the power operator  ** . The result of the calculations that use all of those is a number that is an instance of one of the numeric types that Raku offers.

You can see the class name by explicitly printing it in the test loop:

say "OK $number = " ~ $test.made ~
    ' (' ~ $test.made.^name ~ ')';

All numbers that did not contain a decimal point became Ints, all the rest are Rat s:

OK 7 = 7 (Int)
OK 77 = 77 (Int)
OK -84 = -84 (Int)
OK +7 = 7 (Int)
OK 0 = 0 (Int)
OK 3.14 = 3.14 (Rat)
OK -2.78 = -2.78 (Rat)
OK 5.0 = 5 (Rat)
OK .5 = 0.5 (Rat)
OK -5.3 = -5.3 (Rat)
OK -.3 = -0.3 (Rat)
OK 3E4 = 30000 (Int)
OK -33E55 = -330000000000000000000000000000000000000000000000000000000 (Int)
OK 3E-3 = 0.003 (Rat)
OK -1E-2 = -0.01 (Rat)
OK 3.14E2 = 314 (Rat)
OK .5E-3 = 0.0005 (Rat)

Now, look again at the floating-point method in the actions class. Although its algorithm is straightforward and produces correct results, it is quite wordy and needs a few lines of code. Alternatively, you can pass this task to the host language itself! Let Raku parse the floating-point number for you:

my $n = +"$int.$frac";
$/.make($n);

Wait, what is "$int.$frac" ? It is a string that was matched by the  floating-point token during the parsing process, and it means that instead of reconstructing the string and converting it to a number, we can access it directly converting the  $/ object to a number directly:

method floating-point($/) {
    $/.make(+$/);
}

Does this code resemble something that you’ve already seen? The body of this method is exactly the same as the body of the integer method:

method integer($/) {
    $/.make(+$/);
}

Indeed, we allowed Raku to build the number for us when it contained digits only. We can delegate it again if we met a number with a decimal point, too.

But that’s not all. The numbers that our Number grammar allows are all valid Raku numbers, and it is possible to replace all our actions with a single line of code:

class NumberActions {
    method TOP($/) {
        $/.make(+$/);
    }
}

After this change, the types of the number differ a bit. Raku treats a number in the scientific notation as Num , not  Rat . You can confirm that buy running the test loop again:

OK 7 = 7 (Int)
OK 77 = 77 (Int)
OK -84 = -84 (Int)
OK +7 = 7 (Int)
OK 0 = 0 (Int)
OK 3.14 = 3.14 (Rat)
OK -2.78 = -2.78 (Rat)
OK 5.0 = 5 (Rat)
OK .5 = 0.5 (Rat)
OK -5.3 = -5.3 (Rat)
OK -.3 = -0.3 (Rat)
OK 3E4 = 30000 (Num)
OK -33E55 = -3.3e+56 (Num)
OK 3E-3 = 0.003 (Num)
OK -1E-2 = -0.01 (Num)
OK 3.14E2 = 314 (Num)
OK .5E-3 = 0.0005 (Num)

The output format here also depends on how Raku prints the numbers of different numerical types.

In this particular task, all our manual labour is replaced by the compiler actions in the host language. Of course, that was possible because we choose standard data formats that many programming languages can deal with. Do not be afraid to remove your own code as soon as you discovered a simpler way to solve the task. The techniques of working with AST that were demonstrated in this chapter, are the base for our future adventure in this book. Stay tuned!

P. S. An attentive reader may have noticed that the Number grammar does not include numbers such as 4. , where there is an integer part, a decimal point but no decimal part. These numbers are not allowed in Raku itself, so I did not include it to the grammar.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Python学习手册(第4版)

Python学习手册(第4版)

[美] Mark Lutz / 李军、刘红伟 / 机械工业出版社 / 2011-4 / 119.00元

Google和YouTube由于Python的高可适应性、易于维护以及适合于快速开发而采用它。如果你想要编写高质量、高效的并且易于与其他语言和工具集成的代码,《Python学习手册:第4 版》将帮助你使用Python快速实现这一点,不管你是编程新手还是Python初学者。本书是易于掌握和自学的教程,根据作者Python专家Mark Lutz的著名培训课程编写而成。 《Python学习手册:第......一起来看看 《Python学习手册(第4版)》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

MD5 加密
MD5 加密

MD5 加密工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换