《编程语言实现模式》学习笔记（一）：LL(1)递归下降的词法分析器

栏目: 后端 · 前端 · 发布时间: 5年前

内容简介：在阅读的时候，人们会无意识地将字母合成单词，然后将合成的单词组成句子，之后考虑语法结构，最终明白句子的含义。在编译的过程中，编译器为了更好地进行和日常生活中的例子做类比，就如同我们阅读

一、什么是Lexer？

在阅读的时候，人们会无意识地将字母合成单词，然后将合成的单词组成句子，之后考虑语法结构，最终明白句子的含义。在编译的过程中，编译器为了更好地进行 语法分析 ，就也需要先将 字符流 识别分类为有意义的一个个单元，而这些单元便称之为 词法单元 ，编译器解析 字符流 到生成 词法单元 的过程便是 词法分析 ，即：

字符流 --> Lexer --词法分析--> Parser --语法分析--> AST

和日常生活中的例子做类比，就如同我们阅读 do jobs 这句话，我们会依次拿到 d ， o ，， j ， o ， b ， s 这么一堆字符，这些字符就好比 字符流 ，而经过 词法分析 ，我们会把这些字符组成 do 和 jobs 两个单词， do 和 jobs 就好比 词法单元 ，更进一步讲，他们可以表示成 <vt, 'do'> 和 <n, 'jobs'> ，而这么一个 词法单元 序列可经由 语法分析 器解析，判断是否是符合英语语法的句子，我们都知道 vt+n 是合法的语法结构，那么 语法分析器 就能够方便地利用 词法单元 里的信息来做解析了，这便是 词法分析 的意义所在，而 Lexer 便是 能处理输入字符流的识别器

二、词法单元(Token)类型

词法分析 的最终结果是产出 词法单元 流，每个 词法单元 则都有两个重要的属性： 词法类型 和 词法内容 ，所以我们需要事先定义好 词法类型
本文中将以分析“ [a, b, [c, dd]] ”这类语法为例。在这个例子里，我们可以明确包含的字符可以分为：

NAME ：表示变量名称，如“a”、“b”、“c”、“dd”都是变量名称
LBRACK ：表示左括号“ [ ”
RBRACK ：表示右括号“ ] ”
COMMA ：表示逗号“ , ”

所以，我们可以定义 TokenType 如下：

// TokenType.ts
export enum TokenType {
    NAME = 1,
    COMMA = 2,
    LBRACK = 3,
    RBRACK = 4
}

此外，还需要定义一个类，来表示 Token 这一数据结构：

import { TokenType } from './TokenType'

export default class Token {
    public type: TokenType
    public text: string
    constructor(type, text) {
        this.type = type
        this.text = text
    }
    public toString() {
        return `<'${this.text}', ${TokenType[this.type]}>`
    }
}

其中， NAME 比较特殊，它是一或多个字母所组成的，所以我们可以定义它的规则为： ('a' ... 'z'|'A' ... 'Z')+

三、规则实现

规则通常有以下这么几类，可以图示为：

具体而言：

alternatives? ， optional ，表示可选可不选，因此可以编码实现为：

if (/* 看到相应的向前看字符 */) {
    match(alternatives)
}

alternatives+ ， one or more ，表示一或多个，因此可以编码实现为：

do {
    match(alternatives)
} while(/* 看到相应的向前看字符 */)

alternatives* ， zero or more ，表示零或多个，因此可以编码实现为：

while (/* 看到相应的向前看字符 */) {
    match(alternatives)
}

所以，回到刚刚的例子，NAME的定义是 ('a'...'z'|'A'...'Z')+ ，所以它的实现是采用 do-while ：

private _NAME(): Token {
    let buf = ''
    do {
        buf += this._char // this._char表示当前读取到的字符
    } while (this._isLetter())
    return new Token(TokenType.NAME, buf)
}

而一般在 词法分析 中， 空白 是不重要的，我们一般的做法是跳过空白，空白可以定义为 (' ' | '\t' | '\n' | '\r')* ，因此它的实现是采用 while 语句：

private _WS(): void {
    while (
        this._char === ''
        || this._char === '\t'
        || this._char === '\n'
        || this._char === '\r'
    ) {
        this._consume()
    }
}

四、Lexer抽象类

为了方便地实现我们的词法解析器，我们需要定义一个 Lexer 抽象类，它包含了一些内部状态和公用操作方法，即：

状态：

_input: string
_index: number
_char: string | TokenType.EOF

方法：

consume() 方法：用以每次调用时都向前扫描一个字符，当扫描完毕后返回 EOF
match(x: string) 方法：确保 x 是输入流中的下一个字符，如果不是就报错

因此， Lexer 类实现如下：

import { TokenType } from './TokenType'

export default abstract class Lexer {
    protected _input: string
    protected _index: number
    protected _char: string | TokenType.EOF
    constructor(input: string) {
        this._input = input
        this._index = 0
        this._char = input.charAt(0)
    }
    public consume() {
        this._index++
        if (this._index >= this._input.length) {
            this._char = TokenType.EOF
        } else {
            this._char = this._input.charAt(this._index)
        }
    }
    public match(charToMatch: string) {
        if (this._char === charToMatch) {
            this.consume()
        } else {
            throw new Error(`Expecting ${charToMatch}; found ${this._char}`)
        }
    }
}

五、ListLexer

ListLexer 才是我们要实现的 词法解析器 。这里回到标题“基于LL(1)递归下降的Lexer”，这里 LL(1) 的 L 的意思是 Left-to-Right ，即表示：解析器 从左向右 解析输入内容，下降解析时也是按 从左到右 顺序遍历子节点，1则是每次只向前看一个字符。

所以我们是通过向前看一个字符串来判定 词法类型 的，因此 ListLexer 的实现主要是实现 nextToken() 方法，如下：

public nextToken(): Token {
    while (this._char !== TokenType.EOF) {
        // 通过查看向前看字符，确认词法类型
        switch(this._char) {
            // 如果向前看字符是' '，'\t'，'\n'，'\r'，则匹配空白
            case ' ':
            case '\t':
            case '\n':
            case '\r':
                this._WS()
                continue
            // 如果向前看字符是','，则匹配逗号
            case ',':
                this.consume()
                return new Token(TokenType.COMMA, ',')
            // 如果向前看字符是'['，则匹配左方括号
            case '[':
                this.consume()
                return new Token(TokenType.LBRACK, '[')
            // 如果向前看字符是']'，则匹配右方括号
            case ']':
                this.consume()
                return new Token(TokenType.RBRACK, ']')
            // 如果向前看字符不是以上情况，则可能是NAME，也有可能出错
            default:
                if (this._isLetter()) {
                    return this._NAME()
                }
                throw new Error(`invalid character: ${this._char}`)
        }
    }
    return new Token(TokenType.EOF, '<EOF>')
}

这里我们 NAME 的合法组成是字母，因此我们可以实现 _isLetter() 方法如下：

private _isLetter(): boolean {
    const charCode = (<string> this._char).charCodeAt(0)
    return charCode >= 'a'.charCodeAt(0) && charCode <= 'z'.charCodeAt(0)
        || charCode >= 'A'.charCodeAt(0) && charCode <= 'Z'.charCodeAt(0)
}

最终， ListLexer 的实现如下：

import { TokenType } from './TokenType'
import Lexer from './Lexer'
import Token from './Token'

export default class ListLexer extends Lexer {
    constructor(input: string) {
        super(input)
    }
    private _WS(): void {
        while (
            this._char === ' '
            || this._char === '\t'
            || this._char === '\n'
            || this._char === '\r'
        ) {
            this.consume()
        }
    }
    private _NAME(): Token {
        let buf = ''
        do {
            buf += this._char
            this.consume()
        } while (this._isLetter())
        return new Token(TokenType.NAME, buf)
    }
    private _isLetter(): boolean {
        const charCode = (<string> this._char).charCodeAt(0)
        return charCode >= 'a'.charCodeAt(0) && charCode <= 'z'.charCodeAt(0)
            || charCode >= 'A'.charCodeAt(0) && charCode <= 'Z'.charCodeAt(0)
    }
    public nextToken(): Token {
        while (this._char !== TokenType.EOF) {
            switch(this._char) {
                case ' ':
                case '\t':
                case '\n':
                case '\r':
                    this._WS()
                    continue
                case ',':
                    this.consume()
                    return new Token(TokenType.COMMA, ',')
                case '[':
                    this.consume()
                    return new Token(TokenType.LBRACK, '[')
                case ']':
                    this.consume()
                    return new Token(TokenType.RBRACK, ']')
                default:
                    if (this._isLetter()) {
                        return this._NAME()
                    }
                    throw new Error(`invalid character: ${this._char}`)
            }
        }
        return new Token(TokenType.EOF, '<EOF>')
    }
}

六、tokenizer

最终我们写出来的 tokenizer 为：

function tokenizer(input: string) {
    const result: Array<Token> = []
    const lexer: ListLexer = new ListLexer(input)
    try {
        let token: Token = lexer.nextToken()
        while (token.type !== TokenType.EOF) {
            result.push(token)
            token = lexer.nextToken()
        }
    } catch (e) {
        console.error('Tokenized error:', e.message)
    }
    return result
}

测试：

console.log(
    tokenizer('[a, c, ee, [a,b    ,c]]')
)

输出：

[ Token { type: 3, text: '[' },
  Token { type: 1, text: 'a' },
  Token { type: 2, text: ',' },
  Token { type: 1, text: 'c' },
  Token { type: 2, text: ',' },
  Token { type: 1, text: 'ee' },
  Token { type: 2, text: ',' },
  Token { type: 3, text: '[' },
  Token { type: 1, text: 'a' },
  Token { type: 2, text: ',' },
  Token { type: 1, text: 'b' },
  Token { type: 2, text: ',' },
  Token { type: 1, text: 'c' },
  Token { type: 4, text: ']' },
  Token { type: 4, text: ']' } ]

以上所述就是小编给大家介绍的《《编程语言实现模式》学习笔记（一）：LL(1)递归下降的词法分析器》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

小圈子·大社交

Paul Adams / 王志慧 / 人民邮电出版社 / 2013-1 / 29.00元

网络正在脱离以内容为核心构建的方式，转向以人为核心重新构建。这样深远的变革将影响我们制定商业策略、设计以及营销和广告的方式。本书作者先后在谷歌和Facebook供职，对于社交网络有深入的研究和丰富的实战经验。他以学术界和工业界最新的调查研究为基础，阐述了人们如何通过社交圈子相互联系的规律，探讨了理念和品牌信息如何通过社交网络传播开来的过程。书中介绍了许多实际的例子，通过这些鲜活的实例，你将......一起来看看《小圈子·大社交》这本书的介绍吧!

码农工具