论文标题
实用的LR解析器
Practical LR Parser Generation
论文作者
论文摘要
解析是现代编译器中的一个基本构件,对于工业编程语言,这是一项令人惊讶的涉及任务。有许多可以自动生成解析器的方法,但是普遍的共识是,自动解析器的产生对于真实的编程语言是不切实际的:LR/LALR解析器在其支持的语法中被认为是太限制的,而LR Parsers通常被认为在实践中的效率过于效率。结果,几乎所有现代语言都使用手工编写的递归发育分析器,这是一个漫长而容易出错的过程,可极大地增加新的编程语言开发的障碍。 在这项工作中,我们证明,与普遍的共识相反,我们可以在两全其美的情况下拥有最好的:对于非常一般的,实用的语法类别(严格的Knuth的规范LR的超级超集),我们可以自动产生解析器,以及由此产生的Parser代码,以及生成程序本身,也可以高效。这一进步依赖于几个新想法,包括新颖的自动机优化程序;新的语法转换(“ CPS”);每个符号属性;递归的动作;以及我们称为XLR的规范LR解析的扩展,该解析赋予了偏移/减少解析器的限制性非确定性选择的力量。 借助这些成分,我们可以自动为几乎所有在直觉上易于解析的编程语言生成有效的解析器 - 我们通过在一个名为langcc的新软件工具中实现新算法,并在Golang 1.17.8和Python 3.9.9.9.12的语法规范上运行它们,通过实现新算法,这是我们在实验上支持的主张。该工具会自动处理两种语言,并且在标准代码库上运行时生成的代码比Golang的相应手写解析器快1.2倍,并且比Cpython Parser的速度分别快4.3倍。
Parsing is a fundamental building block in modern compilers, and for industrial programming languages, it is a surprisingly involved task. There are known approaches to generate parsers automatically, but the prevailing consensus is that automatic parser generation is not practical for real programming languages: LR/LALR parsers are considered to be far too restrictive in the grammars they support, and LR parsers are often considered too inefficient in practice. As a result, virtually all modern languages use recursive-descent parsers written by hand, a lengthy and error-prone process that dramatically increases the barrier to new programming language development. In this work we demonstrate that, contrary to the prevailing consensus, we can have the best of both worlds: for a very general, practical class of grammars -- a strict superset of Knuth's canonical LR -- we can generate parsers automatically, and the resulting parser code, as well as the generation procedure itself, is highly efficient. This advance relies on several new ideas, including novel automata optimization procedures; a new grammar transformation ("CPS"); per-symbol attributes; recursive-descent actions; and an extension of canonical LR parsing, which we refer to as XLR, which endows shift/reduce parsers with the power of bounded nondeterministic choice. With these ingredients, we can automatically generate efficient parsers for virtually all programming languages that are intuitively easy to parse -- a claim we support experimentally, by implementing the new algorithms in a new software tool called langcc, and running them on syntax specifications for Golang 1.17.8 and Python 3.9.12. The tool handles both languages automatically, and the generated code, when run on standard codebases, is 1.2x faster than the corresponding hand-written parser for Golang, and 4.3x faster than the CPython parser, respectively.