论文标题
编程语言的无监督翻译
Unsupervised Translation of Programming Languages
论文作者
论文摘要
一个迁移器,也称为源代码转换器,是一个将源代码从高级编程语言(例如C ++或Python)转换为另一种的系统。提交者主要用于互操作性,并用过时或弃用的语言(例如Cobol,Python 2)写入现代语言。他们通常依靠手工制作的重写规则,应用于源代码抽象语法树。不幸的是,由此产生的翻译通常缺乏可读性,不尊重目标语言惯例,并且需要进行手动修改才能正常工作。总体翻译过程是时间耗费,需要在源语言和目标语言上进行专业知识,从而使代码翻译项目变得昂贵。尽管在自然语言翻译的背景下,神经模型的表现明显优于其基于规则的同行,但由于该领域中的平行数据缺乏,它们在越过的术中的应用受到限制。在本文中,我们建议利用无监督的机器翻译中的最新方法来训练完全无监督的神经移植器。我们从开源GitHub项目上对源代码进行训练,并表明它可以以高精度转换C ++,Java和Python之间的功能。我们的方法仅依赖于单语源代码,不需要源或目标语言的专业知识,并且可以轻松地将其推广到其他编程语言。我们还构建并发布了由852个并行功能组成的测试集,以及单元测试以检查翻译的正确性。我们表明,我们的模型优于基于规则的商业基线的大量利润率。
A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.