论文标题
通过语料库转移将新兴和自然语言联系起来
Linking Emergent and Natural Languages via Corpus Transfer
论文作者
论文摘要
对语言出现的研究旨在了解人类语言是如何通过知觉基础和交流意图来塑造的。紧急通信(EC)的计算方法主要考虑在有限域中的参考游戏,并在游戏框架内分析学习的协议。结果,目前尚不清楚这些设置中的新兴语言如何连接到自然语言或在现实世界语言处理任务中提供好处,其中统计模型在大型文本语料库中培训。在这项工作中,我们提出了一种新颖的方式,可以通过语料库转移建立这种联系,即在新兴语言的下游自然语言任务上进行预处理,这与直接转移说话者和听众参数的先前工作形成鲜明对比。我们的方法为两个不同任务(语言建模和图像字幕)展示了非平凡的转移益处。例如,在低资源的设置(建模200万自然语言代币)中,对新兴语言语料库进行预培训,只有200万个令牌将模型的困惑降低了$ 24.6 \%\%\%\%$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $在十种自然语言中平均减少。我们还介绍了一个新颖的指标,以通过将紧急信息转换为基于相同图像的自然语言字幕来预测新兴语言的可传递性。我们发现,我们的基于翻译的度量高度与建模自然语言的下游性能高度相关(例如,希伯来语上的$ρ= 0.83 $),而地形相似性(以前的作品中的流行度量)表明,相关性低下($ρ= 0.003 $),暗示了属于属性的简单属性,这些属性是自然而然地被捕获的属性,可能是不完整的复杂性。我们的发现还表明,通过自然语言资源和模型将语言出现向前发展的潜在好处。
The study of language emergence aims to understand how human languages are shaped by perceptual grounding and communicative intent. Computational approaches to emergent communication (EC) predominantly consider referential games in limited domains and analyze the learned protocol within the game framework. As a result, it remains unclear how the emergent languages from these settings connect to natural languages or provide benefits in real-world language processing tasks, where statistical models trained on large text corpora dominate. In this work, we propose a novel way to establish such a link by corpus transfer, i.e. pretraining on a corpus of emergent language for downstream natural language tasks, which is in contrast to prior work that directly transfers speaker and listener parameters. Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning. For example, in a low-resource setup (modeling 2 million natural language tokens), pre-training on an emergent language corpus with just 2 million tokens reduces model perplexity by $24.6\%$ on average across ten natural languages. We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images. We find that our translation-based metric highly correlates with the downstream performance on modeling natural languages (for instance $ρ=0.83$ on Hebrew), while topographic similarity, a popular metric in previous work, shows surprisingly low correlation ($ρ=0.003$), hinting that simple properties like attribute disentanglement from synthetic domains might not capture the full complexities of natural language. Our findings also indicate potential benefits of moving language emergence forward with natural language resources and models.