利用自动语料库生成的新闻文章结构

论文标题

利用自动语料库生成的新闻文章结构

Exploiting News Article Structure for Automatic Corpus Generation of Entailment Datasets

论文作者

Cruz, Jan Christian Blaise, Resabal, Jose Kristian, Lin, James, Velasco, Dan John, Cheng, Charibeth

论文摘要

变形金刚近年来代表自然语言处理（NLP）的最先进，即使在低资源语言中完成的任务也有效。虽然可以为这些语言进行预验证的变压器，但由于缺乏硬基准数据集以及生产它们的困难和成本，衡量其真实的性能和能力是一项挑战。在本文中，我们提出了三个贡献：首先，我们提出了一种使用已发表的新闻文章的低资源语言的自动推理（NLI）基准数据集的方法。通过此，我们创建并发布了Newsph-nli，这是第一个句子的基准数据集中的低资源菲律宾语言。其次，我们基于电气技术生产了新的经过验证的变压器，以进一步缓解菲律宾的资源稀缺性，从而在我们的数据集上对其他常用的转移学习技术进行了基准测试。最后，我们通过使用降解测试在低数据域操作时对传输学习技术进行分析，以阐明其真实的性能。

Transformers represent the state-of-the-art in Natural Language Processing (NLP) in recent years, proving effective even in tasks done in low-resource languages. While pretrained transformers for these languages can be made, it is challenging to measure their true performance and capacity due to the lack of hard benchmark datasets, as well as the difficulty and cost of producing them. In this paper, we present three contributions: First, we propose a methodology for automatically producing Natural Language Inference (NLI) benchmark datasets for low-resource languages using published news articles. Through this, we create and release NewsPH-NLI, the first sentence entailment benchmark dataset in the low-resource Filipino language. Second, we produce new pretrained transformers based on the ELECTRA technique to further alleviate the resource scarcity in Filipino, benchmarking them on our dataset against other commonly-used transfer learning techniques. Lastly, we perform analyses on transfer learning techniques to shed light on their true performance when operating in low-data domains through the use of degradation tests.

下载PDF全文

下载文献需遵守相关版权规定

论文标题