论文标题
即将来临的合成器形状:为什么我们应该使用合成数据进行英语表面实现
Shape of synth to come: Why we should use synthetic data for English surface realization
论文作者
论文摘要
表面实现2018年和2019年的共享任务是自然语言生成共享的任务,目的是探索从普遍依赖性的树木到几种语言的表面字符串的表面实现方法。在2018年共享任务中,有或没有其他合成创建数据的系统的绝对性能几乎没有差异,并且为2019年共享任务引入了禁止使用合成数据的新规则。与2018年共享任务的发现相反,我们在英语2018数据集的实验中表明,合成数据的使用可以产生实质性的积极效果 - 改善了几乎8个BLEU点,用于以前的先前先前的系统。我们分析了合成数据的效果,并认为应该鼓励其使用而不是禁止使用它,以便未来的研究工作继续探索可以利用此类数据的系统。
The Surface Realization Shared Tasks of 2018 and 2019 were Natural Language Generation shared tasks with the goal of exploring approaches to surface realization from Universal-Dependency-like trees to surface strings for several languages. In the 2018 shared task there was very little difference in the absolute performance of systems trained with and without additional, synthetically created data, and a new rule prohibiting the use of synthetic data was introduced for the 2019 shared task. Contrary to the findings of the 2018 shared task, we show, in experiments on the English 2018 dataset, that the use of synthetic data can have a substantial positive effect - an improvement of almost 8 BLEU points for a previously state-of-the-art system. We analyse the effects of synthetic data, and we argue that its use should be encouraged rather than prohibited so that future research efforts continue to explore systems that can take advantage of such data.