训练有素的口语翻译变压器模型

论文标题

训练有素的口语翻译变压器模型

Jointly Trained Transformers models for Spoken Language Translation

论文作者

Vydana, Hari Krishna, Karafi'at, Martin, Zmolikova, Katerina, Burget, Luk'as, Cernocky, Honza

论文摘要

常规的口语翻译（SLT）系统是基于管道的系统，在该系统中，我们具有自动语音识别（ASR）系统，可以将源形态从语音转换为文本和机器翻译（MT）系统，以将源文本转换为目标语言的文本。序列序列体系结构的最新进展减少了基于管道的SLT系统（级联ASR-MT）和端到端方法之间的性能差距。尽管端到端和级联的ASR-MT系统达到了可比的性能水平，但我们可以使用ASR假设和Oracle Text W.R.T MT模型看到较大的性能差距。该性能差距表明，由于ASR假设而不是Oracle文本转录本，MT系统容易受到较大的性能退化。在这项工作中，通过在ASR和MT系统之间创建端到端可区分的管道来降低性能中的降解。在这项工作中，我们以ASR目标为辅助损失训练SLT系统，并且两个网络都通过神经隐藏的表示连接。该火车将具有最终目标函数的端到端可区分路径W.R.T，并利用ASR目标来更好地性能SLT系统。该体系结构已从36.8提高到44.5。由于多任务训练，该模型还产生了ASR假设，该假设由预训练的MT模型使用。将拟议的系统与MT模型相结合已提高了BLEU得分1。所有实验均在使用How2语料库的英语葡萄牙语语音翻译任务上报告。最终的BLEU分数是在How2数据集上的最佳语音翻译系统，没有其他培训数据和语言模型，并且参数更少。

Conventional spoken language translation (SLT) systems are pipeline based systems, where we have an Automatic Speech Recognition (ASR) system to convert the modality of source from speech to text and a Machine Translation (MT) systems to translate source text to text in target language. Recent progress in the sequence-sequence architectures have reduced the performance gap between the pipeline based SLT systems (cascaded ASR-MT) and End-to-End approaches. Though End-to-End and cascaded ASR-MT systems are reaching to the comparable levels of performances, we can see a large performance gap using the ASR hypothesis and oracle text w.r.t MT models. This performance gap indicates that the MT systems are prone to large performance degradation due to noisy ASR hypothesis as opposed to oracle text transcript. In this work this degradation in the performance is reduced by creating an end to-end differentiable pipeline between the ASR and MT systems. In this work, we train SLT systems with ASR objective as an auxiliary loss and both the networks are connected through the neural hidden representations. This train ing would have an End-to-End differentiable path w.r.t to the final objective function as well as utilize the ASR objective for better performance of the SLT systems. This architecture has improved from BLEU from 36.8 to 44.5. Due to the Multi-task training the model also generates the ASR hypothesis which are used by a pre-trained MT model. Combining the proposed systems with the MT model has increased the BLEU score by 1. All the experiments are reported on English-Portuguese speech translation task using How2 corpus. The final BLEU score is on-par with the best speech translation system on How2 dataset with no additional training data and language model and much less parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题