FT演讲：丹麦议会演讲语料库

论文标题

FT演讲：丹麦议会演讲语料库

FT Speech: Danish Parliament Speech Corpus

论文作者

Kirkedal, Andreas, Stepanović, Marija, Plank, Barbara

论文摘要

本文介绍了FT演讲，这是一种由丹麦议会记录的会议创建的新演讲语料库，也称为人们（FT）。该语料库包含超过1,800小时的抄录演讲，共有434位发言人。持续时间，词汇和自发语音的数量要比现有的丹麦公共演讲语料库要大得多，丹麦语中的现有公开言论语料库很大程度上仅限于阅读和听写数据。我们概述了设计注意事项，包括预处理方法和对齐过程。为了评估语料库的质量，我们在新资源上训练自动语音识别系统，并将其与迄今为止最大的丹麦公共ASR CorpusSpråkkbanken培训的系统进行比较。我们的基线结果表明，我们在新的语料库上取得了14.01的成绩。 FT语音与内域语言数据的结合提供了与专门在Språkbanken上训练的模型相当的结果，这表明FT语音可以很好地转移到此数据集中。有趣的是，我们的结果表明情况并非如此。这表明，FT演讲为促进对丹麦ASR的研究提供了一种宝贵的资源，并以更自发的语音进行了研究。

This paper introduces FT Speech, a new speech corpus created from the recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish, which are largely limited to read-aloud and dictation data. We outline design considerations, including the preprocessing methods and the alignment procedure. To evaluate the quality of the corpus, we train automatic speech recognition systems on the new resource and compare them to the systems trained on the Danish part of Språkbanken, the largest public ASR corpus for Danish to date. Our baseline results show that we achieve a 14.01 WER on the new corpus. A combination of FT Speech with in-domain language data provides comparable results to models trained specifically on Språkbanken, showing that FT Speech transfers well to this data set. Interestingly, our results demonstrate that the opposite is not the case. This shows that FT Speech provides a valuable resource for promoting research on Danish ASR with more spontaneous speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题