论文标题
挪威议会演讲语料库
The Norwegian Parliamentary Speech Corpus
论文作者
论文摘要
挪威议会言语语料库(NPSC)是一个语音数据集,其中录制了挪威议会的录音。这是第一个包含无脚本的挪威语音的数据集,旨在培训自动语音识别(ASR)系统。录音是用语言代码和说话者手动转录和注释的,并且有关演讲者的详细元数据。转录以归一化和非归一化形式存在,非标准化的单词被明确标记并用标准化等价物进行注释。为了测试该数据集的有用性,我们将在NPSC上训练的ASR系统与仅在手稿阅读的语音上训练的基线系统进行了比较。这些系统在包含自发的方言语音的独立数据集上进行了测试。 NPSC训练的系统的性能明显更好,单词错误率(WER)相对提高了22.9%。此外,对NPSC的培训在方言方面被证明具有“民主化”的效果,因为对于基线系统的方言,改进通常更大。
The Norwegian Parliamentary Speech Corpus (NPSC) is a speech dataset with recordings of meetings from Stortinget, the Norwegian parliament. It is the first, publicly available dataset containing unscripted, Norwegian speech designed for training of automatic speech recognition (ASR) systems. The recordings are manually transcribed and annotated with language codes and speakers, and there are detailed metadata about the speakers. The transcriptions exist in both normalized and non-normalized form, and non-standardized words are explicitly marked and annotated with standardized equivalents. To test the usefulness of this dataset, we have compared an ASR system trained on the NPSC with a baseline system trained on only manuscript-read speech. These systems were tested on an independent dataset containing spontaneous, dialectal speech. The NPSC-trained system performed significantly better, with a 22.9% relative improvement in word error rate (WER). Moreover, training on the NPSC is shown to have a "democratizing" effect in terms of dialects, as improvements are generally larger for dialects with higher WER from the baseline system.