论文标题
众包中的噪声抑制算法的主观评估
Subjective Evaluation of Noise Suppression Algorithms in Crowdsourcing
论文作者
论文摘要
根据ITU-T REC,通常在实验室实验中评估语音通信系统的质量,其中包括噪声抑制算法。 p.835,参与者分别对背景噪声,语音信号和整体质量进行评分。本文介绍了一个开源工具包,用于对众包中的噪声抑制语音进行主观质量评估。我们遵循了ITU-T REC。 p.835和p.808,并高度自动化该过程以防止主持人的错误。为了评估我们的评估方法的有效性,我们比较了平均意见分数(MOS),使用与我们的实施收集的评分进行计算,以及根据ITU-T REC P.835进行的标准实验室实验的MOS值。结果表明,所有三个量表的有效性都高,即背景噪声,语音信号和整体质量(平均PCC = 0.961)。旋转蛋白测试(n = 5)的结果表明,我们的实施也是一种高度可重现的评估方法(PCC = 0.99)。最后,我们将我们在Interspeech 2021深噪声抑制挑战中使用的实现作为主要评估度量,这表明它可以在大规模上使用。分析结果以确定为什么整体性能在背景噪声和语音质量方面是最好的。
The quality of the speech communication systems, which include noise suppression algorithms, are typically evaluated in laboratory experiments according to the ITU-T Rec. P.835, in which participants rate background noise, speech signal, and overall quality separately. This paper introduces an open-source toolkit for conducting subjective quality evaluation of noise suppressed speech in crowdsourcing. We followed the ITU-T Rec. P.835, and P.808 and highly automate the process to prevent moderator's error. To assess the validity of our evaluation method, we compared the Mean Opinion Scores (MOS), calculate using ratings collected with our implementation, and the MOS values from a standard laboratory experiment conducted according to the ITU-T Rec P.835. Results show a high validity in all three scales namely background noise, speech signal and overall quality (average PCC = 0.961). Results of a round-robin test (N=5) showed that our implementation is also a highly reproducible evaluation method (PCC=0.99). Finally, we used our implementation in the INTERSPEECH 2021 Deep Noise Suppression Challenge as the primary evaluation metric, which demonstrates it is practical to use at scale. The results are analyzed to determine why the overall performance was the best in terms of background noise and speech quality.