Fulentnet：深度学习的端到端检测语音差异

论文标题

Fulentnet：深度学习的端到端检测语音差异

FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning

论文作者

Kourkounakis, Tedd, Hajavi, Amirhossein, Etemad, Ali

论文摘要

在工作场所和课堂环境中，强大的演讲技巧都很有价值，并且受欢迎。尤其是对声音演示的可能改善，尤其是爆发和斯托斯仍然是某人演示的最常见和最突出的因素之一。数以百万计的人受到口吃和其他言语爆发的影响，世界上大多数人在压力下的情况下经历了温和的斯托斯。尽管在自动语音识别和语言模型领域进行了很多研究，但在探测和识别方面，缺乏足够的工作。为此，我们提出了一个端到端的深神经网络Fluentnet，能够检测多种不同的反射类型。 FluentNet由挤压和激发残留的卷积神经网络组成，该网络有助于学习强频谱框架级表示，然后是一组双向长期短期记忆层，有助于学习有效的时间关系。最后，FluentNet使用注意力机制专注于语音的重要部分，以获得更好的性能。我们执行许多不同的实验，比较和消融研究来评估我们的模型。我们的模型通过在公开可用的UCLASS数据集上的其他解决方案优于该领域的其他解决方案来实现最新结果。此外，我们介绍了Libristutter：基于综合Studters的公共Librispeech数据集的差异数据集。我们还评估了该数据集上的FluentNet，显示了我们的模型与许多基准技术的强劲性能。

Strong presentation skills are valuable and sought-after in workplace and classroom environments alike. Of the possible improvements to vocal presentations, disfluencies and stutters in particular remain one of the most common and prominent factors of someone's demonstration. Millions of people are affected by stuttering and other speech disfluencies, with the majority of the world having experienced mild stutters while communicating under stressful conditions. While there has been much research in the field of automatic speech recognition and language models, there lacks the sufficient body of work when it comes to disfluency detection and recognition. To this end, we propose an end-to-end deep neural network, FluentNet, capable of detecting a number of different disfluency types. FluentNet consists of a Squeeze-and-Excitation Residual convolutional neural network which facilitate the learning of strong spectral frame-level representations, followed by a set of bidirectional long short-term memory layers that aid in learning effective temporal relationships. Lastly, FluentNet uses an attention mechanism to focus on the important parts of speech to obtain a better performance. We perform a number of different experiments, comparisons, and ablation studies to evaluate our model. Our model achieves state-of-the-art results by outperforming other solutions in the field on the publicly available UCLASS dataset. Additionally, we present LibriStutter: a disfluency dataset based on the public LibriSpeech dataset with synthesized stutters. We also evaluate FluentNet on this dataset, showing the strong performance of our model versus a number of benchmark techniques.

下载PDF全文

下载文献需遵守相关版权规定

论文标题