论文标题
语音形式:一个层次的高效框架,结合了语音的特征
SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech
论文作者
论文摘要
变形金刚在认知语音信号处理场上获得了令人鼓舞的结果,这在各种应用中都很感兴趣,从情绪到神经认知障碍分析。但是,大多数作品将语音信号视为整体,导致对语音独有的发音结构的忽视并反映了认知过程。同时,由于其全部关注操作,变压器由于其全部关注而承受着沉重的计算负担。在本文中,提出了一个称为语音的结构特征的层次有效框架,称为语音形式,并可以用作通用的骨干,用于认知语音信号处理。拟议的语音形式由框架,音素,单词和话语阶段组成,每个语音阶段都根据较高的计算效率的语音结构模式表现出相邻的注意力。对语音情感识别(IEMOCAP&MELD)和神经认知障碍检测(Pitt&Daic-Woz)任务进行了评估,结果表明,语音形式的表现优于基于标准的变压器的框架,同时大大降低了计算成本。此外,我们的语音形式与最先进的方法相当。
Transformer has obtained promising results on cognitive speech signal processing field, which is of interest in various applications ranging from emotion to neurocognitive disorder analysis. However, most works treat speech signal as a whole, leading to the neglect of the pronunciation structure that is unique to speech and reflects the cognitive process. Meanwhile, Transformer has heavy computational burden due to its full attention operation. In this paper, a hierarchical efficient framework, called SpeechFormer, which considers the structural characteristics of speech, is proposed and can be served as a general-purpose backbone for cognitive speech signal processing. The proposed SpeechFormer consists of frame, phoneme, word and utterance stages in succession, each performing a neighboring attention according to the structural pattern of speech with high computational efficiency. SpeechFormer is evaluated on speech emotion recognition (IEMOCAP & MELD) and neurocognitive disorder detection (Pitt & DAIC-WOZ) tasks, and the results show that SpeechFormer outperforms the standard Transformer-based framework while greatly reducing the computational cost. Furthermore, our SpeechFormer achieves comparable results to the state-of-the-art approaches.