论文标题
深度学习时代情感语音综合和转换的概述
An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era
论文作者
论文摘要
语音是人类交流的基本模式,其综合长期以来一直是人类计算机相互作用研究的核心优先事项。近年来,机器设法掌握了人类可以理解的言语的艺术。但是话语的语言内容仅包含其含义的一部分。情感或表现力有能力将言语转化为能够传达亲密思想,感觉和情感的媒介,这对于参与和自然主义的人际交往至关重要。迄今为止,在文本到语音综合的最新进展之后,迄今为止,赋予综合话语的表达性的目的仍然难以捉摸,但在情感语音综合和转换领域,范式转移正在进行中。深度学习,正如人工智能最新进展的基础的技术是率先衡量这些努力。在目前的概述中,我们概述了持续的趋势并总结了最先进的方法,以便全面概述这个令人兴奋的领域。
Speech is the fundamental mode of human communication, and its synthesis has long been a core priority in human-computer interaction research. In recent years, machines have managed to master the art of generating speech that is understandable by humans. But the linguistic content of an utterance encompasses only a part of its meaning. Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions -- aspects that are essential for engaging and naturalistic interpersonal communication. While the goal of imparting expressivity to synthesised utterances has so far remained elusive, following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion as well. Deep learning, as the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts. In the present overview, we outline ongoing trends and summarise state-of-the-art approaches in an attempt to provide a comprehensive overview of this exciting field.