论文标题
视觉上下文可以改善体现代理的自动语音识别吗?
Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent?
论文作者
论文摘要
自动语音识别(ASR)系统的使用变得无所不在,从个人助理到聊天机器人,家庭和工业自动化系统等。现代机器人还具有ASR功能,可以与人类互动,因为语音是最自然的互动方式。但是,与个人助理相比,机器人中的ASR面临着其他挑战。作为一个体现的代理,机器人必须识别其周围的物理实体,因此可靠地认识到包含此类实体的描述的语音。但是,由于ASR培训中的局限性,例如通用数据集和开放式唱机建模,因此当前的ASR系统通常无法这样做。同样,推理期间的不利条件,例如噪声,重音和远场言语,使转录不准确。在这项工作中,我们提出了一种将机器人的视觉信息纳入ASR系统并提高对包含可见实体的口语的识别的方法。具体来说,我们提出了一种新的解码器偏置技术来结合视觉上下文,同时确保ASR输出不会因不正确的上下文而降低。我们从未修饰的ASR系统中实现了59%的相对减少。
The usage of automatic speech recognition (ASR) systems are becoming omnipresent ranging from personal assistant to chatbots, home, and industrial automation systems, etc. Modern robots are also equipped with ASR capabilities for interacting with humans as speech is the most natural interaction modality. However, ASR in robots faces additional challenges as compared to a personal assistant. Being an embodied agent, a robot must recognize the physical entities around it and therefore reliably recognize the speech containing the description of such entities. However, current ASR systems are often unable to do so due to limitations in ASR training, such as generic datasets and open-vocabulary modeling. Also, adverse conditions during inference, such as noise, accented, and far-field speech makes the transcription inaccurate. In this work, we present a method to incorporate a robot's visual information into an ASR system and improve the recognition of a spoken utterance containing a visible entity. Specifically, we propose a new decoder biasing technique to incorporate the visual context while ensuring the ASR output does not degrade for incorrect context. We achieve a 59% relative reduction in WER from an unmodified ASR system.