论文标题
SAMU-XLSR:语义对齐的多模式话语级跨语性语音表示
SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation
论文作者
论文摘要
我们提出了SAMU-XLSR:语义上的多模式话语级跨语性语音表示框架学习框架。与以前关于语音表示学习的作品不同,该学习以声音框架的分辨率(10-20ms)学习了多种语言上下文语音嵌入,这项工作着重于学习以句子(5-10s)分辨率嵌入多种模态(语音 - 文本)多语言语音(5-10),以使嵌入矢量空间在不同语言上跨不同的语言跨不同的语言。我们将最先进的多语言声学框架级语音表示模型XLS-R与语言不可知的bert句子嵌入(LABSE)模型相结合,以创建一个发言级多模式多语言语音编码器SAMU-XLSR。尽管我们仅使用多语言转录的语音数据来训练SAMU-XLSR,但在其学习的表示空间中出现了跨语性语音文本和语音语音关联。为了证实我们的主张,我们将SAMU-XLSR语音编码器与预先训练的LABSE文本句子编码器结合使用,用于跨语性的语音到文本翻译,而SAMU-XLSR则单独用于交叉语言语音到语音语音到语音翻译。我们通过在几个数据集上执行几个跨语性文本和语音翻译检索任务来强调这些应用程序。
We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5-10s) such that the embedding vector space is semantically aligned across different languages. We combine state-of-the-art multilingual acoustic frame-level speech representation learning model XLS-R with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder SAMU-XLSR. Although we train SAMU-XLSR with only multilingual transcribed speech data, cross-lingual speech-text and speech-speech associations emerge in its learned representation space. To substantiate our claims, we use SAMU-XLSR speech encoder in combination with a pre-trained LaBSE text sentence encoder for cross-lingual speech-to-text translation retrieval, and SAMU-XLSR alone for cross-lingual speech-to-speech translation retrieval. We highlight these applications by performing several cross-lingual text and speech translation retrieval tasks across several datasets.