论文标题
自我监督语音表示的相似性分析
Similarity Analysis of Self-Supervised Speech Representations
论文作者
论文摘要
自我监督的语音表示学习最近是一个繁荣的研究主题。已经提出了许多算法,用于从大规模未标记的数据中学习有用的表示,并且还研究了它们在各种语音任务中的应用。但是,很少有研究重点是理解现有方法的特性。在这项工作中,我们旨在对一些最具代表性的自我监管算法进行比较研究。具体而言,我们使用现有的相似性度量量化了不同自我监督表示的相似性。我们还设计了探测任务,以研究模型的预训练损失与他们所学会中包含的特定语音信息的数量之间的相关性。除了显示出相同输入的各种自我监督模型的行为如何不同之外,我们的研究还发现,训练目标对代表性相似性具有更高的影响,而不是建筑选择(RNN/Transformer/CNN)和方向性(Uni/BiDirectional)。我们的结果还表明,对于某些自我监管的算法,预训练损失与下游性能之间存在很强的相关性。
Self-supervised speech representation learning has recently been a prosperous research topic. Many algorithms have been proposed for learning useful representations from large-scale unlabeled data, and their applications to a wide range of speech tasks have also been investigated. However, there has been little research focusing on understanding the properties of existing approaches. In this work, we aim to provide a comparative study of some of the most representative self-supervised algorithms. Specifically, we quantify the similarities between different self-supervised representations using existing similarity measures. We also design probing tasks to study the correlation between the models' pre-training loss and the amount of specific speech information contained in their learned representations. In addition to showing how various self-supervised models behave differently given the same input, our study also finds that the training objective has a higher impact on representation similarity than architectural choices such as building blocks (RNN/Transformer/CNN) and directionality (uni/bidirectional). Our results also suggest that there exists a strong correlation between pre-training loss and downstream performance for some self-supervised algorithms.