使用薄弱的监督在语音中局部定位关键字

论文标题

使用薄弱的监督在语音中局部定位关键字

Towards localisation of keywords in speech using weak supervision

论文作者

Olaleye, Kayode, van Niekerk, Benjamin, Kamper, Herman

论文摘要

弱监督和自我监督模型的发展可以在无法获得完整转录的低资源环境中实现语音技术。我们考虑使用两种形式的弱监督，在未明确提供位置信息的情况下是否可以定位。首先，仅指示一个单词的存在或不存在，即一个词袋（弓）标签。在第二个，视觉上下文的形式是图像的形式，并与未标记的话语配对；然后，需要使用配对数据以自我监督的方式对模型进行培训。对于关键字本地化，我们调整了一种基于显着的方法，通常在视觉域中使用。我们将其与现有的技术进行比较，该技术将本地化作为网络体系结构的一部分。尽管基于显着的方法更灵活（可以在没有架构限制的情况下应用），但我们在将其用于关键字本地化时确定了一个关键限制。在两种形式的监督中，视觉训练的模型的性能比弓箭训练的模型差。我们定性地表明，受视觉训练的模型有时会找到与语义相关的单词，但这并不一致。尽管我们的结果表明有一些信号允许进行本地化，但它还要求其他定位方法与这些形式的弱监督更好地匹配。

Developments in weakly supervised and self-supervised models could enable speech technology in low-resource settings where full transcriptions are not available. We consider whether keyword localisation is possible using two forms of weak supervision where location information is not provided explicitly. In the first, only the presence or absence of a word is indicated, i.e. a bag-of-words (BoW) labelling. In the second, visual context is provided in the form of an image paired with an unlabelled utterance; a model then needs to be trained in a self-supervised fashion using the paired data. For keyword localisation, we adapt a saliency-based method typically used in the vision domain. We compare this to an existing technique that performs localisation as a part of the network architecture. While the saliency-based method is more flexible (it can be applied without architectural restrictions), we identify a critical limitation when using it for keyword localisation. Of the two forms of supervision, the visually trained model performs worse than the BoW-trained model. We show qualitatively that the visually trained model sometimes locate semantically related words, but this is not consistent. While our results show that there is some signal allowing for localisation, it also calls for other localisation methods better matched to these forms of weak supervision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题