论文标题
AVGZSLNET:通过重建多模式嵌入的标签功能,视听概括零射击学习
AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing Label Features from Multi-Modal Embeddings
论文作者
论文摘要
在本文中,我们提出了一种新颖的方法,用于在多模式环境中进行广义零射击学习,在该环境中,我们在训练过程中没有看到新颖的音频/视频类别/视频。我们将文本嵌入的语义相关性用作零拍学习的一种手段,通过将音频和视频嵌入与相应的类标签文本特征空间对齐。我们的方法使用跨模式解码器和复合三重损失。跨模式解码器强制执行一种约束,即可以从数据点的音频和视频嵌入中重建类标签文本功能。这有助于音频和视频嵌入越来越靠近类标签文本嵌入。复合三重态损失利用了音频,视频和文本嵌入。它有助于将同一类的嵌入更加紧密,并在多模式设置中从不同类别的嵌入方式推开嵌入。这有助于网络在多模式的零射学习任务上更好地执行。重要的是,即使在测试时间缺少模态,我们的多模式零射学习方法也有效。我们测试了对广义的零射击分类和检索任务的方法,并表明我们的方法在存在单一模态以及存在多种方式的情况下优于其他模型。我们通过将方法与以前的方法进行比较并使用各种消融来验证我们的方法。
In this paper, we propose a novel approach for generalized zero-shot learning in a multi-modal setting, where we have novel classes of audio/video during testing that are not seen during training. We use the semantic relatedness of text embeddings as a means for zero-shot learning by aligning audio and video embeddings with the corresponding class label text feature space. Our approach uses a cross-modal decoder and a composite triplet loss. The cross-modal decoder enforces a constraint that the class label text features can be reconstructed from the audio and video embeddings of data points. This helps the audio and video embeddings to move closer to the class label text embedding. The composite triplet loss makes use of the audio, video, and text embeddings. It helps bring the embeddings from the same class closer and push away the embeddings from different classes in a multi-modal setting. This helps the network to perform better on the multi-modal zero-shot learning task. Importantly, our multi-modal zero-shot learning approach works even if a modality is missing at test time. We test our approach on the generalized zero-shot classification and retrieval tasks and show that our approach outperforms other models in the presence of a single modality as well as in the presence of multiple modalities. We validate our approach by comparing it with previous approaches and using various ablations.