掩盖建模二人组：通过鼓励两个网络建模输入来学习表示

论文标题

掩盖建模二人组：通过鼓励两个网络建模输入来学习表示

Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

论文作者

Niizumi, Daisuke, Takeuchi, Daiki, Ohishi, Yasunori, Harada, Noboru, Kashino, Kunio

论文摘要

蒙面自动编码器是一种简单而强大的自我监督学习方法。但是，它通过重建蒙版的输入补丁而间接学习表示表示。几种方法通过预测蒙版贴片的表示直接学习表示。但是，我们认为使用所有补丁编码训练信号表示形式是次优的。我们提出了一种新的方法，即掩盖建模二人组（M2D），该方法直接学习表示形式，同时仅使用蒙版贴剂获得训练信号。在M2D中，在线网络编码可见的补丁程序并预测蒙版的补丁表示表示，而目标网络（动量编码器）编码蒙版的补丁程序。为了更好地预测目标表示形式，在线网络应该很好地对输入进行建模，而目标网络也应为与在线预测一致。然后，学习的表示形式应更好地对输入进行建模。我们通过学习通用音频表示来验证了M2D，M2D在urbansound8k，voxceleb1，audioset20k，gtzan和seadmentcommandsv2上设定了新的最新性能。我们另外，在附录中使用Imagenet-1K验证了M2D对图像的有效性。

Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M2D), that learns representations directly while obtaining training signals using only masked patches. In the M2D, the online network encodes visible patches and predicts masked patch representations, and the target network, a momentum encoder, encodes masked patches. To better predict target representations, the online network should model the input well, while the target network should also model it well to agree with online predictions. Then the learned representations should better model the input. We validated the M2D by learning general-purpose audio representations, and M2D set new state-of-the-art performance on tasks such as UrbanSound8K, VoxCeleb1, AudioSet20K, GTZAN, and SpeechCommandsV2. We additionally validate the effectiveness of M2D for images using ImageNet-1K in the appendix.

下载PDF全文

下载文献需遵守相关版权规定

论文标题