多模式对比度学习与limoe：专家的语言形象混合

论文标题

多模式对比度学习与limoe：专家的语言形象混合

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

论文作者

Mustafa, Basil, Riquelme, Carlos, Puigcerver, Joan, Jenatton, Rodolphe, Houlsby, Neil

论文摘要

大型稀疏激活模型在多个领域中获得了出色的性能。但是，这种模型通常一次以单个方式训练。我们介绍了能够多模式学习的专家模型的稀疏混合物limoe。 Limoe同时接受图像和文本，同时接受对比损失进行训练。 MOE自然而然地适合多模式主链，因为专家层可以学习适当的方式分区。但是，出现了新的挑战。特别是，培训稳定性和平衡的专家利用率，为此我们提出了基于熵的正规化计划。在多个尺度上，我们证明了相当于等效计算成本模型的绩效显着改善。 Limoe-l/16经过相当训练的夹子L/14的训练，达到78.6％的零拍摄成像网精度（vs. 76.2％），当进一步缩放到H/14（带有其他数据）时，它可以实现84.1％，可与最新的方法相当，这些方法使用更大的自定义每型主型背骨和预训练型训练型。我们分析了Limoe的定量和定性行为，并证明了现象，例如对模态的不同处理以及特定于模式的专家的有机出现。

Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6% zero-shot ImageNet accuracy (vs. 76.2%), and when further scaled to H/14 (with additional data) it achieves 84.1%, comparable to state-of-the-art methods which use larger custom per-modality backbones and pre-training schemes. We analyse the quantitative and qualitative behavior of LIMoE, and demonstrate phenomena such as differing treatment of the modalities and the organic emergence of modality-specific experts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题