论文标题

多模式对比度学习与limoe:专家的语言形象混合

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

论文作者

Mustafa, Basil, Riquelme, Carlos, Puigcerver, Joan, Jenatton, Rodolphe, Houlsby, Neil

论文摘要

大型稀疏激活模型在多个领域中获得了出色的性能。但是,这种模型通常一次以单个方式训练。我们介绍了能够多模式学习的专家模型的稀疏混合物limoe。 Limoe同时接受图像和文本,同时接受对比损失进行训练。 MOE自然而然地适合多模式主链,因为专家层可以学习适当的方式分区。但是,出现了新的挑战。特别是,培训稳定性和平衡的专家利用率,为此我们提出了基于熵的正规化计划。在多个尺度上,我们证明了相当于等效计算成本模型的绩效显着改善。 Limoe-l/16经过相当训练的夹子L/14的训练,达到78.6%的零拍摄成像网精度(vs. 76.2%),当进一步缩放到H/14(带有其他数据)时,它可以实现84.1%,可与最新的方法相当,这些方法使用更大的自定义每型主型背骨和预训练型训练型。我们分析了Limoe的定量和定性行为,并证明了现象,例如对模态的不同处理以及特定于模式的专家的有机出现。

Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6% zero-shot ImageNet accuracy (vs. 76.2%), and when further scaled to H/14 (with additional data) it achieves 84.1%, comparable to state-of-the-art methods which use larger custom per-modality backbones and pre-training schemes. We analyse the quantitative and qualitative behavior of LIMoE, and demonstrate phenomena such as differing treatment of the modalities and the organic emergence of modality-specific experts.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源