mod-squad：设计专家的混合物作为模块化多任务学习者

论文标题

mod-squad：设计专家的混合物作为模块化多任务学习者

Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners

论文作者

Chen, Zitian, Shen, Yikang, Ding, Mingyu, Chen, Zhenfang, Zhao, Hengshuang, Learned-Miller, Erik, Gan, Chuang

论文摘要

多任务学习（MTL）中的优化比单任务学习（STL）更具挑战性，因为来自不同任务的梯度可能是矛盾的。当任务相关时，在其中共享一些参数可能是有益的（合作）。但是，某些任务需要在特定类型的数据或歧视（专业化）方面具有专业知识的其他参数。为了应对MTL挑战，我们提出了Mod-Squad，这是一个新模型，该模型被模块化为专家组（A Squad'）。这种结构使我们能够将合作和专业化形式化为匹配专家和任务的过程。在训练单个模型期间，我们优化了此匹配过程。具体而言，我们将专家（MOE）层的混合物结合到变压器模型中，并带有新的损失，并结合了任务和专家之间的相互依赖性。结果，仅针对每个任务激活一小部分专家。这样可以防止所有任务之间的整个主干模型共享，从而增强了模型，尤其是当训练设置的大小和任务数量扩大时。更有趣的是，对于每项任务，我们可以将一小群专家提取为独立模型，该模型保持与大型模型相同的性能。具有13个视觉任务的任务数据集上的广泛实验和具有5个视觉任务的Pascal-Context数据集显示了我们方法的优势。

Optimization in multi-task learning (MTL) is more challenging than single-task learning (STL), as the gradient from different tasks can be contradictory. When tasks are related, it can be beneficial to share some parameters among them (cooperation). However, some tasks require additional parameters with expertise in a specific type of data or discrimination (specialization). To address the MTL challenge, we propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad'). This structure allows us to formalize cooperation and specialization as the process of matching experts and tasks. We optimize this matching process during the training of a single model. Specifically, we incorporate mixture of experts (MoE) layers into a transformer model, with a new loss that incorporates the mutual dependence between tasks and experts. As a result, only a small set of experts are activated for each task. This prevents the sharing of the entire backbone model between all tasks, which strengthens the model, especially when the training set size and the number of tasks scale up. More interestingly, for each task, we can extract the small set of experts as a standalone model that maintains the same performance as the large model. Extensive experiments on the Taskonomy dataset with 13 vision tasks and the PASCAL-Context dataset with 5 vision tasks show the superiority of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题