通过多Query Transformer进行多任务学习，以进行密集的预测

论文标题

通过多Query Transformer进行多任务学习，以进行密集的预测

Multi-Task Learning with Multi-Query Transformer for Dense Prediction

论文作者

Xu, Yangyang, Li, Xiangtai, Yuan, Haobo, Yang, Yibo, Zhang, Lefei

论文摘要

以前的多任务密集预测研究开发了复杂的管道，例如在多个阶段进行多模式蒸馏或为每个任务寻找任务关系上下文。除这些方法之外的核心见解是最大程度地提高每个任务的互动效果。受到最近基于查询的变压器的启发，我们提出了一条名为MultyQuery Transformer（MQTransFormer）的简单管道，该管道配备了来自不同任务的多个查询，以促进多个任务之间的推理并简化交叉任务交互管道。我们没有在不同任务之间建模每个像素上下文的密集上下文，而是寻求特定于任务的代理，以通过每个查询编码与任务相关的上下文进行编码的多个查询执行交叉任务推理。 MQTRANSFORMER由三个关键组件组成：共享编码器，交叉任务查询注意模块和共享解码器。我们首先使用与任务相关的查询对每个任务进行建模。然后，功能提取器和与任务相关的查询都馈入共享编码器的特定特定功能输出，从而从特定于任务的功能中编码了与任务相关的查询。其次，我们设计了一个交叉任务查询注意模块，以推理多个与任务相关的查询之间的依赖关系。这使模块仅专注于查询级别的交互。最后，我们使用共享解码器逐渐使用来自不同任务的合理查询功能来逐步完善图像功能。在两个密集的预测数据集（NYUD-V2和Pascal-Context）上进行了广泛的实验结果表明，该提出的方法是一种有效的方法，可以实现最新的结果。代码和型号可在https://github.com/yangyangxu0/mqtransformer上找到。

Previous multi-task dense prediction studies developed complex pipelines such as multi-modal distillations in multiple stages or searching for task relational contexts for each task. The core insight beyond these methods is to maximize the mutual effects of each task. Inspired by the recent query-based Transformers, we propose a simple pipeline named Multi-Query Transformer (MQTransformer) that is equipped with multiple queries from different tasks to facilitate the reasoning among multiple tasks and simplify the cross-task interaction pipeline. Instead of modeling the dense per-pixel context among different tasks, we seek a task-specific proxy to perform cross-task reasoning via multiple queries where each query encodes the task-related context. The MQTransformer is composed of three key components: shared encoder, cross-task query attention module and shared decoder. We first model each task with a task-relevant query. Then both the task-specific feature output by the feature extractor and the task-relevant query are fed into the shared encoder, thus encoding the task-relevant query from the task-specific feature. Secondly, we design a cross-task query attention module to reason the dependencies among multiple task-relevant queries; this enables the module to only focus on the query-level interaction. Finally, we use a shared decoder to gradually refine the image features with the reasoned query features from different tasks. Extensive experiment results on two dense prediction datasets (NYUD-v2 and PASCAL-Context) show that the proposed method is an effective approach and achieves state-of-the-art results. Code and models are available at https://github.com/yangyangxu0/MQTransformer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题