动态组变压器：具有动态组注意的一般视觉变压器主链

论文标题

动态组变压器：具有动态组注意的一般视觉变压器主链

Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention

论文作者

Liu, Kai, Wu, Tianyi, Liu, Cong, Guo, Guodong

论文摘要

最近，变形金刚在各种视觉任务中表现出了有希望的表现。为了减少每个查询对所有键/值的查询引起的二次计算复杂性，各种方法都限制了本地区域内的关注范围，每个查询只能在手工制作的窗口内关注键/值。但是，这些手工制作的窗口分区机制是数据不可能的，而忽略了其输入内容，因此很可能有一个查询可能会涉及无关的键/值。为了解决这个问题，我们提出了一个动态的小组注意（DG注意），该问题将所有查询动态分为多个组，并为每个组选择最相关的密钥/值。我们的DG注意力可以灵活地对更相关的依赖性建模，而无需在基于手工窗口的注意力中使用任何空间约束。基于DG注意，我们开发了一种名为Dynamic Group Transformer（DGT）的通用视觉变压器骨干。广泛的实验表明，我们的模型可以胜过多个常见视觉任务的最新方法，包括图像分类，语义分割，对象检测和实例分割。

Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by each query attending to all keys/values, various methods have constrained the range of attention within local regions, where each query only attends to keys/values within a hand-crafted window. However, these hand-crafted window partition mechanisms are data-agnostic and ignore their input content, so it is likely that one query maybe attends to irrelevant keys/values. To address this issue, we propose a Dynamic Group Attention (DG-Attention), which dynamically divides all queries into multiple groups and selects the most relevant keys/values for each group. Our DG-Attention can flexibly model more relevant dependencies without any spatial constraint that is used in hand-crafted window based attention. Built on the DG-Attention, we develop a general vision transformer backbone named Dynamic Group Transformer (DGT). Extensive experiments show that our models can outperform the state-of-the-art methods on multiple common vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题