论文标题
动态组变压器:具有动态组注意的一般视觉变压器主链
Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention
论文作者
论文摘要
最近,变形金刚在各种视觉任务中表现出了有希望的表现。为了减少每个查询对所有键/值的查询引起的二次计算复杂性,各种方法都限制了本地区域内的关注范围,每个查询只能在手工制作的窗口内关注键/值。但是,这些手工制作的窗口分区机制是数据不可能的,而忽略了其输入内容,因此很可能有一个查询可能会涉及无关的键/值。为了解决这个问题,我们提出了一个动态的小组注意(DG注意),该问题将所有查询动态分为多个组,并为每个组选择最相关的密钥/值。我们的DG注意力可以灵活地对更相关的依赖性建模,而无需在基于手工窗口的注意力中使用任何空间约束。基于DG注意,我们开发了一种名为Dynamic Group Transformer(DGT)的通用视觉变压器骨干。广泛的实验表明,我们的模型可以胜过多个常见视觉任务的最新方法,包括图像分类,语义分割,对象检测和实例分割。
Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by each query attending to all keys/values, various methods have constrained the range of attention within local regions, where each query only attends to keys/values within a hand-crafted window. However, these hand-crafted window partition mechanisms are data-agnostic and ignore their input content, so it is likely that one query maybe attends to irrelevant keys/values. To address this issue, we propose a Dynamic Group Attention (DG-Attention), which dynamically divides all queries into multiple groups and selects the most relevant keys/values for each group. Our DG-Attention can flexibly model more relevant dependencies without any spatial constraint that is used in hand-crafted window based attention. Built on the DG-Attention, we develop a general vision transformer backbone named Dynamic Group Transformer (DGT). Extensive experiments show that our models can outperform the state-of-the-art methods on multiple common vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation.