论文标题
IWIN:通过不规则窗户的变压器检测人类对象的互动检测
Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows
论文作者
论文摘要
本文介绍了一个名为IWIN Transformer的New Vision Transformer,该变压器是专门为人类对象相互作用(HOI)检测而设计的,这是一个详细的场景理解任务,涉及人类/对象检测和交互识别的顺序过程。 IWIN Transformer是一个分层变压器,在不规则的窗口内逐渐执行令牌表示学习和令牌团聚。不规则的窗户通过通过学习的偏移来增强常规网格位置,1)消除代币表示学习中的冗余,这会导致有效的人/对象检测,以及2)启用凝集的代币与人类/物体保持一致的不同形状,以不同的形状与高度识别的识别式识别的互动相互作用。在两个标准HOI检测基准数据集(HICO-DET和V-Coco)上验证了IWIN变压器的有效性和效率。结果表明,我们的方法以较少的培训时代($ 0.5 \ times $)的范围胜过大幅度的现有方法(在HICO-DET上的3.7 MAP增益和V-Coco上的2.0 MAP增益)。
This paper presents a new vision Transformer, named Iwin Transformer, which is specifically designed for human-object interaction (HOI) detection, a detailed scene understanding task involving a sequential process of human/object detection and interaction recognition. Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows. The irregular windows, achieved by augmenting regular grid locations with learned offsets, 1) eliminate redundancy in token representation learning, which leads to efficient human/object detection, and 2) enable the agglomerated tokens to align with humans/objects with different shapes, which facilitates the acquisition of highly-abstracted visual semantics for interaction recognition. The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets, HICO-DET and V-COCO. Results show our method outperforms existing Transformers-based methods by large margins (3.7 mAP gain on HICO-DET and 2.0 mAP gain on V-COCO) with fewer training epochs ($0.5 \times$).