论文标题
有效以演员为中心的人对象互动检测
Effective Actor-centric Human-object Interaction Detection
论文作者
论文摘要
尽管人类对象的相互作用(HOI)检测取得了巨大的进步,但由于与图像中发生的多个人类和物体的复杂相互作用,这仍然是充满挑战的,这不可避免地会导致歧义。大多数现有方法要么生成所有人类对象对候选者,并通过以两阶段的方式连续裁剪本地特征来推断其关系,要么在一个阶段过程中直接预测交互点。但是,缺乏两个或一个单阶段方法的空间配置或推理步骤,分别限制了它们在如此复杂的场景中的性能。为了避免这种歧义,我们提出了一个新颖的以演员为中心的框架。主要思想是,在推断相互作用时:1)获得由演员位置引导的整个图像的非本地特征,以建模演员与上下文之间的关系,然后2)我们使用对象分支来生成像素相互作用区域预测,其中相互作用区域表示对象中心区域。此外,我们还使用演员分支来获得参与者的相互作用预测,并提出基于中心指数索引的新颖组成策略,以生成最终的HOI预测。由于使用非本地特征以及人类对象组成策略的部分耦合特性,我们提出的框架可以更准确地检测HOI,尤其是对于复杂的图像。广泛的实验结果表明,我们的方法实现了具有挑战性的V-Coco和Hico-Det基准测试的最新方法,并且在多个人和/或物体场景中更加健壮。
While Human-Object Interaction(HOI) Detection has achieved tremendous advances in recent, it still remains challenging due to complex interactions with multiple humans and objects occurring in images, which would inevitably lead to ambiguities. Most existing methods either generate all human-object pair candidates and infer their relationships by cropped local features successively in a two-stage manner, or directly predict interaction points in a one-stage procedure. However, the lack of spatial configurations or reasoning steps of two- or one- stage methods respectively limits their performance in such complex scenes. To avoid this ambiguity, we propose a novel actor-centric framework. The main ideas are that when inferring interactions: 1) the non-local features of the entire image guided by actor position are obtained to model the relationship between the actor and context, and then 2) we use an object branch to generate pixel-wise interaction area prediction, where the interaction area denotes the object central area. Moreover, we also use an actor branch to get interaction prediction of the actor and propose a novel composition strategy based on center-point indexing to generate the final HOI prediction. Thanks to the usage of the non-local features and the partly-coupled property of the human-objects composition strategy, our proposed framework can detect HOI more accurately especially for complex images. Extensive experimental results show that our method achieves the state-of-the-art on the challenging V-COCO and HICO-DET benchmarks and is more robust especially in multiple persons and/or objects scenes.