像人类的偏见：场景图的认知偏见框架

论文标题

像人类的偏见：场景图的认知偏见框架

Biasing Like Human: A Cognitive Bias Framework for Scene Graph Generation

论文作者

Chang, Xiaoguang, Wang, Teng, Sun, Changyin, Cai, Wenzhe

论文摘要

场景图的生成是一项复杂的任务，因为没有特定的识别模式（例如，“看”和“接近”没有关于视觉的显着差异，而在具有不同形态学不同的实体之间可能会发生“近近”）。因此，某些场景图生成方法被困在由反复无常的视觉特征和琐碎数据集注释引起的最常见关系预测中。因此，最近的著作强调了“无偏”的方法来平衡更翔实的场景图的预测。但是，人类对众多对象之间关系的快速准确判断应归因于“偏见”（即经验和语言知识），而不是纯粹的视觉。为了增强模型能力，受“认知偏差”机制的启发，我们提出了一个新型的3-范式框架，该框架模拟了人类如何将标签语言特征纳入基于视觉表示的指导，以更好地矿山隐藏的关系模式并减轻嘈杂的视觉传播。我们的框架是任何场景图模型的模型 - 敏锐性。全面的实验证明，我们的框架的表现优于几个指标的基线模块，最小参数会增加，并在Visual Genome数据集上实现了新的SOTA性能。

Scene graph generation is a sophisticated task because there is no specific recognition pattern (e.g., "looking at" and "near" have no conspicuous difference concerning vision, whereas "near" could occur between entities with different morphology). Thus some scene graph generation methods are trapped into most frequent relation predictions caused by capricious visual features and trivial dataset annotations. Therefore, recent works emphasized the "unbiased" approaches to balance predictions for a more informative scene graph. However, human's quick and accurate judgments over relations between numerous objects should be attributed to "bias" (i.e., experience and linguistic knowledge) rather than pure vision. To enhance the model capability, inspired by the "cognitive bias" mechanism, we propose a novel 3-paradigms framework that simulates how humans incorporate the label linguistic features as guidance of vision-based representations to better mine hidden relation patterns and alleviate noisy visual propagation. Our framework is model-agnostic to any scene graph model. Comprehensive experiments prove our framework outperforms baseline modules in several metrics with minimum parameters increment and achieves new SOTA performance on Visual Genome dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题