通过语言条件过滤器调节自下而上和自上而下的视觉处理

论文标题

通过语言条件过滤器调节自下而上和自上而下的视觉处理

Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters

论文作者

Kesen, İlker, Can, Ozan Arkan, Erdem, Erkut, Erdem, Aykut, Yuret, Deniz

论文摘要

如何最好地将涉及语言和视觉的多模式任务中的语言和感知处理整合在一起是一个重要的开放问题。在这项工作中，我们认为，以自上而下的方式使用语言的常见做法，直接通过高级视觉特征进行视觉关注，这可能不是最佳的。我们假设使用语言还可以调节从像素到高级功能的自下而上的处理可以为整体性能带来好处。为了支持我们的主张，我们提出了一个基于U-NET的模型，并对两个语言视觉密集预测任务进行实验：参考表达式细分和语言引导的图像着色。我们比较结果，其中一个或自下而上的视觉分支都以语言为条件。我们的实验表明，除了自上而下的注意力外，使用语言控制过滤器进行自下而上的视觉处理，还可以在任务和实现竞争性能方面取得更好的结果。我们的语言分析表明，自下而上的调节改善对象的细分，尤其是在输入文本是指低级视觉概念时。代码可在https://github.com/ilkerkesen/bvpr上找到。

How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from pixels to high-level features can provide benefits to the overall performance. To support our claim, we propose a U-Net-based model and perform experiments on two language-vision dense-prediction tasks: referring expression segmentation and language-guided image colorization. We compare results where either one or both of the top-down and bottom-up visual branches are conditioned on language. Our experiments reveal that using language to control the filters for bottom-up visual processing in addition to top-down attention leads to better results on both tasks and achieves competitive performance. Our linguistic analysis suggests that bottom-up conditioning improves segmentation of objects especially when input text refers to low-level visual concepts. Code is available at https://github.com/ilkerkesen/bvpr.

下载PDF全文

下载文献需遵守相关版权规定

论文标题