论文标题
曼尼特人:实体级文本引导的图像操纵通过令牌的语义对准和世代
ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation
论文作者
论文摘要
现有的文本指导的图像操纵方法旨在修改图像的外观或在虚拟或简单场景中编辑一些对象,而这些对象远非实际应用。在这项工作中,我们研究了关于现实世界中实体层面的文本指导图像操纵的一项新任务。该任务强加了三个基本要求:(1)编辑与文本描述一致的实体,(2)以保留文本 - 欧元区域,以及(3)将操纵的实体合并到图像中自然而然。为此,我们根据两阶段图像合成方法(即\ textbf {Manitrans}提出了一个新的基于变压器的框架,它不仅可以编辑实体的外观,还可以生成与文本指南相对应的新实体。我们的框架结合了一个语义一致性模块,以定位要操纵的图像区域,以及有助于对齐视觉和语言之间关系的语义损失。我们在真实数据集,幼崽,牛津和可可数据集上进行了广泛的实验,以验证我们的方法是否可以区分相关和无关的区域,并与基线方法相比实现更精确,更灵活的操作。项目主页为\ url {https://jawang19.github.io/manitrans}。
Existing text-guided image manipulation methods aim to modify the appearance of the image or to edit a few objects in a virtual or simple scenario, which is far from practical application. In this work, we study a novel task on text-guided image manipulation on the entity level in the real world. The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally. To this end, we propose a new transformer-based framework based on the two-stage image synthesis method, namely \textbf{ManiTrans}, which can not only edit the appearance of entities but also generate new entities corresponding to the text guidance. Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language. We conduct extensive experiments on the real datasets, CUB, Oxford, and COCO datasets to verify that our method can distinguish the relevant and irrelevant regions and achieve more precise and flexible manipulation compared with baseline methods. The project homepage is \url{https://jawang19.github.io/manitrans}.