论文标题
通过文本功能解释性嵌入和外部攻击节点植入的人文本图像匹配
Person Text-Image Matching via Text-Feature Interpretability Embedding and External Attack Node Implantation
论文作者
论文摘要
人的文本图像匹配,也称为基于文本的人搜索,旨在使用文本说明检索特定行人的图像。尽管人的文本图像匹配取得了很大的研究进展,但现有方法仍然面临两个挑战。首先,缺乏文本功能的解释性使得它们与相应的图像特征有效地保持一致。其次,相同的行人图像通常对应于多个不同的文本描述,单个文本描述可以对应于同一身份的多个不同图像。文本描述和图像的多样性使网络很难提取与两种模式相匹配的强大功能。为了解决这些问题,我们通过嵌入文本功能解释性和外部攻击节点来提出一个人的文本图像匹配方法。具体而言,我们通过为它们提供具有图像特征的一致语义信息来提高文本特征的可解释性,以实现文本的对齐并描述图像区域特征。为了解决文本多样性和相应的人图像所带来的挑战,我们处理由扰动信息引起的多样性对特征引起的变化,并提出了一种新颖的对手攻击和防御方法,以解决其解决方案。在模型设计中,图形卷积被用作特征表示的基本框架,并且通过在图形卷积层中植入附加的攻击节点来改善模型针对文本和图像多样性的鲁棒性,从而模拟了由文本和特征提取上图像多样性引起的对抗性攻击。广泛的实验证明了与现有方法相对于现有方法匹配的文本形象匹配的有效性和优势。该方法的源代码发布在
Person text-image matching, also known as text based person search, aims to retrieve images of specific pedestrians using text descriptions. Although person text-image matching has made great research progress, existing methods still face two challenges. First, the lack of interpretability of text features makes it challenging to effectively align them with their corresponding image features. Second, the same pedestrian image often corresponds to multiple different text descriptions, and a single text description can correspond to multiple different images of the same identity. The diversity of text descriptions and images makes it difficult for a network to extract robust features that match the two modalities. To address these problems, we propose a person text-image matching method by embedding text-feature interpretability and an external attack node. Specifically, we improve the interpretability of text features by providing them with consistent semantic information with image features to achieve the alignment of text and describe image region features.To address the challenges posed by the diversity of text and the corresponding person images, we treat the variation caused by diversity to features as caused by perturbation information and propose a novel adversarial attack and defense method to solve it. In the model design, graph convolution is used as the basic framework for feature representation and the adversarial attacks caused by text and image diversity on feature extraction is simulated by implanting an additional attack node in the graph convolution layer to improve the robustness of the model against text and image diversity. Extensive experiments demonstrate the effectiveness and superiority of text-pedestrian image matching over existing methods. The source code of the method is published at