变形金刚和CNN都在Sbir上击败了人类

论文标题

变形金刚和CNN都在Sbir上击败了人类

Transformers and CNNs both Beat Humans on SBIR

论文作者

Seddati, Omar, Dupont, Stéphane, Mahmoudi, Saïd, Dutoit, Thierry

论文摘要

基于草图的图像检索（SBIR）是检索与语义和手绘草图查询的空间配置相匹配的自然图像（照片）的任务。草图的普遍性扩大了可能的应用程序的范围，并增加了对有效SBIR解决方案的需求。在本文中，我们研究了经典的基于三胞胎的SBIR解决方案，并表明对水平翻转（即使在模型登录之后）的持续不变性也损害了性能。为了克服这一局限性，我们提出了几种方法，并深入评估它们每个方法以检查其有效性。我们的主要贡献是双重的：我们提出和评估几种直观的修改，以构建具有更好的翻转均衡性的SBIR解决方案。我们表明，视觉变形金刚更适合SBIR任务，并且它们的优于CNN的幅度很大。我们进行了许多实验，并引入了第一个模型，以优于大型SBIR基准（粗略）的人类表现。与以前的最新方法相比，我们最好的模型在粗略的基准上获得了62.25％（k = 1）的召回率为46.2％。

Sketch-based image retrieval (SBIR) is the task of retrieving natural images (photos) that match the semantics and the spatial configuration of hand-drawn sketch queries. The universality of sketches extends the scope of possible applications and increases the demand for efficient SBIR solutions. In this paper, we study classic triplet-based SBIR solutions and show that a persistent invariance to horizontal flip (even after model finetuning) is harming performance. To overcome this limitation, we propose several approaches and evaluate in depth each of them to check their effectiveness. Our main contributions are twofold: We propose and evaluate several intuitive modifications to build SBIR solutions with better flip equivariance. We show that vision transformers are more suited for the SBIR task, and that they outperform CNNs with a large margin. We carried out numerous experiments and introduce the first models to outperform human performance on a large-scale SBIR benchmark (Sketchy). Our best model achieves a recall of 62.25% (at k = 1) on the sketchy benchmark compared to previous state-of-the-art methods 46.2%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题