对话框必须继续：通过生成自我训练改进视觉对话框

论文标题

对话框必须继续：通过生成自我训练改进视觉对话框

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

论文作者

Kang, Gi-Cheon, Kim, Sungdong, Kim, Jin-Hwa, Kwak, Donghyun, Zhang, Byoung-Tak

论文摘要

视觉对话框（Visdial）是用对话记录作为上下文回答基于图像中的一系列问题的任务。先前的工作已通过监督学习或对相关视觉和语言数据集的预先培训培训了对话代理。本文提出了一种半监督的学习方法，用于视觉上的对话框，称为“生成自我训练”（GST），以利用网络上的未标记图像。具体而言，GST首先通过分布式检测检索内域图像，并通过多模式条件文本生成生成有关图像的合成对话框。然后，GST在合成数据和原始Visdial数据上训练对话框。结果，GST将训练数据的量缩放到粘性的数量级（1.2m至129m QA数据）。为了对合成对话框进行强有力的培训，我们还提出了基于困惑的数据选择和多模式一致性正则化。对Visdial V1.0和V0.9的评估数据集表明，GST在两个数据集上都取得了新的最新结果。我们进一步观察了GST对视觉和文本对抗性攻击的鲁棒性。最后，GST在低数据表明中产生了强劲的性能。代码可从https://github.com/gicheonkang/gst-visdial获得。

Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images through out-of-distribution detection and generates synthetic dialogs regarding the images via multimodal conditional text generation. GST then trains a dialog agent on the synthetic and the original VisDial data. As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1.2M to 12.9M QA data). For robust training of the synthetic dialogs, we also propose perplexity-based data selection and multimodal consistency regularization. Evaluation on VisDial v1.0 and v0.9 datasets shows that GST achieves new state-of-the-art results on both datasets. We further observe the robustness of GST against both visual and textual adversarial attacks. Finally, GST yields strong performance gains in the low-data regime. Code is available at https://github.com/gicheonkang/gst-visdial.

下载PDF全文

下载文献需遵守相关版权规定

论文标题