论文标题
B-SCST:图像字幕的贝叶斯自我批判序列训练
B-SCST: Bayesian Self-Critical Sequence Training for Image Captioning
论文作者
论文摘要
贝叶斯深神经网络(DNNS)可以提供数学上接地的框架,以量化图像字幕模型的预测中的不确定性。我们为图像字幕模型提出了基于策略授予的增强学习训练技术的贝叶斯变体,以直接优化非差异图像字幕质量指标,例如Cider-D。我们通过合并贝叶斯推断,扩展了众所周知的自我批判序列训练(SCST)方法,以进行图像字幕模型,并将其称为B-SCST。 B-SCST中的策略梯度的“基线”是通过使用使用贝叶斯DNN模型获得的分布的标题的平均预测质量指标(CIDER-D)生成的。我们使用Monte Carlo(MC)辍学近似变异推断来推断这种预测分布。我们表明,与SCST方法相比,B-SCST改善了Flickr30k,MS Coco和Vizwiz图像字幕数据集的Cider-D分数。我们还提供了对预测标题的不确定性定量的研究,并证明它与苹果酒-D分数很好地相关。据我们所知,这是第一个这样的分析,它可以提高图像字幕模型输出的解释性,这对于实际应用至关重要。
Bayesian deep neural networks (DNNs) can provide a mathematically grounded framework to quantify uncertainty in predictions from image captioning models. We propose a Bayesian variant of policy-gradient based reinforcement learning training technique for image captioning models to directly optimize non-differentiable image captioning quality metrics such as CIDEr-D. We extend the well-known Self-Critical Sequence Training (SCST) approach for image captioning models by incorporating Bayesian inference, and refer to it as B-SCST. The "baseline" for the policy-gradients in B-SCST is generated by averaging predictive quality metrics (CIDEr-D) of the captions drawn from the distribution obtained using a Bayesian DNN model. We infer this predictive distribution using Monte Carlo (MC) dropout approximate variational inference. We show that B-SCST improves CIDEr-D scores on Flickr30k, MS COCO and VizWiz image captioning datasets, compared to the SCST approach. We also provide a study of uncertainty quantification for the predicted captions, and demonstrate that it correlates well with the CIDEr-D scores. To our knowledge, this is the first such analysis, and it can improve the interpretability of image captioning model outputs, which is critical for practical applications.