论文标题
图像到语言理解:字幕方法
Image to Language Understanding: Captioning approach
论文作者
论文摘要
从视觉表示中提取上下文对于计算机科学的发展至关重要。自然语言中这种格式的代表具有多种应用,例如帮助视觉障碍。这种方法是计算机视觉和自然语言技术的结合,这是一个很难解决的问题。该项目旨在比较解决图像字幕问题的不同方法。在特定的角度上,重点是比较两种不同类型的模型:编码器 - 模型方法和多模型方法。在“编码器”方法中,将注射和合并架构与主要基于对象检测的多模式图像字幕方法进行了比较。这些方法已根据最先进的句子比较指标进行了比较,例如BLEU,GLEU,Meteor和Rouge在包含100K图像的Google概念字幕数据集中的子集上进行了比较。在此比较的基础上,我们观察到最佳模型是注射的编码模型。这种最佳方法已被部署为基于网络的系统。在上传图像时,这样的系统将输出与图像关联的最佳标题。
Extracting context from visual representations is of utmost importance in the advancement of Computer Science. Representation of such a format in Natural Language has a huge variety of applications such as helping the visually impaired etc. Such an approach is a combination of Computer Vision and Natural Language techniques which is a hard problem to solve. This project aims to compare different approaches for solving the image captioning problem. In specific, the focus was on comparing two different types of models: Encoder-Decoder approach and a Multi-model approach. In the encoder-decoder approach, inject and merge architectures were compared against a multi-modal image captioning approach based primarily on object detection. These approaches have been compared on the basis on state of the art sentence comparison metrics such as BLEU, GLEU, Meteor, and Rouge on a subset of the Google Conceptual captions dataset which contains 100k images. On the basis of this comparison, we observed that the best model was the Inception injected encoder model. This best approach has been deployed as a web-based system. On uploading an image, such a system will output the best caption associated with the image.