视频字幕和应用程序的集成方法

论文标题

视频字幕和应用程序的集成方法

An Integrated Approach for Video Captioning and Applications

论文作者

Amirian, Soheyla, Taha, Thiab R., Rasheed, Khaled, Arabnia, Hamid R.

论文摘要

物理计算基础架构，数据收集和算法最近在图像和视频中提取信息已取得了重大进步。增长在图像字幕和视频字幕方面特别出色。但是，视频字幕的大多数进步仍在简短的视频中进行。在这项研究中，我们仅通过使用KeyFrames来标题更长的视频，这是视频总帧的一小部分。与其处理数千个帧，不如根据关键框的数量处理几个帧。许多帧的计算与字幕过程的速度之间存在权衡。这项研究的方法是允许用户指定执行时间和准确性之间的权衡。此外，我们认为将图像，视频和自然语言链接提供了许多实际的好处和即时的实际应用。从建模的角度来看，我们的贡献并没有设计和分阶段进行显式算法来处理视频并在复杂的处理管道中生成字幕，而是通过为长时间的视频进行字幕设计混合深度学习体系结构来设计混合深度学习体系结构。我们将所开发的技术和方法视为迈向本研究中讨论的应用程序的步骤。

Physical computing infrastructure, data gathering, and algorithms have recently had significant advances to extract information from images and videos. The growth has been especially outstanding in image captioning and video captioning. However, most of the advancements in video captioning still take place in short videos. In this research, we caption longer videos only by using the keyframes, which are a small subset of the total video frames. Instead of processing thousands of frames, only a few frames are processed depending on the number of keyframes. There is a trade-off between the computation of many frames and the speed of the captioning process. The approach in this research is to allow the user to specify the trade-off between execution time and accuracy. In addition, we argue that linking images, videos, and natural language offers many practical benefits and immediate practical applications. From the modeling perspective, instead of designing and staging explicit algorithms to process videos and generate captions in complex processing pipelines, our contribution lies in designing hybrid deep learning architectures to apply in long videos by captioning video keyframes. We consider the technology and the methodology that we have developed as steps toward the applications discussed in this research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题