Reveca-视频事件字幕的丰富编码器框架框架

论文标题

Reveca-视频事件字幕的丰富编码器框架框架

REVECA -- Rich Encoder-decoder framework for Video Event CAptioner

论文作者

Heo, Jaehyuk, Jeong, YongGi, Kim, Sunwoo, Kim, Jaehee, Kang, Pilsung

论文摘要

我们描述了在CVPR 2022举行的长形式视频理解研讨会上使用的通用边界事件字幕挑战中使用的方法。我们为视频事件字幕座（ReveCA）设计了一个丰富的编码器框架（Reveca），该框架利用了来自视频的空间和时间信息，以生成相应事件边界的字幕。 Reveca使用框架位置嵌入在事件边界之前和之后合并信息。此外，它采用了使用时间段网络和基于时间的成对差异方法提取的功能来学习时间信息。为了学习事件的主题，采用了注意集合过程的语义细分掩模。最后，洛拉（Lora）用于微调图像编码器以提高学习效率。 Reveca在动力学-GEBC测试数据上的平均得分为50.97，比基线方法提高了10.17。我们的代码可在https://github.com/tootouch/reveca中找到。

We describe an approach used in the Generic Boundary Event Captioning challenge at the Long-Form Video Understanding Workshop held at CVPR 2022. We designed a Rich Encoder-decoder framework for Video Event CAptioner (REVECA) that utilizes spatial and temporal information from the video to generate a caption for the corresponding the event boundary. REVECA uses frame position embedding to incorporate information before and after the event boundary. Furthermore, it employs features extracted using the temporal segment network and temporal-based pairwise difference method to learn temporal information. A semantic segmentation mask for the attentional pooling process is adopted to learn the subject of an event. Finally, LoRA is applied to fine-tune the image encoder to enhance the learning efficiency. REVECA yielded an average score of 50.97 on the Kinetics-GEBC test data, which is an improvement of 10.17 over the baseline method. Our code is available in https://github.com/TooTouch/REVECA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题