论文标题
SSGVS:语义场景图表到视频合成
SSGVS: Semantic Scene Graph-to-Video Synthesis
论文作者
论文摘要
作为图像合成任务的自然扩展,视频合成最近引起了很多兴趣。许多图像合成的作用利用类标签或文本作为指导。但是,标签和文本都无法提供明确的时间指导,例如何时开始或结束操作。为了克服这一限制,我们将语义视频场景图作为视频综合的输入,因为它们代表了场景中对象之间的空间和时间关系。由于视频场景图通常是时间离散的注释,因此我们提出了一个视频场景图(VSG)编码器,该图形不仅编码现有的视频场景图,而且还可以预测未标记框架的图表。 VSG编码器已通过不同的对比度多模式损失进行了预训练。基于预先训练的VSG编码器,VQ-VAE和自动回归变压器的语义场景图表到视频综合框架(SSGVS)提议合成一个给定的视频,给定初始场景图像和非固定数量的语义场景图,以合成视频。我们在动作基因组数据集上评估了SSGV和其他最先进的视频合成模型,并在视频综合中展示了视频场景图的积极意义。源代码将发布。
As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code will be released.