论文标题

通过暂时集群进行时空视频场景文本检测

Towards Spatio-Temporal Video Scene Text Detection via Temporal Clustering

论文作者

Cai, Yuanqiang, Liu, Chang, Wang, Weiqiang, Ye, Qixiang

论文摘要

在空间域中仅界限盒注释,现有的视频场景文本检测(VSTD)基准缺乏视频框架之间文本实例的时间关系,这阻碍了视频与文本相关的应用程序的开发。在本文中,我们系统地引入了一种新的大规模基准,称为STVTEXT4,一种精心设计的空间静态检测指标(STDM)和一种新型的基于聚类的基线方法,称为时间聚类(TC)。 STVTEXT4打开了一个充满挑战而有希望的VSTD的方向,称为ST-VSTD,该方向的目标是同时检测空间和时间域中的视频场景文本。 STVTEXT4包含来自161个视频的161,347个视频帧的140万多个文本实例,其中每个实例不仅带有空间边界框和时间范围,还包括四个固有属性,包括可读性,密度,规模和生命周期,以促进社区。通过在视频序列中连续传播相同的文本,TC可以准确输出文本的空间四边形和时间范围,这为ST-VSTD设定了强大的基线。实验证明了我们方法的功效以及STVTEXT4的巨大学术和实际价值。数据集和代码将很快提供。

With only bounding-box annotations in the spatial domain, existing video scene text detection (VSTD) benchmarks lack temporal relation of text instances among video frames, which hinders the development of video text-related applications. In this paper, we systematically introduce a new large-scale benchmark, named as STVText4, a well-designed spatial-temporal detection metric (STDM), and a novel clustering-based baseline method, referred to as Temporal Clustering (TC). STVText4 opens a challenging yet promising direction of VSTD, termed as ST-VSTD, which targets at simultaneously detecting video scene texts in both spatial and temporal domains. STVText4 contains more than 1.4 million text instances from 161,347 video frames of 106 videos, where each instance is annotated with not only spatial bounding box and temporal range but also four intrinsic attributes, including legibility, density, scale, and lifecycle, to facilitate the community. With continuous propagation of identical texts in the video sequence, TC can accurately output the spatial quadrilateral and temporal range of the texts, which sets a strong baseline for ST-VSTD. Experiments demonstrate the efficacy of our method and the great academic and practical value of the STVText4. The dataset and code will be available soon.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源