联合CNN和变压器网络通过弱监督学习，以进行有效的人群计数

论文标题

联合CNN和变压器网络通过弱监督学习，以进行有效的人群计数

Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting

论文作者

Wang, Fusen, Liu, Kai, Long, Fei, Sang, Nong, Xia, Xiaofeng, Sang, Jun

论文摘要

当前，对于人群计数，通过密度图估计的全面监督方法是主流研究方向。但是，这种方法需要图像中人员的位置级注释，这既耗时又费力。因此，迫切需要依靠计数级注释的弱监督方法。由于CNN不适合建模全球环境和图像补丁之间的相互作用，因此通过CNN进行弱监督学习的人群通常无法显示出良好的性能。通过依次提出了通过变压器进行弱监督的模型，以模拟全局上下文并学习对比功能。但是，变压器将人群图像直接分为一系列令牌，这可能不是一个不错的选择，因为每个行人都是独立的个人，并且网络的参数数量非常大。因此，我们通过本文中的人群进行弱监督的学习，提出了一个联合CNN和Transformer网络（JCTNET）。 JCTNET由三个部分组成：CNN特征提取模块（CFM），变压器特征提取模块（TFM）和计数回归模块（CRM）。特别是，CFM提取了人群语义信息功能，然后将其补丁分区发送到TRM以进行建模全局上下文，而CRM用于预测人数。广泛的实验和可视化表明，JCTNet可以有效地关注人群区域，并在五个主流数据集上获得较高的弱监督计数性能。与纯变压器工程相比，模型的参数数量可以减少约67％〜73％。我们还试图解释这样的现象，即仅受计数级注释限制的模型仍然可以集中在人群地区。我们认为我们的工作可以促进该领域的进一步研究。

Currently, for crowd counting, the fully supervised methods via density map estimation are the mainstream research directions. However, such methods need location-level annotation of persons in an image, which is time-consuming and laborious. Therefore, the weakly supervised method just relying upon the count-level annotation is urgently needed. Since CNN is not suitable for modeling the global context and the interactions between image patches, crowd counting with weakly supervised learning via CNN generally can not show good performance. The weakly supervised model via Transformer was sequentially proposed to model the global context and learn contrast features. However, the transformer directly partitions the crowd images into a series of tokens, which may not be a good choice due to each pedestrian being an independent individual, and the parameter number of the network is very large. Hence, we propose a Joint CNN and Transformer Network (JCTNet) via weakly supervised learning for crowd counting in this paper. JCTNet consists of three parts: CNN feature extraction module (CFM), Transformer feature extraction module (TFM), and counting regression module (CRM). In particular, the CFM extracts crowd semantic information features, then sends their patch partitions to TRM for modeling global context, and CRM is used to predict the number of people. Extensive experiments and visualizations demonstrate that JCTNet can effectively focus on the crowd regions and obtain superior weakly supervised counting performance on five mainstream datasets. The number of parameters of the model can be reduced by about 67%~73% compared with the pure Transformer works. We also tried to explain the phenomenon that a model constrained only by count-level annotations can still focus on the crowd regions. We believe our work can promote further research in this field.

下载PDF全文

下载文献需遵守相关版权规定

论文标题