论文标题
2D和3D功能的交互式多尺度融合用于多对象跟踪
Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object Tracking
论文作者
论文摘要
多对象跟踪(MOT)是实现自动驾驶的重要任务。传统作品试图根据LIDAR收集的点云(PC)或基于从相机捕获的图像来完成此任务。但是,依靠一个传感器还不够强大,因为它可能在跟踪过程中失败。另一方面,来自多种方式的特征融合有助于提高准确性。结果,正在开发基于不同传感器集成多种模式特征的新技术。来自RGB摄像机和LIDAR的3D结构信息的纹理信息在不同情况下具有各自的优势。但是,由于完全不同的信息方式,实现有效的功能融合并不容易。先前的融合方法通常在骨干中融合顶级特征,从不同方式提取特征。在本文中,我们首先引入PointNet ++以获取点云的多尺度深度表示,以使其适应图像和点云的多尺度特征之间提出的交互式特征融合。具体而言,通过像素级和点级特征之间的多尺度交互式查询和融合,我们的方法可以获得更多的显着特征,以提高多个对象跟踪的性能。此外,我们探讨了预训练对每种单一模式的有效性,并在基于融合的模型上进行微调。实验结果表明,我们的方法可以在KITTI基准测试上实现良好的性能,并且在不使用多尺度特征融合的情况下超过其他方法。此外,消融研究表明了多尺度特征融合和对单态训练的有效性。
Multiple object tracking (MOT) is a significant task in achieving autonomous driving. Traditional works attempt to complete this task, either based on point clouds (PC) collected by LiDAR, or based on images captured from cameras. However, relying on one single sensor is not robust enough, because it might fail during the tracking process. On the other hand, feature fusion from multiple modalities contributes to the improvement of accuracy. As a result, new techniques based on different sensors integrating features from multiple modalities are being developed. Texture information from RGB cameras and 3D structure information from Lidar have respective advantages under different circumstances. However, it's not easy to achieve effective feature fusion because of completely distinct information modalities. Previous fusion methods usually fuse the top-level features after the backbones extract the features from different modalities. In this paper, we first introduce PointNet++ to obtain multi-scale deep representations of point cloud to make it adaptive to our proposed Interactive Feature Fusion between multi-scale features of images and point clouds. Specifically, through multi-scale interactive query and fusion between pixel-level and point-level features, our method, can obtain more distinguishing features to improve the performance of multiple object tracking. Besides, we explore the effectiveness of pre-training on each single modality and fine-tuning on the fusion-based model. The experimental results demonstrate that our method can achieve good performance on the KITTI benchmark and outperform other approaches without using multi-scale feature fusion. Moreover, the ablation studies indicates the effectiveness of multi-scale feature fusion and pre-training on single modality.