学习使用1百万像素事件摄像头检测对象

论文标题

学习使用1百万像素事件摄像头检测对象

Learning to Detect Objects with a 1 Megapixel Event Camera

论文作者

Perot, Etienne, de Tournemire, Pierre, Nitti, Davide, Masci, Jonathan, Sironi, Amos

论文摘要

事件摄像机用高时间精度，低数据率和高动力范围编码视觉信息。由于这些特征，事件摄像机特别适合具有高运动，挑战性照明条件和需要低潜伏期的场景。但是，由于该领域的新颖性，与传统的基于框架的解决方案相比，基于事件的系统在许多视觉任务上的性能仍然更低。这种性能差距的主要原因是：与框架相机相比，事件传感器的空间分辨率较低；缺乏大型培训数据集；缺乏为基于事件的处理而建立的深度学习体系结构。在本文中，我们在基于事件的对象检测任务的背景下解决了所有这些问题。首先，我们公开发布第一个高分辨率大规模数据集以进行对象检测。该数据集包含1兆像素事件摄像头的14小时以上的录音，以及25m框架的汽车，行人和两轮车，并以高频标记。其次，我们介绍了一种新颖的经常性架构，用于基于事件的检测，并为训练训练而造成了时间一致性损失。将事件序列置于模型的内部内存中的能力对于实现高精度至关重要。我们的模型的表现优于基于事件的大型馈送架构。此外，我们的方法不需要从事件中重建强度图像，这表明直接来自原始事件的训练比通过中间强度图像是可能的，更有效和更准确的。在这项工作中引入的数据集上的实验，为这些事件和灰度图像可用，显示出与高度调谐和研究的基于框架的检测器相同的性能。

Event cameras encode visual information with high temporal precision, low data-rate, and high-dynamic range. Thanks to these characteristics, event cameras are particularly suited for scenarios with high motion, challenging lighting conditions and requiring low latency. However, due to the novelty of the field, the performance of event-based systems on many vision tasks is still lower compared to conventional frame-based solutions. The main reasons for this performance gap are: the lower spatial resolution of event sensors, compared to frame cameras; the lack of large-scale training datasets; the absence of well established deep learning architectures for event-based processing. In this paper, we address all these problems in the context of an event-based object detection task. First, we publicly release the first high-resolution large-scale dataset for object detection. The dataset contains more than 14 hours recordings of a 1 megapixel event camera, in automotive scenarios, together with 25M bounding boxes of cars, pedestrians, and two-wheelers, labeled at high frequency. Second, we introduce a novel recurrent architecture for event-based detection and a temporal consistency loss for better-behaved training. The ability to compactly represent the sequence of events into the internal memory of the model is essential to achieve high accuracy. Our model outperforms by a large margin feed-forward event-based architectures. Moreover, our method does not require any reconstruction of intensity images from events, showing that training directly from raw events is possible, more efficient, and more accurate than passing through an intermediate intensity image. Experiments on the dataset introduced in this work, for which events and gray level images are available, show performance on par with that of highly tuned and studied frame-based detectors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题