lpyolo：低精度YOLO用于FPGA的面部检测

论文标题

lpyolo：低精度YOLO用于FPGA的面部检测

LPYOLO: Low Precision YOLO for Face Detection on FPGA

论文作者

Günay, Bestami, Okcu, Sefa Burak, Bilge, Hasan Şakir

论文摘要

近年来，边缘计算设备和人工智能应用程序的数量已过剩。在边缘计算中，决策过程和计算从服务器转移到边缘设备。因此，需要便宜和低电源设备。 FPGA是非常低的功率，倾向于进行平行操作和非常适合运行卷积神经网络（CNN）的设备，这是人工智能应用程序的基本单位。监视系统上的面部检测是安全市场上最期待的应用。在这项工作中，重新设计了Tinyyolov3架构并部署了面部检测。它是一种基于CNN的对象检测方法，并针对嵌入式系统开发。 Pynq-Z2被选为具有低端Xilinx Zynq 7020 System-On-Chip（SOC）的目标板。重新设计的TinyYolov3模型是用Brevitas库以许多位宽度精度定义的，Brevitas库将CNN层和激活以整数量化形式带来。然后，使用宽面数据集对模型进行量化结构进行训练。为了降低延迟和功耗，FPGA的OnChip内存被配置为整个网络参数的存储，最后一个激活函数被修改为重新缩放的Harttanh而不是Sigmoid。同样，高度的并行性应用于FPGA的逻辑资源。该模型使用使用Finn Framework和Finn-HLS库将基于HLS的应用程序转换为基于HLS的应用程序，其中包括C ++中的图层定义。后来，该模型被合成和部署。 SOC的CPU具有多线程机制，负责预处理，后处理和TCP/IP流操作。因此，使用4位Precision模型实现了2.4瓦总董事会的总板功耗，每秒（FPS）吞吐量18帧（FPS）吞吐量和0.757 MAP的准确率。

In recent years, number of edge computing devices and artificial intelligence applications on them have advanced excessively. In edge computing, decision making processes and computations are moved from servers to edge devices. Hence, cheap and low power devices are required. FPGAs are very low power, inclined to do parallel operations and deeply suitable devices for running Convolutional Neural Networks (CNN) which are the fundamental unit of an artificial intelligence application. Face detection on surveillance systems is the most expected application on the security market. In this work, TinyYolov3 architecture is redesigned and deployed for face detection. It is a CNN based object detection method and developed for embedded systems. PYNQ-Z2 is selected as a target board which has low-end Xilinx Zynq 7020 System-on-Chip (SoC) on it. Redesigned TinyYolov3 model is defined in numerous bit width precisions with Brevitas library which brings fundamental CNN layers and activations in integer quantized form. Then, the model is trained in a quantized structure with WiderFace dataset. In order to decrease latency and power consumption, onchip memory of the FPGA is configured as a storage of whole network parameters and the last activation function is modified as rescaled HardTanh instead of Sigmoid. Also, high degree of parallelism is applied to logical resources of the FPGA. The model is converted to an HLS based application with using FINN framework and FINN-HLS library which includes the layer definitions in C++. Later, the model is synthesized and deployed. CPU of the SoC is employed with multithreading mechanism and responsible for preprocessing, postprocessing and TCP/IP streaming operations. Consequently, 2.4 Watt total board power consumption, 18 Frames-Per-Second (FPS) throughput and 0.757 mAP accuracy rate on Easy category of the WiderFace are achieved with 4 bits precision model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题