RapidLayout：使用进化算法的FPGA优化收缩期阵列的快速硬块放置

论文标题

RapidLayout：使用进化算法的FPGA优化收缩期阵列的快速硬块放置

RapidLayout: Fast Hard Block Placement of FPGA-optimized Systolic Arrays using Evolutionary Algorithms

论文作者

Zhang, Niansong, Chen, Xiang, Kapre, Nachiket

论文摘要

进化算法可以胜过传统的放置算法，例如模拟退火，分析放置以及手动放置，例如运行时，电线，金属丝，管道成本和时钟频率，当绘制FPGA硬块强度块强度设计（如Xilinx Ultrascale+ FPGAS上）时。对于某些硬块密集型，收缩期阵列加速器设计，商业级Xilinx Vivado CAD工具无法提供法律路由解决方案，而无需乏味的手动放置约束。取而代之的是，我们为这些硬块制定了一种自动FPGA放置算法，作为一个多目标优化问题，该问题针对线长和最大边界框尺寸指标。我们使用Xilinx RapidWright框架构建了一个名为Rapidlayout的端到端位置和路由流。 Rapidlayout的运行速度比Vivado快5-6 $ \ times $，并具有手动限制，并消除了为硬块手动生成位置约束的长达数周的努力。我们还对每个卷积块内部的长电线进行自动化后管道，以靶向650MHz URAM限制的操作。 Rapidlayout在运行时胜过VPR中的模拟退火器33％，WireLength的1.9-2.4 $ \ times $，以及3-4 $ \ times $ bounding Box大小，而（2）在运行时（1.8-2.2.2.2 $ \ times $ \ in Tires $ \ y in Wirelelementh）和2 $ iN-2-2-2 $ intime $ reelelelements $ \ relelelelementh和2-2-2-2 $ intiles $ intime $ iN和2-2-2-2-2-2-2-2。我们利用从基本FPGA设备的转移学习来对Ultrascale+ Family中类似FPGA设备的加速放置优化，比从头开始学习的位置。

Evolutionary algorithms can outperform conventional placement algorithms such as simulated annealing, analytical placement as well as manual placement on metrics such as runtime, wirelength, pipelining cost, and clock frequency when mapping FPGA hard block intensive designs such as systolic arrays on Xilinx UltraScale+ FPGAs. For certain hard-block intensive, systolic array accelerator designs, the commercial-grade Xilinx Vivado CAD tool is unable to provide a legal routing solution without tedious manual placement constraints. Instead, we formulate an automatic FPGA placement algorithm for these hard blocks as a multi-objective optimization problem that targets wirelength squared and maximum bounding box size metrics. We build an end-to-end placement and routing flow called RapidLayout using the Xilinx RapidWright framework. RapidLayout runs 5-6$\times$ faster than Vivado with manual constraints and eliminates the weeks-long effort to generate placement constraints manually for the hard blocks. We also perform automated post-placement pipelining of the long wires inside each convolution block to target 650MHz URAM-limited operation. RapidLayout outperforms (1) the simulated annealer in VPR by 33% in runtime, 1.9-2.4$\times$ in wirelength, and 3-4$\times$ in bounding box size, while also (2) beating the analytical placer UTPlaceF by 9.3$\times$ in runtime, 1.8-2.2$\times$ in wirelength, and 2-2.7$\times$ in bounding box size. We employ transfer learning from a base FPGA device to speed-up placement optimization for similar FPGA devices in the UltraScale+ family by 11-14$\times$ than learning the placements from scratch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题