论文标题
Leaper:通过转移学习快速准确的基于FPGA的系统性能预测
LEAPER: Fast and Accurate FPGA-based System Performance Prediction via Transfer Learning
论文作者
论文摘要
机器学习最近获得了吸引力,以克服FPGA上缓慢的加速器生成和实现过程。它可用于构建绩效和资源使用模型,以实现快速的早期设计空间探索。首先,培训需要大量数据(从设计综合和实现工具中提取的功能),这是成本上的,因为耗时的加速器设计和实施过程。其次,针对特定环境训练的模型无法预测新的未知环境的性能或资源使用情况。在云系统中,租用数据收集平台来构建ML模型可以大大增加系统的总成本所有权(TCO)。第三,基于ML的模型使用有限数量的样品训练,易于过度拟合。为了克服这些限制,我们提出了LeAper,这是一种基于转移学习的方法,用于预测基于FPGA的系统中的性能和资源使用情况。 Leaper的关键思想是将基于ML的性能和资源使用模型转移到了针对低端边缘环境的训练的新型高端云环境,以提供快速准确的预测加速器实现。实验结果表明,Leaper(1)平均提供了六个工作负载和五个FPGA,当我们使用转移的模型在具有5次学习的云环境中进行预测时,精度为85%,并且(2)(2)将FPGA上加速器实现的设计空间探索时间从10x减少到几天,从几天到几天。
Machine learning has recently gained traction as a way to overcome the slow accelerator generation and implementation process on an FPGA. It can be used to build performance and resource usage models that enable fast early-stage design space exploration. First, training requires large amounts of data (features extracted from design synthesis and implementation tools), which is cost-inefficient because of the time-consuming accelerator design and implementation process. Second, a model trained for a specific environment cannot predict performance or resource usage for a new, unknown environment. In a cloud system, renting a platform for data collection to build an ML model can significantly increase the total-cost-ownership (TCO) of a system. Third, ML-based models trained using a limited number of samples are prone to overfitting. To overcome these limitations, we propose LEAPER, a transfer learning-based approach for prediction of performance and resource usage in FPGA-based systems. The key idea of LEAPER is to transfer an ML-based performance and resource usage model trained for a low-end edge environment to a new, high-end cloud environment to provide fast and accurate predictions for accelerator implementation. Experimental results show that LEAPER (1) provides, on average across six workloads and five FPGAs, 85% accuracy when we use our transferred model for prediction in a cloud environment with 5-shot learning and (2) reduces design-space exploration time for accelerator implementation on an FPGA by 10x, from days to only a few hours.