论文标题

POA:用于利用加速器级别并行性的高性能调度框架

POAS: A high-performance scheduling framework for exploiting Accelerator Level Parallelism

论文作者

Martínez, Pablo Antonio, Bernabé, Gregorio, García, Jose Manuel

论文摘要

在所有范围内,异构计算都成为主流。计算机架构中这个新时代带来了一种称为加速器级并行性(ALP)的新范式。在ALP中,加速器可同时使用前所未有的性能和能源效率。为此,有许多问题要解决,这是共同执行的最具挑战性的问题之一。 本文开发了一个名为POA的调度框架,这是一种为通用应用程序共同执行的一般方法。与其他调度方法不同,POA不能直接安排应用程序。取而代之的是,这是一个通用模型,可转换任何应用程序以使其适合共执行,以便可以在ALP环境中执行。我们的建议由四个不同的步骤组成:预测,优化,适应和时间表。在这些阶段中,应用程序中实施了不同的修改,以使其适合在ALP环境中执行。在这项工作中,我们还将框架应用于矩阵乘法案例研究,概述了使用POA将应用程序移植的关键和最重要的步骤。 我们使用CPU内核,CUDA内核和张量核(XPU)评估了基于POA的基于POA的实现,以在CPU/GPU/XPU环境上进行矩阵乘法。我们的实验证明,在研究的情况下的共执行可以受益于ALP,仅使用一个加速器就可以产生高达45%的加速度。 POA的可靠灵活性和潜力使其成为在未来计算机系统中吸引ALP的绝佳候选者。

Heterogeneous computing is becoming mainstream in all scopes. This new era in computer architecture brings a new paradigm called Accelerator Level Parallelism (ALP). In ALP, accelerators are used concurrently to provide unprecedented levels of performance and energy efficiency. To reach that, there are many problems to be solved, one of the most challenging being co-execution. This paper develops a scheduling framework called POAS, a general method for providing co-execution to generic applications. Unlike other scheduling approaches, POAS does not directly schedule applications. Instead, it is a generic model that transforms any application to make it suitable for co-execution, so that it can be executed in ALP environments. Our proposal is composed of four differentiated steps: predict, optimize, adapt and schedule. During these phases, different modifications are implemented in the application to make it suitable to be executed in ALP environments. In this work we also apply our framework to a matrix multiplication case study, outlining the critical and most important steps to port the application with POAS. We evaluate our POAS-based implementation for matrix multiplication on a CPU/GPU/XPU environment using CPU cores, CUDA cores and tensor cores (XPU). Our experiments prove that co-execution in the studied scenario can benefit from ALP, yielding speedups of up to 45% with respect to using only one accelerator. The proven flexibility and potential of POAS make it an excellent candidate to reach ALP in future computer systems.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源