CPU-GPU统一内存中的超额检查管理的智能框架

论文标题

CPU-GPU统一内存中的超额检查管理的智能框架

An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory

论文作者

Long, Xinjian, Gong, Xiangyang, Zhou, Huiyang

论文摘要

本文提出了一个新颖的智能框架，用于在CPU-GPU UVM中进行超额认证管理。我们通过统一内存分析了当前基于规则的GPU内存超额标准方法，以及针对其他计算机架构组件的当前基于学习的方法。然后，我们确定现有的基于规则的方法与理论上限之间的性能差距。我们还确定了应用机器智能的优势以及现有基于学习的方法的局限性。本文提出了一个新颖的智能框架，用于在CPU-GPU UVM中进行超额认证管理。它由访问模式分类器组成，然后是基于模式的特定变压器模型，该模型使用新颖的损失函数，旨在减少页面触动。策略引擎旨在利用模型的结果来执行准确的页面预摘要和前发行。我们通过流行的基准套件的11个记忆密集型基准评估了我们的智能框架。我们的解决方案的表现优于过度订阅管理的最先进方法（SOTA）方法，与基线相比，在125 \％的内存超额标准下，击中的页数减少了64.4 \％，而SOTA方法则减少了17.3 \％的页数。我们的解决方案在125 \％内存过度订阅下的平均IPC提高1.52倍，而我们的解决方案的平均IPC在150 \％内存超过150 \％的内存过度下的平均改善。我们的解决方案的表现优于现有的页面地址预测的现有基于学习的方法，对于单个GPGPU工作负载，将TOP-1的准确性提高了6.45 \％（最高41.2 \％），将TOP-1的准确性提高了10.2 \％（最高30.2 \％）的多个同一GPGPU工作负载。

This paper proposes a novel intelligent framework for oversubscription management in CPU-GPU UVM. We analyze the current rule-based methods of GPU memory oversubscription with unified memory, and the current learning-based methods for other computer architectural components. We then identify the performance gap between the existing rule-based methods and the theoretical upper bound. We also identify the advantages of applying machine intelligence and the limitations of the existing learning-based methods. This paper proposes a novel intelligent framework for oversubscription management in CPU-GPU UVM. It consists of an access pattern classifier followed by a pattern-specific Transformer-based model using a novel loss function aiming for reducing page thrashing. A policy engine is designed to leverage the model's result to perform accurate page prefetching and pre-eviction. We evaluate our intelligent framework on a set of 11 memory-intensive benchmarks from popular benchmark suites. Our solution outperforms the state-of-the-art (SOTA) methods for oversubscription management, reducing the number of pages thrashed by 64.4\% under 125\% memory oversubscription compared to the baseline, while the SOTA method reduces the number of pages thrashed by 17.3\%. Our solution achieves an average IPC improvement of 1.52X under 125\% memory oversubscription, and our solution achieves an average IPC improvement of 3.66X under 150\% memory oversubscription. Our solution outperforms the existing learning-based methods for page address prediction, improving top-1 accuracy by 6.45\% (up to 41.2\%) on average for a single GPGPU workload, improving top-1 accuracy by 10.2\% (up to 30.2\%) on average for multiple concurrent GPGPU workloads.

下载PDF全文

下载文献需遵守相关版权规定

论文标题