点火器：用于云中可预测的DNN推断的干扰感知的GPU资源提供

论文标题

点火器：用于云中可预测的DNN推断的干扰感知的GPU资源提供

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

论文作者

Xu, Fei, Xu, Jianian, Chen, Jiabin, Chen, Li, Shang, Ruitao, Zhou, Zhi, Liu, Fangming

论文摘要

GPU对于加速对云数据中心中潜伏期敏感的深神经网络（DNN）推理工作负载至关重要。为了充分利用GPU资源，在共同确定的DNN推理工作负载中GPU的空间共享变得越来越引人注目。但是，GPU共享不可避免地会在共同认同的推理工作量之间引起严重的性能干扰，这是基于对EC2 GPU实例的DNN推断的经验测量研究的激励。尽管现有用于保证推理性能服务水平目标（SLO）的工作重点是GPU的时间共享或反应性GPU资源缩放和推理迁移技术，但如何主动缓解这种严重的性能干扰的关注较少。在本文中，我们提出了IGNITER，这是一种干扰感知的GPU资源提供框架，用于实现云中可预测的DNN推断。 IGNITER由两个关键组成部分组成：（1）轻巧的DNN推理性能模型，该模型利用系统和工作负载指标，这些指标实际上可以访问以捕获性能干扰；（2）一种经济高效的GPU资源供应策略，该策略基于我们的推理绩效模型共同优化GPU资源分配和自适应批处理，以实现DNN推理工作负载的可预测性能。我们基于在EC2 GPU实例上托管的NVIDIA TRITON推理服务器实现IGNITER的原型。在四个代表性DNN模型和数据集上进行的广泛的原型实验表明，IGNITER可以保证DNN推理工作负载的性能，而实际上可接受的运行时开销，同时与最先进的GPU资源提供策略相比，将货币成本降低了25％。

GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings severe performance interference among co-located inference workloads, as motivated by an empirical measurement study of DNN inference on EC2 GPU instances. While existing works on guaranteeing inference performance service level objectives (SLOs) focus on either temporal sharing of GPUs or reactive GPU resource scaling and inference migration techniques, how to proactively mitigate such severe performance interference has received comparatively little attention. In this paper, we propose iGniter, an interference-aware GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud. iGniter is comprised of two key components: (1) a lightweight DNN inference performance model, which leverages the system and workload metrics that are practically accessible to capture the performance interference; (2) A cost-efficient GPU resource provisioning strategy that jointly optimizes the GPU resource allocation and adaptive batching based on our inference performance model, with the aim of achieving predictable performance of DNN inference workloads. We implement a prototype of iGniter based on the NVIDIA Triton inference server hosted on EC2 GPU instances. Extensive prototype experiments on four representative DNN models and datasets demonstrate that iGniter can guarantee the performance SLOs of DNN inference workloads with practically acceptable runtime overhead, while saving the monetary cost by up to 25% in comparison to the state-of-the-art GPU resource provisioning strategies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题