论文标题

ORCA:用于卸载美国规模数据中心应用程序的网络和架构共同设计

ORCA: A Network and Architecture Co-design for Offloading us-scale Datacenter Applications

论文作者

Yuan, Yifan, Huang, Jinghan, Sun, Yan, Wang, Tianchen, Nelson, Jacob, Ports, Dan R. K., Wang, Yipeng, Wang, Ren, Tai, Charlie, Kim, Nam Sung

论文摘要

为了解决数据中心应用程序的“数据中心税”和“杀手微秒”问题,已提出了包括基于NIC的智能NIC的各种解决方案。尽管如此,他们经常在网络和/或PCIE链接上遭受通信的高度障碍。为了应对当前解决方案的局限性,本文提出了Orca,这是一个整体网络和架构共同设计的解决方案,该解决方案利用当前的RDMA和新兴的高速缓存 - 芯片外互连技术。具体而言,ORCA由四个硬件和软件组件组成:(1)由单方面RDMA写入和高速缓存的内存写入管理的统一和机内通信的统一抽象; (2)有效通知通过缓存相干协助的加速器请求; (3)CACHE-COHRENENT ACCELERATAR架构直接处理NIC收到的请求; (4)现代服务器存储器系统的自适应设备到宿主数据传输,这些数据传输由DRAM和NVM组成,可利用CPU和PCIE中的最新功能。我们使用商业系统原型ORCA进行了原型ORCA,并评估了三个流行的数据中心应用程序:内存键值商店,基于链条复制的分布式交易系统和深度学习建议模型推断。评估表明,与当前最新解决方案相比,ORCA的潜伏期低30.1〜69.1%,高达2.5倍,功率效率高3倍。

Responding to the "datacenter tax" and "killer microseconds" problems for datacenter applications, diverse solutions including Smart NIC-based ones have been proposed. Nonetheless, they often suffer from high overhead of communications over network and/or PCIe links. To tackle the limitations of the current solutions, this paper proposes ORCA, a holistic network and architecture co-design solution that leverages current RDMA and emerging cache-coherent off-chip interconnect technologies. Specifically, ORCA consists of four hardware and software components: (1) unified abstraction of inter- and intra-machine communications managed by one-sided RDMA write and cache-coherent memory write; (2) efficient notification of requests to accelerators assisted by cache coherence; (3) cache-coherent accelerator architecture directly processing requests received by NIC; and (4) adaptive device-to-host data transfer for modern server memory systems consisting of both DRAM and NVM exploiting state-of-the-art features in CPUs and PCIe. We prototype ORCA with a commercial system and evaluate three popular datacenter applications: in-memory key-value store, chain replication-based distributed transaction system, and deep learning recommendation model inference. The evaluation shows that ORCA provides 30.1~69.1% lower latency, up to 2.5x higher throughput, and 3x higher power efficiency than the current state-of-the-art solutions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源