论文标题
PALMED:超级标准体系结构的吞吐量表征 - 扩展版本
PALMED: Throughput Characterization for Superscalar Architectures -- Extended Version
论文作者
论文摘要
在超级量表体系结构中,调度程序将微型操作($μ$ ops)动态分配给执行端口。建筑结构的端口映射描述了指令如何分解为每$ $ $ $ op的$ $ ops和列表。编译器和性能调试工具使用它来表征反复执行的一系列指令的性能吞吐量,作为循环的核心组件。 本文介绍了双等效表示:体系结构的资源映射是一个抽象模型,在该模型中,要执行指令必须使用一组抽象资源,代表执行端口的组合。对于给定的架构,查找端口映射是一个重要但困难的问题。构建资源映射是一个更容易理解的问题,并提供了一个更简单,更等效的模型。本文介绍了Palmed,该工具会自动构建用于管道,超级量表的CPU架构的资源映射。 Palmed不需要硬件性能计数器,并且仅依赖于运行时测量。 我们通过从Spec CPU 2017基准的编译二进制基础上提取代表性基本块的代表性基本块来评估双重表示对吞吐量建模的相关性。我们将现有机器模型预测的吞吐量与Palmed产生的吞吐量进行了比较,并发现与最先进的工具相当的精度,在英特尔的Skylake微体系结构上实现了此工作负载的均值低于10%的均方错误率。
In a super-scalar architecture, the scheduler dynamically assigns micro-operations ($μ$OPs) to execution ports. The port mapping of an architecture describes how an instruction decomposes into $μ$OPs and lists for each $μ$OP the set of ports it can be mapped to. It is used by compilers and performance debugging tools to characterize the performance throughput of a sequence of instructions repeatedly executed as the core component of a loop. This paper introduces a dual equivalent representation: The resource mapping of an architecture is an abstract model where, to be executed, an instruction must use a set of abstract resources, themselves representing combinations of execution ports. For a given architecture, finding a port mapping is an important but difficult problem. Building a resource mapping is a more tractable problem and provides a simpler and equivalent model. This paper describes Palmed, a tool that automatically builds a resource mapping for pipelined, super-scalar, out-of-order CPU architectures. Palmed does not require hardware performance counters, and relies solely on runtime measurements. We evaluate the pertinence of our dual representation for throughput modeling by extracting a representative set of basic-blocks from the compiled binaries of the SPEC CPU 2017 benchmarks. We compared the throughput predicted by existing machine models to that produced by Palmed, and found comparable accuracy to state-of-the art tools, achieving sub-10 % mean square error rate on this workload on Intel's Skylake microarchitecture.