通过超载减少伴随算法分化中的随机访问记忆大小

论文标题

通过超载减少伴随算法分化中的随机访问记忆大小

Reduction of the Random Access Memory Size in Adjoint Algorithmic Differentiation by Overloading

论文作者

Naumann, Uwe

论文摘要

通过操作员的伴随算法分化和功能过载是基于对数值模拟程序评估产生的有向无环图的解释。存储图形所需的计算机系统内存的大小与基础程序执行的浮点操作数量成正比。它很快超过了可用的内存资源。除了相对简单的数值模拟外，天真的伴随算法分化通常变得不可行。访问与图形关联的数据可以分类为顺序和随机性。后者是指图表内顶点之间的邻接关系定义的内存访问模式。依次访问的数据可以分解为块。可以在系统内存层次结构上流块，从而将可用内存的数量（例如）扩展到硬盘。异步I/O可以帮助减轻由于内存较慢而增加的成本。因此，可以解决更大的问题实例，而无需诉诸于技术挑战的用户干预（例如检查点）。随机访问的数据不必分解。由于跨块的数据访问，其块流的流传输可能会产生大量的计算成本开销。因此，应将伴随所需的随机存储器的大小保持在最小的状态，以消除分解的需求。我们提出了对伴随$ l $值的专用内存与将剩余带宽的开发作为可能的解决方案的组合。测试结果表明，在保留总体计算效率的同时，可以大量节省随机访问记忆尺寸。

Adjoint algorithmic differentiation by operator and function overloading is based on the interpretation of directed acyclic graphs resulting from evaluations of numerical simulation programs. The size of the computer system memory required to store the graph grows proportional to the number of floating-point operations executed by the underlying program. It quickly exceeds the available memory resources. Naive adjoint algorithmic differentiation often becomes infeasible except for relatively simple numerical simulations. Access to the data associated with the graph can be classified as sequential and random. The latter refers to memory access patterns defined by the adjacency relationship between vertices within the graph. Sequentially accessed data can be decomposed into blocks. The blocks can be streamed across the system memory hierarchy thus extending the amount of available memory, for example, to hard discs. Asynchronous i/o can help to mitigate the increased cost due to accesses to slower memory. Much larger problem instances can thus be solved without resorting to technically challenging user intervention such as checkpointing. Randomly accessed data should not have to be decomposed. Its block-wise streaming is likely to yield a substantial overhead in computational cost due to data accesses across blocks. Consequently, the size of the randomly accessed memory required by an adjoint should be kept minimal in order to eliminate the need for decomposition. We propose a combination of dedicated memory for adjoint $L$-values with the exploitation of remainder bandwidth as a possible solution. Test results indicate significant savings in random access memory size while preserving overall computational efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题