论文标题
CHEX:多元重播有订购的检查点
CHEX: Multiversion Replay with Ordered Checkpoints
论文作者
论文摘要
在科学计算和数据科学学科中,通常有必要共享应用程序工作流程并重复结果。当前的工具将应用程序工作流集合起来,并共享结果容器以重复结果。由于容器化,这些工具确实会改善结果的共享。但是,它们不能提高重播的效率。在本文中,我们介绍了多元重播问题,该问题是当应用程序集装箱的多个版本时出现的,并且必须重复重复每个版本以重复结果。为了避免单独执行每个版本,我们开发了CHEX,该CHEX检查程序的状态并确定何时可以跨版本重复使用程序状态。它使用基于系统呼叫的执行谱系来做到这一点。我们跨版本识别常见计算的能力使我们能够基于检查点件转速系统使用内存中的缓存来考虑优化重播。我们显示多元重播问题是NP-HARD,并提出了有效的启发式方法。 Chex通过共享共同的计算来减少总体重播时间,但避免存储大量的检查点。我们证明CHEX保持轻量级的包裹共享,并将多元宇宙重播的总时间平均提高了50%。
In scientific computing and data science disciplines, it is often necessary to share application workflows and repeat results. Current tools containerize application workflows, and share the resulting container for repeating results. These tools, due to containerization, do improve sharing of results. However, they do not improve the efficiency of replay. In this paper, we present the multiversion replay problem which arises when multiple versions of an application are containerized, and each version must be replayed to repeat results. To avoid executing each version separately, we develop CHEX, which checkpoints program state and determines when it is permissible to reuse program state across versions. It does so using system call-based execution lineage. Our capability to identify common computations across versions enables us to consider optimizing replay using an in-memory cache, based on a checkpoint-restore-switch system. We show the multiversion replay problem is NP-hard, and propose efficient heuristics for it. CHEX reduces overall replay time by sharing common computations but avoids storing a large number of checkpoints. We demonstrate that CHEX maintains lightweight package sharing, and improves the total time of multiversion replay by 50% on average.