论文标题
Teampi-基于复制的弹性而没有(性能)疼痛
TeaMPI -- Replication-based Resilience without the (Performance) Pain
论文作者
论文摘要
在一个我们无法经常去检查点的时代,复制是构造数值模拟的通用方式,即使硬件零件失败,也可以继续运行。但是,复制通常不会在较大的尺度上使用,因为天真地反映了计算曾经有效将机器大小减半,并且保持复制的模拟彼此一致并不是一致的。我们为Exahype Engine(一种基于任务的双曲线方程系统求解器)演示,可以在用户端上没有重大代码更改的情况下实现弹性,而我们引入了一种新颖的算法,而复制减少了时间到底的时间。冗余的CPU周期没有“无所事事”燃烧。我们的作品采用了一个弱一致的数据模型,在该模型中,复制品是否独立运行,但通过心跳消息互相告知它们是否仍在运行。我们的关键绩效想法是让复制模拟的任务共享其一些结果,而我们将每个复制品的实际任务执行顺序供电。这样,重复的等级可以跳过一些本地计算,并自动开始彼此同步。我们使用生产水平的地震波方程求解器进行的实验提供了证据,表明这种新颖的概念有可能使复制能够负担得起高性能计算中的大规模模拟。
In an era where we can not afford to checkpoint frequently, replication is a generic way forward to construct numerical simulations that can continue to run even if hardware parts fail. Yet, replication often is not employed on larger scales, as naïvely mirroring a computation once effectively halves the machine size, and as keeping replicated simulations consistent with each other is not trivial. We demonstrate for the ExaHyPE engine -- a task-based solver for hyperbolic equation systems -- that it is possible to realise resiliency without major code changes on the user side, while we introduce a novel algorithmic idea where replication reduces the time-to-solution. The redundant CPU cycles are not burned "for nothing". Our work employs a weakly consistent data model where replicas run independently yet inform each other through heartbeat messages whether they are still up and running. Our key performance idea is to let the tasks of the replicated simulations share some of their outcomes, while we shuffle the actual task execution order per replica. This way, replicated ranks can skip some local computations and automatically start to synchronise with each other. Our experiments with a production-level seismic wave-equation solver provide evidence that this novel concept has the potential to make replication affordable for large-scale simulations in high-performance computing.