“Resilience has already become a prominent issue on current large-scale platforms,” the editors write in the preface to the book. “The advent of exascale computers with millions of cores and billion-parallelism is only going to worsen the scenario. The capacity to deal with errors and faults will be a critical factor for HPC applications to be deployed efficiently.”
The reference volume provides an overview of various fault tolerance methods for HPC applications in two parts. In Part I, the editors along with colleague Jack Dongarra, focus on checkpointing, “the de-facto standard technique for resilience in HPC protocols.” The authors present the main protocols, coordinated and hierarchical, and introduce probabilistic performance models to assess these protocols. Such models are necessary, they say, for minimizing bias when dealing with future hardware, which by its definition does not yet exist. They look at checkpointing combined with fault prediction and with replication. General-purpose techniques, including checkpoint and rollback recovery protocols, as well as application-specific methods are considered, such as ABFT, or Algorithm based Fault Tolerance. There’s also a section on how to cope with silent errors.
The authors describe the problem in terms of scale, which they write is both an opportunity (“the most viable path to sustained petascale”) and a threat:
“Future platforms will enroll even more computing resources to enter the Exascale era. Current plans refer to systems either with 100,000 nodes, each equipped with 10,000 cores (the fat node scenario), or with 1,000,000 nodes, each equipped with 1,000 cores (the slim node scenario).
“Even if each node provides an individual MTBF (Mean Time Between Failures) of, say, one century, a machine with 100,000 such nodes will encounter a failure every 9 hours in average, which is larger than the execution time of many HPC applications. Worse, a machine with 1,000,000 nodes (also with a one-century MTBF) will encounter a failure every 53 minutes in average. Note that a one-century MTBF per node is an optimistic figure, given that each node is composed of several hundreds or thousands of cores.
“To further darken the picture, several types of errors need to be considered when computing at scale. In addition to classical fail-stop errors (such as hardware failures), silent errors (a.k.a silent data corruptions) must be taken into account. Contrarily to fail-stop failures, silent errors are not detected immediately, but instead after some arbitrary detection latency, which complicates methods to cope with them.”
Part II is labeled “Technical Contributions” and is organized into four chapters.
2) Errors and Faults by Ana Gaiaru and Franck Cappello
3) Fault-Tolerant MPI by Aurélien Bouteiller
4) Using Replication for Resilience on Exascale Systems by Henri Casanova, Frederic Vivien and Dounia Zaidouni
5) Energy-Aware Checkpointing Strategies by Guillaume Aupy, Anne Benoit, Mohammed El Mehdi Diori, Oliver Glück and Laurent Lefèvre
Each chapter focuses on a different aspect of resiliency at scale. Chapter five, for example, is important for spotlighting the connection that exists between the power challenge and the resilience challenge.
“[F]ault tolerance and energy consumption are interrelated: fault tolerance consumes energy and some energy reduction techniques can increase error and failure rates,” write the international team of HPC experts.
The 320-page book is available now in both hard cover, eBook and Kindle editions. Part I of the book also appears in a slightly-modified form in a May 2015 report [PDF].
Dr. Thomas Herault is a research scientist in the Innovative Computing Laboratory (ICL) at the University of Tennessee Knoxville, Tennessee. Dr. Yves Robert is a professor in the Laboratory of Parallel Computing at the Ecole Normale Supérieure de Lyon, France, and a visiting research scholar in the ICL.