Fault Tolerance Techniques for HPC

zahed golabi No Comments

Among the chief challenges of deploying useful exascale machines, resilience looms large. Today’s error rates combined with tomorrow’s node counts cannot sustain a productive workflow without intervention. The significance of this issue has not gone unnoticed. A comprehensive collection of fault tolerance techniques are presented in one volume, called “Fault Tolerance Techniques for High-Performance Computing,” by editors Thomas Herault and Yves Robert — published last month by Springer Verlag.

“Resilience has already become a prominent issue on current large-scale platforms,” the editors write in the preface to the book. “The advent of exascale computers with millions of cores and billion-parallelism is only going to worsen the scenario. The capacity to deal with errors and faults will be a critical factor for HPC applications to be deployed efficiently.”

The reference volume provides an overview of various fault tolerance methods for HPC applications in two parts. In Part I, the editors along with colleague Jack Dongarra, focus on checkpointing, “the de-facto standard technique for resilience in HPC protocols.” The authors present the main protocols, coordinated and hierarchical, and introduce probabilistic performance models to assess these protocols. Such models are necessary, they say, for minimizing bias when dealing with future hardware, which by its definition does not yet exist. They look at checkpointing combined with fault prediction and with replication. General-purpose techniques, including checkpoint and rollback recovery protocols, as well as application-specific methods are considered, such as ABFT, or Algorithm based Fault Tolerance. There’s also a section on how to cope with silent errors.

The authors describe the problem in terms of scale, which they write is both an opportunity (“the most viable path to sustained petascale”) and a threat:

“Future platforms will enroll even more computing resources to enter the Exascale era. Current plans refer to systems either with 100,000 nodes, each equipped with 10,000 cores (the fat node scenario), or with 1,000,000 nodes, each equipped with 1,000 cores (the slim node scenario).

“Even if each node provides an individual MTBF (Mean Time Between Failures) of, say, one century, a machine with 100,000 such nodes will encounter a failure every 9 hours in average, which is larger than the execution time of many HPC applications. Worse, a machine with 1,000,000 nodes (also with a one-century MTBF) will encounter a failure every 53 minutes in average. Note that a one-century MTBF per node is an optimistic figure, given that each node is composed of several hundreds or thousands of cores.

“To further darken the picture, several types of errors need to be considered when computing at scale. In addition to classical fail-stop errors (such as hardware failures), silent errors (a.k.a silent data corruptions) must be taken into account. Contrarily to fail-stop failures, silent errors are not detected immediately, but instead after some arbitrary detection latency, which complicates methods to cope with them.”

Part II is labeled “Technical Contributions” and is organized into four chapters.

2) Errors and Faults by Ana Gaiaru and Franck Cappello

3) Fault-Tolerant MPI by Aurélien Bouteiller

4) Using Replication for Resilience on Exascale Systems by Henri Casanova, Frederic Vivien and Dounia Zaidouni

5) Energy-Aware Checkpointing Strategies by Guillaume Aupy, Anne Benoit, Mohammed El Mehdi Diori, Oliver Glück and Laurent Lefèvre

Each chapter focuses on a different aspect of resiliency at scale. Chapter five, for example, is important for spotlighting the connection that exists between the power challenge and the resilience challenge.

“[F]ault tolerance and energy consumption are interrelated: fault tolerance consumes energy and some energy reduction techniques can increase error and failure rates,” write the international team of HPC experts.

The 320-page book is available now in both hard cover, eBook and Kindle editions. Part I of the book also appears in a slightly-modified form in a May 2015 report [PDF].

Dr. Thomas Herault is a research scientist in the Innovative Computing Laboratory (ICL) at the University of Tennessee Knoxville, Tennessee. Dr. Yves Robert is a professor in the Laboratory of Parallel Computing at the Ecole Normale Supérieure de Lyon, France, and a visiting research scholar in the ICL.

The Power and Possibilities of Exascale Computing

Ali Moradi Alamdarloo No Comments

Eighteen zeroes. That is the ability to run a quintillion calculations per second and exascale computing using memory driven computing processes that will touch all aspects of our lives. The race to the Exascale is the space race of this century.

On the Road to Exascale

Ali Moradi Alamdarloo No Comments

Exascale computing

Exascale computing refers to computing systems capable of at least one exaFLOPS, or a billion billion calculations per second. Such capacity represents a thousandfold increase over the first petascale computer that came into operation in 2008. At a supercomputing conference in 2009, Computerworld projected exascale implementation by 2018. Exascale computing would be considered as a significant achievement in computer engineering, for it is believed to be the order of processing power of the human brain at neural level (functional might be lower). It is, for instance, the target power of the Human Brain Project.

Why do we need exascale computers ?

The only bad news is that we need more than exascale computing. Some of the key computational challenges, that face not just individual companies, but civilisation as a whole, will be enabled by exascale computing.

Everyone is concerned about climate change and climate modelling. The computational challenge for doing oceanic clouds, ice and topography are all tremendously important. And today we need at least two orders of magnitude improvement on that problem alone.

Controlled fusion – a big activity shared with Europe and Japan – can only be done with exascale computing and beyond. There is also medical modelling, whether it is life sciences itself, or the design of future drugs for every more rapidly changing and evolving viruses – again it’s a true exascale problem.

Exascale computing is really the medium and the only viable means of managing our future. It is probably crucial to the progress and the advancement of the modern age.

Sunway TaihuLight

The Sunway TaihuLight is a Chinese supercomputer which, as of June 2016, is ranked number one in the TOP500 list as the fastest supercomputer in the world, with a LINPACK benchmark rating of 93 petaflops. This is nearly three times as fast as the previous holder of the record, the Tianhe-2, which ran at 34 petaflops. As of June 2016, it is also ranked as the third most energy-efficient supercomputer in TOP500, with an efficiency of 6,051.30 MFLOPS/W. It was designed by the National Research Center of Parallel Computer Engineering & Technology (NRCPC) and is located at the National Supercomputing Center in Wuxi in the city of Wuxi, in Jiangsu province, China.