Sierra, Livermore’s next advanced technology high performance computing system, will join LLNL’s lineup of supercomputers in 2017–2018. The new system will provide computational resources that are essential for nuclear weapon scientists to fulfill the National Nuclear Security Administration’s stockpile stewardship mission through simulation in lieu of underground testing. Advanced Simulation and Computing (ASC) Program scientists and engineers will use Sierra to assess the performance of nuclear weapon systems as well as nuclear weapon science and engineering calculations. These calculations are necessary to understand key issues of physics, the knowledge of which later makes its way into the integrated design codes. This work on Sierra has important implications for other global and national challenges such as nonproliferation and counterterrorism.
For Sierra’s installation, the CORAL partnership (Collaboration of Oak Ridge, Argonne, and Livermore) was formed to procure high performance computers from multiple vendors. IBM was selected as the vendor for LLNL. The IBM-built Sierra supercomputer is projected to provide four to six times the sustained performance and five to seven times the workload performance of Sequoia, with a 125 petaFLOP/s peak. At approximately 11 megawatts, Sierra will also be about five times more power efficient than Sequoia. By combining two types of processor chips—IBM’s Power 9 processors and NVIDIA’s Volta graphics processing units (GPUs)—Sierra is designed for more efficient overall operations and is expected to be a promising architecture for extreme-scale computing.
In late 2016, LLNL acquired three small-scale “early access” (EA) versions of Sierra, consisting of IBM Minsky compute nodes with 20 Power 8 cores each and 4 NVIDIA Pascal GPUs. These small systems feature components only one generation behind those of Sierra. EA systems enable application porting and tuning in advance of the CORAL Sierra system delivery and acceptance (late 2017 to mid-2018). To enable this work, beta software co-designed by the CORAL laboratories and IBM is being installed on the EA systems.
IBM Power and NVIDIA Volta processors
capability parallel computing
“Resilience has already become a prominent issue on current large-scale platforms,” the editors write in the preface to the book. “The advent of exascale computers with millions of cores and billion-parallelism is only going to worsen the scenario. The capacity to deal with errors and faults will be a critical factor for HPC applications to be deployed efficiently.”
The reference volume provides an overview of various fault tolerance methods for HPC applications in two parts. In Part I, the editors along with colleague Jack Dongarra, focus on checkpointing, “the de-facto standard technique for resilience in HPC protocols.” The authors present the main protocols, coordinated and hierarchical, and introduce probabilistic performance models to assess these protocols. Such models are necessary, they say, for minimizing bias when dealing with future hardware, which by its definition does not yet exist. They look at checkpointing combined with fault prediction and with replication. General-purpose techniques, including checkpoint and rollback recovery protocols, as well as application-specific methods are considered, such as ABFT, or Algorithm based Fault Tolerance. There’s also a section on how to cope with silent errors.
The authors describe the problem in terms of scale, which they write is both an opportunity (“the most viable path to sustained petascale”) and a threat:
“Future platforms will enroll even more computing resources to enter the Exascale era. Current plans refer to systems either with 100,000 nodes, each equipped with 10,000 cores (the fat node scenario), or with 1,000,000 nodes, each equipped with 1,000 cores (the slim node scenario).
“Even if each node provides an individual MTBF (Mean Time Between Failures) of, say, one century, a machine with 100,000 such nodes will encounter a failure every 9 hours in average, which is larger than the execution time of many HPC applications. Worse, a machine with 1,000,000 nodes (also with a one-century MTBF) will encounter a failure every 53 minutes in average. Note that a one-century MTBF per node is an optimistic figure, given that each node is composed of several hundreds or thousands of cores.
“To further darken the picture, several types of errors need to be considered when computing at scale. In addition to classical fail-stop errors (such as hardware failures), silent errors (a.k.a silent data corruptions) must be taken into account. Contrarily to fail-stop failures, silent errors are not detected immediately, but instead after some arbitrary detection latency, which complicates methods to cope with them.”
Part II is labeled “Technical Contributions” and is organized into four chapters.
2) Errors and Faults by Ana Gaiaru and Franck Cappello
3) Fault-Tolerant MPI by Aurélien Bouteiller
4) Using Replication for Resilience on Exascale Systems by Henri Casanova, Frederic Vivien and Dounia Zaidouni
5) Energy-Aware Checkpointing Strategies by Guillaume Aupy, Anne Benoit, Mohammed El Mehdi Diori, Oliver Glück and Laurent Lefèvre
Each chapter focuses on a different aspect of resiliency at scale. Chapter five, for example, is important for spotlighting the connection that exists between the power challenge and the resilience challenge.
“[F]ault tolerance and energy consumption are interrelated: fault tolerance consumes energy and some energy reduction techniques can increase error and failure rates,” write the international team of HPC experts.
The 320-page book is available now in both hard cover, eBook and Kindle editions. Part I of the book also appears in a slightly-modified form in a May 2015 report [PDF].
Dr. Thomas Herault is a research scientist in the Innovative Computing Laboratory (ICL) at the University of Tennessee Knoxville, Tennessee. Dr. Yves Robert is a professor in the Laboratory of Parallel Computing at the Ecole Normale Supérieure de Lyon, France, and a visiting research scholar in the ICL.
The future of computing will be driven by constraints on power consumption. Achieving an exaflop will be limited to no more than 20 MW of power, forcing co-design innovations in both hardware and software to improve overall efficiency. On the hardware side, processor designs are shifting to many-core architectures to increase the ratio of computational power to power consumption. Research and development efforts of other hardware components, such as the memory and inter- connect, further enhance energy efficiency and overall reliability. On the software side, simulation codes and parallel programming models will need modifications to adapt to the increased concurrency and other new features of future architectures. Developing power-aware runtime systems is key to fully utilizing the limited re- sources. In this paper, we survey the current research in energy-efficient and power-constrained techniques in software, then present an analysis of these techniques as they apply to a specific high-performance computing use case.
Exascale systems to be deployed in the near future will come with deep hierarchical parallelism, will exhibit various levels of heterogeneity, will be prone to frequent component failures, and will face tight power consumption constraints. The notion of application performance in these systems becomes multi-criteria, with fault-tolerance and power consumption metrics to be considered in addition to sheer compute speed. As a result, many of the proven algorithmic techniques used in parallel computing for decades will not be effective in Exascale systems unless they are adapted or in some cases radically changed. The Dagstuhl seminar “Algorithms and Scheduling Techniques for Exascale Systems” was aimed at sharing open problems, new results, and prospective ideas broadly connected to the Exascale problem. This report provides a brief executive summary of the seminar and lists all the presented material.
Seminar 15.–20. September, 2013 – www.dagstuhl.de/13381
Eighteen zeroes. That is the ability to run a quintillion calculations per second and exascale computing using memory driven computing processes that will touch all aspects of our lives. The race to the Exascale is the space race of this century.
The past 12 months encompassed a number of new developments in HPC, as well as an intensification of existing trends. TOP500 News takes a look at the top eight hits and misses of 2017.
Hit: Machine learning, the killer app for HPC
Machine learning, and the broader category of AI, continued to spread its influence across the HPC landscape in 2017. Web-based applications in search, ad-serving, language translation and image recognition continued to get smarter this year, as more sophisticated neural network models were developed. What’s new this year is the beginning of trend that inserts this technology into a broad range of traditional HPC workflows.
In applications as distinct as weather modeling, financial risk analysis, astrophysics simulations, and diagnostic medicine, developers used machine learning software to improve accuracy of their models and speed time-to-result. At the same time, conventional supercomputing platforms are also being used for machine learning R&D. In one of the most impressive computing demonstrations of the year, a poker-playing AI known as Libratus trained itself on the Bridges supercomputer at the Pittsburgh Supercomputing Center, and went on to crush four of the best professional players in the world. As more powerful GPUs make thier way into supercomputers (see below), we should see a lot more cutting-edge machine learning research being performed on these machines.
Hit: NVIDIA makes Volta GPU a deep learning monster
NVIDIA intensified its dominance in the AI space, with the launch of its Volta V100 GPU in May. With special circuitry for tensor processing, the V100 put unprecedented amounts of deep learning processing power – 120 teraflops per chip – into the hands of anyone with a spare PCIe port. Amazon and Microsoft will be the earliest adopters of the technology, followed soon thereafter by Baidu.
In addition to its deep learning prowess, the V100 GPU also deliver 7 double precision teraflops, making it eminently suitable for conventional HPC setups. The devices are already being deployed in the Department of Energy’s two most powerful supercomputers, Summit and Sierra, both of which are expected to come online in the first half of 2018. Those systems promise to be in high demand for both traditional HPC simulations and machine learning applications.
Miss: Intel fumbles pre-exascale deployment, drops Knights Hill
In October, the Department of Energy reported that its 180-petaflop Aurora supercomputer, which was slated to be installed at Argonne National Lab next year, was canceled. The system was to be powered by Knights Hill, Intel’s next-generation Xeon Phi processors. Instead, Aurora will be remade into a one exaflop system to be deployed in the 2020-2021 timeframe.
The rationale for the change in plans was not made clear, and as we wrote at the time, “something apparently went wrong with the Aurora work, and the Knights Hill chip looks like the prime suspect.” In November, Intel revealed it had dumped the Knights Hill product, without specifying any alternate roadmap for the Xeon Phi line.
Hit and Miss: AMD offers alternatives to Intel and NVIDIA silicon
In June, AMD launched EPYC, the chipmaker’s first credible alternative to Intel’s Xeon product line since the original Opteron processors. The EPYC 7000 series processors has more cores, more I/O connectivity, and better memory bandwidth than Intel’s “Skylake” Xeon CPUs, which were launched in July. Although AMD initially missed the opportunity to talk about the EPYC processors during ISC 2017, subsequent third-party testing and a more concerted effort by AMD at SC17 revealed that the EPYC processors had some advantages for HPC workloads, at least for some of them. Nonetheless, Intel will prove difficult dislodge from its position atop the datacenter food chain.
At SC17, AMD also talked up their Radeon Instinct GPUs (initially announced in December 2016), the chipmaker’s first serious foray into the machine learning datacenter space. These processors have plenty of flops to offer, but nothing approaching the performance of V100 for deep learning, since the Radeon hardware lacks the specialized arithmetic units that NVIDIA added for neural net acceleration. AMD is counting on its more open approach to GPU software to lure CUDA customers away from NVIDIA’s clutches.
Hit: Cavium becomes the center of gravity for ARM-powered HPC
Cavium’s second-generation ThunderX2 ARM server SoC was soft-launched way back in May 2016, but it wasn’t until this year that the chip got some attention from users and vendors. The processor offers decent performance, superior memory bandwidth, and an abundance of external connectivity to distinguish it from other ARM chip vendors taking aim at the datacenter.
In January, the EU’s Mont-Blanc project selected the ThunderX2 for its phase three pre-exascale prototype, which will be constructed by Atos/Bull. The French computer-maker intends to productize the ARM-based Mont-Blanc design as an option on its Sequana supercomputer line. In November, Cray followed suit with a ThunderX2-powered XC50 blade, which will become the basis of the Isambard supercomputer in the UK. HPE, Gigabyte Technology, and Ingrasys also came up with their own versions of ThunderX2-based servers. With the ARM software ecosystem for the datacenter also starting to fill out, 2018 could be a breakout year for the architecture in high performance computing and elsewhere.
Hit: Microsoft inches its way back into HPC
Between Microsoft’s acquisition of Cycle Computing and the next upgrade of its FPGA-accelerated Azure cloud, Microsoft looks like it’s becoming a bigger HPC player, at least in terms of technology prowess. Although the company still offers plenty of NVIDIA GPUs in Azure for cloud customers interested in accelerating HPC, data analytics, and deep learning workloads, the long-term strategy appears to moving toward an FPGA approach. If they manage to pull this off, Microsoft could drive a lot more interest in reconfigurable computing from performance-minded users, while simultaneously becoming a technology leader in this area.
Hit: Quantum computing on the cusp
Perhaps the fastest-moving HPC technology of 2017 was quantum computing, which, in a fairly short space of time, grew from an obscure set of research projects into a technology battle between some of the biggest names in the industry. The most visible of these IBM and Google, both of which built increasingly more capable quantum computers over the past 12 months. Currently, IBM currently has a 20-qubit system available for early users, with a 50-qubit prototype waiting in the wings. The company even managed to collect a handful of paying customers for this early hardware. Meanwhile, Google is fiddling with a 22-qubit system, with a 49-qubit machine promised before the end of the year.
In October, Intel has made its own quantum intentions known, with the revelation of a 17-qubit processor. For its part, Microsoft is working on a topological quantum computer, and while it has yet to field a working prototype, the company has come up with a software toolkit for the technology, complete with its own quantum computing programming language (Q#). In a similar vein, Atos/Bull launched a 40-qubit quantum simulator this year, softening the ground for the eventual hardware that everyone expects is right around the corner. 2018 is shaping up to be an even more exciting year for qubit followers.
Miss: Exascale computing fatigue
While exascale projects around the world made a lot of news in 2016, with the different players jockeying for position, this year the news has been a lot more subdued. Maybe that’s because the various efforts in China, Japan, Europe, and the US are now pretty well set in place, and are just methodically moving forward at their own pace. But with the rise of AI and machine learning, and more generally, data analytics, the artificial milestone of reaching an exaflop on double precision floating point math seems a lot less relevant.
Consider that the DOE’s 200-petaflop Summit supercomputer will deliver three peak exaflops of deep learning performance, and drawing on that capability with large-scale neural networks may dwarf any advances made with the first “true” exascale machines used for traditional modeling. In a Moor Insights and Strategy white paper, senior analyst Karl Freund writes: “It is becoming clear that the next big advances in HPC may not have to wait for exascale-class systems, but are being realized today using Machine Learning methodologies. In fact, the convergence of HPC and [machine learning] could potentially redefine what an exascale system should even look like.”
In a world where machine learning can outperform oncologists, poker-players, and hedge fund analysts, it’s hard to argue with that assessment.
Exascale computing refers to computing systems capable of at least one exaFLOPS, or a billion billion calculations per second. Such capacity represents a thousandfold increase over the first petascale computer that came into operation in 2008. At a supercomputing conference in 2009, Computerworld projected exascale implementation by 2018. Exascale computing would be considered as a significant achievement in computer engineering, for it is believed to be the order of processing power of the human brain at neural level (functional might be lower). It is, for instance, the target power of the Human Brain Project.
The only bad news is that we need more than exascale computing. Some of the key computational challenges, that face not just individual companies, but civilisation as a whole, will be enabled by exascale computing.
Everyone is concerned about climate change and climate modelling. The computational challenge for doing oceanic clouds, ice and topography are all tremendously important. And today we need at least two orders of magnitude improvement on that problem alone.
Controlled fusion – a big activity shared with Europe and Japan – can only be done with exascale computing and beyond. There is also medical modelling, whether it is life sciences itself, or the design of future drugs for every more rapidly changing and evolving viruses – again it’s a true exascale problem.
Exascale computing is really the medium and the only viable means of managing our future. It is probably crucial to the progress and the advancement of the modern age.
The Sunway TaihuLight is a Chinese supercomputer which, as of June 2016, is ranked number one in the TOP500 list as the fastest supercomputer in the world, with a LINPACK benchmark rating of 93 petaflops. This is nearly three times as fast as the previous holder of the record, the Tianhe-2, which ran at 34 petaflops. As of June 2016, it is also ranked as the third most energy-efficient supercomputer in TOP500, with an efficiency of 6,051.30 MFLOPS/W. It was designed by the National Research Center of Parallel Computer Engineering & Technology (NRCPC) and is located at the National Supercomputing Center in Wuxi in the city of Wuxi, in Jiangsu province, China.