Morning Parallel Session A
- On the Calculation Distribution between CPU and GPU in the Hybrid Supercomputing the Radiation Transport
- Case study: Bohrium – Powering Oceanography Simulation
- Roadmap to Exascale for Nek5000
- Port Out, Motherboard Home: Accelerating CASTEP on CPU-GPU clusters
Morning Parallel Session B
- MPI Storage Windows for a Data-centric Era
- Performance of HPC I/O Strategies at Scale
- Evaluation Methodology of an NVRAM-based Platform for the Exascale
- Scalable I/O Funnelling for ECMWF’s Integrated Forecast System
Afternoon Parallel Session A
- Progressive load balancing of asynchronous algorithms in distributed memory
- An Exploration of Fault Resilience Protocols for Large-Scale Application Execution on Exascale Computing Platforms
Afternoon Parallel Session B
- Speeding up a high-performance scientific code with GASPI shared notifications
- Efficient Gather-Scatter Operations in Nek5000 Using PGAS
On the Calculation Distribution between CPU and GPU in the Hybrid Supercomputing the Radiation Transport
Roman Uskov, Boris Chetverushkin, Vladimir Gasilov, Mikhail Markov, Olga Olkhovskaya and
Case study: Bohrium – Powering Oceanography Simulation
Mads Ohm Larsen and Dion Häfner
In the field of oceanography numerical simulations have been used for more than 50 years. These simulations often require long integration time and can run for several real-time months. They are thus often written in high-performance languages such as C or Fortran. Both of these are often thought of as being complex to use. Fortunately we have seen a shift in the scientific programming community towards focusing more on productivity as oppose to just performance. We would, however, like to keep the performance from these archaic languages. Academic code often has a limited lifespan because the developers shift as people graduate and new people arrive. Having to use a long time to understand the simulations will take away from the actual science being made. Veros is a Python translation of an already existing oceanography project written in Fortran. It utilizes NumPy for its vector computations. In order to rectify the performance loss Veros have chosen Bohrium as its computational back-end. Bohrium is a run-time framework that allows sequential interpreted code to be easily parallelized and possibly run on GPGPUs. This is done by just-in-time (JIT) compiling generated OpenMP, OpenCL, or CUDA kernels and running them on the appropriate hardware. Bohrium also support multiple front-ends, namely Python, C, and C++. In Python this is done with minimal intrusion, thus no annotations are needed, as long as the code utilize NumPy functions Bohrium will override these and replace them with JIT-compiled kernels. Bohrium has its own intermediate representation (IR), which is an instruction set gathered from the interpreted code up to a side effect, for example I/O. Well known compiler optimizations, such as constant folding and strength reduction, can be applied to the IR prior to generating the kernels. Other optimizations include using already established libraries such as BLAS for low-level linear algebra. Bindings to the appropriate BLAS library on your system is auto generated when Bohrium is compiled. This means, that if you have for example clBLAS installed, Bohrium will create bindings to it, that can be utilized directly from Python. These will also overwrite the NumPy methods already using BLAS for even better performance.
Using Bohrium of course comes with an overhead in form of generating the kernels. Fortunately this overhead is amortized for larger simulations. For the Veros project we see that Bohrium is roughly an order of magnitude faster than the same implementation using Fortran or NumPy in the benchmarks. However, a parallel Fortran implementation using MPI for communication is faster still. In the future we would like to utilize distributed memory systems with Bohrium, so we can run even larger problem sizes, using possibly multiple terabytes of memory.
Roadmap to Exascale for Nek5000
Adam Peplinski, Evelyn Otero, Paul Fischer, Stefan Kerkemeier, Jing Gong and Philipp Schlatter
Nek5000 is a highly scalable spectral element code for simulation of turbulent flows in complex
domains. As a Gordon Bell winner, it has strong-scaled to over a million MPI ranks, sustaining
over 60 percent efficiency on 1048576 ranks of Mira with just 2000 points per core. This
is in line with efficiency expectations for multigrid on this architecture at these scales. Moreover
a Nek5000-derived miniapp was developped, Nekbone, which has sustained 1.2 PFLOPS
on six million ranks of Sequoia (6% of peak).
In this talk, we will present the main characteristics of Nek5000 towards exascale performance
improvement. The overall efficiency of Nek5000 derives from:
- stable high-order discretizations that, for a given accuracy, require significantly less data
movement and fewer flops than their low-order counterparts.
- communication-minimal bases requiring only C0 continuity between elements and, hence, exchange of only surface data between processors.
- The use of fast matrix-free formulations based on tensor-product operators.
- Tensor-product contractions based on highly-optimized small matrix-matrix product kernels.
- An efficient and scalable communication framework, gslib, that has only O(n) or O(nlogP) complexity for both execute and setup phases, which has been deployed to over six million MPI ranks.
- A scalable p-multigrid solver that uses fast tensor-product-based smoothers coupled with an unstructured algebraic multigrid solver for the global coarse-grid solve, with corresponding parallel setup.
Port Out, Motherboard Home: Accelerating CASTEP on CPU-GPU clusters
Matthew Smith, Arjen Tamerus, and Phil Hasnip
Heterogeneous computer systems which use CPUs and GPUs in tandem to accelerate computation are well-established in HPC clusters, and are a candidate technological route to exascale computing. Optimal software performance on such massively-parallel systems involves the exploitation of distributed-memory parallelism on the CPU and the offloading of computationally intensive tasks to the GPU. A major goal of software design is therefore to marry these two elements and thereby maximise CPU and GPU computation while minimising CPU-CPU and CPU-GPU communications.
Here we present recent work undertaken towards achieving this goal for CASTEP, the UK’s premier quantum-mechanical materials-modeling software. We describe our approach towards enabling accelerator-compatability for this mature code, using our hybrid openaccmpi implementation, as well as our use of accelerator libraries including cufft and magma. The gains in performance afforded by these developments are illustrated with results from UK HPC facilities.
MPI Storage Windows for a Data-centric Era
Sergio Rivas-Gomez, Stefano Markidis, Erwin Laure and Sai Narasimhamurthy
Even though breaking the ExaFLOP barrier is expected to become one of the major computing milestones over the next decade, several challenges arise that remain of paramount importance for the success of the Exascale supercomputer. One such challenge is the bandwidth and access latency of the IO subsystem, projected to remain roughly constant in comparison with the concurrency of Exascale machines, that will increase approximately 4000×. In addition, with the integration of emerging deep-learning and data analytics applications on HPC, the chances for unexpected failures at Exascale will considerably raise as well. In order to overcome some of these limitations, upcoming large-scale systems will feature a variety of Non-Volatile RAM (NVRAM), next to traditional hard-disks and conventional DRAM. Emerging non-volatile solid-state technologies, such as flash, phase-change and spin-transfer torque memories, are used to decrease the existing gap between memory and storage. Hence, the integration of these technologies provides several advantages (e.g., data locality), that can potentially reduce the overall power consumption and IO access latency of HPC applications. In this presentation, we address the challenge of adapting MPI to the changes in the memory and storage hierarchies of Exascale supercomputers.
We present the concept of MPI storage windows, an extension to the MPI one-sided communication model that provides a unique interface for programming memory and storage. We illustrate its benefits for out-of-core execution and parallel I/O, as well as present a novel fault-tolerance mechanism based on this concept. Preliminary performance results indicate that our approach incurs in negligible performance differences on real-world applications compared to traditional MPI memory windows. Additionally, we present heterogeneous window allocations, that provide a unified virtual address space for memory and storage. Results on out-of-core execution show less than a 40% performance penalty incurred while exceeding the main memory capacity of compute nodes.
Performance of HPC I/O Strategies at Scale
Keeran Brabazon, Oliver Perks, Stefano Markidis, Ivy Bo Peng, Sergio Rivas Gomez and Adrian Jackson
In order to address the gap between compute and I/O moving towards the extreme and exascale, I/O strategies and subsystems are going to be more versatile and more complex. We have already seen a move to take advantage of fast storage nodes in the form of burst buffers in production HPC systems, as well as the use of parallel I/O libraries (such as MPI I/O and HDF5) and programmer-directed I/O. With this added complexity, a user of an HPC system needs to be informed of the advantages and disadvantages of different I/O strategies. In this presentation, we consider the performance of an HPC filesystem for a real-world application, rather than benchmarks or mini-apps. The application under consideration is a plasma physics simulation (iPIC3D) developed by KTH, Stockholm. Recent work at KTH has involved developing run times for efficient I/O at scale, and iPIC3D has been extended such that the I/O scheme used by the application can easily be switched at compile time. Performance data is gathered for weak scaling experiments of iPIC3D using different I/O methods, collected every day over a two-month period. Focus is kept on the performance of write operations, as this is important in traditional HPC simulations, in which initial data are read during initialisation, and snapshots of the simulation status are recorded at different points in simulation time. Program internal data is captured using the Arm MAP profiling tool, which gathers a rich set of performance metrics from within an application. Overhead of the tool is measured at less than 5% of overall run-time, meaning that performance data is gathered for a close to production run.
Variation in observed application performance are correlated to system load and I/O subsystem performance. Conclusions compare and contrast current I/O paradigms, as well as taking a forward looking view as to the suitability of different I/O run times for the extreme scale emerging in the coming years.
Evaluation Methodology of an NVRAM-based Platform for the Exascale
Juan F.R. Herrera, Suraj Prabhakaran, Michèle Weiland, and Mark Parsons
One of the major roadblocks to achieving the goal of building an HPC system capable of Exascale computation is the I/O bottleneck. Current systems are capable of processing data quickly, but speeds are limited by how fast the system is able to read and write data. This represents a significant loss of time and energy in the system. Being able to widen, and ultimately eliminate, this bottleneck would majorly increase the performance and efficiency of HPC systems. The NEXTGenIO project is investigating this issue by bridging the latency gap between memory and disk through the use of non-volatile memory, which will sit between conventional DDR memory and disk storage. In addition to the hardware that will be built as part of the project, the project will develop the software stack (from OS and runtime support to programming models and tools) that goes hand-in-hand with this new hardware architecture. This project addresses a key challenge not only for Exascale, but also for HPC and data intensive computing in general: the challenge of I/O performance.
An application suite of eight memory and I/O-bound applications have been selected alongside a set of test cases for each application, to evaluate the platform’s effectiveness regarding I/O performance and throughput. The application suite covers a wide range of fields, from computer-aided engineering to meteorology, computational chemistry, and machine learning. The output of the evaluation will document the benefits of the NEXTGenIO technology, and indicate its impact and future lines of development. Three measurement scenarios are defined to assess the specific benefits of the NEXTGenIO
- Baseline measurement in today’s systems.
- Measurements on the NEXTGenIO platform without the use of non-volatile memory.
- Measurements on the NEXTGenIO platform with the use of non-volatile memory.
The profiling tools Allinea MAP and Score-P are used to collect the metrics needed to evaluate the performance of the applications for each scenario. These tools have been extended to support performance analysis with non-volatile memory. In our presentation, we will present our methodology for evaluating the NEXTGenIO platform and show early memory and I/O profiling results for the applications. We will discuss how NVRAM will impact the performance of these applications.
Scalable IO Funelling for ECMWF’s Integrated Forecast System
James Hawkes, Tiago Quintino
ECMWF’s IFS (Integrated Forecast System) uses an IO server architecture to improve the scalability of data output. The IO server splits the global MPI communicator, dedicating some processes to the scientific model whilst funelling IO through dedicated IO processes. The IO processes are responsible for collating, buffering, and encoding output data; leaving the model processes free to continue computing without blocking, so long as the IO process buffers are not full. All of the output routines should scale efficiently with number of processes and number of fields, by avoiding global synchronization and distributing the fields evenly between IO processes. The funelling method allows control over contention of IO hardware, and also provides the opportunity to perform global operations (such as encoding) on collated output data – without expensive collective communications.
The IO server architecture was originally developed at Météo France, and later ported to the IFS atmospheric model. In this presentation, the authors describe the architecture of the IO server and its advantages compared to alternative methods. We present recent developments in coupling the non-atmospheric wave model with the IO server, and demonstrate the realized improvements to overall scalability and performance. Furthermore, we discuss the integration of the IO server with the downstream IO stack and future plans for closer integration with post-processing services.
Progressive load balancing of asynchronous algorithms in distributed memory
Justs Zarins, Michèle Weiland
As supercomputers are growing in size, running large scale, tightly-coupled applications efficiently is becoming more difficult. A key component of the problem is the cost of synchronisation which increases with system noise and performance variability. This affects even high-end HPC machines. An exciting and promising approach for addressing this problem is to stop enforcing synchronisation points. This results in what are known as “asynchronous” or “chaotic” algorithms; commonly they are iteratively convergent. The cores are allowed to compute using whatever latest data is available to them, which might be “stale”, instead of waiting for other threads to catch up. Existing applications of this methodology show good performance and fault tolerance with respect to their synchronous counterparts.
While asynchrony removes the computational cost of requiring all data to arrive at the same time, a different cost takes its place – progress imbalance. This is natural because synchronisation points exist to coordinate progress. An imbalance in progress can result in slower convergence or even failure to converge, as old data is used for updates. This can be countered by putting a strict bound on how stale data is allowed to be, but at a cost to performance. As an alternative, the authors of  introduce the idea of progressive load balancing – balancing asynchronous algorithms over time as opposed to balancing instantaneously. Instead of finetuning iteration rates, parts of the working set are periodically moved between computing threads on a node. As a result, progress imbalance is limited without adding a large overhead. The approach is similar to bounded staleness, but it continues to work efficiently in the presence of continuous progress imbalance which may be caused by, for example, hardware performance variability or workload imbalance. The authors tested the approach running Jacobi’s method on a single compute node and found it increased iteration rate and decreased progress imbalance between parts of the solution space. Here we present an extension of progressive load balancing to the distributed memory setting.
We start by evaluating the extent to which progress imbalance can be reduced by running independent load balancing on each node. Additionally, we evaluate an implementation where load balancing is allowed to take place across nodes periodically to account for cases where a whole node is slow. Finally, we draw conclusions about the benefits and challenges of this approach in the context of future exascale applications.
An Exploration of Fault Resilience Protocols for Large-Scale Application Execution on Exascale Computing Platforms
Daniel Dauwe, Sudeep Pasricha, Anthony A. Maciejewski, and Howard Jay Siegel
The probability of applications experiencing failures in today’s high performance computing (HPC) systems has increased significantly with the increase in the number of system nodes. It is expected that exascale-sized systems are likely to operate with mean time between failures (MTBF) of as little as a few minutes, causing frequent interrupts in application execution as well as substantially greater energy costs in a system. Periodic application checkpointing to a parallel file system has for years been the de-facto strategy for enabling resilience to failures in HPC platforms. This traditional checkpoint/restart protocol is widely used in HPC systems for mitigating the impact of failures on application performance. However, as system sizes approach exascale levels, the higher frequency of failures and the lengthy time required to checkpoint/restart an exascale-size application make traditional checkpointing impractical. A number of strategies have been proposed in recent years to enable systems of these extreme sizes to be resilient against failures.
Our work is one of the first to provide a comprehensive comparison among traditional checkpoint/restart and three state-of-the-art HPC resilience protocols that are being considered for use in exascale HPC systems. We demonstrate the importance of employing these state-of-the-art protocols and examine how each resilience protocol behaves as application sizes scale from what is considered large today through to exascale sizes. Because we experiment with applications that have multiple sets of execution characteristics, our analysis of the applications allows for the simultaneous investigation of each protocol’s ability to handle varying application sizes and reliability goals. Our results not only show the necessity of incorporating improved forms of resilience for future HPC systems, but also show that different resilience protocols perform better or worse when executing different types of applications. Additionally, our results show that the resilience protocol that is optimal for a particular application type can change as the application scales. Based on these results, we propose optimizations to the multi-level checkpointing approach, which we believe is one of the most promising fault resilience approaches for exascale complexity platforms. We devise techniques for multi-level checkpoint interval optimization, with an emphasis on performance efficiency as well as energy use. We demonstrate that distinct intervals exist when optimizing for either one metric or the other, and examine the sensitivity of this phenomena to changes in several system parameters and application characteristics.
Speeding up a high-performance scientific code with GASPI shared notifications
Dana Akhmetova, Roman Iakymchuk, Valeria Bartsch, Erwin Laure, Christian Simmendinger
For the HPC community it is very important to understand how practicable programming models for the coming Exascale computing era will look like. Therefore, experimenting with current parallel programming models and studying their interoperability, scalability and performance will provide valuable insights for this.
While message passing supports communication in distributed-memory systems by exchanging messages, the Partitioned Global Address Space (PGAS) programming model provides the concept of a global memory address space, physically located on different nodes, but accessible to all the processes. It is based on one-sided remote direct memory access (RDMA) driven communication supported directly by network. GASPI (Global Address Space Programming Interface), a PGAS API, shifts a paradigm from bulk-synchronous two-sided communication towards asynchronous communication . It represents an alternative to the MPI standard.
In our previous works, we have already experimented with GASPI and have ported a number of real-world applications to this model. The new implementations showed positive results and performed faster than their initial (MPI+OpenMP) versions at least on a large number of cores. In this study we experiment with a new GASPI feature called shared notifications in iPIC3D, a large plasma physics code for space weather applications written in C++ and MPI+OpenMP. To our knowledge, the GASPI shared notifications have never been used before. In our test runs they have shown very promising performance behaviour. We are using the GPI-2 library, an implementation of the GASPI standard of the PGAS programming model, developed by the Fraunhofer Institute for Industrial Mathematics ITWM. In this work we:
- port a large high-performance scientific real-world code to GASPI;
- use shared notifications, a new feature of GASPI, for the first time ever;
- analyse how suitable GASPI is for Exascale computing by providing performance and scaling tests with up to 8192 processes, and by comparing with MPI, the de-facto standard for distributed-memory programming;
- provide a step-by-step methodology with new GASPI features in a real-world application with discussions on interoperability issues;
- share our experience in the form of best-practice programming guides, including suggestions to programmers who may adopt our approach.
Our performance results show that GASPI shared notifications provide a promising new set of features to further pave the way to the Exascale era for scientific production codes.
Efficient Gather-Scatter Operations in Nek5000 Using PGAS
Niclas Jansson, Nick Johnson, and Michael Bareford
Gather-scatter operations are one of the key communication kernels used in the computational fluid dynamics (CFD) application Nek5000 for fetching data dependencies (gather), and spreading results to other nodes (scatter). The current implementation used in Nek5000 is the Gather-Scatter library, GS, which utilises different communication strategies: nearest neighbour exchange, message aggregation, and collectives, to efficiently perform communication on a given platform. GS is implemented using non-blocking, two-sided message passing via MPI and the library has proven to scale well to hundreds of thousands of cores. However, the necessity to match sending and receiving messages in the two-sided communication abstraction can quickly increase latency and synchronisation costs for very fine grained parallelism, in particular for the unstructured communication patterns created by unstructured CFD problems.
ExaGS is a re-implementation of the Gather-Scatter library, with the intent to use the best available programming model for a given architecture. We present our current implementation of ExaGS, based on the one-sided programming model provided by the Partitioned Global Address Space (PGAS) abstraction, using Unified Parallel C (UPC). Using a lock-free design with efficient point-to-point synchronisation primitives, ExaGS is able to reduce communication latency compared to the current two-sided MPI implementation. A detailed description of the library and implemented algorithms are given, together with a performance study of ExaGS when used together with Nek5000, and its co-design benchmarking application Nekbone.