Morning Parallel Session A

Morning Parallel Session B

Afternoon Parallel Session A

Afternoon Parallel Session B

On the Calculation Distribution between CPU and GPU in the Hybrid Supercomputing the Radiation Transport
Roman Uskov, Boris Chetverushkin, Vladimir Gasilov, Mikhail Markov, Olga Olkhovskaya and
Mikhail Zhukovskiy


The algorithm of Monte Carlo radiation transport modelling is developed for simulating the interaction between ionizing radiation and matter of the complex technical object by use of the supercomputers of heterogeneous architecture (the hybrid calculating cluster, HCC). HCC involves a number of nodes. Every node includes CPU and GPU. The MPI, OpenMP and CUDA technologies are used for supercomputing. Distribution of computing between nodes is carried out by application of MPI. Data exchange does only happen between node at the beginning and at the end of simulation. Therefore, MPI parallelization is scaled up almost infinitely. The GPU utilization and the calculation distribution between CPU and GPU within a single node are performed by means of CUDA. An approach to GPU utilization is not obvious. Every kernel is written by the author of the software as no GPU solutions are exist in the problem in question.
The Monte Carlo simulation of the radiation transport [2] is based on building the random trajectories of the radiation particles. Construction of different particles requires the different amount of computation. Moreover, various parts of particle trajectory algorithm demand various amount of computing resources. The basic principle of the calculation distribution between CPU and GPU is to carry out computing with high calculation load on GPU and with low one on CPU. For instance, the build of photons trajectories could be conditionally split into two parts – geometrical part and physical one. The analysis of computational load shows the following. The geometrical component of the algorithm (tracing the object) requires huge of simple calculations and therefore it is performed using GPU. Vice versa physical calculations (simulating the interaction acts between photon and atom) is carried out on CPU. As to electron trajectories all of the computing except for exchange operations is made on GPU. The developed algorithm is implemented as the parallel software on the prototype of the exascale system HCC K-100 (

Case study: Bohrium – Powering Oceanography Simulation
Mads Ohm Larsen and Dion Häfner

In the field of oceanography numerical simulations have been used for more than 50 years. These simulations often require long integration time and can run for several real-time months. They are thus often written in high-performance languages such as C or Fortran. Both of these are often thought of as being complex to use. Fortunately we have seen a shift in the scientific programming community towards focusing more on productivity as oppose to just performance. We would, however, like to keep the performance from these archaic languages. Academic code often has a limited lifespan because the developers shift as people graduate and new people arrive. Having to use a long time to understand the simulations will take away from the actual science being made. Veros is a Python translation of an already existing oceanography project written in Fortran. It utilizes NumPy for its vector computations. In order to rectify the performance loss Veros have chosen Bohrium as its computational back-end. Bohrium is a run-time framework that allows sequential interpreted code to be easily parallelized and possibly run on GPGPUs. This is done by just-in-time (JIT) compiling generated OpenMP, OpenCL, or CUDA kernels and running them on the appropriate hardware. Bohrium also support multiple front-ends, namely Python, C, and C++. In Python this is done with minimal intrusion, thus no annotations are needed, as long as the code utilize NumPy functions Bohrium will override these and replace them with JIT-compiled kernels. Bohrium has its own intermediate representation (IR), which is an instruction set gathered from the interpreted code up to a side effect, for example I/O. Well known compiler optimizations, such as constant folding and strength reduction, can be applied to the IR prior to generating the kernels. Other optimizations include using already established libraries such as BLAS for low-level linear algebra. Bindings to the appropriate BLAS library on your system is auto generated when Bohrium is compiled. This means, that if you have for example clBLAS installed, Bohrium will create bindings to it, that can be utilized directly from Python. These will also overwrite the NumPy methods already using BLAS for even better performance.

Using Bohrium of course comes with an overhead in form of generating the kernels. Fortunately this overhead is amortized for larger simulations. For the Veros project we see that Bohrium is roughly an order of magnitude faster than the same implementation using Fortran or NumPy in the benchmarks. However, a parallel Fortran implementation using MPI for communication is faster still. In the future we would like to utilize distributed memory systems with Bohrium, so we can run even larger problem sizes, using possibly multiple terabytes of memory.

Roadmap to Exascale for Nek5000
Adam Peplinski, Evelyn Otero, Paul Fischer, Stefan Kerkemeier, Jing Gong and Philipp Schlatter

Nek5000 is a highly scalable spectral element code for simulation of turbulent flows in complex
domains. As a Gordon Bell winner, it has strong-scaled to over a million MPI ranks, sustaining
over 60 percent efficiency on 1048576 ranks of Mira with just 2000 points per core. This
is in line with efficiency expectations for multigrid on this architecture at these scales. Moreover
a Nek5000-derived miniapp was developped, Nekbone, which has sustained 1.2 PFLOPS
on six million ranks of Sequoia (6% of peak).
In this talk, we will present the main characteristics of Nek5000 towards exascale performance
improvement. The overall efficiency of Nek5000 derives from:

  • stable high-order discretizations that, for a given accuracy, require significantly less data
    movement and fewer flops than their low-order counterparts.
  • communication-minimal bases requiring only C0 continuity between elements and, hence, exchange of only surface data between processors.
  • The use of fast matrix-free formulations based on tensor-product operators.
  • Tensor-product contractions based on highly-optimized small matrix-matrix product kernels.
  • An efficient and scalable communication framework, gslib, that has only O(n) or O(nlogP) complexity for both execute and setup phases, which has been deployed to over six million MPI ranks.
  • A scalable p-multigrid solver that uses fast tensor-product-based smoothers coupled with an unstructured algebraic multigrid solver for the global coarse-grid solve, with corresponding parallel setup.
Exascale extensions of Nek5000 will be somewhat dependent on emergent architectures, but a significant trend is towards the use of accelerators (e.g., GPUs). At present, Nek5000 is being ported to GPUs using OpenACC and OCCA. Other aspects necessary for large-scale simulations, such as error estimators, non-conformal meshes and adaptive simulations, will also be discussed. Several members of the Nek5000 development team are part of the US Department of Energy co-design Center for Efficient Exascale Discretizations (CEED). Another consortium driving as well the exascale development of Nek5000 is the SeRC Exascale Simulation Software Initiative (SESSI) which aims for performance and scalability improvement of a number of widely-used codes.
In summary, the overall performance of exascale code derives from the product SP = ηP P S1, where SP is the sustained flops rate on P processors and ηP is the strong-scale efficiency. Nek5000 has an established track record of sustaining near-unity efficiency for the anticipated exascale values of P. Boosting S1 on future complex nodes, is the current high priority for our current exascale development.

Port Out, Motherboard Home: Accelerating CASTEP on CPU-GPU clusters
Matthew Smith, Arjen Tamerus, and Phil Hasnip

Heterogeneous computer systems which use CPUs and GPUs in tandem to accelerate computation are well-established in HPC clusters, and are a candidate technological route to exascale computing. Optimal software performance on such massively-parallel systems involves the exploitation of distributed-memory parallelism on the CPU and the offloading of computationally intensive tasks to the GPU. A major goal of software design is therefore to marry these two elements and thereby maximise CPU and GPU computation while minimising CPU-CPU and CPU-GPU communications.

Here we present recent work undertaken towards achieving this goal for CASTEP, the UK’s premier quantum-mechanical materials-modeling software. We describe our approach towards enabling accelerator-compatability for this mature code, using our hybrid openaccmpi implementation, as well as our use of accelerator libraries including cufft and magma. The gains in performance afforded by these developments are illustrated with results from UK HPC facilities.

MPI Storage Windows for a Data-centric Era
Sergio Rivas-Gomez, Stefano Markidis, Erwin Laure and Sai Narasimhamurthy

Even though breaking the ExaFLOP barrier is expected to become one of the major computing milestones over the next decade, several challenges arise that remain of paramount importance for the success of the Exascale supercomputer. One such challenge is the bandwidth and access latency of the IO subsystem, projected to remain roughly constant in comparison with the concurrency of Exascale machines, that will increase approximately 4000×. In addition, with the integration of emerging deep-learning and data analytics applications on HPC, the chances for unexpected failures at Exascale will considerably raise as well. In order to overcome some of these limitations, upcoming large-scale systems will feature a variety of Non-Volatile RAM (NVRAM), next to traditional hard-disks and conventional DRAM. Emerging non-volatile solid-state technologies, such as flash, phase-change and spin-transfer torque memories, are used to decrease the existing gap between memory and storage. Hence, the integration of these technologies provides several advantages (e.g., data locality), that can potentially reduce the overall power consumption and IO access latency of HPC applications. In this presentation, we address the challenge of adapting MPI to the changes in the memory and storage hierarchies of Exascale supercomputers.

We present the concept of MPI storage windows, an extension to the MPI one-sided communication model that provides a unique interface for programming memory and storage. We illustrate its benefits for out-of-core execution and parallel I/O, as well as present a novel fault-tolerance mechanism based on this concept. Preliminary performance results indicate that our approach incurs in negligible performance differences on real-world applications compared to traditional MPI memory windows. Additionally, we present heterogeneous window allocations, that provide a unified virtual address space for memory and storage. Results on out-of-core execution show less than a 40% performance penalty incurred while exceeding the main memory capacity of compute nodes.

Performance of HPC I/O Strategies at Scale
Keeran Brabazon, Oliver Perks, Stefano Markidis, Ivy Bo Peng, Sergio Rivas Gomez and Adrian Jackson

In order to address the gap between compute and I/O moving towards the extreme and exascale, I/O strategies and subsystems are going to be more versatile and more complex. We have already seen a move to take advantage of fast storage nodes in the form of burst buffers in production HPC systems, as well as the use of parallel I/O libraries (such as MPI I/O and HDF5) and programmer-directed I/O. With this added complexity, a user of an HPC system needs to be informed of the advantages and disadvantages of different I/O strategies. In this presentation, we consider the performance of an HPC filesystem for a real-world application, rather than benchmarks or mini-apps. The application under consideration is a plasma physics simulation (iPIC3D) developed by KTH, Stockholm. Recent work at KTH has involved developing run times for efficient I/O at scale, and iPIC3D has been extended such that the I/O scheme used by the application can easily be switched at compile time. Performance data is gathered for weak scaling experiments of iPIC3D using different I/O methods, collected every day over a two-month period. Focus is kept on the performance of write operations, as this is important in traditional HPC simulations, in which initial data are read during initialisation, and snapshots of the simulation status are recorded at different points in simulation time. Program internal data is captured using the Arm MAP profiling tool, which gathers a rich set of performance metrics from within an application. Overhead of the tool is measured at less than 5% of overall run-time, meaning that performance data is gathered for a close to production run.

Variation in observed application performance are correlated to system load and I/O subsystem performance. Conclusions compare and contrast current I/O paradigms, as well as taking a forward looking view as to the suitability of different I/O run times for the extreme scale emerging in the coming years.

Evaluation Methodology of an NVRAM-based Platform for the Exascale
Juan F.R. Herrera, Suraj Prabhakaran, Michèle Weiland, and Mark Parsons

One of the major roadblocks to achieving the goal of building an HPC system capable of Exascale computation is the I/O bottleneck. Current systems are capable of processing data quickly, but speeds are limited by how fast the system is able to read and write data. This represents a significant loss of time and energy in the system. Being able to widen, and ultimately eliminate, this bottleneck would majorly increase the performance and efficiency of HPC systems. The NEXTGenIO project is investigating this issue by bridging the latency gap between memory and disk through the use of non-volatile memory, which will sit between conventional DDR memory and disk storage. In addition to the hardware that will be built as part of the project, the project will develop the software stack (from OS and runtime support to programming models and tools) that goes hand-in-hand with this new hardware architecture. This project addresses a key challenge not only for Exascale, but also for HPC and data intensive computing in general: the challenge of I/O performance.

An application suite of eight memory and I/O-bound applications have been selected alongside a set of test cases for each application, to evaluate the platform’s effectiveness regarding I/O performance and throughput. The application suite covers a wide range of fields, from computer-aided engineering to meteorology, computational chemistry, and machine learning. The output of the evaluation will document the benefits of the NEXTGenIO technology, and indicate its impact and future lines of development. Three measurement scenarios are defined to assess the specific benefits of the NEXTGenIO

  • Baseline measurement in today’s systems.
  • Measurements on the NEXTGenIO platform without the use of non-volatile memory.
  • Measurements on the NEXTGenIO platform with the use of non-volatile memory.

The profiling tools Allinea MAP and Score-P are used to collect the metrics needed to evaluate the performance of the applications for each scenario. These tools have been extended to support performance analysis with non-volatile memory. In our presentation, we will present our methodology for evaluating the NEXTGenIO platform and show early memory and I/O profiling results for the applications. We will discuss how NVRAM will impact the performance of these applications.

Scalable IO FuNnelling for ECMWF’s Integrated Forecast System
James Hawkes, Tiago Quintino

ECMWF’s IFS (Integrated Forecast System) uses an IO server architecture to improve the scalability of data output. The IO server splits the global MPI communicator, dedicating some processes to the scientific model whilst funnelling IO through dedicated IO processes. The IO processes are responsible for collating, buffering, and encoding output data; leaving the model processes free to continue computing without blocking, so long as the IO process buffers are not full. All of the output routines should scale efficiently with number of processes and number of fields, by avoiding global synchronization and distributing the fields evenly between IO processes. The funnelling method allows control over contention of IO hardware, and also provides the opportunity to perform global operations (such as encoding) on collated output data – without expensive collective communications.

The IO server architecture was originally developed at Météo France, and later ported to the IFS atmospheric model. In this presentation, the authors describe the architecture of the IO server and its advantages compared to alternative methods. We present recent developments in coupling the non-atmospheric wave model with the IO server, and demonstrate the realized improvements to overall scalability and performance. Furthermore, we discuss the integration of the IO server with the downstream IO stack and future plans for closer integration with post-processing services.

Progressive load balancing of asynchronous algorithms in distributed memory
Justs Zarins, Michèle Weiland

As supercomputers are growing in size, running large scale, tightly-coupled applications efficiently is becoming more difficult. A key component of the problem is the cost of synchronisation which increases with system noise and performance variability. This affects even high-end HPC machines. An exciting and promising approach for addressing this problem is to stop enforcing synchronisation points. This results in what are known as “asynchronous” or “chaotic” algorithms; commonly they are iteratively convergent. The cores are allowed to compute using whatever latest data is available to them, which might be “stale”, instead of waiting for other threads to catch up. Existing applications of this methodology show good performance and fault tolerance with respect to their synchronous counterparts.

While asynchrony removes the computational cost of requiring all data to arrive at the same time, a different cost takes its place – progress imbalance. This is natural because synchronisation points exist to coordinate progress. An imbalance in progress can result in slower convergence or even failure to converge, as old data is used for updates. This can be countered by putting a strict bound on how stale data is allowed to be, but at a cost to performance. As an alternative, the authors of [2] introduce the idea of progressive load balancing – balancing asynchronous algorithms over time as opposed to balancing instantaneously. Instead of finetuning iteration rates, parts of the working set are periodically moved between computing threads on a node. As a result, progress imbalance is limited without adding a large overhead. The approach is similar to bounded staleness, but it continues to work efficiently in the presence of continuous progress imbalance which may be caused by, for example, hardware performance variability or workload imbalance. The authors tested the approach running Jacobi’s method on a single compute node and found it increased iteration rate and decreased progress imbalance between parts of the solution space. Here we present an extension of progressive load balancing to the distributed memory setting.

We start by evaluating the extent to which progress imbalance can be reduced by running independent load balancing on each node. Additionally, we evaluate an implementation where load balancing is allowed to take place across nodes periodically to account for cases where a whole node is slow. Finally, we draw conclusions about the benefits and challenges of this approach in the context of future exascale applications.

An Exploration of Fault Resilience Protocols for Large-Scale Application Execution on Exascale Computing Platforms
Daniel Dauwe, Sudeep Pasricha, Anthony A. Maciejewski, and Howard Jay Siegel

The probability of applications experiencing failures in today’s high performance computing (HPC) systems has increased significantly with the increase in the number of system nodes. It is expected that exascale-sized systems are likely to operate with mean time between failures (MTBF) of as little as a few minutes, causing frequent interrupts in application execution as well as substantially greater energy costs in a system. Periodic application checkpointing to a parallel file system has for years been the de-facto strategy for enabling resilience to failures in HPC platforms. This traditional checkpoint/restart protocol is widely used in HPC systems for mitigating the impact of failures on application performance. However, as system sizes approach exascale levels, the higher frequency of failures and the lengthy time required to checkpoint/restart an exascale-size application make traditional checkpointing impractical. A number of strategies have been proposed in recent years to enable systems of these extreme sizes to be resilient against failures.

Our work is one of the first to provide a comprehensive comparison among traditional checkpoint/restart and three state-of-the-art HPC resilience protocols that are being considered for use in exascale HPC systems. We demonstrate the importance of employing these state-of-the-art protocols and examine how each resilience protocol behaves as application sizes scale from what is considered large today through to exascale sizes. Because we experiment with applications that have multiple sets of execution characteristics, our analysis of the applications allows for the simultaneous investigation of each protocol’s ability to handle varying application sizes and reliability goals. Our results not only show the necessity of incorporating improved forms of resilience for future HPC systems, but also show that different resilience protocols perform better or worse when executing different types of applications. Additionally, our results show that the resilience protocol that is optimal for a particular application type can change as the application scales. Based on these results, we propose optimizations to the multi-level checkpointing approach, which we believe is one of the most promising fault resilience approaches for exascale complexity platforms. We devise techniques for multi-level checkpoint interval optimization, with an emphasis on performance efficiency as well as energy use. We demonstrate that distinct intervals exist when optimizing for either one metric or the other, and examine the sensitivity of this phenomena to changes in several system parameters and application characteristics.

Speeding up a high-performance scientific code with GASPI shared notifications
Dana Akhmetova, Roman Iakymchuk, Valeria Bartsch, Erwin Laure, Christian Simmendinger

For the HPC community it is very important to understand how practicable programming models for the coming Exascale computing era will look like. Therefore, experimenting with current parallel programming models and studying their interoperability, scalability and performance will provide valuable insights for this.

While message passing supports communication in distributed-memory systems by exchanging messages, the Partitioned Global Address Space (PGAS) programming model provides the concept of a global memory address space, physically located on different nodes, but accessible to all the processes. It is based on one-sided remote direct memory access (RDMA) driven communication supported directly by network. GASPI (Global Address Space Programming Interface), a PGAS API, shifts a paradigm from bulk-synchronous two-sided communication towards asynchronous communication [3]. It represents an alternative to the MPI standard.

In our previous works, we have already experimented with GASPI and have ported a number of real-world applications to this model. The new implementations showed positive results and performed faster than their initial (MPI+OpenMP) versions at least on a large number of cores. In this study we experiment with a new GASPI feature called shared notifications in iPIC3D, a large plasma physics code for space weather applications written in C++ and MPI+OpenMP. To our knowledge, the GASPI shared notifications have never been used before. In our test runs they have shown very promising performance behaviour. We are using the GPI-2 library, an implementation of the GASPI standard of the PGAS programming model, developed by the Fraunhofer Institute for Industrial Mathematics ITWM. In this work we:

  • port a large high-performance scientific real-world code to GASPI;
  • use shared notifications, a new feature of GASPI, for the first time ever;
  • analyse how suitable GASPI is for Exascale computing by providing performance and scaling tests with up to 8192 processes, and by comparing with MPI, the de-facto standard for distributed-memory programming;
  • provide a step-by-step methodology with new GASPI features in a real-world application with discussions on interoperability issues;
  • share our experience in the form of best-practice programming guides, including suggestions to programmers who may adopt our approach.

Our performance results show that GASPI shared notifications provide a promising new set of features to further pave the way to the Exascale era for scientific production codes.

Efficient Gather-Scatter Operations in Nek5000 Using PGAS
Niclas Jansson, Nick Johnson, and Michael Bareford

Gather-scatter operations are one of the key communication kernels used in the computational fluid dynamics (CFD) application Nek5000 for fetching data dependencies (gather), and spreading results to other nodes (scatter). The current implementation used in Nek5000 is the Gather-Scatter library, GS, which utilises different communication strategies: nearest neighbour exchange, message aggregation, and collectives, to efficiently perform communication on a given platform. GS is implemented using non-blocking, two-sided message passing via MPI and the library has proven to scale well to hundreds of thousands of cores. However, the necessity to match sending and receiving messages in the two-sided communication abstraction can quickly increase latency and synchronisation costs for very fine grained parallelism, in particular for the unstructured communication patterns created by unstructured CFD problems.

ExaGS is a re-implementation of the Gather-Scatter library, with the intent to use the best available programming model for a given architecture. We present our current implementation of ExaGS, based on the one-sided programming model provided by the Partitioned Global Address Space (PGAS) abstraction, using Unified Parallel C (UPC). Using a lock-free design with efficient point-to-point synchronisation primitives, ExaGS is able to reduce communication latency compared to the current two-sided MPI implementation. A detailed description of the library and implemented algorithms are given, together with a performance study of ExaGS when used together with Nek5000, and its co-design benchmarking application Nekbone.