Morning Parallel Session A

Morning Parallel Session B

Afternoon Parallel Session A

Afternoon Parallel Session B

A Task-Based Particle-in-Cell Method with Automatic Load Balancing using the AllScale Environment
Roman Iakymchuk, Herbert Jordan, Philipp Gschwandtner, Thomas Heller, Peter Thoman Xavier Aguilar, Thomas Fahringer, Erwin Laure, Stefano Markidis
The Particle-in-Cell (PIC) method is one of the most common and powerful numerical techniques for the simulation of fusion, astrophysical, and space plasmas. For instance, PIC simulations are used to study the interaction of the Earth’s electromagnetic field with the hot plasma emanated by the sun, the so-called solar wind. To note, high energy plasma in space can damage spacecrafts and put at risk the life of astronauts. Thus, it is important to enable efficient large-scale PIC simulations that are capable to predict different phenomena in space. The PIC method solves the kinetic equation of plasmas by first sampling plasma distribution functions with computational particles and then following their trajectories by solving the equation of motion for each particle. The electromagnetic field determining the particle trajectory is calculated by computing the Maxwell’s equations on a grid. The coupling between particle and both electric and magnetic fields is provided by the so-called interpolation functions (aka gather and scatter). Typically, parallel PIC simulations divide the simulation box in several equal-in-size domains with initially the same number of particles [1]. Each domain is assigned to a process that carries out the computation for the particles in the domain. When a particle exits the domain, it is communicated to a different domain. Because of the non-uniform configuration of the electromagnetic field in space, computational particles concentrate in relatively small spatial regions while few particles cover other spatial regions, resulting in load imbalance. Workload-imbalance is the most severe and limiting problem in large-scale PIC simulations, resulting in up to 60 % of the process imbalance.
In this talk, we present a new formulation of the PIC method to provide automatic loadbalancing using the AllScale toolchain1. The AllScale approach, which is based on task-centric nested recursive parallelism, aims to provide a unified programming system for the effective development of highly scalable, resilient and performance-portable parallel applications for Exascale systems. In the spirit of this approach, we redesign the PIC method: particles are placed inside individual cells rather than both structures stored separately, requiring extra mappings. Moreover, the field solver implements an explicit scheme which allows to eliminate global communications and synchronizations. This facilitates the accommodation of the nested, recursive, and asynchronous task parallelism that is in the core of the AllScale toolchain. Furthermore, our method employs dynamic load balancing: the simulation is recursively divided into smaller domains, forming a set of tasks to be computed. The granularity and distribution of those tasks is thereby controlled by the scheduler provided by the AllScale runtime system, leading to an even load balance where each processing unit spends approximately the same amount of time on processing its assigned share of particles. We demonstrate our new approach using a case study with severe load imbalance, the Earth’s Van Allen radiation belts.

LFRic and PSyclone: Building a Domain Specific Embedded
Language for weather and Climate models
Christopher Maynard

In common with many science applications, exascale computing presents a disruptive change for weather and climate models. However, the difficulty in porting and optimising legacy codes to new hardware is particularly acute for this domain as the software is large (O(106) lines of code), takes a long time develop (∼ 10 years for a new dynamical core) and is long-lived (typically ∼ 25 years or longer). These timescales are much longer than the changes in both processor architectures and programming models necessary to exploit these architectures. Moreover, highly scalable algorithms are necessary to exploit the necessary degree of parallelism exascale computers are likely to exhibit.

In collaboration with academic partners, the Met Office is developing a new dynamical core, called GungHo. By employing a mixed Finite Element Method on an unstructured mesh, the new dynamical core is designed to maintain the scientific accuracy of the current Unified Model (UM) dynamical core (ENDGame [2]), whilst allowing improved scalability by avoiding the singularities present at the poles of a lon-lat grid. A new atmospheric model and software infrastructure, named LFRic after Lewis Fry Richardson, is being developed to host the GungHo dynamical core, as the structured, lon-lat grid is inherent in the data structures of the UM. The software design is based on a layered architecture and a separation of concerns between the natural science code in which the mathematics is expressed and the computational science code where the parallelism and other other software and hardware specific performance optimisations are expressed. In particular, there are three layers. The top layer, the algorithm layer, is where high-level mathematical operations on global fields are performed. The bottom layer is the kernel layer where these operations are expressed on a single column of data. In between is the Parallelisation System or PSy layer, where the horizontal looping and parallelism is expressed. This abstraction, called PSyKAl, is written in Fortran 2003 using Object Orientation to encode the rules of the API. Moreover, a Python code called PSyclone can parse the algorithm and kernel layers and generate the Psy layer with different target programming models. In effect, the PSyKAl API and PSyclone are a Domain Specific Embedded Language (DSEL). Domain science code which conforms to this API can be written in serial and the code, parallelised and optimised for the targeted hardware architecture, is then generated automatically.

The model is under active development and indeed, the science and numerical analysis are still areas of active research. However, in order to assess the scientific performance of the model, sufficiently computationally challenging problems must be tackled. Thus, there is a requirement to run these research models efficiently and at scale on current hardware architectures.

In this paper, the software design and strategy for future architectures is presented. Moreover, some preliminary performance analysis is presented, including scaling analysis to a quarter of million cores on the Met Office Cray XC40. The use of redundant computation, shared memory threaded parallelism such as OpenMP and OpenACC and performance on different architectures such as CPUs, GPUs and ARM processors are also discussed. Furthermore, as I/O is a significant performance factor for weather and in particular climate models, some performance analysis of I/O at scale using the asynchronous IO server of the XIOS library is presented.

Leveraging SLEPc in modeling the earth’s magnetic environment
Nick Brown, Brian Hamilton, William Brown, Ciaran D Beggan, Brian Bainbridge, Susan Macmillan

The Model of the Earth’s Magnetic Environment (MEME) [1] is an important tool for calculating the earth’s magnetic field. In addition to being used directly for scientific study, this model is also a building block for a number of other BGS codes which have a wide variety of applications from oil and gas exploration to GPS positioning. The Earth’s internal magnetic field is generated by the motion of the conductive metallic fluid in the outer core. Changes in the motion of this fluid, which are happening all the time, cause variations of the shape and intensity of the magnetic field measured at the surface. In order to calculate the current magnetic environment, the MEME models uses a combination of current and historical data from magnetic survey satellites (such as the ESA Swarm mission) and observational sites around the globe.

Computationally the model requires the solving of differential equations which it does by calculating Eigenvalues and Eigenvectors of a matrix (built from the input data) via the GOLUB algorithm, where the matrix is tri-diagonalised using Givens rotations. This method is very stable, but neither fast nor parallelised. The small amount of parallelism already present in the code is used for building the matrix, which itself is very time consuming, and exhibits significant load imbalance. Due to the sequential nature of the solver all data must fit within a single memory space and this is currently a major limitation. Due to these memory issues the model is limited to around 10,000 parameters and this means that only a subset (often 1 in 20 points) of current satellite data can be studied. With the deployment of new observation satellites and technologies imminent, it is realistic that runs with over 100,000 parameters will be required in the future but this is far beyond what the model is currently capable of.

We have replaced the bespoke solver with the SLEPc [2] package which builds upon PETSc to provide Eigensolvers. There are two advantages to this; firstly we get numerous solvers out of the box which are trivial to swap in and out so we can experiment with performance and stability. Secondly we are able to leverage the existing PETSc parallelism mechanisms. SLEPc/PETSc favours decomposing the matrix in a row based fashion where a number of matrix rows reside on each processes. However this raises a challenge when building up the matrix due to the symmetry. If we naively built the matrix then it would result in a very uneven load (depending upon the number of points a process has in the upper part of the matrix) or a duplication of calculations. To address this we have developed an algorithm which evenly distributes the points in the matrix for building by including a subset of points in both the upper and lower parts of the matrix. Whilst some communication is required once building has been completed, the work is well balanced and hence far more efficient than the existing approach, inherently works with the distributed data we require and extra parallelism is possible by utilising multiple processes in the building of each row.

In this talk we will discuss our work modernising the solver and parallelism of this model, the suitability of SLEPc and our algorithm for balancing the matrix building. We will illustrate the performance and scalability of the new code on ARCHER and describe the process adopted for providing confidence in the accuracy of results (which is very important to the community.)

A directory/cache for leveraging the efficient use of distributed memory by task-based runtime systems
Tiberiu Rotaru, Bernd Lörwald, Nick Brown, Mirko Rahn, Olivier Aumage, Vicenc Beltran, Xavier Teruel, Jan Ciesko, Jakub Sístek

As the community progresses towards exascale the appearance of much more complex architectures than we are currently used to leveraging is likely. For instance, machines which are highly parallel, with a large number of multi-core nodes, deep memory hierarchies and complex interconnect topologies are around the corner. Efficiently programming these new systems will be a huge challenge, where programmers will not only need to fundamentally rethink their code to increase the level of parallelism by at least an order of magnitude, but to also address other issues such as resilience.

Task-based models [1] are one possible solution to this challenge, where parallelism is decomposed into many tasks. By rethinking their parallelism in the paradigm of tasks, one can significantly reduce synchronisation which is key for achieving high levels of concurrency. The underlying task model decouples the management of parallelism from computation. This relieves the application developers from dealing with lower level details such as scheduling, memory management and resilience concerns that are tricky to manage in large computing systems. However task-based models are not a silver bullet and this paradigm often focuses around providing the abstraction of a single shared address space to the application programmer. To scale beyond a single physical memory space then some sort of distribute technology (e.g. MPI or GASPI) must be combined. This interoperability is either at the task based runtime level (and implicit to the programmer) or involves explicit communications calls provided by the programmer within tasks of their application code.

We have developed an API for a Directory/Cache [2] which can be integrated with task-based runtimes and seamlessly (to the applications programmer) provides the abstraction of a single shared address space. Supporting interoperability between task-based models and distributed memory technologies, whilst memory is physically distributed amongst the nodes the Directory/Cache enables this to be presented to the applications programmer as a single, unified memory space. The directory tracks what data is physically stored where and the cache is used for performance to avoid frequently retrieving the same piece of remote data. The main purpose of the Directory/Cache is to provide a set of services that support task-based runtime systems efficiently running distributed applications, while being able to consistently manage data stored in distributed memory or in local caches.

We have developed a reference implementation of our Directory/Cache API which, to illustrate the abstract and generic nature of our API, has been or is being integrated with the runtimes of the OmpSs, StarPU, GPI-Space and PaRSEC task-based models. The Directory/Cache API allows runtimes to be completely independent from the physical representation of data and from the type of storage used. This facilitates access through the same interface to an extendable list of transport implementations using different communication libraries such as GASPI and MPI and even on-disk storage or tiered memory.

In this talk we will describe the main concepts behind our API and the underlying architecture of the reference implementation. We will also present the results of integrating the Directory/Cache with popular task-based models and specifically the performance and scalability that this affords to real-world applications utilising this technology.

Adapting CASTEP for the Exascale age: Hybrid OpenMP and vectorisation
Arjen Tamerus, Ed Higgins, Phil Hasnip

CASTEP [1] is a high-performance density functional theory code, used to simulate the chemical, electronic and mechanical properties of materials using a plane-wave basis set. Its use accounts for a significant percentage of the total compute cycles of Tier-1 and Tier-2 HPC facilities in the UK. Its parallel performance is achieved through an MPI and OpenMP based parallel implementation, which is efficient on current architectures. In this talk we present the work undertaken to prepare CASTEP for the Exascale generation, through the optimisation of the hybrid OpenMP-MPI parallel mode. This work also improves the intra-node scaling, and is achieved through optimisations to internal CASTEP routines, better use of threaded libraries and run-time optimisations. We will reflect on the challenges faced and how they affect performance and memory scalability.

With the limit of sequential performance clearly in sight and the move to wide vectors and SIMD-inspired accelerators to achieve FLOP targets, CASTEP has to adapt to make optimal use of these new technologies. We will discuss the in-progress work of improving the vectorisation capabilities of the code and improving memory management, benefitting highly parallel and heavily vectorised architectures like the Xeon Phi platform, as well as modern x86 CPUs like Intel’s Skylake-X and upcoming platforms.

OpenACC accelerator for the PN –PN-2 algorithm in Nek5000
Evelyn Otero, Jing Gong, Misun Min, Paul Fischer, Philipp Schlatter and Erwin Laure

Nek5000 is an open-source code for the simulation of incompressible flows. Nek5000 is widely used in a broad range of applications, including the study of thermal hydraulics in nuclear reactor cores, the modeling of ocean currents, and the study of stability, transition and turbulence on airplane wings. Exascale HPC architectures are increasingly prevalent in the Top500 list, with CPU based nodes enhanced by accelerators or co-processors optimized for floating-point calculations. We have previously presented a serial case studies of partially porting to parallel GPU-accelerated systems for Nek5000/Nekbone, see [1–3]. In this paper, we expand our previously developed work and take advantage of the optimized results to port the full version of Nek5000 to GPU-accelerated systems, especially regarding the PN –PN-2 algorithm. This latter algorithm is a way to de-couple the momentum from the pressure equations that does not lead to spurious pressure modes. It is more efficient than other methods, but it involves different approximation spaces for velocity (order N) and pressure (order N-2). The paper focuses on the technology watch of heterogeneous modelling and its impact on the exascale architectures (e.g. GPU accelerators system). In fact GPU accelerators can strongly speed up the most consuming parts of the code, running efficiently in parallel on thousands of cores. The goal of this work is to investigate if the PN –PN-2 algorithm can take advantage of hybrid architectures and be used in Nek5000 to improve its scalability to exascale. In this talk, we describe the GPU implementation of PN –PN-2 algorithm in Nek5000, namely:

  • The use of GPU-direct to communicate directly between GPU memory spaces without
    involving the CPU memory. For this work, we use an OpenACC accelerated version of
    Nek5000 which is already implemented in the MPI communication library gs [3].
  • The initial profiling and assessment of suitability of the code for the most time consuming
  • The implementation of the OpenACC version for the multigrid solver.

In addition we present the initial performance results of the OpenACC version of PN –PN-algorithm for a typical production problem. Finally we discuss the experience and the challenges we faced during this work.

First steps in porting the LFRic Weather and Climate model to the FPGAs of the EuroExa architecture
Mike Ashworth,Graham Riley, and Andrew Attwood

The EuroExa project proposes a High Performance Computing (HPC) architecture which is both scalable to exascale performance levels and delivers world-leading power efficiency. This is achieved through the use of low-power ARM processors together with closely coupled FPGA programmable components. In order to demonstrate the efficacy of the design, the EuroExa partners support a rich set of applications.

One such application is the new weather and climate model, LFRic (named in honour of Lewis Fry Richardson), which is being developed by the Met Office and its partners for operational deployment in the middle of the next decade. High quality forecasting of the weather on global, regional and local scales is of great importance to a wide range of human activities and exploitation of latest developments in HPC has always been of critical importance to the weather forecasting community.

The first EuroExa system is being built now and is due for first application access in mid2018. In order to prepare for this we have been porting the LFRic model to a Zynq UltraScale+ ZCU102 Evaluation Platform. The initial approach is to study the LFRic code at three levels: the full application (Fortran), compact applications or “mini-apps” and key computational kernels. An example of a kernel is the matrix-vector product which contributes significantly to the execution time in the Helmholtz solver and elsewhere. Our first steps have been to evaluate the performance on the ARM quad-core CPU A53, to use Vivado HLS to generate IP blocks to run on the UltraScale+ FPGA.

The matrix-vector updates have been extracted into a kernel test program and converted to C. There are dependencies between some of the updates across the horizontal mesh and a colouring scheme is used in LFRic, such that nodes within a single ‘colour’ can be computed simultaneously. This is used to produce independent computations for multi-threading with OpenMP and can be exploited for the FPGA acceleration as well. As with all accelerator based solutions, a key optimization strategy is to minimize the overhead of transferring data between the CPU and the FPGA. We shall discuss how we have approached this for the LFRic code.

The Vivado HLS and Design Suite is only one programming model for porting applications onto the FPGA. In early 2018 we shall be comparing this approach with the OmpSs@FPGA system and the Maxeler MaxJ compiler, and by the time of the workshop will be in a good position to be able to present comparisons of the performance achieved so far, ease-of-use, robustness and maturity of the tools.

An MPI-CUDA parallel direct FFT-based solver for large
scale simulations of coastal flows
Guillermo Oyarzun, Georgios A. Leftheriotis, Iason A. Chalmoukis and Athanassios A. Dimas

The simulation of coastal phenomena involves the numerical resolution of three-dimensional turbulent flows induced in the coastal zone, and mainly in the surf zone, by wave propagation (oblique to the shore), refraction, breaking and dissipation. For sake of accuracy, the mesh discretization must consist of millions of cells in order to capture all the physical phenomena. In such CFD simulations, the Poisson equation arises from the divergence constraint and must be solved once per time integration step. The solution of this implicit linear system is usually one of the most time-consuming and difficult to parallelize parts of scientific simulation codes. In flows with periodic boundary conditions the Fourier diagonalization is the best choice for its solution [1]. In a previous work we have focused our attention in the parallelization of a FFT solver in multicore-CPUs [2]. Currently, the hybridization of the supercomputer systems is a consolidated reality. However, the efficient implementation in hybrid CPU/GPU supercomputers it is not trivial. The stream processing model of the GPUs requires of massive parallelism within the application for achieving enough occupancy to hide the memory latency. Consequently, data rearrangements are needed for exploiting the coalesced memory accesses. Moreover, the MPI communication episodes are costly, because in hybrid systems each communication involves an additional cost of transferring data between CPU and GPU through the PCI-express. With this scenario in mind, we propose a direct FFT solver capable of engaging multiple CPU/GPU nodes. Our application context is the simulation of 3D coastal flows with periodic boundary conditions in two directions. The FFT is used to transform the original 3D system in a set of 1D subsystems. The data redundancies necessary for the stream processing are obtained by using the Schur decomposition method. The main idea relies in decomposing the 1D systems in groups of 3 × 3 independent linear systems, and reordering them in such a way that repeated coefficients can be reused by the threads within the same block. In addition, MPI communication costs are reduced because i) using the Schur decomposition we can avoid some of the MPI ALL TO ALL calls, and ii) we propose an overlapping strategy based in [3] that allows us to perform simultaneously MPI communications and computations.

NEXTGenSim: a workflow and hardware-aware workload simulator for HPC systems.
E. Farsarakis, N. Johnson

Efficient scheduling in any HPC system can make the difference between efficient and poor use of resources. Having the scheduler aware of the hardware configuration of nodes may offer opportunities to schedule more efficiently.

Previous workload simulators have focused predominantly on the study of the effect of different scheduling algorithms such as First Come First Served, or First Fit. As part of the NEXTGenIO project* we have developed NEXTGenSim, discrete event hardware and workflow aware scheduling simulator which estimates execution time of applications based on specific hardware characteristics such as NVRAM.

Our work is focused on the unique nature of jobs that form workflows, i.e. that have a data dependence on each other and must follow a specific order, and how changes in scheduling policy can benefit throughput of a system, with special consideration for novel HPC systems incorporating technologies such as NVRAM. Using NVRAM in a persistent storage state allows the scheduling of workflow jobs to nodes which are used in full or in part across the whole workflow. For example, if job A is followed by job B, it makes sense to run job B on the same nodes as job A. If persistent storage (via NVRAM) is available, then rather than moving results to a Lustre of GPFS filesystem, data can remain local to nodes between jobs, reducing latency in beginning job B. Should the nodes of A not be available when B is ready to execute, the data must be moved to relevant nodes or the job delayed.

By modelling the time to move data to and from Lustre/GPFS and the time saved by using persistent storage via NVRAM, we can experiment with different strategies for persisting intermediate products from workflow jobs. As input we can use both synthetic workloads and anonymised workloads from real HPC systems. For example, we can experiment with a strategy of only allowing job B to be run on the nodes of A, regardless of the delay time between them and how this impacts the overall completion of a workflow, or we can see how increasing or decreasing the time taken to move data to and from Lustre or GPFS changes the overall throughput of a system, and how using NVRAM storage when possible benefits this.

Extra-P: Automatic Empirical Performance Modeling of Parallel Programs
Alexandru Calotoiu, Torsten Hoefler, Sergei Shudler, and Felix Wolf

Once a program has been parallelized, its performance usually remains far from optimal. Too difficult is the process of performance optimization, which needs to consider the complex interplay between the algorithm and the hardware. Many parallel applications also suffer from latent performance limitations that may prevent them from scaling to larger problem or machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made a point where remediation can already be difficult. Performance models allow such issues to be predicted before they become relevant. A performance model is a formula that expresses a performance metric of interest such as execution time or energy consumption as a function of one or more execution parameters such as the size of the input problem or the number of processors. However, deriving such models analytically from the code is so laborious that too many application developers shy away from the effort.

In this talk, we will present Extra-P, a new performance-modeling tool. It substantially improves both coverage and speed of performance modeling and analysis. Generating an empirical performance model automatically for each part of a parallel program with respect to the variation of one or more relevant parameters such as process count or problem size, it becomes possible to easily identify those parts that will reduce performance at larger core counts or when solving a bigger problem. Specialized heuristics traverse the search space rapidly and generate insightful performance models. We will discuss case studies with large-scale applications in which we uncover both previously known and unknown performance bottlenecks. As a specific example, we will show how Extra-P can support co-design for task-based programming.

Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. Unfortunately, finding a good match between input size and core count usually requires significant manual experimentation. Using Extra-P we can find the isoefficiency function of a task-based program, which binds efficiency, core count, and the input size in one analytical expression.

Wavelet-Based Compression Algorithm
Patrick Vogler, Ulrich Rist

The steady increase of available computer resources has enabled engineers and scientists to use progressively more complex models to simulate a myriad of fluid flow problems. Yet, whereas modern high performance computers (HPC) have seen a steady growth in computing power, the same trend has not been mirrored by a significant gain in data transfer rates. Current systems are capable of producing and processing high amounts of data quickly, while the overall performance is oftentimes hampered by how fast a system can transfer and store the computed data. Considering that CFD researchers invariably seek to study simulations with increasingly higher spatial and temporal resolution, the imminent move to exascale computing will consequently only exacerbate this problem [1]. Using the otherwise wasted compute cycles to create a more compact form of a numerical dataset, one could alleviate the I/O bottleneck by exploiting it’s inherent statistical redundancies. Since effective data storage is a pervasive problem in information technology, much effort has already been spent on adapting existing compression algorithms for floating-point arrays.

In this context, Loddoch and Schmalzl [1] have extended the Joint Photographic Experts Group (JPEG) standard for volumetric floating-point arrays by applying the one-dimensional real-to-real discrete cosine transform (DCT) along the axis of each spatial dimension, using a variable-length code to encode the resulting DCT coefficients. Lindstrom [2], on the other hand, uses a lifting based integer-to-integer implementation of the discrete cosine transform, followed by an embedded block coding algorithm based on group testing. While these compression algorithms are simple and efficient in exploiting the low frequency nature of most numerical datasets, their major disadvantage lies in the non-locality of the basis functions of the discrete cosine transform. Thus, if a DCT coefficient is quantized, the effect of a lossy compression stage will be felt throughout the entire flow field [3]. To alleviate this, the numerical field is typically divided into small blocks and the discrete cosine transform is applied to each block one at a time. While partitioning the flow field also facilitates random-access read and write operations, this approach gives rise to block boundary artifacts which are synonymous with the JPEG compression standard.

In order to circumvent this problem we propose to adapt the JPEG-2000 (JP2) compression standard for volumetric floating-point arrays. In contrast to the baseline JPEG standard, JPEG-2000 employs a lifting-based one-dimensional discrete wavelet transform (DWT) that can be performed by either the reversible LeGall-(5,3) taps filter for lossless or the non reversible Daubechies-(9,7) tabs filter for lossy coding [3]. Due to its time-frequency representation, which identifies the time or location at which various frequencies are present in the original signal, the discrete wavelet transform allows for the entire frame to be decorrelated concurrently. This eliminates blocking artifacts at high compression ratios commonly associated with the JPEG standard. We therefore demonstrate the viability of a wavelet-based compression scheme for large-scale numerical datasets.

Controllable data precision in the Met Office Unified Model
Richard Gilham, and Paul Selwood

The Unified Model (UM) is a million-line, 30-year-old, but still very actively developed, Fortran codebase that is at the heart of the Met Office’s weather forecasts and climate research. Despite its age, the UM is a cutting-edge model scientifically, and is successfully used for research and operations at scales from a desktop computer, to several hundred compute nodes on very large supercomputers.

The huge socio-economic benefits of accurate forecasts mean that there is a constant push to improve the compute efficiency of the UM to free up resources for further improvements. For example, efforts to improve both shared and distributed memory parallelism have allowed significant increases in model resolution and ensemble size. The subject of this talk, however, is on exploring the possibility of reducing the precision of the model from double to single precision, and potentially beyond[1]. Indications from comparable codes show a possible 40% saving in compute. Moreover, trends in future HPC architectures indicate that single precision capability would be highly advantageous for portability as well as performance. Single precision compute in the UM’s numerical solver algorithms have already been proven and used operationally, providing significant compute savings.

In a proof-of-concept project, a specific scientific section of the UM was targeted and made ‘precision-aware’. A bottom-up approach ensured that the project was of tractable magnitude, and that the benefits may be readily ‘pulled-through’ from research to operations. The hardest challenges in the project were working within the conservative working practices for an operational model, and overcoming subtle but frustrating technical debt. As well as demonstrating the technical feasibility of the approach, a saving of approximately 5% on total model runtime was realised for negligible scientific impact. Savings within the targeted section were around 40%, in line with expectations. Future work would look to extend this methodology to more scientific sections of the model.

Running Distributed Computations Using the SpiNNaker platform
Alan Barry Stokes, Andrew Rowley, Christian Brenninkmeijer, Donal Fellow, Andrew Gait, Oliver Rhodes, and Steve Furber

The SpiNNaker Platform, is a well known neuromorphic computing platform within the Human Brain Project, which is designed to run large scale spiking neural networks in real time. The SpiNNaker platform consists of up to a million low power ARM processors (each of which run at 200MHz), and therefore for applications to make the most of the platform, they require to be highly parallel in nature. Neural networks and the shallow water equations are a perfect sample of these types of applications, as each individual neuron/cell can be evaluated in parallel. The 1 million core machine when running 100% is estimated to use 100 KW, and therefore is significantly cheaper to run than most traditional HPC clusters of relative scale. Such infrastructures could satisfy the issue of power consumption of exascale machines.

Due to the complexity of using the SpiNNaker platform, we have developed a software stack that maps applications described as a graph (where vertices represent computation and edges represent communication of data between vertices) onto the SpiNNaker platform whilst also managing the runtime execution and data extraction process for the applications. we believe that by representing the application problem as a graph, or equivalent format, provides more potential than sequentially written code bases for detecting areas of parallelism. Communication between vertices in SpiNNaker is executed through small data packets (32 or 64 bit packet sizes) which are multicasted throughout the network through routers on each chip. We have not yet implemented a MPI or Open MP interface, but instead support an asynchronous interrupt based API where cores are informed when a packet has been received. This does not mean MPI is not possible with SPiNNaker, the communications network lends itself to an MPI implementation with the restriction that the network packets are small (but we provide a 1 to 1 message format for larger data packets,). By splitting an application into such small components that utilises the 1 million core machine, it can potentially reduce the communications requirement by each core into manageable sizes for the SpiNNaker communication infrastructure.

We believe there are other sets of applications, apart from neural networks and shallow water equations, that can be easily represented by a graph and therefore can be executed upon such architectures efficiently. For such applications, the massive parallelism provided by the SpiNNaker platform can improve upon current infrastructure performances in terms of speed and energy consumption.

In this talk we will discuss the SpiNNaker architecture, software stack, and programming paradigm and try to relate lessons we have learnt whilst using the SpiNNaker architecture to challenges in reaching exascale machines and why applications can see improvements when written/rewritten, if possible, to a graph representation.

Performance Tracing of Heterogeneous Exascale Applications: Pitfalls and Opportunities
Holger Brunst, Christian Herold, Matthias Weber

Future Exascale systems are expected to introduce an unprecedented degree of heterogeneity as pointed out by DOE’s ASCR Program Manager Dr. Lucy Nowell at the recent VPA17 workshop at SC17. Dr. Nowell also stated that the induced adaption and redesign process of highly distributed algorithms and applications requires the support from the debugging and performance tools community. Because of that, we expect performance tools to be early adopters of the emerging system architectures, as they need to be one step ahead by definition. This position paper lists and discusses both tools pitfalls and opportunities arising from infrastructure and paradigm changes in the near future. It focuses on in-situ performance trace data processing, persistent memory exploitation, scripted applications and usability.

The sheer complexity of an Exascale hardware and software stack calls for a holistic performance reviewing at all layers of abstraction. Unfortunately, a non-perturbing monitoring solution with dedicated hardware is normally impracticable for economic reasons. Facultative software-based monitoring seems to be feasible but the resulting perturbation will grow with the number of monitored parameters. An iterative refinement process of a changing set of parameters seems to be most realistic, while only practical when performed in-situ in one and the same run due to long application runtimes and startup overheads. We will present how data collection, selection and visualization need to be rethought for the Exascale.

Node-local large persistent memory similar to Cray’s DataWarp or Intel’s 3D XPoint products will be located in very close proximity to the CPUs. This will have an impact on application file data handling and scheduling. I/O usage patterns that were prohibited performance-wise in the past might all of a sudden be top-notch. Understanding I/O performance requires a thorough picture of the actions in the increasingly deep I/O stack. Again, we expect that it will not be feasible to record all relevant information at the same time, which takes us back to the pre-mentioned runtime data selection. We will present new ways for in-depth I/O tracing and visualization designed for Exascale.

Scripted application workflows in python and alike are likely to gain further importance and are not easy to study with traditional performance tools due to their complex virtual runtime environment. Scripting approaches enable new science communities without traditional HPC background to enter the computing domain, which reveals tool usability issues to be discussed in the Exascale context. Our usability analysis is backed by customer feedback on the Vampir performance visualizer, which we develop in-house for many years.

In-flight ensemble processing for exascale
Jeff Cole, Bryan Lawrence, Grenville Lister, Yann Meursdesoif, Rupert Nash, and Michèle Weiland

Weather and climate science make heavy use of ensembles of model simulations to provide estimation of uncertainty arising from a range of causes. Current practice is to write each ensemble member (simulation) out to disk as it is running, and carry out an ensemble analysis at the end of the simulation. Such analysis will include simple statistics, as well as detailed analysis of some ensemble members. However, as model resolutions increase (with more data per simulation), and ensemble sizes increase (more instances), the storage and analysis of this data is becoming prohibitively expensive — many major weather and climate sites are looking at managing in excess of an exabyte of data within the next few years. This become problematic for an environment where we anticipate running such ensembles on exascale machines which may not themselves include local storage of sufficient size where data can be resident for long periods of analysis.

There are only two possible strategies to cope with this data deluge – data compression (including “thinning”, that is the removal of data from the output) and in-flight analysis. We discuss here some first steps with the latter approach. We exploit the XML IO server (XIOS) to manage the output from simulations and to carry out some initial analysis en-route to storage. We have achieved three specific ambitions: (1) We have adapted a current branch of the Met Office Unified Model to replace most of the diagnostic system with the XIOS. (2) We have exploited a single executable MPI environment to run multiple UM instances with output sent to XIOS, and (3) We have demonstrated that simple ensemble statistics can be calculated inflight, including both summary statistics of individual ensemble members, and cross-member statistics such as means and extremes.

With this ability, we can in principle avoid having all data needing to reside on fast disk when the ensemble simulation is complete. This would allow, for example, deployment on an exascale machine with burst-buffer migrating data directly to tape (or to the wide area network). There are some issues yet to be resolved. In particular, we need to manage the MPI context to explicitly mange errors propagating up from an ensemble member (an errant ensemble member could otherwise halt the execution), and we need to consider how to bring third party data into the XIOS context so that non-linear comparisons can be calculated and meaned at run time. Neither of these are expected to be very difficult, but they will involve further engineering. In the longer-term, in-flight analysis will have to address some sort of steering where not all ensemble members are output for the entire duration of the simulation, but even this interim method will help with data management. It will be possible to identify “interesting” ensemble members from summary statistics, and keep them online for more detailed analysis, while less (initially) interesting ensemble members can be more rapidly migrated to colder storage for later analysis.