Morning Parallel Session A
- Reducing the number of load rebalancings and increasing concurrency in mesh-based dynamically adaptive simulations
- Accelerating simulations of cerebrovascular blood flow through parallelization in time
Morning Parallel Session B
- High Performance Combinatorial Search with YewPar
- An Evaluation of the TensorFlow Programming Model for Solving Traditional HPC Problems
Afternoon Parallel Session A
- Leveraging hierarchical memories for micro-core architectures
- Developing An Extensible, Portable, Scalable Toolkit for Projection Based Particle Methods
- SMURFF: a High-Performance Framework for Matrix Factorization
Afternoon Parallel Session B
- Using modular supercomputing to propel Particle-in-Cell methods to exascale
- A portable runtime system approach to engineering design at exascale using the Uintah software
- Mitigating performance degradation of frequently synchronised applications in presence of coarse grain jitter
Reducing the number of load rebalancings and increasing concurrency in mesh-based dynamically adaptive simulations
D.E. Charrier, B. Hazelwood, T. Weinzierl
Accelerating simulations of cerebrovascular blood flow through parallelization in time
Rupert Nash, David Scott, Daniel Ruprecht and Derek Groen
Simulating blood flow in networks of human arteries (“computational hemodynamics”) allows us to better understand how flow properties in these arteries relate to the occurrence, and degradation of major cardiovascular diseases. The HemeLB Lattice Boltzmann Method (LBM) simulation environment has a long history of successful use in computational hemodynamics. However, for realistic, patient-specific geometries like the Circle of Willis, HemeLB scales only up to 12-25k cores on ARCHER, resulting in simulation times of several days. Parallel-intime integration methods like the widely used Parareal algorithm  can be used to unlock concurrency in the time dimension in order to extend parallel scaling, escape the “traps of strong and weak scaling” and reduce simulation times. While Parareal has been used successfully in combination with LBM codes , performance has so far only been explored for relatively simple benchmark problems. We will report on the ongoing effort of integrating an eventbased variant of Parareal into HemeLB with the aim extending parallel scaling on ARCHER and bringing down simulation times for geometries like the Circle of Willis to a day or less. Specific challenges of combining parallel-in-time integration with LBM will be illustrated and HemeLB-specific solutions presented. First performance tests on ARCHER will be shown.
High Performance Combinatorial Search with YewPar
Blair Archibald, Patrick Maier, Robert Stewart and Phil Trinder
We propose a novel HPC application domain, namely combinatorial search. Combinatorial search plays an important role in many application areas including industrial scheduling and allocation, artificial intelligence, computational biology and computational algebra. At the heart of (exact) combinatorial search are backtracking search algorithms that systematically, and dynamically, explore a search space to find objects of interest. Due to the nature of these algorithms, the amount of computation can grow exponentially with the input size. Parallelism is one way to address this exponential complexity and HPC machines, if used effectively, may enable the solving of important problem instances that are currently out of reach.
A task-parallel backtracking search algorithm splits the search tree into multiple subtrees or subtasks and speculatively explore each one in parallel. While seemingly simple, this approach poses many challenges. Subtrees vary hugely in size, often by orders of magnitude; searches typically follow specific search order heuristics that must be preserved as far as possible; and asynchronous knowledge sharing between tasks dynamically affects the set of active tasks, e.g. by pruning unnecessary tasks. Combinatorial search differs from traditional HPC workloads in the use of speculative parallelism, high degree of irregularity, asynchronous knowledge sharing, and the lack of floating point computation. However, in many respects it is well suited to HPC architectures: search problems are compute-heavy, perform little I/O, have few global synchronisation points, and tend to benefit from integer/bitset vectorisation.
Currently, most parallel combinatorial search is limited to multi-cores. However, given the need to tackle ever larger problems, it is essential to enable the use of HPC resources for this domain, ideally without requiring HPC expertise of the user. To this end we have developed YewPar1, a framework for parallel combinatorial search, designed with HPC in mind. YewPar exposes high level parallel patterns (algorithmic skeletons) for users to specify search algorithms, while the framework manages all parallel coordination implicitly. This approach allows search experts to take advantage of multiple parallel architectures (multi-core, cluster, cloud, HPC) without detailed parallelism or architecture knowledge. YewPar is built on HPX , a standards compliant C++ library designed for exascale systems, that leverages asynchronous task parallelism and a global address space. We have previously demonstrated the generality of the pattern approach on four different combinatorial problems: Maximum Clique, k-Clique, Travelling Salesperson and Knapsack. Recent results for Maximum Clique show YewPar achieving an average speedup of 170 on 255 worker cores. Due to the high degree of irregularity, linear scaling cannot be not expected. Combinatorial search has many important and compute-heavy applications. With YewPar we have begun to scale combinatorial search to HPC and hope that in the future, with exascale machines, we will be able to solve a wide range of problems that are currently out of reach.
An Evaluation of the TensorFlow Programming Model for Solving Traditional HPC Problems
Jun Zhang, Stefano Markidis, Steven Wei Der Chien, Ivy Bo Peng and Erwin Laure
Deep-learning applications, such as pattern recognition, computer vision, speech recognition and natural language processing, are emerging on HPC systems. These applications use deep learning, a branch of machine learning, to determine the weights of artificial neural network nodes by minimizing a loss function. Such applications require the calculation of dense matrix-matrix and matrix-vector multiplications, also called tensorial operations.
In the last decade, the use of Graphics Processing Unit (GPU) has considerably speeded up deep-learning computations, leading to a Renaissance of the artificial neural network. Recently, the Nvidia Volta GPU and the Google Tensor Processing Unit (TPU)  have been specially designed to support deep-learning workloads. Not only hardware has evolved quickly to support deep-learning, but also new programming models have emerged for convenient expression of tensorial operations and deep-learning computational paradigms. An example of such new programming approaches is TensorFlow, an open-source deep-learning library released by Google in 2015.
TensorFlow expresses algorithms as a computational graph where the nodes represent operations and the edges between nodes represent the data flow . The multi-dimensional variables, such as vectors and matrices, flowing from one operation to another are called tensors. For this reason, any algorithm in TensorFlow needs to be expressed as a computational graph. Execution of the computational graph nodes is automatically assigned to different devices, such as GPU and CPU on different computing nodes, and communication is based on RDMA, using lossy compression (32-bit floating point values are truncated to 16-bit representations).
The goal of this work is the evaluate the usability and expressiveness of the TensorFlow programming model to solve traditional HPC problems, such as the solution of a linear system with a Conjugate Gradient (CG) solver. In this work, we first express the CG solver as computational graph and execute it on distributed system with GPUs. We evaluate the difficulty of expressing traditional HPC algorithms using computational graphs and we study scalability of the TensorFlow when solving large linear systems in accelerated systems. Finally, we quantify the impact of loss of precision when using lossy communication. This study provides an initial investigation of new emerging programming models for HPC.
Leveraging hierarchical memories for micro-core architectures
Nick Brown, Maurice Jamieson
Micro-core architectures combine many simple, low power and low on-chip memory cores onto a single processor package. The low power nature of these architectures means that there is potential for their use in future HPC and embedded systems, and their low cost makes them ideal for education and prototyping. However there is a high barrier to entry in programming due to the considerable complexity and immaturity of supporting tools. ePython is a Python virtual machine we have developed for the 16-core Epiphany III microcore architecture which fits in the 32Kb per core memory. In combination we developed an abstraction that supports offloading functions in existing Python codes, running on a host CPU, seamlessly to the micro-cores. In  we introduced this abstraction and motivated it with a machine learning code for detecting lung cancer in 3D CT scans where kernels for model training and inference ran in parallel on the micro-cores. However the small amount of core memory severely limited the physical size of the images, which had to be interpolated to fit.
In addition to the small on-core memory, there is typically much larger, slower external memory. In order to take full advantage of these architectures one must leverage these hierarchies of memory, but a key question is how best to achieve this whilst maintaining good performance.
In this work we have addressed this challenge by splitting the memory abstraction into three choices:
- A mirroring of memory where copies of external memory also exist on the micro-cores. A manually copying of data to and from the different memory levels is required.
- Memory can be exposed from a specific level in the hierarchy to the micro-cores without an explicit copy being allocated. Reads and writes directly access this external memory, these accesses being blocking or non-blocking (using the DMA engines.) Abstractions around the non-blocking approach enables the programmer to leverage patterns such as double buffering and data streaming to overlap compute and memory access for performance.
- By default memory belongs to the hierarchical level where it is first declared. It is possible to override this via memory kinds . In our approach these are Python objects that follow a standard interface and sit outside of the core ePython implementation. They define the behaviour of memory access at the level of hierarchy they represent.
Based upon this work we are now able to run the machine learning code of with the full sized images. The programmer is able to experiment with choices around memory placement and access patterns without having to worry about the low level complexities of data movement.
Developing An Extensible, Portable, Scalable Toolkit for Massively Parallel Incompressible SPH
The stability, accuracy, energy conservation, boundary conditions of the projection based particle method such as incompressible smoothed particle hydrodynamics(ISPH) have been greatly improved. However, there are still many challenges compared with other particle based methods from the perspective of computation and high performance software implementation when using hundreds of millions of particles above. In this talk, we are particularly concerning the scalable algorithms for the post peta-scale particle method based simulations, these algorithms are low overhead domain decomposition and dynamic load balancing involving irregular particle distributions and complex geometries, flexible parallel communications algorithms to facilitate user scientific software development. Particles ordering for cache-based computing architectures and reducing the sparse matrix bandwidth and efficient sparse linear solvers which is the additional distinct challenge for projection-based particle methods. The implementation details introduced here are intended to form future guidance for the new projection-based particle application development on the novel computing architectures.
SMURFF: a High-Performance Framework for Matrix Factorization
Tom Vander Aa and Tom Ashby
Recommender Systems (RS) have become very common in recent years and are useful in various real-life applications. The most popular ones are probably suggestions for movies on Netflix and books for Amazon. However, they can also be used in more unlikely area such drug discovery where a key problem is the identification of candidate molecules that affect proteins associated with diseases. In RS one has to analyze large and sparse matrices, for example those containing the known movie or book ratings. Matrix Factorization (MF) is a technique that has been successfully used here. The idea of this method is to approximate the rating matrix R as a product of two low-rank matrices U and V . Predictions can be made from the approximation U ⇥ V which is dense.
Bayesian probabilistic matrix factorization (BPMF) is one of the most popular algorithms for matrix factorization. Thanks to the Bayesian approach, BPMF has been proven to be more robust to data-overfitting and released from cross-validation. Yet BPMF is more computational intensive and thus more challenging to implement for large datasets. In this work we present SMURFF a high-performance feature-rich framework to compose and construct different Bayesian matrix-factorization methods, based on BPMF, for example Macau or GFA. Using the SMURFF framework one can easily vary: i) the type of matrix to be factored (dense or sparse); ii) the prior-distribution that you assume the model to fit to (multivariate-normal, spike andslab, and others); or iii) the noise model (Gaussian noise, Probit noise or others). The framework also allows to combine different matrices together and thus incorporate more and different types of information into the model.
The SMURFF framework has been implemented in C++ using the Eigen library for matrix manipulations, OpenMP for multi-core parallelization, and MPI and GASPI for multinode parallelization. Performance results of one of the imlemented methods (BPMF) can be found in . The framework has been successfully used in the H2020 ExCAPE project to do large scale runs of compound-activity prediction. We were able to reduce training time for a realistic dataset from 3 months to 15 hours for the SMURFF C++ implementation compared to the original R implementation.
Using modular supercomputing to propel Particle-in Cell methods to exascale
Jorge Amaya, Diego Gonzalez-Herrero, Anke Kreuzer, Estela Suarez and Giovanni Lapenta
The most efficient applications in HPC take advantage of hardware optimizations that fine-tune cache management, vectorization, I/O and multi-threading. Applications are now developed to target the potential optimizations of a single computer architecture (Intel CPUs, IBM CPUs, GPUs, FPGAs, etc.). We investigate how applications increase their performances by transferring sub-tasks to different hardware. The DEEP-EST project proposes a new “modular architecture” allowing to use different hardware components for a single application. Two modules are used in the present work to perform simulations of space plasmas: the Booster module, composed of multiple Intel Xeon Phi KNL processors, and the Cluster module, composed of multiple Intel Haswell processors.
The xPic code is a Particle-in-Cell software used for the study of astrophysical plasmas. It is composed of two main sections: a) a Maxwell solver for the evolution of electromagnetic fields, and b) a particle solver for the transport of plasma ions and electrons that flow through the ambient electromagnetic fields. For an accurate description of astrophysical plasmas trillions of ions and electrons have to be moved in the system. We divide the code so the particle solver runs in the Booster module, taking advantage of the parallelization potential of its simple operations, and the field solver runs in the Cluster module performing serialized and communications intensive tasks. This Cluster-Booster division of work allows for an important increase in efficiency.
In this work we present the DEEP-EST architecture and the gains in performance obtained by the use of the Cluster-Booster approach. We show an analysis of the code and a projection of its performances towards exascale using the Dimemas tool of the Barcelona Supercomputer Center.
A portable runtime system approach to engineering design at exascale using the Uintah software
Martin Berzins, John Schmidt, Damodar Sahsbarude, Alan Humphrey, Sidharth Kumar, Brad Peterson, Zhang Yang
The many challenges of exascale computing including having suitable problems to run at such scales and having software that solves those problems in a way that makes it possible to quickly move to the new low-power designs at exascale with a minimum of code rewrites and at the same time to be able to both generate suitable; output and to visualize it. The challenges are being addressed by the CCMSC Center at the University of Utah using the UIntah software in close collaboration with GE and with Sandia Laboratories. The primary motivating problem is that of 1000Mwe GE coal boiler design as shown in Figure 1. Modeling the turbulent combustion in such a boiler in detail requires discretizating a structure that is about 6000 m3 at mm scale giving a grid with 6×1012 grid cells and 6×1013 turbulent combustion variables. In addition a low mach number approximation requires the solution of a system of 6×1012 equations per timestep and as the primary heat transfer mechanism is radiation everything is globally coupled.
The Utah Uintah software solves a task-based formulation of this problem in a petascale form by using its Arches component. The tasks that specify this problem are executed in an asynchronous and out of order manner by the Uintah’s runtime system. This allows adaptive scalability on present architectures. A raytracing approach to radiation scales to all of DOE Titan architecture using CPUs and GPUs. Linear solves are using preconditioned CG in the hypre code. Strong Scaling results are shown in Table1 and weak scaling results also exist.
In addressing the challenges of future exascale architectures such as the Argonne A21 architecture for which we are an early user, it is important to achieve both portability and performance. The approach that we have taken is to evolve the runtime system to take advantage of very different architectures such as those based on GPUs, Intel KNLs and the Sunway TiahuLight. Examples of performance on these architectures will be shown. The second step is to strive for portable performance using portability libraries such as Kokkos from Sandia Labs. This involves extending the Uintah programming model to ask the user to write Kokkos loops. The improvements in performance will be shown and the implications for using exascale machines like A21 described.
Mitigating performance degradation of frequently synchronised applications in presence of coarse grain jitter
Gladys Utrera and Jesus Labarta
Operating system (OS) noise exists on most computing platforms, and its effect on HPC systems can be devastating especially on frequently synchronized applications. For this reason, OS noise is currently the objective of many research works which include characterization, detection and mitigation techniques. Even more, some authors pointed out future sources of noise in extreme-scale systems due to for example fault tolerance mechanisms (i.e. checkpoint/restart) between others. As part of the efforts to reduce OS noise, proposals that goes from the design of lightweight kernels to the use of non-blocking collectives have been analysed. Collectives operations are the base of many HPC applications, which are frequently used in a regular manner in iterative processing. This kind of operations are especially sensitive to OS noise. The delay caused by the CPU cycles stolen from one task is amplified as a result of the collective operation, causing a load imbalance effect on the application. This problem is specially recommended by the authors in to be analysed in the future design of runtime systems.
In this work, we propose a mechanism to take advantage of the idle cycles at the collective operation generated by the delay of another process executing OS activities. To that aim, we use these CPU cycles to make progress in the execution of the application by migrating the affected task to the CPU owned by task that first arrived to the collective operation. Task migration is considered just within a node. The increasing availability of multicore systems, and even more important in exascale systems makes the approach feasible. In addition, shared memory within a node reduce considerably any memory access penalization due to task migration (shared last level cache).
The point is how the first task that arrives to the synchronization point at each node knows that there is a delay and which task is causing such delay. To that end, we study two approaches: 1) runtime detection; 2) prediction. The first approach is based on measuring CPU cycles per time unit using hardware counters and comparing them against an initial measurement. If the ratio is below a threshold, then we declare CPU cycles were stolen. The second approach is based on the observation that OS activities are regular and have a pattern. Consequently, any of these interruptions can be predicted. The predictions for each activity can be made by simply having tables at each CPU with information of the main daemons (last execution time, duration and intervals of execution) or with a more sophisticated technique that uses artificial intelligence.
We present in this work an evaluation of the second alternative, using a simple prediction. In order to avoid the native noise of the platform where we make the evaluations, we use just half of the available CPUs at each node. In addition, the noise is simulated and scaled to reflect the impact of the mechanism clearly and also to have whole control over the noise prediction. The evaluations were made varying frequency and duration of the noise occurrence. The duration is expressed relative to the calculation time iteration (i.e. the time between two consecutive collective operations).
Execution results on a multicore cluster with 48 CPUs at each node, running on 8 nodes and executing MPI microbenchmarks show that with perfect noise prediction, when having noise with duration equal to one calculation-phase, the gain in performance is about 19% and with two calculation-phase duration the gain can be up to 24%. About the error of the prediction, we observe that over-prediction, which is to perform task migration without having noise, can have more penalization than under-prediction, which is not doing task migration every time there is noise. For example, having 50% of misprediction the performance degrades by 5% doing task migration with respect to not doing it. While, doing task migration 50% of the times there is noise, may increase performance in about 6%.
Consequently, task migration for coarse grain noises is an attractive alternative for frequently synchronized applications. In addition, is preferable to have less predictions but accurate ones than overpredict noise occurrences. So, we need to improve the prediction mechanism. In this sense, we are working on optimizing runtime detection mechanisms which are costly but more accurate.