Morning Parallel Session A

Morning Parallel Session B

Afternoon Parallel Session A

Afternoon Parallel Session B

Reducing the number of load rebalancings and increasing concurrency in mesh-based dynamically adaptive simulations
D.E. Charrier, B. Hazelwood, T. Weinzierl
ExaHyPE is a H2020 project where an international consortium develops a simulation engine for hyperbolic equation system solvers based upon highly accurate ADER-DG coupled to robust Finite Volumes. The engine is used to simulate earthquakes and astrophysical events. After a brief sketch of ADER-DG, we dive into our AMR code base forging dynamically adaptive Cartesian meshes from spacetrees. While we support arbitrary dynamic adaptivity, benchmarks reveal that it is performance wisely advantageous to have some regularity in the grid, as we then can use classic optimisation techniques for regular Cartesian grids locally. Furthermore, strongly dynamic AMR forces us to rebalance the computational load between the employed MPI ranks often. Data has to be moved around. This is time consuming.
We propose to impose some regularity on the grid, i.e. to take the adaptive compute grid and then to add additional cells such that the grid becomes block structured with regular Cartesian blocks. Furthermore, we propose to introduce a host grid which is finer than the compute grid. We embed the compute grid into the host grid and perform all domain decomposition on the host grid. Only if the host grid cannot accomodate the compute grid due to strong adaptivity changes, we rebalance.

Accelerating simulations of cerebrovascular blood flow through parallelization in time
Rupert Nash, David Scott, Daniel Ruprecht and Derek Groen

Simulating blood flow in networks of human arteries (“computational hemodynamics”) allows us to better understand how flow properties in these arteries relate to the occurrence, and degradation of major cardiovascular diseases. The HemeLB Lattice Boltzmann Method (LBM) simulation environment has a long history of successful use in computational hemodynamics. However, for realistic, patient-specific geometries like the Circle of Willis, HemeLB scales only up to 12-25k cores on ARCHER, resulting in simulation times of several days. Parallel-intime integration methods like the widely used Parareal algorithm [1] can be used to unlock concurrency in the time dimension in order to extend parallel scaling, escape the “traps of strong and weak scaling” and reduce simulation times. While Parareal has been used successfully in combination with LBM codes [3], performance has so far only been explored for relatively simple benchmark problems. We will report on the ongoing effort of integrating an eventbased variant of Parareal into HemeLB with the aim extending parallel scaling on ARCHER and bringing down simulation times for geometries like the Circle of Willis to a day or less. Specific challenges of combining parallel-in-time integration with LBM will be illustrated and HemeLB-specific solutions presented. First performance tests on ARCHER will be shown.

High Performance Combinatorial Search with YewPar
Blair Archibald, Patrick Maier, Robert Stewart and Phil Trinder

We propose a novel HPC application domain, namely combinatorial search. Combinatorial search plays an important role in many application areas including industrial scheduling and allocation, artificial intelligence, computational biology and computational algebra. At the heart of (exact) combinatorial search are backtracking search algorithms that systematically, and dynamically, explore a search space to find objects of interest. Due to the nature of these algorithms, the amount of computation can grow exponentially with the input size. Parallelism is one way to address this exponential complexity and HPC machines, if used effectively, may enable the solving of important problem instances that are currently out of reach.

A task-parallel backtracking search algorithm splits the search tree into multiple subtrees or subtasks and speculatively explore each one in parallel. While seemingly simple, this approach poses many challenges. Subtrees vary hugely in size, often by orders of magnitude; searches typically follow specific search order heuristics that must be preserved as far as possible; and asynchronous knowledge sharing between tasks dynamically affects the set of active tasks, e.g. by pruning unnecessary tasks. Combinatorial search differs from traditional HPC workloads in the use of speculative parallelism, high degree of irregularity, asynchronous knowledge sharing, and the lack of floating point computation. However, in many respects it is well suited to HPC architectures: search problems are compute-heavy, perform little I/O, have few global synchronisation points, and tend to benefit from integer/bitset vectorisation.

Currently, most parallel combinatorial search is limited to multi-cores. However, given the need to tackle ever larger problems, it is essential to enable the use of HPC resources for this domain, ideally without requiring HPC expertise of the user. To this end we have developed YewPar1, a framework for parallel combinatorial search, designed with HPC in mind. YewPar exposes high level parallel patterns (algorithmic skeletons) for users to specify search algorithms, while the framework manages all parallel coordination implicitly. This approach allows search experts to take advantage of multiple parallel architectures (multi-core, cluster, cloud, HPC) without detailed parallelism or architecture knowledge. YewPar is built on HPX [3], a standards compliant C++ library designed for exascale systems, that leverages asynchronous task parallelism and a global address space. We have previously demonstrated the generality of the pattern approach on four different combinatorial problems: Maximum Clique, k-Clique, Travelling Salesperson and Knapsack. Recent results for Maximum Clique show YewPar achieving an average speedup of 170 on 255 worker cores. Due to the high degree of irregularity, linear scaling cannot be not expected. Combinatorial search has many important and compute-heavy applications. With YewPar we have begun to scale combinatorial search to HPC and hope that in the future, with exascale machines, we will be able to solve a wide range of problems that are currently out of reach.

An Evaluation of the TensorFlow Programming Model for Solving Traditional HPC Problems
Steven Wei Der Chien, Chaitanya Prasad Sishtla, Stefano Markidis, Jun Zhang, Ivy Bo Peng and Erwin Laure

Computational intensive applications such as pattern recognition, and natural language processing, are increasingly popular on HPC systems. Many of these applications use deep-learning, a branch of machine learning, to determine the weights of artificial neural network nodes by minimizing a loss function. Such applications depend heavily on dense matrix multiplications, also called \emph{tensorial} operations. The use of Graphics Processing Unit (GPU) has considerably speeded up deep-learning computations, leading to a Renaissance of the artificial neural network. Recently, the NVIDIA Volta GPU [1] and the Google Tensor Processing Unit (TPU) have been specially designed to support deep-learning workloads. New programming models have also emerged for convenient expression of tensorial operations and deep-learning computational paradigms. An example of such new programming frameworks is TensorFlow, an open-source deep-learning library released by Google in 2015.

TensorFlow expresses algorithms as a computational graph where nodes represent operations and edges between nodes represent data flow. Multi-dimensional data such as vectors and matrices which flows between operations are called Tensors. For this reason, computation problems need to be expressed as a computational graph. In particular, TensorFlow supports distributed computation with flexible assignment of operation and data to devices such as GPU and CPU on different computing nodes. Computation on devices are based on optimized kernels such as MKL, Eigen and cuBLAS. Inter-node communication can be through TCP and RDMA.

This work attempts to evaluate the usability and expressiveness of the TensorFlow programming model for traditional HPC problems. As an illustration, we prototyped a distributed block matrix multiplication for large dense matrices which cannot be co-located on a single device and a Conjugate Gradient (CG) solver. We evaluate the difficulty of expressing traditional HPC algorithms using computational graphs and study the scalability of distributed TensorFlow on accelerated systems. Our preliminary result with distributed matrix multiplication shows that distributed computation on TensorFlow is extremely scalable. This study provides an initial investigation of new emerging programming models for HPC.

[1] Markidis, Stefano, Chien, Steven Wei Der, Laure, Erwin, Peng, Ivy Bo and Vetter, Jeffrey S., 2018, NVIDIA Tensor Core Programmability, Performance & Precision. arXiv preprint arXiv:1803.04014.

The work is funded by the European Commission through the SAGE project (Grant agreement no. 671500).


Leveraging hierarchical memories for micro-core architectures
Nick Brown, Maurice Jamieson

Micro-core architectures combine many simple, low power and low on-chip memory cores onto a single processor package. The low power nature of these architectures means that there is potential for their use in future HPC and embedded systems, and their low cost makes them ideal for education and prototyping. However there is a high barrier to entry in programming due to the considerable complexity and immaturity of supporting tools. ePython is a Python virtual machine we have developed for the 16-core Epiphany III microcore architecture which fits in the 32Kb per core memory. In combination we developed an abstraction that supports offloading functions in existing Python codes, running on a host CPU, seamlessly to the micro-cores. In [1] we introduced this abstraction and motivated it with a machine learning code for detecting lung cancer in 3D CT scans where kernels for model training and inference ran in parallel on the micro-cores. However the small amount of core memory severely limited the physical size of the images, which had to be interpolated to fit.

In addition to the small on-core memory, there is typically much larger, slower external memory. In order to take full advantage of these architectures one must leverage these hierarchies of memory, but a key question is how best to achieve this whilst maintaining good performance.

In this work we have addressed this challenge by splitting the memory abstraction into three choices:

  • A mirroring of memory where copies of external memory also exist on the micro-cores. A manually copying of data to and from the different memory levels is required.
  • Memory can be exposed from a specific level in the hierarchy to the micro-cores without an explicit copy being allocated. Reads and writes directly access this external memory, these accesses being blocking or non-blocking (using the DMA engines.) Abstractions around the non-blocking approach enables the programmer to leverage patterns such as double buffering and data streaming to overlap compute and memory access for performance.
  • By default memory belongs to the hierarchical level where it is first declared. It is possible to override this via memory kinds [2]. In our approach these are Python objects that follow a standard interface and sit outside of the core ePython implementation. They define the behaviour of memory access at the level of hierarchy they represent.

Based upon this work we are now able to run the machine learning code of with the full sized images. The programmer is able to experiment with choices around memory placement and access patterns without having to worry about the low level complexities of data movement.

Developing An Extensible, Portable, Scalable Toolkit for Massively Parallel Incompressible SPH
Xiaohu Guo

The stability, accuracy, energy conservation, boundary conditions of the projection based particle method such as incompressible smoothed particle hydrodynamics(ISPH) have been greatly improved[1]. However, there are still many challenges compared with other particle based methods from the perspective of computation and high performance software implementation when using hundreds of millions of particles above. In this talk, we are particularly concerning the scalable algorithms for the post peta-scale particle method based simulations, these algorithms are low overhead domain decomposition and dynamic load balancing involving irregular particle distributions and complex geometries, flexible parallel communications algorithms to facilitate user scientific software development. Particles ordering for cache-based computing architectures and reducing the sparse matrix bandwidth and efficient sparse linear solvers which is the additional distinct challenge for projection-based particle methods. The implementation details introduced here are intended to form future guidance for the new projection-based particle application development on the novel computing architectures.

SMURFF: a High-Performance Framework for Matrix Factorization
Tom Vander Aa and Tom Ashby

Recommender Systems (RS) have become very common in recent years and are useful in various real-life applications. The most popular ones are probably suggestions for movies on Netflix and books for Amazon. However, they can also be used in more unlikely area such drug discovery where a key problem is the identification of candidate molecules that affect proteins associated with diseases. In RS one has to analyze large and sparse matrices, for example those containing the known movie or book ratings. Matrix Factorization (MF) is a technique that has been successfully used here. The idea of this method is to approximate the rating matrix R as a product of two low-rank matrices U and V . Predictions can be made from the approximation U ⇥ V which is dense.

Bayesian probabilistic matrix factorization (BPMF) is one of the most popular algorithms for matrix factorization. Thanks to the Bayesian approach, BPMF has been proven to be more robust to data-overfitting and released from cross-validation. Yet BPMF is more computational intensive and thus more challenging to implement for large datasets. In this work we present SMURFF a high-performance feature-rich framework to compose and construct different Bayesian matrix-factorization methods, based on BPMF, for example Macau or GFA. Using the SMURFF framework one can easily vary: i) the type of matrix to be factored (dense or sparse); ii) the prior-distribution that you assume the model to fit to (multivariate-normal, spike andslab, and others); or iii) the noise model (Gaussian noise, Probit noise or others). The framework also allows to combine different matrices together and thus incorporate more and different types of information into the model.

The SMURFF framework has been implemented in C++ using the Eigen library for matrix manipulations, OpenMP for multi-core parallelization, and MPI and GASPI for multinode parallelization. Performance results of one of the imlemented methods (BPMF) can be found in [8]. The framework has been successfully used in the H2020 ExCAPE project to do large scale runs of compound-activity prediction. We were able to reduce training time for a realistic dataset from 3 months to 15 hours for the SMURFF C++ implementation compared to the original R implementation.

Using modular supercomputing to propel Particle-in Cell methods to exascale
Jorge Amaya, Diego Gonzalez-Herrero, Anke Kreuzer, Estela Suarez and Giovanni Lapenta

The most efficient applications in HPC take advantage of hardware optimizations that fine-tune cache management, vectorization, I/O and multi-threading. Applications are now developed to target the potential optimizations of a single computer architecture (Intel CPUs, IBM CPUs, GPUs, FPGAs, etc.). We investigate how applications increase their performances by transferring sub-tasks to different hardware. The DEEP-EST project proposes a new “modular architecture” allowing to use different hardware components for a single application. Two modules are used in the present work to perform simulations of space plasmas: the Booster module, composed of multiple Intel Xeon Phi KNL processors, and the Cluster module, composed of multiple Intel Haswell processors.

The xPic code is a Particle-in-Cell software used for the study of astrophysical plasmas. It is composed of two main sections: a) a Maxwell solver for the evolution of electromagnetic fields, and b) a particle solver for the transport of plasma ions and electrons that flow through the ambient electromagnetic fields. For an accurate description of astrophysical plasmas trillions of ions and electrons have to be moved in the system. We divide the code so the particle solver runs in the Booster module, taking advantage of the parallelization potential of its simple operations, and the field solver runs in the Cluster module performing serialized and communications intensive tasks. This Cluster-Booster division of work allows for an important increase in efficiency.

In this work we present the DEEP-EST architecture and the gains in performance obtained by the use of the Cluster-Booster approach. We show an analysis of the code and a projection of its performances towards exascale using the Dimemas tool of the Barcelona Supercomputer Center.

A portable runtime system approach to engineering design at exascale using the Uintah software
Martin Berzins, John Schmidt, Damodar Sahsbarude, Alan Humphrey, Sidharth Kumar, Brad Peterson, Zhang Yang

The many challenges of exascale computing including having suitable problems to run at such scales and having software that solves those problems in a way that makes it possible to quickly move to the new low-power designs at exascale with a minimum of code rewrites and at the same time to be able to both generate suitable; output and to visualize it. The challenges are being addressed by the CCMSC Center at the University of Utah using the UIntah software in close collaboration with GE and with Sandia Laboratories. The primary motivating problem is that of 1000Mwe GE coal boiler design as shown in Figure 1. Modeling the turbulent combustion in such a boiler in detail requires discretizating a structure that is about 6000 m3 at mm scale giving a grid with 6×1012 grid cells and 6×1013 turbulent combustion variables. In addition a low mach number approximation requires the solution of a system of 6×1012 equations per timestep and as the primary heat transfer mechanism is radiation everything is globally coupled.

The Utah Uintah software solves a task-based formulation of this problem in a petascale form by using its Arches component. The tasks that specify this problem are executed in an asynchronous and out of order manner by the Uintah’s runtime system. This allows adaptive scalability on present architectures. A raytracing approach to radiation scales to all of DOE Titan architecture using CPUs and GPUs. Linear solves are using preconditioned CG in the hypre code. Strong Scaling results are shown in Table1 and weak scaling results also exist.

In addressing the challenges of future exascale architectures such as the Argonne A21 architecture for which we are an early user, it is important to achieve both portability and performance. The approach that we have taken is to evolve the runtime system to take advantage of very different architectures such as those based on GPUs, Intel KNLs and the Sunway TiahuLight. Examples of performance on these architectures will be shown. The second step is to strive for portable performance using portability libraries such as Kokkos from Sandia Labs. This involves extending the Uintah programming model to ask the user to write Kokkos loops. The improvements in performance will be shown and the implications for using exascale machines like A21 described.

Mitigating performance degradation of frequently synchronised applications in presence of coarse grain jitter
Gladys Utrera and Jesus Labarta

Operating system (OS) noise exists on most computing platforms, and its effect on HPC systems can be devastating especially on frequently synchronized applications. For this reason, OS noise is currently the objective of many research works which include characterization, detection and mitigation techniques. Even more, some authors pointed out future sources of noise in extreme-scale systems due to for example fault tolerance mechanisms (i.e. checkpoint/restart) between others. As part of the efforts to reduce OS noise, proposals that goes from the design of lightweight kernels to the use of non-blocking collectives have been analysed. Collectives operations are the base of many HPC applications, which are frequently used in a regular manner in iterative processing. This kind of operations are especially sensitive to OS noise. The delay caused by the CPU cycles stolen from one task is amplified as a result of the collective operation, causing a load imbalance effect on the application. This problem is specially recommended by the authors in to be analysed in the future design of runtime systems.

In this work, we propose a mechanism to take advantage of the idle cycles at the collective operation generated by the delay of another process executing OS activities. To that aim, we use these CPU cycles to make progress in the execution of the application by migrating the affected task to the CPU owned by task that first arrived to the collective operation. Task migration is considered just within a node. The increasing availability of multicore systems, and even more important in exascale systems makes the approach feasible. In addition, shared memory within a node reduce considerably any memory access penalization due to task migration (shared last level cache).

The point is how the first task that arrives to the synchronization point at each node knows that there is a delay and which task is causing such delay. To that end, we study two approaches: 1) runtime detection; 2) prediction. The first approach is based on measuring CPU cycles per time unit using hardware counters and comparing them against an initial measurement. If the ratio is below a threshold, then we declare CPU cycles were stolen. The second approach is based on the observation that OS activities are regular and have a pattern. Consequently, any of these interruptions can be predicted. The predictions for each activity can be made by simply having tables at each CPU with information of the main daemons (last execution time, duration and intervals of execution) or with a more sophisticated technique that uses artificial intelligence.

We present in this work an evaluation of the second alternative, using a simple prediction. In order to avoid the native noise of the platform where we make the evaluations, we use just half of the available CPUs at each node. In addition, the noise is simulated and scaled to reflect the impact of the mechanism clearly and also to have whole control over the noise prediction. The evaluations were made varying frequency and duration of the noise occurrence. The duration is expressed relative to the calculation time iteration (i.e. the time between two consecutive collective operations).

Execution results on a multicore cluster with 48 CPUs at each node, running on 8 nodes and executing MPI microbenchmarks show that with perfect noise prediction, when having noise with duration equal to one calculation-phase, the gain in performance is about 19% and with two calculation-phase duration the gain can be up to 24%. About the error of the prediction, we observe that over-prediction, which is to perform task migration without having noise, can have more penalization than under-prediction, which is not doing task migration every time there is noise. For example, having 50% of misprediction the performance degrades by 5% doing task migration with respect to not doing it. While, doing task migration 50% of the times there is noise, may increase performance in about 6%.

Consequently, task migration for coarse grain noises is an attractive alternative for frequently synchronized applications. In addition, is preferable to have less predictions but accurate ones than overpredict noise occurrences. So, we need to improve the prediction mechanism. In this sense, we are working on optimizing runtime detection mechanisms which are costly but more accurate.