http://www.easc2018.ed.ac.uk The Exascale Applications and Software Conference Thu, 28 Nov 2019 14:03:17 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.12 https://i0.wp.com/www.easc2018.ed.ac.uk/wp-content/uploads/2018/02/cropped-EASC-Logo-with-traces-CORRECT-square.png?fit=32%2C32 http://www.easc2018.ed.ac.uk 32 32 136061257 Exascale Applications Workshop http://www.easc2018.ed.ac.uk/exascale-applications-workshop/ Tue, 20 Mar 2018 15:48:33 +0000 http://www.easc2018.ed.ac.uk/?p=408 EASC attendees are invited to register for a free Exascale Applications Workshop on 19 April (p.m.) – 20 April (a.m.) at Pollock Halls, Edinburgh.

This workshop will focus on communication models and their use in applications. The aim is to increase the scalability of scientific and industrial codes to exascale for those applications that require it.

The workshop will have 4 sessions:

  • Asynchronous execution (chair: Dr George Beckett, EPCC)
  • Interoperability (chair: Dr Mark Bull, EPCC)
  • Usage of libraries (chair: Dr Mirko Rahn, Fraunhofer ITWM)
  • Distributed tasks (chair: Dr Jakub Šístek, University of Manchester)

Registration is free – to register, simply email info@intertwine-project.eu stating your name, organisation name and email address, and any dietary requirements.

Full descriptions of each session and the proposed agenda can be found at https://www.intertwine-project.eu/exascale-applications-workshop-agenda

]]>
408
Thursday Paper Abstracts http://www.easc2018.ed.ac.uk/thursday-paper-abstracts/ Fri, 02 Mar 2018 18:04:17 +0000 http://www.easc2018.ed.ac.uk/?p=387 Morning Parallel Session A

Morning Parallel Session B

Afternoon Parallel Session A

Afternoon Parallel Session B


Reducing the number of load rebalancings and increasing concurrency in mesh-based dynamically adaptive simulations
D.E. Charrier, B. Hazelwood, T. Weinzierl
ExaHyPE is a H2020 project where an international consortium develops a simulation engine for hyperbolic equation system solvers based upon highly accurate ADER-DG coupled to robust Finite Volumes. The engine is used to simulate earthquakes and astrophysical events. After a brief sketch of ADER-DG, we dive into our AMR code base forging dynamically adaptive Cartesian meshes from spacetrees. While we support arbitrary dynamic adaptivity, benchmarks reveal that it is performance wisely advantageous to have some regularity in the grid, as we then can use classic optimisation techniques for regular Cartesian grids locally. Furthermore, strongly dynamic AMR forces us to rebalance the computational load between the employed MPI ranks often. Data has to be moved around. This is time consuming.
We propose to impose some regularity on the grid, i.e. to take the adaptive compute grid and then to add additional cells such that the grid becomes block structured with regular Cartesian blocks. Furthermore, we propose to introduce a host grid which is finer than the compute grid. We embed the compute grid into the host grid and perform all domain decomposition on the host grid. Only if the host grid cannot accomodate the compute grid due to strong adaptivity changes, we rebalance.

Accelerating simulations of cerebrovascular blood flow through parallelization in time
Rupert Nash, David Scott, Daniel Ruprecht and Derek Groen

Simulating blood flow in networks of human arteries (“computational hemodynamics”) allows us to better understand how flow properties in these arteries relate to the occurrence, and degradation of major cardiovascular diseases. The HemeLB Lattice Boltzmann Method (LBM) simulation environment has a long history of successful use in computational hemodynamics. However, for realistic, patient-specific geometries like the Circle of Willis, HemeLB scales only up to 12-25k cores on ARCHER, resulting in simulation times of several days. Parallel-intime integration methods like the widely used Parareal algorithm [1] can be used to unlock concurrency in the time dimension in order to extend parallel scaling, escape the “traps of strong and weak scaling” and reduce simulation times. While Parareal has been used successfully in combination with LBM codes [3], performance has so far only been explored for relatively simple benchmark problems. We will report on the ongoing effort of integrating an eventbased variant of Parareal into HemeLB with the aim extending parallel scaling on ARCHER and bringing down simulation times for geometries like the Circle of Willis to a day or less. Specific challenges of combining parallel-in-time integration with LBM will be illustrated and HemeLB-specific solutions presented. First performance tests on ARCHER will be shown.


High Performance Combinatorial Search with YewPar
Blair Archibald, Patrick Maier, Robert Stewart and Phil Trinder

We propose a novel HPC application domain, namely combinatorial search. Combinatorial search plays an important role in many application areas including industrial scheduling and allocation, artificial intelligence, computational biology and computational algebra. At the heart of (exact) combinatorial search are backtracking search algorithms that systematically, and dynamically, explore a search space to find objects of interest. Due to the nature of these algorithms, the amount of computation can grow exponentially with the input size. Parallelism is one way to address this exponential complexity and HPC machines, if used effectively, may enable the solving of important problem instances that are currently out of reach.

A task-parallel backtracking search algorithm splits the search tree into multiple subtrees or subtasks and speculatively explore each one in parallel. While seemingly simple, this approach poses many challenges. Subtrees vary hugely in size, often by orders of magnitude; searches typically follow specific search order heuristics that must be preserved as far as possible; and asynchronous knowledge sharing between tasks dynamically affects the set of active tasks, e.g. by pruning unnecessary tasks. Combinatorial search differs from traditional HPC workloads in the use of speculative parallelism, high degree of irregularity, asynchronous knowledge sharing, and the lack of floating point computation. However, in many respects it is well suited to HPC architectures: search problems are compute-heavy, perform little I/O, have few global synchronisation points, and tend to benefit from integer/bitset vectorisation.

Currently, most parallel combinatorial search is limited to multi-cores. However, given the need to tackle ever larger problems, it is essential to enable the use of HPC resources for this domain, ideally without requiring HPC expertise of the user. To this end we have developed YewPar1, a framework for parallel combinatorial search, designed with HPC in mind. YewPar exposes high level parallel patterns (algorithmic skeletons) for users to specify search algorithms, while the framework manages all parallel coordination implicitly. This approach allows search experts to take advantage of multiple parallel architectures (multi-core, cluster, cloud, HPC) without detailed parallelism or architecture knowledge. YewPar is built on HPX [3], a standards compliant C++ library designed for exascale systems, that leverages asynchronous task parallelism and a global address space. We have previously demonstrated the generality of the pattern approach on four different combinatorial problems: Maximum Clique, k-Clique, Travelling Salesperson and Knapsack. Recent results for Maximum Clique show YewPar achieving an average speedup of 170 on 255 worker cores. Due to the high degree of irregularity, linear scaling cannot be not expected. Combinatorial search has many important and compute-heavy applications. With YewPar we have begun to scale combinatorial search to HPC and hope that in the future, with exascale machines, we will be able to solve a wide range of problems that are currently out of reach.


An Evaluation of the TensorFlow Programming Model for Solving Traditional HPC Problems
Steven Wei Der Chien, Chaitanya Prasad Sishtla, Stefano Markidis, Jun Zhang, Ivy Bo Peng and Erwin Laure

Computational intensive applications such as pattern recognition, and natural language processing, are increasingly popular on HPC systems. Many of these applications use deep-learning, a branch of machine learning, to determine the weights of artificial neural network nodes by minimizing a loss function. Such applications depend heavily on dense matrix multiplications, also called \emph{tensorial} operations. The use of Graphics Processing Unit (GPU) has considerably speeded up deep-learning computations, leading to a Renaissance of the artificial neural network. Recently, the NVIDIA Volta GPU [1] and the Google Tensor Processing Unit (TPU) have been specially designed to support deep-learning workloads. New programming models have also emerged for convenient expression of tensorial operations and deep-learning computational paradigms. An example of such new programming frameworks is TensorFlow, an open-source deep-learning library released by Google in 2015.

TensorFlow expresses algorithms as a computational graph where nodes represent operations and edges between nodes represent data flow. Multi-dimensional data such as vectors and matrices which flows between operations are called Tensors. For this reason, computation problems need to be expressed as a computational graph. In particular, TensorFlow supports distributed computation with flexible assignment of operation and data to devices such as GPU and CPU on different computing nodes. Computation on devices are based on optimized kernels such as MKL, Eigen and cuBLAS. Inter-node communication can be through TCP and RDMA.

This work attempts to evaluate the usability and expressiveness of the TensorFlow programming model for traditional HPC problems. As an illustration, we prototyped a distributed block matrix multiplication for large dense matrices which cannot be co-located on a single device and a Conjugate Gradient (CG) solver. We evaluate the difficulty of expressing traditional HPC algorithms using computational graphs and study the scalability of distributed TensorFlow on accelerated systems. Our preliminary result with distributed matrix multiplication shows that distributed computation on TensorFlow is extremely scalable. This study provides an initial investigation of new emerging programming models for HPC.

[1] Markidis, Stefano, Chien, Steven Wei Der, Laure, Erwin, Peng, Ivy Bo and Vetter, Jeffrey S., 2018, NVIDIA Tensor Core Programmability, Performance & Precision. arXiv preprint arXiv:1803.04014.

The work is funded by the European Commission through the SAGE project (Grant agreement no. 671500).

 


Leveraging hierarchical memories for micro-core architectures
Nick Brown, Maurice Jamieson

Micro-core architectures combine many simple, low power and low on-chip memory cores onto a single processor package. The low power nature of these architectures means that there is potential for their use in future HPC and embedded systems, and their low cost makes them ideal for education and prototyping. However there is a high barrier to entry in programming due to the considerable complexity and immaturity of supporting tools. ePython is a Python virtual machine we have developed for the 16-core Epiphany III microcore architecture which fits in the 32Kb per core memory. In combination we developed an abstraction that supports offloading functions in existing Python codes, running on a host CPU, seamlessly to the micro-cores. In [1] we introduced this abstraction and motivated it with a machine learning code for detecting lung cancer in 3D CT scans where kernels for model training and inference ran in parallel on the micro-cores. However the small amount of core memory severely limited the physical size of the images, which had to be interpolated to fit.

In addition to the small on-core memory, there is typically much larger, slower external memory. In order to take full advantage of these architectures one must leverage these hierarchies of memory, but a key question is how best to achieve this whilst maintaining good performance.

In this work we have addressed this challenge by splitting the memory abstraction into three choices:

  • A mirroring of memory where copies of external memory also exist on the micro-cores. A manually copying of data to and from the different memory levels is required.
  • Memory can be exposed from a specific level in the hierarchy to the micro-cores without an explicit copy being allocated. Reads and writes directly access this external memory, these accesses being blocking or non-blocking (using the DMA engines.) Abstractions around the non-blocking approach enables the programmer to leverage patterns such as double buffering and data streaming to overlap compute and memory access for performance.
  • By default memory belongs to the hierarchical level where it is first declared. It is possible to override this via memory kinds [2]. In our approach these are Python objects that follow a standard interface and sit outside of the core ePython implementation. They define the behaviour of memory access at the level of hierarchy they represent.

Based upon this work we are now able to run the machine learning code of with the full sized images. The programmer is able to experiment with choices around memory placement and access patterns without having to worry about the low level complexities of data movement.


Developing An Extensible, Portable, Scalable Toolkit for Massively Parallel Incompressible SPH
Xiaohu Guo

The stability, accuracy, energy conservation, boundary conditions of the projection based particle method such as incompressible smoothed particle hydrodynamics(ISPH) have been greatly improved[1]. However, there are still many challenges compared with other particle based methods from the perspective of computation and high performance software implementation when using hundreds of millions of particles above. In this talk, we are particularly concerning the scalable algorithms for the post peta-scale particle method based simulations, these algorithms are low overhead domain decomposition and dynamic load balancing involving irregular particle distributions and complex geometries, flexible parallel communications algorithms to facilitate user scientific software development. Particles ordering for cache-based computing architectures and reducing the sparse matrix bandwidth and efficient sparse linear solvers which is the additional distinct challenge for projection-based particle methods. The implementation details introduced here are intended to form future guidance for the new projection-based particle application development on the novel computing architectures.


SMURFF: a High-Performance Framework for Matrix Factorization
Tom Vander Aa and Tom Ashby

Recommender Systems (RS) have become very common in recent years and are useful in various real-life applications. The most popular ones are probably suggestions for movies on Netflix and books for Amazon. However, they can also be used in more unlikely area such drug discovery where a key problem is the identification of candidate molecules that affect proteins associated with diseases. In RS one has to analyze large and sparse matrices, for example those containing the known movie or book ratings. Matrix Factorization (MF) is a technique that has been successfully used here. The idea of this method is to approximate the rating matrix R as a product of two low-rank matrices U and V . Predictions can be made from the approximation U ⇥ V which is dense.

Bayesian probabilistic matrix factorization (BPMF) is one of the most popular algorithms for matrix factorization. Thanks to the Bayesian approach, BPMF has been proven to be more robust to data-overfitting and released from cross-validation. Yet BPMF is more computational intensive and thus more challenging to implement for large datasets. In this work we present SMURFF a high-performance feature-rich framework to compose and construct different Bayesian matrix-factorization methods, based on BPMF, for example Macau or GFA. Using the SMURFF framework one can easily vary: i) the type of matrix to be factored (dense or sparse); ii) the prior-distribution that you assume the model to fit to (multivariate-normal, spike andslab, and others); or iii) the noise model (Gaussian noise, Probit noise or others). The framework also allows to combine different matrices together and thus incorporate more and different types of information into the model.

The SMURFF framework has been implemented in C++ using the Eigen library for matrix manipulations, OpenMP for multi-core parallelization, and MPI and GASPI for multinode parallelization. Performance results of one of the imlemented methods (BPMF) can be found in [8]. The framework has been successfully used in the H2020 ExCAPE project to do large scale runs of compound-activity prediction. We were able to reduce training time for a realistic dataset from 3 months to 15 hours for the SMURFF C++ implementation compared to the original R implementation.


Using modular supercomputing to propel Particle-in Cell methods to exascale
Jorge Amaya, Diego Gonzalez-Herrero, Anke Kreuzer, Estela Suarez and Giovanni Lapenta

The most efficient applications in HPC take advantage of hardware optimizations that fine-tune cache management, vectorization, I/O and multi-threading. Applications are now developed to target the potential optimizations of a single computer architecture (Intel CPUs, IBM CPUs, GPUs, FPGAs, etc.). We investigate how applications increase their performances by transferring sub-tasks to different hardware. The DEEP-EST project proposes a new “modular architecture” allowing to use different hardware components for a single application. Two modules are used in the present work to perform simulations of space plasmas: the Booster module, composed of multiple Intel Xeon Phi KNL processors, and the Cluster module, composed of multiple Intel Haswell processors.

The xPic code is a Particle-in-Cell software used for the study of astrophysical plasmas. It is composed of two main sections: a) a Maxwell solver for the evolution of electromagnetic fields, and b) a particle solver for the transport of plasma ions and electrons that flow through the ambient electromagnetic fields. For an accurate description of astrophysical plasmas trillions of ions and electrons have to be moved in the system. We divide the code so the particle solver runs in the Booster module, taking advantage of the parallelization potential of its simple operations, and the field solver runs in the Cluster module performing serialized and communications intensive tasks. This Cluster-Booster division of work allows for an important increase in efficiency.

In this work we present the DEEP-EST architecture and the gains in performance obtained by the use of the Cluster-Booster approach. We show an analysis of the code and a projection of its performances towards exascale using the Dimemas tool of the Barcelona Supercomputer Center.


A portable runtime system approach to engineering design at exascale using the Uintah software
Martin Berzins, John Schmidt, Damodar Sahsbarude, Alan Humphrey, Sidharth Kumar, Brad Peterson, Zhang Yang

The many challenges of exascale computing including having suitable problems to run at such scales and having software that solves those problems in a way that makes it possible to quickly move to the new low-power designs at exascale with a minimum of code rewrites and at the same time to be able to both generate suitable; output and to visualize it. The challenges are being addressed by the CCMSC Center at the University of Utah using the UIntah software in close collaboration with GE and with Sandia Laboratories. The primary motivating problem is that of 1000Mwe GE coal boiler design as shown in Figure 1. Modeling the turbulent combustion in such a boiler in detail requires discretizating a structure that is about 6000 m3 at mm scale giving a grid with 6×1012 grid cells and 6×1013 turbulent combustion variables. In addition a low mach number approximation requires the solution of a system of 6×1012 equations per timestep and as the primary heat transfer mechanism is radiation everything is globally coupled.

The Utah Uintah software solves a task-based formulation of this problem in a petascale form by using its Arches component. The tasks that specify this problem are executed in an asynchronous and out of order manner by the Uintah’s runtime system. This allows adaptive scalability on present architectures. A raytracing approach to radiation scales to all of DOE Titan architecture using CPUs and GPUs. Linear solves are using preconditioned CG in the hypre code. Strong Scaling results are shown in Table1 and weak scaling results also exist.

In addressing the challenges of future exascale architectures such as the Argonne A21 architecture for which we are an early user, it is important to achieve both portability and performance. The approach that we have taken is to evolve the runtime system to take advantage of very different architectures such as those based on GPUs, Intel KNLs and the Sunway TiahuLight. Examples of performance on these architectures will be shown. The second step is to strive for portable performance using portability libraries such as Kokkos from Sandia Labs. This involves extending the Uintah programming model to ask the user to write Kokkos loops. The improvements in performance will be shown and the implications for using exascale machines like A21 described.


Mitigating performance degradation of frequently synchronised applications in presence of coarse grain jitter
Gladys Utrera and Jesus Labarta

Operating system (OS) noise exists on most computing platforms, and its effect on HPC systems can be devastating especially on frequently synchronized applications. For this reason, OS noise is currently the objective of many research works which include characterization, detection and mitigation techniques. Even more, some authors pointed out future sources of noise in extreme-scale systems due to for example fault tolerance mechanisms (i.e. checkpoint/restart) between others. As part of the efforts to reduce OS noise, proposals that goes from the design of lightweight kernels to the use of non-blocking collectives have been analysed. Collectives operations are the base of many HPC applications, which are frequently used in a regular manner in iterative processing. This kind of operations are especially sensitive to OS noise. The delay caused by the CPU cycles stolen from one task is amplified as a result of the collective operation, causing a load imbalance effect on the application. This problem is specially recommended by the authors in to be analysed in the future design of runtime systems.

In this work, we propose a mechanism to take advantage of the idle cycles at the collective operation generated by the delay of another process executing OS activities. To that aim, we use these CPU cycles to make progress in the execution of the application by migrating the affected task to the CPU owned by task that first arrived to the collective operation. Task migration is considered just within a node. The increasing availability of multicore systems, and even more important in exascale systems makes the approach feasible. In addition, shared memory within a node reduce considerably any memory access penalization due to task migration (shared last level cache).

The point is how the first task that arrives to the synchronization point at each node knows that there is a delay and which task is causing such delay. To that end, we study two approaches: 1) runtime detection; 2) prediction. The first approach is based on measuring CPU cycles per time unit using hardware counters and comparing them against an initial measurement. If the ratio is below a threshold, then we declare CPU cycles were stolen. The second approach is based on the observation that OS activities are regular and have a pattern. Consequently, any of these interruptions can be predicted. The predictions for each activity can be made by simply having tables at each CPU with information of the main daemons (last execution time, duration and intervals of execution) or with a more sophisticated technique that uses artificial intelligence.

We present in this work an evaluation of the second alternative, using a simple prediction. In order to avoid the native noise of the platform where we make the evaluations, we use just half of the available CPUs at each node. In addition, the noise is simulated and scaled to reflect the impact of the mechanism clearly and also to have whole control over the noise prediction. The evaluations were made varying frequency and duration of the noise occurrence. The duration is expressed relative to the calculation time iteration (i.e. the time between two consecutive collective operations).

Execution results on a multicore cluster with 48 CPUs at each node, running on 8 nodes and executing MPI microbenchmarks show that with perfect noise prediction, when having noise with duration equal to one calculation-phase, the gain in performance is about 19% and with two calculation-phase duration the gain can be up to 24%. About the error of the prediction, we observe that over-prediction, which is to perform task migration without having noise, can have more penalization than under-prediction, which is not doing task migration every time there is noise. For example, having 50% of misprediction the performance degrades by 5% doing task migration with respect to not doing it. While, doing task migration 50% of the times there is noise, may increase performance in about 6%.

Consequently, task migration for coarse grain noises is an attractive alternative for frequently synchronized applications. In addition, is preferable to have less predictions but accurate ones than overpredict noise occurrences. So, we need to improve the prediction mechanism. In this sense, we are working on optimizing runtime detection mechanisms which are costly but more accurate.

]]>
387
Wednesday Paper Abstracts http://www.easc2018.ed.ac.uk/wednesday-paper-abstracts/ Fri, 02 Mar 2018 17:25:12 +0000 http://www.easc2018.ed.ac.uk/?p=383 Morning Parallel Session A

Morning Parallel Session B

Afternoon Parallel Session A

Afternoon Parallel Session B


A Task-Based Particle-in-Cell Method with Automatic Load Balancing using the AllScale Environment
Roman Iakymchuk, Herbert Jordan, Philipp Gschwandtner, Thomas Heller, Peter Thoman Xavier Aguilar, Thomas Fahringer, Erwin Laure, Stefano Markidis
The Particle-in-Cell (PIC) method is one of the most common and powerful numerical techniques for the simulation of fusion, astrophysical, and space plasmas. For instance, PIC simulations are used to study the interaction of the Earth’s electromagnetic field with the hot plasma emanated by the sun, the so-called solar wind. To note, high energy plasma in space can damage spacecrafts and put at risk the life of astronauts. Thus, it is important to enable efficient large-scale PIC simulations that are capable to predict different phenomena in space. The PIC method solves the kinetic equation of plasmas by first sampling plasma distribution functions with computational particles and then following their trajectories by solving the equation of motion for each particle. The electromagnetic field determining the particle trajectory is calculated by computing the Maxwell’s equations on a grid. The coupling between particle and both electric and magnetic fields is provided by the so-called interpolation functions (aka gather and scatter). Typically, parallel PIC simulations divide the simulation box in several equal-in-size domains with initially the same number of particles [1]. Each domain is assigned to a process that carries out the computation for the particles in the domain. When a particle exits the domain, it is communicated to a different domain. Because of the non-uniform configuration of the electromagnetic field in space, computational particles concentrate in relatively small spatial regions while few particles cover other spatial regions, resulting in load imbalance. Workload-imbalance is the most severe and limiting problem in large-scale PIC simulations, resulting in up to 60 % of the process imbalance.
In this talk, we present a new formulation of the PIC method to provide automatic loadbalancing using the AllScale toolchain1. The AllScale approach, which is based on task-centric nested recursive parallelism, aims to provide a unified programming system for the effective development of highly scalable, resilient and performance-portable parallel applications for Exascale systems. In the spirit of this approach, we redesign the PIC method: particles are placed inside individual cells rather than both structures stored separately, requiring extra mappings. Moreover, the field solver implements an explicit scheme which allows to eliminate global communications and synchronizations. This facilitates the accommodation of the nested, recursive, and asynchronous task parallelism that is in the core of the AllScale toolchain. Furthermore, our method employs dynamic load balancing: the simulation is recursively divided into smaller domains, forming a set of tasks to be computed. The granularity and distribution of those tasks is thereby controlled by the scheduler provided by the AllScale runtime system, leading to an even load balance where each processing unit spends approximately the same amount of time on processing its assigned share of particles. We demonstrate our new approach using a case study with severe load imbalance, the Earth’s Van Allen radiation belts.

LFRic and PSyclone: Building a Domain Specific Embedded
Language for weather and Climate models
Christopher Maynard

In common with many science applications, exascale computing presents a disruptive change for weather and climate models. However, the difficulty in porting and optimising legacy codes to new hardware is particularly acute for this domain as the software is large (O(106) lines of code), takes a long time develop (∼ 10 years for a new dynamical core) and is long-lived (typically ∼ 25 years or longer). These timescales are much longer than the changes in both processor architectures and programming models necessary to exploit these architectures. Moreover, highly scalable algorithms are necessary to exploit the necessary degree of parallelism exascale computers are likely to exhibit.

In collaboration with academic partners, the Met Office is developing a new dynamical core, called GungHo. By employing a mixed Finite Element Method on an unstructured mesh, the new dynamical core is designed to maintain the scientific accuracy of the current Unified Model (UM) dynamical core (ENDGame [2]), whilst allowing improved scalability by avoiding the singularities present at the poles of a lon-lat grid. A new atmospheric model and software infrastructure, named LFRic after Lewis Fry Richardson, is being developed to host the GungHo dynamical core, as the structured, lon-lat grid is inherent in the data structures of the UM. The software design is based on a layered architecture and a separation of concerns between the natural science code in which the mathematics is expressed and the computational science code where the parallelism and other other software and hardware specific performance optimisations are expressed. In particular, there are three layers. The top layer, the algorithm layer, is where high-level mathematical operations on global fields are performed. The bottom layer is the kernel layer where these operations are expressed on a single column of data. In between is the Parallelisation System or PSy layer, where the horizontal looping and parallelism is expressed. This abstraction, called PSyKAl, is written in Fortran 2003 using Object Orientation to encode the rules of the API. Moreover, a Python code called PSyclone can parse the algorithm and kernel layers and generate the Psy layer with different target programming models. In effect, the PSyKAl API and PSyclone are a Domain Specific Embedded Language (DSEL). Domain science code which conforms to this API can be written in serial and the code, parallelised and optimised for the targeted hardware architecture, is then generated automatically.

The model is under active development and indeed, the science and numerical analysis are still areas of active research. However, in order to assess the scientific performance of the model, sufficiently computationally challenging problems must be tackled. Thus, there is a requirement to run these research models efficiently and at scale on current hardware architectures.

In this paper, the software design and strategy for future architectures is presented. Moreover, some preliminary performance analysis is presented, including scaling analysis to a quarter of million cores on the Met Office Cray XC40. The use of redundant computation, shared memory threaded parallelism such as OpenMP and OpenACC and performance on different architectures such as CPUs, GPUs and ARM processors are also discussed. Furthermore, as I/O is a significant performance factor for weather and in particular climate models, some performance analysis of I/O at scale using the asynchronous IO server of the XIOS library is presented.


Leveraging SLEPc in modeling the earth’s magnetic environment
Nick Brown, Brian Hamilton, William Brown, Ciaran D Beggan, Brian Bainbridge, Susan Macmillan

The Model of the Earth’s Magnetic Environment (MEME) [1] is an important tool for calculating the earth’s magnetic field. In addition to being used directly for scientific study, this model is also a building block for a number of other BGS codes which have a wide variety of applications from oil and gas exploration to GPS positioning. The Earth’s internal magnetic field is generated by the motion of the conductive metallic fluid in the outer core. Changes in the motion of this fluid, which are happening all the time, cause variations of the shape and intensity of the magnetic field measured at the surface. In order to calculate the current magnetic environment, the MEME models uses a combination of current and historical data from magnetic survey satellites (such as the ESA Swarm mission) and observational sites around the globe.

Computationally the model requires the solving of differential equations which it does by calculating Eigenvalues and Eigenvectors of a matrix (built from the input data) via the GOLUB algorithm, where the matrix is tri-diagonalised using Givens rotations. This method is very stable, but neither fast nor parallelised. The small amount of parallelism already present in the code is used for building the matrix, which itself is very time consuming, and exhibits significant load imbalance. Due to the sequential nature of the solver all data must fit within a single memory space and this is currently a major limitation. Due to these memory issues the model is limited to around 10,000 parameters and this means that only a subset (often 1 in 20 points) of current satellite data can be studied. With the deployment of new observation satellites and technologies imminent, it is realistic that runs with over 100,000 parameters will be required in the future but this is far beyond what the model is currently capable of.

We have replaced the bespoke solver with the SLEPc [2] package which builds upon PETSc to provide Eigensolvers. There are two advantages to this; firstly we get numerous solvers out of the box which are trivial to swap in and out so we can experiment with performance and stability. Secondly we are able to leverage the existing PETSc parallelism mechanisms. SLEPc/PETSc favours decomposing the matrix in a row based fashion where a number of matrix rows reside on each processes. However this raises a challenge when building up the matrix due to the symmetry. If we naively built the matrix then it would result in a very uneven load (depending upon the number of points a process has in the upper part of the matrix) or a duplication of calculations. To address this we have developed an algorithm which evenly distributes the points in the matrix for building by including a subset of points in both the upper and lower parts of the matrix. Whilst some communication is required once building has been completed, the work is well balanced and hence far more efficient than the existing approach, inherently works with the distributed data we require and extra parallelism is possible by utilising multiple processes in the building of each row.

In this talk we will discuss our work modernising the solver and parallelism of this model, the suitability of SLEPc and our algorithm for balancing the matrix building. We will illustrate the performance and scalability of the new code on ARCHER and describe the process adopted for providing confidence in the accuracy of results (which is very important to the community.)


A directory/cache for leveraging the efficient use of distributed memory by task-based runtime systems
Tiberiu Rotaru, Bernd Lörwald, Nick Brown, Mirko Rahn, Olivier Aumage, Vicenc Beltran, Xavier Teruel, Jan Ciesko, Jakub Sístek

As the community progresses towards exascale the appearance of much more complex architectures than we are currently used to leveraging is likely. For instance, machines which are highly parallel, with a large number of multi-core nodes, deep memory hierarchies and complex interconnect topologies are around the corner. Efficiently programming these new systems will be a huge challenge, where programmers will not only need to fundamentally rethink their code to increase the level of parallelism by at least an order of magnitude, but to also address other issues such as resilience.

Task-based models [1] are one possible solution to this challenge, where parallelism is decomposed into many tasks. By rethinking their parallelism in the paradigm of tasks, one can significantly reduce synchronisation which is key for achieving high levels of concurrency. The underlying task model decouples the management of parallelism from computation. This relieves the application developers from dealing with lower level details such as scheduling, memory management and resilience concerns that are tricky to manage in large computing systems. However task-based models are not a silver bullet and this paradigm often focuses around providing the abstraction of a single shared address space to the application programmer. To scale beyond a single physical memory space then some sort of distribute technology (e.g. MPI or GASPI) must be combined. This interoperability is either at the task based runtime level (and implicit to the programmer) or involves explicit communications calls provided by the programmer within tasks of their application code.

We have developed an API for a Directory/Cache [2] which can be integrated with task-based runtimes and seamlessly (to the applications programmer) provides the abstraction of a single shared address space. Supporting interoperability between task-based models and distributed memory technologies, whilst memory is physically distributed amongst the nodes the Directory/Cache enables this to be presented to the applications programmer as a single, unified memory space. The directory tracks what data is physically stored where and the cache is used for performance to avoid frequently retrieving the same piece of remote data. The main purpose of the Directory/Cache is to provide a set of services that support task-based runtime systems efficiently running distributed applications, while being able to consistently manage data stored in distributed memory or in local caches.

We have developed a reference implementation of our Directory/Cache API which, to illustrate the abstract and generic nature of our API, has been or is being integrated with the runtimes of the OmpSs, StarPU, GPI-Space and PaRSEC task-based models. The Directory/Cache API allows runtimes to be completely independent from the physical representation of data and from the type of storage used. This facilitates access through the same interface to an extendable list of transport implementations using different communication libraries such as GASPI and MPI and even on-disk storage or tiered memory.

In this talk we will describe the main concepts behind our API and the underlying architecture of the reference implementation. We will also present the results of integrating the Directory/Cache with popular task-based models and specifically the performance and scalability that this affords to real-world applications utilising this technology.


Adapting CASTEP for the Exascale age: Hybrid OpenMP and vectorisation
Arjen Tamerus, Ed Higgins, Phil Hasnip

CASTEP [1] is a high-performance density functional theory code, used to simulate the chemical, electronic and mechanical properties of materials using a plane-wave basis set. Its use accounts for a significant percentage of the total compute cycles of Tier-1 and Tier-2 HPC facilities in the UK. Its parallel performance is achieved through an MPI and OpenMP based parallel implementation, which is efficient on current architectures. In this talk we present the work undertaken to prepare CASTEP for the Exascale generation, through the optimisation of the hybrid OpenMP-MPI parallel mode. This work also improves the intra-node scaling, and is achieved through optimisations to internal CASTEP routines, better use of threaded libraries and run-time optimisations. We will reflect on the challenges faced and how they affect performance and memory scalability.

With the limit of sequential performance clearly in sight and the move to wide vectors and SIMD-inspired accelerators to achieve FLOP targets, CASTEP has to adapt to make optimal use of these new technologies. We will discuss the in-progress work of improving the vectorisation capabilities of the code and improving memory management, benefitting highly parallel and heavily vectorised architectures like the Xeon Phi platform, as well as modern x86 CPUs like Intel’s Skylake-X and upcoming platforms.


OpenACC accelerator for the PN –PN-2 algorithm in Nek5000
Evelyn Otero, Jing Gong, Misun Min, Paul Fischer, Philipp Schlatter and Erwin Laure

Nek5000 is an open-source code for the simulation of incompressible flows. Nek5000 is widely used in a broad range of applications, including the study of thermal hydraulics in nuclear reactor cores, the modeling of ocean currents, and the study of stability, transition and turbulence on airplane wings. Exascale HPC architectures are increasingly prevalent in the Top500 list, with CPU based nodes enhanced by accelerators or co-processors optimized for floating-point calculations. We have previously presented a serial case studies of partially porting to parallel GPU-accelerated systems for Nek5000/Nekbone, see [1–3]. In this paper, we expand our previously developed work and take advantage of the optimized results to port the full version of Nek5000 to GPU-accelerated systems, especially regarding the PN –PN-2 algorithm. This latter algorithm is a way to de-couple the momentum from the pressure equations that does not lead to spurious pressure modes. It is more efficient than other methods, but it involves different approximation spaces for velocity (order N) and pressure (order N-2). The paper focuses on the technology watch of heterogeneous modelling and its impact on the exascale architectures (e.g. GPU accelerators system). In fact GPU accelerators can strongly speed up the most consuming parts of the code, running efficiently in parallel on thousands of cores. The goal of this work is to investigate if the PN –PN-2 algorithm can take advantage of hybrid architectures and be used in Nek5000 to improve its scalability to exascale. In this talk, we describe the GPU implementation of PN –PN-2 algorithm in Nek5000, namely:

  • The use of GPU-direct to communicate directly between GPU memory spaces without
    involving the CPU memory. For this work, we use an OpenACC accelerated version of
    Nek5000 which is already implemented in the MPI communication library gs [3].
  • The initial profiling and assessment of suitability of the code for the most time consuming
    subroutines.
  • The implementation of the OpenACC version for the multigrid solver.

In addition we present the initial performance results of the OpenACC version of PN –PN-algorithm for a typical production problem. Finally we discuss the experience and the challenges we faced during this work.


First steps in porting the LFRic Weather and Climate model to the FPGAs of the EuroExa architecture
Mike Ashworth,Graham Riley, and Andrew Attwood

The EuroExa project proposes a High Performance Computing (HPC) architecture which is both scalable to exascale performance levels and delivers world-leading power efficiency. This is achieved through the use of low-power ARM processors together with closely coupled FPGA programmable components. In order to demonstrate the efficacy of the design, the EuroExa partners support a rich set of applications.

One such application is the new weather and climate model, LFRic (named in honour of Lewis Fry Richardson), which is being developed by the Met Office and its partners for operational deployment in the middle of the next decade. High quality forecasting of the weather on global, regional and local scales is of great importance to a wide range of human activities and exploitation of latest developments in HPC has always been of critical importance to the weather forecasting community.

The first EuroExa system is being built now and is due for first application access in mid2018. In order to prepare for this we have been porting the LFRic model to a Zynq UltraScale+ ZCU102 Evaluation Platform. The initial approach is to study the LFRic code at three levels: the full application (Fortran), compact applications or “mini-apps” and key computational kernels. An example of a kernel is the matrix-vector product which contributes significantly to the execution time in the Helmholtz solver and elsewhere. Our first steps have been to evaluate the performance on the ARM quad-core CPU A53, to use Vivado HLS to generate IP blocks to run on the UltraScale+ FPGA.

The matrix-vector updates have been extracted into a kernel test program and converted to C. There are dependencies between some of the updates across the horizontal mesh and a colouring scheme is used in LFRic, such that nodes within a single ‘colour’ can be computed simultaneously. This is used to produce independent computations for multi-threading with OpenMP and can be exploited for the FPGA acceleration as well. As with all accelerator based solutions, a key optimization strategy is to minimize the overhead of transferring data between the CPU and the FPGA. We shall discuss how we have approached this for the LFRic code.

The Vivado HLS and Design Suite is only one programming model for porting applications onto the FPGA. In early 2018 we shall be comparing this approach with the OmpSs@FPGA system and the Maxeler MaxJ compiler, and by the time of the workshop will be in a good position to be able to present comparisons of the performance achieved so far, ease-of-use, robustness and maturity of the tools.


NEXTGenSim: a workflow and hardware-aware workload simulator for HPC systems.
E. Farsarakis, N. Johnson

Efficient scheduling in any HPC system can make the difference between efficient and poor use of resources. Having the scheduler aware of the hardware configuration of nodes may offer opportunities to schedule more efficiently.

Previous workload simulators have focused predominantly on the study of the effect of different scheduling algorithms such as First Come First Served, or First Fit. As part of the NEXTGenIO project* we have developed NEXTGenSim, discrete event hardware and workflow aware scheduling simulator which estimates execution time of applications based on specific hardware characteristics such as NVRAM.

Our work is focused on the unique nature of jobs that form workflows, i.e. that have a data dependence on each other and must follow a specific order, and how changes in scheduling policy can benefit throughput of a system, with special consideration for novel HPC systems incorporating technologies such as NVRAM. Using NVRAM in a persistent storage state allows the scheduling of workflow jobs to nodes which are used in full or in part across the whole workflow. For example, if job A is followed by job B, it makes sense to run job B on the same nodes as job A. If persistent storage (via NVRAM) is available, then rather than moving results to a Lustre of GPFS filesystem, data can remain local to nodes between jobs, reducing latency in beginning job B. Should the nodes of A not be available when B is ready to execute, the data must be moved to relevant nodes or the job delayed.

By modelling the time to move data to and from Lustre/GPFS and the time saved by using persistent storage via NVRAM, we can experiment with different strategies for persisting intermediate products from workflow jobs. As input we can use both synthetic workloads and anonymised workloads from real HPC systems. For example, we can experiment with a strategy of only allowing job B to be run on the nodes of A, regardless of the delay time between them and how this impacts the overall completion of a workflow, or we can see how increasing or decreasing the time taken to move data to and from Lustre or GPFS changes the overall throughput of a system, and how using NVRAM storage when possible benefits this.


Extra-P: Automatic Empirical Performance Modeling of Parallel Programs
Alexandru Calotoiu, Torsten Hoefler, Sergei Shudler, and Felix Wolf

Once a program has been parallelized, its performance usually remains far from optimal. Too difficult is the process of performance optimization, which needs to consider the complex interplay between the algorithm and the hardware. Many parallel applications also suffer from latent performance limitations that may prevent them from scaling to larger problem or machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made a point where remediation can already be difficult. Performance models allow such issues to be predicted before they become relevant. A performance model is a formula that expresses a performance metric of interest such as execution time or energy consumption as a function of one or more execution parameters such as the size of the input problem or the number of processors. However, deriving such models analytically from the code is so laborious that too many application developers shy away from the effort.

In this talk, we will present Extra-P, a new performance-modeling tool. It substantially improves both coverage and speed of performance modeling and analysis. Generating an empirical performance model automatically for each part of a parallel program with respect to the variation of one or more relevant parameters such as process count or problem size, it becomes possible to easily identify those parts that will reduce performance at larger core counts or when solving a bigger problem. Specialized heuristics traverse the search space rapidly and generate insightful performance models. We will discuss case studies with large-scale applications in which we uncover both previously known and unknown performance bottlenecks. As a specific example, we will show how Extra-P can support co-design for task-based programming.

Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. Unfortunately, finding a good match between input size and core count usually requires significant manual experimentation. Using Extra-P we can find the isoefficiency function of a task-based program, which binds efficiency, core count, and the input size in one analytical expression.


Wavelet-Based Compression Algorithm
Patrick Vogler, Ulrich Rist

The steady increase of available computer resources has enabled engineers and scientists to use progressively more complex models to simulate a myriad of fluid flow problems. Yet, whereas modern high performance computers (HPC) have seen a steady growth in computing power, the same trend has not been mirrored by a significant gain in data transfer rates. Current systems are capable of producing and processing high amounts of data quickly, while the overall performance is oftentimes hampered by how fast a system can transfer and store the computed data. Considering that CFD researchers invariably seek to study simulations with increasingly higher spatial and temporal resolution, the imminent move to exascale computing will consequently only exacerbate this problem [1]. Using the otherwise wasted compute cycles to create a more compact form of a numerical dataset, one could alleviate the I/O bottleneck by exploiting it’s inherent statistical redundancies. Since effective data storage is a pervasive problem in information technology, much effort has already been spent on adapting existing compression algorithms for floating-point arrays.

In this context, Loddoch and Schmalzl [1] have extended the Joint Photographic Experts Group (JPEG) standard for volumetric floating-point arrays by applying the one-dimensional real-to-real discrete cosine transform (DCT) along the axis of each spatial dimension, using a variable-length code to encode the resulting DCT coefficients. Lindstrom [2], on the other hand, uses a lifting based integer-to-integer implementation of the discrete cosine transform, followed by an embedded block coding algorithm based on group testing. While these compression algorithms are simple and efficient in exploiting the low frequency nature of most numerical datasets, their major disadvantage lies in the non-locality of the basis functions of the discrete cosine transform. Thus, if a DCT coefficient is quantized, the effect of a lossy compression stage will be felt throughout the entire flow field [3]. To alleviate this, the numerical field is typically divided into small blocks and the discrete cosine transform is applied to each block one at a time. While partitioning the flow field also facilitates random-access read and write operations, this approach gives rise to block boundary artifacts which are synonymous with the JPEG compression standard.

In order to circumvent this problem we propose to adapt the JPEG-2000 (JP2) compression standard for volumetric floating-point arrays. In contrast to the baseline JPEG standard, JPEG-2000 employs a lifting-based one-dimensional discrete wavelet transform (DWT) that can be performed by either the reversible LeGall-(5,3) taps filter for lossless or the non reversible Daubechies-(9,7) tabs filter for lossy coding [3]. Due to its time-frequency representation, which identifies the time or location at which various frequencies are present in the original signal, the discrete wavelet transform allows for the entire frame to be decorrelated concurrently. This eliminates blocking artifacts at high compression ratios commonly associated with the JPEG standard. We therefore demonstrate the viability of a wavelet-based compression scheme for large-scale numerical datasets.


Controllable data precision in the Met Office Unified Model
Richard Gilham, and Paul Selwood

The Unified Model (UM) is a million-line, 30-year-old, but still very actively developed, Fortran codebase that is at the heart of the Met Office’s weather forecasts and climate research. Despite its age, the UM is a cutting-edge model scientifically, and is successfully used for research and operations at scales from a desktop computer, to several hundred compute nodes on very large supercomputers.

The huge socio-economic benefits of accurate forecasts mean that there is a constant push to improve the compute efficiency of the UM to free up resources for further improvements. For example, efforts to improve both shared and distributed memory parallelism have allowed significant increases in model resolution and ensemble size. The subject of this talk, however, is on exploring the possibility of reducing the precision of the model from double to single precision, and potentially beyond[1]. Indications from comparable codes show a possible 40% saving in compute. Moreover, trends in future HPC architectures indicate that single precision capability would be highly advantageous for portability as well as performance. Single precision compute in the UM’s numerical solver algorithms have already been proven and used operationally, providing significant compute savings.

In a proof-of-concept project, a specific scientific section of the UM was targeted and made ‘precision-aware’. A bottom-up approach ensured that the project was of tractable magnitude, and that the benefits may be readily ‘pulled-through’ from research to operations. The hardest challenges in the project were working within the conservative working practices for an operational model, and overcoming subtle but frustrating technical debt. As well as demonstrating the technical feasibility of the approach, a saving of approximately 5% on total model runtime was realised for negligible scientific impact. Savings within the targeted section were around 40%, in line with expectations. Future work would look to extend this methodology to more scientific sections of the model.


Running Distributed Computations Using the SpiNNaker platform
Alan Barry Stokes, Andrew Rowley, Christian Brenninkmeijer, Donal Fellow, Andrew Gait, Oliver Rhodes, and Steve Furber

The SpiNNaker Platform, is a well known neuromorphic computing platform within the Human Brain Project, which is designed to run large scale spiking neural networks in real time. The SpiNNaker platform consists of up to a million low power ARM processors (each of which run at 200MHz), and therefore for applications to make the most of the platform, they require to be highly parallel in nature. Neural networks and the shallow water equations are a perfect sample of these types of applications, as each individual neuron/cell can be evaluated in parallel. The 1 million core machine when running 100% is estimated to use 100 KW, and therefore is significantly cheaper to run than most traditional HPC clusters of relative scale. Such infrastructures could satisfy the issue of power consumption of exascale machines.

Due to the complexity of using the SpiNNaker platform, we have developed a software stack that maps applications described as a graph (where vertices represent computation and edges represent communication of data between vertices) onto the SpiNNaker platform whilst also managing the runtime execution and data extraction process for the applications. we believe that by representing the application problem as a graph, or equivalent format, provides more potential than sequentially written code bases for detecting areas of parallelism. Communication between vertices in SpiNNaker is executed through small data packets (32 or 64 bit packet sizes) which are multicasted throughout the network through routers on each chip. We have not yet implemented a MPI or Open MP interface, but instead support an asynchronous interrupt based API where cores are informed when a packet has been received. This does not mean MPI is not possible with SPiNNaker, the communications network lends itself to an MPI implementation with the restriction that the network packets are small (but we provide a 1 to 1 message format for larger data packets,). By splitting an application into such small components that utilises the 1 million core machine, it can potentially reduce the communications requirement by each core into manageable sizes for the SpiNNaker communication infrastructure.

We believe there are other sets of applications, apart from neural networks and shallow water equations, that can be easily represented by a graph and therefore can be executed upon such architectures efficiently. For such applications, the massive parallelism provided by the SpiNNaker platform can improve upon current infrastructure performances in terms of speed and energy consumption.

In this talk we will discuss the SpiNNaker architecture, software stack, and programming paradigm and try to relate lessons we have learnt whilst using the SpiNNaker architecture to challenges in reaching exascale machines and why applications can see improvements when written/rewritten, if possible, to a graph representation.


Performance Tracing of Heterogeneous Exascale Applications: Pitfalls and Opportunities
Holger Brunst, Christian Herold, Matthias Weber

Future Exascale systems are expected to introduce an unprecedented degree of heterogeneity as pointed out by DOE’s ASCR Program Manager Dr. Lucy Nowell at the recent VPA17 workshop at SC17. Dr. Nowell also stated that the induced adaption and redesign process of highly distributed algorithms and applications requires the support from the debugging and performance tools community. Because of that, we expect performance tools to be early adopters of the emerging system architectures, as they need to be one step ahead by definition. This position paper lists and discusses both tools pitfalls and opportunities arising from infrastructure and paradigm changes in the near future. It focuses on in-situ performance trace data processing, persistent memory exploitation, scripted applications and usability.

The sheer complexity of an Exascale hardware and software stack calls for a holistic performance reviewing at all layers of abstraction. Unfortunately, a non-perturbing monitoring solution with dedicated hardware is normally impracticable for economic reasons. Facultative software-based monitoring seems to be feasible but the resulting perturbation will grow with the number of monitored parameters. An iterative refinement process of a changing set of parameters seems to be most realistic, while only practical when performed in-situ in one and the same run due to long application runtimes and startup overheads. We will present how data collection, selection and visualization need to be rethought for the Exascale.

Node-local large persistent memory similar to Cray’s DataWarp or Intel’s 3D XPoint products will be located in very close proximity to the CPUs. This will have an impact on application file data handling and scheduling. I/O usage patterns that were prohibited performance-wise in the past might all of a sudden be top-notch. Understanding I/O performance requires a thorough picture of the actions in the increasingly deep I/O stack. Again, we expect that it will not be feasible to record all relevant information at the same time, which takes us back to the pre-mentioned runtime data selection. We will present new ways for in-depth I/O tracing and visualization designed for Exascale.

Scripted application workflows in python and alike are likely to gain further importance and are not easy to study with traditional performance tools due to their complex virtual runtime environment. Scripting approaches enable new science communities without traditional HPC background to enter the computing domain, which reveals tool usability issues to be discussed in the Exascale context. Our usability analysis is backed by customer feedback on the Vampir performance visualizer, which we develop in-house for many years.


In-flight ensemble processing for exascale
Jeff Cole, Bryan Lawrence, Grenville Lister, Yann Meursdesoif, Rupert Nash, and Michèle Weiland

Weather and climate science make heavy use of ensembles of model simulations to provide estimation of uncertainty arising from a range of causes. Current practice is to write each ensemble member (simulation) out to disk as it is running, and carry out an ensemble analysis at the end of the simulation. Such analysis will include simple statistics, as well as detailed analysis of some ensemble members. However, as model resolutions increase (with more data per simulation), and ensemble sizes increase (more instances), the storage and analysis of this data is becoming prohibitively expensive — many major weather and climate sites are looking at managing in excess of an exabyte of data within the next few years. This become problematic for an environment where we anticipate running such ensembles on exascale machines which may not themselves include local storage of sufficient size where data can be resident for long periods of analysis.

There are only two possible strategies to cope with this data deluge – data compression (including “thinning”, that is the removal of data from the output) and in-flight analysis. We discuss here some first steps with the latter approach. We exploit the XML IO server (XIOS) to manage the output from simulations and to carry out some initial analysis en-route to storage. We have achieved three specific ambitions: (1) We have adapted a current branch of the Met Office Unified Model to replace most of the diagnostic system with the XIOS. (2) We have exploited a single executable MPI environment to run multiple UM instances with output sent to XIOS, and (3) We have demonstrated that simple ensemble statistics can be calculated inflight, including both summary statistics of individual ensemble members, and cross-member statistics such as means and extremes.

With this ability, we can in principle avoid having all data needing to reside on fast disk when the ensemble simulation is complete. This would allow, for example, deployment on an exascale machine with burst-buffer migrating data directly to tape (or to the wide area network). There are some issues yet to be resolved. In particular, we need to manage the MPI context to explicitly mange errors propagating up from an ensemble member (an errant ensemble member could otherwise halt the execution), and we need to consider how to bring third party data into the XIOS context so that non-linear comparisons can be calculated and meaned at run time. Neither of these are expected to be very difficult, but they will involve further engineering. In the longer-term, in-flight analysis will have to address some sort of steering where not all ensemble members are output for the entire duration of the simulation, but even this interim method will help with data management. It will be possible to identify “interesting” ensemble members from summary statistics, and keep them online for more detailed analysis, while less (initially) interesting ensemble members can be more rapidly migrated to colder storage for later analysis.

]]>
383
Tuesday Paper Abstracts http://www.easc2018.ed.ac.uk/tuesday-paper-abstracts/ Fri, 02 Mar 2018 16:17:14 +0000 http://www.easc2018.ed.ac.uk/?p=371 Morning Parallel Session A

Morning Parallel Session B

Afternoon Parallel Session A

Afternoon Parallel Session B


On the Calculation Distribution between CPU and GPU in the Hybrid Supercomputing the Radiation Transport
Roman Uskov, Boris Chetverushkin, Vladimir Gasilov, Mikhail Markov, Olga Olkhovskaya and
Mikhail Zhukovskiy

 

The algorithm of Monte Carlo radiation transport modelling is developed for simulating the interaction between ionizing radiation and matter of the complex technical object by use of the supercomputers of heterogeneous architecture (the hybrid calculating cluster, HCC). HCC involves a number of nodes. Every node includes CPU and GPU. The MPI, OpenMP and CUDA technologies are used for supercomputing. Distribution of computing between nodes is carried out by application of MPI. Data exchange does only happen between node at the beginning and at the end of simulation. Therefore, MPI parallelization is scaled up almost infinitely. The GPU utilization and the calculation distribution between CPU and GPU within a single node are performed by means of CUDA. An approach to GPU utilization is not obvious. Every kernel is written by the author of the software as no GPU solutions are exist in the problem in question.
The Monte Carlo simulation of the radiation transport [2] is based on building the random trajectories of the radiation particles. Construction of different particles requires the different amount of computation. Moreover, various parts of particle trajectory algorithm demand various amount of computing resources. The basic principle of the calculation distribution between CPU and GPU is to carry out computing with high calculation load on GPU and with low one on CPU. For instance, the build of photons trajectories could be conditionally split into two parts – geometrical part and physical one. The analysis of computational load shows the following. The geometrical component of the algorithm (tracing the object) requires huge of simple calculations and therefore it is performed using GPU. Vice versa physical calculations (simulating the interaction acts between photon and atom) is carried out on CPU. As to electron trajectories all of the computing except for exchange operations is made on GPU. The developed algorithm is implemented as the parallel software on the prototype of the exascale system HCC K-100 (http://www.kiam.ru/MVS/resourses/k100.html).

Case study: Bohrium – Powering Oceanography Simulation
Mads Ohm Larsen and Dion Häfner

In the field of oceanography numerical simulations have been used for more than 50 years. These simulations often require long integration time and can run for several real-time months. They are thus often written in high-performance languages such as C or Fortran. Both of these are often thought of as being complex to use. Fortunately we have seen a shift in the scientific programming community towards focusing more on productivity as oppose to just performance. We would, however, like to keep the performance from these archaic languages. Academic code often has a limited lifespan because the developers shift as people graduate and new people arrive. Having to use a long time to understand the simulations will take away from the actual science being made. Veros is a Python translation of an already existing oceanography project written in Fortran. It utilizes NumPy for its vector computations. In order to rectify the performance loss Veros have chosen Bohrium as its computational back-end. Bohrium is a run-time framework that allows sequential interpreted code to be easily parallelized and possibly run on GPGPUs. This is done by just-in-time (JIT) compiling generated OpenMP, OpenCL, or CUDA kernels and running them on the appropriate hardware. Bohrium also support multiple front-ends, namely Python, C, and C++. In Python this is done with minimal intrusion, thus no annotations are needed, as long as the code utilize NumPy functions Bohrium will override these and replace them with JIT-compiled kernels. Bohrium has its own intermediate representation (IR), which is an instruction set gathered from the interpreted code up to a side effect, for example I/O. Well known compiler optimizations, such as constant folding and strength reduction, can be applied to the IR prior to generating the kernels. Other optimizations include using already established libraries such as BLAS for low-level linear algebra. Bindings to the appropriate BLAS library on your system is auto generated when Bohrium is compiled. This means, that if you have for example clBLAS installed, Bohrium will create bindings to it, that can be utilized directly from Python. These will also overwrite the NumPy methods already using BLAS for even better performance.

Using Bohrium of course comes with an overhead in form of generating the kernels. Fortunately this overhead is amortized for larger simulations. For the Veros project we see that Bohrium is roughly an order of magnitude faster than the same implementation using Fortran or NumPy in the benchmarks. However, a parallel Fortran implementation using MPI for communication is faster still. In the future we would like to utilize distributed memory systems with Bohrium, so we can run even larger problem sizes, using possibly multiple terabytes of memory.


Roadmap to Exascale for Nek5000
Adam Peplinski, Evelyn Otero, Paul Fischer, Stefan Kerkemeier, Jing Gong and Philipp Schlatter

Nek5000 is a highly scalable spectral element code for simulation of turbulent flows in complex
domains. As a Gordon Bell winner, it has strong-scaled to over a million MPI ranks, sustaining
over 60 percent efficiency on 1048576 ranks of Mira with just 2000 points per core. This
is in line with efficiency expectations for multigrid on this architecture at these scales. Moreover
a Nek5000-derived miniapp was developped, Nekbone, which has sustained 1.2 PFLOPS
on six million ranks of Sequoia (6% of peak).
In this talk, we will present the main characteristics of Nek5000 towards exascale performance
improvement. The overall efficiency of Nek5000 derives from:

  • stable high-order discretizations that, for a given accuracy, require significantly less data
    movement and fewer flops than their low-order counterparts.
  • communication-minimal bases requiring only C0 continuity between elements and, hence, exchange of only surface data between processors.
  • The use of fast matrix-free formulations based on tensor-product operators.
  • Tensor-product contractions based on highly-optimized small matrix-matrix product kernels.
  • An efficient and scalable communication framework, gslib, that has only O(n) or O(nlogP) complexity for both execute and setup phases, which has been deployed to over six million MPI ranks.
  • A scalable p-multigrid solver that uses fast tensor-product-based smoothers coupled with an unstructured algebraic multigrid solver for the global coarse-grid solve, with corresponding parallel setup.
Exascale extensions of Nek5000 will be somewhat dependent on emergent architectures, but a significant trend is towards the use of accelerators (e.g., GPUs). At present, Nek5000 is being ported to GPUs using OpenACC and OCCA. Other aspects necessary for large-scale simulations, such as error estimators, non-conformal meshes and adaptive simulations, will also be discussed. Several members of the Nek5000 development team are part of the US Department of Energy co-design Center for Efficient Exascale Discretizations (CEED). Another consortium driving as well the exascale development of Nek5000 is the SeRC Exascale Simulation Software Initiative (SESSI) which aims for performance and scalability improvement of a number of widely-used codes.
In summary, the overall performance of exascale code derives from the product SP = ηP P S1, where SP is the sustained flops rate on P processors and ηP is the strong-scale efficiency. Nek5000 has an established track record of sustaining near-unity efficiency for the anticipated exascale values of P. Boosting S1 on future complex nodes, is the current high priority for our current exascale development.

Port Out, Motherboard Home: Accelerating CASTEP on CPU-GPU clusters
Matthew Smith, Arjen Tamerus, and Phil Hasnip

Heterogeneous computer systems which use CPUs and GPUs in tandem to accelerate computation are well-established in HPC clusters, and are a candidate technological route to exascale computing. Optimal software performance on such massively-parallel systems involves the exploitation of distributed-memory parallelism on the CPU and the offloading of computationally intensive tasks to the GPU. A major goal of software design is therefore to marry these two elements and thereby maximise CPU and GPU computation while minimising CPU-CPU and CPU-GPU communications.

Here we present recent work undertaken towards achieving this goal for CASTEP, the UK’s premier quantum-mechanical materials-modeling software. We describe our approach towards enabling accelerator-compatability for this mature code, using our hybrid openaccmpi implementation, as well as our use of accelerator libraries including cufft and magma. The gains in performance afforded by these developments are illustrated with results from UK HPC facilities.


MPI Storage Windows for a Data-centric Era
Sergio Rivas-Gomez, Stefano Markidis, Erwin Laure and Sai Narasimhamurthy

Even though breaking the ExaFLOP barrier is expected to become one of the major computing milestones over the next decade, several challenges arise that remain of paramount importance for the success of the Exascale supercomputer. One such challenge is the bandwidth and access latency of the IO subsystem, projected to remain roughly constant in comparison with the concurrency of Exascale machines, that will increase approximately 4000×. In addition, with the integration of emerging deep-learning and data analytics applications on HPC, the chances for unexpected failures at Exascale will considerably raise as well. In order to overcome some of these limitations, upcoming large-scale systems will feature a variety of Non-Volatile RAM (NVRAM), next to traditional hard-disks and conventional DRAM. Emerging non-volatile solid-state technologies, such as flash, phase-change and spin-transfer torque memories, are used to decrease the existing gap between memory and storage. Hence, the integration of these technologies provides several advantages (e.g., data locality), that can potentially reduce the overall power consumption and IO access latency of HPC applications. In this presentation, we address the challenge of adapting MPI to the changes in the memory and storage hierarchies of Exascale supercomputers.

We present the concept of MPI storage windows, an extension to the MPI one-sided communication model that provides a unique interface for programming memory and storage. We illustrate its benefits for out-of-core execution and parallel I/O, as well as present a novel fault-tolerance mechanism based on this concept. Preliminary performance results indicate that our approach incurs in negligible performance differences on real-world applications compared to traditional MPI memory windows. Additionally, we present heterogeneous window allocations, that provide a unified virtual address space for memory and storage. Results on out-of-core execution show less than a 40% performance penalty incurred while exceeding the main memory capacity of compute nodes.


Performance of HPC I/O Strategies at Scale
Keeran Brabazon, Oliver Perks, Stefano Markidis, Ivy Bo Peng, Sergio Rivas Gomez and Adrian Jackson

In order to address the gap between compute and I/O moving towards the extreme and exascale, I/O strategies and subsystems are going to be more versatile and more complex. We have already seen a move to take advantage of fast storage nodes in the form of burst buffers in production HPC systems, as well as the use of parallel I/O libraries (such as MPI I/O and HDF5) and programmer-directed I/O. With this added complexity, a user of an HPC system needs to be informed of the advantages and disadvantages of different I/O strategies. In this presentation, we consider the performance of an HPC filesystem for a real-world application, rather than benchmarks or mini-apps. The application under consideration is a plasma physics simulation (iPIC3D) developed by KTH, Stockholm. Recent work at KTH has involved developing run times for efficient I/O at scale, and iPIC3D has been extended such that the I/O scheme used by the application can easily be switched at compile time. Performance data is gathered for weak scaling experiments of iPIC3D using different I/O methods, collected every day over a two-month period. Focus is kept on the performance of write operations, as this is important in traditional HPC simulations, in which initial data are read during initialisation, and snapshots of the simulation status are recorded at different points in simulation time. Program internal data is captured using the Arm MAP profiling tool, which gathers a rich set of performance metrics from within an application. Overhead of the tool is measured at less than 5% of overall run-time, meaning that performance data is gathered for a close to production run.

Variation in observed application performance are correlated to system load and I/O subsystem performance. Conclusions compare and contrast current I/O paradigms, as well as taking a forward looking view as to the suitability of different I/O run times for the extreme scale emerging in the coming years.


Evaluation Methodology of an NVRAM-based Platform for the Exascale
Juan F.R. Herrera, Suraj Prabhakaran, Michèle Weiland, and Mark Parsons

One of the major roadblocks to achieving the goal of building an HPC system capable of Exascale computation is the I/O bottleneck. Current systems are capable of processing data quickly, but speeds are limited by how fast the system is able to read and write data. This represents a significant loss of time and energy in the system. Being able to widen, and ultimately eliminate, this bottleneck would majorly increase the performance and efficiency of HPC systems. The NEXTGenIO project is investigating this issue by bridging the latency gap between memory and disk through the use of non-volatile memory, which will sit between conventional DDR memory and disk storage. In addition to the hardware that will be built as part of the project, the project will develop the software stack (from OS and runtime support to programming models and tools) that goes hand-in-hand with this new hardware architecture. This project addresses a key challenge not only for Exascale, but also for HPC and data intensive computing in general: the challenge of I/O performance.

An application suite of eight memory and I/O-bound applications have been selected alongside a set of test cases for each application, to evaluate the platform’s effectiveness regarding I/O performance and throughput. The application suite covers a wide range of fields, from computer-aided engineering to meteorology, computational chemistry, and machine learning. The output of the evaluation will document the benefits of the NEXTGenIO technology, and indicate its impact and future lines of development. Three measurement scenarios are defined to assess the specific benefits of the NEXTGenIO
technology:

  • Baseline measurement in today’s systems.
  • Measurements on the NEXTGenIO platform without the use of non-volatile memory.
  • Measurements on the NEXTGenIO platform with the use of non-volatile memory.

The profiling tools Allinea MAP and Score-P are used to collect the metrics needed to evaluate the performance of the applications for each scenario. These tools have been extended to support performance analysis with non-volatile memory. In our presentation, we will present our methodology for evaluating the NEXTGenIO platform and show early memory and I/O profiling results for the applications. We will discuss how NVRAM will impact the performance of these applications.


Scalable IO FuNnelling for ECMWF’s Integrated Forecast System
James Hawkes, Tiago Quintino

ECMWF’s IFS (Integrated Forecast System) uses an IO server architecture to improve the scalability of data output. The IO server splits the global MPI communicator, dedicating some processes to the scientific model whilst funnelling IO through dedicated IO processes. The IO processes are responsible for collating, buffering, and encoding output data; leaving the model processes free to continue computing without blocking, so long as the IO process buffers are not full. All of the output routines should scale efficiently with number of processes and number of fields, by avoiding global synchronization and distributing the fields evenly between IO processes. The funnelling method allows control over contention of IO hardware, and also provides the opportunity to perform global operations (such as encoding) on collated output data – without expensive collective communications.

The IO server architecture was originally developed at Météo France, and later ported to the IFS atmospheric model. In this presentation, the authors describe the architecture of the IO server and its advantages compared to alternative methods. We present recent developments in coupling the non-atmospheric wave model with the IO server, and demonstrate the realized improvements to overall scalability and performance. Furthermore, we discuss the integration of the IO server with the downstream IO stack and future plans for closer integration with post-processing services.


Progressive load balancing of asynchronous algorithms in distributed memory
Justs Zarins, Michèle Weiland

As supercomputers are growing in size, running large scale, tightly-coupled applications efficiently is becoming more difficult. A key component of the problem is the cost of synchronisation which increases with system noise and performance variability. This affects even high-end HPC machines. An exciting and promising approach for addressing this problem is to stop enforcing synchronisation points. This results in what are known as “asynchronous” or “chaotic” algorithms; commonly they are iteratively convergent. The cores are allowed to compute using whatever latest data is available to them, which might be “stale”, instead of waiting for other threads to catch up. Existing applications of this methodology show good performance and fault tolerance with respect to their synchronous counterparts.

While asynchrony removes the computational cost of requiring all data to arrive at the same time, a different cost takes its place – progress imbalance. This is natural because synchronisation points exist to coordinate progress. An imbalance in progress can result in slower convergence or even failure to converge, as old data is used for updates. This can be countered by putting a strict bound on how stale data is allowed to be, but at a cost to performance. As an alternative, the authors of [2] introduce the idea of progressive load balancing – balancing asynchronous algorithms over time as opposed to balancing instantaneously. Instead of finetuning iteration rates, parts of the working set are periodically moved between computing threads on a node. As a result, progress imbalance is limited without adding a large overhead. The approach is similar to bounded staleness, but it continues to work efficiently in the presence of continuous progress imbalance which may be caused by, for example, hardware performance variability or workload imbalance. The authors tested the approach running Jacobi’s method on a single compute node and found it increased iteration rate and decreased progress imbalance between parts of the solution space. Here we present an extension of progressive load balancing to the distributed memory setting.

We start by evaluating the extent to which progress imbalance can be reduced by running independent load balancing on each node. Additionally, we evaluate an implementation where load balancing is allowed to take place across nodes periodically to account for cases where a whole node is slow. Finally, we draw conclusions about the benefits and challenges of this approach in the context of future exascale applications.


An Exploration of Fault Resilience Protocols for Large-Scale Application Execution on Exascale Computing Platforms
Daniel Dauwe, Sudeep Pasricha, Anthony A. Maciejewski, and Howard Jay Siegel

The probability of applications experiencing failures in today’s high performance computing (HPC) systems has increased significantly with the increase in the number of system nodes. It is expected that exascale-sized systems are likely to operate with mean time between failures (MTBF) of as little as a few minutes, causing frequent interrupts in application execution as well as substantially greater energy costs in a system. Periodic application checkpointing to a parallel file system has for years been the de-facto strategy for enabling resilience to failures in HPC platforms. This traditional checkpoint/restart protocol is widely used in HPC systems for mitigating the impact of failures on application performance. However, as system sizes approach exascale levels, the higher frequency of failures and the lengthy time required to checkpoint/restart an exascale-size application make traditional checkpointing impractical. A number of strategies have been proposed in recent years to enable systems of these extreme sizes to be resilient against failures.

Our work is one of the first to provide a comprehensive comparison among traditional checkpoint/restart and three state-of-the-art HPC resilience protocols that are being considered for use in exascale HPC systems. We demonstrate the importance of employing these state-of-the-art protocols and examine how each resilience protocol behaves as application sizes scale from what is considered large today through to exascale sizes. Because we experiment with applications that have multiple sets of execution characteristics, our analysis of the applications allows for the simultaneous investigation of each protocol’s ability to handle varying application sizes and reliability goals. Our results not only show the necessity of incorporating improved forms of resilience for future HPC systems, but also show that different resilience protocols perform better or worse when executing different types of applications. Additionally, our results show that the resilience protocol that is optimal for a particular application type can change as the application scales. Based on these results, we propose optimizations to the multi-level checkpointing approach, which we believe is one of the most promising fault resilience approaches for exascale complexity platforms. We devise techniques for multi-level checkpoint interval optimization, with an emphasis on performance efficiency as well as energy use. We demonstrate that distinct intervals exist when optimizing for either one metric or the other, and examine the sensitivity of this phenomena to changes in several system parameters and application characteristics.


Speeding up a high-performance scientific code with GASPI shared notifications
Dana Akhmetova, Roman Iakymchuk, Valeria Bartsch, Erwin Laure, Christian Simmendinger

For the HPC community it is very important to understand how practicable programming models for the coming Exascale computing era will look like. Therefore, experimenting with current parallel programming models and studying their interoperability, scalability and performance will provide valuable insights for this.

While message passing supports communication in distributed-memory systems by exchanging messages, the Partitioned Global Address Space (PGAS) programming model provides the concept of a global memory address space, physically located on different nodes, but accessible to all the processes. It is based on one-sided remote direct memory access (RDMA) driven communication supported directly by network. GASPI (Global Address Space Programming Interface), a PGAS API, shifts a paradigm from bulk-synchronous two-sided communication towards asynchronous communication [3]. It represents an alternative to the MPI standard.

In our previous works, we have already experimented with GASPI and have ported a number of real-world applications to this model. The new implementations showed positive results and performed faster than their initial (MPI+OpenMP) versions at least on a large number of cores. In this study we experiment with a new GASPI feature called shared notifications in iPIC3D, a large plasma physics code for space weather applications written in C++ and MPI+OpenMP. To our knowledge, the GASPI shared notifications have never been used before. In our test runs they have shown very promising performance behaviour. We are using the GPI-2 library, an implementation of the GASPI standard of the PGAS programming model, developed by the Fraunhofer Institute for Industrial Mathematics ITWM. In this work we:

  • port a large high-performance scientific real-world code to GASPI;
  • use shared notifications, a new feature of GASPI, for the first time ever;
  • analyse how suitable GASPI is for Exascale computing by providing performance and scaling tests with up to 8192 processes, and by comparing with MPI, the de-facto standard for distributed-memory programming;
  • provide a step-by-step methodology with new GASPI features in a real-world application with discussions on interoperability issues;
  • share our experience in the form of best-practice programming guides, including suggestions to programmers who may adopt our approach.

Our performance results show that GASPI shared notifications provide a promising new set of features to further pave the way to the Exascale era for scientific production codes.


Efficient Gather-Scatter Operations in Nek5000 Using PGAS
Niclas Jansson, Nick Johnson, and Michael Bareford

Gather-scatter operations are one of the key communication kernels used in the computational fluid dynamics (CFD) application Nek5000 for fetching data dependencies (gather), and spreading results to other nodes (scatter). The current implementation used in Nek5000 is the Gather-Scatter library, GS, which utilises different communication strategies: nearest neighbour exchange, message aggregation, and collectives, to efficiently perform communication on a given platform. GS is implemented using non-blocking, two-sided message passing via MPI and the library has proven to scale well to hundreds of thousands of cores. However, the necessity to match sending and receiving messages in the two-sided communication abstraction can quickly increase latency and synchronisation costs for very fine grained parallelism, in particular for the unstructured communication patterns created by unstructured CFD problems.

ExaGS is a re-implementation of the Gather-Scatter library, with the intent to use the best available programming model for a given architecture. We present our current implementation of ExaGS, based on the one-sided programming model provided by the Partitioned Global Address Space (PGAS) abstraction, using Unified Parallel C (UPC). Using a lock-free design with efficient point-to-point synchronisation primitives, ExaGS is able to reduce communication latency compared to the current two-sided MPI implementation. A detailed description of the library and implemented algorithms are given, together with a performance study of ExaGS when used together with Nek5000, and its co-design benchmarking application Nekbone.

]]>
371
Call For Participation http://www.easc2018.ed.ac.uk/call-for-participation/ Mon, 02 Oct 2017 08:00:26 +0000 http://www.easc2018.ed.ac.uk/?p=68 You are warmly invited to participate in the 5th Exascale Applications and Software Conference (EASC 2018), to be held in Edinburgh, Scotland on the 17-19 April 2018.

Topics of interest

The organisers seek novel contributions in all areas associated with applications, tools, software programming models and libraries, and other technologies necessary to exploit future exascale systems.

  • enabling and optimising applications for exascale in any area;
  • developing and enhancing algorithms for exascale systems;
  • aiding the exploitation of massively parallel systems through tools, e.g. performance analysis, debugging, development environments;
  • programming models and libraries for exascale;
  • exascale runtimes and system software;
  • evaluating best practice in HPC concerning large-scale facilities and application execution;
  • novel uses of current generation and future exascale systems;
  • new hardware technologies and their exploitation to solve exascale challenges.

Research and/or experience that brings together current theory and practice is particularly welcome.

Participation Guidelines

We seek contributions in the form of a one page abstract for oral or poster presentation.

Poster submissions on recent or ongoing research in the topics of interest listed above are warmly welcomed. Poster presenters will have the opportunity to present a lightning talk (1-2 minutes) introducing the topic of their poster. Please send a brief abstract (max. 100 words, plus one image) by 1st of March 2018.

There will be a prize for the best paper and for the best student contribution.

There will be an opportunity to submit papers to a special edition of JPDC. All proceedings submissions will be peer-reviewed.

Important dates

  • Submissions open: 9th October 2017
  • Submissions close: Extended to 11th December 2017
  • Author notification: 31st January 2018
  • Poster submissions close: 1st March 2018
  • Early bird registrations close: 28th March 2018
  • Regular registrations close: 8th April 2018
  • Conference begins: 17th April 2018
]]>
68
Sponsorship http://www.easc2018.ed.ac.uk/sponsorship/ Mon, 02 Oct 2017 07:54:50 +0000 http://www.easc2018.ed.ac.uk/?p=152 Sponsoring the conference is a great opportunity to have your brand seen by approximately 100 academics, industry professionals, and decision-makers from across the HPC sector.

Our pre-arranged packages offer great value, but if you are looking for something more customised, we would be happy to work with you. If your organisation would like to sponsor the event, please contact us now at info@easc2018.ed.ac.uk

Sponsorship packages

Link to company website from conference website
Inclusion in emailed programme to delegates prior to conference
Company logo on marketing materials & website
Logo included in conference programme
Exhibition stand
Company advertising in conference centre
Promotional material in delegate pack
List of delegate contact details (from opt-in list)
Full-page advert in conference programme
Mentioned during conference opening and closing speeches
Sponsorship of conference dinner
Complimentary registrations

 

Link to company website from conference website
Inclusion in emailed programme to delegates prior to conference
Company logo on marketing materials & website
Logo included in conference programme
Exhibition stand
Company advertising in conference centre
Promotional material in delegate pack
List of delegate contact details (from opt-in list)
Half-page advert in conference programme
Sponsorship of conference drinks reception
Complimentary registrations

Link to company website from conference website
Inclusion in emailed programme to delegates prior to conference
Company logo on marketing materials & website
Logo included in conference programme
Exhibition stand
Advert in conference programme
Complimentary registration

 

Images
Johnny Durnan CC BY-SA 2.0, Lukacs CC BY 2.5 Rosser1954 CC BY-SA 4.0, via Wikimedia Commons

]]>
152
Conference Themes http://www.easc2018.ed.ac.uk/conference-themes/ http://www.easc2018.ed.ac.uk/conference-themes/#respond Fri, 22 Sep 2017 13:36:44 +0000 http://www.easc2018.ed.ac.uk/?p=78 http://www.easc2018.ed.ac.uk/conference-themes/feed/ 0 78 Getting to Edinburgh http://www.easc2018.ed.ac.uk/getting-to-edinburgh/ http://www.easc2018.ed.ac.uk/getting-to-edinburgh/#respond Fri, 22 Sep 2017 12:38:11 +0000 http://www.easc2018.ed.ac.uk/?p=53 There are a variety of ways to get to us!

Fly

Edinburgh International Airport has links with over 140 destinations worldwide. If there’s no direct flight from your location, you likely can get here with only one stopover.

Train

Edinburgh has two major railway stations – Haymarket and Waverley. You can travel to these stations along any of the major railways in the UK. Both are located in the city centre, making travel to our conference from the station easy.

Drive

Edinburgh is easily reachable by car from anywhere in the UK.

]]>
http://www.easc2018.ed.ac.uk/getting-to-edinburgh/feed/ 0 53