Enabling Technologies
Since the first comprehensive study on the HPC challenges issued by the US Defense Advanced Research Project Agency (DARPA), 4 major challenges to achieving Exascale have been reported:
- The Energy and Power Challenge: no combination of currently available technologies will be able to deliver HPC systems at connected wattages well below 100 MW.
- The Memory and Storage Challenge: state of the art technologies do not have yet the necessary maturity to fulfil the I/O bandwidth requirements within an acceptable power envelop.
- The Concurrency and Locality Challenge: future applications may have to support more than a billion of separate threads in order to efficiently use the hardware.
- The Resilience Challenge: systems become increasingly sensitive to operating environments given the component counts for supercomputers and operating points at low voltage and high temperature.
The EESI2 experts have identified several technologies that address the main roadblocks to build the proper software infrastructure for exascale systems.
Numerical algorithms
Scope: Numerical analysis underlies all numerical computation in any application areas. Algorithms are given a particular focus to understand how exascale impacts technologies. EESI2 studies the following technologies:
- Dense linear algebra
- Graph partitioning
- Sparse direct methods
- Iterative methods
- Eigenvalue problems
- Optimization & control
- Structured & unstructured grids
- Monte Carlo
- Tensors
- Fast Multipole Methods
Challenges:
Memory access is a major bottleneck in computations. Therefore algorithms need to maximise the number of useful calculations per memory access. Scheduling methods, memory affinity schemes, load balancing methods, are challenged.
A lot of overheads is generated by load-balancing, synchronisation, communication or fault tolerance mechanisms, that hinders the path toward exascale.
Additionally, some technologies solve issues but create new problems, eg stability of the overall system, overhead of asynchronous methods. For example, data structures using octree features face irregular computation patterns with load-balancing issues.
New applications, such as Big Data, or low rank approximations and compression methods require to adapt methods to matrix structures.
Path:
Memory issues can be tackled by using blocking/tiling and communication hiding methods.
Some overheads can be addressed by expressing computations at multiple levels of abstraction as task graphs and making use of data-driven schedulers.
Algorithms based on modular frameworks can be scalable to support alternative scheduling, load-balancing methods or memory affinity schemes.
Asynchronous/chaotic relaxation methods offer better use of parallelisation.
Most of issues require to address the trade off between speed, accuracy and reproducibility, the impact of fault tolerance on algorithm design and uncertainty quantification
Application auto-tuning will become a trivial feature.
Scientific software engineering, software ecosystem, programmability
Scope: development, operation and maintenance of software
Challenges: multiple challenges occur due to the long life of codes or the lack of high-level programming
The management of complexity is still in its infancy with a lack of methods that support the high-level design and quality management of exascale applications.
Tools for error and performance analysis should provide better insight
Talents are rare, given le scarcity of software engineering training of scientists, and lack of education cursus
Path:. Expansion of most current algorithm- and programming- HPC software development beyond their centric view to offer a better understanding of the (re-)design and of quality management processes. This should provide appropriate methods and tools to support such processes.
Roadmap: most popular programming interfaces (MPI and OpenMP) have presented substantial extensions eg non-blocking collectives that enable overlap of communication and computation or neighbourhood collectives
Disruptive technologies
Scope: Disruption identification method and technologies that tackle such disruptions.
Challenges: EESI2 has identified 3 areas of concern:
- variability in resources (resource performance, resource availability),
- change in the level of abstraction in the way that applications are written
- change in the execution model, from synchronous to asynchronous.
Some technologies have been identified as source of disruption:
- in hardware: new memory or packaging,
- in software: virtualization
Path: Some technologies could tackle identified issues:
- auto-management mechanisms at system level,
- dynamic adaptation or malleable programming models with runtimes
- resource management that support variability,
- dynamic load balancing
- programming models that separate at application level the algorithmic part from the specificities of the resources (OmpSs, OpenACC, …)
- rapid prototyping programming environments (eDSLs, Perl, Python, …);
- runtimes that support asynchrony (as task-based, data-flow schedulers or non- blocking MPI communications).
Hardware and software vendors
Scope: assessing the Exascale roadmap by investigating state of the art and market trends, both in hardware and software. EESI2 leverages its network of contacts with HPC vendors and cooperates with ETP4HPC (a European Technology Platform for HPC forum led by industry) to develop this workpackage.
Challenges: Several challenges are known:
On the hardware side:
- energy efficiency of all system components,
- bandwidth requirements of the memory system,
- increased requirements in resilience,
- routing of interconnection networks.
On the software side:
- scalability to large number of tasks,
- programming models and tools,
- techniques for checkpoint and restart, by-en-large fault tolerance,
- maintenance of legacy codes,
- development of mini-apps and benchmarks
- virtualization.