Optimising OpenMP implementation of MD modelling package Tinker

Posted by a.hay on 6 August 2014 - 10:00am

By Weronika Filinger, Application Developer at EPCC.

Do you use scientific codes in your research? In this article I will describe briefly the process I have undertaken to optimise the parallel performance of a computational chemistry package – TINKER, as part of the EPCC/SSI APES project.

TINKER can be used to perform molecular modelling and molecular dynamics simulations. Originally written in Fortran 77 and currently in the process of being ported to Fortran 90, it has already been parallelised for a shared memory environment using OpenMP. The code does not scale well with increasing number of cores and the scaling is even poorer on AMD architectures. In my investigation I used a cluster machine hosted by EPCC consisting of 24 compute nodes, each with four 16-core AMD Opteron 6276 2.3 GHz Interlagos processors, giving a total of 64 cores per node that share memory. As the current parallelisation of TINKER is purely for shared memory, all my investigations were restricted to a single node.

On running the code on this system, one quickly finds that the improvement is not proportional to the number of threads used. The largest speedup obtained in this case was 3.63 observed for 16 threads implying a parallel efficiency of only about ~23%. The corresponding speedups observed for 32 and 64 threads were 3.59 and 3.15, respectively. In this instance, using larger numbers of threads does not equal better performance. Moreover, taking a parallel efficiency of at least 60% as an acceptable minimum means that it would not really be worthwhile running TINKER on more than 4 threads.

If you ever want to optimise a code, your first step after base-lining the performance, should be to focus on the compiler optimisations. Care is required though, as changing compiler flags can both improve as well as degrade the performance of a code. Regardless, many scientific applications do not use compilers to their maximal advantage. Compilers have tens of optimisations options which can be confusing - I usually have a look at the flags controlling the optimisation levels (-O0, -O1, -O2 and -O3) and platform specific flags. After determining that the compiler settings have been chosen correctly for the given platform, it is good idea to check if the parallel directives are inhibiting the compiler optimisations by compiling the code with and without the OpenMP enabling flag (e.g. –fopenmp for gfortran) and executing the code on only one thread. For TINKER the parallel version on one thread was slightly faster than the serial execution. This implies that the parallel directives are not inhibiting the compiler from performing optimisations.

After sorting out the compiler settings, it is necessary to understand what is happening in the parallel sections of the code. OpenMP forks a number of threads in parallel sections, enabled through compiler directives, which are used to concurrently execute those parts of the code. This is what, hopefully, makes your code go faster. Yet just because the code has been parallelised correctly, it does not mean it has been done in the most efficient way. Therefore, the next step is to profile the parallelised code executed on different number of threads.

To reduce load imbalance between different threads and, at the same time, you also need to improve caching and reduce false sharing. As such, it is necessary to find which OpenMP schedule is the most suitable for each do region. The schedule informs the compiler how to divide the data amongst the different threads. The code was optimised using various different schedules appropriate for the parallel region of the code being addressed and up to 50% improvement was obtained (when using gfortran).

Taking into consideration that the parallel coverage is about 75% on 8 threads, a speedup of 5.3 seems to be a fairly good result. The question is why the corresponding behaviour does not extend to larger numbers of threads. The answer to that question seems to be related to the Non Uniform Memory Access (NUMA) effects. NUMA architectures, which nowadays are common, allow systems to be built with lots of cores that share a common large bank of memory. Accessing this memory, however, has different costs associated with it, which depend on the distance between the core and memory location. The machine used at EPCC is a cc-NUMA architecture – the cc means that it is cache coherent. Here, the cache is put in front of the memory banks to mitigate some of the memory latency of accessing the separate memory banks - with 8 NUMA regions per node (2 cores x 4 modules). To investigate to what degree the NUMA effects affect the performance of the Tinker code, the Unix numactl command was used.

Using 8 threads and placing them in different configurations showed that the best performance was observed when there was only one thread per NUMA region. There is a drop in performance for two threads per NUMA region but not as significant as for four and eight threads. This means that the best performance is obtained when the sharing of the hardware resources between the threads is kept to a minimum. The difference between one and two threads per NUMA region is not that great, but using more than 2 threads per NUMA region clearly causes memory bandwidth contention. Similarly, when 16 threads are used although only 2 threads are placed in each NUMA region and the memory bandwidth is saturated, which, we believe, is what causes the major performance bottleneck in TINKER.

There is no easy fix that would improve the performance of Tinker any further without a radical change to the code base. It may be possible to gain some improvement from restructuring all of the data structures to improve data locality. However, there is a limit to how much that would improve the performance, and it would require a great deal of effort. It seems more feasible to adapt TINKER to use distributed memory model but that would also require significant restructuring of the code.

The specifics of the optimisation process are code dependent. Therefore it is natural that some codes will benefit from optimisation than the others. The problem is that scientific applications were and still are written with the focus of ‘getting the science bit right’ and do not give much consideration to the performance issues until much later. This should change.