A few years ago we had the rare opportunity to program on a 8-GPU monster machine provided by NVIDIA. That was an experience! The goal was to parallelise a 3D wave modelling tool in the frequency domain (Helmholtz equation) on a machine with multiple GPUs. The motivation was to run a problem of a realistic size that are usually too large to fit in the memory of one GPU. Actually, we are facing the same issue as 20 years ago with CPUs.

###### The monster machine

For our numerical experiments NVIDIA provided a Westmere based 12-cores machine connected to 8 GPUs Tesla 2050 as shown on the figure above. The 12-core machine has 48 GB of RAM. Each socket has 6 CPU cores Intel(R) Xeon(R) CPU X5670 @ 2.93GHz and is connected trough 2 PCI- buses to 4 graphics cards. Note that two GPUs are sharing one PCI-bus connected to a socket. Each GPU consist of 448 cores with a clock rate of 1.5 GHz and has 3 GB of memory.

###### Multi-GPU approach

There are several approaches to deal with multi-GPU:

- The data-parallel approach, where all matrix-vector and vector-vector operations are split between multiple GPUs. The advantage of this approach is that it is relatively easy to implement. However, matrix- vector multiplication requires exchange of the data between different GPUs, that can lead to significant data transfer times if the computational part is small.
- Split of the algorithm, where different parts of algorithm are executed on different devices. For instance, the solver is executed on one GPU and the preconditioner on another one. In this way the communication between GPUs will be minimized. However this approach requires individual solution for each algorithm.
- Domain-Decomposition approach, where the original continuous or discrete problem is decomposed into parts which are executed on different GPUs and the overlapping information (halos) is exchanged by data transfer. For our wave simulation in frequency domain this approach can however have difficulties with convergence for higher frequencies (see Ernst, Gander).

We have chosen the data-parallel approach and split of the algorithms and made a comparison between multi-core and multi- GPUs. We leave out the domain decomposition approach because the convergence of the Helmholtz solver is not guaranteed.

###### Computations on Multi-GPU

You can do computations on multi-GPU by either:

- pushing different context to different GPUs, or
- creating multiple threads on CPU, where each of them will communicate with one GPU.

For our purposes we have chosen the second option, since it was easier to understand and implement.

###### Issues

Implementation on multi-GPUs requires careful consideration of possibilities and optimization options. The issues we encountered during our work are listed below:

- Multi-threading implementation, where the life of a thread should be as long as the application. This is crucial for the multi-threading way of implementation on multi-GPU. Note that in case of pushing contexts this is not an issue.
- Because of limited GPU memory size, large problems need multiple GPUs.
- Efficient memory reusage to avoid allocation/deallocation. Due to memory limitations the memory should be reused as much as possible, especially in the multigrid method.

In our work we create a pool of vectors on the GPU and reuse them during the whole solution time. - Limit communications CPUGPU and GPUCPU.
- Back then, it was beneficial to use texture memory when possible, but it was not easy as each GPU needs its own texture reference.
- Coalescing is difficult since each matrix row has a different number of elements.

###### Helmholtz solver

The Helmholtz equation represents the time-harmonic wave propagation in the frequency domain and has applications in many fields of science and technology, e.g. in aeronautics, marine technology, geophysics, and optical problems. In particular we consider the Helmholtz equation discretized by a second order finite difference scheme.

As a solver for the Helmholtz equation (wave equation in frequency domain) we use Bi-CGSTAB with a shifted Laplacian multigrid preconditioner.

Since the Bi-CGSTAB is a collection of vector additions, dot products and matrix-vector multiplications, the multi-GPU version of the Bi-CGSTAB is straight forward. We have seen that the speedup on multi-GPUs is smaller than on single GPU due to the data transfer between CPU and GPU. However it is possible to compute a problem of a realistic size on multi-GPUs and the computation on multi-GPU is still many times faster than 12-core Westmere.

The shifted Laplace preconditioner consists of a coarse grid correction based on the Galerkin method with matrix-dependent prolongation and Gauss-Seidel as a smoother. The implementation of coarse grid correction on multi-GPU is straight forward, since the main ingredient of the coarse grid correction is matrix-vector multiplication. The coarse grid matrices are constructed on CPU and then transferred to the GPUs.

The Gauss-Seidel smoother on multi-GPU requires adaptation of the algorithm. We use eight color Gauss-Seidel, since the Helmholtz equation is given in three dimensions and computations at each discretization point should be done independent on the neighbours to allow parallelism.

###### Conclusions

We implemented the three-dimensional Helmholtz wave equation on a multi-GPU. To keep the double precision convergence the solver (Bi-CGSTAB method) is implemented on GPU in double precision and the preconditioner in single precision.

Two multi-GPU approaches have been considered: data parallel approach and a split of the algorithm.

For the data parallel approach, we were able to solve larger problems than on one GPU

and get a better performance than multi-threaded CPU implementation. However due to

the communication between GPUs and a CPU the resulting speedups have been considerably smaller

compared to the single-GPU implementation.

To minimize the communication but still be able to solve large problems we have introduced split of the algorithm. In this case the speedup on multi-GPUs is similar to the single GPU compared to the multi-core implementation.

See full comparison of the multi-GPU implementation to a single-GPU and a multi-threaded CPU implementation on a realistic problem size here or request a copy by filling in the contact form or by sending us an email.

We would like to thank again NVIDIA Corporation (in particular François Courteille) for access to the their many-core-multi-GPU architecture.

This work has been presented during the following conferences: