VCRS matrix compression for GPUs

Have you considered using compression to solve huge linear systems of equations? How can a lossy compression speedup numerical solvers without affecting accuracy? Do you need compression that will be suitable for GPU(s)?

Well, the good news is that we have developed a matrix compression scheme that you can use on GPUs. We call it VCRS – Very Compressed Row Storage. In this post, firstly, we focus on “why do we need matrix compression”. Secondly, we describe “what is VCRS” and “how to use it for compression”. Finally, we give few examples from a real-world applications.

Why matrix compression?

If you are doing modelling or simulation of a physical process, most of the time, you end up with differential equations describing this process. Very often we can’t solve these differential equations analytically in the continuous space.  Therefore, we need to discretise these and solve numerically. Discretisation can be viewed as representation of your differential equation as a system of linear equations in a matrix form

A x =b

where A is the matrix, x is the solution vector, b is the right-hand side vector.

Most of the time, the matrix A is huge (millions of rows) and sparse (few non-zero elements in a row). We can’t just invert this matrix to get the solution, since the matrix inversion is very costly in terms of memory and computational costs. To solve this system of linear equations, we use iterative methods. Very often we can speedup iterative methods using a preconditioner M^{-1} such that

M^{-1}A x = M^{-1}b

Basically, the preconditioner M^{-1} is a matrix which is close to the inverse of the matrix A, in other words M^{-1}\approx A^{-1}.

If M^{-1}A  is the identity matrix, then we automatically obtain solution of our system of linear equations.

Sometimes, the original matrix A can be implemented matrix-free, meaning that the elements are calculated on the fly and not stored. However, most of the time we have to store the preconditioner matrix M in the memory. The less storage both matrices take, the larger problems we can compute. It is especially important if we use accelerators, for example a GPU with limited memory.

An important aspect of a preconditioner is that it does not have to be exact. It is acceptable to have an preconditioner that is an approximation of an approximation.

Therefore lossy compression of the preconditioner enables bigger problems to be computed on accelerators.

There are many ways to compress a matrix. In this blog we suggest to use a compressed matrix M using VCRS compression. This way we take two pigeons with one bean:

  • speedup iterative solver with a preconditioner
  • suitable preconditioner for GPU(s)

What is VCRS?

VCRS stands for Very Compressed Row Storage format. This method was developed during Hans‘s PhD at TU Delft. VCRS format was inspired by the well-known CSR (Compressed Sparse Row) format.

CSR format versus VCRS format

To illustrate compression, let’s consider a small matrix A from a one-dimensional Poisson equation with Dirichlet boundary conditions, see picture above.

CSR format consists of two integer and one floating point arrays:

  1. The non-zero elements of the matrix A are consecutively, row by row, stored in the floating point array \text{data}.
  2. The column index of each element is stored in an integer array \text{cidx}.
  3. The second integer array \text{first} contains the location of the beginning of each row.

To take advantage of the redundancy in the column indices of a matrix constructed by a discretization with finite differences or finite elements on structured meshes, we introduce a new sparse storage format: VCRS.

VCRS format consists of five integer and one floating point arrays:

  1. The first array \text{col offset} contains the column indices of the first non-zero elements of each row.
  2. The second array \text{num row} consists of the number of non-zero elements per row.
  3. The third array is \text{col data} which represents a unique set of indices per row, calculated as the column indices of the non-zero elements in the row \text{cidx} minus \text{col offset}. Here, the row numbers 0 and 4 in the matrix A have the same set of indices, \text{col data} = {0 1}, and the row numbers 1 and 2 have the same set of indices as well, \text{col data} = {1 2 1}.
  4.  To reduce redundancy in this array, we introduce a fourth array \text{col pointer} that contains an index per row, pointing at the starting positions in the \text{col data}
  5. The approach from previous array 4 is also applied to the array \text{data} containing values of the non-zero elements per row, i. e., the set of values is listed uniquely.
  6. Therefore, also here we need an additional array of pointers per row \text{data pointer} pointing at the positions of the first non-zero value in a row in \text{data}

Why VCRS is better than CSR?

At a first glance, it seems that the VCRS format is based on more arrays than the CSR format, six versus three, respectively. However, the large arrays in the CSR format are cidx and data, and they contain redundant information of repeated indices and values of the matrix. For small matrices, the overhead can be significant, however, for large matrices it can be beneficial to use the VCRS, especially on GPUs with a limited amount of memory.

Summarizing, the following factors contribute to the usage of the VCRS format:

  1. The CSR format of a large matrix contains a large amount of redundancy, especially if the matrix arises from finite-difference discretisations;
  2. The amount of redundancy of a matrix A can vary depending on the accuracy and storage requirements, giving the opportunity to use a lossy compression;
  3. The exact representation of matrices is not required for the preconditioner, an approximation might be sufficient for the convergence of the solver.
The lossy compression of preconditioners can be very beneficial for a GPU-implementation, as it allows to store the data on hardware with a limited amount of memory, but at the same time takes advantage of its speed compared to CPU hardware.

VCRS with lossy compression

Of course you can already use VCRS format as described above. If you want to get even more advantages of the VCRS format, here we list two mechanisms to adjust the data redundancy:

  • Quantization
  • Row classification

Quantization

Quantization is a lossy compression technique that compresses a range of values to a single value. It has well-known applications in image processing and digital signal processing.

By lossy compression, as opposed to lossless compression, some information will be lost.

However, we need to make sure that the effect of the data loss in lossy compression does not affect the accuracy of the solution. The simplest example of quantization is rounding a real number to the nearest integer value.

The quantization technique can be used to make the matrix elements in different rows similar to each other for better compression. The quantization mechanism is based on the maximum and minimum values of a matrix and on a number of so-called bins, or sample intervals.

Figure above illustrates the quantization process of a matrix with values on the interval [0, 1]. In this example the number of bins is set to 5, meaning there are 5 intervals [ 0.2(i-1),\ 0.2i\ ),\ i = 1, 2,3,4, 5. The matrix entries are normally distributed between 0 and 1, as shown by the black dots connected with the solid line. By applying quantization, the matrix values that fall in a bin, are assigned to be a new value equal to the bin center. Therefore, instead of the whole range of matrix entries, we only get 5 values. Obviously, the larger number of bins, the more accurate is the representation of matrix entries.

Row classification

Next, we introduce row classification as a mechanism to define similarity of two different matrix rows.
Given a sorted array of rows and a tolerance, we can easily search for two rows that are similar within a certain tolerance. The main assumption for row comparison is that the rows have the same number of non-zero elements.

Let R_i = \{ a_{i1} \; a_{i2} \; \dots \; a_{in}\} be the i-th row of matrix A of length n and R_j = \{ a_{j1} \; a_{j2} \; \dots \; a_{jn}\} be
the j-th row of A.

The comparison of two rows is summarized in Algorithm 3 below. If R_i is not smaller than R_j and R_j is not smaller than R_i, then the rows R_i and R_j are “equal within the given tolerance \lambda“.
Algorithm 4 then describes the comparison of two complex values and Algorithm 5 compares two floating-point numbers.

Figure bellow illustrates the classification of a complex matrix entree a_{ij}. Within a distance \lambda the numbers are assumed to be equal to a_{ij}.  Then, a_{ij} is smaller than the numbers in the dark gray area, and larger than the numbers in the light gray area.

Impact of quantization and row classification

The number of bins and tolerance have influence on

  1. the compression: the less bins and/or the larger lambda, the better is the matrix compressed
  2. the accuracy: the more bins and the smaller lambda, the more accurate are matrix operations (such as matrix-vector multiplication)
  3. the computational time: the less bins and/or the larger lambda, the faster are computations
  4. the memory usage: the less bins and/or the larger lambda, the less memory is used
  5. the speedup on modern hardware (which is calculated as a ratio of the computational time of the algorithm using the original matrix and of the computational time using the compressed matrix)
There is a trade-off between performance and accuracy: the more accurate the compressed matrix, the slower matrix-vector multiplication.

Example: Helmholtz equation

From our previous posts, you might know that one of the problems we had to solve is the Helmholtz equation – the wave equation in the frequency domain.

 

Real part of the preconditioner
Imaginary part of the preconditioner

Using the VCRS format to store this matrix results in three to four times smaller memory requirements and three to four times faster matrix-vector multiplication, depending on the compression parameters. In general, we observed the compression factor between 3 and 20 depending on the matrix.

Based on our experiments, the most reasonable parameter choice would be a tolerance λ = 0.1, and number of bins equals to 100 000.

Summarising our experiments for the whole solver (Bi-CGSTAB preconditioned with shifted Laplace matrix-dependent multigrid method), we can conclude that the VCRS format can be used

  • to reduce the memory for the preconditioner as well as
  • to increase the performance of the solver on different hardware platforms (CPUs, GPUs),
  • with minimal effect on the accuracy of the solver.

Example: Reservoir simulation

The VCRS compression can also be used in other applications where the solvers are based on a preconditioned system of linear equations. For example, an iterative solver for linear equations is also an important part of a reservoir simulator.

It appears within a Newton step to solve discretized non-linear partial differential equations describing the fluid flow in porous media. The basic partial differential equations include a mass-conservation equation, Darcy’s law, and an equation of state relating the fluid pressure to its density. In its original form the values of the matrix are scattered. Although the matrix looks full due to the scattered entries, the most common number of non-zero elements per row is equal to 7, however the maximum number of elements per row is 210.

The distribution of the matrix values is shown in the figure below. Note, that the matrix has real-valued entries only. It can be seen that there is a large variety of matrix values, that makes the quantization and row classification effective.

Using the VCRS format to store this matrix results in two to three times smaller memory requirements and two to three times faster matrix-vector multiplication, depending on the compression parameters. 

Summary

In this post

  • we introduced a VCRS (Very Compressed Row Storage) format,
  • we have shown that the VCRS format not only reduces the size of the stored matrix by a certain factor but also increases the efficiency of the matrix-vector computations,
  • we introduced two parameters for lossy compression: number of bins and lambda,
  • we concluded that with proper choice of these compression parameters, the effect of the lossy compression on the whole solver is minimal,
  • we used VCRS compression on CPUs and GPUs,
  • we applied VCRS on real-world applications.

What matrix compression are you using? Let us know in the comment box below.


Liked this article? Get EZNumeric’s future articles in your inbox:

Heterogeneous Map-Reduce: Seismic Imaging Application (Part 2/3)

This article is the second part of the series about the heterogeneous map-reduce approach:
Part 1 – Heterogeneous Map-Reduce: meet the Task System (Part 1/3)
Part 2 – [You are here] – Heterogeneous Map-Reduce: Seismic Imaging Application (Part 2/3)
Part 3 – Heterogeneous Map-Reduce: Scientific Visualisation (Part 3/3)

In the previous part we have described the heterogeneous map reduce framework. Here, we will start with an example from a seismic imaging application.

Problem setting

The oil and gas industry makes use of computational intensive algorithms such as reverse-time migration and full waveform inversion to provide an image of the subsurface. The image is obtained by sending a wave energy into the subsurface and recording the signal required for a seismic wave to reflect back to the surface from the interfaces with different physical properties. A seismic wave is usually generated by shots at known frequencies, placed close to the surface on land or to the water surface in the sea. Returning waves are usually recorded in time by hydrophones in the marine environment or by geophones during land acquisition. The goal of seismic imaging is to transform the seismograms to a spatial image of the subsurface.

Migration algorithms produce an image of the subsurface given seismic data measured at the surface. In particular, pre-stack depth migration produces the depth locations of reflectors by mapping seismic data from the time domain to the depth domain, assuming that a sufficiently accurate velocity model is available. The classic imaging principle is based on the correlation of the forward propagated wavefield from a source and a backward propagated wavefield from the receivers. In frequency domain the image is calculated as follows:

\text{Image} (x,y,z) = \sum_{\text{shots}}\sum_{\omega} W_{\text{source}}^*(x,y,z,\omega) W_{receivers}(x,y,z,\omega)

where W_{\text{source}}(x,y,z,\omega)  denotes the wavefield propagated from the source and W_{\text{receivers}}(x,y,z,\omega)  from the receivers respectively, \omega denotes the frequency. That means for every shot and every frequency we need to simulate the wave propagation twice.

Levels of parallelism

Level 1: The highest level of parallelization for frequency-domain migration is over the shots. Each shot is treated independently. We assume that the migration volume for one shot is computed on one compute node connected to none, one or more GPUs.

Level 2: The next level of parallelism involves the frequencies. For each frequency, a linear system of equations needs to be solved.

Level 3: The third level of parallelism includes matrix decomposition, where the matrix for the linear system of equations is decomposed into subsets of rows that fit on a single GPU.

Level 4: The last level of parallelism for migration in frequency domain is parallelization of matrix-vector multiplications (MVMs) and vector-vector operations.

Task system (Map-Reduce)

Here we describe how the task system works for the migration in frequency domain.

Read more about task system in the previous post: Heterogeneous Map-Reduce: Meet the Task System

 

Mostly we will use the first level of the parallelism. The server or ‘master node’ creates one task per source. Each task is added to a “Available” queue. When a client requests a task, a given task is moved from the queue to the “Running” list.

As we have seen earlier, the migration algorithm consists of forward and backward modelling in frequency domain for each source. Therefore, the second level of parallelism for migration consist of parallelization over all frequencies for each source.

Let’s assume that we have N_s sources and N_\omega frequencies. Then, one task consists of computations of one frequency \omega_{s_i} for a given source {s_i}, i=1,...,N_s. In total, we have N_\omega\cdot N_s tasks.

For each frequency, a linear system of equations needs to be solved.  The matrix size and memory requirements are the same for each frequency, but the lower frequencies require less compute time than the higher ones. Here, we assume that one frequency for one source in the frequency domain fits in one compute node. At this point, the auto load-balancing of the tasks comes into play. There is no need to know beforehand how to distribute the shots over the compute nodes.

If a compute node is connected to one or more GPUs, we can make use of the third level of parallelism and decompose the matrix across GPU(s). However, this is done within a task.

Migration movie

The movie starts with the velocity model. We run migration on the SEG/EAGE Overtrust model, which represents an acoustic constant-density medium with complex, layered structures and faults. Then the camera rotates and it shows the migrated image, which appears to be empty. The migrated data from each shots is added and shown in the movie as it is being received by the server (‘Reduce’ part of the algorithm).  We can see how layers become visible. At the end we see the final image that we obtained using migration in frequency domain.

More details on the simulation

The volume has a size of 1000×1000×620 m3. The problem is discretized on a grid with 301×301×187 points and a spacing of 3.33 m in each coordinate direction.

The discretization for migration in frequency domain is 2nd-order in space. A Ricker wavelet with a peak frequency of 15 Hz is chosen for the source and the maximum frequency in this experiment is 30 Hz. Note that by reducing the maximum frequency, we can increase the grid spacing. For instance, by choosing a maximum frequency of 8 Hz, the grid spacing can be chosen as 25 m in each direction. The line of sources is located at a depth of 10 m and is equally spaced with an interval of 18.367 m 4 in the x-direction.

The receivers are equally distributed in the two horizontal directions with the same spacing as the sources, at the depth of 20 m. The sampling interval for the modelled seismic data is 4 ms. The maximum simulation time is 0.5 s. For migration in the frequency domain we chose a frequency interval of 2 Hz.

What is next?

In this post we showed how to use the heterogeneous map-reduce framework and the task system on an application from seismic imaging : migration in frequency domain.

In the next post of this series we are going to have a look at scientific visualisation.


Get EZNumeric’s future articles in your inbox:

Parallel wave modelling on a 8-GPU monster machine

NvidiaSystem with 8 GPUs

A few years ago we had the rare opportunity to program on a 8-GPU monster machine provided by NVIDIA. That was an experience! The goal was to parallelise a 3D wave modelling tool in the frequency domain (Helmholtz equation) on a machine with multiple GPUs. The motivation was to run a problem of a realistic size that are usually too large to fit in the memory of one GPU. Actually, we are facing the same issue as 20 years ago with CPUs.

The monster machine
Nvidia System with 8 GPUs

For our numerical experiments NVIDIA provided a Westmere based 12-cores machine connected to 8 GPUs Tesla 2050 as shown on the figure above. The 12-core machine has 48 GB of RAM. Each socket has 6 CPU cores Intel(R) Xeon(R) CPU X5670 @ 2.93GHz and is connected trough 2 PCI- buses to 4 graphics cards. Note that two GPUs are sharing one PCI-bus connected to a socket. Each GPU consist of 448 cores with a clock rate of 1.5 GHz and has 3 GB of memory.

Multi-GPU approach

There are several approaches to deal with multi-GPU:

  1. The data-parallel approach, where all matrix-vector and vector-vector operations are split between multiple GPUs. The advantage of this approach is that it is relatively easy to implement. However, matrix- vector multiplication requires exchange of the data between different GPUs, that can lead to significant data transfer times if the computational part is small.
  2. Split of the algorithm, where different parts of algorithm are executed on different devices. For instance, the solver is executed on one GPU and the preconditioner on another one. In this way the communication between GPUs will be minimized. However this approach requires individual solution for each algorithm.
  3. Domain-Decomposition approach, where the original continuous or discrete problem is decomposed into parts which are executed on different GPUs and the overlapping information (halos) is exchanged by data transfer. For our wave simulation in frequency domain this approach can however have difficulties with convergence for higher frequencies (see Ernst, Gander).

We have chosen the data-parallel approach and split of the algorithms and made a comparison between multi-core and multi- GPUs. We leave out the domain decomposition approach because the convergence of the Helmholtz solver is not guaranteed.

Computations on Multi-GPU

You can do computations on multi-GPU by either:

  • pushing different context to different GPUs, or
  • creating multiple threads on CPU, where each of them will communicate with one GPU.

 

Multi-GPU Pushing context
Multi-GPU Multi-threading

 

 

 

 

 

 

 

 

 

 

For our purposes we have chosen the second option, since it was easier to understand and implement.

Issues

Implementation on multi-GPUs requires careful consideration of possibilities and optimization options. The issues we encountered during our work are listed below:

  •  Multi-threading implementation, where the life of a thread should be as long as the application. This is crucial for the multi-threading way of implementation on multi-GPU. Note that in case of pushing contexts this is not an issue.
  • Because of limited GPU memory size, large problems need multiple GPUs.
  • Efficient memory reusage to avoid allocation/deallocation. Due to memory limitations the memory should be reused as much as possible, especially in the multigrid method.
    In our work we create a pool of vectors on the GPU and reuse them during the whole solution time.
  • Limit communications CPU\rightarrowGPU and GPU\rightarrowCPU.
  • Back then, it was beneficial to use texture memory when possible, but it was not easy as each GPU needs its own texture reference.
  • Coalescing is difficult since each matrix row has a different number of elements.
Helmholtz solver

The Helmholtz equation represents the time-harmonic wave propagation in the frequency domain and has applications in many fields of science and technology, e.g. in aeronautics, marine technology, geophysics, and optical problems. In particular we consider the Helmholtz equation discretized by a second order finite difference scheme.

As a solver for the Helmholtz equation (wave equation in frequency domain) we use Bi-CGSTAB with a shifted Laplacian multigrid preconditioner.

Since the Bi-CGSTAB is a collection of vector additions, dot products and matrix-vector multiplications, the multi-GPU version of the Bi-CGSTAB is straight forward. We have seen that the speedup on multi-GPUs is smaller than on single GPU due to the data transfer between CPU and GPU. However it is possible to compute a problem of a realistic size on multi-GPUs and the computation on multi-GPU is still many times faster than 12-core Westmere.

The shifted Laplace preconditioner consists of a coarse grid correction based on the Galerkin method with matrix-dependent prolongation and Gauss-Seidel as a smoother. The implementation of coarse grid correction on multi-GPU is straight forward, since the main ingredient of the coarse grid correction is matrix-vector multiplication. The coarse grid matrices are constructed on CPU and then transferred to the GPUs.

The Gauss-Seidel smoother on multi-GPU requires adaptation of the algorithm. We use eight color Gauss-Seidel, since the Helmholtz equation is given in three dimensions and computations at each discretization point should be done independent on the neighbours to allow parallelism.

Conclusions

We implemented the three-dimensional Helmholtz wave equation on a multi-GPU. To keep the double precision convergence the solver (Bi-CGSTAB method) is implemented on GPU in double precision and the preconditioner in single precision.

Two multi-GPU approaches have been considered: data parallel approach and a split of the algorithm.

For the data parallel approach, we were able to solve larger problems than on one GPU
and get a better performance than multi-threaded CPU implementation. However due to
the communication between GPUs and a CPU the resulting speedups have been considerably smaller
compared to the single-GPU implementation.

To minimize the communication but still be able to solve large problems we have introduced split of the  algorithm. In this case the speedup on multi-GPUs is similar to the single GPU compared to the multi-core implementation.

See full comparison of the multi-GPU implementation to a single-GPU and a multi-threaded CPU implementation on a realistic problem size here or request a copy by filling in the contact form or by sending us an email.

We would like to thank again NVIDIA Corporation (in particular François Courteille) for access to the their many-core-multi-GPU architecture.

This work has been presented during the following conferences:

  • GTC 2012, San Jose, US
  • ENUMATH 2011, Leicester, UK