VCRS matrix compression for GPUs

Have you considered using compression to solve huge linear systems of equations? How can a lossy compression speedup numerical solvers without affecting accuracy? Do you need compression that will be suitable for GPU(s)?

Well, the good news is that we have developed a matrix compression scheme that you can use on GPUs. We call it VCRS – Very Compressed Row Storage. In this post, firstly, we focus on “why do we need matrix compression”. Secondly, we describe “what is VCRS” and “how to use it for compression”. Finally, we give few examples from a real-world applications.

Why matrix compression?

If you are doing modelling or simulation of a physical process, most of the time, you end up with differential equations describing this process. Very often we can’t solve these differential equations analytically in the continuous space.  Therefore, we need to discretise these and solve numerically. Discretisation can be viewed as representation of your differential equation as a system of linear equations in a matrix form

A x =b

where A is the matrix, x is the solution vector, b is the right-hand side vector.

Most of the time, the matrix A is huge (millions of rows) and sparse (few non-zero elements in a row). We can’t just invert this matrix to get the solution, since the matrix inversion is very costly in terms of memory and computational costs. To solve this system of linear equations, we use iterative methods. Very often we can speedup iterative methods using a preconditioner M^{-1} such that

M^{-1}A x = M^{-1}b

Basically, the preconditioner M^{-1} is a matrix which is close to the inverse of the matrix A, in other words M^{-1}\approx A^{-1}.

If M^{-1}A  is the identity matrix, then we automatically obtain solution of our system of linear equations.

Sometimes, the original matrix A can be implemented matrix-free, meaning that the elements are calculated on the fly and not stored. However, most of the time we have to store the preconditioner matrix M in the memory. The less storage both matrices take, the larger problems we can compute. It is especially important if we use accelerators, for example a GPU with limited memory.

An important aspect of a preconditioner is that it does not have to be exact. It is acceptable to have an preconditioner that is an approximation of an approximation.

Therefore lossy compression of the preconditioner enables bigger problems to be computed on accelerators.

There are many ways to compress a matrix. In this blog we suggest to use a compressed matrix M using VCRS compression. This way we take two pigeons with one bean:

  • speedup iterative solver with a preconditioner
  • suitable preconditioner for GPU(s)

What is VCRS?

VCRS stands for Very Compressed Row Storage format. This method was developed during Hans‘s PhD at TU Delft. VCRS format was inspired by the well-known CSR (Compressed Sparse Row) format.

CSR format versus VCRS format

To illustrate compression, let’s consider a small matrix A from a one-dimensional Poisson equation with Dirichlet boundary conditions, see picture above.

CSR format consists of two integer and one floating point arrays:

  1. The non-zero elements of the matrix A are consecutively, row by row, stored in the floating point array \text{data}.
  2. The column index of each element is stored in an integer array \text{cidx}.
  3. The second integer array \text{first} contains the location of the beginning of each row.

To take advantage of the redundancy in the column indices of a matrix constructed by a discretization with finite differences or finite elements on structured meshes, we introduce a new sparse storage format: VCRS.

VCRS format consists of five integer and one floating point arrays:

  1. The first array \text{col offset} contains the column indices of the first non-zero elements of each row.
  2. The second array \text{num row} consists of the number of non-zero elements per row.
  3. The third array is \text{col data} which represents a unique set of indices per row, calculated as the column indices of the non-zero elements in the row \text{cidx} minus \text{col offset}. Here, the row numbers 0 and 4 in the matrix A have the same set of indices, \text{col data} = {0 1}, and the row numbers 1 and 2 have the same set of indices as well, \text{col data} = {1 2 1}.
  4.  To reduce redundancy in this array, we introduce a fourth array \text{col pointer} that contains an index per row, pointing at the starting positions in the \text{col data}
  5. The approach from previous array 4 is also applied to the array \text{data} containing values of the non-zero elements per row, i. e., the set of values is listed uniquely.
  6. Therefore, also here we need an additional array of pointers per row \text{data pointer} pointing at the positions of the first non-zero value in a row in \text{data}

Why VCRS is better than CSR?

At a first glance, it seems that the VCRS format is based on more arrays than the CSR format, six versus three, respectively. However, the large arrays in the CSR format are cidx and data, and they contain redundant information of repeated indices and values of the matrix. For small matrices, the overhead can be significant, however, for large matrices it can be beneficial to use the VCRS, especially on GPUs with a limited amount of memory.

Summarizing, the following factors contribute to the usage of the VCRS format:

  1. The CSR format of a large matrix contains a large amount of redundancy, especially if the matrix arises from finite-difference discretisations;
  2. The amount of redundancy of a matrix A can vary depending on the accuracy and storage requirements, giving the opportunity to use a lossy compression;
  3. The exact representation of matrices is not required for the preconditioner, an approximation might be sufficient for the convergence of the solver.
The lossy compression of preconditioners can be very beneficial for a GPU-implementation, as it allows to store the data on hardware with a limited amount of memory, but at the same time takes advantage of its speed compared to CPU hardware.

VCRS with lossy compression

Of course you can already use VCRS format as described above. If you want to get even more advantages of the VCRS format, here we list two mechanisms to adjust the data redundancy:

  • Quantization
  • Row classification

Quantization

Quantization is a lossy compression technique that compresses a range of values to a single value. It has well-known applications in image processing and digital signal processing.

By lossy compression, as opposed to lossless compression, some information will be lost.

However, we need to make sure that the effect of the data loss in lossy compression does not affect the accuracy of the solution. The simplest example of quantization is rounding a real number to the nearest integer value.

The quantization technique can be used to make the matrix elements in different rows similar to each other for better compression. The quantization mechanism is based on the maximum and minimum values of a matrix and on a number of so-called bins, or sample intervals.

Figure above illustrates the quantization process of a matrix with values on the interval [0, 1]. In this example the number of bins is set to 5, meaning there are 5 intervals [ 0.2(i-1),\ 0.2i\ ),\ i = 1, 2,3,4, 5. The matrix entries are normally distributed between 0 and 1, as shown by the black dots connected with the solid line. By applying quantization, the matrix values that fall in a bin, are assigned to be a new value equal to the bin center. Therefore, instead of the whole range of matrix entries, we only get 5 values. Obviously, the larger number of bins, the more accurate is the representation of matrix entries.

Row classification

Next, we introduce row classification as a mechanism to define similarity of two different matrix rows.
Given a sorted array of rows and a tolerance, we can easily search for two rows that are similar within a certain tolerance. The main assumption for row comparison is that the rows have the same number of non-zero elements.

Let R_i = \{ a_{i1} \; a_{i2} \; \dots \; a_{in}\} be the i-th row of matrix A of length n and R_j = \{ a_{j1} \; a_{j2} \; \dots \; a_{jn}\} be
the j-th row of A.

The comparison of two rows is summarized in Algorithm 3 below. If R_i is not smaller than R_j and R_j is not smaller than R_i, then the rows R_i and R_j are “equal within the given tolerance \lambda“.
Algorithm 4 then describes the comparison of two complex values and Algorithm 5 compares two floating-point numbers.

Figure bellow illustrates the classification of a complex matrix entree a_{ij}. Within a distance \lambda the numbers are assumed to be equal to a_{ij}.  Then, a_{ij} is smaller than the numbers in the dark gray area, and larger than the numbers in the light gray area.

Impact of quantization and row classification

The number of bins and tolerance have influence on

  1. the compression: the less bins and/or the larger lambda, the better is the matrix compressed
  2. the accuracy: the more bins and the smaller lambda, the more accurate are matrix operations (such as matrix-vector multiplication)
  3. the computational time: the less bins and/or the larger lambda, the faster are computations
  4. the memory usage: the less bins and/or the larger lambda, the less memory is used
  5. the speedup on modern hardware (which is calculated as a ratio of the computational time of the algorithm using the original matrix and of the computational time using the compressed matrix)
There is a trade-off between performance and accuracy: the more accurate the compressed matrix, the slower matrix-vector multiplication.

Example: Helmholtz equation

From our previous posts, you might know that one of the problems we had to solve is the Helmholtz equation – the wave equation in the frequency domain.

 

Real part of the preconditioner
Imaginary part of the preconditioner

Using the VCRS format to store this matrix results in three to four times smaller memory requirements and three to four times faster matrix-vector multiplication, depending on the compression parameters. In general, we observed the compression factor between 3 and 20 depending on the matrix.

Based on our experiments, the most reasonable parameter choice would be a tolerance λ = 0.1, and number of bins equals to 100 000.

Summarising our experiments for the whole solver (Bi-CGSTAB preconditioned with shifted Laplace matrix-dependent multigrid method), we can conclude that the VCRS format can be used

  • to reduce the memory for the preconditioner as well as
  • to increase the performance of the solver on different hardware platforms (CPUs, GPUs),
  • with minimal effect on the accuracy of the solver.

Example: Reservoir simulation

The VCRS compression can also be used in other applications where the solvers are based on a preconditioned system of linear equations. For example, an iterative solver for linear equations is also an important part of a reservoir simulator.

It appears within a Newton step to solve discretized non-linear partial differential equations describing the fluid flow in porous media. The basic partial differential equations include a mass-conservation equation, Darcy’s law, and an equation of state relating the fluid pressure to its density. In its original form the values of the matrix are scattered. Although the matrix looks full due to the scattered entries, the most common number of non-zero elements per row is equal to 7, however the maximum number of elements per row is 210.

The distribution of the matrix values is shown in the figure below. Note, that the matrix has real-valued entries only. It can be seen that there is a large variety of matrix values, that makes the quantization and row classification effective.

Using the VCRS format to store this matrix results in two to three times smaller memory requirements and two to three times faster matrix-vector multiplication, depending on the compression parameters. 

Summary

In this post

  • we introduced a VCRS (Very Compressed Row Storage) format,
  • we have shown that the VCRS format not only reduces the size of the stored matrix by a certain factor but also increases the efficiency of the matrix-vector computations,
  • we introduced two parameters for lossy compression: number of bins and lambda,
  • we concluded that with proper choice of these compression parameters, the effect of the lossy compression on the whole solver is minimal,
  • we used VCRS compression on CPUs and GPUs,
  • we applied VCRS on real-world applications.

What matrix compression are you using? Let us know in the comment box below.


Liked this article? Get EZNumeric’s future articles in your inbox:

How to Solve the “Dependency Hell” with Python

If you are programming, you know the pain of installing 3rd-party libraries or updating them. Especially if you are working on different platforms. After manual installation of a library, often you find out that a dependency (yet another library used by the one you want to install) need to be installed as well.

Dependency hell

There is another dimension to this problem – different platforms have different package systems. To name a few: Mac OSX uses MacPorts or brew, Ubuntu uses apt and CentOS uses yum. If you are like me, then you might prefer to use Mac OSX at home and one of the Linux distributions on a computational cluster. All those systems/distributions have their own libraries with different versions, compiler versions, versions of dependencies, etc.

Even if you get libraries you need, they might have different versions on different platforms. Because distributions ship their own versions. There is an official term called dependency hell.

Dependency hell is a colloquial term for the frustration of some software users who have installed software packages which have dependencies on specific versions of other software packages. Michael Jang (2006). Linux annoyances for geeks. O’Reilly Media. ISBN 9780596552244. Retrieved 2012-02-16.

 

How nice it would be to have it done automatically? No more configure scripts! Huge time saver!

The idea is to have the same set of libraries across all (Unix-like) systems. Ideally, you want to compile your software on different platforms and you want to have the same version of the required libraries across different platforms.

What libraries are we talking about? If you are programming, then there is good chance you want to install CMake, gcc, Qt, you name it.

Solution

We can break down the installation of a library into 6 actions for each platform that you are using:

  1. Install dependencies (that have to be taken into account),
  2. download package on each platform,
  3. patch (if necessary, apply small compatibility changes),
  4. configure,
  5. compile the library,
  6. install.

There are two ways to implement the solution:

  1. Most obvious approach: do every action manually.
  2. Smart aproach: automate actions with a magic Python script. One program – “dependency hell” problem solved.

Magic Python script(s)

To make use of the smart approach, meaning to automate the installation process, we suggest to write a Python script. Actually, two Python scripts: one will contain the description of the classes (we will discuss it below) and the second one will apply these classes to each library you need to install.

Python script #1

Let’s start with the first Python script that contains a base class called ExternalLibrary().

The most important function in this class is the function Run() which executed the steps 1-6 from the previous section. Note that this function is recursive, because in the step 1 “Installing dependencies” also calls the function Run() for each dependency.

The code snippet below shows the function Run() and the steps 1-6 to install a library.

Let’s us have a look closer at the step number 4: configure. Configuration is done by build systems. Build systems create a make file to compile the library self and its dependencies. Different libraries use different build systems. To name a few: Qt and gcc use gnuconf, CMake and MySQL use CMake on Linux platforms. To make a distinction between different build systems, we have created two derived classes: ExternalLibraryCMake() and ExternalLibraryGNUConf(). The only difference between these two is the Configure() function.

 

Python script #2

The second script is a list of the libraries you need with specific parameters. For example, let’s consider installing CMake 3.4.3. Of course, we have to know

  • where to download the library,
  • setup the source directory,
  • configuration parameters,
  • dependencies.
Prior to configure/compile a library (step 4 and 5) , the environment must be setup in such a way that the library knows where to find the libraries that it depends on. To do that, the Python program creates a simple bash script that sets variables such as LD_LIBRARY_PATH or PATH. Here are step-by-step instructions how to do it.

 

Then the Python script will look like this:

 

As you can see, there are some checks to avoid doing the same operation twice. I.e, the package has already been downloaded or extracted.

A recursive function can easily produce the dependency graph. This graph is a visual check that all dependencies are correctly set-up:

 

part of the Dependency graph for our in-house code

How to handle updates?

A package update can easily be added by adding another small class that depends on the earlier version of the package. Only a couple of functions need to be changed. Once a more recent version become available, for example to install Qt 5.7.0, we have added a simple derived class ExternalLibraryQt570(ExternalLibraryQt560) based on previous library version Qt 5.6.0. The code will look like this:

Parallel compilation

A side effect of using a Python script for automatic installation of 3rd-party libraries is parallel compilation. At the runtime, your Python script can check how many processors are available on your computer. Then your Python script automatically launches a parallel compilation and utilises all CPU power you’ve got.

Summary

Let us summarise the advantages of using the Python scripts to automate the installation of the 3rd-party libraries and solve dependency hell:

  • no manual repetitive work: ones setup in the Python script, it will run automatically on different platforms;
  • updates are easily added;
  • all knowledge about configuration system for each library is in Python script and not lost;
  • launch compilation in parallel automatically utilising multi-core CPUs.

To quote Elsa from the Disney movie Frozen: “Recursivity never bothered me anyway” 🙂

How do you update 3rd-party libraries?

Let us know in the comment box below if you would like to have the code in GitHub.


Get EZNumeric’s future articles in your inbox: