Log In Sign Up

Strategies for the vectorized Block Conjugate Gradients method

Block Krylov methods have recently gained a lot of attraction. Due to their increased arithmetic intensity they offer a promising way to improve performance on modern hardware. Recently Frommer et al. presented a block Krylov framework that combines the advantages of block Krylov methods and data parallel methods. We review this framework and apply it on the Block Conjugate Gradients method,to solve linear systems with multiple right hand sides. In this course we consider challenges that occur on modern hardware, like a limited memory bandwidth, the use of SIMD instructions and the communication overhead. We present a performance model to predict the efficiency of different Block CG variants and compare these with experimental numerical results.


page 1

page 2

page 3

page 4


Hardware-Oriented Krylov Methods for High-Performance Computing

Krylov subspace methods are an essential building block in numerical sim...

Performance Analysis and Optimal Node-Aware Communication for Enlarged Conjugate Gradient Methods

Krylov methods are a key way of solving large sparse linear systems of e...

Split block Bloom filters

This short note describes a Bloom filter variant that takes advantage of...

A Hardware-aware and Stable Orthogonalization Framework

The orthogonalization process is an essential building block in Krylov s...

A posteriori superlinear convergence bounds for block conjugate gradient

In this paper, we extend to the block case, the a posteriori bound showi...

GPMR: An Iterative Method for Unsymmetric Partitioned Linear Systems

We introduce an iterative method named GPMR for solving 2x2 block unsymm...

Accelerating Nuclear Configuration Interaction Calculations through a Preconditioned Block Iterative Eigensolver

We describe a number of recently developed techniques for improving the ...

1 Introduction

Developers of numerical software are facing multiple challenges on modern HPC-hardware. Firstly, multiple levels of concurrency must be exploited to achieve the full performance. Secondly, due to that parallelism, communication between nodes is needed, which must be cleverly organized to avoid an expensive overhead. And most importantly, modern CPUs have a low memory bandwidth, compared to the peak FLOP rate, such that for standard linear solvers the memory bandwidth is the bottleneck for the performance. Therefore only algorithms with a high arithmetic intensity will perform well.

Instruction level parallelism is now apparent on all modern CPU architectures. They provide dedicated vector (or SIMD) instructions, that allow to proceed multiple floating point operations with one instruction call, e.g. AVX-512 allows processing 8 double at once. The efficient use of these instructions is a further challenge.

The Conjugate Gradients method (CG) is a standard tool for solving large, sparse, symmetric, positive definite, linear systems. The Block Conjugate Gradient (BCG) method was introduced in the 80s to improve the convergence rate for systems with multiple right hand sides O’Leary (1980). Recently these methods have been rediscovered to reduce the communication overhead in parallel computations Grigori et al. (2016); Grigori and Tissot (2017); Al Daas et al. (2017).

In this paper we present a generalization of the BCG method, which makes it applicable to arbitrary many right-hand-sides. We consider a symmetric, positive definite matrix and want to solve the matrix equation


This paper is structured as follows. In Section 2 we briefly review the theoretical background of block Krylov methods, using the notation of Frommer et al. (2017). Then in Section 3 we apply this theory on the BCG method. The implementation of the method, a theoretical performance model and some numerical experiments are presented in Section 4.

2 Block Krylov subspaces

Considering functions of matrices, Frommer et al. presented in Frommer et al. (2017) a generic framework for block Krylov methods. Further work on this framework can be found in the PhD thesis of Lund Lund (2018). In the following we review the most important definitions, which we will need in section 3. Frommer et al. used as numeric field, for simplicity of the following numerics, we restrict our self to .

Definition 1 (Block Krylov subspace)

Let be a *-subalgebra of and . The -th block Krylov subspace with respect to and is defined by


From that definition we find the following lemma directly.

Lemma 1

If and are two *-subalgebras of , with . Then



In this paper we want to consider the following *-subalgebras and -products and the corresponding Krylov spaces.

Definition 2 (Relevant *-subalgebras)

Let be a divider of . We define the following *-subalgebras and corresponding products:

hybrid: (4)
global: (5)
where denotes the

dimensional identity matrix and

denotes the set of matrices where only the diagonal matrices have non-zero values. Furthermore we define the special cases
classical: and (6)
parallel: (7)

The names result from the behavior of the resulting Krylov method; yields in the classical block Krylov method as presented by O’Leary O’Leary (1980), whereas results in a CG method, which is carried out for all right hand sides simultaneously, in a instruction level parallel fashion.

From that definition we could conclude the following embedding lemma.

Lemma 2 (Embeddings of *-subalgebras)

For , where is a divisor of and is a divisor of , we have the following embedding:


3 Block Conjugate Gradient method

Algorithm 1 shows the preconditioned BCG method. We recompute in line 6 to improve the stability of the method. A more elaborate discussion of the stability of the BCG method can be found in the paper of Dubrulle Dubrulle (2001). This stabilization has only mild effect on the performance as the communication that is needed to compute the block product could be carried out together with the previous block product.

3:  for  to convergence do
6:      {recompute}
14:  end for
Algorithm 1 Preconditioned Block Conjugate Gradients method (stablized)

The algorithm is build-up from 4 kernels:

  • BDOT: Computing the block product,

  • BAXPY: Generic vector update

  • BOP: Applying the operator (or preconditioner) on a block vector

  • BSOLVE: Solve a block system in the *-subalgebra

O’Leary showed the following convergence result to estimate the error of the classical BCG method.

Theorem 3.1 (Convergence of Block Conjugate Gradients (O’Leary, 1980, Theorem 5))

For the energy-error of the -th column of the classical BCG method, the following estimation hold:


denotes the eigenvalues of the preconditioned matrix

. The constant depends on and the initial error but not on .

This theorem holds for the classical method. However as the hybrid method is only a data-parallel version of the classical block method the same convergence rate hold with for the method. The following lemma gives us a convergence rate for the global methods.

Lemma 3 (Theoretical convergence rate of global methods)

The theoretical convergence rate of a global method using is


A global method is equivalent to solve the -dimensional system


with the classical block Krylov method with right hand sides. The matrix of this system has the same eigenvalues as but with times the multiplicity. Thus the -smallest eigenvalue is . Therefore and by applying Theorem 3.1 we deduce the theoretical convergence rate.

This result makes the global methods irrelevant for practical use. In particular for the non-global hybrid method would perform better.

4 Implementation and numerical experiments

With the Dune 2.6 release an abstraction for SIMD data types was introduced. The aim of these abstraction is the possibility to use the data types as a replacement for the numeric data type, like double or float, to create data parallel methods. For a more detailed review see Bastian et al. Bastian et al. (2019). Usually these SIMD data types are provided by external libraries like Vc Kretz and Lindenstruth (2012) or vectorclass Fog (2013), which usually provide data types with the length of the hardware SIMD (e.g.  or ). For problems with more right hand sides we use the LoopSIMD type. This type is basically an array but implements the arithmetic operations by static sized loops.

1void baxpy(scalar_field_type alpha,
2           const BlockProduct<scalar_field_type>& gamma,
3           const X& x, X& y){
4  for(size_t i=0;i<x.size();++i){
5    field_type xi = x[i]
6    field_type yi = y[i];
7    for(size_t r=0;r<P;++r){
8      yi += lane(r, xi)*gamma[r];
9    }
10    y[i] = yi;
11  }
Listing 1: Implementation of BAXPY

Listing 1 shows the implementation for the BAXPY kernel for the case .

In a first test series we examine the runtime of the kernels BDOT, BAXPY and BOP. To check the efficiency of our implementation we, take the following performance model into account. This performance model is a simplified variant of the ECM model presented by Hofmann et al. Hofmann et al. (2019). We assume that the runtime of a kernel is bounded by the following three factors:

  • : The time the processor needs to perform the necessary number of floating point operations. Where is the number of floating point operations of the kernel.

  • : The time to transfer the data from the main memory to the L1 cache. Where is the amount of data that needs to be transferred in the kernel.

  • : The time to transfer data between L1 cache and registers. Where is the amount of data that needs to be transferred in the kernel.

Finally the expected runtime is given by


Table 1 shows an overview of the performance relevant characteristics of the kernels. We observe that for the BOP kernel the runtime per right hand side decreases rapidly for small , this is in accordance with our expectation. For larger the runtime per right hand side increases slightly. We suppose that this effect is due to the fact that for larger one row of a block vector occupies more space in the caches, hence fewer rows can be cached. This effects could be mitigated by using a different sparse matrix format, like the Sell-C- format Kreutzer et al. (2014).

Furthermore we see that the runtime of the BDOT and BAXPY kernels is constant up to an certain (). This is in accordance with our expectation, as it is memory bound in that regime and does not depend on . At almost all the runtimes for global and hybrid version coincide except for . The reason for that is, that a takes memory, which is exactly the L1 cache size. In the non-global setting two of these matrices are modified during the computation, which then exceeds the L1 cache. This explains as well why the runtime for the case is so much higher than expected.

Figure 1 shows the measured data compared with the expected runtimes. All tests are performed on an Intel Skylake-SP Xeon Gold 6148 on one core. The theoretical peakflops are , the memory bandwidth is and the register bandwidth is .

Table 1: Performance relevant characteristics for the BDOT and BAXPY kernels. Number of floating point operations (), amount of data loaded from main memory (), number of data transfers between registers and L1-Cache (). is the number of non-zeros in .
Figure 1: Microbenchmarks for kernels BDOT, BAXPY and BOP using

In a second experiment we compare the runtime of the whole algorithm with each other. For that we discretized a 2D heterogeneous Poisson problem with a -point Finite Difference stencil. The right hand sides are initialized with random numbers. We iterate until the defect norm of each column has been decreased by a factor of . An ILU preconditioner was used. Figure 2 shows the results. We see that the best block size is . In another test we compare the runtimes for different parameters, where the algorithm is executed times until all right hand sides are solved. In this case the case is the fastest but only slightly slower as the . The reason for that is the worse cache behavior of the BOP kernel, like we have seen before. Note that on a distributed machine the latter case would need x less communication.

Figure 2: Time to solution for different parameters. Numbers on top of the bars denote the number of iterations. Left: Right: Different configurations: is the number of repetitions to solve for all right hand sides.

5 Conclusion and outlook

In this paper we have presented strategies for the vectorized BCG method. For that we reviewed the block Krylov framework of Frommer et al. and apply it on the BCG method. This makes it possible to use the advantages of the BCG method as far as it is beneficial, while the number of right hand sides can be further increased. This helps to decrease the communication overhead and improve the arithmetic intensity of the kernels. We observed that the runtime of the individual kernels scale linearly with the number of right hand sides as long as they are memory bound ( on our machine). That means that it is always beneficial to use at least this block size , depending on the problem it could also be beneficial to choose even larger .

The found optimizations are also applicable to other block Krylov space methods like GMRes, MINRES or BiCG, and could be combined with pipelining techniques. These approaches are the objective of future work.