Developers of numerical software are facing multiple challenges on modern HPC-hardware. Firstly, multiple levels of concurrency must be exploited to achieve the full performance. Secondly, due to that parallelism, communication between nodes is needed, which must be cleverly organized to avoid an expensive overhead. And most importantly, modern CPUs have a low memory bandwidth, compared to the peak FLOP rate, such that for standard linear solvers the memory bandwidth is the bottleneck for the performance. Therefore only algorithms with a high arithmetic intensity will perform well.
Instruction level parallelism is now apparent on all modern CPU architectures. They provide dedicated vector (or SIMD) instructions, that allow to proceed multiple floating point operations with one instruction call, e.g. AVX-512 allows processing 8 double at once. The efficient use of these instructions is a further challenge.
The Conjugate Gradients method (CG) is a standard tool for solving large, sparse, symmetric, positive definite, linear systems. The Block Conjugate Gradient (BCG) method was introduced in the 80s to improve the convergence rate for systems with multiple right hand sides O’Leary (1980). Recently these methods have been rediscovered to reduce the communication overhead in parallel computations Grigori et al. (2016); Grigori and Tissot (2017); Al Daas et al. (2017).
In this paper we present a generalization of the BCG method, which makes it applicable to arbitrary many right-hand-sides. We consider a symmetric, positive definite matrix and want to solve the matrix equation
This paper is structured as follows. In Section 2 we briefly review the theoretical background of block Krylov methods, using the notation of Frommer et al. (2017). Then in Section 3 we apply this theory on the BCG method. The implementation of the method, a theoretical performance model and some numerical experiments are presented in Section 4.
2 Block Krylov subspaces
Considering functions of matrices, Frommer et al. presented in Frommer et al. (2017) a generic framework for block Krylov methods. Further work on this framework can be found in the PhD thesis of Lund Lund (2018). In the following we review the most important definitions, which we will need in section 3. Frommer et al. used as numeric field, for simplicity of the following numerics, we restrict our self to .
Definition 1 (Block Krylov subspace)
Let be a *-subalgebra of and . The -th block Krylov subspace with respect to and is defined by
From that definition we find the following lemma directly.
If and are two *-subalgebras of , with . Then
In this paper we want to consider the following *-subalgebras and -products and the corresponding Krylov spaces.
Definition 2 (Relevant *-subalgebras)
Let be a divider of . We define the following *-subalgebras and corresponding products:
|where denotes the |
dimensional identity matrix anddenotes the set of matrices where only the diagonal matrices have non-zero values. Furthermore we define the special cases
The names result from the behavior of the resulting Krylov method; yields in the classical block Krylov method as presented by O’Leary O’Leary (1980), whereas results in a CG method, which is carried out for all right hand sides simultaneously, in a instruction level parallel fashion.
From that definition we could conclude the following embedding lemma.
Lemma 2 (Embeddings of *-subalgebras)
For , where is a divisor of and is a divisor of , we have the following embedding:
3 Block Conjugate Gradient method
Algorithm 1 shows the preconditioned BCG method. We recompute in line 6 to improve the stability of the method. A more elaborate discussion of the stability of the BCG method can be found in the paper of Dubrulle Dubrulle (2001). This stabilization has only mild effect on the performance as the communication that is needed to compute the block product could be carried out together with the previous block product.
The algorithm is build-up from 4 kernels:
BDOT: Computing the block product,
BAXPY: Generic vector update
BOP: Applying the operator (or preconditioner) on a block vector
BSOLVE: Solve a block system in the *-subalgebra
O’Leary showed the following convergence result to estimate the error of the classical BCG method.
Theorem 3.1 (Convergence of Block Conjugate Gradients (O’Leary, 1980, Theorem 5))
For the energy-error of the -th column of the classical BCG method, the following estimation hold:
where denotes the eigenvalues of the
denotes the eigenvalues of the preconditioned matrix. The constant depends on and the initial error but not on .
This theorem holds for the classical method. However as the hybrid method is only a data-parallel version of the classical block method the same convergence rate hold with for the method. The following lemma gives us a convergence rate for the global methods.
Lemma 3 (Theoretical convergence rate of global methods)
The theoretical convergence rate of a global method using is
A global method is equivalent to solve the -dimensional system
with the classical block Krylov method with right hand sides. The matrix of this system has the same eigenvalues as but with times the multiplicity. Thus the -smallest eigenvalue is . Therefore and by applying Theorem 3.1 we deduce the theoretical convergence rate.
This result makes the global methods irrelevant for practical use. In particular for the non-global hybrid method would perform better.
4 Implementation and numerical experiments
With the Dune 2.6 release an abstraction for SIMD data types was introduced. The aim of these abstraction is the possibility to use the data types as a replacement for the numeric data type, like double or float, to create data parallel methods. For a more detailed review see Bastian et al. Bastian et al. (2019). Usually these SIMD data types are provided by external libraries like Vc Kretz and Lindenstruth (2012) or vectorclass Fog (2013), which usually provide data types with the length of the hardware SIMD (e.g. or ). For problems with more right hand sides we use the LoopSIMD type. This type is basically an array but implements the arithmetic operations by static sized loops.
Listing 1 shows the implementation for the BAXPY kernel for the case .
In a first test series we examine the runtime of the kernels BDOT, BAXPY and BOP. To check the efficiency of our implementation we, take the following performance model into account. This performance model is a simplified variant of the ECM model presented by Hofmann et al. Hofmann et al. (2019). We assume that the runtime of a kernel is bounded by the following three factors:
: The time the processor needs to perform the necessary number of floating point operations. Where is the number of floating point operations of the kernel.
: The time to transfer the data from the main memory to the L1 cache. Where is the amount of data that needs to be transferred in the kernel.
: The time to transfer data between L1 cache and registers. Where is the amount of data that needs to be transferred in the kernel.
Finally the expected runtime is given by
Table 1 shows an overview of the performance relevant characteristics of the kernels. We observe that for the BOP kernel the runtime per right hand side decreases rapidly for small , this is in accordance with our expectation. For larger the runtime per right hand side increases slightly. We suppose that this effect is due to the fact that for larger one row of a block vector occupies more space in the caches, hence fewer rows can be cached. This effects could be mitigated by using a different sparse matrix format, like the Sell-C- format Kreutzer et al. (2014).
Furthermore we see that the runtime of the BDOT and BAXPY kernels is constant up to an certain (). This is in accordance with our expectation, as it is memory bound in that regime and does not depend on . At almost all the runtimes for global and hybrid version coincide except for . The reason for that is, that a takes memory, which is exactly the L1 cache size. In the non-global setting two of these matrices are modified during the computation, which then exceeds the L1 cache. This explains as well why the runtime for the case is so much higher than expected.
Figure 1 shows the measured data compared with the expected runtimes. All tests are performed on an Intel Skylake-SP Xeon Gold 6148 on one core. The theoretical peakflops are , the memory bandwidth is and the register bandwidth is .
In a second experiment we compare the runtime of the whole algorithm with each other. For that we discretized a 2D heterogeneous Poisson problem with a -point Finite Difference stencil. The right hand sides are initialized with random numbers. We iterate until the defect norm of each column has been decreased by a factor of . An ILU preconditioner was used. Figure 2 shows the results. We see that the best block size is . In another test we compare the runtimes for different parameters, where the algorithm is executed times until all right hand sides are solved. In this case the case is the fastest but only slightly slower as the . The reason for that is the worse cache behavior of the BOP kernel, like we have seen before. Note that on a distributed machine the latter case would need x less communication.
5 Conclusion and outlook
In this paper we have presented strategies for the vectorized BCG method. For that we reviewed the block Krylov framework of Frommer et al. and apply it on the BCG method. This makes it possible to use the advantages of the BCG method as far as it is beneficial, while the number of right hand sides can be further increased. This helps to decrease the communication overhead and improve the arithmetic intensity of the kernels. We observed that the runtime of the individual kernels scale linearly with the number of right hand sides as long as they are memory bound ( on our machine). That means that it is always beneficial to use at least this block size , depending on the problem it could also be beneficial to choose even larger .
The found optimizations are also applicable to other block Krylov space methods like GMRes, MINRES or BiCG, and could be combined with pipelining techniques. These approaches are the objective of future work.
- Al Daas et al. (2017) Al Daas, H., Grigori, L., Hénon, P., Ricoux, P.: Enlarged gmres for reducing communication (2017). Preprint hal.inria.fr/hal-01497943
- Bastian et al. (2019) Bastian, P., Blatt, M., Dedner, A., Dreier, N.A., Engwer, C., Fritze, R., Gräser, C., Kempf, D., Klöfkorn, R., Ohlberger, M., Sander, O.: The dune framework: Basic concepts and recent developments (2019). Preprint arXiv:1909.13672
- Dubrulle (2001) Dubrulle, A.A.: Retooling the method of block conjugate gradients. Electron. Trans. Numer. Anal. 12, 216–233 (2001). URL http://etna.mcs.kent.edu/vol.12.2001/pp216-233.dir/pp216-233.pdf
- Fog (2013) Fog, A.: C++ vector class library (2013). URL http://www.agner.org/optimize/vectorclass.pdf
- Frommer et al. (2017) Frommer, A., Szyld, D.B., Lund, K.: Block krylov subspace methods for functions of matrices. Electron. Trans. Numer. Anal. 47, 100–126 (2017). URL http://etna.math.kent.edu/vol.47.2017/pp100-126.dir/pp100-126.pdf
- Grigori et al. (2016) Grigori, L., Moufawad, S., Nataf, F.: Enlarged krylov subspace conjugate gradient methods for reducing communication. SIAM Journal on Matrix Analysis and Applications 37(2), 744–773 (2016). DOI 10.1137/140989492
- Grigori and Tissot (2017) Grigori, L., Tissot, O.: Reducing the communication and computational costs of enlarged krylov subspaces conjugate gradient (2017). Preprint hal.inria.fr/hal-01451199/
- Hofmann et al. (2019) Hofmann, J., Alappat, C.L., Hager, G., Fey, D., Wellein, G.: Bridging the architecture gap: Abstracting performance-relevant properties of modern server processors (2019). Preprint arXiv:1907.00048
- Kretz and Lindenstruth (2012) Kretz, M., Lindenstruth, V.: Vc: A c++ library for explicit vectorization. Software: Practice and Experience 42(11), 1409–1430 (2012). DOI 10.1002/spe.11490
- Kreutzer et al. (2014) Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide simd units. SIAM Journal on Scientific Computing 36(5), C401–C423 (2014). DOI 10.1137/130930352
- Lund (2018) Lund, K.: A new block Krylov subspace framework with applications to functions of matrices acting on multiple vectors. Ph.D. thesis, Temple University (2018)
- O’Leary (1980) O’Leary, D.P.: The block conjugate gradient algorithm and related methods. Linear algebra and its applications 29, 293–322 (1980). DOI 10.1016/0024-3795(80)90247-5