1 Introduction
Developers of numerical software are facing multiple challenges on modern HPChardware. Firstly, multiple levels of concurrency must be exploited to achieve the full performance. Secondly, due to that parallelism, communication between nodes is needed, which must be cleverly organized to avoid an expensive overhead. And most importantly, modern CPUs have a low memory bandwidth, compared to the peak FLOP rate, such that for standard linear solvers the memory bandwidth is the bottleneck for the performance. Therefore only algorithms with a high arithmetic intensity will perform well.
Instruction level parallelism is now apparent on all modern CPU architectures. They provide dedicated vector (or SIMD) instructions, that allow to proceed multiple floating point operations with one instruction call, e.g. AVX512 allows processing 8 double at once. The efficient use of these instructions is a further challenge.
The Conjugate Gradients method (CG) is a standard tool for solving large, sparse, symmetric, positive definite, linear systems. The Block Conjugate Gradient (BCG) method was introduced in the 80s to improve the convergence rate for systems with multiple right hand sides O’Leary (1980). Recently these methods have been rediscovered to reduce the communication overhead in parallel computations Grigori et al. (2016); Grigori and Tissot (2017); Al Daas et al. (2017).
In this paper we present a generalization of the BCG method, which makes it applicable to arbitrary many righthandsides. We consider a symmetric, positive definite matrix and want to solve the matrix equation
(1) 
This paper is structured as follows. In Section 2 we briefly review the theoretical background of block Krylov methods, using the notation of Frommer et al. (2017). Then in Section 3 we apply this theory on the BCG method. The implementation of the method, a theoretical performance model and some numerical experiments are presented in Section 4.
2 Block Krylov subspaces
Considering functions of matrices, Frommer et al. presented in Frommer et al. (2017) a generic framework for block Krylov methods. Further work on this framework can be found in the PhD thesis of Lund Lund (2018). In the following we review the most important definitions, which we will need in section 3. Frommer et al. used as numeric field, for simplicity of the following numerics, we restrict our self to .
Definition 1 (Block Krylov subspace)
Let be a *subalgebra of and . The th block Krylov subspace with respect to and is defined by
(2) 
From that definition we find the following lemma directly.
Lemma 1
If and are two *subalgebras of , with . Then
(3) 
holds.
In this paper we want to consider the following *subalgebras and products and the corresponding Krylov spaces.
Definition 2 (Relevant *subalgebras)
Let be a divider of . We define the following *subalgebras and corresponding products:
hybrid:  (4)  
global:  (5)  
where denotes the dimensional identity matrix and denotes the set of matrices where only the diagonal matrices have nonzero values. Furthermore we define the special cases  
classical:  and  (6)  
parallel:  (7) 
The names result from the behavior of the resulting Krylov method; yields in the classical block Krylov method as presented by O’Leary O’Leary (1980), whereas results in a CG method, which is carried out for all right hand sides simultaneously, in a instruction level parallel fashion.
From that definition we could conclude the following embedding lemma.
Lemma 2 (Embeddings of *subalgebras)
For , where is a divisor of and is a divisor of , we have the following embedding:
(8) 
3 Block Conjugate Gradient method
Algorithm 1 shows the preconditioned BCG method. We recompute in line 6 to improve the stability of the method. A more elaborate discussion of the stability of the BCG method can be found in the paper of Dubrulle Dubrulle (2001). This stabilization has only mild effect on the performance as the communication that is needed to compute the block product could be carried out together with the previous block product.
The algorithm is buildup from 4 kernels:

BDOT: Computing the block product,

BAXPY: Generic vector update

BOP: Applying the operator (or preconditioner) on a block vector

BSOLVE: Solve a block system in the *subalgebra
O’Leary showed the following convergence result to estimate the error of the classical BCG method.
Theorem 3.1 (Convergence of Block Conjugate Gradients (O’Leary, 1980, Theorem 5))
For the energyerror of the th column of the classical BCG method, the following estimation hold:
where
denotes the eigenvalues of the preconditioned matrix
. The constant depends on and the initial error but not on .This theorem holds for the classical method. However as the hybrid method is only a dataparallel version of the classical block method the same convergence rate hold with for the method. The following lemma gives us a convergence rate for the global methods.
Lemma 3 (Theoretical convergence rate of global methods)
The theoretical convergence rate of a global method using is
(9) 
Proof
A global method is equivalent to solve the dimensional system
(10) 
with the classical block Krylov method with right hand sides. The matrix of this system has the same eigenvalues as but with times the multiplicity. Thus the smallest eigenvalue is . Therefore and by applying Theorem 3.1 we deduce the theoretical convergence rate.
This result makes the global methods irrelevant for practical use. In particular for the nonglobal hybrid method would perform better.
4 Implementation and numerical experiments
With the Dune 2.6 release an abstraction for SIMD data types was introduced. The aim of these abstraction is the possibility to use the data types as a replacement for the numeric data type, like double or float, to create data parallel methods. For a more detailed review see Bastian et al. Bastian et al. (2019). Usually these SIMD data types are provided by external libraries like Vc Kretz and Lindenstruth (2012) or vectorclass Fog (2013), which usually provide data types with the length of the hardware SIMD (e.g. or ). For problems with more right hand sides we use the LoopSIMD type. This type is basically an array but implements the arithmetic operations by static sized loops.
Listing 1 shows the implementation for the BAXPY kernel for the case .
In a first test series we examine the runtime of the kernels BDOT, BAXPY and BOP. To check the efficiency of our implementation we, take the following performance model into account. This performance model is a simplified variant of the ECM model presented by Hofmann et al. Hofmann et al. (2019). We assume that the runtime of a kernel is bounded by the following three factors:

: The time the processor needs to perform the necessary number of floating point operations. Where is the number of floating point operations of the kernel.

: The time to transfer the data from the main memory to the L1 cache. Where is the amount of data that needs to be transferred in the kernel.

: The time to transfer data between L1 cache and registers. Where is the amount of data that needs to be transferred in the kernel.
Finally the expected runtime is given by
(11) 
Table 1 shows an overview of the performance relevant characteristics of the kernels. We observe that for the BOP kernel the runtime per right hand side decreases rapidly for small , this is in accordance with our expectation. For larger the runtime per right hand side increases slightly. We suppose that this effect is due to the fact that for larger one row of a block vector occupies more space in the caches, hence fewer rows can be cached. This effects could be mitigated by using a different sparse matrix format, like the SellC format Kreutzer et al. (2014).
Furthermore we see that the runtime of the BDOT and BAXPY kernels is constant up to an certain (). This is in accordance with our expectation, as it is memory bound in that regime and does not depend on . At almost all the runtimes for global and hybrid version coincide except for . The reason for that is, that a takes memory, which is exactly the L1 cache size. In the nonglobal setting two of these matrices are modified during the computation, which then exceeds the L1 cache. This explains as well why the runtime for the case is so much higher than expected.
Figure 1 shows the measured data compared with the expected runtimes. All tests are performed on an Intel SkylakeSP Xeon Gold 6148 on one core. The theoretical peakflops are , the memory bandwidth is and the register bandwidth is .
BDOT  

BAXPY  
BOP 
In a second experiment we compare the runtime of the whole algorithm with each other. For that we discretized a 2D heterogeneous Poisson problem with a point Finite Difference stencil. The right hand sides are initialized with random numbers. We iterate until the defect norm of each column has been decreased by a factor of . An ILU preconditioner was used. Figure 2 shows the results. We see that the best block size is . In another test we compare the runtimes for different parameters, where the algorithm is executed times until all right hand sides are solved. In this case the case is the fastest but only slightly slower as the . The reason for that is the worse cache behavior of the BOP kernel, like we have seen before. Note that on a distributed machine the latter case would need x less communication.
5 Conclusion and outlook
In this paper we have presented strategies for the vectorized BCG method. For that we reviewed the block Krylov framework of Frommer et al. and apply it on the BCG method. This makes it possible to use the advantages of the BCG method as far as it is beneficial, while the number of right hand sides can be further increased. This helps to decrease the communication overhead and improve the arithmetic intensity of the kernels. We observed that the runtime of the individual kernels scale linearly with the number of right hand sides as long as they are memory bound ( on our machine). That means that it is always beneficial to use at least this block size , depending on the problem it could also be beneficial to choose even larger .
The found optimizations are also applicable to other block Krylov space methods like GMRes, MINRES or BiCG, and could be combined with pipelining techniques. These approaches are the objective of future work.
Bibliography
 Al Daas et al. (2017) Al Daas, H., Grigori, L., Hénon, P., Ricoux, P.: Enlarged gmres for reducing communication (2017). Preprint hal.inria.fr/hal01497943
 Bastian et al. (2019) Bastian, P., Blatt, M., Dedner, A., Dreier, N.A., Engwer, C., Fritze, R., Gräser, C., Kempf, D., Klöfkorn, R., Ohlberger, M., Sander, O.: The dune framework: Basic concepts and recent developments (2019). Preprint arXiv:1909.13672
 Dubrulle (2001) Dubrulle, A.A.: Retooling the method of block conjugate gradients. Electron. Trans. Numer. Anal. 12, 216–233 (2001). URL http://etna.mcs.kent.edu/vol.12.2001/pp216233.dir/pp216233.pdf
 Fog (2013) Fog, A.: C++ vector class library (2013). URL http://www.agner.org/optimize/vectorclass.pdf
 Frommer et al. (2017) Frommer, A., Szyld, D.B., Lund, K.: Block krylov subspace methods for functions of matrices. Electron. Trans. Numer. Anal. 47, 100–126 (2017). URL http://etna.math.kent.edu/vol.47.2017/pp100126.dir/pp100126.pdf
 Grigori et al. (2016) Grigori, L., Moufawad, S., Nataf, F.: Enlarged krylov subspace conjugate gradient methods for reducing communication. SIAM Journal on Matrix Analysis and Applications 37(2), 744–773 (2016). DOI 10.1137/140989492
 Grigori and Tissot (2017) Grigori, L., Tissot, O.: Reducing the communication and computational costs of enlarged krylov subspaces conjugate gradient (2017). Preprint hal.inria.fr/hal01451199/
 Hofmann et al. (2019) Hofmann, J., Alappat, C.L., Hager, G., Fey, D., Wellein, G.: Bridging the architecture gap: Abstracting performancerelevant properties of modern server processors (2019). Preprint arXiv:1907.00048
 Kretz and Lindenstruth (2012) Kretz, M., Lindenstruth, V.: Vc: A c++ library for explicit vectorization. Software: Practice and Experience 42(11), 1409–1430 (2012). DOI 10.1002/spe.11490
 Kreutzer et al. (2014) Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrixvector multiplication on modern processors with wide simd units. SIAM Journal on Scientific Computing 36(5), C401–C423 (2014). DOI 10.1137/130930352
 Lund (2018) Lund, K.: A new block Krylov subspace framework with applications to functions of matrices acting on multiple vectors. Ph.D. thesis, Temple University (2018)
 O’Leary (1980) O’Leary, D.P.: The block conjugate gradient algorithm and related methods. Linear algebra and its applications 29, 293–322 (1980). DOI 10.1016/00243795(80)902475