As computational models become more complex, efficient analysis of such models will require advanced numerical algorithms that target high-performance computing. At the core of these analyses is often the solution of large, sparse linear systems . The most common algorithms for performing such solves are Krylov subspace methods, like the Generalized Minimum Residual Method (GMRES) . GMRES is still considered one of the best general purpose, iterative methods for non-Hermitian linear systems, even though it requires full orthogonalization of the Krylov subspace, . To accelerate the convergence of GMRES, and reduce costs like orthogonalization, it is common to solve a preconditioned linear system. The preconditioner, , is ideally an inexpensive approximation of and can be applied to the left, , or right, where , of the original linear system.
Modern architectures often gain computing power through the addition of more processors (cores), rather than through increases in processor clock speed. To efficiently utilize these architectures, it is necessary to leverage parallel algorithms. Due to the high communication and global synchronization requirements of Krylov subspace methods, like GMRES, much research has been devoted to communication-avoiding algorithms [2, 3, 4]. Among these algorithms are
-step methods, which compute several new Krylov subspace vectors between each communication-intensive orthogonalization.
Polynomial preconditioning is another approach that has potential to reduce communication costs in Krylov solvers. Instead of performing many iterations at once, like the
-step methods, this approach packs more work into each iteration. With polynomial preconditioning, more matrix-vector products are used to form each basis vector before orthogonalizing, giving GMRES more power and potential for convergence for roughly the same amount of communication. Chebyshev and least-squares polynomials are common choices for polynomial preconditioners since they can be created so that their norms are minimized over a given interval. Unfortunately, obtaining an effective Chebyshev or least squares polynomial requires estimates for the extreme eigenvalues of the associated matrix, which are often hard to obtain in practice. This is especially true for non-Hermitian linear systems where determining the polynomial may require an estimate of the convex hull of the spectrum[5, p. 403].
Trilinos  is one of the major software libraries that offers a collection of advanced numerical algorithms for parallel computing, including sparse linear solvers and preconditioners. While polynomial preconditioners have long been considered for parallel computing ([7, 8] and references therein), they are rarely included in high-performance numerical software collections. In Trilinos, for example, the only polynomial preconditioners are the Chebyshev and least-squares preconditioners in the IFPACK package and they only work for Hermitian matrices. The more recent package, IFPACK2, only provides Chebyshev as a smoother for multigrid. This lack of polynomial preconditioners in high-performance software may be due to the computational cost required to compute eigenvalues for obtaining an effective polynomial.
The GMRES minimum residual polynomial also has small norm over the spectrum of the given matrix, making it an effective preconditioner . Its construction does not require any explicit information about the spectrum, making it cheaper to obtain than other polynomials. This preconditioner has been used effectively in several contexts: In [10, 11], it is used to spectrally transform an operator to solve for eigenvalues. The work  uses the GMRES minimum residual polynomial to precondition serial implementations of GMRES and GMRES-DR, and  studies the preconditioner for methods IDR(s) and BiCGStab.
In this paper, we demonstrate an implementation of the GMRES minimum residual polynomial preconditioner using Trilinos. We discuss issues that arise when implementing the preconditioner into existing software, and we test its potential to accelerate the solution of large-scale linear systems. We illustrate that this preconditioner can be effective in a high-performance computing environment for the following reasons:
Simple and Effective: The GMRES minimum residual polynomial preconditioner is simple to implement, as discussed in Section II. Since no factorization of the matrix is required, this preconditioner is suitable for problems where the full matrix is unavailable. The effectiveness of the preconditioner is demonstrated in Sections III, VI.
Avoiding Communication: Applying the polynomial preconditioner requires only sparse matrix-vector products (SpMVs) and vector updates. This greatly reduces global communication and synchronization from inner products, which may become a bottleneck. We discuss in Section VII potential to combine the preconditioner with other communication-avoiding kernels.
Accelerates Existing Preconditioners: The polynomial preconditioner can be combined with existing preconditioners. We use it to accelerate ILU in Section VI.
Because of these advantages, we propose that polynomial preconditioning should be considered as an addition to high-performance solvers software libraries.
Ii Implementing the Minimum Residual Polynomial Preconditioner
Given a linear system , we obtain the polynomial preconditioner of degree . We first build a power basis , where is an arbitrary vector. Then we solve the normal equations
The elements of are the coefficients of , that is,
This method becomes unstable as the the columns of lose linear independence, but its results are generally sufficient for low-degree polynomials.
Notice that the coefficients of the polynomial depend on the choice of . Ref.  suggests using , the problem right-hand side. We discuss in Section V why a random vector may instead be preferable. Note also that the polynomial preconditioner can be easily combined with other preconditioners: Given a preconditioned system , we can use the operator to form the power basis and obtain a polynomial preconditioner for the already preconditioned system.
The spectrum of the preconditioned operator will typically be better for convergence of GMRES than that of the original matrix : the small eigenvalues of will be mapped to well-separated eigenvalues of , and other eigenvalues will be clustered near . Further details on the derivation and algebraic properties of the polynomial, as well as algorithms for more stable implementation, can be found in .
The results presented in this paper are obtained using an implementation of the minimum residual polynomial preconditioner written directly in Belos, the next-generation iterative linear solvers package of Trilinos . While most preconditioners are found in other packages within Trilinos, the Belos package was the most convenient location since it already has code for generating Krylov subspaces and performing orthogonalization. Furthermore, Belos is designed so that it can be used to precondition itself as an inner-outer solver. Finally, this implementation provides the advantage that it works for both Epetra and Tpetra-based linear algebra.
The GMRES minimum residual polynomial operator is generated in Belos using the vector (MultiVecTraits) and operator (OperatorTraits) abstractions. The GmresPolyPrec class takes a Belos linear problem and a polynomial degree as arguments to the constructor and computes the coefficients of the polynomial. It forms the matrices for (1) and then uses the LAPACK routines , POTRF/POTRS, to compute a Cholesky factorization and solve the system. The class also provides a function to apply to a given vector .
A new Belos SolverManager class, GmresPolyPrecSolMgr, allows current Belos linear solvers to interface to the polynomial preconditioner as they would another linear solver. The Belos SolverFactory was extended to include the new GmresPolyPrecSolMgr, allowing us to set the polynomial preconditioner as an inner solver and GMRES as an outer solver. For the results in this paper, we used the Epetra linear algebra interfaces.
Iii Serial Numerical Results
This section contains results from a serial build of the Belos polynomial preconditioner. The outer solver is GMRES with two steps of Classical Gram Schmidt (ICGS) orthogonalization. Each orthogonalization step requires two block inner products and one norm, which we count as three ”dot” products per iteration in the figures to follow. (This is more effective for avoiding communication than modified Gram-Schmidt orthogonalization.) The algorithm needs SpMVs and two block inner products to create the polynomial, which is used as a right preconditioner. We require a relative residual norm less than for convergence.
For all results in this section, we use the matrix e20r0100 from Matrix Market , a real non-Hermitian matrix of size . This matrix has high condition number, estimated by Matrix Market at , and proves to be extremely difficult for GMRES. In practice, e20r0100 may be best addressed using a direct solver, but it is representative of the difficulties that one may encounter in GMRES.
In this example, we choose subspace size and a right-hand side, , with random entries. We generate the polynomial using . Fig. 1 shows residual norm convergence for unpreconditioned GMRES (indicated by ) and polynomial preconditioned GMRES for . Convergence in relation to SpMVs is shown on the top and convergence relative to inner products is shown on the bottom.
The results, illustrated in Figure 1, show that without preconditioning the relative residual norm only improves by one order of magnitude before stalling out. For these experiments we see a distinct improvement when raising the degree of the polynomial. While the degree problem converges in SpMVs, the polynomial of degree gives the most improvement, converging in SpMVs. The decrease in inner (dot) products is more substantial. The degree problem converges in dot products, while the degree problem only requires . This is over three times fewer dot products, and a stark difference from the unpreconditioned problem which stagnated.
Considering the spectra of and helps to explain the improvement from polynomial preconditioning. Figure 2 shows the eigenvalues of (top) and the new spectrum after applying a degree preconditioner (bottom). The matrix is indefinite, having eigenvalues that lie in the left half of the complex plane. The largest eigenvalue with negative real part has magnitude , and the smallest has magnitude . While the polynomials do not create a significant change in the eigenvalues that have negative real part, they do help to cluster eigenvalues on the right-half plane away from the origin. We find that the ratio of smallest to largest magnitudes of eigenvalues on the right half plane is without preconditioning. With only a degree polynomial, this ratio improves almost ten times to . When the polynomial is degree , the ratio has improved to .
In our next example, we use the same problem and polynomials from Example III.1, but we increase the subspace size to . Though GMRES no longer stalls with the larger subspace, convergence is too slow to run to completion. We estimate that SpMVs and dot products are needed to converge. Figure 3 shows the improvement with preconditioning. The degree preconditioned problem converges faster, but still too slowly to run to completion. The polynomial of degree helps to attain convergence at a cost of SpMVs and dot products. With degree , the cost is SpMVs and dot products. Thus, if this computation was run in parallel, we would reduce global communication calls by about orders of magnitude over no preconditioning. With that comes approximately orders of magnitude improvement in matrix-vector products, reducing processor-to-processor communication as well.
Iv A Degree Selection Strategy
It may be difficult for the user to determine when it is best to stop raising the polynomial degree. Raising the degree often results in a better preconditioner, but it can reach a point of diminishing returns. The polynomial preconditioner can decrease both the number of inner products and SpMVs required to converge, but sometimes inner products are reduced at the expense of more SpMVs. Fortunately, many matrices are stored so that SpMVs only require communication with neighboring processors. Thus, for communication reduction, it may be more beneficial to perform extra SpMVs in order to avoid operations that require synchronous global communication, like inner products.
We ran serial tests on several different matrices to determine the effects of raising the polynomial degree. All tests were performed with a right-hand side where is a randomly generated solution vector. We let and choose a maximum subspace size of . Matrices bwm2000, orsirr1, s1rmq4m1, and e20r0100 can be obtained via Matrix Market. The matrix BiDiag1, with , has on the diagonal and on all elements of the superdiagonal. Matrix BiDiag2, with , has on the diagonal and on all elements of the superdiagonal.
Figure 4 shows the number of SpMVs required to reach a relative residual tolerance of for polynomials of degrees . Results for no preconditioning correspond to degree on the plot. Figure 5 shows the corresponding number of inner products required for convergence, where one inner product is counted for each of two passes of Gram-Schmidt orthogonalization and one more for the norm. For matrices bwm2000 and e20r0100, convergence stagnates with no preconditioning. The problem e20r0100 first converges with the preconditioner of degree , and bwm2000 first converges with degree .
The results suggest that preconditioning is more likely to reduce matrix-vector products for difficult problems than for simpler ones. For the easiest problem, BiDiag1, matrix-vector products increase with preconditioning, even for degree . For matrices e20r0100, bwm2000, and s1rmq4m1, which were most difficult, the expense of SpMVs decreases with preconditioning up until degree . For the other two problems, the number of SpMVs decreases slightly for very low-degree polynomials and begins to rise again, with no savings after degree .
Unlike with SpMVs, polynomial preconditioning is consistent in reducing the number of inner products for all problems in Figure 5, regardless of difficulty. By the time the polynomial degree is increased to , the number of inner products has decreased by approximately an order of magnitude or more for all problems. However, after degree , the number of inner products remains constant while the number of SpMVs is increasing.
We found that after the polynomial degree gets large enough, increasing it failed to result in new coefficients of significant magnitude. Example coefficients for the matrix s1rmq4m1 are shown in Table I. Notice that for degrees , and , the first eleven polynomial coefficients remain the same. The additional coefficients in the polynomials of degrees and are so near zero that they do not provide any additional information. This explains why the degree and degree preconditioners give no improvement in cost over the degree preconditioner. In fact, they are more expensive due to the extra SpMVs incurred with a near-zero coefficient. Polynomial coefficients for the other matrices tested followed a similar trend, becoming very small at high degrees.
|Deg 7||Deg 10||Deg 12||Deg 15|
This effect may also be due to the ill-conditioned problem of computing coefficients via the normal equations with a power basis. We observe that the appearance of near-zero coefficients corresponds with a positive return value info in the LAPACK function POTRF when forming the polynomial, which means that the matrix [, in our case] is not positive definite. Of the six matrices discussed in this section, three first give positive return values starting with degree , and the other three examples begin to give warnings at degree . Despite this warning, all coefficients are still computed. In other examples, NaNs were computed after the LAPACK error occurred and the polynomial degree was raised too high.
We also tried using a more stable LAPACK routine POSVX, which equilibrates the system before Cholesky factorization and/or improves the solution using iterative refinement. Neither of these options resulted in better polynomial coefficients. Another option is to find a QR factorization for solving the normal equations. In [9, p. 11], this resulted in less accurate polynomial coefficients. Thus, we do not consider it here.
We conjecture that the best polynomial preconditioner constructed with the power basis method will have the highest degree possible without a warning from LAPACK. Based on the examples above, this polynomial seems likely to minimize the number of inner products and norms while avoiding extra SpMVs. This strategy can be easily implemented for automatic degree selection.
It is worth noting that there are examples, such as Sherman5 from Matrix Market, where this degree selection strategy fails. This indefinite matrix is an extremely difficult problem for GMRES. The polynomial of degree was a very successful preconditioner because the spectrum of was entirely in the one side of the complex plane. With higher degree polynomials, the matrix was once again indefinite and GMRES did not converge, but LAPACK did not give positive return values until degree . In such instances, it may be best to take the auto-selection degree as an upper bound and try to obtain results with lower-degree polynomials.
V Choosing a Vector to Generate the Polynomial
All experiments thus far have successfully generated the polynomial preconditioner using , the problem right-hand side. This choice worked well in the previous sections because the problem right-hand side was generated using randomization. More structured right-hand sides may generate a poor polynomial preconditioner.
Consider the discretized Laplacian equation over a square domain, with constant source function and zero boundary conditions (Example ). The matrix size is . The eigenvalues of this matrix are all real-valued and lie in the interval , with several eigenvalues very close to . All values of the right-hand side vector are very close to or . Figure 6 shows the polynomial of degree generated with .
The -axis corresponds to the spectrum of , and the -axis shows the range of eigenvalues of . Recall that if is a good preconditioner, the large eigenvalues of will be mapped close to and the small eigenvalues of will be well-separated between and . This polynomial does nothing of the sort. The largest eigenvalues near are mapped to near , and the eigenvalues in the middle of the spectrum are mapped to values as small as up to larger than .
The preconditioned matrix is highly indefinite and is much harder for GMRES than the original problem. After iterations of GMRES(50), the relative residual almost stalls out at . The vector
appears to have very small components in the eigenvector directions ofthat correspond to large eigenvalues. Thus, the GMRES minimum residual polynomial effectively ignores those large eigenvalues.
We now generate a random vector
with uniformly distributed elements in. The new polynomial of degree is shown in Figure 7. The preconditioner works very well; GMRES(50) reaches a relative residual norm of in only iterations. The plot of the polynomial shows that the small eigenvalues of are well-separated and the rest of the spectrum is mapped between and . It appears that a random vector helps the polynomial to address all parts of the spectrum better than a structured right-hand side vector. For the remaining experiments in this paper, we let be a random vector.
Vi Parallel Numerical Results
Experiments in this section were performed using the Kodiak cluster at Baylor University. The cluster has Cray regular compute nodes, each with dual 18-core Intel E5-2695 V4 (Broadwell) processors and 256GB RAM. All tests used only one compute node.
Both examples that follow test finite element discretizations of the convection-diffusion equation
The matrices and right-hand sides are generated with Firedrake  software using a function space of continuous piecewise-linear polynomials. The domain is a 2D unit square mesh centered at the origin with , yielding a matrix of size . Similar to Example in , and
We use Dirichlet boundary conditions: on boundary , and on the remaining boundaries.
We employ GMRES(50), requesting a relative residual tolerance of and using two steps of classical Gram-Schmidt orthogonalization. We generate a random vector and hold it constant for generating all polynomials, regardless of degree or MPI processes.
For this example, . Without preconditioning, GMRES(50) does eventually converge. The bar graphs in Figure 8 show solve times over increasing numbers of MPI processes for no preconditioning and polynomial preconditioners of degrees and . On one processor, the autodegree selection algorithm chooses degree as optimal. The bars are split to show three different timings: Time spent in the orthogonalization kernel is indicated by the bottom and middle parts of the bar, for dot products (including norms) and vector updates, respectively. The top part of the bar indicates time spent applying to a vector. Timings for other operations, including polynomial construction, were negligible.
Notice first the differences in scaling on the -axes. The polynomial preconditioner of degree gives almost times improvement in solve time over no preconditioning, and degree gives over times improvement in solve time over no preconditioning. Observe also that strong scaling is roughly the same with the preconditioned and unpreconditioned problems. Solve time decreases by about half as we go from to MPI processes and by a little less than a half as we add more processes.
Although running on a single compute node means that communication consists only of reading shared memory, orthogonalization dominates solve time when no preconditioning is used. In particular, dot products and norms require almost half of the total solve time. With a degree preconditioner, less than one-fourth of the compute time is used for dot products and norms, while a much greater proportion of time is used applying the preconditioned matrix with SpMVs and vector updates. This shows potential to further reduce solve time by combining polynomial application with a communication-avoiding algorithm such as the Matrix Powers Kernel. See Section VII for further discussion.
Surprisingly, there were differences in the polynomial coefficients generated (for fixed degree) with increasing numbers of processors. Thus, the number of iterations required for convergence varied with the number of MPI processes. While it would be ideal to have consistent convergence behavior when increasing the number of MPI processes, all the polynomial preconditioners generated here greatly improve convergence.
We modify the convection-diffusion problem from the previous example by choosing . The increased contribution from the convection term makes this problem too difficult for GMRES(50) to converge without a preconditioner, and polynomial preconditioning alone is ineffective. Thus we combine polynomial preconditioning with an ILU preconditioner . We use ILU(0) with no overlap between processors, as implemented in the Trilinos package IFPACK. We apply ILU preconditioning on the left and polynomial preconditioning on the right. Thus, we are solving the system where
Since the ILU preconditioner changes as the number of MPI processes increases (ILU factorizations are computed locally on diagonal blocks of the matrix ), the polynomial preconditioners also vary with the number of MPI processes. Figure 9 shows convergence times for polynomial preconditioning of various degrees combined with ILU. Degree indicates ILU preconditioning only. Bar graphs for computations with and MPI processes are shown. As in the previous example, the bottom and middle sections of each bar indicate orthogonalization time spent on dot products and updates, respectively. The top section of each bar indicates time spent on all remaining operations, including polynomial and ILU preconditioning.
Some observations: First, polynomial preconditioning with ILU is significantly better than ILU alone. Over MPI process, we attain speedup of almost times. Second, this improvement is consistent with increased parallelism. Even over MPI processes, we still have almost times speedup: ILU by itself converges in seconds, while the degree polynomial preconditioned GMRES converges in seconds. Third, the proportion of time spent in dot products (orthogonalization) is greatly reduced with polynomial preconditioning. On cores with ILU only, dot products and norms consumed seconds, or about of compute time. With degree polynomial preconditioning, they consume only seconds, or about of compute time. This suggests that polynomial preconditioning can be a worthwhile addition to ILU and other existing preconditioners.
Vii Relation to Communication-Avoiding Methods
Communication-Avoiding Krylov methods, such as CA-GMRES , are variations of -step Krylov methods. They reduce the number of communication steps at the cost of more memory and flops. The savings in global communication and synchronization is important on extreme-scale parallel systems. There are actually two savings in communication: (a) Global communication (inner products, orthogonalization) happens only once every steps; and (b) with the Matrix Powers Kernel (MPK), even local communication can be reduced at a cost in memory. The Matrix Powers Kernel is designed to perform several matrix-vector products consecutively while minimizing reads from memory. In Communication-Avoiding (CA)-GMRES, the Matrix Powers Kernel is used to form several vectors of a Krylov subspace without orthogonalizing in between. After all of the SpMVs are performed, then the basis vectors are orthogonalized using the Tall-Skinny QR (TSQR) algorithm. Unfortunately, CA-GMRES is prone to numerical instability. It is more stable to form a new Krylov vector from a basis vector that has already been orthogonalized. The more matrix-vector products are computed before orthogonalization, the more likely the Krylov vectors will begin to lose linear independence. Polynomial preconditioned GMRES may provide an avenue for taking advantage of communication-avoiding SpMVs with the Matrix Powers Kernel while avoiding the numerical pitfalls of delayed orthogonalization. Polynomial preconditioning can either be used with standard GMRES or within CA methods:
Polynomial preconditioned standard GMRES. We could use the MPK to evaluate the polynomial. Communication occurs as usual in each iteration of the standard GMRES algorithm.
Polynomial preconditioning within CA-GMRES. Polynomials are “communication-avoiding” in the sense that the dependency pattern is sparse and it is simple to determine the required replication/ghosting of data .
Note that in the first case, we can choose how many powers we use in the MPK, that is, how long to wait between each communication. The simplest choice is to let be the degree of the polynomial in (2). However, this becomes sub-optimal (even impractical) for high degree polynomials due to the high memory cost. Thus, we are free to choose . This is analogous to the fact that in -step methods, the length of the MPK, , could be different (smaller) than . Observe that polynomial preconditioned GMRES and CA-GMRES will have essentially the same communication requirements when and . Still, we emphasize they are not equivalent methods. Our results show that convergence is improved and the number of inner products is reduced using polynomial preconditioning.
Also note, by combining polynomial preconditioning and CA-GMRES, orthogonalization is only needed once every SpMVs. Future work includes a more detailed analysis and comparison of polynomial preconditioned GMRES and CA-GMRES.
We have shown that polynomial preconditioning can be effective in improving the convergence of GMRES. Our experiments demonstrate reduction in dot products that helps avoid global communication. We showed parallel results on a moderate size cluster. Future work include experiments on larger problems on highly parallel supercomputers, where communication is more expensive. It may also be worthwhile to investigate more stable implementations for constructing the polynomial preconditioner.
We believe polynomial preconditioning is under-appreciated and is a good alternative (or complement) to recent communication-avoiding methods such as CA-GMRES. It should be made available in high-performance software libraries to help enable exascale computing.
The first author would like to thank Ron Morgan for several useful discussions and suggestions. We also thank Rob Kirby for help in generating test problems.
-  Y. Saad and M. H. Schultz, “GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems,” SIAM J. Sci. Statist. Comput., vol. 7, no. 3, pp. 856–869, 1986.
-  M. F. Hoemmen, “Communication-avoiding Krylov subspace methods,” Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, Tech. Rep. UCB/EECS-2010-37, April 2010.
-  M. Mohiyuddin, M. Hoemmen, J. Demmel, and K. Yelick, “Minimizing communication in sparse matrix solvers,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ser. SC ’09. New York, NY, USA: ACM, 2009, pp. 36:1–36:12.
-  J. Demmel, M. Hoemmen, M. Mohiyuddin, and K. Yelick, “Avoiding communication in sparse matrix computations,” in 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 2008.
-  Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2003.
-  M. A. Heroux et al., “An overview of the Trilinos project,” ACM Trans. Math. Softw., vol. 31, no. 3, pp. 397–423, Sep. 2005.
-  Y. Saad, “Practical use of polynomial preconditionings for the conjugate gradient method,” SIAM Journal on Scientific and Statistical Computing, vol. 6, no. 4, pp. 865–881, 1985.
-  Y. Liang, “The use of parallel polynomial preconditioners in the solution of systems of linear equations,” Ph.D. dissertation, University of Ulster, 2005.
-  Q. Liu, R. B. Morgan, and W. Wilcox, “Polynomial preconditioned GMRES and GMRES-DR,” SIAM J. Sci. Comput., vol. 37, no. 5, pp. S407–S428, 2015.
-  H. K. Thornquist, “Fixed-polynomial approximate spectral transformations for preconditioning the eigenvalue problem,” Ph.D. dissertation, Rice University, 2006.
-  M. Embree, J. A. Loe, and R. B. Morgan, “Polynomial preconditioned Arnoldi,” submitted. [Online]. Available: https://arxiv.org/abs/1806.08020
-  J. A. Loe and R. B. Morgan, “Polynomial preconditioned BICGStab and IDR,” submitted. [Online]. Available: https://sites.baylor.edu/ronald_morgan/files/2015/05/PPNSymmLanLinEqs-1828ts3.pdf
-  E. Bavier, M. Hoemmen, S. Rajamanickam, and H. Thornquist, “Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems,” Scientific Programming, vol. 20, no. 3, pp. 241–255, 2012.
-  E. Anderson et al., LAPACK User’s Guide. Philadelphia, PA: SIAM, 1992.
-  Matrix Market. [Online]. Available: https://math.nist.gov/MatrixMarket/
-  H. C. Elman, D. J. Silvester, and A. J. Wathen, Finite elements and fast iterative solvers: with applications in incompressible fluid dynamics, 2nd ed., ser. Numerical Mathematics and Scientific Computation. Oxford University Press, Oxford, 2014.
-  F. Rathgeber et al., “Firedrake: automating the finite element method by composing abstractions,” ACM Trans. Math. Softw., vol. 43, no. 3, pp. 24:1–24:27, 2016.