I Introduction
Solving systems of linear equations is a fundamental computational task in numerous fields [9, 11]
, including signal processing and machine learning. The general form for a system of linear equations is expressed as:
(1) 
which can be compactly represented in matrix form as:
(2) 
where matrix with size
contains the coefficients of the system, vector
with size represents the unknown variables to be found, and vector with size contains known constants.For the common case of squaresized , a naive approach to find is via matrix inverse, i.e. . However, in practice computing the inverse is actually not necessary and can result in numerical instability, leading to inaccurate solutions [12]. An effective and general approach is to solve the system with the help of lowerupper (LU) decomposition [9, 11]. This can be typically implemented through the xGESV set of functions^{1}^{1}1 For each LAPACK function, we substitute the first letter of the function with ‘x’ to indicate a set of functions which differ only in the applicable element type (eg. single or doubleprecision element type). For example, the set {SGESV, DGESV, CGESV, ZGESV} is represented as xGESV. available in the ubiquitous and defacto industry standard LAPACK software [1]. Several optimised versions of LAPACK are available, such as MKL [13] for the x8664 architecture used by the widely employed Intel and AMD processors.
The task of converting an arbitrary linear algebra expression into an efficient sequence of optimally matched LAPACK function calls, either manually or automatically, is known as the linear algebra mapping problem [18]. When the conversion is done manually, the process can be laborious and errorprone; it typically requires thorough understanding of several areas: highperformance computing, numerical linear algebra, and the intricacies of LAPACK. Furthermore, source code that directly uses LAPACK functions has several downsides: it (i) has little resemblance to the originating mathematical expression, (ii) requires manual memory management, (iii) requires keeping track of many extra variables. In turn, these downsides reduce the readability of the source code by nonexpert users, while increasing the maintenance burden and risk of bugs [22, 15].
To address the above issues and hence to increase productivity, linear algebra expressions are often implemented with the aid of highlevel frameworks such as Matlab, Octave [14] and Armadillo [19]. While these frameworks facilitate simplified user code that focuses on highlevel algorithm logic, the automatic mapping of given mathematical expressions into LAPACK functions can be suboptimal [2].
Higher level frameworks default to storing dense matrices in straightforward columnmajor format (where all the elements in each column are stored consecutively in memory), without taking into account any special structure or properties of the matrix. This choice is made by framework maintainers to reduce the internal code complexity of the framework, and to avoid burdening users with the choice of storage types; users may not have the inclination nor expertise to understand the advantages and disadvantages of various storage formats.
Within the above context, the dense matrix LUbased solver is an effective and oftenused onesizefitsall approach for solving systems of linear equations. However, this straightforward approach does not take into account various matrix properties which can be exploited to reduce the computational effort and/or to increase numerical stability. For example, when matrix is symmetric positive definite, it is more computationally efficient to use Cholesky decomposition instead of LU decomposition [11].
In this paper we describe an adaptive solver that we have devised and implemented inside recent versions of the opensource Armadillo C++ linear algebra library. The library provides a highlevel
domain specific language [16, 21] embedded within the host C++ language, allowing mathematical operations with matrices to be expressed in a concise and easytoread manner similar to Matlab/Octave. This facilitates prototyping directly in C++ and aids the conversion of research code into production environments.The adaptive solver is able to automatically detect several common properties of matrices, while being robust to inherent limitations in the precision of floating point representations [10, 17], and then map them to a large set of suitable LAPACK functions. We show that this leads to considerable speedups, thereby freeing the user from worrying about storage formats and using cumbersome manual calls to LAPACK functions in order to obtain good computational efficiency.
We continue the paper as follows. Section II describes several common matrix properties which can be exploited, summarises the algorithms used for detecting such properties, and lists the suitable sets of LAPACK functions tailored for each property. Section III provides an empirical evaluation showing the speedups attained by the adaptive solver. Section IV provides a brief discussion on computational complexity and runtime considerations. The salient points and avenues for further exploitation are summarised in Section V.
Ii Automatic Detection and Mapping
The adaptive solver detects special matrix structures in the following order: (i) banded, (ii) upper or lower triangular, (iii) symmetric positive definite (sympd). Examples of the structures are shown in Fig. 1. A specialised solver is employed as soon as one of the structures is detected. The detection algorithms and associated LAPACK functions for solving the systems are described in Sections IIA through to IIC. If no special structure is detected, an extended form of an LUbased solver is used, described in Section IID. A flowchart summarising the adaptive solver is given in Fig. 2.
Each of the solvers estimates the reciprocal condition number
[3] of the given system, which is used to determine the quality of the solution. If any of the solvers fail to find a solution (including the solver for general matrices), or if the system is determined to be poorly conditioned, an approximate solution is attempted using a solver based on singular value decomposition (SVD), described in Section IIE.The code for the detection and mapping is implemented as part of the solve() function in Armadillo. As of Armadillo version 9.900, the solve() function is comprised of about 3000 lines of code, not counting LAPACK code. Example usage of the solve() function is shown in Fig. 3.
Iia Banded Matrices
Banded matrices contain elements on the main diagonal and typically on several more diagonals, above and/or below the main diagonal. All other elements are zero. As such, computational effort can be considerably reduced by exploiting the sparseness of the matrix. Fig. 1(a) shows an example banded matrix.
To efficiently detect a banded matrix structure within a columnmajor dense matrix, each column is examined rather than each possible diagonal. To determine the number of superdiagonals (diagonals above the main diagonal), the elements in each column are traversed from the top of the column to the location of the main diagonal within that column, followed by noting the difference in the locations of the first nonzero element and the main diagonal. The maximum of all noted differences is taken as the number of superdiagonals. To determine the number of subdiagonals (diagonals below the main diagonal), the elements in each column are traversed from the location of the main diagonal within that column to the bottom of the column, followed by noting the difference in the locations of the main diagonal and the last nonzero element. The maximum of all noted differences is taken as the number of subdiagonals.
As soon as the number of detected diagonals indicates that more than 25% of the elements within the matrix are nonzero, all further processing is stopped and the matrix is deemed to be nonbanded. The threshold of 25% was determined empirically as a good tradeoff between the lower computational requirements of solving a banded system, and the amount of extra processing required for both the detection and further wrangling of data in the banded structure.
If a banded matrix structure is detected, data from the diagonals must be first converted into a relatively compact storage format, where each diagonal is stored as a vector in a separate dense matrix [1]. Data in the compact storage format is then used by the LAPACK function xGBTRF to compute the LU decomposition of the banded matrix, followed by the xGBTRS function to solve the system based on the LU decomposition, and finally the xGBCON function to estimate the reciprocal condition number from the LU decomposition.
Diagonal matrices are treated as banded matrices, where the number of diagonals above and below the main diagonal is zero. While it is certainly possible to have specialised handling for diagonal matrices, in practice such matrices are seldom encountered when solving systems of linear equations.
IiB Triangular Matrices
For triangular matrices, the LU decomposition can be avoided and the solution can be obtained via either backsubstitution (for upper triangular matrices) or forwardsubstitution (for lower triangular matrices) [11]. As such, considerable reductions in computational effort can be attained. An example of a lower triangular matrix is shown in Fig. 1(b).
The process of forwardsubstitution can be summarised as first finding the value for in Eqn. (1), where through to are known to be zero, resulting in . The value for is then used to find , where through to are zero. This iterative process can be compactly expressed as:
(3) 
The process of backsubstitution follows a similar manner, starting from the bottom of the matrix instead of the top (i.e. is solved first instead of ).
The detection of triangular matrices is done in a straightforward manner. If all elements above the main diagonal are zero, a lower triangular matrix is present. Conversely, if all elements below the main diagonal are zero, an upper triangular matrix is present. An attempt to solve the system is made by using the xTRTRS function, and the reciprocal condition number is computed via the xTRCON function.
IiC Symmetric Positive Definite Matrices
A matrix is considered to be symmetric positive definite (sympd) if for every nonzero column vector , and is symmetric (i.e.. its lower triangular part is equivalent to a transposed version of its upper triangular part, excluding the main diagonal). Fig. 1(c) shows an example sympd matrix. For solving systems of linear equations, the sympd property can be exploited by using Cholesky decomposition [11] instead of LU decomposition in order to reduce computational effort.
To determine whether a given matrix is sympd, the traditional approach is to check whether all its eigenvalues are real and positive
[5]. However, this raises two complications. First, the eigendecomposition of a matrix is computationally expensive, which would defeat the aim of saving computational effort. Second, since the matrix is stored in simple columnmajor format, all matrix elements must be checked to ensure that symmetry is actually present. This in turn raises a further complication, as a simple check for equality between elements does not take into account minor differences that may be present due to the accumulation of rounding errors stemming from limitations in the precision of floating point representations [10, 17].To address the above issues, we have devised a fast twopass algorithm which aims to ensure the presence of several necessary conditions for a sympd matrix [5, 11]: (i) all diagonal elements are greater than zero, (ii) the element with largest modulus is on the main diagonal, (iii) the matrix is diagonally dominant, and (iv) the matrix is symmetric while tolerating minor variations in a robust manner.
We note that while the conditions checked by the algorithm are necessary, they are not sufficient to absolutely guarantee the presence of a sympd matrix. However, for practical purposes, in our experience (mainly in the machine learning area) the conditions appear adequate for determining sympd presence.
The algorithm is shown in Fig. 4, which returns either true or false to indicate that a matrix is likely to be sympd. As soon as any condition is not satisfied, the algorithm aborts further processing and returns false. Lines 7 to 8 check whether all diagonal elements are greater than zero. Lines 12 to 14 check the presence of symmetry by determining whether the difference between an element and its corresponding transposed element is below a threshold; the difference is compared against the threshold in both absolute and relative terms, for robustness against precision variations between lowmagnitude and highmagnitude floating point representations [10, 17]. We have empirically chosen the threshold as , where is the machine epsilon^{2}^{2}2 Machine epsilon () is defined as the difference between 1 and the next representable floating point value [10, 17]. For doubleprecision floating point numbers on x8664 machines, . . Line 15 checks whether the element with largest modulus is on the main diagonal, while line 16 performs a rudimentary check for diagonal dominance.
If the algorithm determines that a sympd matrix is present, an attempt to solve the system is made with a combination of the following LAPACK functions: xPOTRF, xPOTRS, and xPOCON. The xPOTRF function computes the Cholesky decomposition, xPOTRS solves the system based on the Cholesky decomposition, and finally xPOCON computes the reciprocal condition number from the Cholesky decomposition.
If the xPOTRF function fails (i.e. the Cholesky decomposition was not found), the failure is taken to indicate that the matrix is not actually sympd. In that case, the system is solved by the generic solver described in Section IID.
IiD General Matrices
For general matrices, where no special structure has been detected, a standard LUbased solver is used. Here matrix from Eqn. (2) is expressed as , where is a unit lower triangular matrix, and is an upper triangular matrix. Rewriting Eqn. (2) yields:
(4) 
Solving for is then accomplished in two steps. In the first step, Eqn. (4) is rewritten as , and a solution for is found. Since is lower triangular and is known, is easily found via forwardsubstitution (as shown in Section IIB). In the second step, is expressed as . Since is lower triangular and is known, is found via backsubstitution, thereby providing the desired solution to Eqn. (2).
Rather than simply using the xGESV family of functions from LAPACK that do not provide an estimate of the reciprocal condition number, we use a combination of xGETRF, xGETRS and xGECON. The xGETRF function computes the LU decomposition, xGETRS solves a system of linear equations based on the LU decomposition, and finally xGECON computes the reciprocal condition number from the LU decomposition.
matrix size  standard  adaptive  reduction 

100100  72.78%  
250250  73.59%  
500500  79.61%  
10001000  84.09% 
matrix size  standard  adaptive  overhead 

100100  1.627%  
250250  0.243%  
500500  0.114%  
10001000  0.187% 
IiE Fallback for Poorly Conditioned Systems
If the above solvers fail or if the estimated reciprocal condition number indicates a poorly conditioned system, an approximate solution is attempted using a solver based on SVD. A poorly conditioned system is detected when the reciprocal condition number is below a threshold; following LAPACK’s example, we have set this threshold to , where is the machine epsilon as defined in Section IIC. The use of the SVDbased fallback solver can be disabled through an optional argument to the solve() function.
The SVDbased solver uses the xGELSD set of functions, which find a minimumnorm solution to a linear least squares problem [6], i.e. is found via minimisation of . In brief, the solution to the system in Eqn. (2) is reformulated as , where is known as the MoorePenrose pseudoinverse, which can be obtained via the SVD of [11]. This reformulation is equivalent to the least squares solution.
Iii Empirical Evaluation
In this section we demonstrate the speedups resulting from automatically detecting the matrix properties described in Section II and selecting the most appropriate set of LAPACK functions for solving systems of linear equations. The evaluations were done on a machine with an Intel Core i55200U CPU running at 2.2 GHz. Compilation was done with the GCC 10.1 C++ compiler with the following configuration options: O3 march=native. We used the opensource OpenBLAS 0.3.9 [24] package which provides optimised implementations of LAPACK functions.
We performed evaluations on the following set of sizes for matrix , ranging from small to large: {100100, 250250, 500500, 10001000}. Three matrix types were used: (i) banded (with 5 diagonals), (ii) lower triangular, (iii) symmetric positive definite (sympd). For each matrix size and type, 1000 unique random systems were solved, with each random system solved using the standard LUbased solver (as per Section IID) and the adaptive solver. For each solver, the average wallclock time (in seconds) across the 1000 runs is reported. The results are presented in Tables IV, IV and IV.
The results indicate that for all matrix types, the adaptive solver reduces the wallclock time. On average, the reduction is most notable for banded systems, closely followed by triangular systems. While the reductions for sympd systems are less pronounced, they still show useful speedups. In general, the larger the matrix size, the larger the degree in reduction of wallclock time.
To gauge the overhead of all the detection algorithms, we compare the time taken to solve random dense systems (without any special structure) using the standard LUbased solver against the adaptive solver. The results shown in Table IV indicate that the overhead is negligible. This is due to each detection algorithm stopping as soon as it determines that a special structure is not present.
Iv Runtime Considerations
In the previous section we showed the proposed algorithm to be empirically efficient on randomly generated linear systems. Indeed, this is partly due to the fact that the checks that we have proposed are asymptotically more efficient than the factorisations we employ. To see this, observe that the banded structure check described in Section IIA can be performed in a single pass over the matrix. For a square matrix of size , this is time. Similarly, the triangular structure check described in Section IIB can also be performed in a single pass, yielding time. Finally, the likely_sympd algorithm in Fig. 4, as written, can be performed as an diagonal pass on followed by a single pass. Overall, this yields a total quadratic runtime for all of the proposed checks.
In the situation where the given matrix is triangular, we can perform forward or backsubstitution in time, yielding an asymptotic improvement over the standard LUbased solver, which takes time [11]. Cholesky decomposition and singular value decomposition both also take time to compute [11]. However, in practice the Cholesky decomposition tends to be faster (as only one triangular part of the symmetric matrix needs to be processed), explaining the rest of the empirical advantage that was observed in our experiments.
V Conclusion
We have described an adaptive solver for systems of linear equations that is able to automatically detect several common properties of a given system (banded, triangular, and symmetric positive definite), followed by solving the system via mapping to a set of suitable LAPACK functions best matched to each property. Furthermore, the solver can detect poorly conditioned systems and automatically obtain an approximate solution via singular value decomposition as a fallback. The automatic handling facilitates simplified user code that focuses on highlevel algorithm logic, freeing the user from worrying about various storage formats and using cumbersome manual calls to LAPACK functions in order to obtain good computational efficiency.
The solver is present inside recent versions of the highlevel Armadillo C++ library for linear algebra [19, 20], which allows matrix operations to be expressed in an easytoread manner similar to Matlab/Octave. The solver is comprised of about 3000 lines of code, not counting LAPACK code; it is able to handle matrices with single and doubleprecision floating point elements, in both real and complex forms. The source code for the adaptive solver is provided under the permissive Apache 2.0 license [23], allowing unencumbered use in commercial products; it can be obtained from http://arma.sourceforge.net.
The adaptive solver has been successfully used to increase efficiency in opensource toolkits such as the mlpack library for machine learning [8], and the ensmallen library for numerical optimisation [4]. Areas for further improvements include specialised handling of plain symmetric matrices (symmetric but not positive definite), and tridiagonal matrices which are a common subtype of banded matrices [7].
Acknowledgements
We would like to thank our colleagues at Data61/CSIRO (Frank De Hoog, Mark Staples) and Griffith University (M.A. Hakim Newton, Majid Namazi) for discussions leading to the improvements of this paper.
References
 [1] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. SIAM, 1999.
 [2] H. Barthels, C. Psarras, and P. Bientinesi. The sad truth about linear algebra software. XXI Householder Symposium on Numerical Linear Algebra, 2020.
 [3] D. A. Belsley, E. Kuh, and R. E. Welsch. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. WileyInterscience, 1980.
 [4] S. Bhardwaj, R. R. Curtin, M. Edel, Y. Mentekidis, and C. Sanderson. ensmallen: a flexible C++ library for efficient function optimization. Workshop on Systems for ML and Open Source Software at NeurIPS / NIPS, 2018.
 [5] R. Bhatia. Positive Definite Matrices. Princeton University Press, 2015.
 [6] S. Boyd and L. Vandenberghe. Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares. Cambridge University Press, 2018.
 [7] E. W. Cheney and D. R. Kincaid. Numerical Mathematics and Computing. Cengage Learning, 7th edition, 2012.
 [8] R. R. Curtin, M. Edel, M. Lozhnikov, Y. Mentekidis, S. Ghaisas, and S. Zhang. mlpack 3: a fast, flexible machine learning library. Journal of Open Source Software, 3(26):726, 2018.
 [9] J. W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.

[10]
D. Goldberg.
What every computer scientist should know about floatingpoint arithmetic.
ACM Computing Surveys, 23(1):548, 1991.  [11] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, 4th edition, 2013.
 [12] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, 2nd edition, 2002.
 [13] Intel. Math Kernel Library (MKL), 2020. https://software.intel.com/mkl.
 [14] S. Linge and H. P. Langtangen. Programming for Computations  MATLAB/Octave. Springer, 2016.

[15]
R. Malhotra and A. Chug.
Software maintainability: Systematic literature review and current
trends.
International Journal of Software Engineering and Knowledge Engineering
, 26(8):12211253, 2016.  [16] M. Mernik, J. Heering, and A. M. Sloane. When and how to develop domainspecific languages. ACM Computing Surveys, 37(4):316344, 2005.
 [17] J.M. Muller, N. Brunie, F. de Dinechin, C.P. Jeannerod, M. Joldes, V. Lefèvre, G. Melquiond, N. Revol, and S. Torres. Handbook of FloatingPoint Arithmetic. Birkhäuser, 2nd edition, 2018.
 [18] C. Psarras, H. Barthels, and P. Bientinesi. The linear algebra mapping problem. arXiv:1911.09421, 2019.
 [19] C. Sanderson and R. Curtin. Armadillo: a templatebased C++ library for linear algebra. Journal of Open Source Software, 1:26, 2016.
 [20] C. Sanderson and R. Curtin. Practical sparse matrices in C++ with hybrid storage and templatebased expression optimisation. Mathematical and Computational Applications, 24(3), 2019.
 [21] M. Scherr and S. Chiba. Almost firstclass language embedding: taming staged embedded DSLs. In ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences, pages 2130, 2015.
 [22] H. M. Sneed. A cost model for software maintenance & evolution. In IEEE International Conference on Software Maintenance, 2004.
 [23] A. St. Laurent. Understanding Open Source and Free Software Licensing. O’Reilly Media, 2008.
 [24] Z. Xianyi, W. Qian, and W. Saar. OpenBLAS, 2020. http://www.openblas.net.
Comments
There are no comments yet.