A Hybrid Direct-Iterative Method for Solving KKT Linear Systems

10/07/2021 ∙ by Shaked Regev, et al. ∙ Stanford University 0

We propose a solution strategy for linear systems arising in interior method optimization, which is suitable for implementation on hardware accelerators such as graphical processing units (GPUs). The current gold standard for solving these systems is the LDL^T factorization. However, LDL^T requires pivoting during factorization, which substantially increases communication cost and degrades performance on GPUs. Our novel approach solves a large indefinite system by solving multiple smaller positive definite systems, using an iterative solve for the Schur complement and an inner direct solve (via Cholesky factorization) within each iteration. Cholesky is stable without pivoting, thereby reducing communication and allowing reuse of the symbolic factorization. We demonstrate the practicality of our approach and show that on large systems it can efficiently utilize GPUs and outperform LDL^T factorization of the full system.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interior methods for nonlinear optimization [52, 7, 49, 17] are essential in many areas of science and engineering. They are commonly used in model predictive control with applications to robotics [43], autonomous cars [53], aerial vehicles [21], combustion engines [22], and heating, ventilation and air-conditioning systems [25], to name a few. Interior methods are also used in public health policy strategy development [3, 42], data fitting in physics [36], genome science [4, 50], and many other areas.

Most of the computational cost within interior methods is in solving linear systems of Karush-Kuhn-Tucker (KKT) type [29]. The linear systems are sparse, symmetric indefinite, and usually ill-conditioned and difficult to solve. Furthermore, implementations of interior methods for nonlinear optimization, such as the filter-line-search approach in Ipopt [52] and HiOp [34]

, typically expect the linear solver to provide the matrix inertia (number of positive, negative and zero eigenvalues) to determine if the system should be regularized. (Otherwise, interior methods perform curvature tests to ensure descent in a certain merit function 

[10].) Relatively few linear solvers are equipped to solve KKT systems, and even fewer to run those computations on hardware accelerators such as graphic processing units (GPUs) [45].

At the time of writing, six out of the ten most powerful computers in the world have more than 90% of their compute power in hardware accelerators [47]. Hardware accelerator technologies are becoming ubiquitous in off-the-shelf products, as well. In order to take advantage of these emerging technologies, it is necessary to develop fine-grain parallelization techniques tailored for high-throughput devices such as GPUs.

For such sparse problems, pivoting becomes extremely expensive, as data management takes a large fraction of the total time compared to computation [14]. Unfortunately, LDL factorization is unstable without pivoting. This is why LDL approaches, typically used by interior methods for nonlinear problems on CPU-based platforms [46], have not performed as well as on hardware accelerators [45]. Some iterative methods such as MINRES [30]

for general symmetric matrices can make efficient (albeit memory bandwidth limited) use of GPUs because they only require matrix-vector multiplications at each iteration, which can be highly optimized, but they have limited efficiency when the number of iterations becomes large. Another approach for better-conditioned KKT systems is using a modified version of the preconditioned conjugate gradient (

PCG) with implicit-factorization preconditioning [12]. In our case, the ill-conditioned nature of our linear problems means that iterative methods alone are not practical [30, 35].

We propose a hybrid direct-iterative method for solving KKT systems that is suitable for execution on hardware accelerators. The method only requires direct solves using a Cholesky factorization, as opposed to LDL, which means it avoids pivoting. We provide preliminary test results that show the practicality of our approach. Our test cases are generated by optimal power flow analysis [8, 23, 27] applied to realistic power grid models that resemble actual grids, but do not contain any proprietary data [26]. These systems are extracted from optimal power flow analysis using Ipopt [52] with MA57 as its linear solver. Solving such sequences of linear problems gives us an insight in how our linear solver behaves within an interior method. Using these test cases allowed us to assess the practicality of our hybrid approach without interfacing the linear solver with an optimization solver. Power grids are representative of very sparse and irregular systems commonly found in engineering disciplines.

The paper is organized as follows. Table 1 defines our notations. Section 2 describes the optimization problem being solved. Section 3 defines the linear systems that arise when an interior method is applied to the optimization problem. In Section 4, we derive a novel hybrid direct-iterative algorithm to utilize the block structure of the linear systems, and prove convergence properties for the algorithm. Numerical tests in Section 5 show the accuracy of our algorithm on realistic systems, using a range of algorithmic parameter values. Section 6 compares our C and CUDA implementation to MA57 [13]. Section 7 explains our decision to use a direct solver in the inner loop of our algorithm. In Section 8, we summarize our main contributions and results. Appendix A provides supplemental figures for Section 5.

Variable Properties Functions Meaning
Symmetric matrix , largest, smallest (most negative) eigenvalues
SPSD matrix the smallest nonzero eigenvalue
SPD matrix condition number
rectangular matrix null() nullspace
vector, A diagonal matrix ,
vector a -vector of s
Table 1: Notation. SP(S)D stands for symmetric positive (semi)definite

2 Nonlinear optimization problem

We consider constrained nonlinear optimization problems of the form equationparentequation

s.t. (1b)

where is an -vector of optimization parameters, is a possibly nonconvex objective function, defines equality constraints, and defines inequality constraints. (Problems with more general inequalities can be treated in the same way.) Functions , and are assumed to be twice continuous differentiable. Interior methods enforce bound constraints (1d) by adding barrier functions to the objective (1a):

where the inequality constraints (1c) are treated as equality constraints with slack variables . The barrier parameter is reduced toward zero using a continuation method to obtain solution that is close to the solution of (1) to within a solver tolerance.

Interior methods are most effective when exact first and second derivatives are available, as we assume for , , and . We define and as the sparse Jacobians for the constraints. The solution of a barrier subproblem satisfies the nonlinear equations equationparentequation


where and are primal variables, and are Lagrange multipliers (dual variables) for constraints (2c)–(2d), and and are Lagrange multipliers for the bounds and . The conditions , , , and are maintained throughout, and the matrices and are SPD.

Analogously to [52], at each continuation step in we solve nonlinear equations (2) using a variant of Newton’s method. Typically and are eliminated from the linearized version of (2) by substituting the linearized versions of (2e) and (2f) into the linearized versions of (2a) and (2b), respectively, to obtain a smaller symmetric problem. Newton’s method then calls the linear solver to solve a series of linearized systems , , of block form


where index denotes optimization solver iteration (including continuation step in and Newton iterations), each is a KKT-type matrix (with saddle-point structure), vector is a search direction111Search directions are defined such that for some linesearch steplength . for the primal and dual variables, and is derived from the residual vector for (2) evaluated at the current value of the primal and dual variables (with as the method converges):

With and , the sparse Hessian

and diagonal are matrices, is a diagonal matrix, is a sparse matrix, and is a sparse matrix. We define , , and .

Interior methods may take hundreds of iterations (but typically not thousands) before they converge to a solution. All matrices have the same sparsity pattern, and their nonzero entries change slowly with . An interior method can exploit this by reusing output from linear solver functions across multiple iterations :

  • Ordering and symbolic factorization are needed only once because the sparsity pattern is the same for all .

  • Numerical factorizations can be reused over several adjacent Newton’s iterations, e.g., when an inexact Newton solver is used within the optimization algorithm.

Operations such as triangular solves have to be executed at each iteration .

The workflow of the optimization solver with calls to different linear solver functions is shown in Fig. 1 (where denotes the linear system to be solved at each iteration). The main optimization solver loop is the top feedback loop in Fig. 1. It is repeated until the solution is optimal or a limit on optimization iterations is reached. At each iteration, the residual vector is updated. Advanced implementations have control features to ensure stability and convergence of the optimization solver. The lower feedback loop in Fig. 1 shows linear system regularization by adding a diagonal perturbation to the KKT matrix. One such perturbation removes singularity [52, Sec. 3.1], which happens if there are redundant constraints. The linear solver could take advantage of algorithm control like this and request matrix perturbation when beneficial.

Figure 1: Optimization solver workflow showing invocation of key linear solver functions. The top feedback loop represents the main optimization solver iteration loop. The bottom feedback loop is the optimization solver control mechanism to regularize the underlying problem when necessary.

3 Solving KKT linear systems

LDL factorization via MA57 [13] has been used effectively for extremely sparse problems on traditional CPU-based platforms, but is not suitable for fine grain parallelization required for GPU acceleration. Parallel and GPU accelerated direct solve implementations such as SuperLU [24, 2], STRUMPACK [38, 44], and PaStiX [19, 31] exist for general symmetric indefinite systems (although the first two are designed for general systems), but these software packages are designed to take advantage of dense blocks of the matrices in denser problems and do not perform well on our systems of interest, which do not yield these dense blocks [45, 18].

The fundamental issue with using GPUs for LDL is that this factorization is not stable without pivoting [16]. Pivoting requires considerable data movement and, as a result, a substantial part of the run time is devoted to memory access and communication. Any gains from the hardware acceleration of floating point operations are usually outweighed by the overhead associated with permuting the system matrix during the pivoting. This is especially burdensome because both rows and columns need to be permuted in order to preserve symmetry [14]. While any of the two permutations can be performed efficiently on its own with an appropriate data structure for the system’s sparse matrix (i.e., row compressed sparse storage for row permutations and, analogously, column compressed sparse storage for column permutations) swapping both rows and columns simultaneously is necessarily costly.

Here we propose a method that uses sparse Cholesky factorization (in particular, a GPU implementation). Cholesky factorization is advantageous for a GPU implementation because it is stable without pivoting and can use GPUs efficiently compared to LDL [37]. Furthermore, the ordering of the unknowns (also known as the symbolic factorization) for the purpose of reducing fill-in in the factors can be established without considering the numerical values and only once at the beginning of the optimization process, and hence, its considerable computational cost is amortized over the optimization iterations.

To make the problem smaller, we can eliminate and from (3) to obtain the system [32, Sec. 3.1]


where . This reduction requires block-wise Gaussian elimination with block pivot , which is ill-conditioned when has large elements, as it ultimately does. Thus, system (4) is smaller but more ill-conditioned. After solving (4), we compute and in turn to obtain the solution of (3).

4 A block system solution method

Let be any SPD matrix. Multiplying the second row of (4) by and adding it to the first row gives a system of the form


where . The simplest choice is with :


When is SPD, its Schur complement is well defined, and (5) is equivalent to


This is the approach of Golub and Greif [15] for saddle-point systems (which have the structure of the system in (4)). Golub and Greif found experimentally that made SPD and better conditioned than smaller or larger values. We show in Theorems 2 and 4 that for large , the condition number increases as , but converges to as . Corollary 1 shows there is a finite value of that minimizes

. This value is probably close to


Our contribution is to combine the system reduction in [32] (from (3) to (4)) with the method of [15] for changing (4) to (5) to solve an optimization problem consisting of a series of block systems using a GPU implementation of sparse Cholesky factorization applied to . A new method for regularizing is added, and practical choices for parameters are given based on scaled systems. Also, important convergence properties of the method are proven in Theorems 14.

If (5) or are poorly conditioned, the only viable option may be to ignore the block structure in (5) and solve (3) with an LDL factorization such as MA57 (likely without help from GPUs). This is the fail-safe option. Otherwise, we require to be SPD (for large enough ) and use its Cholesky factorization to apply the conjugate gradient method (CG) [20] or MINRES [30] to (7) with or without a preconditioner.If the part of QR factors of is not too dense, we could use

as a multiplicative preconditioner. This gives the exact solution if the RHS is orthogonal to null() or is square, which is not possible in our case. In our experiments, solutions with the preconditioner lagged slightly behind those without it, but both take iterations. We proceed without the use of preconditioner.

4.1 A hybrid solver with minimal regularization

Typically, (4) starts with an SPD and full-rank . As the optimization solver iterations progress, may become indefinite and ’s rank may shrink (at least numerically). This means the system becomes singular and must be regularized. We seek a small regularization to avoid changing too much the solution of the equivalent system (5).

An SPD guarantees that is SPD on null(), a requirement at the solution of the optimization problem.

Theorem 1.

For large enough () and full row rank , is SPD iff is positive definite on null().


Assume is SPD. For any nonzero vector in null(), we have (this direction of the proof does not require to have full row rank).

Conversely, assume is positive definite on null() and has full row rank. For any nonzero vector with in null() and orthogonal to null(),

We have by assumption. Further,

if . ∎

We work with . We use SPD as a proxy for being SPSD on null(), keeping in mind that if it does not hold even for a very large in (6), is singular and needs to be modified. However, cannot be made arbitrarily large without increasing when is rectangular, as in our case. There must be an ideal intermediate value of .

Theorem 2.

If has full row rank with more columns than rows, is symmetric and positive definite on null(), and , there exists such that for , increases linearly with .


Hence , meaning for large enough (defined as ). Similarly,

Thus . From Theorem 1, , so that for . ∎

Corollary 1.

Among s such that is SPD (), is minimized for some .

In practice, the optimizer may provide systems where is not SPD on null(). In this case we can regularize by using instead. Unlike , the introduction of changes the solution of the system, so it is essential to keep as small as possible. If is not SPD, we set , a parameter for some minimum value of regularization. If is still not SPD, we double until it is. This ensures we use the minimal value of regularization (to within a factor of 2) needed to make SPD.

If proves to be large, which can happen before we are close to a solution, it is essential for the optimizer to be informed to allow it to modify the next linear system. When the optimizer nears a solution, will not need to be large.

In our tests, starts at for the next matrix in the sequence, but is set back to its previous value if the factorization fails.

We also set , the maximum allowed before we resort to LDL factorization of or return to the optimization solver.

If has low row rank, in (7) is SPSD. In this case, CG will succeed if (7) is consistent. Otherwise, it will encounter a near-zero quadratic form. We then restart CG on the regularized system . In this way, regularizes the block and regularizes the block.

To ensure we can judge the size of parameters and relative to system (5), we first scale (5) with a symmetrized version of Ruiz scaling [40].

Algorithm 1 is a generalization of the Uzawa iteration [48] for KKT systems with a (1,1) block that is not necessarily positive definite. It gives a method for solving a sequence of systems {(7)–(8)} with used in the calculation of . The workflow is similar to Fig. 1 except only is factorized. On lines 16–19, is applied by direct triangular solves with the Cholesky factors , of . Each iteration of the CG solve on line 17 requires multiplication by , applying , multiplying by , and adding a scalar multiple of the original vector. Section 7 shows why complete Cholesky factorization of was chosen.

The rest of the section discusses other important bounds that must obey. However, selecting an ideal

in practice is difficult and requires problem heuristics (like

in [16]) or trial and error.

1:for each matrix in the sequence such as in (5do
3:    ( used in the calculation of )
4:   Try (Fail False if factorized, True otherwise)
5:   while Fail and  do
6:      if  then
8:      else
10:      end if
12:      Try (Fail False if factorized, True otherwise)
13:   end while
14:   if Fail==False then
15:      Direct solve
16:      if CG on produces a small quadratic form then
17:         CG solve (perturbed (7))
18:      end if
19:      Direct solve                      (perturbed (8))
20:   else
21:      Use LDL to solve (3) or return problem to optimization solver
22:   end if
23:end for
Algorithm 1 Using CG on Schur complement to solve the block system (5) by solving (7)–(8). is a nonsingular perturbation of . Typical parameters: , , .

4.2 Guaranteed descent direction

Optimization solvers typically use an LDL factorization to solve the system (3) at each step because (with minimal extra computation) it supplies the inertia of the matrix. A typical optimization approach treats each of the four block of (3) as one block, accounting for the possible regularization applied to . We mean that the (1,1) block is a Hessian of dimension , and the (2,1) block represents the constraint Jacobian and has dimensions . The inertia being implies that (a) is uniformly SPD on null() for all , meaning for all vectors satisfying , iterations , and some positive constant ; (b) KKT matrix in (3) is nonsingular. Together these conditions are sufficient to ensure a descent direction and fast convergence in the optimization algorithm [52, 29]. We show that Algorithm 1 ensures properties (a) and (b) without computing the inertia of the KKT matrix.

Theorem 3.

If Algorithm 1 succeeds with , it provides a descent direction for the interior method for large enough .


By construction, is uniformly SPD (because the Cholesky factorization was successful for all iterations of the solver until this point) and the KKT system in (5) is nonsingular (because has full row rank, or was added to to shift it from SPSD to SPD). Therefore, Algorithm 1 provides a descent direction for (5) even if regularization is added. The rest of the proof assumes , and we return to the other case later. Since has full row rank, is nonsingular by assumption, all block rows in (3) have full rank internally, and Gaussian elimination cannot cause cancellation of an entire block row, we conclude that (3) is nonsingular. Let (for used in Algorithm 1). For any nonzero vector with and of dimensions and ,

where . If , then , which means and the proof is complete (with ). Otherwise, . So for large enough , . Applying Theorem 1 to (3) with and replacing and shows that is positive definite on null(). ∎

We note that corresponds to the so-called primal regularization of the filter line-search algorithm [52]. Under this algorithm, whenever becomes too large, one can invoke the feasibility restoration phase of the filter line-search algorithm [52] as an alternative to performing the LDL factorization on the CPU. Feasibility restoration “restarts” the optimization at a point with more favorable numerical properties. We also note that when is sufficiently large, the curvature test used in [10] should be satisfied. Hence, inertia-free interior methods have the global convergence property without introduction of other regularization from the outer loop.

The regularization is a numerical remedy for low-rank , caused by redundant equality constraints. This regularization is similar to the so-called dual regularization used in Ipopt [52] and specifically addresses the issue of rank deficiency; however, there is no direct analogue from a regularization in (5) to Ipopt’s dual regularization in (3) and neither of the two heuristics guarantees a descent direction for (3). Given the similarity between the two heuristics, we believe that the regularization can be effective within the filter line-search algorithm.

When Algorithm 1 is integrated with a nonlinear optimization solver such as [51] and [33], the while loop (Line 5) in Algorithm 1 can be removed and the computation of and can be decided by the optimization algorithm. The development of robust heuristics that allow Algorithm 1’s and regularization within the filter line-search algorithm will be subject of future work.

4.3 Convergence for large

In Section 4.1 we showed that is SPD for large enough , and that should not be so large that the low-rank term makes ill-conditioned. Here we show that in order to decrease the number of CG iterations, it is beneficial to increase beyond what is needed for to be SPD.

Theorem 4.

In exact arithmetic, for , the eigenvalues of converge to with an error term that decays as .


By definition,


Since is nonsingular and has full row rank by assumption, the Searle identity  [41] with and gives

For , . The identity , which applies when , gives


A different proof of an equivalent result is given by Benzi and Liu [5, Lemma 3.1].

Corollary 2.

The eigenvalues of are well clustered for , and the iterative solve in Algorithm 1 line 16 converges quickly.

In the next section, we show that our choices of are large enough for the arguments to hold. This explains the rapid convergence of CG. We also show that our transformation from (3) to (5) and our scaling choice are stable on our test cases, by measuring the error for the original system (3).

5 Practicality demonstration

We demonstrate the practicality of Algorithm 1 using five series of linear problems [26] generated by the Ipopt solver performing optimal power flow analysis on the power grid models summarized in Table 2. We compare our results with a direct solve via MA57’s LDL factorization [13].

Name nnz()
South Carolina grid 56K 411K
Illinois grid 4.64K 21.6K
Texas grid 55.7K 268K
US Western Interconnection grid 238K 1.11M
US Eastern Interconnection grid 1.64M 7.67M
Table 2: Characteristics of the five tested optimization problems, each generating sequences of linear systems (3) of dimension . Numbers are rounded to 3 digits. K and M signify and .

5.1 selection

Figure 2: Illinois grid: (a) CG iterations on Eq. 7 with varying . gives good convergence. The mean number of iterations for is . (b) Sorted eigenvalues of in (9) matrix 22, for . The eigenvalues are clustered close to .
Figure 3: Illinois grid (3), with varying in (6): (a) Backward error (BE) and (b) relative residual (RR). (a) gives results close to machine precision. (b) has .

We use Algorithm 1 with CG as the iterative method on line 16. A larger in may improve CG convergence but make more ill-conditioned and increase the error in the solution of (3). We therefore run some preliminary tests to find a suitable before selecting and testing other problems, and we require CG to solve accurately (stopping tolerance for the relative residual norm). We start with one of the smaller problems, the Illinois grid, to eliminate some values of and see if the condition number of for each matrix in the sequence is useful in determining the needed iterations or regularization. We run these tests with .

Figure 2(a) shows that for the Illinois grid sequence, values of give CG convergence in approximately 10 iterations for every matrix in the sequence. For the last matrix we found that eigenvalues of in (7) are very well clustered and the condition number of is , as shown in Fig. 2(b). This guarantees and explains the rapid convergence because for CG applied to a general SPD system ,


where is the error in an approximate solution at iteration  [16].

For the last few matrices, is the only value requiring . The final value of was . No important information was gleaned from . For all other values of , for the whole sequence. For a system and an approximate solution , we define the backward error BE as and the relative residual RR as . As is common practice, we use

to estimate

, which is too expensive to calculate directly. always provides an upper bound for , but in practice is quite close to the actual value. Note that MA57 always has a BE of order machine precision. Figure 3 shows the (a) BE and (b) RR for system (3) for varying . Results for the BE and RR of system (4) are not qualitatively different and are given in Appendix A. One conclusion is that increasing to reduce CG iterations can be costly for the accuracy of the solution of the full system. Based on the results of this section, in the range gives reasonable CG iterations and final accuracy. For other matrices, we present a selected in this range that produced the best results.

5.2 Results for larger matrices

Figure 4: South Carolina grid: for . For other values of the graph was similar. Except for the first few and last few matrices, , meaning the required regularization would make the solution too inaccurate. The value of is omitted on the log-scale.
Figure 5: US Western Interconnection grid with in (6): (a) CG iterations on Eq. 7. The mean number of iterations is . (b) BE and RR for the sequence. The BE for (3) is less than , except for matrix 4.
Figure 6: US Eastern Interconnection grid with in (6): (a) CG iterations on Eq. 7. The mean number of iterations is . (b) BE and RR for (3) and (4). The BE for (3) is less than .

Solving larger (and perhaps more poorly conditioned) problems brings about new computational challenges and limits the amount of time any particular task can take. We wish to set small enough to avoid over-regularizing the problem, and large enough to eliminate wasteful iterations and numerical issues. We want small enough to recognize that we have over-regularized (and should try a different method), but large enough to allow for reasonable regularization. In our numerical tests, we use and large enough so that can increase until is SPD. This informs the parameter selection for the next system.

Figure 4 shows that the South Carolina grid matrices as currently constructed cannot benefit from this method. They need to make SPD, which on a scaled problem means as much weight is given to regularization as to the actual problem. Algorithm 1 succeeds on the other matrix sequences, at least for certain s, and needs no regularization ().

For the US Western Interconnection grid, Fig. 5(a) shows a CG convergence graph and (b) shows several types of error. For the US Eastern Interconnection grid, Fig. 6(a) shows a CG convergence graph and (b) shows several types of error. Figures for the Texas grid are given in Appendix A, as they do not provide more qualitative insight. Convergence occurs for all matrix sequences in less than iterations on average. The BE for (3) is consistently less than and, with two exceptions in the US Western Interconnection grid, is close to machine precision. Results for the US Eastern Interconnection grid show that the method does not deteriorate with problem size, but rather there are some irregular matrices in the US Western Interconnection grid.

The results in this section suggest that in the range down to is reasonable for any . There is no clear choice for , but a plausible value would be . This way we are guaranteed that the regularization doesn’t take over the problem, and the number of failed factorizations is limited to , which should be negligible in the total solution times for a series of problems.

5.3 Reordering

Figure 7: Illinois grid matrix 22: (a) Approximate minimum degree ordering of chol() is sparser than (b) Nested dissection ordering of chol(). Both orderings are calculated in Matlab.

The efficiency of sparse Cholesky factorization depends greatly on the row/column ordering defined by permutation . Figure 7 compares the sparsity of , corresponding to of matrix 22 in the Illinois grid sequence, obtained from two choices of : approximate minimum degree (AMD) and nested dissection. (The data in this section are generated via Matlab.) We see that AMD produces a sparser ( nonzeros vs. ).

Reverse Cuthill-McKee and no ordering gave and nonzeros respectively. Recall that the sparsity structure is identical for matrices from the same family and similar for other matrix families. As expected, AMD was the sparsest ordering tested for other matrix families. Thus, AMD is our reordering of choice. It is a one-time cost performed during the optimization problem setup.

6 Comparison with LDL

We compare our method with a direct solve of (3) using MA57’s LDL factorization [13] with default settings. All testing in this section is done using a prototype C/CUDA code on a single GPU device. Reference MA57 solutions were computed on a CPU. Further details on computational platforms used are given in Table 3.

Machine name Host processor Accelerator device Host compiler Device compiler
Newell IBM Power9 NVIDIA Volta 100 GCC 7.4.0 CUDA 10.2
Deception AMD EPYC 7502 NVIDIA Ampere 100 GCC 7.5.0 CUDA 11.1
Table 3: Accelerator devices and compilers used.

The factorization density is . We define analogously for the Cholesky factors of in (5), with and dimension . Note that gives the average number of nonzeros in the factorization per row or column. Table 4 shows that the Cholesky factor of is usually less dense than the LDL factor for (3), even though (3) is sparser than (5). Table 5 shows the solve times on the Newell computing cluster [28]. The main trend is that as the problem size increases, GPUs using cuSolver [1] increasingly outperform equivalent computations on a CPU.

Illinois 4.64K 94.7K 20.4 2.28K 34.9K 15.3
Texas 55.7K 2.95M 52.9 25.9K 645K 24.9
Western US 238K 10.7M 44.8 116K 2.23M 19.2
Eastern US 1.64M 85.4M 52.1 794K 17.7M 22.3
Table 4: The dimensions, number of nonzeros, and factorization densities (the number of nonzeros in the factors per row) for solving (3) directly with LDL (, , respectively) and for solving (6) with Cholesky (, nnz, respectively). Numbers are rounded to 3 digits. K and M signify and . For all cases, and .
Western US
Eastern US
Table 5: Average times (in seconds) for solving (3) directly on a CPU with LDL (via MA57 [13]) or for solving one linear system with supernodal Cholesky via Cholmod (CM) in SuiteSparse [9], or Cholesky via cuSolver [1] (CS), each on a CPU and on a GPU. Cholesky on a GPU is quicker than LDL on a CPU by an increasingly large ratio. CM GPU does not work for small problems. All runs are on Newell [28].
Name MA57 Hybrid Direct-Iterative Solver
Analysis Factorization Total solves
Western US
Eastern US
Table 6: Average times (in seconds) for solving sequences of systems (3) directly on a CPU with LDL (via MA57 [13]) or for solving sequences of systems (7)–(8) on a GPU. The latter is split into analysis and factorization phases, and multiple solves. Symbolic analysis is needed only once for the whole sequence. Factorization happens once for each matrix. The solve phase is the total time for Lines 15-17 in Algorithm 1 with a CG tolerance of on Line 16. The results show that our method, without optimization of the code and kernels, outperforms LDL on the largest series (US Eastern Interconnection grid) by a factor of more than on a single matrix, and more than on a whole series, because the cost of symbolic analysis can be amortized over the series. All runs are on Deception [28].

Supernodal Cholesky via Cholmod in Suitesparse [9] does not perform well on GPUs for these test cases, but performs better than cuSolver on CPUs. This matches literature showing that multifrontal or supernodal approaches are not suitable for very sparse and irregular systems, where the dense blocks become too small, leading to an unfavorable ratio of communication versus computation [11, 6, 18]. This issue is exacerbated when supernodal or multifrontal approaches are used for fine-grain parallelization on GPUs [45]. Our method becomes better when the ratio of LDL to Cholesky factorization time grows, because factorization is the most costly part of linear solvers and our method has more (but smaller and less costly) system solves.

Table 6 compares a direct solve of (3) using MA57’s LDL factorization [13] and the full CG solve and direct solves on {(7)–(8)}, broken down into symbolic analysis of , factorization of , and CG on (7) on Deception [28].

When Cholesky is used, symbolic analysis is needed only for the first matrix in the sequence because pivoting is not a concern. As problems grow larger, the solve phase becomes a smaller part of the total run time. Also, our method increasingly outperforms MA57. The run time is reduced by a factor of more than on one matrix from the US Eastern Interconnection grid and more than on the whole series, because the analysis cost can be amortized over the entire optimization problem. This motivates using our method for even larger problems.

Another advantage of our method is that it solves systems that are less than half the size of the original one, though it does have to solve more of them. Notably, the LDL factorization may require pivoting during the factorization, whereas Cholesky does not. With MA57, all our test cases required substantial permutations even with a lax pivot tolerance of 0.01 (and a tolerance as large as 0.5 may be required to keep the factorization stable). This means that for our systems, LDL factorization requires considerable communication and presents a major barrier for GPU programming. On the other hand, the direct solve is generally more accurate by 2–3 orders of magnitude.

7 Iterative vs. direct solve with in Algorithm 1

We may ask if forming and its Cholesky factorization is worthwhile when it could be avoided by iterative solves with . Systems (7)–(8) require two solves with and an iterative solve with , which includes “inner” solves with until CG converges. As we see in Section 5, between 6 and 92 solves with are needed in our cases (and possibly more in other cases). Further, the inner iterative solves with would degrade the accuracy compared to inner direct solves or would require an excessive number of iterations. Therefore, for iterative solves with to be viable, the direct solve would have to cause substantial densification of the problem (i.e., the Cholesky factor would have to be very dense). Let be the number of multiplications when is applied as an operator, and be the number of multiplications for solving systems with . These values (generated in Matlab) and their ratio are given in Table 7. The ratio is always small and does not grow with problem size, meaning remains very sparse and the factorization is efficient. As the factorization dominates the total time of a direct solve with multiple right-hand sides, this suggests that performing multiple inner iterative solves is not worthwhile.

Name ratio
Illinois 20.5K 34.9K 1.70
Texas 249K 646K 2.59
Western US 1.05M 2.23M 2.11
Eastern US 7.23M 17.7M 2.45
Table 7: Densification of the problem for cases where the direct-iterative method is viable. Numbers are rounded to 3 digits. K and M signify and . is the number of multiplications when is applied as an operator, and is the number of multiplications for solving systems with . The ratio is only about 2 in all cases.

8 Summary

Following the approach of Golub and Greif [15