Linear programming (LP) is one of the most useful tools available to theoreticians and practitioners throughout science and engineering. It has been extensively used to solve various problems in a wide range of areas, including operations research, engineering, economics, or even in more abstract mathematical areas such as combinatorics. Also in machine learning and numerical optimization, LP appears in numerous settings, including -regularized SVMs , basis pursuit (BP) 
, sparse inverse covariance matrix estimation (SICE), the nonnegative matrix factorization (NMF) , MAP inference , etc. Not surprisingly, designing and analyzing LP algorithms is a topic of paramount importance in computer science and applied mathematics.
One of the most successful paradigms for solving LPs is the family of Interior Point Methods (IPMs), pioneered by Karmarkar in the mid 1980s . Path-following IPMs and, in particular, long-step path following IPMs, are among the most practical approaches for solving linear programs. Consider the standard form of the primal LP problem:
where , , and are the inputs, and
is the vector of the primal variables. The associated dual problem is
where and are the vectors of the dual and slack variables respectively. Triplets that uphold both (1) and (2) are called primal-dual solutions. Path-following IPMs typically converge towards a primal-dual solution by operating as follows: given the current iterate , they compute the Newton search direction and update the current iterate by following a step towards the search direction. To compute the search direction, one standard approach  involves solving the normal equations555Another widely used approach is to solve the augmented system . This approach is less relevant for this paper.:
Here, is a diagonal matrix, are diagonal matrices whose -th diagonal entries are equal to and , respectively, and is a vector whose exact definition is given in eqn. (22)666The superscript in eqn. (22) simply indicates iteration count and is omitted here for notational simplicity.. Given , computing and only involves matrix-vector products.
The core computational bottleneck in IPMs is the need to solve the linear system of eqn. (3) at each iteration. This leads to two key challenges: first, for high-dimensional matrices , solving the linear system is computationally prohibitive. Most implementations of IPMs use a direct solver; see Chapter 6 of . However, if is large and dense, direct solvers are computationally impractical. If is sparse, specialized direct solvers have been developed, but these do not apply to many LP problems, especially those arising in machine learning applications, due to irregular sparsity patterns. Second, an alternative to direct solvers is the use of iterative solvers, but the situation is further complicated since is typically ill-conditioned. Indeed, as IPM algorithms approach the optimal primal-dual solution, the diagonal matrix becomes ill-conditioned, which also results in the matrix becoming ill-conditioned. Additionally, using approximate solutions for the linear system of eqn. (3) causes certain invariants, which are crucial for guaranteeing the convergence of IPMs, to be violated; see Section 1.1 for details.
In this paper, we address the aforementioned challenges, for the special case where , i.e., the number of constraints is much smaller than the number of variables; see Section 5 for a generalization. This is a common setting in many applications of LP solvers. For example, in machine learning, -SVMs and basis pursuit problems often exhibit such structure when the number of available features () is larger than the number of objects (). Indeed, this setting has been of interest in recent work on LPs [16, 3, 29]. For simplicity of exposition, we also assume that the constraint matrix has full rank, equal to . First, we propose and analyze a preconditioned Conjugate Gradient (CG) iterative solver for the normal equations of eqn. (3), using matrix sketching constructions from the Randomized Linear Algebra (RLA) literature. We develop a preconditioner for using matrix sketching which allows us to prove strong convergence guarantees for the residual of CG solvers. Second, building upon the work of , we propose and analyze a provably accurate long-step infeasible IPM algorithm. The proposed IPM solves the normal equations using iterative solvers. In this paper, for brevity and clarity, we primarily focus our description and analysis on the CG iterative solver. We note that a non-trivial concern is that the use of iterative solvers and matrix sketching tools implies that the normal equations at each iteration will be solved only approximately. In our proposed IPM, we develop a novel way to correct for the error induced by the approximate solution in order to guarantee convergence. Importantly, this correction step is relatively computationally light, unlike a similar step proposed in . Third, we empirically show that our algorithm performs well in practice. We consider solving LPs that arise from -regularized SVMs and test them on a variety of synthetic and real-world data sets. Several extensions of our work are discussed in Section 5.
1.1 Our contributions
Our point of departure in this work is the introduction of preconditioned, iterative solvers for solving eqn. (3). Preconditioning is used to address the ill-conditioning of the matrix . Iterative solvers allow the computation of approximate solutions using only matrix-vector products while avoiding matrix inversion, Cholesky or LU factorizations, etc. A preconditioned formulation of eqn. (3) is:
where is the preconditioning matrix; should be easily invertible (see [2, 20] for background). An alternative yet equivalent formulation of eqn. (4), which is more amenable to theoretical analysis, is
where is a vector such that . Note that the matrix in the left-hand side of the above equation is always symmetric, which is not necessarily the case for eqn. (4). We do emphasize that one can use eqn. (4) in the actual implementation of the preconditioned solver; eqn. (5) is much more useful in theoretical analyses.
Recall that we focus on the special case where has
, i.e., it is a short-and-fat matrix. Our first contribution starts with the design and analysis of a preconditioner for the Conjugate Gradient solver that satisfies, with high probability,
for some error parameter . In the above, and
correspond to the smallest and largest singular value of the matrix in parentheses. The above condition says that the preconditioner effectively reduces the condition number ofto a constant. We note that the particular form of the lower and upper bounds in eqn. (6) was chosen to simplify our derivations. RLA matrix-sketching techniques allow us to construct preconditioners for all short-and-fat matrices that satisfy the above inequality and can be inverted efficiently. Such constructions go back to the work of ; see Section 3 for details on the construction of and its inverse. Importantly, given such a preconditioner, we then prove that the resulting CG iterative solver satisfies
Here is the approximate solution returned by the CG iterative solver after iterations. In words, the above inequality states that the residual achieved after iterations of the CG iterative solver drops exponentially fast. To the best of our knowledge, this result is not known in the CG literature: indeed, it is actually well-known that the residual error of CG may oscillate, even in cases where the energy norm of the solution error decreases monotonically. However, we prove that if the preconditioner is sufficiently good, i.e., it satisfies the constraint of eqn. (6), then the residual error decreases as well.
Our second contribution is the analysis of a novel variant of a long-step infeasible IPM algorithm proposed by . Recall that such algorithms can, in general, start with an initial point that is not necessarily feasible, but does need to satisfy some, more relaxed, constraints. Following the lines of [53, 36], let be the set of feasible and optimal solutions of the form for the primal and dual problems of eqns. (1) and (2) and assume that is not empty. Then, long-step infeasible IPMs can start with any initial point that satisfies and , for some feasible and optimal solution . In words, the starting primal and slack variables must be strictly positive and larger (element-wise) when compared to some feasible, optimal primal-dual solution. See Chapter 6 of  for a discussion regarding why such choices of starting points are also relevant to computational practice.
The flexibility of infeasible IPMs comes at a cost: long-step feasible IPMs converge in iterations, while long-step infeasible IPMs need iterations to converge [53, 36]. Here is the accuracy of the approximate LP solution returned by the IPM; see Algorithm 2 for the exact definition. Let
where and are the primal and dual residuals, respectively, and characterize how far the initial point is from being feasible. As long-step infeasible IPM algorithms iterate and update the primal and dual solutions, the residuals are updated as well. Let be the primal and dual residual at the -th iteration: it is well-known that the convergence analysis of infeasible long-step IPMs critically depends on lying on the line segment between 0 and . Unfortunately, using approximate solvers (such as the CG solver proposed above) for the normal equations violates this invariant.Aa simple solution to fix this problem by adding a perturbation vector to the current primal-dual solution that guarantees that the invariant is satisfied is proposed in . Again, we use RLA matrix sketching principles to propose an efficient construction for that provably satisfies the invariant. Next, we combine the above two primitives to prove that Algorithm 2 in Section 4 satisfies the following theorem.
Let be an accuracy parameter. Consider the long-step infeasible IPM Algorithm 2 (Section 4) that solves eqn. (5) using the CG solver of Algorithm 1 (Section 3). Assume that the CG iterative solver runs with accuracy parameter and iteration count . Then, with probability at least 0.9, the long-step infeasible IPM converges after iterations.
We note that the 0.9 success probability above is for simplicity of exposition and can be easily amplified using standard techniques. Also, at each iteration of our infeasible long-step IPM algorithm, the running time is . See Section 4 for a detailed discussion of the overall running time.
Our empirical evaluation demonstrates that our algorithm requires an order of magnitude much fewer inner CG iterations than a standard IPM using CG, while producing a comparably accurate solution (see Section 6). In practice, our empirical evaluation also indicates that using a CG solver with our sketching-based preconditioner does not increase the number of (outer) iterations of the infeasible IPM, compared to unpreconditioned CG or a direct linear solver. In particular, there are instances where our solver performs much better than unpreconditioned CG in terms of (outer) iteration count.
1.2 Comparison with Related Work
There is a large body of literature on solving LPs using IPMs. We only review literature that is immediately relevant to our work. Recall that we solve the normal equations inexactly at each iteration, and develop a way to correct for the error incurred. We also focus on IPMs that can use an sufficiently positive, infeasible initial point (see Section 1.1). We discuss below two papers that present related ideas.
The use of an approximate iterative solver for eqn. (3), followed by a correction step to “fix” the approximate solution was proposed in  (see our discussion in Section 1.1). We propose efficient, RLA-based approaches to precondition and solve eqn. (3), as well as a novel approach to correct for the approximation error in order to guarantee the convergence of the IPM algorithm. Specifically,  propose to solve eqn. (3) using the so-called maximum weight basis preconditioner . However, computing such a preconditioner needs access to a maximal linearly independent set of columns of in each iteration, which is costly, taking time in the worst-case. More importantly, while  was able to provide a bound on the condition number of the preconditioned matrix, that depends only on properties of , and is independent of , this bound might, in general, be very large. In contrast, our bound is a constant and it does not depend on properties of or its dimension. In addition,  assumed a bound on the two-norm of the residual of the preconditioned system, but it is unclear how their preconditioner guarantees such a bound. Similar concerns exist for the construction of the correction vector proposed by , which our work alleviates.
The line of research in the Theoretical Computer Science literature that is closest to our work is , who presented an IPM that uses an approximate solver in each iteration. However, their accuracy guarantee is in terms of the final objective value which is different from ours. More importantly,  focuses on short-step, feasible IPMs, whereas ours is long-step and does not require a feasible starting point. Finally, the approximate solver proposed by  works only for the special case of input matrices that correspond to graph Laplacians, following the lines of [44, 45].
We also note that in the Theoretical Computer Science literature, [24, 25, 26, 27, 28, 10] proposed and analyzed theoretically ground-breaking algorithms for LPs based on novel tools such as the so-called inverse maintenance for accelerating the linear system solvers in IPMs. However, all these endeavors are primarily focused on the theoretically fast but practically inefficient short-step feasible IPMs. In contrast, our work is focused on infeasible long-step IPMs, known to work efficiently in practice. Very recently,  proposed another fast, short-step, feasible IPM for solving tall and dense LPs. The output of their algorithm does not satisfy the linear constraints exactly (similar to ) and the final convergence guarantee is somewhat different from our work.
Another relevant line of research is the work of , which proposed solving eqn. (3) using preconditioned Krylov subspace methods, including variants of generalized minimum residual (GMRES) or CG methods. Indeed,  conducted extensive numerical experiments on LP problems taken from standard benchmark libraries, but did not provide any theoretical guarantees.
From a matrix-sketching perspective, our work was partially motivated by 
, which presented an iterative, sketching-based algorithm to solve under-constrained ridge regression problems, but did not address how to make use of such approaches in an IPM-based framework, as we do here. Recent papers proposed the so-calledNewton sketch [40, 50] to construct an approximate Hessian matrix for more general convex objective functions of which LP is a special case. Nevertheless, these randomized second-order methods are significantly faster than the conventional approach only when the data matrix is over-constrained, i.e. . It is unclear whether the approach of [40, 50] is faster than IPMs when the optimization problem to be solved is linear. A probabilistic algorithm to solve LP approximately in a random projection-based reduced feature-space was proposed in . A possible drawback of this paper is that the approximate solution is infeasible with respect to the original region.
2 Notation and Background
denote matrices and denote vectors. For vector , denotes its Euclidean norm; for a matrix denotes its spectral norm and denotes its Frobenius norm. We use to denote a null vector or null matrix, dependent upon context, and to denote the all-ones vector. For any matrix with of rank
a thin Singular Value Decomposition (SVD) is a product, with (the matrix of the left singular vectors), the matrix of the top- right singular vectors), and a diagonal matrix whose entries are equal to the singular values of . We use to denote the -th singular value of the matrix in parentheses.
For any two symmetric positive semidefinite (resp. positive definite) matrices and of appropriate dimensions, () denotes that is positive semidefinite (resp. positive definite).
We now briefly discuss a result on matrix sketching [11, 12] that is particularly useful in our theoretical analyses. In our parlance,  proved that, for any matrix , there exists a sketching matrix such that
holds with probability at least for any . Here is a (constant) accuracy parameter. Ignoring constant terms, ; has non-zero entries per row; and the product can be computed in time .
3 Conjugate Gradient Solver
Notice that we only need to compute in order to use it to solve eqn. (5). Towards that end, we first compute the sketched matrix . Then, we compute the SVD of the matrix : let be the matrix of its left singular vectors and let be the matrix of its singular values. Notice that the left (and right) singular vectors of are equal to and its singular values are equal to . Therefore, .
In the above we used and . The running time needed to compute the sketch is equal to (ignoring constant factors) . Note that . The cost of computing the SVD of (and therefore ) is . Overall, computing can be done in time
If the sketching matrix satisfies eqn. (12), then, for all ,
Consider the condition of eqn. (12):
), note that all the eigenvalues oflie between and and thus . Therefore, , as is non-singular and we know that the rank of a matrix remains unaltered by pre- or post-multiplying it by a non-singular matrix. So, we have ; in words has full rank. Therefore, all the diagonal entries of are positive and .
The above lemma directly implies eqn. (6). We now proceed to show that the above construction for , when combined with the conjugate gradient solver to solve eqn. (5), indeed satisfies eqn. (7)777See Chapter 9 of  for a detailed overview of CG.. We do note that in prior work most of the convergence guarantees for CG focus on the error of the approximate solution. However, in our work, we are interested in the convergence of the residuals and it is known that even if the energy norm of the error of the approximate solution decreases monotonically, the norms of the CG residuals may oscillate. Interestingly, we can combine a result on the residuals of CG from  with Lemma 2 to prove that in our setting the norms of the CG residuals also decrease monotonically.
Let be the residual at the -th iteration of the CG algorithm:
Lemma 3 (Theorem 8 of ).
Let and be the residuals obtained by the CG solver at steps and . Then,
where is the condition number of .
From Lemma 2, we get
where the last inequality follows from . Applying eqn. (19) recursively, we get
which proves the condition of eqn. (7).
We remark that one can consider using MINRES  instead of CG. Our results hinges on bounding the two-norm of the residual. MINRES finds, at each iteration, the optimal vector with respect the two-norm of the residual inside the same Krylov subspace of CG for the corresponding iteration. Thus, the bound we prove for CG applies to MINRES as well.
4 The Infeasible IPM algorithm
In order to avoid spurious solutions, primal-dual path-following IPMs bias the search direction towards the central path and restrict the iterates to a neighborhood of the central path. This search is controlled by the centering parameter . At each iteration, given the current solution , a standard infeasible IPM obtains the search direction by solving the following system of linear equations:
Here and are computed given the current iterate and ; we skip the indices on and for notational simplicity. After solving the above system, the infeasible IPM Algorithm 2 proceeds by computing a step-size to return:
Recall that is a vector with and (the primal and dual residuals). We also use the duality measure and the vector
Given from eqn. (20a), and are easy to compute from eqns. (20b) and (20c), as they only involve matrix-vector products. However, since we use Algorithm 1 to solve eqn. (20a) approximately using the sketching-based preconditioned CG solver, the primal and dual residuals do not lie on the line segment between and . This invalidates known proofs of convergence for infeasible IPMs.
For notational simplicity, we now drop the dependency of vectors and scalars on the iteration counter . Let be the approximate solution to eqn. (20a). In order to account for the loss of accuracy due to the approximate solver, we compute as follows:
Here is a perturbation vector that needs to exactly satisfy the following invariant at each iteration of the infeasible IPM:
We note that the computation of is still done using, essentially, eqn. (20b), namely
Construction of . There are many choices for satisfying eqn. (24). To prove convergence, it is desirable for to have a small norm, hence a general choice is
which involves the computation of the pseudoinverse , which is expensive, taking time . Instead, we propose to construct using the sketching matrix of Section 2. More precisely, we construct the perturbation vector
The following lemma proves that the proposed satisfies eqn. (24).
Let be the thin SVD representation of . We use the exact same as discussed in Section 3. Therefore, eqn. (12) holds with probability and it directly follows from the proof of Lemma 2 that . Recall that has full row-rank and thus . Therefore, taking , we get
where the second equality follows from . ∎
We emphasize here that we use the same exact sketching matrix to form the preconditioner used in the CG algorithm of Section 3 as well as the vector in eqn.(26). This allows us to sketch only once, thus saving time in practice. Next, we present a bound for the two-norm of the perturbation vector of eqn. (26).
With probability at least , our perturbation vector in Lemma 4 satisfies
Recall that . Also, and are (respectively) the matrices of the left singular vectors and the singular values of . Now, let be the right singular vector of . Therefore, is the thin SVD representation of . Also, from Lemma 2, we know that has full rank. Therefore, . Next, we bound :
In the above we used . Using the SVD of and , we get . Now, note that
is an orthogonal matrix and. Therefore, combining with eqn. (28) yields
The first equality follows from the unitary invariance property of the spectral norm and the second inequality follows from the sub-multiplicativity of the spectral norm and . Our construction for implies that eqn. (10) holds for any matrix and, in particular, for . Eqn. (10) implies that
holds with probability at least . Applying Weyl’s inequality on the left hand side of the eqn. (30), we get
Using and , we get888The constant 3 in eqn. (32) could be slightly improved to ; we chose to keep the suboptimal constant to make our results compatible with the work in  and avoid reproving theorems from . The better constant does not result in any significant improvements in the number of iterations of the Algorithm 2.
The above result is particularly useful in proving the convergence of Algorithm 2. More precisely, combining a result from  with our preconditioner , we can prove that . This bound allows us to prove that if we run Algorithm 1 for iterations, then
We are now ready to present the infeasible IPM algorithm. We will need the following definition for the neighborhood
Here and we note that the duality measure steadily reduces at each iteration.
Running time. We start by discussing the running time to compute . As discussed in Section 3, can be computed in time. Now, as has non-zero entries per row, pre-multiplying by takes time (assuming ). Since and are diagonal matrices, computing takes time, which is asymptotically the same as computing (see eqn. (13)).
We now discuss the overall running time of Algorithm 2. At each iteration, with failure probability , the preconditioner and the vector can be computed in time. In addition, for iterations of Algorithm 1, all the matrix-vector products in the CG solver can be computed in time. Therefore, the computational time for steps (a)-(d) is given by . Finally, taking a union bound over all iterations with (ignoring constant factors), Algorithm 2 converges with probability at least 0.9. The running time at each iteration is given by .
We briefly discuss extensions of our work. First, there is nothing special about using a CG solver for solving eqn. (5). We analyze two more solvers that could replace the proposed CG solver without any loss in accuracy or any increase in the number of iterations for the long-step infeasible IPM Algorithm 2 of Section 4. In Appendix A, we analyze the performance of the preconditioned Richardson Iteration and in Appendix B, we analyze the performance of the preconditioned Steepest Descent. In both cases, if the respective preconditioned solver (with the preconditioner of Section 3) runs for steps, Theorem 1 still holds, with small differences in the constant terms. While preconditioned Richardson iteration and preconditioned Steepest Descent are interesting from a theoretical perspective, they are not particularly practical. In future work, we will also consider the preconditioned Chebyshev semi-iterative method, which offers practical advantages compared to PCG in parallel settings.
Second, recall that our approach focused on full rank input matrices with . Our overall approach still works if in any matrix that is low-rank, e.g., . In that case, using the thin SVD of , we can rewrite the linear constraints as follows , where and are the matrices of left and right singular vecors of respectively; is the diagonal matrix with the non-zero singular values of as its diagonal elements. The LP of eqn. (1) can be restated as