    # Solving Linear Programs in the Current Matrix Multiplication Time

This paper shows how to solve linear programs of the form _Ax=b,x≥0 c^ x with n variables in time O^*((n^ω+n^2.5-α/2+n^2+1/6) (n/δ)) where ω is the exponent of matrix multiplication, α is the dual exponent of matrix multiplication, and δ is the relative accuracy. For the current value of ω∼2.37 and α∼0.31, our algorithm takes O^*(n^ω(n/δ)) time. When ω = 2, our algorithm takes O^*(n^2+1/6(n/δ)) time. Our algorithm utilizes several new concepts that we believe may be of independent interest: ∙ We define a stochastic central path method. ∙ We show how to maintain a projection matrix √(W)A^T(AWA^)^-1A√(W) in sub-quadratic time under ℓ_2 multiplicative changes in the diagonal matrix W.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Linear programming is one of the key problems in computer science. In both theory and practice, many problems can be reformulated as linear programs to take advantage of fast algorithms. For an arbitrary linear program with variables and constraints111Throughout this paper, we assume there is no redundant constraints and hence . Note that papers in different communities uses different symbols to denote the number of variables and constraints in a linear program., the fastest algorithm takes 222We use to hide all and factors in the introduction. where is the number of non-zeros in [LS14, LS15].

For the generic case we focus in this paper, the current fastest runtime is dominated by . This runtime has not be been improved since the result by Vaidya on 1989 [Vai87, Vai89b]. The bound originated from two factors: the cost per iteration and the number of iterations . The cost per iteration looks optimal because this is the cost to compute for a dense . Therefore, many efforts [Kar84, Ren88, NN89, Vai89a, LS14] have been focused on decreasing the number of iterations while maintaining the cost per iteration. As for many important linear programs (and convex programs), the number of iterations has been decreased, including maximum flow [Mad13, Mad16], minimum cost flow [CMSV17], geometric median [CLM16], matrix scaling and balancing [CMTV17], and regression [BCLL18]. Unfortunately, beating iterations (or when ) for the general case remains one of the biggest open problems in optimization.

Avoiding this open problem, this paper develops a stochastic central path method that has a runtime of , where is the exponent of matrix multiplication and is the dual exponent of matrix multiplication333The dual exponent of matrix multiplication is the supremum among all such that it takes time to multiply an matrix by an matrix.. For the current value of and , the runtime is simply . This achieves the natural barrier for solving linear programs because linear system is a special case of linear program and that the currently fastest way to solve general linear systems involves matrix multiplication. Despite the exact approach used in [CW87, Wil12, DS13, LG14] cannot give a bound on better than [AFLG15] and all known approaches cannot achieve the bound [AW18], it is still possible that using all known approaches. Therefore, we believe improving the additive term remains an interesting open problem.

Our method is a stochastic version of the short step central path method. This short step method takes steps and each step decreases by a factor for all where is the dual variable [Ren88]. This results in coordinate updates and total time. Our method takes the same number of step but only updates coordinates each step. Therefore, we only update coordinates in total, which is nearly optimal.

Our framework is efficient enough to take a much smaller step while maintaining the same running time. For the current value of , we show how to obtain the same runtime of by taking steps and coordinates update per steps. This is because the complexity of each step decreases proportionally when the step size decreases. Beyond the cost per iteration, we remark that our algorithm is one of the very few central path algorithms [PRT02, Mad13, Mad16] that does not maintain

close to some ideal vector in

norm. We are hopeful that our stochastic method and our proof will be useful for future research on interior point methods. In particular, it would be interesting to see how this can be combined with techniques in [Cla95, LS14] to get a faster algorithm for linear programs with .

Besides the applications to linear programs, some of our techniques are probably useful for studying other important problems in convex optimization. In particular, our framework should be naturally extendable to a larger class of convex programs.

## 2 Results and Techniques

###### Theorem 2.1 (Main result).

Given a linear program with no redundant constraints. Assume that the polytope has diameter in norm, namely, for any with , we have .

Then, for any , outputs such that

 c⊤x≤minAx=b,x≥0c⊤x+δ⋅∥c∥∞R%and∥Ax−b∥1≤δ⋅(R∑i,j|Ai,j|+∥b∥1)

in expected time

 (nω+o(1)+n2.5−α/2+o(1)+n2+1/6+o(1))⋅log(nδ)

where is the exponent of matrix multiplication, is the dual exponent of matrix multiplication.

For the current value of and , the expected time is simply .

###### Remark 2.2.

See [Ren88] and [LS13, Sec E, F] on the discussion on converting an approximation solution to an exact solution. For integral , it suffices to pick to get an exact solution where is the bit complexity and is the largest absolute value of the determinant of a square sub-matrix of . For many combinatorial problems, .

If is the current cost of matrix multiplication and inversion with , our runtime is simply . The comes from iteration count and the factor comes from the doubling trick in the projection maintenance section. We left the problem of obtaining as an open problem.

### 2.1 Central Path Method

Our algorithm relies on two new ingredients: stochastic central path and projection maintenance. The central path method consider the linear program

 minAx=b,x≥0c⊤x(primal)andmaxA⊤y≤cb⊤y(dual)

with . Any solution of the linear program satisfies the following optimality conditions:

 xisi =0 for all i, Ax =b, A⊤y+s =c, xi,si ≥0 for all i.

We call feasible if it satisfies the last three equations above. For any feasible , the duality gap is . The central path method find a solution of the linear program by following the central path which uniformly decrease the duality gap. The central path is a path parameterized by and defined by

 xt,ist,i =t for all i, Axt =b, A⊤yt+st =c, xt,i,st,i ≥0 for all i.

It is known [YTM94] how to transform linear programs by adding many variables and constraints so that:

• The optimal solution remains the same.

• The central path at is near where and are all and all vectors with appropriate lengths.

• It is easy to covert an approximate solution of the transformed program to the original one.

For completeness, a theoretical version of such result is included in Lemma A.6. This result shows that it suffices to move gradually to for small enough .

#### 2.1.1 Short Step Central Path Method

The short step central path method maintains for some vector such that

 ∑i(μi−t)2=O(t2) for some scalar t>0.

To move from to approximately, we approximate the term by and obtain the following system: system:

 Xδs+Sδx =δμ, Aδx =0, (1) A⊤δy+δs =0,

where and . This equation is the linear approximation of the original goal (moving from to ), and that the step is explicitly given by the formula

 δx=X√XS(I−P)1√XSδμ and δs=S√XSP1√XSδμ (2)

where is an orthogonal projection and the formulas are the diagonal matrices of the corresponding vectors.

A standard choice of is for all and this requires iterations to converge. Combining this with the inverse maintenance technique [Vai87], this gives a total runtime of . We remark that is an invariant of the algorithm and the progress is measured by because the duality gap is roughly .

#### 2.1.2 Stochastic Central Path Method

This part discuss how to modify the short step central path to decrease the cost per iteration to roughly . Since our goal is to implement a central path method in sub-quadratic time per iteration, we even do not have the budget to compute every iterations. Therefore, instead of maintaining shown in previous papers, we will study the problem of maintaining a projection matrix due to the formula of and (2).

However, even if the projection matrix is given explicitly for free, it is difficult to multiply the dense projection matrix with a dense vector in time . To avoid moving along a dense , we move along an sparse direction defined by

 ˜δμ,i=⎧⎪⎨⎪⎩δμ,i/pi,with probability pidef=k⋅(δ2μ,i∑lδ2μ,l+1n);0,else. (3)

The sparse direction is defined so that we are moving in the same direction in expectation (

) and that the direction has as small variance as possible (

). If the projection matrix is given explicitly, we can apply the projection matrix on in time . This paper picks and the total cost of projection vector multiplications is about .

During the whole algorithm, we maintain a projection matrix

 ¯¯¯¯P= ⎷¯¯¯¯¯X¯¯¯¯SA⊤(A¯¯¯¯¯X¯¯¯¯SA⊤)−1A ⎷¯¯¯¯¯X¯¯¯¯S

for vectors and such that and for all . Since we maintain the projection at a nearby point , our stochastic step , and are defined by

 ¯¯¯¯¯X˜δs+¯¯¯¯S˜δx =˜δμ, A˜δx =0, (4) A⊤˜δy+˜δs =0,

which is different from (1) on both sides of the first equation. Similar to (2), Lemma 4.2 shows that

 ˜δx=¯¯¯¯¯X√¯¯¯¯¯X¯¯¯¯S(I−¯¯¯¯P)1√¯¯¯¯¯X¯¯¯¯S˜δμ and ˜δs=¯¯¯¯S√¯¯¯¯¯XS¯¯¯¯P1√¯¯¯¯¯X¯¯¯¯S˜δμ. (5)

The previously fastest algorithm involves maintaining the matrix inverse using subspace embedding techniques [Sar06, CW13, NN13] and leverage score sampling [SS11]. In this paper, we maintain the projection directly.

The key departure from the central path we present is that we can only maintain

 0.9t≤μi=xisi≤1.1t for some t>0

instead of close to in norm. We will further explain the proof in Section 4.1.

### 2.2 Projection Maintenance

The projection matrix we maintain is of the form where . For intuition, we only explain how to maintain the matrix for the short step central path step here. In this case, we have for each step.

If the changes of is uniformly across all the coordinates, then for all . Since it takes steps to change all coordinates by a constant factor and we only need to maintain with for all , we can update the matrix every steps. Hence, the average cost of maintaining the projection matrix is , which is exactly what we desired.

For the other extreme case that the “adversary” puts all of his budget on few coordinates, only coordinates are changed by a constant factor after iterations. In this case, instead of updating every step, we can compute online by the woodbury matrix identity.

###### Fact 2.3.

The Woodbury matrix identity is

 (M+UCV)−1=M−1−M−1U(C−1+VM−1U)−1VM−1.

Let denote the set of coordinates that is changed by more than a constant factor and . Using the identity above, we have that

 Mwnew=Mw−(Mw)S(Δ−1S,S+(Mw)S,S)−1((Mw)S)⊤ (6)

where , is the columns from of and are the rows and columns from of and .

As long as for all except not too many coordinates, (6) can be applied online efficiently. In another case, we can use (6) instead to update the matrix and the cost is dominated by multiplying a matrix with a matrix.

###### Theorem 2.4 (Rectangular matrix multiplication, [Lgu18]).

Let the dual exponent of matrix multiplication be the supremum among all such that it takes time to multiply an matrix by an matrix.

Then, for any , multiplying an with an matrix or with takes time

 n2+o(1)+rω−21−αn2−α(ω−2)1−α+o(1).

Furthermore, we have .

See Lemma A.5 for the origin of the formula. Since the cost of multiplying matrix by a matrix is same as the cost for with , (6) should be used to update at least coordinates. In the extreme case we are discussing, we only need to update the matrix times and each takes time, and hence the total cost is less than .

In previous papers [Kar84, Vai89b, NN91, NN94, LS14, LS15], the matrix is updated in a fixed schedule independent of the input sequence . This leads to sub-optimal bounds if used in this paper. We instead define a potential function to measure the distance between the approximate vector and the target vector . When there are more than coordinates of that is far from , we update by a certain greedy step. As in the extreme cases, the worst case of our algorithm is that the “adversary” puts his budget across all coordinates uniformly and hence the worst case runtime is per iteration. We will further explain the potential in Section 5.1.

## 3 Preliminaries

For notation convenience, we assume the number of variables and there is no redundant constraints. In particular, this implies that the constraint matrix is full rank and

For a positive integer , let denote the set .

For any function , we define to be . In addition to notation, for two functions , we use the shorthand (resp. ) to indicate that (resp. ) for some absolute constant .

We use to denote and to denote .

For vectors and accuracy parameter , we use to denote that . Similarly, for any scalar , we use to denote that

For a vector and , we use to denote a length vector with the -th coordinate is . Similarly, we extend other scalar operations to vector coordinate-wise.

Given vectors , we use and to denote the diagonal matrix of those two vectors. We use to denote the diagonal matrix given . Similarly, we extend other scalar operations to diagonal matrix diagonal-wise. Note that matrix is an orthogonal projection matrix.

## 4 Stochastic Central Path Method

### 4.1 Proof Outline

The short step central path method is defined using the approximation . This approximate is only accurate if and . For the step, we have

 X−1δx=1√XS(I−P)1√XSδμ∼1t(I−P)δμ (7)

where we used for all .

If we know that , then the norm can be bounded as follows:

 ∥X−1δx∥∞≤∥X−1δx∥2≲1t∥(I−P)δμ∥2≤1t∥δμ∥2≪1

where we used that is an orthogonal projection matrix. This is the reason why a standard choice of is for all .

For the stochastic step, for roughly coordinates. Therefore, the norm of is very large (). After the projection, we have . Hence, we cannot bound by

. To improve the bound, we use Chernoff bounds to estimate

.

Beside the norm bound, the proof sketch in (7) also requires using for all . The short step central path proof maintains an invariant that . However, since our stochastic step has a stochastic noise with norm as large as , one cannot hope to maintain close to in norm. Instead, we follow an idea in [LS14, LSW15] and maintain the following potential

 n∑i=1cosh(λ(xisit−1))=nO(1)

with . Note that the potential bounded by implies that is a multiplicative approximation of . To bound the potential, consider and be the potential above. Then, we have that

 E[Φ(rnew)]∼Φ(r)+⟨∇Φ(r),E[rnew−r]⟩+12∥rnew−r∥2∇2Φ(r).

The first order term can be bounded efficiently because is close to the short step central path step. The second term is a variance term which scales like due to the independent coordinates. Therefore, the potential changed by factor each step. Hence, we can maintain it for roughly steps.

To make sure the potential is bounded during the whole algorithm, our step is the mixtures of two steps of the form . The first term is to decrease and the second term is to decrease .

Since the algorithm is randomized, there is a tiny probability that is large. In that case, we switch to a short step central path method. See Figure 1, Algorithm 1, and Algorithm 2. The first part of the proof involves bounding every quantity listed in Table 1. In the second part, we are using these quantities to bound the expectation of .

To decouple the proof in both parts, we will make the following assumption in first part. It will be verified in the second part.

###### Assumption 4.1.

Assume the following for the input of the procedure StochasticStep (see Algorithm 1):

• with .

• outputs such that with .

• with .

• .

### 4.2 Bounding each quantity of stochastic step

First, we give an explicit formula for our step, which will be used in all subsequent calculations.

###### Lemma 4.2.

The procedure (see Algorithm 1) finds a solution , to (4) by the formula

 ˜δx= ¯¯¯¯¯X√¯¯¯¯¯X¯¯¯¯S(I−¯¯¯¯P)1√¯¯¯¯¯X¯¯¯¯S˜δμ (8) ˜δs= ¯¯¯¯S√¯¯¯¯¯X¯¯¯¯S¯¯¯¯P1√¯¯¯¯¯X¯¯¯¯S˜δμ (9)

with

 ¯¯¯¯P= ⎷¯¯¯¯¯X¯¯¯¯SA⊤(A¯¯¯¯¯X¯¯¯¯SA⊤)−1A ⎷¯¯¯¯¯X¯¯¯¯S. (10)
###### Proof.

For the first equation of (4), we multiply on both sides,

 A¯¯¯¯S−1¯¯¯¯¯X˜δs+A˜δx=A¯¯¯¯S−1˜δμ.

Since the second equation gives , then we know that .

Multiplying on both sides of the third equation of (4), we have

 −A¯¯¯¯S−1¯¯¯¯¯XA⊤˜δy=A¯¯¯¯S−1¯¯¯¯¯X˜δs=A¯¯¯¯S−1˜δμ.

Thus,

 ˜δy= −(A¯¯¯¯S−1¯¯¯¯¯XA⊤)−1A¯¯¯¯S−1˜δμ, ˜δs= A⊤(A¯¯¯¯S−1¯¯¯¯¯XA⊤)−1A¯¯¯¯S−1˜δμ, ˜δx= ¯¯¯¯S−1˜δμ−¯¯¯¯S−1¯¯¯¯¯XA⊤(A¯¯¯¯S−1¯¯¯¯¯XA⊤)−1A¯¯¯¯S−1˜δμ.

Recall we define as (10), then we have

 ˜δs= ¯¯¯¯S√¯¯¯¯¯X¯¯¯¯S⋅ ⎷¯¯¯¯¯X¯¯¯¯SA⊤(A¯¯¯¯¯X¯¯¯¯SA⊤)−1 ⎷¯¯¯¯¯X¯¯¯¯S⋅1√¯¯¯¯¯X¯¯¯¯S˜δμ=¯¯¯¯S√¯¯¯¯¯X¯¯¯¯S¯¯¯¯P1√¯¯¯¯¯X¯¯¯¯S˜δμ,

and

 ˜δx=