 # Randomized Fast Subspace Descent Methods

Randomized Fast Subspace Descent (RFASD) Methods are developed and analyzed for smooth and non-constraint convex optimization problems. The efficiency of the method relies on a space decomposition which is stable in A-norm, and meanwhile, the condition number κ_A measured in A-norm is small. At each iteration, the subspace is chosen randomly either uniformly or by a probability proportional to the local Lipschitz constants. Then in each chosen subspace, a preconditioned gradient descent method is applied. RFASD converges sublinearly for convex functions and linearly for strongly convex functions. Comparing with the randomized block coordinate descent methods, the convergence of RFASD is faster provided κ_A is small and the subspace decomposition is A-stable. This improvement is supported by considering a multilevel space decomposition for Nesterov's `worst' problem.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the non-constraint convex minimization problem

 minx∈Vf(x), (1)

where is a smooth and convex function and its derivative is Lipschitz continuous with constant and is a Hilbert space. In practice, but might be assigned with an inner product other than the standard inner product of . Solving minimization problem (1

) is a central task with wide applications in fields of scientific computing, machine learning, and data science, etc.

Due to the eruption of data and the stochasticity of the real world, randomness is introduced to make algorithms more robust and computational affordable. In the following, we will restrict ourselves to randomized algorithms related to the coordinate descent (CD) method [18, 3, 14] and its block variant, i.e., the block CD (BCD) method [18, 3, 2, 27, 1] and propose a new algorithm generalizing the randomized CD (RCD) and randomized BCD (RBCD) methods.

In , Nesterov studied a RBCD method for huge-scale optimization problems. Assuming the gradient of the objective function is coordinate-wise Lipschitz continuous with constants , at each step, the block coordinates is chosen randomly with probability and an optimal block coordinate step with step size is employed. Note that, when , the probability

is uniformly distributed. When

, it is proportional to . It is shown that such an RBCD method converges linearly for a strongly convex function and sublinearly for the convex case. Later, in , the cyclic version of the BCD method was studied, namely, each iteration consists of performing a gradient projection step with respect to a certain block taken in a cyclic order. Global sublinear convergence rate was established for convex problems and, when the objective function is strongly convex, a linear convergence rate can be proved. More recently, Wright   studied a simple version of RCD that updates one coordinate at each time with uniformly chosen index. It was pointed out that when applying to linear system using least-sqaures formulation, such a RCD is exactly a randomized Kaczmarz method [22, 11]. Similarly, it was shown that RCD convergences sublinearly for convex functions and linearly for strongly convex functions. In , Lu developed a randomized block proximal damped newton (RBPDN) method. For solving smooth convex minimization problem, RBPDN uses Newton’s method in each block. Comparing with the Newton’s method, the computational complexity of RBPDN is reduced since the Newton’s step is performed locally on each block. There is a trade-off between convergence rate and computational complexity. If the dimension of the blocks is too small, i.e. , the Hessian on each block might lose lots of information, which might lead to slow convergence. While if the block’s dimension is large, for example, , then the computation of Hessian inverse on each block might still be expensive.

Those existing RCD and RBCD methods can achieve acceleration comparing with standard gradient descent (GD) methods, especially for large-scale problems. However, the convergence of RCD and RBCD becomes quite slow when the problem is ill-conditioned. It is well-known that, preconditioning techniques can be used to improve the conditioning of an optimization problem, see [19, 18]. While preconditioning techniques can be motivated in different ways, one approach is to look at the problem (1) in endowed by an inner product induced by the preconditioner. Roughly speaking, assuming the preconditioner is symmetric and positive-definite, we consider the Lipschitz continuity and convexity using the -inner product and -norm. A good preconditioner means that the condition number measured using -norm is relatively small and therefore, the convergence of preconditioned GD (PGD) can be accelerated. The price to pay is the cost of the action of , which might be prohibitive for large-size problem. Moreover, it is also difficult to use the preconditioner in the RCD and RBCD methods due to the fact that the coordinate-wise decomposition is essentially based on the -norm.

One main idea of the proposed algorithm is to generalize the coordinate-wise decomposition to subspace decomposition which is more suitable for the -norm. This idea itself is not new. For example, the well-known multigrid method , which is one of the most efficient methods for solving the elliptic problem, can be derived fom subspace correction methods based on a multilevel space decomposition . Its randomized version has been considered in . Recently, in , we have developed fast subspace descent (FASD) methods by borrowing the subspace decomposition idea of multigrid methods for solving the optimization problems. In this paper, we provide a randomized version of FASD and abbreviated as RFASD.

A key feature of FASD and RFASD is a subspace decomposition with . Note here we do not require the space decomposition to be a direct sum nor be orthogonal. Indeed, the overlapping/redundancy between the subspaces is crucial to speed up the convergence if the space decomposition is stable in the norm as follows,

• (SD) Stable decomposition: there exists a constant , such that

 ∀v∈V,v=J∑i=1vi, and J∑i=1∥vi∥2A≤CA∥v∥2A. (2)

With such a subspace decomposition, the proposed RFASD method is similar with the RBCD method. At each iteration, RFASD randomly chooses a subspace according to certain sampling distribution, computes a search direction in the subspace, and then update the iterator with an appropriate step size.

Based on standard assumptions, we first prove a sufficient decay inequality. Then coupled with the standard upper bound of the optimality gap, we are able to show RFASD converges sublinearly for convex functions and linearly if the objective function is strongly convex. More importantly, the convergence rate of RFASD depends on the condition number measured in -norm and there is no need of inverting directly, only local solves on each subspace is sufficient. Using strongly convex case as an example, we show that the convergence rate is

 1−1J1CA1κA,

where is the condition number of measured in the -norm, which could be much smaller than the condition number of measures in -norm. Here is the number of subspaces and measures the stability of the space decomposition in -norm; see (2). Therefore, if we choose a proper preconditioner such that and a stable subspace decomposition such that , after iterations, we have

 (1−1J1CA1κA)J≤exp(−1CAκA)=exp(−O(1)).

This indicates an exponential decay rate that is independent of the size of the optimization problem, which shows the potential of the proposed RFASD method for solving large-scale optimization problems. In summary, based on a stable subspace decomposition, we can achieve the preconditioning effect by only solving smaller size problems on each subspace, which reduces the computational complexity.

The paper is organized as follows. In Section 2, we set up the minimization problem with proper assumptions on and . Then we propose the RFASD algorithm. In Section 3, based on the stable decomposition assumption, convergence analysis for convex functions and strongly convex function are derived. In Section 4, we give some examples and comparisons between several methods within the framework of RFASD. Numerical experiments results are provided in Section 5 to confirm the theories. Finally, we provide some conclusions in Section 6.

## 2 Fast Subspace Descent Methods

In this section, we introduce the basic setting of the optimization problem we consider as well as basic definitions, notation, and assumptions. Then we propose the fast subspace descent method based on proper subspace decomposition.

### 2.1 Problem Setting

We consider the minimization problem (1). The Hilbert space

is a vector space equipped with an inner product

and the norm induced is denoted by . Although our discussion might be valid in general Hilbert spaces, we restrict ourself to the finite dimensional space and without of loss generality we take . In this case, the standard dot product in corresponds to and is denoted by

 (x,y)=x⋅y:=N∑i=1xiyi∀x,y∈V. (3)

The inner produce is induced by a given symmetric positive definite (SPD) matrix and defined as follows,

 (x,y)A:=(Ax,y),∀ x,y∈V. (4)

Let be the linear space of all linear and continuous mappings which is called the dual space of . The dual norm w.r.t the -norm is defined as: for

 ∥f∥V′=sup∥x∥A≤1⟨f,x⟩, (5)

where denotes the standard duality pair between and . By Riesz representation theorem, can be also treat as a vector and, it is straightforward to verify that

 ∥f∥V′=∥f∥A−1:=⟨A−1f,f⟩1/2.

Next, we introduce a decomposition of the space , i.e.,

 V=V1+V2+⋯+VJ,Vi⊂V,i=1,⋯,J. (6)

Again we emphasize that the space decomposition is not necessarily a direct sum nor be orthogonal. For each subspace , we assign an inner product induced by a symmetric and positive definite matrix . The product space

 ˜V=V1×V2×⋯×VJ

is assigned with the product topology: for

 ∥~v∥~A:=(J∑i=1∥vi∥2Ai)1/2.

In matrix form, is a block diagonal matrix defined on .

We shall make the following assumptions on the objective function:

• (LCi) The gradient of is Lipschitz continuous restricted in each subspace with Lipschitz constant , i.e.,

 ∥∇f(x+vi)−∇f(x)∥A−1≤LA,i∥vi∥Ai,∀vi∈Vi.
• (SC) is strongly convex with strong convexity constant , i.e.,

 ⟨∇f(x)−∇f(y),x−y⟩≥μA∥x−y∥2A,∀x,y∈V.

Let be the natural inclusion and let . In the terminology of multigrid methods, corresponds to the prolongation operator and is the restriction.

### 2.2 Randomized Fast Subspace Descent Methods

Now, we propose the randomized fast subspace descent (RFASD) algorithm.

The non-uniform sampling distribution and the step size requires a priori knowledge of . A conservative plan is to use one upper bound for all subspaces. For example, when the gradient of is Lipschitz continuous with Lipschitz constant , i.e.,

 ∥∇f(x)−∇f(y)∥A−1≤LA∥x−y∥A,∀x,y∈V.

We can set for all and consequently we pick the subspace uniformly and use a uniform step size .

There is a balance between the number of subspaces and the complexity of the subspace solvers. For example, we can choose and thus . But then we need to compute which may cost for which is not practical for large-size problems (e.g., using the standard Gauss elimination to compute leads to ). On the other extreme, we can chose a multilevel decomposition with and . Then the cost to compute is .

One important question is what is a good choice of an -norm? Any good preconditioner for the objective function is a candidate. For example, when exists, or its approximation is a good choice since this inherits advantages of the Newton’s method or quasi-Newton methods.

Another important question is how to get a stable decomposition based on a given SPD matrix ? There is no satisfactory and universal answer to this question. One can always start from a block coordinate decomposition. When , this leads to the block Newton method considered in . We can then merge small blocks to form a larger one in a multilevel fashion and algebraic multigrid methods  can be used in this process to provide a coarsening of the graph defined by the Hessian. In general, efficient and effective space decomposition will be problem dependent. We shall provide an example later on.

### 2.3 Randomized Full Approximation Storage Scheme

The full approximation storage (FAS) scheme, in the deterministic setting, was proposed in  and is a multigrid method for solving nonlinear equations. Several FAS-like algorithms for solving optimization problems have been considered in the literature [6, 12, 15, 7], including those that are line search-based recursive or trust region-based recursive algorithms

Based on RFASD, we shall propose a randomized FAS (RFAS) algorithm. We first recall FAS in the optimization setting as discussed in . Given a space decomposition Let be a projection operator and, ideally, should provide a good approximation of in the subspace . In addition, is a local objective function. can be the original . Then it coincides with the multilevel optimization methods established by Tai and Xu ; see Remark 4.2 in . Given the current approximation , in FAS, the search direction , , is computed by the following steps:

Replacing Step 4 in RFASD (Algorithm 1), we obtain the RFAS as shown in Algorithm 3.

Next we show that if we choose in certain way, RFAS becomes a special case of RFASD. Given the SPD matrices , which induce the inner products on subspaces , , we define the following quadratic local objective functions,

 fi(w)=12∥w∥2Ai,∀w∈Vi,i=1,2,⋯,J.

From Algortihm 2, it is easy to see that . Therefore, in this case, RFAS agrees with RFASD. Note that in this setting may not be the Galerkin projection of , i.e. , which is different from RPSD. Nevertheless, the convergence analysis (Theorem 1 and 2) can be still applied but the constants and depends on the choices of .

Consider a slighly more general case that the local objective function is in , then by mean value theorem, from Algorithm 2, we can write the equation of as

 ⟨∇2fi(ξi)si,vi⟩=−⟨Ri∇f(xk),vi⟩,∀vi∈Vi,

which implies and, again, RFAS is a special case of RFASD. Of course, this choice of is impractical because we do not know in general. One practical choice might be , which is the Galerkin projection of the Hessian matrix of original . In this case, RFAS becomes block Newton’s method. The constants and are thus changed as depends on and is different at each iteration.

## 3 Convergence Analysis

In this section, we shall present a convergence analysis of RFASD. We first discuss the stability of a space decomposition and then prove the crucial sufficient decay property. Then we obtain a linear or sublinear convergence rate by different upper bounds of the optimality gap.

### 3.1 Stable decomposition

We first introduce the mapping as follow,

 Π~v=J∑i=1vi,for ~v=(v1,v2,…,vJ)⊺∈˜V.

We can write and in terms of prolongation and restriction operators.

The space decomposition implies that is surjective. Since for finite dimensional spaces, all norms are equivalent, is a linear and continuous operator and, thus, by the open mapping theorem, there exists a continuous right inverse of . Namely there exists a constant , such that for any , there exists a decomposition with for , and

 J∑i=1∥vi∥2Ai≤CA∥v∥2A. (8)

The constant measures the stability of the space decomposition. When the decomposition is orthogonal, . As the adjoint, the operator is injective and bounded below. The following result is essentially from the fact that and

has the same minimum singular value

.

###### Lemma 1

For a given , let for . Then

 −⟨g,J∑i=1si⟩=J∑i=1∥gi∥A−1i=J∑i=1∥si∥2Ai (9)

and

 ∥g∥2A−1≤CAJ∑i=1∥si∥2Ai=J∑i=1∥si∥2Ai, (10)

where is the constant in (8).

###### Proof

The first identity is an easy consequence of definitions as: for

 −⟨g,si⟩=⟨g,IiA−1iRig⟩=∥gi∥A−1i=∥A−1igi∥2Ai=∥si∥2Ai.

We now prove (10). For a given , let . For any , we chose a stable decomposition . Then

 ⟨g,w⟩ =J∑i=1⟨g,wi⟩=J∑i=1⟨gi,wi⟩≤(J∑i=1∥gi∥2A−1i)1/2(J∑i=1∥wi∥2Ai)1/2 ≤C1/2A(J∑i=1∥gi∥2A−1i)1/2∥w∥A.

Thus,

 ∥g∥2A−1=(supw∈V⟨g,w⟩∥w∥A)2≤CAJ∑i=1∥gi∥2A−1i=J∑i=1∥si∥2Ai,

which completes the proof.

### 3.2 Sufficient decay

We shall prove a sufficient decay property for the function value. Note that we do not assume is convex but only Lipschitz continuous in each subspace.

###### Lemma 2

Suppose the objective function and space decomposition satisfy (LCi). Let be the sequence generated by Algorithm 1. Then for all , we have

 E[f(xk+1)]−f(xk)≤−12¯LACAJ∥∇f(xk)∥2A−1. (11)

###### Proof

By the Lipschitz continuity (LCi) and the choice of the step size, we have

 f(xk+1) ≤f(xk)+αk⟨∇f(xk),sik⟩+LA,ik2α2k∥sik∥2Ai =f(xk)+1LA,ik⟨∇f(xk),sik⟩+12LA,ik∥sik∥2Ai.

Take expectation of conditioned by with probability

 E[f(xk+1))−f(xk) ≤1J¯LA⟨∇f(xk),J∑i=1si⟩+12J¯LAJ∑i=1∥si∥2Ai ≤−12J¯LAJ∑i=1∥si∥2Ai by (???) ≤−12J¯LACA∥∇f(xk)∥2A−1 by (???).

As we will show in the next two subsections, based on the above sufficient decay property (11), together with a proper upper bound of the optimality gap, linear or sub-linear convergent rate can be obtained for the strongly convex and convex case, respectively.

### 3.3 Linear convergence for strongly convex functions

To derive the linear convergence for the strongly convex case, we shall use the following upper bound of the optimality gap.

###### Lemma 3 (Theorem 2.1.10 in )

Suppose that satisfies assumption (SC) with constant and is the minimizer of ; then for all ,

 f(x)−f(x∗)≤12μA∥∇f(x)∥2A−1. (12)

Now we are ready to show the linear convergence of RFASD when the objective function is strongly convex and the result is summarized in the following theorem.

###### Theorem 1

Suppose the objective function and space decomposition satisfy (LCi) and (SC) with . Let be the sequence generated by Algorithm 1. Then for all , we have the linear contraction

 E[f(xk+1)]−f(x∗)≤(1−1JμA¯LA1CA)(E[f(xk)]−f(x∗)). (13)

###### Proof

The left hand side of (11) can be rewritten as . Combining (11) and (12), rearranging the terms, and taking expectation with respect to , we get the desired result.

### 3.4 Sublinear convergence for convex functions

Next, we give the convergence result for convex but not strongly convex objective functions, i.e. in (SC), based on the following bounded level set assumption.

• (BL) Bounded level set: is convex and attains its minimum value on a set . There is a finite constant such that the level set for defined by is bounded, that is,

 maxx∗∈Smaxx{∥x−x∗∥A:f(x)≤f(x0)}≤R0. (14)
###### Lemma 4

Suppose the objective function satisfies (LC) and (BL). Then for all and ,

 f(x)−f(x∗)≤R0∥∇f(x)∥A−1. (15)

###### Proof

By convexity and (BL), for and ,

 f(x)−f∗≤⟨∇f(xk),x−x∗⟩≤∥∇f(x)∥A−1∥x−x∗∥A≤R0∥∇f(x)∥A−1.

We still use the same step size and show that RFASD converges sublinearly for convex objective function .

###### Theorem 2

Suppose the objective function and space decomposition satisfy (LCi), (BL) and (SC) with . Let be the sequence generated by Algorithm 1. Then for all , we have

 E[f(xk)]−f∗≤d01+d0ck≤2R20J¯LACAk, (16)

where and

###### Proof

Note that (11) can be written as

 E[f(xk+1)]−f(x∗)−(f(xk)−f(x∗)) ≤−12¯LAJ1CA∥∇f(xk)∥2A−1 ≤−c(f(xk)−f(x∗))2

where and the last inequality is derived based on (15) with . Now denoting and taking expectation with respect to , we have

 dk+1−dk≤−cd2k≤0.

Based on the above inequality, we obtain

 1dk+1−1dk=dk−dk+1dkdk+1≥dk−dk+1(dk)2≥c.

Recursively applying this inequality, we obtain

 1dk≥1d0+ck. (17)

which implies (16).

###### Remark

The parameters , and can be dynamically changing, i.e., as a function of . For example, we can use , which is smaller than . The space decomposition and local Lipschitz constants could also be improved during the iterations. In these cases, we use to denote the constant and the last inequality (17) holds with the second term on the right-hand-side being replaced by

### 3.5 Complexity

Based on the convergence results Theorem 1 and 2

, we can estimate the computational complexity of the proposed RFASD method and compare with the GD, PGD, RCD, and RBCD methods. As usual, for a prescribed error

, we first estimate how many iterations are needed to reach the tolerance and then estimate the overall computational complexity based on the cost of each iteration.

For gradient-based methods, the main cost per iteration is the evaluation of gradient . In general, it may take operations. In certain cases, the cost could be reduced. For example, when is sparse, e.g., computing one coordinate component of only needs operations, then the computing takes operations. Another example is to use advanced techniques, such as fast multipole method [21, 8], to compute , then the cost could be reduced to . In our discussion, we focus on the general case (referred as dense case) and sparse case. For subspace decomposition type algorithms, including RCD, BRCD, and RFASD, each iteration only needs to compute restricted on one subspace and, therefore, the cost of computing the gradient is (dense case) or (sparse case). Note for RCD. When the preconditioning technique is applied, the extra cost is introduced besides computing the gradient. We assume computing the inverse of an matrix is with , then the extra cost for PGD is since the need of . For the proposed RFASD, the extra cost is reduced to since we only need to compute on each subspace. Now, we summarize the complexity comparison in Table 1.

From Table 1, it is clear that RFASD can take advantage of the preconditioning effect, i.e., or . Meanwhile, there is no need to invert globally and but to compute on each subspace, which reduces the computational cost in the sense that (dense case) or (sparse case) if and . Of course, the key is a stable space decomposition in -norm such that the stability constant can be kept small. In next section, we use Nesterov’s “worst” problem  as an example to demonstrate how to achieve this in practice.

## 4 Examples

In this section, we give some examples of the RFASD method and use the example introduced by Nesterov  to discuss different methods.

We first recall the Nesterov’s “worst” problem 

###### example

For , consider the non-constrained minimization problem (1) with

 f(x):=fL,r(x)=L4(12(x21+r−1∑i=1(xi−xi−1)2+x2r)−x1), (18)

where represents the -th coordinate of and is a constant integer that defines the intrinsic dimension of the problem. The minimum value of the function is

 f∗=L16(−1+1r+1). (19)

### 4.1 Randomized block coordinate descent methods

We follow  to present the RBCD methods. Let endowed with standard -norm , i.e. . Define a partition of the unit matrix

 In=(U1,⋯,UJ)∈RN×N, Ui∈RN×ni, i=1,⋯,J.

Now we consider the space decomposition where and . Naturally, and in this setting. For each subspace, we also use the -norm , i.e.

is the identity matrix of size

. In this setting, RFASD (Algorithm 1) is given by

 xk+1=xk+1Liksik,sik=−UikU⊺ik∇f(xk). (20)

This is just the RBCD algorithm proposed in . Moreover, if the space decomposition is coordinate-wise, i.e., , , then it is reduced to the RCD method.

Regarding the convergence analysis, since the subspace decomposition is direct and orthogonal in inner product, in this case, we have

 J∑i=1si=−∇f(xk)andJ∑i=1∥si∥2=∥∇f(xk)∥2,

which implies that in (SD). Moreover, because the -norm is used here, Lipschitz constant and strong convexity constant are measured in -norm and, hence, we drop the subscript for those constants. Finally, we apply Theorem 1 and 2 and recovery the classical convergence results of RBCD  as follows,

• Convex case:

• Strongly convex case:

Consider Example 4 with , since -norm is used, it is easy to see that and . Therefore, the condition number is and, for strongly convex case, the convergence rate is and, according to Table 1, it requires operations (due to the fact that this problem is sparse) to achieve a given accuracy . This could be quite expensive, even impractical, for large , i.e., large-scale problems.

### 4.2 Randomized fast subspace descent methods

The RFASD method allows us to use a preconditioner without computing . We chose an appropriate -norm defined by an SPD matrix . Let

 V=V1+V2+⋯+VJ,Vi⊂V,i=1,⋯,J,

be a space decomposition. For each subspace, we still use the -norm. Namely we use for all . One can easily verify that which is the so-called Galerkin projection of to the subspace .

In this case, the averaged Lipschitz constant and the strong convexity constant are measured in -norm. Based on Theorem 1 and 2, we naturally have

• Convex case:

• Strongly convex case:

Comparing with the convergence results of RBCD, the key here is to design an appropriate preconditioner which induces the -norm and corresponding stable decomposition such that for convex case or for strongly convex case. Then we will achieve speedup comparing with RCD/RBCD.

Of course, the choices of the preconditioner and the stable decomposition are usually problem-dependent. Let us again consider Example 4 with . Note that the objective function can be written in the following matrix format

 f(x) =fL,N(x)=12(Ax,x)−(x,b), (21) A =L2tridiag(−1,2,−1)∈RN×N and b=L4e1, (22)

where . A good choice of the preconditioner is itself. It is easy to verify that, when measuring in -norm, the averaged Lipschitz constant is and the strong convexity constant is . Therefore, the condition number measured in -norm is