 # A Proximal Stochastic Quasi-Newton Algorithm

In this paper, we discuss the problem of minimizing the sum of two convex functions: a smooth function plus a non-smooth function. Further, the smooth part can be expressed by the average of a large number of smooth component functions, and the non-smooth part is equipped with a simple proximal mapping. We propose a proximal stochastic second-order method, which is efficient and scalable. It incorporates the Hessian in the smooth part of the function and exploits multistage scheme to reduce the variance of the stochastic gradient. We prove that our method can achieve linear rate of convergence.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the following convex optimization problem

 minx∈RdP(x)def=F(x)+R(x), (1)

where is the average of a set of smooth convex functions , namely

 F(x)=1nn∑i=1fi(x),

and is convex and can be non-smooth.

The formulation (1

) includes many applications in machine learning, such as regularized empirical risk minimization. For example, given a training set

, where is the feature of the th sample and is the response. If we take , and

, then we can obtain lasso regression. If we take

(),

, then the model becomes logistic regression with elastic net penalty.

One typical approach for solving the formulation (1) is first order methods that use proximal mappings to handle the non-smooth part, such as ISTA Daubechies et al. (2003), SpaRSA Wright et al. (2009) and TRIP Kim et al. (2010). The first order method can be improved by Nesterov’s acceleration strategy Nesterov (1983). One seminal work is the FISTA Beck & Teboulle (2009), and related package TFOCS Becker et al. (2011) has been widely used.

Another class of methods to handle Problem (1) is proximal Newton-type algorithms Fukushima & Mine (1981); Becker & Fadili (2012); Oztoprak et al. (2012); Lee et al. (2014). Proximal Newton-type methods approximate the smooth part with a local quadratic model and successively minimize the surrogate functions. Compared with the first-order methods, the Newton-type methods obtain rapid convergence rate because they incorporate additional curvature information.

Both conventional first order and Newton-type methods require the computation of full gradient in each iteration, which is very expensive when the number of the component

is very large. In this case, ones usually exploit the stochastic optimization algorithms, which only process single or mini-batch components of the objective at each step. The stochastic gradient descent (SGD)

Bottou (2010) has been widely used in many machine learning problems. However, SGD usually suffers from large variance of random sampling, leading to a slower convergence rate. There are some methods to improve SGD in the case that the objective is smooth (a special case of Problem (1) in which ). They include the first order methods such as SAG Roux et al. (2012) and SVRG Johnson & Zhang (2013), and the Newton-type methods such as stochastic quasi-newton method Byrd et al. (2014), unified quasi-Newton method Sohl-Dickstein et al. (2014) and linearly-convergent stochastic L-BFGS Moritz et al. (2015). There are also some extensions to solve the formulation (1) which includes the non-smooth case, e.g., the first order method Prox-SVRG Xiao & Zhang (2014), accelerated Prox-SVRG Nitanda (2014) and proximal stochastic Newton-type gradient descent Shi & Liu (2015).

In this paper, we introduce a stochastic proximal quasi-Newton algorithm to solve the general formulation (1). Our method incorporates the second order information by using a scaled proximal mapping to handle the non-smooth part in the objective. Compared with Shi & Liu (2015)’s stochastic Newton-type method which requires storing the whole data set, our method only needs to store mini-batch data in each iteration. Furthermore, we exploit the idea of multistage scheme Johnson & Zhang (2013); Xiao & Zhang (2014) to reduce the variance in our algorithm. We also prove our method is linearly convergent, which is the same as the special case of solving the smooth problem Moritz et al. (2015).

## 2 Notation and Preliminaries

In this section we give the notation and preliminaries which will be used in this paper. Let denote the identity matrix. For a vector , the Euclidean norm is denoted as and the weighted norm is denoted as , where is positive definite. For a subset , we define the function as

 fS(x)=∑i∈Sfi(x).

The proximal mapping of a convex function at is

 proxQ(x)=argminyQ(y)+12||y−x||2.

The scaled proximal mapping of the convex function at with respect to the positive definite matrix is

 proxHQ(x)=argminyQ(y)+12||y−x||2H.

We make the following assumptions.

###### Assumption 1.

The component function is -strongly convex and its gradient is Lipschitz continuous with constant ; that is, for any , we have

 μi2||x−y||2≤fi(y)−fi(x)−(x−y)T∇fi(y)≤Li2||x−y||2,

which is equivalent to

 μiId⪯∇2fi(x)⪯LiId.

Then is -strongly convex and its gradient is Lipschitz continuous with constant , where and . Furthermore, let .

###### Assumption 2.

For any nonempty size- subset , we have

 λId⪯∇2fS(x)⪯ΛId.

Based on Assumption 1 and the convexity of , we can derive that is -strongly convex even when is not strongly convex.

## 3 The Proximal Stochastic Quasi-Newton Algorithm

The traditional proximal Newton-type methods Fukushima & Mine (1981); Becker & Fadili (2012); Oztoprak et al. (2012); Lee et al. (2014) use the following update rule at th iteration

 xk+1=proxHkηkR(xk−ηkH−1k∇F(xk)), (2)

where is the step size and is the Hessian of at or its approximation. We can view such iteration as minimizing the composite of local quadratic approximation to and the non-smooth part , that is,

 proxHkηkR(xk−ηkH−1k∇F(xk)) = argminy∇F(xk)T(y−xk)+12ηk||y−xk||2Hk+R(y).

The update rule (2) requires the computation of the full gradient at each iteration. When the number of the component is very large, it is very expensive. In this case, we can use the stochastic variant of proximal Newton-type methods. We can sample a mini-batch at each stage and take the iteration as follow

 xk+1=proxHkηkR(xk−ηkH−1k∇fSk(xk)), (3)

where . To avoid the step size decaying to zero, we use the multi-stage scheme Johnson & Zhang (2013); Xiao & Zhang (2014) to reduce the variance in random sampling. Specifically, we replace by the variance reduced gradient :

 vk=1MbqSk(∇fSk(xk)−∇fSk(~x))+∇F(~x), (4)

where is the size of mini-batch, and

is the probability of sampling mini-batch

. The estimate

in (4) is the estimate of optimal solution , and we update the full gradient after every iterations. The probability is proportional to the Lipschitz constant of . We provide the detailed analysis in Lemma 4.

Thus we use the following modified update rule in our algorithm

 xk+1=proxHkηkR(xk−ηkH−1kvk). (5)

If has simple proximal mapping, the subproblem (5) can be solved by iterative methods such as FISTA Beck & Teboulle (2009). When the dimension is large, solving (5) by using the exactly Hessian matrix in each iteration is unacceptable. To make the iteration (5) efficient, we construct the approximation of Hessian by combining the idea of the stochastic LBFGS Byrd et al. (2014) and the proximal splitting method Becker & Fadili (2012). Suppose that the approximate Hessian has the form , where is a diagonal with positive diagonal elements and is obtained via the results of recently iterations. The detail of constructing the Hessian is given in Algorithm 2. We solve the subproblem (5) in terms of the following lemma Becker & Fadili (2012).

###### Lemma 1.

Let be positive definite. Then

 proxHQ(x)=D−1/2proxQ∘D−1/2(D1/2x−v),

where and is the root of

 uT(x−D−1/2proxQ∘D−1/2(D1/2(x−βD−1u)))+β=0.

Lemma 1 implies that we can solve the subproblem (5) efficiently when the proximal mapping of is simple. We summarize the whole procedure of our method in Algorithm 1.

## 4 Convergence Analysis

By the strongly convexity of

, we show that the eigenvalues of the approximate Hessian

obtained from Algorithm 2 is bounded.

###### Theorem 1.

By Assumption 2, there exist two constants such that the matrix constructed from Algorithm 2 satisfies , where

 Γ = dΛα, γ = α(α−2)λd+1+α(1−α)λdΛ+Λ2λd−1dd−1Λdλ2(1−α).
###### Proof.

By Assumption 2 and Algorithm 1, we have and , which implies

 λ≤sTryr||sr||2≤sTr∇2fTr(^xr)sr||sr||2≤Λ. (6)

Letting and using the definition of in Algorithm 2, we have

 1Λ≤τ=sTryr||yr||2=sTr∇2fTr(^xr)srsTr(∇2fTr(^xr))2sr=zTrzrzTr∇2fTr(^xr)zr≤1λ. (7)

Together with (6) and (7), we have

 1Λ2≤||sr||2||yr||2≤1λ2. (8)

Using the Woodbury formula and the procedure of Algorithm 2, we can write as

 Hr=(ατId+uruTr)−1=1ατId−uruTrατ(ατ+uTrur).

Then the largest eigenvalue of has the upper bound

 σmax(Hr) ≤ tr(Hr) = 1ατtr(Id)−tr(uruTrατ(ατ+uTrur)) ≤ 1ατtr(Id)=dατ≤dΛα.

Then we can bound the value of as follows

 uTrur = ||sr−ατyr||2(sr−ατyr)Tyr (9) = ||sr||2−2ατsTryr+α2τ2||yr||2sTryr−ατ||yr||2 = ||sr||2−2ατ2||yr||2+α2τ2||yr||2τ||yr||2−ατ||yr||2 = ||sr||2−α(2−α)τ2||yr||2τ(1−α)||yr||2 = ||sr||2τ(1−α)||yr||2−α(2−α)τ1−α ≤ Λλ2(1−α)−α(2−α)(1−α)Λ,

where the last inequality uses the result of (8). We can compute the determinant of as follows.

 det(Hr) = det(1ατId−uruTrατ(ατ+uTrur)) = 1(ατ)ddet(Id−uruTrατ+uTrur) = 1(ατ)d−1(ατ+uTrur) ≥ (λα)d−11αλ+Λλ2(1−α)−α(2−α)(1−α)Λ = (λα)d−1α(α−2)λ2+α(1−α)Λλ+Λ2λ2Λ(1−α) = α(α−2)λd+1+α(1−α)Λλd+Λ2λd−1αd−1Λλ2(1−α).

Combining with the result in (9), we have

 σmin(Hr) ≥ det(Hr)σmax(Hr)d−1 = α(α−2)λd+1+α(1−α)λdΛ+Λ2λd−1αd−1Λλ2(1−α)αd−1(dΛ)d−1 = α(α−2)λd+1+α(1−α)λdΛ+Λ2λd−1dd−1Λdλ2(1−α).

We generalize Lemma 3.6 in Xiao & Zhang (2014), by integrating the second-order information.

###### Lemma 2.

For any and positive definite , let , , and . Then we have

 P(y) ≥ P(x+)+gTH(y−x)+ΔT(x+−y)+(η||g||2H−Lη22||g||2).

Similar with the standard proximal mapping, the scaled proximal mapping also has the non-expansive property Lee et al. (2014).

###### Lemma 3.

Suppose is a convex function from to and satisfies . Let and . Then .

We can bound the variance of the stochastic gradient as following lemma.

###### Lemma 4.

Let be the definition of (4) and let and . Then we have

 E||vk−∇F(xk)||2≤4LQ[P(xk)−P(x∗)+P(~x)−P(x∗)].

Based on the above results, we can obtain the following convergence result of our method.

###### Theorem 2.

Let , and be the definition of Lemma 4. When is sufficiently large, we have

 E[P(~xs)−P(x∗)]≤ρs(P(~x0)−P(x∗)),

where

 ρ=Γγ2+4η2μΓLQ(m+1)(ηγ2−4η2ΓLQ)μm<1.
###### Proof.

Applying Lemma 2 with , , , , , and , we have

 P(x∗)≥P(xk+1)+gTkHr(x∗−xk)+ΔTk(xk+1−x∗)+(η||gk||2Hr−Lη22||gk||2) (10)

and . Then consider the difference of and iteration results with respect to

 ∥xk+1−x∗∥2Hr (11) = ||xk−x∗+xk+1−xk||2Hr = ||xk−x∗||2Hr+(xk−x∗)THr(xk+1−xk)+||xk+1−xk||2Hr = ||xk−x∗||2Hr−2ηgTkHr(xk−x∗)T+η2||gk||2Hr ≤ ||xk−x∗||2Hr+2η[P(x∗)−P(xk+1)]−2ηΔTk(xk+1−x∗)−(2η2||gk||2Hr−Lη3||gk||2)+η2||gk||2Hr ≤ ||xk−x∗||2Hr+2η[P(x∗)−P(xk+1)]−2ηΔTk(xk+1−x∗),

where the first inequality uses the results (10) and the second inequality is obtained by .

Then we bound . We define the result of proximal mapping of the full gradient as

 ¯xk+1=proxHrηR(xk−ηH−1r∇F(xk)). (12)

Recall that we obtain via

 xk+1=proxHrηR(xk−ηH−1rvk). (13)

Then we have

 −2ηΔTk(xk+1−x∗) (14) = −2ηΔTk(xk+1−¯xk+1+¯xk+1−x∗) = −2ηΔTk(xk+1−¯xk+1)−2ηΔTk(¯xk+1−x∗) ≤ 2η||Δk|| ||xk+1−¯xk+1||−2ηΔTk(¯xk+1−x∗) = 2η||Δk|| ||proxHrηR(xk−ηH−1rvk)−proxHrηR(xk−ηH−1r∇F(xk))||−2ηΔTk(¯xk+1−x∗) ≤ 2η2Γγ||Δk|| ||xk−ηH−1rvk−(xk−ηH−1r∇F(xk)||−2ηΔTk(¯xk+1−x∗) = 2η2Γγ||Δk|| ||H−1rΔk||−2ηΔTk(¯xk+1−x∗) ≤ 2η2Γγ2||Δk||2−2ηΔTk(¯xk+1−x∗),

where the first inequality is obtained by the Cauchy-Schwarz inequality and the second inequality is obtained by applying Lemma 3 on the fact (12) and (13). We note that and

are independent of the random variable

and by fixing . Then

 E[ΔTk(¯xk+1−x∗)]=(E[Δk])T(¯xk+1−x∗)=0. (15)

Taking the expectation on (11) and combine the results of (14) and (15), we have

 E∥xk+1−x∗∥2Hr ≤ ∥xk−x∗∥2Hr+2ηE[P(x∗)−P(xk+1)]−2ηΔTk(xk+1−x∗) ≤ ∥xk−x∗∥2Hr+2ηE[P(x∗)−P(xk+1)]+2η2Γγ2E||Δk||2−2ηE[ΔTk(¯xk+1−x∗)] ≤ ∥xk−x∗∥2Hr+2ηE[P(x∗)−P(xk+1)]+8η2ΓLQγ2[P(xk)−P(x∗)+P(~x)−P(x∗)].

Consider stages, . Summing over on the above inequality and taking the expectation with , we have

 m−1∑k=0E||xk+1−x∗||2Hr ≤ m−1∑k=0||xk−x∗||2Hr+m−1∑k=02ηE[P(x∗)−P(xk+1)]+8η2ΓLQγ2m−1∑k=0[P(xk)−P(x∗)+P(~x)−P(x∗)].

That is

 E∥xm−x∗∥2Hr ≤ ∥x0−x∗∥2Hr+2ηE[P(x∗)−P(xm)]−(2η−8η2ΓLQγ2)m−1∑k=1E[P(xk)−P(x∗)] +