1 Introduction
We consider the following convex optimization problem
(1) |
where is the average of a set of smooth convex functions , namely
and is convex and can be non-smooth.
The formulation (1
) includes many applications in machine learning, such as regularized empirical risk minimization. For example, given a training set
, where is the feature of the th sample and is the response. If we take , and, then we can obtain lasso regression. If we take
(),, then the model becomes logistic regression with elastic net penalty.
One typical approach for solving the formulation (1) is first order methods that use proximal mappings to handle the non-smooth part, such as ISTA Daubechies et al. (2003), SpaRSA Wright et al. (2009) and TRIP Kim et al. (2010). The first order method can be improved by Nesterov’s acceleration strategy Nesterov (1983). One seminal work is the FISTA Beck & Teboulle (2009), and related package TFOCS Becker et al. (2011) has been widely used.
Another class of methods to handle Problem (1) is proximal Newton-type algorithms Fukushima & Mine (1981); Becker & Fadili (2012); Oztoprak et al. (2012); Lee et al. (2014). Proximal Newton-type methods approximate the smooth part with a local quadratic model and successively minimize the surrogate functions. Compared with the first-order methods, the Newton-type methods obtain rapid convergence rate because they incorporate additional curvature information.
Both conventional first order and Newton-type methods require the computation of full gradient in each iteration, which is very expensive when the number of the component
is very large. In this case, ones usually exploit the stochastic optimization algorithms, which only process single or mini-batch components of the objective at each step. The stochastic gradient descent (SGD)
Bottou (2010) has been widely used in many machine learning problems. However, SGD usually suffers from large variance of random sampling, leading to a slower convergence rate. There are some methods to improve SGD in the case that the objective is smooth (a special case of Problem (1) in which ). They include the first order methods such as SAG Roux et al. (2012) and SVRG Johnson & Zhang (2013), and the Newton-type methods such as stochastic quasi-newton method Byrd et al. (2014), unified quasi-Newton method Sohl-Dickstein et al. (2014) and linearly-convergent stochastic L-BFGS Moritz et al. (2015). There are also some extensions to solve the formulation (1) which includes the non-smooth case, e.g., the first order method Prox-SVRG Xiao & Zhang (2014), accelerated Prox-SVRG Nitanda (2014) and proximal stochastic Newton-type gradient descent Shi & Liu (2015).In this paper, we introduce a stochastic proximal quasi-Newton algorithm to solve the general formulation (1). Our method incorporates the second order information by using a scaled proximal mapping to handle the non-smooth part in the objective. Compared with Shi & Liu (2015)’s stochastic Newton-type method which requires storing the whole data set, our method only needs to store mini-batch data in each iteration. Furthermore, we exploit the idea of multistage scheme Johnson & Zhang (2013); Xiao & Zhang (2014) to reduce the variance in our algorithm. We also prove our method is linearly convergent, which is the same as the special case of solving the smooth problem Moritz et al. (2015).
2 Notation and Preliminaries
In this section we give the notation and preliminaries which will be used in this paper. Let denote the identity matrix. For a vector , the Euclidean norm is denoted as and the weighted norm is denoted as , where is positive definite. For a subset , we define the function as
The proximal mapping of a convex function at is
The scaled proximal mapping of the convex function at with respect to the positive definite matrix is
We make the following assumptions.
Assumption 1.
The component function is -strongly convex and its gradient is Lipschitz continuous with constant ; that is, for any , we have
which is equivalent to
Then is -strongly convex and its gradient is Lipschitz continuous with constant , where and . Furthermore, let .
Assumption 2.
For any nonempty size- subset , we have
Based on Assumption 1 and the convexity of , we can derive that is -strongly convex even when is not strongly convex.
3 The Proximal Stochastic Quasi-Newton Algorithm
The traditional proximal Newton-type methods Fukushima & Mine (1981); Becker & Fadili (2012); Oztoprak et al. (2012); Lee et al. (2014) use the following update rule at th iteration
(2) |
where is the step size and is the Hessian of at or its approximation. We can view such iteration as minimizing the composite of local quadratic approximation to and the non-smooth part , that is,
The update rule (2) requires the computation of the full gradient at each iteration. When the number of the component is very large, it is very expensive. In this case, we can use the stochastic variant of proximal Newton-type methods. We can sample a mini-batch at each stage and take the iteration as follow
(3) |
where . To avoid the step size decaying to zero, we use the multi-stage scheme Johnson & Zhang (2013); Xiao & Zhang (2014) to reduce the variance in random sampling. Specifically, we replace by the variance reduced gradient :
(4) |
where is the size of mini-batch, and
is the probability of sampling mini-batch
. The estimate
in (4) is the estimate of optimal solution , and we update the full gradient after every iterations. The probability is proportional to the Lipschitz constant of . We provide the detailed analysis in Lemma 4.Thus we use the following modified update rule in our algorithm
(5) |
If has simple proximal mapping, the subproblem (5) can be solved by iterative methods such as FISTA Beck & Teboulle (2009). When the dimension is large, solving (5) by using the exactly Hessian matrix in each iteration is unacceptable. To make the iteration (5) efficient, we construct the approximation of Hessian by combining the idea of the stochastic LBFGS Byrd et al. (2014) and the proximal splitting method Becker & Fadili (2012). Suppose that the approximate Hessian has the form , where is a diagonal with positive diagonal elements and is obtained via the results of recently iterations. The detail of constructing the Hessian is given in Algorithm 2. We solve the subproblem (5) in terms of the following lemma Becker & Fadili (2012).
Lemma 1.
Let be positive definite. Then
where and is the root of
Lemma 1 implies that we can solve the subproblem (5) efficiently when the proximal mapping of is simple. We summarize the whole procedure of our method in Algorithm 1.
4 Convergence Analysis
By the strongly convexity of
, we show that the eigenvalues of the approximate Hessian
obtained from Algorithm 2 is bounded.Theorem 1.
By Assumption 2, there exist two constants such that the matrix constructed from Algorithm 2 satisfies , where
Proof.
By Assumption 2 and Algorithm 1, we have and , which implies
(6) |
Letting and using the definition of in Algorithm 2, we have
(7) |
Together with (6) and (7), we have
(8) |
Using the Woodbury formula and the procedure of Algorithm 2, we can write as
Then the largest eigenvalue of has the upper bound
Then we can bound the value of as follows
(9) | |||||
where the last inequality uses the result of (8). We can compute the determinant of as follows.
Combining with the result in (9), we have
∎
We generalize Lemma 3.6 in Xiao & Zhang (2014), by integrating the second-order information.
Lemma 2.
For any and positive definite , let , , and . Then we have
Similar with the standard proximal mapping, the scaled proximal mapping also has the non-expansive property Lee et al. (2014).
Lemma 3.
Suppose is a convex function from to and satisfies . Let and . Then .
We can bound the variance of the stochastic gradient as following lemma.
Lemma 4.
Let be the definition of (4) and let and . Then we have
Based on the above results, we can obtain the following convergence result of our method.
Theorem 2.
Proof.
Applying Lemma 2 with , , , , , and , we have
(10) |
and . Then consider the difference of and iteration results with respect to
(11) | |||||
where the first inequality uses the results (10) and the second inequality is obtained by .
Then we bound . We define the result of proximal mapping of the full gradient as
(12) |
Recall that we obtain via
(13) |
Then we have
(14) | |||||
where the first inequality is obtained by the Cauchy-Schwarz inequality and the second inequality is obtained by applying Lemma 3 on the fact (12) and (13). We note that and
are independent of the random variable
and by fixing . Then(15) |
Taking the expectation on (11) and combine the results of (14) and (15), we have
Consider stages, . Summing over on the above inequality and taking the expectation with , we have
That is