1 Introduction
Recent years have witnessed an increased interest in adopting the stochastic (sub)gradient (SGD) methods [1, 3, 21]
for solving largescale machine learning problems. In each of the algorithmic iterations, SGD reduces the computation cost by sampling one (or a small number of) example for computing a stochastic (sub)gradient. Thus the computation cost in SGD is independent of the size of the data available for training; this property makes SGD appealing for largescale optimization. However, when the optimization problems involve a complex domain (for example a positive definite constraint or a polyhedron one), the projection operation in each iteration of SGD, which is used to ensure the feasibility of the intermediate solutions, may become the computational bottleneck.
In this paper we consider to solve the following constrained optimization problem
(1)  
where is strongly convex [23] and is convex. We assume a stochastic access model for , in which the only access to is via a stochastic gradient oracle; in other words, given arbitrary
, this stochastic gradient oracle produces a random vector
, whose expectation is a subgradient of at the point , i.e., , where denotes the subdifferential set of at . On the other hand we have the full access to the (sub)gradient of .The standard SGD method [5] solves Eq. (1) by iterating the updates in Eq. (2) with an appropriate step size , e.g., ), as below
(2) 
and then returning as the final solution for a total number of iterations . Note that is a projection operator defined as
(3) 
If the involved constraint function is complex (e.g., a polyhedral or a positive definite constraint), computing the associated projection may be computationally expensive; for example, a projection onto a positive definite cone over
requires a full singular value decomposition (SVD) operation with time complexity of
.In this paper, we propose an epochbased SGD method, called EproSGD, which requires only a logarithmic number of projections (onto the feasible set), and meanwhile achieves an optimal convergence rate for stochastic strongly convex optimization. Specifically, the proposed EproSGD method consists of a sequence of epochs; within each of the epochs, the standard SGD is applied to optimize a composite objective function augmented by the complex constraint function, hence avoiding the expensive projections steps; at the end of every epoch, a projection operation is performed to ensure the feasibility of the intermediate solution. Our analysis shows that given a strongly convex optimization and for a total number of iterations, EproSGD requires only projections, and meanwhile achieves an optimal rate of convergence at , both in expectation and with a high probability.
To exploit the structure (for example the sparisty) of the optimization problem, we propose a proximal variant of the EproSGD method, namely EproORDA, which utilizes an existing optimal dual averaging method to solve the involved proximal mapping. Our analysis shows that EproORDA similarly requires only a logarithmic number of projections while enjoys an optimal rate of convergence.
For illustration we apply the proposed EproSGD methods on two realworld applications, i.e., the constrained Lasso formulation and the large margin nearest neighbor (LMNN) classification. Our experimental results demonstrate the efficiency of the proposed methods, in comparison to the existing methods.
2 Related Work
The present work is inspired from the breakthrough work in [20], which proposed two novel oneprojectionbased stochastic gradient descent (OneProj) methods for stochastic convex optimizations. Specifically the first OneProj method was developed for general convex optimization; it introduces a regularized Lagrangian function as
then applies SGD to the convexconcave problem , and finally performs only one projection at the end of all iterations, where is a bounded ball subsuming as a subset.
The second OneProj method was developed for strongly convex optimization. The proposed method introduced an augmented objective function
(4) 
where is a parameter dependent on the total number of iterations , and is a problem specific parameter [20]. OneProj applies SGD to the augmented objective function, specifically using a stochastic subgradient of and a subgradient of , and then performs a projection step after all iterations. For a total number iterations, the OneProj method achieves a rate of convergence at , which is suboptimal for stochastic strongly convex optimization.
Several recent works [15, 26] propose optimal methods with optimal rates of convergence at for stochastic strongly convex optimization. In particular, the EpochSGD method [15] consists of a sequence of epochs, each of which has a geometrically decreasing step size and a geometrically increasing iteration number. This method however needs to project the intermediate solutions onto a feasible set at every algorithmic iteration; when the involved constraint is complex, the involved projection is usually computationally expensive. This limitation restricts the practical applications on large scale data analysis. Therefore we are motivated to develop an optimal stochastic algorithm for strongly convex optimization but with a constant number of projections.
Another closely related work is the logTSGD [33] for stochastic strongly convex and smooth optimization. LogTSGD achieves an optimal rate of convergence, while requires to perform projections, where is the ratio of the smoothness parameter to the strong convexity parameter. There are several key differences between our proposed EproSGD method and logTSGD: (i) logTSGD and its analysis rely on both the smoothness and the strong convexity of the objective function; in contrast, EproSGD only assumes that the objective function is strongly convex; (ii) the number of the required projections in logTSGD is , where the conditional number can be very large in real applications; in contrast, EproSGD requires at most projections.
Besides reducing the number of projections in SGD, another line of research is based on the conditional gradient algorithms [7, 14, 17, 18, 32]; this type of algorithms mostly build upon the FrankWolfe technique [11], which eschews the projection in favor of a linear optimization step; however in general, they require the smoothness assumption in the objective function. On the other hand, [12, 16] extended FrankWolfe techniques to stochastic or online setting for general and strongly convex optimizations. Specifically [16] presents an online/stochastic FrankWolfe (OFW) algorithm with a convergence rate for general convex optimization problems, which is slower than the optimal rate . [12] presents an algorithm for online strongly convex optimization with an regret bound, implying an convergence rate for stochastic stronlgy convex optimization. This algorithm requires the problem domain to be a polytope, instead of a convex inequality constraint used in this paper; it also hinges on an efficient local linear optimization oracle that amounts to approximately solving a linear optimization problem over an intersection of a ball and and the feasible domain; furthermore the convergence result only holds in expectation and is suboptimal.
3 EpochProjection Sgd Algorithm
In this section, we present an epochprojection SGD method, called EproSGD, for solving Eq. (1) and discuss its convergence result. Based on a stochastic dual averaging algorithm, we then present a proximal variant of the proposed EproSGD method.
3.1 Setup and Background
Denote the optimal solution to Eq. (1) by and its domain set by . Since is strongly convex [23] and is convex, the optimization problem in Eq. (1) is strongly convex. Note that the strong convexity in implies that for any . Our analysis is based on the following assumptions:

The stochastic subgradient is uniformly bounded by , i.e., .

The subgradient is uniformly bounded by , i.e., for any .

There exists a positive value such that
(5)
Remarks Assumptions A1 and A2 respectively impose an upper bound on the stochastic subgradient of the objective function and the constraint function . Assumption A3 ensures that the projection of a point onto a feasible domain does not deviate too much from this intermediate point. Note that Assumption A1 is previously used in [15]; a condition similar to Assumption A3 is used in [20], which however simply assumes that , without considering possible nondifferentiability in .
A key consequence of Assumption A3 is presented in the following lemma.
Lemma 1.
For any , let . If Assumption A3 holds, then
(6) 
where is a hinge operator defined as if , and otherwise.
Proof.
If , we have ; the inequality in Eq. (6) trivially holds. If , we can verify that , and there exists and such that (using duality theory). It follows that (), and thus is the same direction as . It follows that
where the last inequality uses Assumption A3. This completes the proof of this lemma. ∎
The result in Lemma 1 is closely related to the polyhedral error bound condition [13, 31]; this condition shows that the distance of a point to the optimal set of a convex optimization problem whose epigraph is a polyhedron is bounded by the distance of the objective value at this point to the optimal objective value scaled by a constant. For illustration, we consider the optimization problem
with an optimal set as . If , is the closest point in the optimal set to . Therefore, by the polyhedral error bound condition of a polyhedral convex optimization, if is a polyhedral function, there exists a such that
Below we present three examples in which Assumption A3 or Lemma 1 is satisfied. Example : an affine function with . Example : the norm constraint where . Example : the maximum of a finite number of affine functions satisfying Lemma 1 as well as the polyhedral error bound condition [31].
3.2 Main Algorithm
To solve Eq. (1) (using EproSGD), we introduce an augmented objective function by incorporating the constraint function as
(7) 
It is worth noting that the augmented function in Eq. (7) does not have any iterationdependent parameter, for example the parameter in Eq. (4). is a prescribed parameter satisfying , as illustrated in Lemma 2.
The details of our proposed EproSGD algorithm is presented in Algorithm 1. Similar to EpochSGD [15], EproSGD consists of a sequence of epochs, each of which has a geometrically decreasing step size and a geometrically increasing iteration number (Line in Algorithm 1). The updates in every intraepoch (Line  ) are standard SGD steps applied to the augmented objective function with . EproSGD is different from EpochSGD in that the former computes a projection only at the end of each epoch, while the latter computes a projection at each iteration. Consequently, when the projection step is computationally expensive (e.g., projecting onto a positive definite constraint), EproSGD may require much less computation time than EpochSGD.
In Lemma 2, we present an important convergence analysis for the intraepoch steps of Algorithm 1, which are key building blocks for deriving the main results in Theorem 1.
Lemma 2.
Under Assumptions A1A3, if we apply the update for a number of iterations, the following equality holds
where .
Proof.
Let and denote by the expectation conditioned on the randomness until round . It is easy to verify that , and . For any , we have
where . Furthermore by the convexity of , we have
Noting that , taking expectation over randomness and summation over , we have
Let . Since , we have
(8) 
It follows that
(9) 
If , we have . Following from and , we can verify that and also holds.
Next we show that holds when . From Lemma 1, we have
(10) 
Moreover it follows from and that the following inequality holds
(11)  
Substituting Eqs. (10) and (11) into Eq. (9), we have
By some rearrangement, we have . Furthermore we have
where the second inequality follows from , and for any . This completes the proof of the lemma. ∎
We present a main convergence result of the EproSGD algorithm in the following theorem.
Theorem 1.
Under Assumptions A1A3 and given that is strongly convex, if we let , , and set , the total number of epochs in Algorithm 1 is given by
(12) 
the solution enjoys a convergence rate of
(13) 
and .
Proof.
From the updating rule , we can easily verify Eq. (12). Since , the inequality trivially holds.
Let . It follows that and . Next we show the inequality
(14) 
holds by induction. Note that Eq. (14) implies , due to . Let . It follows from Lemma 5, , and , the inequality in Eq. (14) holds when . Assuming that Eq. (14) holds for , we show that Eq. (14) holds for .
For a random variable
measurable with respect to the randomness up to epoch . Let denote the expectation conditioned on all the randomness up to epoch . Following Lemma 2, we haveSince by the strong convexity in , we have
which completes the proof of this theorem. ∎
Remark We compare the obtained main results in Theorem 1 with several existing works. Firstly Eq. (13) implies that EproSGD achieves an optimal bound , matching the lower bound for a strongly convex problem [15]. Secondly in contrast to the OneProj method [20] with a convergence rate , EproSGD uses no more than projections to obtain an convergence rate. EproSGD thus has better control over the solution for not deviating (too much) from the feasibility domain in the intermediate iterations. Thirdly compared to EpochSGD with its convergence rate bounded by , the convergence rate bound of EproSGD is only worse by a factor of constant . Particularly consider a positive definite constraint with , , and , we have and the bound of EproSGD is only worse by a factor of than EpochSGD. Finally compared to the logTSGD algorithm [33] which requires projections ( is the conditional number), the number of projections in EproSGD is independent of the conditional number.
The main results in Lemma 2 and Theorem 1 are expected convergence bounds. In Theorem 2 (proof provided in Appendix) we show that EproSGD also enjoys a high probability bound under a boundedness assumption, i.e., for all . Note that the existing EpochSGD method [15] uses two different methods to derive its high probability bounds. Specifically the first method relies on an efficient function evaluator to select the best solutions among multiple trials of run; while the second one modifies the updating rule by projecting the solution onto the intersection of the domain and a centershifted bounded ball with decaying radius. These two methods however may lead to additional computation steps, if being adopted for deriving high probability bounds for EproSGD.
Theorem 2.
Under Assumptions A1A3 and given for all . If we let , , and set , , the total number of epochs in Algorithm 1 is given by
and the final solution enjoys a convergence rate of
with a probability at least , where .
Remark The assumption
can be satisfied practically, if we estimate the value of
such that , and then project the intermediate solutions onto at every iteration. Note that EpochSGD [15] requires a total number of projections, and its high probability bound of EpochSGD is denoted by with a probability at lest , where .3.3 A Proximal Variant
We propose a proximal extension of EproSGD, by exploiting the structure of the objective function. Let the objective function in Eq. (1) be a sum of two components
where is a relatively simple function, for example a squared norm or norm, such that the involved proximal mapping
is easy to compute. The optimization problem in Eq. (1) can be rewritten as
(15)  
Denote by the optimal solution to Eq. (15). We similarly introduce an augmented objective function as
(16) 
The update of the proximal SGD method for solving (15) [9, 10, 22] is given by
(17) 
If is a sparse regularizer, the proximal SGD can guarantee the sparsity in the intermediate solutions and usually yields better convergence than the standard SGD. However, given a complex constraint, solving the proximal mapping may be computational expensive. Therefore, we consider a proximal variant of EproSGD which involves only the proximal mapping of without the constraint . An instinctive solution is to use the following update in place of step 6 in Algorithm 1:
(18)  
Based on this update and using techniques in Lemma 2, we obtain a similar convergence result (proof provided in Appendix), as presented in the following lemma [8].
Lemma 3.
Under Assumptions A1A3 and setting , by applying the update in Eq. (18) a number of iterations, we have
(19)  
where , and denotes the projected solution of the averaged solution .
Different from the main result in Lemma 2, Eq. (19) has an additional term ; it makes the convergence analysis in EproSGD difficult. To overcome this difficulty, we adopt the optimal regularized dual averaging (ORDA) algorithm [6] for solving Eq. (16). The details of ORDA are presented in Algorithm 2. The main convergence results of ORDA are summarized in the following lemma (proof provided in Appendix).
Lemma 4.
Under Assumptions A1A3 and setting , by running ORDA a number of iterations for solving the augmented objective (16), we have
and
where denotes the projected solution of the final solution .
We present a proximal variant of EproSGD, namely EproORDA, in Algorithm 3, and summarize its convergence results in Theorem 3. Note that Algorithm 2 and the convergence analysis in Lemma 4 are independent of the strong convexity in ; the strong convexity is however used for analyzing the convergence of EproORDA in Theorem 3 (proof provided in Appendix).
Theorem 3.
Under Assumptions A1A3 and given that is strongly convex, if we let and , and set , then the total number of epochs in Algorithm 3 is given by
and the final solution enjoys a convergence rate of
and .
4 An Example of Solving Lmnn via EproSgd
In this section, we discuss an application of applying the proposed EproSGD to solve a high dimensional distance metric learning (DML) with a large margin formulation, i.e., the large margin nearest neighbor (LMNN) classification method [30]. LMNN classification is one of the stateoftheart methods for knearest neighbor classification. It learns a positive semidefinite distance metric, based on which the examples from the knearest neighbors always belong to the same class, while the examples from different classes are separated by a large margin.
To describe the LMNN method, we first present some notations. Let , be a set of data points, where and denote the feature representation and the class label, respectively. Let be a positive definite matrix that defines a distance metric as . To learn a distance metric that separates the examples from different classes by a large margin, one needs to extract a set of similar examples (from the same class) and dissimilar examples (from a different class), denoted by , where shares the same class label to and a different class from . To this end, for each example one can form by extracting the k nearest neighbors (defined by an Euclidean distance metric) that share the same class label to , and form by extracting a set of examples that have a different class label. Then an appropriate distance metric could be obtained from the following constrained optimization problem
(20) 
where is a hinge loss and is a tradeoff parameter. In Eq. (4), is used as the constraint to ensure that Assumption A3 holds. Minimizing the first term is equivalent to maximizing the margin between and . The matrix encodes certain prior knowledge about the distance metric; for example, the original LMNN work [30] defines as , where are all knearest neighbor pairs from the same class. Other works [19] have used a weighted summation of distances between all data pairs or intraclass covariance matrix [25]. The last term is used as a regularization term and also makes the objective function strongly convex.
Comments
There are no comments yet.