I Introduction
Many optimization problems that arise in machine learning, signal processing, and statistical estimation can be formulated as
regularized stochastic optimization (also referred to as stochastic composite optimization) problems in which one jointly minimizes the expectation of a stochastic loss function plus a possibly nonsmooth regularization term. Examples include Tikhonov and elastic net regularization, Lasso, sparse logistic regression, and support vector machines
[1, 2, 3, 4, 5].Stochastic approximation methods such as stochastic gradient descent were among the first algorithms developed for solving stochastic optimization problems
[6]. Recently, these methods have received significant attention due to their simplicity and effectiveness (see, e.g., [7, 8, 9, 10, 11, 12, 13]). In particular, Nemirovski et. al. [7] demonstrated that for nonsmooth stochastic convex optimization problems, a modified stochastic approximation method, the mirror descent, exhibits an unimprovable convergence rate , where is the number of iterations. Later, Lan [8] developed a mirror descent algorithm for stochastic composite convex problems which explicitly accounts for the smoothness of the loss function and achieves the optimal rate. A similar result for the dual averaging method was obtained by Xiao [9].The methods for solving stochastic optimization problems cited above are inherently serial in the sense that the gradient computations take place on a single processor which has access to the whole dataset. However, it happens more and more often that one single computer is unable to store and handle the amounts of data that we encounter in practical problems. This has caused a strong interest in developing parallel optimization algorithms which are able to split the data and distribute the computation across multiple processors or multiple computer clusters (see, e.g., [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] and references therein).
One simple and popular stochastic approximation method is minibatching, where iterates are updated based on the average gradient with respect to multiple data points rather than based on gradients evaluated at a single data at a time. Recently, Dekel et. al. [32] proposed a parallel minibatch algorithm for regularized stochastic optimization problems, in which multiple processors compute gradients in parallel using their own local data, and then aggregate the gradients up a spanning tree to obtain the averaged gradient. While this algorithm can achieve linear speedup in the number of processors, it has the drawback that the processors need to synchronize at each round and, hence, if one of them fails or is slower than the rest, then the entire algorithm runs at the pace of the slowest processor.
In this paper, we propose an asynchronous minibatch algorithm for regularized stochastic optimization problems with smooth loss functions that eliminates the overhead associated with global synchronization. Our algorithm allows multiple processors to work at different rates, perform computations independently of each other, and update global decision variables using outofdate gradients. A similar model of parallel asynchronous computation was applied to coordinate descent methods for deterministic optimization in [33, 34, 35] and mirror descent and dual averaging methods for stochastic optimization in [36]. In particular, Agarwal and Duchi [36] have analyzed the convergence of asynchronous minibatch algorithms for smooth stochastic convex problems, and interestingly shown that bounded delays do not degrade the asymptotic convergence. However, they only considered the case where the regularization term is the indicator function of a compact convex set.
We extend the results of [36] to general regularization functions (like the norm, often used to promote sparsity), and establish a sharper expectedvalue type of convergence rate than the one given in [36]. Specifically, we make the following contributions:

For general convex regularization functions, we show that when the constraint set is closed and convex (but not necessarily bounded), the running average of the iterates generated by our algorithm with constant stepsizes converges at rate to a ball around the optimum. We derive an explicit expression that quantifies how the convergence rate and the residual error depends on loss function properties and algorithm parameters such as the constant stepsize, the batch size, and the maximum delay bound .

For general convex regularization functions and compact constraint sets, we prove that the running average of the iterates produced by our algorithm with a timevarying stepsize converges to the true optimum (without residual error) at rate
This result improves upon the previously known rate
for delayed stochastic mirror descent methods with timevarying stepsizes given in [36]. In this case, our algorithm enjoys nearlinear speedup as long as the number of processors is .

When the regularization function is strongly convex and the constraint set is closed and convex, we establish that the iterates converge at rate
If the number of processors is of the order of , this rate is asymptotically in , which is the best known rate for strongly convex stochastic optimization problems in a serial setting.
The remainder of the paper is organized as follows. In Section II, we introduce the notation and review some preliminaries that are essential for the development of the results in this paper. In Section III, we formulate the problem and discuss our assumptions. The proposed asynchronous minibatch algorithm and its main theoretical results are presented in Section IV. Computational experience is reported in Section V while Section VI concludes the paper.
Ii Notation and Preliminaries
Iia Notation
We let and denote the set of natural numbers and the set of natural numbers including zero, respectively. The inner product of two vectors is denoted by . We assume that is endowed with a norm , and use to represent the corresponding dual norm, defined by
IiB Preliminaries
Next, we review the key definitions and results necessary for developing the main results of this paper. We start with the definition of a Bregman distance function, also referred to as a proxfunction.
Definition 1
A function is called a distance generating function with modulus with respect to norm , if is continuously differentiable and strongly convex with respect to over the set . That is, for all ,
Every distance generating function introduces a corresponding Bregman distance function
For example, choosing , which is strongly convex with respect to the norm over any convex set , would result in . Another common example of distance generating functions is the entropy function
which is strongly convex with respect to the norm over the standard simplex
and its associated Bregman distance function is
The main motivation to use a generalized distance generating function, instead of the usual Euclidean distance function, is to design optimization algorithms that can take advantage of the geometry of the feasible set (see, e.g., [37, 7, 38, 39]).
Remark 1
The strong convexity of the distance generating function always ensures that
and if and only if .
Remark 2
Throughout the paper, there is no loss of generality to assume that . Indeed, if , we can choose the scaled function , which has modulus , to generate the Bregman distance function.
The following definition introduces subgradients of proper convex functions.
Definition 2
For a convex function , a vector is called a subgradient of at if
The set of all subgradients of at is called the subdifferential of at , and is denoted by .
Iii Problem Setup
We consider stochastic convex optimization problems of the form
(1) 
Here, is the decision variable,
is a random vector whose probability distribution
is supported on a set , is convex and differentiable for each , and is a proper convex function that may be nonsmooth and extended realvalued. Let us define(2) 
Note that the expectation function is convex, differentiable, and [40]. We use to denote the set of optimal solutions of Problem (1) and to denote the corresponding optimal value.
A difficulty when solving optimization problem (1) is that the distribution is often unknown, so the expectation (2) cannot be computed. This situation occurs frequently in datadriven applications such as machine learning. To support these applications, we do not assume knowledge of (or of ), only access to a stochastic oracle. Each time the oracle is queried with an , it generates an independent and identically distributed (i.i.d.) sample from and returns .
We also impose the following assumptions on Problem (1).
Assumption 1 (Existence of a minimum)
The optimal set is nonempty.
Assumption 2 (Lipschitz continuity of )
For each , the function has Lipschitz continuous gradient with constant . That is, for all ,
Assumption 3 (Bounded gradient variance)
There exists a constant such that
Assumption 4 (Closed effective domain of )
The function is simple and lower semicontinuous, and its effective domain, , is closed.
Possible choices of include:

Unconstrained smooth minimization: .

Constrained smooth minimization: is the indicator function of a nonempty closed convex set , i.e.,

regularized minimization: with .

Constrained regularized minimization: In this case, with .
Several practical problems in machine learning, statistical applications, and signal processing satisfy Assumptions 1–4 (see, e.g., [2, 3, 4]). One such example is regularized logistic regression for sparse binary classification. We are then given a large number of observations
drawn i.i.d. from an unknown distribution , and want to solve the minimization problem (1) with
and . The role of regularization is to produce sparse solutions.
One approach for solving Problem (1) is the serial minibatch method based on the mirror descent scheme [32]. Given a point , a single processor updates the decision variable by sampling
i.i.d. random variables
from , computing the averaged stochastic gradientand performing the composite mirror descent update
where is a positive stepsize parameter. Under Assumptions 1–4 and choosing an appropriate stepsize, this algorithm is guaranteed to converge to the optimum [32, Theorem 9]. However, in many emerging applications, such as largescale machine learning and statistics, the size of dataset is so huge that it cannot fit on one machine. Hence, we need optimization algorithms that can be conveniently and efficiently executed in parallel on multiple processors.
Iv An Asynchronous MiniBatch Algorithm
In this section, we will present an asynchronous minibatch algorithm that exploits multiple processors to solve Problem (1). We characterize the iteration complexity and the convergence rate of the proposed algorithm, and show that these compare favourably with the state of the art.
Iva Description of Algorithm
We assume p processors have access to a shared memory for the decision variable . The processors may have different capabilities (in terms of processing power and access to data) and are able to update without the need for coordination or synchronization. Conceptually, the algorithm lets each processor run its own stochastic composite mirror descent process, repeating the following steps:

Read from the shared memory and load it into the local storage location ;

Sample i.i.d random variables from the distribution ;

Compute the averaged stochastic gradient vector

Update current in the shared memory via
The algorithm can be implemented in many ways as depicted in Figure 1. One way is to consider the processors as peers that each execute the fourstep algorithm independently of each other and only share the global memory for storing . In this case, each processor reads the decision vector twice in each round: once in the first step (before evaluating the averaged gradient), and once in the last step (before carrying out the minimization). To ensure correctness, Step 4 must be an atomic operation, where the executing processor puts a write lock on the global memory until it has written back the result of the minimization (cf. Figure 1, left). The algorithm can also be executed in a masterworker setting. In this case, each of the worker nodes retrieves from the master in Step 1 and returns the averaged gradient to the master in Step 3; the fourth step (carrying out the minimization) is executed by the master (cf. Figure 1, right)
Independently of how we choose to implement the algorithm, processors may work at different rates: while one processor updates the decision vector (in the shared memory setting) or send its averaged gradient to the master (in the masterworker setting), the others are generally busy computing averaged gradient vectors. The processors that perform gradient evaluations do not need to be aware of updates to the decision vector, but can continue to operate on stale information about . Therefore, unlike synchronous parallel minibatch algorithms [32], there is no need for processors to wait for each other to finish the gradient computations. Moreover, the value at which the average of gradients is evaluated by a processor may differ from the value of to which the update is applied.
Algorithm 1 describes the asynchronous processes that run in parallel. To describe the progress of the overall optimization process, we introduce a counter that is incremented each time is updated. We let denote the time at which used to compute the averaged gradient involved in the update of was read from the shared memory. It is clear that for all . The value
can be viewed as the delay between reading and updating for processors and captures the staleness of the information used to compute the average of gradients for the kth update. We assume that the delay is not too long, i.e., there is a nonnegative integer such that
The value of is an indicator of the asynchronism in the algorithm and in the execution platform. In practice, will depend on the number of parallel processors used in the algorithm [33, 34, 35]. Note that the cyclicdelay minibatch algorithm [36], in which the processors are ordered and each updates the decision variable under a fixed schedule, is a special case of Algorithm 1 where , or, equivalently, for all .
(3)  
IvB Convergence Rate for General Convex Regularization
The following theorem establishes convergence properties of Algorithm 1 when a constant stepsize is used.
Theorem 1
See Appendix A.
Theorem 1 demonstrates that for any constant stepsize satisfying (4), the running average of iterates generated by Algorithm 1 will converge in expectation to a ball around the optimum at a rate of . The convergence rate and the residual error depend on the choice of : decreasing reduces the residual error, but it also results in a slower convergence. We now describe a possible strategy for selecting the constant stepsize. Let be the total number of iterations necessary to achieve optimal solution to Problem (1), that is, when . If we pick
(5) 
it follows from Theorem 1 that the corresponding satisfies
where . This inequality tells us that if the first term on the righthand side is less than , i.e., if
then . Hence, the iteration complexity of Algorithm 1 with the stepsize choice (5) is given by
(6) 
As long as the maximum delay bound is of the order , the first term in (6) is asymptotically negligible, and hence the iteration complexity of Algorithm 1 is asymptotically , which is exactly the iteration complexity achieved by the minibatch algorithm for solving stochastic convex optimization problems in a serial setting [32]. As discussed before, is related to the number of processors used in the algorithm. Therefore, if the number of processors is of the order of , parallelization does not appreciably degrade asymptotic convergence of Algorithm 1. Furthermore, as processors are being run in parallel, updates occur roughly times as quickly and in time scaling as , the processors may compute averaged gradient vectors (instead of vectors). This means that the nearlinear speedup in the number of processors can be expected.
Remark 3
Another strategy for the selection of the constant stepsize in Algorithm 1 is to use that depends on the prior knowledge of the number of iterations to be performed. More precisely, assume that the number of iterations is fixed in advance, say equal to . By choosing as
for some , it follows from Theorem 1 that the running average of the iterates after iterations satisfies
It is easy to verify that the optimal choice of , which minimizes the second term on the righthandside of the above inequality, is
With this choice of , we then have
In the case that , the preceding guaranteed bound reduces to the one obtained in [8, Theorem 1] for the serial stochastic mirror descent algorithm with constant stepsizes. Note that in order to implement Algorithm 1 with the optimal constant stepsize policy, we need to estimate an upper bound on , since is usually unknown.
The following theorem characterizes the convergence of Algorithm 1 with a timevarying stepsize sequence when is bounded in addition to being closed and convex.
Theorem 2
See Appendix B.
The timevarying stepsize , which ensures the convergence of the algorithm, consists of two terms: the timevarying term should control the errors from stochastic gradient information while the role of the constant term () is to decrease the effects of asynchrony (bounded delays) on the convergence of the algorithm. According to Theorem 2, in the case that , the delay becomes increasingly harmless as the algorithm progresses and the expected function value evaluated at converges asymptotically at a rate , which is known to be the best achievable rate of the mirror descent method for nonsmooth stochastic convex optimization problems [7].
For the special case of the optimization problem (1) where is restricted to be the indicator function of a compact convex set, Agarwal and Duchi [36, Theorem 2] showed that the convergence rate of the delayed stochastic mirror descent method with timevarying stepsize is
where is the maximum bound on . Comparing with this result, instead of a asymptotic penalty of the form due to the delays, we have the penalty , which is much smaller for large . Therefore, not only do we extend the result of [36] to general regularization functions, but we also obtain a sharper guaranteed convergence rate than the one presented in [36].
IvC Convergence Rate for Strongly Convex Regularization
In this subsection, we restrict our attention to stochastic composite optimization problems with strongly convex regularization terms. Specifically, we assume that is strongly convex with respect to , that is, for any ,
Examples of the strongly convex function include:

regularization: with .

Elastic net regularization: with and .
Remark 4
In order to derive the convergence rate of Algorithm 1 for solving (1) with a strongly convex regularization term, we need to assume that the Bregman distance function used in the algorithm satisfies the next assumption.
Assumption 5 (Quadratic growth condition)
For all , we have
with .
For example, if , then and . Note that Assumption 5 will automatically hold when the distance generating function has Lipschitz continuous gradient with a constant [12].
The associated convergence result now reads as follows.
Theorem 3
See Appendix C.
An interesting point regarding Theorem 3 is that for solving stochastic composite optimization problems with strongly convex regularization functions, the maximum delay bound can be as large as without affecting the asymptotic convergence rate of Algorithm 1. In this case, our asynchronous minibatch algorithm converges asymptotically at a rate of , which matches the best known rate achievable in a serial setting.
V Experimental Results
We have developed a complete masterworker implementation of our algorithm in C/++ using the Massage Passing Interface libraries (OpenMPI). Although we argued in Section IV that Algorithm 1 can be implemented using atomic operations on sharedmemory computing architectures, we have chosen the MPI implementation due to its flexibility in scaling the problem to distributedmemory environments.
We evaluated our algorithm on a document classification problem using the text categorization dataset rcv1 [42]. This dataset consists of documents, with
unique stemmed tokens spanning 103 topics. Out of these topics, we decided to classify sportsrelated documents. To this end, we trained a sparse (binary) classifier by solving the following
regularized logistic regression problemHere, is the sparse vector of token weights assigned to each document, and indicates whether a selected document is sportsrelated, or not ( is if the document is about sport, otherwise). To evaluate scalability, we used both the training and test sets available when solving the optimization problem. We implemented Algorithm 1 with timevarying stepsizes, and used a batch size of 1000 documents. The regularization parameter was set to , and the algorithm was run until a fixed tolerance was met.
Figure 2 presents the achieved relative speedup of the algorithm with respect to the number of workers used. The relative speedup of the algorithm on processors is defined as , where and are the time it takes to run the corresponding algorithm (to accuracy) on 1 and processing units, respectively. We observe a nearlinear relative speedup, consistent with our theoretical results. The timings are averaged over 10 Monte Carlo runs.
Vi Conclusions
We have proposed an asynchronous minibatch algorithm that exploits multiple processors to solve regularized stochastic optimization problems with smooth loss functions. We have established that for closed and convex constraint sets, the iteration complexity of the algorithm with constant stepsizes is asymptotically . For compact constraint sets, we have proved that the running average of the iterates generated by our algorithm with timevarying stepsize converges to the optimum at a rate . When the regularization function is strongly convex and the constraint set is closed and convex, the algorithm achieves the rate of the order . We have shown that the penalty in convergence rate of the algorithm due to asynchrony is asymptotically negligible and a nearlinear speedup in the number of processors can be expected. Our computational experience confirmed the theory.
In this section, we prove the main results of the paper, namely, Theorems 1–3. We first state three key lemmas which are instrumental in our argument.
The following result establishes an important recursion for the iterates generated by Algorithm 1.
Lemma 1
We start with the firstorder optimality condition for the point in the minimization problem (3): there exists subgradient such that for all , we have
where denotes the partial derivative of the Bregman distance function with respect to the second variable. Plugging the following equality
into the previous inequality and rearranging terms gives
(8) 
where the last inequality used
by the (strong) convexity of . We now use the following wellknown three point identity of the Bregman distance function [43] to rewrite the lefthand side of (8):
From this relation, with , , and , we have
Substituting the preceding equality into (8) and rearranging terms result in
Since the distance generating function is strongly convex, we have the lower bound
which implies that
(9) 
The essential idea in the rest of the proof is to use convexity and smoothness of the expectation function to bound for each and each . According to Assumption 2, and, hence, are Lipschitz continuous with the constant . By using the Lipschitz continuity of and then the convexity of , we have
(10) 
for any . Combining inequalities (9) and (10), and recalling that , we obtain
We now rewrite the above inequality in terms of the error as follows:
(11) 
We will seek upper bounds on the quantities and . Let be a sequence of positive numbers. For , we have
(12) 
where the second inequality follows from the FenchelYoung inequality applied to the conjugate pair and , i.e.,
We now turn to . It follows from definition that
Then, by the convexity of the norm , we conclude that
(13) 
where the last inequality comes from our assumption that for all . Substituting inequalities (12) and (13) into the bound (11) and simplifying yield
Setting , where
Comments
There are no comments yet.