) problems in which one jointly minimizes the expectation of a stochastic loss function plus a possibly nonsmooth regularization term. Examples include Tikhonov and elastic net regularization, Lasso, sparse logistic regression, and support vector machines[1, 2, 3, 4, 5].
Stochastic approximation methods such as stochastic gradient descent were among the first algorithms developed for solving stochastic optimization problems. Recently, these methods have received significant attention due to their simplicity and effectiveness (see, e.g., [7, 8, 9, 10, 11, 12, 13]). In particular, Nemirovski et. al.  demonstrated that for nonsmooth stochastic convex optimization problems, a modified stochastic approximation method, the mirror descent, exhibits an unimprovable convergence rate , where is the number of iterations. Later, Lan  developed a mirror descent algorithm for stochastic composite convex problems which explicitly accounts for the smoothness of the loss function and achieves the optimal rate. A similar result for the dual averaging method was obtained by Xiao .
The methods for solving stochastic optimization problems cited above are inherently serial in the sense that the gradient computations take place on a single processor which has access to the whole dataset. However, it happens more and more often that one single computer is unable to store and handle the amounts of data that we encounter in practical problems. This has caused a strong interest in developing parallel optimization algorithms which are able to split the data and distribute the computation across multiple processors or multiple computer clusters (see, e.g., [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] and references therein).
One simple and popular stochastic approximation method is mini-batching, where iterates are updated based on the average gradient with respect to multiple data points rather than based on gradients evaluated at a single data at a time. Recently, Dekel et. al.  proposed a parallel mini-batch algorithm for regularized stochastic optimization problems, in which multiple processors compute gradients in parallel using their own local data, and then aggregate the gradients up a spanning tree to obtain the averaged gradient. While this algorithm can achieve linear speedup in the number of processors, it has the drawback that the processors need to synchronize at each round and, hence, if one of them fails or is slower than the rest, then the entire algorithm runs at the pace of the slowest processor.
In this paper, we propose an asynchronous mini-batch algorithm for regularized stochastic optimization problems with smooth loss functions that eliminates the overhead associated with global synchronization. Our algorithm allows multiple processors to work at different rates, perform computations independently of each other, and update global decision variables using out-of-date gradients. A similar model of parallel asynchronous computation was applied to coordinate descent methods for deterministic optimization in [33, 34, 35] and mirror descent and dual averaging methods for stochastic optimization in . In particular, Agarwal and Duchi  have analyzed the convergence of asynchronous mini-batch algorithms for smooth stochastic convex problems, and interestingly shown that bounded delays do not degrade the asymptotic convergence. However, they only considered the case where the regularization term is the indicator function of a compact convex set.
We extend the results of  to general regularization functions (like the norm, often used to promote sparsity), and establish a sharper expected-value type of convergence rate than the one given in . Specifically, we make the following contributions:
For general convex regularization functions, we show that when the constraint set is closed and convex (but not necessarily bounded), the running average of the iterates generated by our algorithm with constant step-sizes converges at rate to a ball around the optimum. We derive an explicit expression that quantifies how the convergence rate and the residual error depends on loss function properties and algorithm parameters such as the constant step-size, the batch size, and the maximum delay bound .
For general convex regularization functions and compact constraint sets, we prove that the running average of the iterates produced by our algorithm with a time-varying step-size converges to the true optimum (without residual error) at rate
This result improves upon the previously known rate
for delayed stochastic mirror descent methods with time-varying step-sizes given in . In this case, our algorithm enjoys near-linear speedup as long as the number of processors is .
When the regularization function is strongly convex and the constraint set is closed and convex, we establish that the iterates converge at rate
If the number of processors is of the order of , this rate is asymptotically in , which is the best known rate for strongly convex stochastic optimization problems in a serial setting.
The remainder of the paper is organized as follows. In Section II, we introduce the notation and review some preliminaries that are essential for the development of the results in this paper. In Section III, we formulate the problem and discuss our assumptions. The proposed asynchronous mini-batch algorithm and its main theoretical results are presented in Section IV. Computational experience is reported in Section V while Section VI concludes the paper.
Ii Notation and Preliminaries
We let and denote the set of natural numbers and the set of natural numbers including zero, respectively. The inner product of two vectors is denoted by . We assume that is endowed with a norm , and use to represent the corresponding dual norm, defined by
Next, we review the key definitions and results necessary for developing the main results of this paper. We start with the definition of a Bregman distance function, also referred to as a prox-function.
A function is called a distance generating function with modulus with respect to norm , if is continuously differentiable and -strongly convex with respect to over the set . That is, for all ,
Every distance generating function introduces a corresponding Bregman distance function
For example, choosing , which is -strongly convex with respect to the -norm over any convex set , would result in . Another common example of distance generating functions is the entropy function
which is -strongly convex with respect to the -norm over the standard simplex
and its associated Bregman distance function is
The main motivation to use a generalized distance generating function, instead of the usual Euclidean distance function, is to design optimization algorithms that can take advantage of the geometry of the feasible set (see, e.g., [37, 7, 38, 39]).
The strong convexity of the distance generating function always ensures that
and if and only if .
Throughout the paper, there is no loss of generality to assume that . Indeed, if , we can choose the scaled function , which has modulus , to generate the Bregman distance function.
The following definition introduces subgradients of proper convex functions.
For a convex function , a vector is called a subgradient of at if
The set of all subgradients of at is called the subdifferential of at , and is denoted by .
Iii Problem Setup
We consider stochastic convex optimization problems of the form
Here, is the decision variable,
is a random vector whose probability distributionis supported on a set , is convex and differentiable for each , and is a proper convex function that may be nonsmooth and extended real-valued. Let us define
A difficulty when solving optimization problem (1) is that the distribution is often unknown, so the expectation (2) cannot be computed. This situation occurs frequently in data-driven applications such as machine learning. To support these applications, we do not assume knowledge of (or of ), only access to a stochastic oracle. Each time the oracle is queried with an , it generates an independent and identically distributed (i.i.d.) sample from and returns .
We also impose the following assumptions on Problem (1).
Assumption 1 (Existence of a minimum)
The optimal set is nonempty.
Assumption 2 (Lipschitz continuity of )
For each , the function has Lipschitz continuous gradient with constant . That is, for all ,
Assumption 3 (Bounded gradient variance)
There exists a constant such that
Assumption 4 (Closed effective domain of )
The function is simple and lower semi-continuous, and its effective domain, , is closed.
Possible choices of include:
Unconstrained smooth minimization: .
Constrained smooth minimization: is the indicator function of a non-empty closed convex set , i.e.,
-regularized minimization: with .
Constrained -regularized minimization: In this case, with .
Several practical problems in machine learning, statistical applications, and signal processing satisfy Assumptions 1–4 (see, e.g., [2, 3, 4]). One such example is -regularized logistic regression for sparse binary classification. We are then given a large number of observations
drawn i.i.d. from an unknown distribution , and want to solve the minimization problem (1) with
and . The role of regularization is to produce sparse solutions.
i.i.d. random variablesfrom , computing the averaged stochastic gradient
and performing the composite mirror descent update
where is a positive step-size parameter. Under Assumptions 1–4 and choosing an appropriate step-size, this algorithm is guaranteed to converge to the optimum [32, Theorem 9]. However, in many emerging applications, such as large-scale machine learning and statistics, the size of dataset is so huge that it cannot fit on one machine. Hence, we need optimization algorithms that can be conveniently and efficiently executed in parallel on multiple processors.
Iv An Asynchronous Mini-Batch Algorithm
In this section, we will present an asynchronous mini-batch algorithm that exploits multiple processors to solve Problem (1). We characterize the iteration complexity and the convergence rate of the proposed algorithm, and show that these compare favourably with the state of the art.
Iv-a Description of Algorithm
We assume p processors have access to a shared memory for the decision variable . The processors may have different capabilities (in terms of processing power and access to data) and are able to update without the need for coordination or synchronization. Conceptually, the algorithm lets each processor run its own stochastic composite mirror descent process, repeating the following steps:
Read from the shared memory and load it into the local storage location ;
Sample i.i.d random variables from the distribution ;
Compute the averaged stochastic gradient vector
Update current in the shared memory via
The algorithm can be implemented in many ways as depicted in Figure 1. One way is to consider the processors as peers that each execute the four-step algorithm independently of each other and only share the global memory for storing . In this case, each processor reads the decision vector twice in each round: once in the first step (before evaluating the averaged gradient), and once in the last step (before carrying out the minimization). To ensure correctness, Step 4 must be an atomic operation, where the executing processor puts a write lock on the global memory until it has written back the result of the minimization (cf. Figure 1, left). The algorithm can also be executed in a master-worker setting. In this case, each of the worker nodes retrieves from the master in Step 1 and returns the averaged gradient to the master in Step 3; the fourth step (carrying out the minimization) is executed by the master (cf. Figure 1, right)
Independently of how we choose to implement the algorithm, processors may work at different rates: while one processor updates the decision vector (in the shared memory setting) or send its averaged gradient to the master (in the master-worker setting), the others are generally busy computing averaged gradient vectors. The processors that perform gradient evaluations do not need to be aware of updates to the decision vector, but can continue to operate on stale information about . Therefore, unlike synchronous parallel mini-batch algorithms , there is no need for processors to wait for each other to finish the gradient computations. Moreover, the value at which the average of gradients is evaluated by a processor may differ from the value of to which the update is applied.
Algorithm 1 describes the asynchronous processes that run in parallel. To describe the progress of the overall optimization process, we introduce a counter that is incremented each time is updated. We let denote the time at which used to compute the averaged gradient involved in the update of was read from the shared memory. It is clear that for all . The value
can be viewed as the delay between reading and updating for processors and captures the staleness of the information used to compute the average of gradients for the k-th update. We assume that the delay is not too long, i.e., there is a nonnegative integer such that
The value of is an indicator of the asynchronism in the algorithm and in the execution platform. In practice, will depend on the number of parallel processors used in the algorithm [33, 34, 35]. Note that the cyclic-delay mini-batch algorithm , in which the processors are ordered and each updates the decision variable under a fixed schedule, is a special case of Algorithm 1 where , or, equivalently, for all .
Iv-B Convergence Rate for General Convex Regularization
The following theorem establishes convergence properties of Algorithm 1 when a constant step-size is used.
See Appendix -A.
Theorem 1 demonstrates that for any constant step-size satisfying (4), the running average of iterates generated by Algorithm 1 will converge in expectation to a ball around the optimum at a rate of . The convergence rate and the residual error depend on the choice of : decreasing reduces the residual error, but it also results in a slower convergence. We now describe a possible strategy for selecting the constant step-size. Let be the total number of iterations necessary to achieve -optimal solution to Problem (1), that is, when . If we pick
it follows from Theorem 1 that the corresponding satisfies
where . This inequality tells us that if the first term on the right-hand side is less than , i.e., if
As long as the maximum delay bound is of the order , the first term in (6) is asymptotically negligible, and hence the iteration complexity of Algorithm 1 is asymptotically , which is exactly the iteration complexity achieved by the mini-batch algorithm for solving stochastic convex optimization problems in a serial setting . As discussed before, is related to the number of processors used in the algorithm. Therefore, if the number of processors is of the order of , parallelization does not appreciably degrade asymptotic convergence of Algorithm 1. Furthermore, as processors are being run in parallel, updates occur roughly times as quickly and in time scaling as , the processors may compute averaged gradient vectors (instead of vectors). This means that the near-linear speedup in the number of processors can be expected.
Another strategy for the selection of the constant step-size in Algorithm 1 is to use that depends on the prior knowledge of the number of iterations to be performed. More precisely, assume that the number of iterations is fixed in advance, say equal to . By choosing as
for some , it follows from Theorem 1 that the running average of the iterates after iterations satisfies
It is easy to verify that the optimal choice of , which minimizes the second term on the right-hand-side of the above inequality, is
With this choice of , we then have
In the case that , the preceding guaranteed bound reduces to the one obtained in [8, Theorem 1] for the serial stochastic mirror descent algorithm with constant step-sizes. Note that in order to implement Algorithm 1 with the optimal constant step-size policy, we need to estimate an upper bound on , since is usually unknown.
The following theorem characterizes the convergence of Algorithm 1 with a time-varying step-size sequence when is bounded in addition to being closed and convex.
See Appendix -B.
The time-varying step-size , which ensures the convergence of the algorithm, consists of two terms: the time-varying term should control the errors from stochastic gradient information while the role of the constant term () is to decrease the effects of asynchrony (bounded delays) on the convergence of the algorithm. According to Theorem 2, in the case that , the delay becomes increasingly harmless as the algorithm progresses and the expected function value evaluated at converges asymptotically at a rate , which is known to be the best achievable rate of the mirror descent method for nonsmooth stochastic convex optimization problems .
For the special case of the optimization problem (1) where is restricted to be the indicator function of a compact convex set, Agarwal and Duchi [36, Theorem 2] showed that the convergence rate of the delayed stochastic mirror descent method with time-varying step-size is
where is the maximum bound on . Comparing with this result, instead of a asymptotic penalty of the form due to the delays, we have the penalty , which is much smaller for large . Therefore, not only do we extend the result of  to general regularization functions, but we also obtain a sharper guaranteed convergence rate than the one presented in .
Iv-C Convergence Rate for Strongly Convex Regularization
In this subsection, we restrict our attention to stochastic composite optimization problems with strongly convex regularization terms. Specifically, we assume that is -strongly convex with respect to , that is, for any ,
Examples of the strongly convex function include:
-regularization: with .
Elastic net regularization: with and .
In order to derive the convergence rate of Algorithm 1 for solving (1) with a strongly convex regularization term, we need to assume that the Bregman distance function used in the algorithm satisfies the next assumption.
Assumption 5 (Quadratic growth condition)
For all , we have
The associated convergence result now reads as follows.
See Appendix -C.
An interesting point regarding Theorem 3 is that for solving stochastic composite optimization problems with strongly convex regularization functions, the maximum delay bound can be as large as without affecting the asymptotic convergence rate of Algorithm 1. In this case, our asynchronous mini-batch algorithm converges asymptotically at a rate of , which matches the best known rate achievable in a serial setting.
V Experimental Results
We have developed a complete master-worker implementation of our algorithm in C/++ using the Massage Passing Interface libraries (OpenMPI). Although we argued in Section IV that Algorithm 1 can be implemented using atomic operations on shared-memory computing architectures, we have chosen the MPI implementation due to its flexibility in scaling the problem to distributed-memory environments.
We evaluated our algorithm on a document classification problem using the text categorization dataset rcv1 . This dataset consists of documents, with
unique stemmed tokens spanning 103 topics. Out of these topics, we decided to classify sports-related documents. To this end, we trained a sparse (binary) classifier by solving the following-regularized logistic regression problem
Here, is the sparse vector of token weights assigned to each document, and indicates whether a selected document is sports-related, or not ( is if the document is about sport, otherwise). To evaluate scalability, we used both the training and test sets available when solving the optimization problem. We implemented Algorithm 1 with time-varying step-sizes, and used a batch size of 1000 documents. The regularization parameter was set to , and the algorithm was run until a fixed tolerance was met.
Figure 2 presents the achieved relative speedup of the algorithm with respect to the number of workers used. The relative speedup of the algorithm on processors is defined as , where and are the time it takes to run the corresponding algorithm (to -accuracy) on 1 and processing units, respectively. We observe a near-linear relative speedup, consistent with our theoretical results. The timings are averaged over 10 Monte Carlo runs.
We have proposed an asynchronous mini-batch algorithm that exploits multiple processors to solve regularized stochastic optimization problems with smooth loss functions. We have established that for closed and convex constraint sets, the iteration complexity of the algorithm with constant step-sizes is asymptotically . For compact constraint sets, we have proved that the running average of the iterates generated by our algorithm with time-varying step-size converges to the optimum at a rate . When the regularization function is strongly convex and the constraint set is closed and convex, the algorithm achieves the rate of the order . We have shown that the penalty in convergence rate of the algorithm due to asynchrony is asymptotically negligible and a near-linear speedup in the number of processors can be expected. Our computational experience confirmed the theory.
The following result establishes an important recursion for the iterates generated by Algorithm 1.
We start with the first-order optimality condition for the point in the minimization problem (3): there exists subgradient such that for all , we have
where denotes the partial derivative of the Bregman distance function with respect to the second variable. Plugging the following equality
into the previous inequality and re-arranging terms gives
where the last inequality used
From this relation, with , , and , we have
Substituting the preceding equality into (8) and re-arranging terms result in
Since the distance generating function is -strongly convex, we have the lower bound
which implies that
The essential idea in the rest of the proof is to use convexity and smoothness of the expectation function to bound for each and each . According to Assumption 2, and, hence, are Lipschitz continuous with the constant . By using the -Lipschitz continuity of and then the convexity of , we have
We now rewrite the above inequality in terms of the error as follows:
We will seek upper bounds on the quantities and . Let be a sequence of positive numbers. For , we have
where the second inequality follows from the Fenchel-Young inequality applied to the conjugate pair and , i.e.,
We now turn to . It follows from definition that
Then, by the convexity of the norm , we conclude that
Setting , where