An Asynchronous Mini-Batch Algorithm for Regularized Stochastic Optimization

by   Hamid Reza Feyzmahdavian, et al.

Mini-batch optimization has proven to be a powerful paradigm for large-scale learning. However, the state of the art parallel mini-batch algorithms assume synchronous operation or cyclic update orders. When worker nodes are heterogeneous (due to different computational capabilities or different communication delays), synchronous and cyclic operations are inefficient since they will leave workers idle waiting for the slower nodes to complete their computations. In this paper, we propose an asynchronous mini-batch algorithm for regularized stochastic optimization problems with smooth loss functions that eliminates idle waiting and allows workers to run at their maximal update rates. We show that by suitably choosing the step-size values, the algorithm achieves a rate of the order O(1/√(T)) for general convex regularization functions, and the rate O(1/T) for strongly convex regularization functions, where T is the number of iterations. In both cases, the impact of asynchrony on the convergence rate of our algorithm is asymptotically negligible, and a near-linear speedup in the number of workers can be expected. Theoretical results are confirmed in real implementations on a distributed computing infrastructure.



There are no comments yet.


page 1

page 2

page 3

page 4


Local SGD Converges Fast and Communicates Little

Mini-batch stochastic gradient descent (SGD) is the state of the art in ...

Analysis and Implementation of an Asynchronous Optimization Algorithm for the Parameter Server

This paper presents an asynchronous incremental aggregated gradient algo...

Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems

Stochastic gradient descent is a canonical tool for addressing stochasti...

Stochastic Optimization with Laggard Data Pipelines

State-of-the-art optimization is steadily shifting towards massively par...

Asynchronous Distributed Optimization with Randomized Delays

In this work, we study asynchronous finite sum minimization in a distrib...

Asynchronous Batch Bayesian Optimisation with Improved Local Penalisation

Batch Bayesian optimisation (BO) has been successfully applied to hyperp...

An Optimization Framework for Federated Edge Learning

The optimal design of federated learning (FL) algorithms for solving gen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many optimization problems that arise in machine learning, signal processing, and statistical estimation can be formulated as

regularized stochastic optimization (also referred to as stochastic composite optimization

) problems in which one jointly minimizes the expectation of a stochastic loss function plus a possibly nonsmooth regularization term. Examples include Tikhonov and elastic net regularization, Lasso, sparse logistic regression, and support vector machines

[1, 2, 3, 4, 5].

Stochastic approximation methods such as stochastic gradient descent were among the first algorithms developed for solving stochastic optimization problems 

[6]. Recently, these methods have received significant attention due to their simplicity and effectiveness (see, e.g.[7, 8, 9, 10, 11, 12, 13]). In particular, Nemirovski et. al. [7] demonstrated that for nonsmooth stochastic convex optimization problems, a modified stochastic approximation method, the mirror descent, exhibits an unimprovable convergence rate , where is the number of iterations. Later, Lan [8] developed a mirror descent algorithm for stochastic composite convex problems which explicitly accounts for the smoothness of the loss function and achieves the optimal rate. A similar result for the dual averaging method was obtained by Xiao [9].

The methods for solving stochastic optimization problems cited above are inherently serial in the sense that the gradient computations take place on a single processor which has access to the whole dataset. However, it happens more and more often that one single computer is unable to store and handle the amounts of data that we encounter in practical problems. This has caused a strong interest in developing parallel optimization algorithms which are able to split the data and distribute the computation across multiple processors or multiple computer clusters (see, e.g.[14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] and references therein).

One simple and popular stochastic approximation method is mini-batching, where iterates are updated based on the average gradient with respect to multiple data points rather than based on gradients evaluated at a single data at a time. Recently, Dekel et. al. [32] proposed a parallel mini-batch algorithm for regularized stochastic optimization problems, in which multiple processors compute gradients in parallel using their own local data, and then aggregate the gradients up a spanning tree to obtain the averaged gradient. While this algorithm can achieve linear speedup in the number of processors, it has the drawback that the processors need to synchronize at each round and, hence, if one of them fails or is slower than the rest, then the entire algorithm runs at the pace of the slowest processor.

In this paper, we propose an asynchronous mini-batch algorithm for regularized stochastic optimization problems with smooth loss functions that eliminates the overhead associated with global synchronization. Our algorithm allows multiple processors to work at different rates, perform computations independently of each other, and update global decision variables using out-of-date gradients. A similar model of parallel asynchronous computation was applied to coordinate descent methods for deterministic optimization in [33, 34, 35] and mirror descent and dual averaging methods for stochastic optimization in [36]. In particular, Agarwal and Duchi [36] have analyzed the convergence of asynchronous mini-batch algorithms for smooth stochastic convex problems, and interestingly shown that bounded delays do not degrade the asymptotic convergence. However, they only considered the case where the regularization term is the indicator function of a compact convex set.

We extend the results of [36] to general regularization functions (like the norm, often used to promote sparsity), and establish a sharper expected-value type of convergence rate than the one given in [36]. Specifically, we make the following contributions:

  • For general convex regularization functions, we show that when the constraint set is closed and convex (but not necessarily bounded), the running average of the iterates generated by our algorithm with constant step-sizes converges at rate to a ball around the optimum. We derive an explicit expression that quantifies how the convergence rate and the residual error depends on loss function properties and algorithm parameters such as the constant step-size, the batch size, and the maximum delay bound .

  • For general convex regularization functions and compact constraint sets, we prove that the running average of the iterates produced by our algorithm with a time-varying step-size converges to the true optimum (without residual error) at rate

    This result improves upon the previously known rate

    for delayed stochastic mirror descent methods with time-varying step-sizes given in [36]. In this case, our algorithm enjoys near-linear speedup as long as the number of processors is .

  • When the regularization function is strongly convex and the constraint set is closed and convex, we establish that the iterates converge at rate

    If the number of processors is of the order of , this rate is asymptotically in , which is the best known rate for strongly convex stochastic optimization problems in a serial setting.

The remainder of the paper is organized as follows. In Section II, we introduce the notation and review some preliminaries that are essential for the development of the results in this paper. In Section III, we formulate the problem and discuss our assumptions. The proposed asynchronous mini-batch algorithm and its main theoretical results are presented in Section IV. Computational experience is reported in Section V while Section VI concludes the paper.

Ii Notation and Preliminaries

Ii-a Notation

We let and denote the set of natural numbers and the set of natural numbers including zero, respectively. The inner product of two vectors is denoted by . We assume that is endowed with a norm , and use to represent the corresponding dual norm, defined by

Ii-B Preliminaries

Next, we review the key definitions and results necessary for developing the main results of this paper. We start with the definition of a Bregman distance function, also referred to as a prox-function.

Definition 1

A function is called a distance generating function with modulus with respect to norm , if is continuously differentiable and -strongly convex with respect to over the set . That is, for all ,

Every distance generating function introduces a corresponding Bregman distance function

For example, choosing , which is -strongly convex with respect to the -norm over any convex set , would result in . Another common example of distance generating functions is the entropy function

which is -strongly convex with respect to the -norm over the standard simplex

and its associated Bregman distance function is

The main motivation to use a generalized distance generating function, instead of the usual Euclidean distance function, is to design optimization algorithms that can take advantage of the geometry of the feasible set (see, e.g.[37, 7, 38, 39]).

Remark 1

The strong convexity of the distance generating function always ensures that

and if and only if .

Remark 2

Throughout the paper, there is no loss of generality to assume that . Indeed, if , we can choose the scaled function , which has modulus , to generate the Bregman distance function.

The following definition introduces subgradients of proper convex functions.

Definition 2

For a convex function , a vector is called a subgradient of at if

The set of all subgradients of at is called the subdifferential of at , and is denoted by .

Iii Problem Setup

We consider stochastic convex optimization problems of the form


Here, is the decision variable,

is a random vector whose probability distribution

is supported on a set , is convex and differentiable for each , and is a proper convex function that may be nonsmooth and extended real-valued. Let us define


Note that the expectation function is convex, differentiable, and  [40]. We use to denote the set of optimal solutions of Problem (1) and to denote the corresponding optimal value.

A difficulty when solving optimization problem (1) is that the distribution is often unknown, so the expectation (2) cannot be computed. This situation occurs frequently in data-driven applications such as machine learning. To support these applications, we do not assume knowledge of (or of ), only access to a stochastic oracle. Each time the oracle is queried with an , it generates an independent and identically distributed (i.i.d.) sample from and returns .

We also impose the following assumptions on Problem (1).

Assumption 1 (Existence of a minimum)

The optimal set is nonempty.

Assumption 2 (Lipschitz continuity of )

For each , the function has Lipschitz continuous gradient with constant . That is, for all ,

Note that under Assumption 2, is also Lipschitz continuous with the same constant  [9].

Assumption 3 (Bounded gradient variance)

There exists a constant such that

Assumption 4 (Closed effective domain of )

The function is simple and lower semi-continuous, and its effective domain, , is closed.

Possible choices of include:

  • Unconstrained smooth minimization: .

  • Constrained smooth minimization: is the indicator function of a non-empty closed convex set , i.e.,

  • -regularized minimization: with .

  • Constrained -regularized minimization: In this case, with .

Several practical problems in machine learning, statistical applications, and signal processing satisfy Assumptions 14 (see, e.g.[2, 3, 4]). One such example is -regularized logistic regression for sparse binary classification. We are then given a large number of observations

drawn i.i.d. from an unknown distribution , and want to solve the minimization problem (1) with

and . The role of regularization is to produce sparse solutions.

One approach for solving Problem (1) is the serial mini-batch method based on the mirror descent scheme [32]. Given a point , a single processor updates the decision variable by sampling

i.i.d. random variables

from , computing the averaged stochastic gradient

and performing the composite mirror descent update

where is a positive step-size parameter. Under Assumptions 14 and choosing an appropriate step-size, this algorithm is guaranteed to converge to the optimum [32, Theorem 9]. However, in many emerging applications, such as large-scale machine learning and statistics, the size of dataset is so huge that it cannot fit on one machine. Hence, we need optimization algorithms that can be conveniently and efficiently executed in parallel on multiple processors.

Iv An Asynchronous Mini-Batch Algorithm

In this section, we will present an asynchronous mini-batch algorithm that exploits multiple processors to solve Problem (1). We characterize the iteration complexity and the convergence rate of the proposed algorithm, and show that these compare favourably with the state of the art.

Iv-a Description of Algorithm

We assume p processors have access to a shared memory for the decision variable . The processors may have different capabilities (in terms of processing power and access to data) and are able to update without the need for coordination or synchronization. Conceptually, the algorithm lets each processor run its own stochastic composite mirror descent process, repeating the following steps:

  1. Read from the shared memory and load it into the local storage location ;

  2. Sample i.i.d random variables from the distribution ;

  3. Compute the averaged stochastic gradient vector

  4. Update current in the shared memory via

The algorithm can be implemented in many ways as depicted in Figure 1. One way is to consider the processors as peers that each execute the four-step algorithm independently of each other and only share the global memory for storing . In this case, each processor reads the decision vector twice in each round: once in the first step (before evaluating the averaged gradient), and once in the last step (before carrying out the minimization). To ensure correctness, Step 4 must be an atomic operation, where the executing processor puts a write lock on the global memory until it has written back the result of the minimization (cf. Figure 1, left). The algorithm can also be executed in a master-worker setting. In this case, each of the worker nodes retrieves from the master in Step 1 and returns the averaged gradient to the master in Step 3; the fourth step (carrying out the minimization) is executed by the master (cf. Figure 1, right)

Figure 1: Illustration of two conceptually different realizations of Algorithm 1: a shared memory implementation (left); a master-worker implementation (right). In the shared memory setting shown to the left, processor reads from the shared memory and computes the averaged gradient vector . As the processors are being run without synchronization, and are written to the shared memory by other processors while is evaluating . The figure shows a snapshot of the algorithm at time instance , at which the shared memory is locked by to read the current , i.e. , to update it using the out-of-date gradient , and write to the memory. In the master-worker setting illustrated to the right, workers evaluate averaged gradient vectors in parallel and send their computations to buffers on the master processor, which is the sole entity with access to the global memory. The master performs an update using (possibly) out-of-date gradients and passes the updated decision vector back to the workers.

Independently of how we choose to implement the algorithm, processors may work at different rates: while one processor updates the decision vector (in the shared memory setting) or send its averaged gradient to the master (in the master-worker setting), the others are generally busy computing averaged gradient vectors. The processors that perform gradient evaluations do not need to be aware of updates to the decision vector, but can continue to operate on stale information about . Therefore, unlike synchronous parallel mini-batch algorithms [32], there is no need for processors to wait for each other to finish the gradient computations. Moreover, the value at which the average of gradients is evaluated by a processor may differ from the value of to which the update is applied.

Algorithm 1 describes the asynchronous processes that run in parallel. To describe the progress of the overall optimization process, we introduce a counter that is incremented each time is updated. We let denote the time at which used to compute the averaged gradient involved in the update of was read from the shared memory. It is clear that for all . The value

can be viewed as the delay between reading and updating for processors and captures the staleness of the information used to compute the average of gradients for the k-th update. We assume that the delay is not too long, i.e., there is a nonnegative integer such that

The value of is an indicator of the asynchronism in the algorithm and in the execution platform. In practice, will depend on the number of parallel processors used in the algorithm [33, 34, 35]. Note that the cyclic-delay mini-batch algorithm [36], in which the processors are ordered and each updates the decision variable under a fixed schedule, is a special case of Algorithm 1 where , or, equivalently, for all .

1:  Inputs: positive step-sizes ; batch size .
2:  Initialization: ; .
3:  repeat
4:     receive inputs sampled i.i.d. from distribution ;
5:  until termination test satisfied
Algorithm 1 Asynchronous Mini-batch Algorithm (running on each processor)

Iv-B Convergence Rate for General Convex Regularization

The following theorem establishes convergence properties of Algorithm 1 when a constant step-size is used.

Theorem 1

Let Assumptions 14 hold. Assume also that for all ,


Then, for every and any optimizer of (1), we have

where is the Cesáro average of the iterates, i.e.,

Furthermore, is the batch size, the expectation is taken with respect to all random variables , and is given by

See Appendix -A.

Theorem 1 demonstrates that for any constant step-size satisfying (4), the running average of iterates generated by Algorithm 1 will converge in expectation to a ball around the optimum at a rate of . The convergence rate and the residual error depend on the choice of : decreasing reduces the residual error, but it also results in a slower convergence. We now describe a possible strategy for selecting the constant step-size. Let be the total number of iterations necessary to achieve -optimal solution to Problem (1), that is, when . If we pick


it follows from Theorem 1 that the corresponding satisfies

where . This inequality tells us that if the first term on the right-hand side is less than , i.e., if

then . Hence, the iteration complexity of Algorithm 1 with the step-size choice (5) is given by


As long as the maximum delay bound is of the order , the first term in (6) is asymptotically negligible, and hence the iteration complexity of Algorithm 1 is asymptotically , which is exactly the iteration complexity achieved by the mini-batch algorithm for solving stochastic convex optimization problems in a serial setting [32]. As discussed before, is related to the number of processors used in the algorithm. Therefore, if the number of processors is of the order of , parallelization does not appreciably degrade asymptotic convergence of Algorithm 1. Furthermore, as processors are being run in parallel, updates occur roughly times as quickly and in time scaling as , the processors may compute averaged gradient vectors (instead of vectors). This means that the near-linear speedup in the number of processors can be expected.

Remark 3

Another strategy for the selection of the constant step-size in Algorithm 1 is to use that depends on the prior knowledge of the number of iterations to be performed. More precisely, assume that the number of iterations is fixed in advance, say equal to . By choosing as

for some , it follows from Theorem 1 that the running average of the iterates after iterations satisfies

It is easy to verify that the optimal choice of , which minimizes the second term on the right-hand-side of the above inequality, is

With this choice of , we then have

In the case that , the preceding guaranteed bound reduces to the one obtained in [8, Theorem 1] for the serial stochastic mirror descent algorithm with constant step-sizes. Note that in order to implement Algorithm 1 with the optimal constant step-size policy, we need to estimate an upper bound on , since is usually unknown.

The following theorem characterizes the convergence of Algorithm 1 with a time-varying step-size sequence when is bounded in addition to being closed and convex.

Theorem 2

Suppose that Assumptions 14 hold. In addition, suppose that is compact and that is bounded on . Let

If is set to with

then the Cesáro average of the iterates generated by Algorithm 1 satisfies

for all .

See Appendix -B.

The time-varying step-size , which ensures the convergence of the algorithm, consists of two terms: the time-varying term should control the errors from stochastic gradient information while the role of the constant term () is to decrease the effects of asynchrony (bounded delays) on the convergence of the algorithm. According to Theorem 2, in the case that , the delay becomes increasingly harmless as the algorithm progresses and the expected function value evaluated at converges asymptotically at a rate , which is known to be the best achievable rate of the mirror descent method for nonsmooth stochastic convex optimization problems [7].

For the special case of the optimization problem (1) where is restricted to be the indicator function of a compact convex set, Agarwal and Duchi [36, Theorem 2] showed that the convergence rate of the delayed stochastic mirror descent method with time-varying step-size is

where is the maximum bound on . Comparing with this result, instead of a asymptotic penalty of the form due to the delays, we have the penalty , which is much smaller for large . Therefore, not only do we extend the result of [36] to general regularization functions, but we also obtain a sharper guaranteed convergence rate than the one presented in [36].

Iv-C Convergence Rate for Strongly Convex Regularization

In this subsection, we restrict our attention to stochastic composite optimization problems with strongly convex regularization terms. Specifically, we assume that is -strongly convex with respect to , that is, for any ,

Examples of the strongly convex function include:

  • -regularization: with .

  • Elastic net regularization: with and .

Remark 4

The strong convexity of implies that Problem (1) has a unique minimizer  [41, Corollary 11.16].

In order to derive the convergence rate of Algorithm 1 for solving (1) with a strongly convex regularization term, we need to assume that the Bregman distance function used in the algorithm satisfies the next assumption.

Assumption 5 (Quadratic growth condition)

For all , we have

with .

For example, if , then and . Note that Assumption 5 will automatically hold when the distance generating function has Lipschitz continuous gradient with a constant  [12].

The associated convergence result now reads as follows.

Theorem 3

Suppose that the regularization function is -strongly convex and that Assumptions 25 hold. If is set to with

then the iterates produced by Algorithm 1 satisfies

for all .

See Appendix -C.

An interesting point regarding Theorem 3 is that for solving stochastic composite optimization problems with strongly convex regularization functions, the maximum delay bound can be as large as without affecting the asymptotic convergence rate of Algorithm 1. In this case, our asynchronous mini-batch algorithm converges asymptotically at a rate of , which matches the best known rate achievable in a serial setting.

V Experimental Results

We have developed a complete master-worker implementation of our algorithm in C/++ using the Massage Passing Interface libraries (OpenMPI). Although we argued in Section IV that Algorithm 1 can be implemented using atomic operations on shared-memory computing architectures, we have chosen the MPI implementation due to its flexibility in scaling the problem to distributed-memory environments.

We evaluated our algorithm on a document classification problem using the text categorization dataset rcv1 [42]. This dataset consists of documents, with

unique stemmed tokens spanning 103 topics. Out of these topics, we decided to classify sports-related documents. To this end, we trained a sparse (binary) classifier by solving the following

-regularized logistic regression problem

Here, is the sparse vector of token weights assigned to each document, and indicates whether a selected document is sports-related, or not ( is if the document is about sport, otherwise). To evaluate scalability, we used both the training and test sets available when solving the optimization problem. We implemented Algorithm 1 with time-varying step-sizes, and used a batch size of 1000 documents. The regularization parameter was set to , and the algorithm was run until a fixed tolerance was met.

Figure 2 presents the achieved relative speedup of the algorithm with respect to the number of workers used. The relative speedup of the algorithm on processors is defined as , where and are the time it takes to run the corresponding algorithm (to -accuracy) on 1 and processing units, respectively. We observe a near-linear relative speedup, consistent with our theoretical results. The timings are averaged over 10 Monte Carlo runs.

Figure 2: Speedup of Algorithm 1 with respect to the number of workers.

Vi Conclusions

We have proposed an asynchronous mini-batch algorithm that exploits multiple processors to solve regularized stochastic optimization problems with smooth loss functions. We have established that for closed and convex constraint sets, the iteration complexity of the algorithm with constant step-sizes is asymptotically . For compact constraint sets, we have proved that the running average of the iterates generated by our algorithm with time-varying step-size converges to the optimum at a rate . When the regularization function is strongly convex and the constraint set is closed and convex, the algorithm achieves the rate of the order . We have shown that the penalty in convergence rate of the algorithm due to asynchrony is asymptotically negligible and a near-linear speedup in the number of processors can be expected. Our computational experience confirmed the theory.

In this section, we prove the main results of the paper, namely, Theorems 13. We first state three key lemmas which are instrumental in our argument.

The following result establishes an important recursion for the iterates generated by Algorithm 1.

Lemma 1

Suppose Assumptions 14 hold. Then, the iterates generated by Algorithm 1 satisfy


where , is a sequence of strictly positive numbers, and is the error in the gradient estimate.

We start with the first-order optimality condition for the point in the minimization problem (3): there exists subgradient such that for all , we have

where denotes the partial derivative of the Bregman distance function with respect to the second variable. Plugging the following equality

into the previous inequality and re-arranging terms gives


where the last inequality used

by the (strong) convexity of . We now use the following well-known three point identity of the Bregman distance function [43] to rewrite the left-hand side of (8):

From this relation, with , , and , we have

Substituting the preceding equality into (8) and re-arranging terms result in

Since the distance generating function is -strongly convex, we have the lower bound

which implies that


The essential idea in the rest of the proof is to use convexity and smoothness of the expectation function to bound for each and each . According to Assumption 2, and, hence, are Lipschitz continuous with the constant . By using the -Lipschitz continuity of and then the convexity of , we have


for any . Combining inequalities (9) and (10), and recalling that , we obtain

We now rewrite the above inequality in terms of the error as follows:


We will seek upper bounds on the quantities and . Let be a sequence of positive numbers. For , we have


where the second inequality follows from the Fenchel-Young inequality applied to the conjugate pair and , i.e.,

We now turn to . It follows from definition that

Then, by the convexity of the norm , we conclude that


where the last inequality comes from our assumption that for all . Substituting inequalities (12) and (13) into the bound (11) and simplifying yield

Setting , where