# Swarming for Faster Convergence in Stochastic Optimization

We study a distributed framework for stochastic optimization which is inspired by models of collective motion found in nature (e.g., swarming) with mild communication requirements. Specifically, we analyze a scheme in which each one of N > 1 independent threads, implements in a distributed and unsynchronized fashion, a stochastic gradient-descent algorithm which is perturbed by a swarming potential. Assuming the overhead caused by synchronization is not negligible, we show the swarming-based approach exhibits better performance than a centralized algorithm (based upon the average of N observations) in terms of (real-time) convergence speed. We also derive an error bound that is monotone decreasing in network size and connectivity. We characterize the scheme's finite-time performances for both convex and non-convex objective functions.

## Authors

• 17 publications
• 12 publications
06/06/2019

### A Non-Asymptotic Analysis of Network Independence for Distributed Stochastic Gradient Descent

This paper is concerned with minimizing the average of n cost functions ...
05/08/2022

### Decentralized Stochastic Optimization with Inherent Privacy Protection

Decentralized stochastic optimization is the basic building block of mod...
05/11/2021

### Improving the Transient Times for Distributed Stochastic Gradient Methods

We consider the distributed optimization problem where n agents each pos...
05/15/2020

### S-ADDOPT: Decentralized stochastic first-order optimization over directed graphs

In this report, we study decentralized stochastic optimization to minimi...
07/09/2019

### SVGD: A Virtual Gradients Descent Method for Stochastic Optimization

Inspired by dynamic programming, we propose Stochastic Virtual Gradient ...
02/07/2022

### Variance reduced stochastic optimization over directed graphs with row and column stochastic weights

This paper proposes AB-SAGA, a first-order distributed stochastic optimi...
11/23/2019

### A Stochastic Tensor Method for Non-convex Optimization

We present a stochastic optimization method that uses a fourth-order reg...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the optimization problem

 minx∈Rmf(x) (1)

where is a differentiable function. In many applications, the gradient information

is not available in closed form and only noisy estimates can be computed by time-consuming simulations (see

[2, 22] for a survey of gradient estimation techniques). Noise can come from many sources such as incomplete convergence, modeling and discretization error, and finite sample size for Monte-Carlo methods (see for instance [11]).

In this paper we consider the case in which independent and symmetric stochastic oracles (s) can be queried in order to obtain noisy gradient samples of the form

that satisfies the following condition: Each random vector

is independent. , for some for all . We assume the time needed to generate a gradient sample is random and not negligible (e.g., samples are obtained by executing time-consuming simulations). In this setting, a centralized implementation of the stochastic gradient-descent algorithm in which a batch of samples is gathered at every iteration, incurs an overhead that is increasing in . If sampling is undertaken in parallel by simultaneously querying s, the overhead burden is reduced but is still non-negligible due to the need for synchronization. In this paper we study a distributed optimization scheme that does not require synchronization. In the scheme, independent computing threads implement each a stochastic gradient-descent algorithm in which a noisy gradient sample is perturbed by an attractive term, a function of the relative distance between solutions identified by neighboring threads (where the notion of neighborhood is related to a given network topology). This coupling is similar to those found in mathematical models of swarming (see [4]). Complex forms of collective motion such as swarms can be found in nature in many organisms ranging from simple bacteria to mammals (see [17, 16, 20] for references). Such forms of collective behavior rely on limited communication amongst individuals and are believed to be effective for avoiding predators and/or for increasing the chances of finding food (foraging) (see [7, 19]).

We show the proposed scheme has an important noise reduction property as noise realizations that induce individual trajectories differing too much from the group average are likely to be discarded because of the attractive term which aims to maintain cohesion. In contrast to the centralized sample average approach, the noise reduction obtained in a swarming-based framework with threads does not require synchronization since each thread only needs the information on the current solutions identified by neighboring threads. When sampling times are not negligible and exhibit large variation, synchronization may result in significant overhead so that the real-time performance of stochastic gradient-descent algorithms based upon the average of samples obtained in parallel is highly affected by large sampling time variability. In contrast, the real-time performance of a swarming-based implementation with threads exhibits better performance as each thread can asynchronously update its solution based upon a small sample size and still reap the benefits of noise reduction stemming from the swarming discipline.

The main contribution of this paper is the formalization of the benefits of the swarming-based approach for stochastic optimization. Specifically, we show the approach exhibits better performance than a centralized algorithm (based upon the average of observations) in terms of (real time) convergence speed. We derive an error bound that is monotone decreasing in network size and connectivity. Finally, we characterize the scheme’s finite-time performances for both convex and non-convex objective functions.

The structure of this paper is as follows. We introduce two candidate algorithms for solving problem (1) in Section 2 along with the related literature. In Section 3 we perform preliminary analysis on the swarming-based approach. In Section 4 we present our main results. A numerical example is provided in Section 5. We conclude the paper in Section 6.

## 2 Preliminaries

We now present two algorithms for solving problem (1). First, we introduce a centralized stochastic gradient-descent algorithm. Then we propose the corresponding swarming-based approach.

### A Centralized Algorithm

A centralized gradient-descent algorithm is of the form (see for instance [12]):

 xk+1=xk+γkuk,  k∈N (2)

Suppose in a single step, s each generate a noisy gradient sample in parallel, and . Then (2) can be rewritten as

 xk+1=xk+γk(−∇f(xk)+ϵk),  k∈N (3)

where . is the step size. In this paper we consider constant step size policies, i.e., for some .

### A Swarming-Based Approach

A swarming-based asynchronous implementation has computing threads. In contrast to the centralized approach, each thread queries one and independently implements a gradient-descent algorithm based on only one sample per step:

 xi,k+1=xi,k+γ⎡⎣−∇f(xi,k)+εi,k−aN∑j=1,j≠iαij(xi,k−xij,k)⎤⎦,  k∈N (4)

where , and denotes the solution of thread at the time of thread ’s -th update.

The last term on the right hand side of equation (4) represents the function of mutual attraction between individual threads, in which measures the degree of attraction (see [3] for a reference). Let denote the graph of all threads, where stands for the set of vertices (threads), and is the set of edges connecting vertices. Let be the adjacency matrix of where . indicates that thread is informed of the solution identified by threads , or . In this paper we assume the following condition regarding the network structure amongst threads: The graph corresponding to the network of threads is undirected () and connected, i.e., there is a path between every pair of vertices.

### 2.1 Sampling Times and Time-Scales

In our comparison of convergence speeds in real-time we will need to describe the time-scales in which the algorithms described above operate. These time scales depend upon the following standing assumption regarding sampling times: The times needed for generating gradient samples by oracles

are independent and exponentially distributed with mean

. It follows that for the centralized implementation, the time in between updates is . To describe the time-scale for the swarming-based scheme, let . In this time-scale, the time in between any two updates (by possibly different threads) is exponentially distributed with mean . Suppose there is a (virtual) global clock that ticks whenever a thread updates its solution, and let

be the indicator random variable for the event that thread

provides the ()-th update. We rewrite the asynchronous algorithm as follows:

 xi,k+1=xi,k+γui,k1i,k,  k∈N (5)

where

 ui,k=−∇f(xi,k)+εi,k−aN∑j=1,j≠iαij(xi,k−xj,k).

### 2.2 Swarming for Faster Convergence: A Preview of the Main Results

To illustrate the benefits of the swarming-based approach, we will compare the performances of the centralized approach (4) and the swarming-based approach (3) in Section 4. In what follows we provide a brief introduction to the main results.

Denote . Let (respectively, ) measure the quality of solutions under the centralized scheme (respectively, swarming-based approach).

Assuming is -strongly convex with Lipschitz continuous gradients, for sufficiently small , we will show that and both have an upper bound in the order of .

Regarding real-time performance, we will show that and converge at a similar rate. However, the time needed to complete iterations in the swarming-based method is approximately

 Nk⋅Δt/N=kΔt,

while the time needed for the centralized algorithm to complete steps is approximately . The speed-up achieved through the swarming-based framework is due to the fact that . Since the relation holds true in general, this property is likely to be preserved under other distributions of sampling times as well. As we shall see in Section 4, the ratio of convergence speeds between the swarming-based approach and the centralized algorithm is approximately . In other word, the convergence speed is inversely proportional to the rate that gradient samples are generated.

### 2.3 Literature Review

Our work is linked with the extensive literature in stochastic approximation (SA) methods dating back to [21] and [10]. These works include the analysis of convergence (conditions and rates for convergence, proper step size choices) in the context of diverse noise models (see [12]). Recently there has been considerable interest in parallel or distributed implementation of stochastic gradient-descent methods (see [1, 24, 13, 23, 25] for examples). They mainly focus on minimizing a finite sum of convex functions: . Notice that we may write , so that problem (1) can be seen as a special case of the finite-sum formulation, and the swarming-based scheme (4) resembles some of the algorithms in the literature (see [23] for example). However, to the best of our knowledge, this literature does not address the noise reduction properties stemming from multi-agent coordination. Moreover, they do not consider random sampling times or the real-time performance of the algorithms.

Our work is also related to population-based algorithms for simulation-based optimization. In these approaches, at every iteration, the quality of each solution in the population is assessed, and a new population of solutions is randomly generated based upon a given rule that is devised for achieving an acceptable trade-off between “exploration” and “exploitation”. Recent efforts have focused on model-based approaches (see [9]

) which differ from population-based methods in that candidate solutions are generated at each round by sampling from a “model”, i.e., a probability distribution over the space of solutions. The basic idea is to adjust the model based on the sampled solutions in order to bias the future search towards regions containing solutions of higher qualities (see

[8] for a recent survey). These methods are inherently centralized in that the updating of populations (or models) is performed after the quality of all candidate solutions is assessed.

In [18] we have also considered a swarming-type stochastic optimization method. In the paper we used stochastic differential equations to approximate the real-time optimization process. This approach relies on the assumption that step sizes are arbitrarily close to zero. Moreover, finite-time performance was obtained only for strongly convex functions.

## 3 Preliminary Analysis

In this section we study the stochastic processes associated with each one of the threads in the swarming-based approach. The average solution will play an important role in characterizing the performance. This part of the analysis demonstrates the cohesiveness among solutions identified by different threads. To this end, we will analyze the process defined as

 ¯¯¯¯Vk:=1NN∑i=1∥xi,k−¯xk∥2.

Let and . Then .

###### Lemma 1

Under Algorithm (4), suppose Assumption 2.1 holds.

 ¯¯¯¯Vk+1=¯¯¯¯Vk−2γNN∑i=1∇Tf(xi,k)ei,k1i,k−2aγNN∑i=1N∑j=1,j≠iαijeTi,k(ei,k−ej,k)1i,k+2γNN∑i=1eTi,kεi,k1i,k+γ2NN∑i=1∥δi,k∥2, (6)

where

 δi,k=δfi,k+δgi,k+δni,k, (7)

with

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩δfi,k=−∇f(xi,k)1i,k+1NN∑j=1∇f(xj,k)1j,k,δgi,k=−aN∑j=1,j≠iαij(xi,k−xj,k)1i,k+aNN∑j=1N∑l=1,l≠jαjl(xj,k−xl,k)1j,k,δni,k=εi,k1i,k−1NN∑j=1εj,k1j,k. (8)

###### Proof

See Section B.1.

Let be the Laplacian matrix associated with the adjacency matrix , where and when . For an undirected graph, the Laplacian matrix is symmetric positive semi-definite [6]. Let .

 N∑i=1N∑j=1,j≠iαijeTi,k(ei,k−ej,k)=−N∑i=1N∑j=1lijeTi,k(ei,k−ej,k)=N∑i=1N∑j=1lijeTj,kei,k=eTk(L⊗Im)ek≥λ2eTkek=λ2N∑i=1∥ei,k∥2, (9)

where

is the second-smallest eigenvalue of

, also called the algebraic connectivity of (see [6]).

###### Lemma 2

Suppose Assumptions 1, 2 and 2.1 hold. Under Algorithm (4), let , and let denote the conditional expectation of given . Then,

 E[¯¯¯¯Vk+1|xk]≤¯¯¯¯Vk−2γN2N∑i=1∇Tf(xi,k)ei,k−2Naλ2γ¯¯¯¯Vk+γ2N2N∑i=1∥∇f(xi,k)+aN∑j=1,j≠iαij(xi,k−xj,k)∥2+γ2σ2N. (10)

###### Proof

See Appendix B.2.

###### Remark 1

Lemma 2 sheds light on the cohesiveness property of solutions obtained by different threads. When satisfies proper regulatory conditions, is expected to decrease once exceeding a certain value. As a result, is bounded in expectation so that are not too different from each other. As we shall see in the next section, Lemma 2 helps us characterize the superior performance of the swarming-based approach.

## 4 Main Results

In this section we formalize the superior properties of the swarming-based framework. The following additional assumptions will be used. (Lipschitz) for some and for all . (Strong convexity) for some and for all . (Convexity) for all .

Let us introduce a measure , of the distance between the average solution identified by all threads at step and an optimal solution . We present some additional lemmas below.

###### Lemma 3

Suppose Assumptions 1 and 2.1 hold. Under Algorithm (4),

 E[Uk+1|xk]≤Uk−2γN2N∑i=1∇Tf(xi,k)(xi,k−x∗)+2γN2N∑i=1∇Tf(xi,k)(xi,k−¯¯¯xk)+γ2N3N∑i=1∥∇f(xi,k)+aN∑j=1,j≠iαij(xi,k−xj,k)∥2+γ2N2σ2. (11)

###### Proof

See Appendix B.3.

###### Lemma 4

Let denote the degree of vertex in graph and let .

 N∑i=1∥∇f(xi,k)+aN∑j=1,j≠iαij(xi,k−xj,k)∥2≤2N∑i=1∥∇f(xi,k)∥2+8a2N¯¯¯d2¯¯¯¯Vk. (12)

###### Proof

See Appendix B.4.

###### Lemma 5

Suppose Assumptions 1, 2, 2.1, 4, and 4 hold. Under Algorithm (4), let be arbitrary. Then,

 E[Uk+1+ω¯¯¯¯Vk+1|xk]≤Uk+ω¯¯¯¯Vk−2[γN2−(1+ωN)Lγ2N3]N∑i=1∇Tf(xi,k)(xi,k−x∗)+2N(1−ω)Lγ¯¯¯¯Vk−2Nωaλ2γ¯¯¯¯Vk+8N2(1+ωN)a2¯¯¯d2γ2¯¯¯¯Vk+(1+ωN)γ2σ2N2. (13)

###### Proof

See Appendix B.5.

### 4.1 Strongly-Convex Objective Function

In this section we will characterize and compare the performances of the swarming-based approach (4) and the centralized method (3) when the objective function is strongly convex. Specifically, we will discuss the ultimate error bounds and convergence speeds achieved under the two approaches. We will show that the two approaches are comparable in their ultimate error bounds while the swarming-based method enjoys a faster convergence.

The following theorem characterizes the performance of the swarming-based approach when the objective function is Lipschitz continuous and strongly convex.

###### Theorem 4.1

Suppose Assumptions 1, 2, 2.1, 4 and 4 hold. Under Algorithm (4) with step size satisfying

 γ

we have

 E[∥¯xk−x∗∥2]≤ϕk=(1+^ωN)γσ22κN−2κ(1+^ωN)Lγ+[U0+^ω¯¯¯¯V0−(1+^ωN)γσ22κN−2κ(1+^ωN)Lγ](1−C)k. (15)

Here is the solution to

 κLγ^ω2−[κ+(N−1)NκLγ−L−aλ2+4a2¯¯¯d2γ]^ω=−κ+κLγN+L+4Na2¯¯¯d2γ, (16)

and

 C=2Nκγ−2N2κ(1+^ωN)Lγ2. (17)

###### Proof

See Appendix C.1.

###### Remark 2

There are two terms on the right hand side of inequality (15). The first term is a positive constant, while the second term converges to exponentially fast. determines the convergence speed. In the long run (),

 limsupkE[∥¯xk−x∗∥2]≤ϕ∗:=(1+^ωN)γσ22κN−2κ(1+^ωN)Lγ.

The following corollary characterizes the dependency relationship between and several parameters including the step size , algebraic connectivity

, noise variance

, and the convexity factor .

###### Corollary 1

Suppose all the conditions in Theorem 4.1 hold, and

 γ

Then,

 ϕ∗=O(σ2κλ2).

If in addition , i.e., is bounded above and below by asymptotically (e.g., when is a complete graph),

 ϕ∗=O(σ2κN).

###### Proof

By (18),

 κLγ+4a2¯¯¯d2γ<12aλ2+L−κ.

According to definition (16),

 12aλ2^ω≤−κ+κLγN+L+4Na2¯¯¯d2γ≤−κ+L+1N(12aλ2+L−κ).

Hence , and

 ϕ∗=(1+^ωN)γσ22κN−2κ(1+^ωN)Lγ=O(σ2κλ2).

When , follows immediately.

###### Remark 3

From Corollary 1, is decreasing in the algebraic connectivity of the swarming network. In particular with a strong connectivity (), decreases in the number of computing threads. This demonstrates the noise reduction property of the swarming-based scheme.

#### 4.1.1 Comparison with Centralized Implementation

In this section, we compare the performances of the swarming-based approach (4) and the centralized method (3) in terms of their ultimate error bounds and convergence speeds.

First we derive the convergence results for the centralized algorithm (3):

 xk+1=xk+γ(−∇f(xk)+ϵk).

Suppose Assumption 4, 4 hold and . Define to measure the quality of solutions under this algorithm. Then,

 Gk+1= ∥xk+1−x∗∥2 = ∥xk+γ(−∇f(xk)+ϵk)−x∗∥2 = Gk+γ2∥∇f(xk)∥2+γ2∥ϵk∥2−2γ∇Tf(xk)(xk−x∗)+2γ(xk−x∗)Tϵk ≤ Gk+Lγ2∇Tf(xk)(xk−x∗)−2γ∇Tf(xk)(xk−x∗)+γ2∥ϵk∥2 +2γ(xk−x∗)Tϵk ≤ Gk+(Lγ2−2γ)κGk+γ2∥ϵk∥2+2γ(xk−x∗)Tϵk = Gk+κ(Lγ−2)γGk+γ2∥ϵk∥2+2γ(xk−x∗)Tϵk.

Taking expectation on both sides,

 E[Gk+1]≤ E[Gk]+κ(Lγ−2)γE[Gk]+1Nγ2σ2.

We have

 E[∥xk−x∗∥2]≤ γσ2κN(2−Lγ)+[G0−γσ2κN(2−Lγ)](1−2κγ+κLγ2)k−1. (19)

The long-run error bound of scheme (3) is

 ϕ∗∗=γσ2κN(2−Lγ)=O(σ2κN).

By Corollary 1, when and is sufficiently small, and are in the same order.

We now compare the real-time convergence speeds of the two approaches. In the swarming-based implementation, by (15),

 E[∥¯xNk−x∗∥2]≤ϕ∗+(U0+^ω¯V0−ϕ∗)[1−2Nκγ+2N2κ(1+^ωN)Lγ2]Nk.

Under a centralized algorithm, from (19),

 E[∥xk−x∗∥2]≤ϕ∗∗+(G0−ϕ∗∗)(1−2κγ+κLγ2)k.

For large ,

 [1−2Nκγ+2N2κ(1+^ωN)Lγ2]N=1−2κγ+O(1N2)<1−2κγ+κLγ2.

We can see that converges at a similar rate to . However, the time needed to complete iterations in the swarming-based method is approximately

 Nk⋅Δt/N=kΔt,

while the time needed for the centralized algorithm to complete steps is approximately , where (see [14])

 Δtc=E[~Δtc]=E[maxi∈{1,…,N}{~Δti}]=(1+12+…+1N)Δt>Δt.

Clearly the swarming-based scheme converges faster than the centralized one in real time. The ratio of convergence speeds is approximately , or .

###### Remark 4

Another potential benchmark for assessing relative performance is the average solution of threads implementing independently the stochastic gradient-descent method (without an attraction term). However, this approach is deficient as the expected value of distance between the average solution and the optimal solution is bounded away from zero whenever the function is not symmetric with respect to the optimal solution.

In what follows we show the swarming-based framework could be applied to the optimization of general convex and nonconvex objective functions.

### 4.2 General Convex Optimization

The following theorem characterizes the performance of the swarming-based approach for general convex objective functions. We refer to [15] for utilizing the average of history iterates.

###### Theorem 4.2

Suppose Assumptions 1, 2, 2.1, 4 and 4 hold. Define

 ~ω:=NL+4a2¯¯¯d2γNL+aNλ2−4a2N¯¯¯d2γ (20)

Let step size be such that and

 μ=γN2−(1+~ωN)γ2LN3>0. (21)

Let and

 ~xK:=1KK−1∑k=0¯xk. (22)

Under Algorithm (4),

 E[f(~xK)−f(x∗)]≤12NKμ[U0+~ω¯¯¯¯V0+(1+~ωN)Kγ2σ2N2]. (23)

###### Proof

See Appendix C.2.

###### Remark 5

To compute efficiently, notice that

 ~xK=1N(1KK−1∑k=0xi,k)=1N~xi,K,

where . At each step , each thread needs only conduct local update to keep track of its own running average.

It is clear that the upper bound is decreasing in and . We now consider a possible strategy for choosing step size .

###### Corollary 2

Suppose Assumptions 1, 2, 2.1, 4 and 4 hold. Let step size and satisfy the conditions in Theorem 4.2 and

 γ=D√λ2Nσ√K≤min(λ28a¯d2,(2L+aλ2)N4(NL+L+aλ2)L) (24)

for some . Let , , and be set to (20)-(22). Under Algorithm (4),

 E[f(~xK)−f(x∗)]≤ϕ∗K=σ√ND√λ2K[U0+~ω¯¯¯¯V0+(1+2NL+aλ22L+aλ2)D2λ2N]

###### Proof

See Appendix C.3.

###### Remark 6

. If in addition , then . Similar to the strongly-convex case, the error bound obtained here when is comparable with that achieved by a centralized method with iterations (see [15, 5] for reference). Nevertheless, the time needed in a swarming-based approach is considerably less than that in a centralized one (the time ratio is the same as in the strongly-convex case).

### 4.3 Nonconvex Optimization

We further show that the swarming-based approach can be used for optimizing nonconvex objective functions. The employed randomization technique was introduced in [5].

###### Theorem 4.3

Suppose Assumptions 1, 2, 2.1, 4 hold, and . Define

 ˇω:=NL+2(2L2+4a2¯¯¯d2)γ4N(aλ2−L)−4N(2L2+4a2¯¯¯d2)γ. (25)

Let step size be such that and

 ˇμ=γ2N2−(2+4ˇωN)Lγ2N3>0. (26)

Let . Define a new random variable as the following (see [5]):

 P[R=k]=1K,∀k=0,1,…,K−1.

Then we have

 1LE[∥∇f(¯¯¯xR)∥2]≤1NKˇμ[f(¯¯¯x0)−f∗+ˇωL¯¯¯¯V0L+(12+ˇωN)Kγ2σ2N2], (27)

where denotes the optimal value of (1).

###### Proof

See Appendix C.4.

###### Remark 7

Note that the conditions and the obtained error bound in the nonconvex case are similar to those for optimizing general convex objective functions. Hence the discussions in Section 4.2 apply here.

## 5 Numerical Example

In this section, we provide a numerical example to illustrate our theoretic findings. Consider the on-lineRidge regression problem, i.e.,

 minx∈Rdf(x)  (=Eu,v[(uTx−v)2+ρ∥x∥2]) (28)

where ia a penalty parameter. Samples in the form of are gathered continuously with representing the features and being the observed outputs. We assume that each

is uniformly distributed, and

is drawn according to . Here is a predefined parameter, and are independent Gaussian noises with mean and variance . We further assume that the time needed to gather each sample is independent and exponentially distributed with mean .

Given a pair , we can calculate an estimated gradient of :

 g(x,u,v)=2(uTx−v)u+2ρx. (29)

This is an unbiased estimator of

since

 ∇f(x)=Eu,v[2uuTx−2uv]+2ρx=Eu,vg(x,u,v).

Furthermore,

for some . As long as is bounded (which can be verified in the experiments), is uniformly bounded which satisfies Assumption 1. Notice that the Hessian matrix of