Dynamic Average Diffusion with randomized Coordinate Updates

10/21/2018 ∙ by Bicheng Ying, et al. ∙ EPFL 0

This work derives and analyzes an online learning strategy for tracking the average of time-varying distributed signals by relying on randomized coordinate-descent updates. During each iteration, each agent selects or observes a random entry of the observation vector, and different agents may select different entries of their observations before engaging in a consultation step. Careful coordination of the interactions among agents is necessary to avoid bias and ensure convergence. We provide a convergence analysis for the proposed methods, and illustrate the results by means of simulations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

We consider the problem in which a collection of networked agents, indexed , is interested in tracking the average of time-varying signals arriving at the agents, where is the agent index and is the time index. The objective is for the agents to attain tracking in a decentralized manner through local interactions with their neighbors. This type of problem is common in many applications. For example, consider the following distributed empirical risk minimization problem[1, 2, 3, 4, 5, 6, 7]

, which arises in many traditional machine learning formulations:

(1)

where

is some loss function that depends on the data

at location or agent . If we let

denote an estimate for the minimizer

at agent at time , and let denote the data received at that agent at the same time instant, then some solution methods to (1) involve tracking the average gradient defined by[2, 6, 8]:

(2)

where each term inside the summation represents the signal . Likewise, in learning problem formulations involving feature vectors and parameter models that are distributed over space, or loss functions that are expressed in the form of sums [9, 10, 11, 12, 13, 14], we encounter optimization problems of the form

(3)

where is some linear or nonlinear function that depends on the th feature set, , available at agent . Some solution methods to (3) involve tracking the average quantity:

(4)

where again each term inside the summation represents an signal.

There are several useful distributed algorithms in the literature for computing the average of static signals (i.e., signals that do not vary with the time index ), and which are distributed across a network [15, 3, 1, 7, 16, 17]. One famous algorithm is the consensus strategy which takes the form

(5)

where is a nonnegative factor scaling the information from agent to agent and

is some doubly-stochastic matrix. Moreover, the notation

denotes the set of neighbors of agent . In this implementation, each agent starts from its observation vector and continually averages the state values of its neighbors. After sufficient iterations, it is well-known that

(6)

under some mild conditions on [18, 17, 19, 20, 7]. When the static signals become dynamic and are replaced by , a useful variation is the dynamic average consensus algorithm from [21, 22, 23]. It replaces (5) by the recursion:

(7)

where the difference is added as a driving term. In this case, it can be shown that if the signals converge to static values, i.e., if , then result (6) continues to hold [21, 22]. Recursion (7) is motivated in [21, 22]

using useful but heuristic arguments.

Motivated by these considerations, in this work, we develop a dynamic average diffusion strategy for tracking the average of time-varying signals by formulating an optimization problem and showing how to solve it by applying the exact diffusion strategy from [21, 22]. One of the main contributions relative to earlier approaches is that we are specifically interested in the case in which the dimension of the observation vectors may be too large, which means that a solution like (7) will necessitate the sharing of long vectors among the agents resulting in an inefficient communication scheme. We are also interested in the case in which each agent can only observe one random entry of at each iteration (either by design or by choice). In this case, it will be wasteful to share the full vector since only one entries of will be affected by the new information. To handle these situations, we will need to incorporate elements of randomized coordinate-descent[24, 25] into the operation of the algorithm in line with approaches from [26, 27, 28]. Doing so, however, introduces one nontrivial complication: different agents may be selecting or observing different entries of their vectors , which raises a question about how to coordinate or synchronize their interactions. In order to facilitate the presentation, we shall assume initially that all agents select the same entry of their observations vectors at each iteration. Subsequently, we will show how to employ push-sum ideas [29, 30] to allow each agent to select its own local entry independently of the other agents.

2 Algorithm DEVELOPMENT

2.1 Review of Exact Diffusion Strategy

One effective decentralized method to solve problems of the form:

(8)

is the Exact diffusion strategy [6, 31, 32]. In (8), each refers to the risk function at agent and is generally convex or strongly-convex. For simplicity, we shall assume in this work that each is differentiable although the analysis can be extended to non-smooth risk functions by employing subgradient constructions, along the lines of [7, 33], or proximal constructions similar to [34, 5]. To implement exact diffusion, we need to associate a combination matrix with the network graph, where a positive weight is used to scale data that flows from node to if both nodes happen to be neighbors. In this paper we assume that:

Assumption 1 (Topology)

The underlying topology is strongly connected, and the combination matrix is symmetric and doubly stochastic, i.e.,

(9)

where is a vector with all unit entries. We further assume that for at least one agent .

We further introduce as the positive step-size parameter for all nodes. The exact diffusion algorithm is listed in (10a)–(10c). It is shown in [31] that the local variables converge to the exact minimizer of problem (8), , at a linear convergence rate under relatively mild conditions.

 
Algorithm 1 [Exact diffusion strategy for each node ][6, 31]
 
Initialize arbitrarily, and let .
Repeat iteration until convergence

(10a)
(10b)
(10c)

 

2.2 Dynamic Average Diffusion

Now, we consider a time-varying quadratic risk function of the form

(11)

and introduce the average cost

(12)

At every time instant , if we optimize over then it is clear that the minimizer, denoted by will coincide with the average of the observed signals:

(13)

Therefore, one way to track the average of the signals is to track the minimizer of the aggregate cost defined by (12). Apart from the time index, this cost has a form similar to (8) especially when the observations signals approach steady-state values where they become static. This motivates us to apply the exact diffusion construction (10a)–(10c) to the risks defined by (12). Doing so leads to the recursions:

(14a)
(14b)
(14c)

Combining (14a)–(14c) into a single recursion, we obtain:

(15)

so that by selecting , the algorithm reduces to what we shall refer to as the dynamic average diffusion algorithm:

 
Algorithm 2 [Dynamic average diffusion]
 
Initialize: .
Repeat iteration

(16)

 

Other values for are of course possible by using instead (15). Comparing (16) with the consensus version (5), we see that the scaling weights in (16) are multiplying the combined sum of the weight iterate and the difference of the current and past observation vectors, . Moreover, and importantly, while in the consensus construction (5) each agent employs only its own observation vector, we see in (16) that all observations vectors from the neighborhood of agent contribute to the update of . In this way, agents need to share their weight iterates along with the difference of their observation vectors. In a future section, we shall show how agents can only share single entries of their observations vectors chosen at random.

There are several interesting properties associated with the dynamic diffusion strategy (16). First, at any time , the average of the coincides with the average of the , i.e.,

(17)

This property can be easily shown using mathematical induction. Second, when the signal is static, i.e., , the algorithm reduces to the classical consensus construction (5). Third, when the signal converges to some steady-state value , or their time variations become uniform after some time , i.e.,

(18)

then it can be shown that

(19)

This conclusion is a special case of later results in this paper and therefore its proof will follow by specializing the arguments used later in theorem 1.

3 Synchronized Random Updates

Let us consider next the case in which each agent can only access (either by design or by choice) one random entry within the vector . We denote the index of that entry by at iteration ; we use the boldface notation because will be selected at random and boldface symbols denote random quantities in our notation. We shall first assume that all agents select the same ; later we consider the case in which varies among agents and replace the notation by instead, with the superscript referring to the agent. This situation will then enable a fully distributed solution.

When all agents select the same random index , one naive solution to update their weight iterates is to resort to coordinate-descent type constructions[25, 24]. Namely, at iteration , the index is selected uniformly and then only the th entry of is updated, say, as:

(20)

where the notation , for a vector , refers to the th entry of that vector. This iteration applies (19) to the th entry of and keeps all other entries of this vector unchanged relative to . Although simple, this algorithm is not implementable for one subtle reason. This is because at time , agent can only observe and not . In other words, the variable is not available; this variable would be available if we allow agent to save the entire vector from the previous iteration and then select its th entry at time . However, doing so, defeats the purpose of a coordinate-descent solution where the purpose is to avoid working with long observation vectors and to work instead with scalar entries. We can circumvent this difficulty as follows. We let refer to the most recent iteration from the past where the same index was chosen the last time; the value of clearly depends on . Then, we can replace (20) by:

(21)

where the index appears in two locations on the right-hand side: within and . Note first that this implementation is now feasible because the scalar value from the past can be saved into a memory variable. Specifically, for every agent we introduce a vector , which is updated with time. At every iteration , an index is selected and the value of the observation entry is saved into the th location of for later access the next time the index is selected. It is also important to use , with the same subscript , along with in (21) in order to maintain the mean property (17). However, due the definition of and the second line in (21) , we know that . Hence, the resulting algorithm is:

(22)

To simplify the notation, we introduce the indicator function:

(23)

and the selection matrix:

(24)

This matrix is diagonal with a single unit entry on the diagonal at the location of the active index . All other entries are zero. We also introduce the complement matrix:

(25)

Using these matrices, the resulting algorithm is listed in Algorithm 3. The proof of the convergence is provided later in Sec. 5.1.

 
Algorithm 3 [Dynamic average diffusion with synchronous random updates]
 
Initialization: set ;
Repeat for until convergence:

(26a)
(26b)
(26c)

 

4 Independent Random Updates

4.1 A first attempt at random indices

The previous algorithm requires all agents to observe the same “random” index at iteration . In this section, we will allow to be locally selected by the agents. To refer to this generality, we replace the notation by , where is selected uniformly from by agent .

In this way, agents now cannot share the same entries of their observation vectors. However, they will generally exist smaller groups of agents that end up selecting the same index (since indexes are chosen at random). We can represent this possibility by examining replicas of the network topology, as illustrated by Fig. 1. In each layer, we highlight in blue the agents that selected the same index. For example, all four blue agents in the top layer selected the entry of index ; i.e., for these agents, . Only one agent in the second layer selected index and three agents in the bottom layer selected the index .

Figure 1: The layers on the right highlight the agents in the network that selected the same index at iteration .

Motivated by the discussion that led to Algorithm 3, we can similarly start from the following recursions:

(27)

where the summation refers to adding over the neighbor agents whose selected random index is equal to . In this implementation, agents that select the same index within the neighborhood of agent are processed together in a manner similar to Algorithm 3. However, there is one important difficulty with this implementation, which does not work correctly. This is because

(28)

In other words, the “effective” combination matrix for any of the layers (on the right side of Fig. 1) is not necessarily doubly-stochastic anymore. Even worse, the topology from one layer to another and from one iteration to another keeps changing due the random selections at each agent. These facts bias the operation of the algorithm and prevent the agents from reaching consensus. We need to account for these difficulties.

4.2 Push-sum correction

We shall exploit some properties from the push-sum construction. Basically, recall that the original push-sum algorithm deals with the problem of seeking the mean of static signals . One appealing property of the algorithm is that it can be applied to time-varying row stochastic matrices, i.e., to graphs where outgoing scaling factors add up to one, say,

(29)

where the superscript is added to indicate time-variation. This condition only requires the outgoing weights (from agent to agent ) to sum up to one; it does not require the incoming weights into agent to add up to one. Moreover, it is common to assume that the topology satisfies the following condition.

Assumption 2 (Time-varying Topology Assumption[30] )

The sequence is a stationary and ergodic sequence of stochastic matrices with positive diagonal entries , and is primitive.

If we apply the classical consensus algorithm (6) under this condition:

(30)

then will not reach consensus [8]. In order to reach consensus under this time-varing row stochastic topology, the push-sum algorithm construction introduces a vector variable to help correct for bias. The algorithm starts from and ( is the vector with all entries equal one):

(31)

where the last equality is used to mean that the individual entries of are divided by the corresponding entry in ; it refers to an element-wise division. It can be shown under Assumption 2 that this algorithm leads to [30, 29]:

(32)

Later in Sec. 5.2, we provide additional explanations that further clarify why this construction works correctly — see the explanation leading to (69).

4.3 Dynamic diffusion with independent random updates

We can exploit the push-sum construction in the dynamic diffusion scenario when random indexes are selected at each iteration. As we mentioned before, the implementation (27) will not reach the desired consensus since the incoming weights do not add up to one. However, we assumed the underlying matrix is doubly-stochastic, which implies that the outgoing weights still add up to one. Hence, the push-sum construction can be utilized to solve the bias introduced by (27). One important property to enforce is that the entries in and should undergo similar updates. Doing so leads to Algorithm 4.

Comparing (26a)–(26b) with (33b)–(33d), there are two main modifications. One is that the updated index is allowed to vary at different locations. Another is that the output is instead of , i.e., the value after correction by . On the other hand, if we force for all and , the Algorithm 4 will reduces to Algorithm 3 by noting that for any . The proof of the convergence of Algorithm 3 is provided later in the Sec. 5.3.

 
Algorithm 4 [Dynamic average diffusion algorithm with independent random updates]
 
Initialization: set
Repeat for until convergence:

(33a)
(33b)
(33c)
(33d)
(33e)

 

4.4 Special case without push-sum correction

There is one special case where from (27) can still converge to the desired mean value without the push-sum correction. This case has been used in the recent push-pull type algorithm[8, 35], which use pull-network for consensus and pus-network for aggregate the gradient over agents. The special case is when

(34)

This scenario is quite common in the case of tracking the sum of gradients in empirical risk minimization problems. It can be verified that in this case it holds that , i.e., with or without division by . To shed some intuition on this statement, assume that we have shown that the output in Algorithm 4 has converged to the desired consensus value 0. Then, we also know is non-zero due to the non-zero initial value of and the fact is primitive. Combining these two facts and , we can infer that must be zero. The detailed proof of this statement is provided in the next section.

5 Convergence Analysis

In this section, we establish the convergence of Algorithms 3 and 4 for both case of synchronous and independent random entry updates.

5.1 Convergence of algorithm 3

First, we verify that recursions (26a)–(26b) can reach consensus if the observation signals converge to . Then we consider the case that the signals have a small perturbation.

Theorem 1 (Mean-Square Convergence of algorithm 3)

Suppose the underlying topology satisfy the Assumption 1 and each signal converges to a limiting value . It then holds that the algorithm converges in the mean-square-error sense, namely,

(35)

Proof: For a generic th entry of the weight/observation vectors, we collect their values into aggregate vectors as follows:

(36)
(37)

It is sufficient to establish convergence for one entry. Using the vector notation, we can verify that Algorithm 3 leads to:

(38)
(39)

where is is a scalar and denotes the th diagonal element of , which is either or as defined in (24). If we denote the average value vector by

(40)

we have

(41)

where we used the property . Moreover, since the integer value of is selected uniformly from the interval at iteration , we have:

(42)
(43)
(44)

Because the following arguments focus on a single entry of index , we shall drop for simplicity of notation. we can omit . Subtracting (38) from (41) and computing the expectation of the squared norm gives

(45)

where the last inequality exploits the fact that

(46)

Notice that under assumption 1, we can show that matrix is primitive[19, 1]

and, therefore, has one and only one eigenvalue at one with its corresponding eigenvector equal to

. Furthermore, the second largest magnitude of the eigenvalue of , denoted by , satisfies[20]:

(47)

When , which implies full-connectivity, we can end the proof quickly since (45) becomes:

(48)

Hence, in the following argument, we exclude the trivial case . We continue with (45) to get:

(49)

where the second inequality is due to Jensen’s inequality. Similarly, we execute the same procedure on (39):

(50)

where the inequality rely on Jensen’s inequality and we choose in last equality. If converges, it means that

(51)

so that due to (50), we conclude

(52)

Combining with (49), we get

(53)

Hence, we have proven that the algorithm reaches the consensus if the observation signals converge. Lastly, we show that the consensus value is actually the desired . Let and . it follows from (38) and (39) that

(54)
(55)

Subtracting (55) from (54), we have

(56)

so that

(57)

and we conclude is always the same as . Recall that is a vector that stores the past state of and it is easy to see that if converges, which completes the proof.

Corollary 1 (Small Perturbations)

Suppose each entry in the signal satisfies after sufficient iterations :

(58)

This property implies that , where is a small positive value. It then holds that

(59)

Proof: Substituting (58) into (50), for sufficiently large , we have:

(60)

We omit again. Taking the limit over , we get

(61)

Similarly, from (49), we have

(62)

5.2 Time-varying push-sum algorithm

Before we continue with the convergence proofs, we provide some useful intuition for the push sum construction. First, we note that the push-sum algorithm can be written in the following vector form for the th entry of the weight vectors (where we continue to drop the index ):