Online Learning over Dynamic Graphs via Distributed Proximal Gradient Algorithm

05/16/2019 ∙ by Rishabh Dixit, et al. ∙ Indian Institute of Technology Kanpur 0

We consider the problem of tracking the minimum of a time-varying convex optimization problem over a dynamic graph. Motivated by target tracking and parameter estimation problems in intermittently connected robotic and sensor networks, the goal is to design a distributed algorithm capable of handling non-differentiable regularization penalties. The proposed proximal online gradient descent algorithm is built to run in a fully decentralized manner and utilizes consensus updates over possibly disconnected graphs. The performance of the proposed algorithm is analyzed by developing bounds on its dynamic regret in terms of the cumulative path length of the time-varying optimum. It is shown that as compared to the centralized case, the dynamic regret incurred by the proposed algorithm over T time slots is worse by a factor of (T) only, despite the disconnected and time-varying network topology. The empirical performance of the proposed algorithm is tested on the distributed dynamic sparse recovery problem, where it is shown to incur a dynamic regret that is close to that of the centralized algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent advances in sensing, communication, and computation technologies have ushered an era of large-scale networked systems, now ubiquitous in smart grids, robotics, defense installations, industrial automation, and cyber-physical systems [1, 2, 3]. In order to accomplish sophisticated tasks, these multi-agent systems are expected to sense, learn, and adapt to sequentially arriving measurements while facing time-varying and non-stationary environments. Such high levels of network agility requires close coordination and cooperation among the agents, that are otherwise intermittently connected and resource-constrained. Towards performing control and resource allocation over such dynamic graphs, it is necessary to develop in-network optimization algorithms that can be implemented in a distributed fashion [4].

We consider the problem of tracking the minimum of a time-varying cost function that can be written as a sum of several node-specific smooth cost functions and a non-smooth regularizer. The framework subsumes a number of relevant problems such as model-predictive control [5, 6], tracking time-varying parameters [7, 8, 9], path planning [10, 11], real-time magnetic resonance imaging [12], adaptive matrix completion [13], demand scheduling in smart grids [14], and so on. Of particular interest is the distributed setting, where measurements are made locally at each node and the interaction between nodes may only occur over a time-varying directed communication graph. Due to the intermittent connectivity, it is not possible for the nodes to exchange updates at every time slot, and consequently, simple variants of centralized algorithms cannot be directly utilized.

Within the context of static optimization problems, distributed algorithms running over time-varying graphs have been widely studied, with most algorithms relying on the consensus approach [15]. Consensus-based algorithms can be further categorized into those utilizing weighted averaging in the primal domain [16, 17, 18], push-sum-based approaches [19], and the alternating directions method of multipliers (ADMM) method [20]

. The weighted-averaging-based approaches utilize a doubly stochastic matrix for averaging and are amenable to gradient-tracking variants that converge at geometric rates; see

[4] and references therein.

Fewer distributed algorithms exist for the more challenging time-varying setting where the minimizer drifts over time. Algorithms for tracking time-varying parameters have their roots in adaptive signal processing and control theory, where the steady-state tracking error is of interest [21, 22]. Recently however, online convex optimization has emerged as a useful framework for the analysis of tracking algorithms, especially in adversarial settings [23, 24, 25, 26, 27, 28, 29, 30]. The idea here is to characterize the dynamic regret of an algorithm in terms of the cumulative variations in the problem parameters such as the path length. However, existing online algorithms for tracking time-varying objectives cannot handle time-varying graph topologies where the nodes may get disconnected. A distributed dynamic ADMM algorithm for differentiable cost functions was first proposed in [31] and the steady-state optimality gap was derived for static graphs. On the other hand, an online ADMM algorithm for non-differentiable problems was proposed in [9] but required centralized implementation. Closer to the current work, the dynamic regret performance of a distributed mirror descent algorithm was obtained in [32] for possibly non-differentiable functions. However, the underlying graph was required to be static and connected. In contrast, the focus here is on distributed algorithms over dynamic and sporadically connected graphs.

This works puts forth a distributed proximal online gradient descent (OGD) algorithm designed to run over a time-varying network topology and capable of tracking the minimum of a time-varying composite loss function. As the objective may contain non-differentiable regularizers, we build upon the recently developed machinery of proximal OGD

[33]. Towards realizing a distributed implementation, we utilize the idea of weighted averaging using a doubly stochastic weight matrix, as is customary in distributed algorithms for static problems [4]. However, since the graph topology is time-varying, it is not possible for the nodes to perform a consensus update or even exchange information at each time slot. Instead a multi-step consensus algorithm inspired from [34] is developed, where the objective function information is used only intermittently and the remaining time slots are utilized for consensus. It is remarked that the time-varying problem is significantly more challenging than the static version considered in [34], since the objective function continues to evolve even when the consensus steps are being carried out, resulting in the accumulation of additional regret.

The proposed distributed proximal OGD algorithm is analyzed as an inexact variant of the proximal OGD algorithm [35]. Subsequently, the gradient and proximal errors arising due to the distributed operation are characterized, allowing us to obtain the required dynamic regret bounds. Interestingly, the distributed operation only results in an additional factor of as compared to the dynamic regret lower bound for the centralized case. The performance of the proposed algorithm is tested for the dynamic sparse recovery problem and compared with that of the centralized proximal OGD and ADMM algorithms.

A large number of LMS- and RLS-based algorithm have been proposed to solve the dynamic sparse recovery problem, starting from [36, 37]. An online ADMM approach for time-varying LASSO problem was developed in [38] while a online proximal ADMM algorithm for solving the group LASSO problem was proposed in [39]. While analytical results have been developed, they are largely limited to the case of static parameters. In particular, these works do not generally characterize the steady state tracking performance, though it may be possible to borrow similar bounds from the matrix completion literature; see [13]. A comprehensive survey of RLS-based dynamic sparse recovery methods is provided in [8].

When the sparse parameters follow a linear state-space model, they can be tracked within the sparse Bayesian learning (SBL) formalism [40, 41]. More recently, consensus has been employed to track the sparse parameters in a distributed fashion [42]. Likewise, a robust framework for dynamic sparse recovery has been developed in [43]. Different from the system model considered here, SBL-based methods cannot handle adversarial targets and are generally not amenable to regret analysis.

Adversarial parameters are naturally handled within the dynamic or online convex optimization framework. For instance, the dynamic sparse recovery problem was formulated and solved via a ’running’ ADMM approach in [9]. Likewise, the mirror descent algorithm for solving a similar problem was proposed in [32]. Generalizing these frameworks, we consider the dynamic sparse recovery problem in a distributed setting where the number of observations per sensor are not sufficient to estimate the full parameter. Further, the underlying graph topology is time-varying and collecting the observations at a central location is not straightforward.

In summary, our key contributions include: (a) a novel multi-step consensus-based proximal OGD algorithm for time-varying optimization problems over time-varying network topologies (b) performance analysis of the general inexact proximal OGD algorithm and its application to the analysis of the proposed distributed algorithm. The rest of the paper is organized as follows. We start with the problem formulation in Sec. II. The proposed distributed proximal OGD algorithm is presented in Sec. III. A bound on the dynamic regret incurred by the proposed algorithm is obtained in Sec. IV. The empirical performance of the proposed algorithm is tested on the dynamic sparse recovery problem in Sec. V and the conclusions are presented in Sec.VI.

Notation: All the scalars are denoted by regular font lower case letters, vectors by boldface lower case letters, and matrices by upper case boldface letters. Constants such as the total number of time slots and gradient bounds are denoted by regular font capital letters (such as

and ) while problem parameters such as step-size are denoted by Greek letters (such as ). Finally, sets are denoted by calligraphic font capital letters. The -th entry of the matrix is denoted either by or . The Euclidean norm of a vector is denoted by . By default, time and iteration indices (e.g. or ) will be in the subscript, while node or element indices (e.g. or ) will be in the superscript.

References Convex Distributed Dynamic graph topology Dynamic regret bound
[24] Convex No No
[32] Convex Yes No
This work Strongly convex Yes Yes
TABLE I: Dynamic regret rates for online learning with non-differentiable functions (cf. Sec.III)

Ii Problem Formulation

Consider a multi-agent network with the set of agents or nodes interacting with each other over intermittent communication links. The network topology at time is represented by a directed graph where denotes the set of directed edges or links present at time . Specifically, agents can transmit to agent only if at time instant . The dynamic graph is arbitrary and may not necessarily be connected for all . However, is still required to be uniformly strongly connected or -connected [17]; see Sec. III.

Restricted to exchanging information only over the edges in at time , the agents seek to cooperatively track the optimum of the following dynamic optimization problem

()

where is a smooth strongly convex function, is a convex but possibly non-smooth regularizer, and is a closed and convex set with a non-empty interior. For the sake of brevity, henceforth we denote

(1)

and so that () may also be written as . In the multi-agent setting considered here, the function is private to node while is known at all the nodes. It is remarked that the static version of () has found several applications in signal processing [44] and learning problems [45].

Algorithms to solve () are developed and analyzed within the rubric of online convex optimization, where the learning process is modeled as a sequential game between the agents and an adversary [23]. At each time , agent plays an action and in response, receives the functions from the adversary. Over a horizon of times, the goal of the agent is to minimize the cumulative dynamic regret given by [32]:

(2)

The dynamic regret measures the loss incurred by the agent against that of a time-varying clairvoyant. Observe that the dynamic regret is different from and more stringent than the static regret where the benchmark is not time-varying [23]. The definition in (2) also includes the modification due to the distributed nature of the problem, first suggested in [32]. For instance, the right-hand side of (2) includes terms of the form for , and therefore, regret minimization requires the agents to collaborate.

An algorithm is said to be no-regret if grows sublinearly with . It is well-known that the dynamic regret cannot be sublinear unless certain regularity conditions are imposed on the temporal variations of [25]. In the present case, the regret bounds will be developed in terms of the path length, defined as

(3)

and assumed to be sublinear. For general time-varying problems with gradient feedback, the dynamic regret of any algorithm is at least [28]. Dynamic regret bounds of related algorithms capable of handling non-smooth functions are summarized in Table I. Closely related to the proposed work, a distributed online mirror descent algorithm is discussed in [32] that is more general and can handle non-differentiable loss functions. It is remarked that the centralized algorithm proposed in [29] is not included in Table I since it requires the loss function to be self-concordant.

The present work develops an online and distributed algorithm for solving (), where the agents minimize the regret collaboratively and without a central controller or fusion center. While the static variant of the problem can be readily solved via consensus, such an algorithm is not directly applicable to the time-varying problem at hand. Specifically, while the spread of information is limited due to the time-varying communication graph , the problem parameters continue to change with , regardless. Towards this end, we adopt the multi-step consensus idea from [34]. Sublinear regret will be obtained by carefully balancing the additional accuracy obtained from running the consensus step for multiple times against the excess regret accumulated in the meanwhile.

Iii Proposed distributed algorithm

Iii-a Motivation

In order to motivate the proposed algorithm, observe that since is strongly convex, the optimality condition for () may be written as

(4)

where recall that and the proximal operator is defined as:

(5)
(6)

and is the step-size. In a centralized setting, the form of the optimality condition in (4) is suggestive of the proximal OGD algorithm [33] that takes the form

(7)

for all . Though the algorithm in (7) is provably no-regret, it is not usable in distributed settings where the timely evaluation of the average gradient is challenging. Indeed, due to the dynamic and arbitrary nature of the communication graphs , even collecting the gradients from each node is not straightforward.

The consensus algorithm, where the information transmission occurs only over the edges of a given graph , has been widely used for distributed averaging [17]. While the consensus algorithm converges only asymptotically, it is still possible for the nodes to obtain an approximate version of the average in a few iterations. The idea of running the consensus for a few iterations in order to approximately calculate the average gradient was first proposed in [34] for static problems, and will also be utilized here. A key complication that occurs in the dynamic setting is that while agents carry out the averaging via consensus routine, the problem parameters continue to change. Therefore it becomes important to carefully plan the number of consensus steps that will result in a sufficiently high update accuracy without losing track of .

Iii-B The Distributed Proximal OGD Algorithm

Building upon the classical distributed proximal-gradient algorithm [34], the proposed distributed proximal OGD (DP-OGD) algorithm takes up multiple time slots to complete each iteration. The time-varying functions are sampled at times where

(8)

so the -th iteration starting at time takes up time slots where denotes the number of consensus steps at iteration . The two additional time slots are reserved for the gradient update and for the application of the proximal operator. In order to establish the regret bounds, we will generally choose to be a non-decreasing function of . Associate time-varying weights with each edge , while let for all . Let the matrix collect all the edge weights . The full DP-OGD algorithm starts with an initial and consists of the following updates

(9a)
(9b)
(9c)

where we let . Note that only one of the three updates steps in (9a)-(9c) is carried out depending on the value of . These updates in time index have been effectively summarized in Algorithm 1.

1:  Initialize , , step-size , and where
2:  for do
3:       for do
4:  if  then
5:             Get exact gradient and function
6:             Update
7:  end if
8:  if  then
9:             Receive from all
10:             Update
11:  end if
12:  if  then
13:             Update
14:               where
15:  end if
16:       end for
17:  end for
Algorithm 1 DP-OGD Algorithm in time index
Fig. 1: DP-OGD updates at node .

In order to better understand the proposed algorithm, it may be instructive to write down the updates in (9) in terms of the iteration index :

(10a)
(10b)
(10c)

For the sake of brevity, we introduce new (capped) variables and functions whose subscript indicates the iteration index instead of the time index. Specifically, for all , let

(11a)
(11b)
(11c)

Likewise, the optimum at time is specified as . With the iteration-indexed notation in (11), the DP-OGD updates may simply be written as

(12a)
(12b)
(12c)

where denotes the -th entry of the matrix

(13)

Here, the -steps of consensus are all condensed into a single equation (12b). In order to better understand (12b), collect the variables , , and into super-vectors , , and respectively. Recursively applying the consensus averaging, it follows that

(14)

The full algorithm is summarized in Algorithm 2 and the updates at node are shown in Fig. 1. It is remarked that Algorithm 2 is still conceptional since we have left unspecified. An appropriate choice of is necessary to obtain a tight regret bound and a detailed discussion for the same will be provided in Sec. IV.

1:  Initialize and step-size
2:  for do
3:       for do
4:          Get exact gradient and function
5:          Update
6:       end for
7:       for do
8:          Update
9:          Update
10:       end for
11:  end for
Algorithm 2 Distributed Proximal Online Gradient Descent

Iv Performance Analysis

This section develops the regret rate for the proposed algorithm. As already discussed, each iteration involves the gradient update at time , consensus steps, and a proximal update at time . As compared to the existing proximal OGD algorithm, the analysis of DP-OGD algorithm is complicated due to the additional regret that is incurred from subsampling the functions at times . We begin with discussing some preliminaries before proceeding to the assumptions and the regret bounds.

Iv-a Preliminaries

The regret rate of Algorithm 2 will be analyzed using the network averages at each iteration , defined as

(15)

Towards establishing the regret rates, we begin with casting the proposed algorithm as an inexact proximal gradient algorithm that can be viewed as a generalized version of [33]. Using the network averages defined in (15), it is possible to write (12a) as an inexact gradient update step:

(16)
(17)

where recall that and

(18)

Along similar lines, we write the inexact proximal update as

(19)
(20)

where the -proximal operator is as defined in [35] which implies that

(21)

As compared to (12b)-(12c), the error incurred in using the inexact proximal operation can be expressed as

(22)

for all . At this stage, is left unspecified and appropriate bounds will be developed later. Having written the proposed DP-OGD algorithm as an inexact proximal OGD variant, the rest of the analysis proceeds as follows: (a) development of bounds for updates in (17)-(20) in terms of and ; (b) development of bounds on and ; and finally (c) substitution of these bounds to obtain the required regret bounds. It is remarked that such an approach is flexible and readily extendible to other variants where the sources of gradient and proximal errors may be different; e.g., quantization, asynchrony, or computational errors.

Iv-B Assumptions

The assumptions required for developing the regret bounds are discussed subsequently. The first three assumptions pertain to the properties of functions and .

Assumption 1.

The functions are -smooth, i.e., for all , it holds that

(23)

for all and .

Assumption 2.

The cost functions and are Lipschitz continuous, i.e., for all , we have that

(24)
(25)

where we have used the same Lipschitz constant for for the sake of brevity and without loss of generality. Note that (24)-(25) also imply that the corresponding (sub-)gradients are bounded by .

Assumption 3.

The functions are -convex, i.e., for all , it holds that

(26)

It is remarked that Assumptions 1-3

are standard and applicable to a wide range of problems arising in machine learning, signal processing, and communications. The present analysis depends critically on these assumptions and the required regret bounds only hold when

and the parameters and are bounded. The next two assumptions pertain to the network connectivity and the choice of weights.

Assumption 4.

The graph is -connected, i.e., there exists some such that the graph

(27)

is connected for all .

Assumption 5.

The weight matrices have positive entries and satisfy the following properties for each :

  1. Double stochasticity: it holds that ;

  2. Lower bounded non-zero entries: there exists such that for all ;

  3. Non-zero diagonal entries: it holds that for all .

Assumptions 4-5 imply that the entries of the weight matrix are close to in the following sense [16, Proposition 1(b)]:

(28)

where defining , we generally have that and . Inequality (28) holds the key to obtaining fast consensus over time-varying graphs and as we shall see later, the regret rate of the algorithm would depend critically on the value of .

For the next assumption, let the total number of time slots required to carry out updates be given by

(29)

where recall that is the number of consensus steps at the -th iteration. Inverting the relationship (29), for any , it holds that . The following assumption is required to ensure that the regret bounds are sublinear.

Assumption 6.

The number of consensus steps is a non-decreasing function of and the number of consensus steps at the last iteration is sublinear in .

An implication of Assumption 6 is that there cannot be too many consensus steps at any iteration . Having stated all the required assumptions, we are now ready to state the main results of the paper.

Iv-C Regret Bounds

As already discussed, we begin with developing some bounds for the generic inexact proximal gradient method (17)-(20). Bounds on the error incurred from using the proposed distributed algorithm will be developed next. The final regret bounds would follow from combining these two results. The section concludes with some discussion on the nature of the bounds for various choices of .

The first lemma bounds the per-iteration progress of the iterate in terms of its change in distance from the current optimum .

Lemma 1.

Under Assumptions 1-3, the updates in (17)-(20) satisfy

(30)

where, and .

The proof of Lemma 1 is provided in Appendix B and utilizes the strong convexity and smoothness properties of as well as the triangle inequality. The result in Lemma 1 states that the distance between and the current optimal is less than a -fraction of the distance between and , but for an error term that arises due to and in the updates steps. It is remarked that the result in (30) subsumes all existing results in [30, 33].

Taking summation over and rearranging, we obtain the following corollary, whose proof is deferred to Appendix B.

Corollary 1.

Under Assumptions 1-3 and for , the updates in (17)-(20) satisfy

(31)

The result in Corollary 1 provides an upper bound on the cumulative distance between the average current iterate and optimal over time instances. Next, Lemma 2 develops a simple bound on the iterate norm with the proof provided in Appendix C.

Lemma 2.

Under Assumption 2, 4, and 5, the iterates generated by the DP-OGD algorithm are bounded as

(32)
(33)

The bounds in Lemma 2 are not surprising given that the size of each update step is bounded implying that after steps, none of the iterates can be more than far from the starting point. Recall that the inexact proximal OGD algorithm updates in (17)-(20) are really the DP-OGD updates in disguise, with specific definitions of and . The next lemma provides a convenient bound on the error term that will subsequently be used to obtain the regret bounds.

Lemma 3.

Under Assumptions 1-5, the error is bounded as

(34)

The proof of Lemma 3 is provided in Appendix D. It provides a bound on the error sequence generated by the inexact algorithm in terms of a geometric-polynomial sequence. The first term of (34) results from the gradient error while the second term bounds the norm of the proximal error .

We are now ready to convert the results obtained in terms of the iteration index into bounds that depend on time index . Recall that since the -th iteration incurs time slots, the last iteration incurs time slots where , as stated in Assumption 6. Also let so that the initial bounds can be developed in terms of and . Specific examples of will subsequently be provided to yield bounds as explicit functions of . As a precursor, consider a simple example when , so that and . Next, the following lemma reconciles the two definitions of the path length.

Lemma 4.

It holds that

(35)

where is such that .

Proof:

The result follows from the use of triangle inequality:

Note that in order to calculate the dynamic regret in (2), it is necessary to define for all . Towards this end, let for such that . Finally, we provide the main result of the paper in the following Theorem.

Theorem 1.

Under Assumptions 1-6 and , the proposed DP-OGD algorithm incurs the following dynamic regret

(36)

The proof of Theorem 1 is provided in Appendix E. Here, the regret bound is worse than since the algorithm updates are sporadic resulting in additional factor of . The result in (36) depends on the choice of through and . We now discuss the explicit form of the regret bound for a few choices of .

Iv-C1 Logarithmically increasing

Consider the case

(37)

where is left unspecified at this stage. Given , the number of iterations are given by the largest that satisfies the inequality

(38)

For the sake of brevity, let and likewise so that only the dominant terms may be retained. Ignoring the floor function and using Stirling’s approximation [46], we have that

(39)

or equivalently,

(40)

where denotes the Lambert function. It follows that . Also note that

(41)
(42)
(43)

where the last step uses the approximation for large . The overall regret bound thus becomes

(44)

For the regret to be sublinear, it is necessary but not sufficient that is also sublinear. For instance, if with , it is also required to hold that or equivalently one must choose so that the term dominates the summation. In this case, is not allowed to be arbitrarily close to 1. Instead, it is required that or equivalently, we have . In other words, for a given , it is always possible to choose an appropriate value of the parameter , though the regret bound will only be for sufficiently large .

Iv-C2 Constant

Taking for all , it is required that

(45)
(46)

for and sufficiently large. In this case, and

(47)
(48)

yielding the final regret bound

(49)

In order to write the regret in explicit form, let for some . Then the regret in (49) is sublinear when or . However cannot be arbitrarily small or else the term would become too large. The minimum value of the regret is obtained for the choice of such that

(50)
(51)
(52)

where we have used the result from [47]. Since is large, we make use of the approximation , which yields

Therefore the regret bound can be approximately written as

(53)

which is sublinear as long as is not too close to 1 and . As in the previous case, the optimal choice of still requires a sufficiently large value of . In summary, the step-size may be chosen as while may be chosen as either or in Algorithm 2.

Recall that the centralized proximal OGD algorithm achieves a regret of which is also optimal for any online algorithm; see [28]. Remarkably, the dynamic regret of the DP-OGD is only worse by a factor, possibly arising out of the distributed operation over an intermittently connected graph.

V Numerical results

The performance of the proposed DP-OGD algorithm is tested for the dynamic sparse signal recovery problem where the goal is to estimate a time-varying sparse parameter. Such problems have been widely studied in literature and can be broadly classified into those advocating adaptive filtering-based algorithms, those formulating the problem within the sparse Bayesian learning framework, and finally those utilizing tools from dynamic or online convex optimization.

V-a The Dynamic Sparse Recovery Problem

Consider a WSN with sensors connected over a time-varying graph . The parameter of interest is a time-varying sparse signal . At time , sensor makes measurements according to the following model

(54)

where is the observation matrix and is the noise with unknown statistics. The online learning model detailed in Sec. II is considered and the quantities are revealed in a sequential manner. Observe that given no other information, tracking the original parameter is impossible. Instead, we settle for tracking the Elastic Net estimator of given by