A Tutorial on Distributed (Non-Bayesian) Learning: Problem, Algorithms and Results

09/23/2016 ∙ by Angelia Nedić, et al. ∙ Boston University Arizona State University University of Illinois at Urbana-Champaign 0

We overview some results on distributed learning with focus on a family of recently proposed algorithms known as non-Bayesian social learning. We consider different approaches to the distributed learning problem and its algorithmic solutions for the case of finitely many hypotheses. The original centralized problem is discussed at first, and then followed by a generalization to the distributed setting. The results on convergence and convergence rate are presented for both asymptotic and finite time regimes. Various extensions are discussed such as those dealing with directed time-varying networks, Nesterov's acceleration technique and a continuum sets of hypothesis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Achieving global behaviors by repeatedly aggregating local information without complete knowledge of the network has been a recent topic of interest [1, 2, 3, 4, 5]. For example, distributed hypothesis testing method that uses belief propagation has been studied in [4]. Various extensions to finite capacity channels, packet losses, delayed communications and tracking where developed in [6, 7]. In [2]

, the authors proved convergence in probability, asymptotic normality and provided conditions under which the distributed estimation is as good as a centralized one. Later in 

[1, 8], almost sure convergence of non-Bayesian rules based on consensus were shown for static graphs. Other methods to aggregate Bayes estimates in a network have been explored as well [9]. The work in [10] extends the results of [1] to time-varying undirected graphs. In [11], local exponential rates of convergence for undirected gossip-like graphs are studied. The authors in [12, 13] proposed a non-Bayesian learning algorithm where a local Bayes’ update is followed by a consensus step. In [12], convergence result for fixed graphs is provided and large deviation convergence rates are given, proving the existence of a random time after which the beliefs will concentrate exponentially fast. In [13], similar probabilistic bounds for the rate of convergence are derived and comparisons with the centralized version of the learning rule are provided.

Following the seminal work of Jadbabaie et al. in [1, 14, 15], there have been many studies of non-Bayesian rules for distributed learning. Non-Bayesian algorithms involve an aggregation step, usually consisting of a belief aggregation and a Bayesian update that is based on the locally available data. The belief aggregation is typically consisting of a weighted geometric or arithmetic average of beliefs, in which case the results from consensus literature [16, 17, 18, 19, 20] are exploited, while Bayesian update step is based on the standard Bayesian learning approach [21, 22].

Several variants of non-Bayesian approach have been proposed and have been shown to produce consistent estimates, with provable asymptotic and non-asymptotic convergence rates for a general class of distributed algorithms. The main body of work is focused on the case of finitely many hypotheses. The established results include asymptotic convergence rate analysis [11, 23, 24, 25, 26, 27, 28] and non-asymptotic convergence rate bounds [13, 29, 12], time-varying directed graphs [29], continuum set of hypotheses [30], weakly connected graphs [31], bisection search algorithm [32], and transmission node failures  [33, 34, 35].

In this paper, we overview a subset of recent studies on distributed (non-Bayesian) learning algorithms. To present a concise introduction to the topic, we start by presenting ideas from centralized learning and, then, transition to the most recent developments in the distributed setting. This tutorial is by no means extensive and the interested reader may like to look into the references for a more complete exposition of certain aspects.

This tutorial is organized as follows. Section II presents a general introduction to the distributed learning problem. We highlight the main assumptions and how they can be weakened for more general results. Section III provides an overview of the centralized non-Bayesian learning problem and describes some initial generalizations to the distributed setting (known as social learning). Moreover, convergence results as well as (non-)asymptotic convergence rate estimates are provided. Section IV discusses some generalizations aimed at improving the convergence rate estimates (in terms of their dependency on the number of agents), dealing with time-varying directed graphs, and learning with a continuum sets of hypotheses. Finally, some conclusions are presented in Section V.

Notation:

The inner product of two vectors

and is denoted by . We write or to denote the entry of a matrix in the -th row and -th column. We write for the transpose of a matrix and for the transpose of a vector

. A matrix is said to be stochastic if its entries are nonnegative, and the sum of the entries in every row is equal to 1. A stochastic matrix

whose transpose is also stochastic is said to be doubly stochastic. We use

for the identity matrix, where its size will be inferred from the context. We write

to denote the vector with all zero entries except for its -th entry which is equal to . In general, when referring agents we will use superscripts with the letter or , while when referring to a time instant we will use subscripts and the letter .

We write to denote the cardinality of a set , and for a probability measure over the set

. Upper case letters represent random variables (e.g.

) with their corresponding lower case letters as their realizations (e.g. ). Notation is reserved for expectation with respect to a random variable

. We denote the Kullback-Liebler (KL) divergence between two probability distributions

and with a common support set by . In particular, when the distributions and have a countable (or finite) support set, their KL-divergence is given by

The definition of the KL-divergence for general measures and on a given set is a bit more involved; it can be found, for example, in [36], page 111.

Ii Problem Statement

Consider a group of agents, indexed by , each having conditionally independent observations of a random process at discrete time steps . Specifically, agent observes the random variables which are i.i.d. in time and distributed according to an unknown probability distribution . The set of possible outcomes of the random variables is a finite set which we will denote by . For convenience, we stack up all the into a vector denoted as . Then, is an i.i.d. vector taking values in and distributed as . Furthermore, each agent has a family of probability distributions parametrized by , where is a set of parameters. One can think of as a set of hypotheses and as the probability distribution that would be seen by agent if hypothesis were true. In general it is not required that there exists with for all ; in other words, there may not be a hypothesis which matches the observations made by the nodes. Rather, the objective of all agents is to agree on a subset of that fits all the observations in the network best. Formally, this setup describes the scenario where the group of agents collectively tries to solve the following optimization problem

(1)

where

is the Kullback-Leibler divergence between the distribution of

and the distribution that would have been seen by agent if hypothesis were correct. The distributions ’s are not known, therefore, the agents want to “learn” the solution to this optimization problem based on their local observations and some local interactions. See Figure 1 for an illustration of the problem.

[width=0.25]simplex3

Fig. 1: Geometric interpretation of the learning objective. The triangle represents the simplex composed of all agents’ probability distributions. The observations of the agents are generated according to a joint probability distribution

. The joint distribution for the agent observations is parametrized by

. The agent goal is to learn a hypothesis that best describes their observations, which corresponds to the distribution (the closest to the distribution ).

The agents interact over a sequence of directed communication graphs , where is the set of agents (where each agent is viewed as a node), and is the set of edges where if agent can communicate with agent at time instant . Specifically, the agents communicate with each other by sharing their beliefs about the hypotheses set, denoted as , which is a probability distribution over the hypothesis set . In the forthcoming discussion, we will consider the cases where the graphs can be static and may be undirected. We will clearly specify the assumptions made on the graphs.

The hypothesis set can be a finite, countable or continuum set, which will be self-evident from expressions used in Bayes’ update relation.

Iii Algorithms

In this section, we describe some of the algorithms that have been proposed for the distributed non-Bayesian learning problem. Different algorithms and results exist due to the use of different communication networks and protocols for information exchange. Moreover, the variety in the algorithms is also due to the order in which the local information updates and neighbor beliefs aggregation updates are performed.

We will start by considering Bayes’ update for a case of a single agent, i.e., centralized case. Furthermore, initially, for simplicity of exposition we will assume there exists a single that minimizes problem (1) corresponding to a single agent case. In this case, updating the beliefs to account for a set of observations that lead to a posterior belief follows the Bayes’rule. Specifically, having a belief and a new observation at time , the agent updates its belief as follows (see [37]):

(2)

where denotes the Bayesian update of the belief given a new observation , i.e.,

where the symbol stands for positively proportional quantities (the proportionality constant here is the normalization factor needed to obtain a probability distribution).

In [37, 38], the concepts of over-reaction and under-reaction to local observations are introduced, where the belief update rule is given by

(3)

where . When , algorithm (3) reduces to Bayesian learning in (2). When , a relative importance is given to the prior, whereas, for the updates over-react to observations. The authors in [37] showed that update rules of the form (3) converge to the correct parameter value in the almost sure sense whenever and is measurable. If or if is not measurable, then there is an incorrect parameter to which convergence can happen with a positive probability. Thus, as long as there is a constant flow of new information and the agent takes its personal signals into account in a Bayesian manner, the resulting learning process asymptotically coincides with Bayesian learning.

The seminal work of [1] has introduced a social learning approach to non-Bayesian learning, where different agents receive different observations and use a DeGroot-style update to incorporate the views of their neighbors

(4)

where are the weights taking positive values on the links in a static graph (i.e., for all ) and satisfying for all . In [1], it has been shown that, when the underlying social network is strongly connected, every , and at least one agent has a positive prior belief on the true parameter (i.e., for some ), then the beliefs generated by algorithm (4) results in all agents forecasts converging to the correct one with probability one.

A connection between non-Bayesian learning and optimization theory were pointed out in [11], where a distributed learning algorithm has been proposed that is based on a maximum likelihood analysis of the estimation problem and Nesterov’s dual averaging algorithm [39]. Finding the true state of the world was described as the following optimization problem

(5)

or equivalently

(6)

Applying a regularized dual averaging algorithm to the optimization problem (6), one obtains a sequence , where

(7a)
(7b)

with , , being a sequence of non-increasing step-sizes, and a proximal function.

Specifically for the centralized case with the Kullback-Liebler divergence as proximal function, the algorithm in (7) has an explicit closed form solution which coincides with (2).

In the distributed setting in [11], for an undirected and static graph , the randomized gossip interactions were considered, where an agent “wakes-up” according to a Poisson clock and communicates with a randomly selected agent . Both agents average their accumulated observations and add their most recent stochastic gradient, resulting in the update of the form:

(8a)
(8b)
(8c)

while the other agents in the system do not update.

Letting for all agents and using the Kullback-Liebler divergence as a proximal function, the update rule of the form (8) has a closed form solution given by

(9)

where

and with and being the agents involved in the random gossip communication at time (or alternatively, the link being randomly activated in the graph ).

The update rule in (8) involves a form of geometric average of beliefs instead of the linear aggregations of beliefs as in (4). Weak convergence is proven under the connectivity assumption of the interacting graph , i.e.,

Additionally, in [11], the convergence rate results for the estimation process are provided. An asymptotic rate is derived that guarantees that for sufficiently large time scales the beliefs will concentrate around the true hypothesis exponentially fast. The rate at which this happens is proportional to the distance (in the sense of the KL divergence) between the true hypothesis and the second best option, i.e., with probability for sufficiently large it holds that

where is a constant and

Similar asymptotic rates using large scale deviation theory were derived in [12] for a directed static graph but for a different algorithm. Specifically, in [12], an explicit belief update rule is considered where local Bayesian updates are aggregated via geometric averages of the form:

(10)

Under assumptions of strong connectivity, positive prior beliefs and existence of unique correct models, an exponential convergence rate of the beliefs to the correct hypothesis has been shown and an asymptotic convergence rate is provided (see Theorem 1 of [12]).

In recent works [13, 40], non-asymptotic convergence rates for a variety of distributed non-Bayesian learning algorithms have been established. In [13], the algorithm in (8) has been considered for the case of (non-random) agent interaction over a general static connected graphs. In particular, the following relations have been shown to hold

(11)
(12)

with probability , where is arbitrarily small. Here, denotes the total variation between vectors and , is a probability vector with an entry in the position corresponding to hypothesis , denotes the size of the hypotheses set, is a step-size and is a lower bound on the probability mass distribution in the likelihood models. The vector

denotes the stationary distribution of the corresponding Markov chain whose transition probability distribution is the interaction matrix

(in other words the vector

is the left eigenvector associated with the eigenvalue 1 of the matrix

).

The non-asymptotic probabilistic bound in (11) shows the concentration of the beliefs around the true state of the world as an exponentially fast process with a transient time related to the matrix properties and the desired accuracy level. The bound holds for a connected graph and stochastic weight matrix , and the exponential concentration rate depends explicitly on the left-eigenvector associated with eigenvalue 1 of the matrix .

An independent simultaneous work [40, 29] also has developed non-asymptotic bounds for distributed non-Bayesian learning for time-varying graphs and for different algorithms. The belief update rules in [40, 29] are based on mirror descent algorithm as applied to the learning problem in a distributed setting. The resulting update rule has the following form:

(13)

Algorithm (13) is applicable to time-varying graphs, as indicated by the use of time varying weight matrices that are compliant with the graphs’ structure. In particular, the following assumption is imposed on the graph sequence and the matrix sequence .

Assumption 1

Assume that each graph is undirected and has no self-loops (i.e., for all and all ). Moreover, let the graph sequence and the matrix sequence satisfy the following conditions:

  1. is doubly-stochastic for every , with if and for .

  2. Each has positive diagonal entries, i.e., for all and all .

  3. There exist a uniform lower bound on positive entries in , i.e., if .

  4. The graph sequence is -connected, i.e., there is an integer such that the graph is connected for all .

We are now considering the learning problem in (1), where the hypothesis set is finite. We let denote the set of optimal solutions, and note that this set is nonempty. In this setting, the following assumption ensures that the learning process will identify correct hypothesis. In particular, the assumption is for the general case when a unique true state of the underlying process does not exist (implying that is not a singleton).

Assumption 2

For all agents ,

  1. There is a nonempty set such that for all . Furthermore, the intersection set is nonempty.

  2. There exists an such that if then for all .

With the two assumptions above we can state the main result in [29].

Theorem 1

Let Assumptions 1 and 2 hold, and let . The update rule of Eq. (13) has the following property: there is an integer such that, with probability , for all and for all , we have

where

where is a local learning objective of agent given by

while is a constant from Assumption 2(b), from Assumption 1(c) and is given by If each is the lazy Metropolis matrix associated with and , then

Theorem 1 states that, with a high probability and after some time that is sufficiently long (as captured by ), the belief of each agent on any hypothesis outside the optimal set decays at a network-independent rate. This rate scales with the constant , which is the average Kullback-Leibler divergence to the next best hypothesis. However, there is a transient due to the term (since the bound of Theorem 1 is not even below until ), and the size of this transient depends on the network and the number of nodes through the constant .

We note that the transient time for each agent is affected by the discrepancy in the initial beliefs on the correct hypotheses (those in the set ), as captured by the term

in the expression for in Theorem 1. We note that, if agent uses a uniform initial belief, i.e., for all , then this term would be 0 for all and, consequently, it will not contribute to the transient time . Thus, the transient time has a dependence on the initial beliefs that is intuitively plausible. Moreover, if agent were to start with a good initial belief , i.e., a belief such that

then the corresponding transient time would be smaller, which is also to be expected.

Iii-a Connection with Distributed Stochastic Mirror Descent

To make this connection simple, we will keep the assumption that the hypothesis set is a finite set. Then, we can observe that the optimization problem in Eq. (1) is equivalent to the following problem:

The expectations in the preceding relation can exchange the order, so the problem in Eq. (1) is equivalent to the following one:

(14)

The difficulty in evaluating the objective function in Eq. (14) (even in the case of a single agent) lies in the fact that the distributions are unknown. A generic approach to solving such problems is the class of stochastic approximation methods, where the objective is minimized by constructing a sequence of gradient-based iterates where the true gradient of the objective (which is not available) is replaced with a gradient sample that is available at the given update time. A particular method that is relevant here is the stochastic mirror-descent method which would solve the problem in Eq. (14), in a centralized fashion, by constructing a sequence , as follows:

(15)

where is a noisy realization of the gradient of the objective function in Eq. (14) and is a Bregman distance function associated with a distance-generating function , and is the step-size. If we take as the distance-generating function, then the corresponding Bregman distance is the Kullback-Leiblier divergence . Let us note that this specific selection of Bregman divergence was previously studied in [41], where the entropic mirror descent algorithm was proposed. Thus, in this case, the update rule in Eq. (19) corresponds to a distributed implementation of the stochastic mirror descent algorithm in (15), where , , and the stepsize is fixed, i.e., for all .

The update rule in Eq. (13) defines a probability measure over the set which coincides with the iterate update of the distributed stochastic mirror descent algorithm applied to the optimization problem in Eq. (1), i.e.,

(16)

Iv Extensions

Iv-a Fast Rates with Nesterov’s Acceleration

For static undirected graphs, the authors in [29] proposed an update rule with one-step memory as follows:

(17)

where a constant to be set later. This update rule is based on an accelerated algorithm for computing network aggregate values as given in [20], which has the convergence rate a factor of faster than the previous rate results (in terms of the factor that governs the exponential decay).

For the algorithm in (17) we impose the following assumption.

Assumption 3

The graph sequence is static (i.e. for all ) and undirected and the weight matrix is a lazy Metropolis matrix, defined by

where is the Metropolis matrix, which is the unique stochastic matrix whose off-diagonal entries satisfy

with being the degree of the node (i.e., the number of neighbors of in the graph).

The next theorem provides a convergence rate bound for the beliefs generated by algorithm (17). In particular, it shows the rate at which the beliefs dissipate the mass placed on wrong (non-optimal) hypotheses.

Theorem 2

Let Assumptions 3 and 2 hold and let . Furthermore let and let . Then, the update rule of Eq. (17) with this , uniform initial condition and fixed to zero, has the following property: there is an integer such that, with probability , for all and for all , there holds

where

with from Assumption 2(b) and .

The bound of Theorem 2 is an improvement by a factor of compared to the bounds of Theorem 1 (17) when the graphs are static. Indeed, the term is in Theorem 1 if ; the same term is in Theorem 2 and, assuming is within a constant factor of , this becomes . We note, however, that the requirements of this theorem are more stringent than those of Theorem 1. Not only does the graph have to be fixed, but all nodes need to know an upper bound on the total number of agents. Moreover, the bound has to be within a constant factor of the actual number of agents. More details on fast algorithms for distributed optimization and learning can be found in a tutorial paper [42].

Iv-B Directed Time-Varying Graphs

In [43], the authors proposed a new algorithm inspired by the Push-Sum Protocol that is able to guarantee convergence for directed graphs, given as follows:

(18a)
(18b)

For this algorithm, we have the following result about its convergence behavior.

Theorem 3

Assume that the graph sequence is -strongly connected and that assumption 2 holds. Also, let be a given error percentile (or confidence value). Then, for the update rule in Eq. (18), with and uniform initial beliefs, has the following property: there is an integer such that, with probability , for all and for all there holds

where with and as defined in Theorem 1. The constants , and satisfy the following relations:
(1) For general -strongly-connected graph sequences , we have

(2) If every graph is regular with , then we have

Iv-C Infinite Sets of Hypotheses

All previously discussed results assume that the hypothesis set is finite. The exponential convergence rates discussed so far depend on some form of distance between the optimal hypothesis and the second best one, and such results are extendable to the case of countably many hypotheses. However, in the case of a continuum of hypothesis, this approach will encounter obstacles. In a recent work [30], the exponential rates have been established for a compact set of hypotheses. In this case, the update rule for a measurable set is defined as

(19)

where is a measure with respect to which every belief is absolutely continuous. The particular details of the rate results can be found in [30].

V Conclusions and Future Work

We presented a highlight of recent developments on the problem of distributed (non-Bayesian) learning. We discussed the problem statement and how different assumptions on the learning model, communication graphs and hypothesis sets lead to different algorithmic implementations. We showed that the original Bayesian approach can be interpreted as a method for solving a related optimization problem.

Future work should focus on models where the observations are not necessarily identically distributed nor independent. Recent results on the concentration measures without independence provide the theoretical foundations for getting non-asymptotic rates in more general cases [44, 45]. Such time dependence could model changes in the optimal hypotheses, changes on the likelihood models or the Bregman divergences used. Online optimization have been shown efficient for some time dependencies [46, 47].

References