The increasing amount of data generated by recent applications of distributed systems such as social media, sensor networks, and cloud-based databases has brought considerable attention to distributed data processing approaches, in particular the design of distributed algorithms that take into account the communication constraints and make coordinated decisions in a distributed manner Jadbabaie et al (2012); Rahnama Rad and Tahbaz-Salehi (2010); Alanyali et al (2004); Olfati-Saber et al (2006); Aumann (1976); Borkar and Varaiya (1982); Tsitsiklis and Athans (1984); Genest et al (1986); Cooke (1990); DeGroot (1974); Gilardoni and Clayton (1993). In a distributed system, the interactions between agents are usually restricted to follow certain constraints on the flow of information imposed by the network structure. Such information constraints cause the agents to only be able to use locally available information. This contrasts with centralized approaches where all information and computation resources are available at a single location Gubner (1993); Zhu et al (2005); Viswanathan and Varshney (1997); Sun and Deng (2004).
One traditional problem in decision-making is that of parameter estimation or statistical learning. Given a set of noisy observations coming from a joint distribution one would like to estimate a parameter or distribution that minimizes a certain loss function. For example, Maximum a Posteriori (MAP) or Minimum Least Squared Error (MLSE) estimators fit a parameter to some model of the observations. Both, MAP and MLSE estimators require some form of Bayesian posterior computation based on models that explain the observations for a given parameter. Computation of such a posteriori distributions depends on having exact models about the likelihood of the corresponding observations. This is one of the main difficulties of using Bayesian approaches in a distributed setting. A fully Bayesian approach is not possible because full knowledge of the network structure, or of other agents’ likelihood models, may not be availableGale and Kariv (2003); Mossel and Tamuz (2010); Acemoglu et al (2011).
Following the seminal work of Jadbabaie et al. in Jadbabaie et al (2012, 2013); Shahrampour and Jadbabaie (2013), there have been many studies of distributed non-Bayesian update rules over networks. In this case, agents are assumed to be boundedly rational (i.e., they fail to aggregate information in a fully Bayesian way Golub and Jackson (2010)). Proposed non-Bayesian algorithms involve an aggregation step, typically consisting of weighted geometric or arithmetic average of the received beliefs Acemoglu et al (2008); Tsitsiklis and Athans (1984); Jadbabaie et al (2003); Nedić and Olshevsky (2015); Olshevsky (2014), and a Bayesian update with the locally available data Acemoglu et al (2011); Mossel et al (2014)
. Recent studies proposed variations of the non-Bayesian approach and proved consistent, geometric and non-asymptotic convergence rates for a general class of distributed algorithms; from asymptotic analysisShahrampour and Jadbabaie (2013); Lalitha et al (2014); Qipeng et al (2011, 2015); Shahrampour et al (2015); Rahimian et al (2015) to non-asymptotic bounds Shahrampour et al (2016); Nedić et al (2015a); Lalitha et al (2015); Nedić et al (2015b), time-varying directed graphs Nedić et al (2016c), and transmission and node failures Su and Vaidya (2016); see Barbarossa et al (2013); Nedić et al (2016d) for an extended literature review.
We build upon the work in Birgé (2015) on non-asymptotic behaviors of Bayesian estimators to derive new non-asymptotic concentration results for distributed learning algorithms. In contrast to the existing results which assume a finite hypothesis set, in this paper we extend the framework to countably many and a continuum of hypotheses. Our results show that in general, the network structure will induce a transient time after which all agents learn at a network independent rate, and this rate is geometric.
The contributions of this paper are as follows. We begin with a variational analysis of Bayesian posterior and derive an optimization problem for which the posterior is a step of the Stochastic Mirror Descent method. We then use this interpretation to propose a distributed Stochastic Mirror Descent method for distributed learning. We show that this distributed learning algorithm concentrates the beliefs of all agents around the true parameter at an exponential rate. We derive high probability non-asymptotic bounds for the convergence rate. In contrast to the existing literature, we analyze the case where the parameter spaces are compact. Moreover, we specialize the proposed algorithm to parametric models of an exponential family which results in especially simple updates.
The rest of this paper is organized as follows. Section 2 introduces the problem setup, it describes the networked observation model and the inference task. Section 3 presents a variational analysis of the Bayesian posterior, shows the implicit representation of the posterior as steps in a stochastic program and extends this program to the distributed setup. Section 4 specializes the proposed distributed learning protocol to the case of observation models that are members of the exponential family. Section 5 shows our main results about the exponential concentration of beliefs around the true parameter. Section 5 begins by gently introducing our techniques by proving a concentration result in the case of countably many hypotheses, before turning to our main focus: the case when the set of hypotheses is a compact subset of . Finally, conclusions, open problems, and potential future work are discussed.
: Random variables are denoted with upper-case letters, e.g., while the corresponding lower-case are used for their realizations, e.g. . Time indices are denoted by subscripts, and the letter or is generally used. Agent indices are denoted by superscripts, and the letters or are used. We write or to denote the entry of a matrix in its -th row and -th column. We use for the transpose of a matrix , and
for the transpose of a vector. The complement of a set is denoted as .
2 Problem Setup
We begin by introducing the learning problem from a centralized perspective, where all information is available at a single location. Later, we will generalize the setup to the distributed setting where only partial and distributed information is available.
Consider a probability space , where is a sample space, is a -algebra and a probability measure. Assume that we observe a sequence of independent random variables , all taking values in some measurable space and identically distributed with a common unknown distribution . In addition, we have a parametrized family of distributions ,where the map from parameter to distribution is one-to-one. Moreover, the models in are all dominated111A measure is dominated by (or absolutely continuous with respect to) a measure if implies for every measurable set . by a -finite measure , with corresponding densities . Assuming that there exists a such that , the objective is to estimate based on the received observations .
Following a Bayesian approach, we begin with a prior on represented as a distribution on the space ; then given a sequence of observations, we incorporate such knowledge into a posterior distribution following Bayes’ rule. Specifically, we assume that is equipped with a -algebra and a measure and that , which is our prior belief, is a probability measure on which is dominated by . Furthermore, the densities are measurable functions of for any , and also dominated by . We then define the belief as the posterior distribution given the sequence of observations up to time , i.e.,
for any measurable set (note that we used the independence of the observations at each time step). Assuming that all observations are readily available at a centralized location, under appropriate conditions, the recursive Bayesian posterior in Eq. (1) will be consistent in the sense that the beliefs will concentrate around ; see Ghosal (1997); Schwartz (1965); Ghosal et al (2000) for a formal statement. Several authors have studied the rate at which this concentration occurs, in both asymptotic and non-asymptotic regimes Birgé (2015); Ghosal et al (2007); Rivoirard et al (2012).
Now consider the case where there is a network of agents observing the process , where is now a random vector belonging to the product space , and consists of observations of the agents at time . Specifically, agent observes the sequence , where is now distributed according to an unknown distributions . Each agent agent has a private family of distributions it would like to fit to the observations. However, the goal is for all agents to agree on a single that best explains the complete set of observations. In other words, the agents collaboratively seek to find a that makes the distribution as close as possible to the unknown true distribution . Agents interact over a network defined by an undirected graph , where is the set of agents and is a set of undirected edges, i.e., if and only if agents and can communicate with each other.
We study a simple interaction model where, at each step, agents exchange their beliefs with their neighbors in the graph. Thus at every time step , agent will receive the sample from as well as the beliefs of its neighboring agents, i.e., it will receive for all such that . Applying a fully Bayesian approach runs into some obstacles in this setting, as agents know neither the network topology nor the private family of distributions of other agents. Our goal is to design a learning procedure which is both distributed and consistent. That is, we are interested in a belief update algorithm that aggregates information in a non-Bayesian manner and guarantees that the beliefs of all agents will concentrate around .
As a motivating example, consider the problem of distributed source localization Rabbat and Nowak (2004); Rabbat et al (2005). In this scenario, a network of agents receives noisy measurements of the distance to a source. The sensing capabilities of each sensor might be limited to a certain region. The group objective is to jointly identify the location of the source. Figure 1 shows a group of agents (circles) seeking to localize a source (star). There is an underlying graph that indicates which nodes can exchange messages. Moreover, each node has a sensing region indicated by the dashed circle around it. Each agent observes signals proportional to the distance to the target. Since a target cannot be localized effectively from a single measure of the distance, agents must cooperate to have any hope of achieving decent localization. For more details on the problem, as well as simulations of the several discrete learning rules, we refer the reader to our earlier paper Nedić et al (2015a) dealing with the case when the set is finite.
3 A variational approach to distributed Bayesian filtering
In this section, we make the observation that the posterior in Eq. (1) corresponds to an iteration of a first-order optimization algorithm, namely Stochastic Mirror Descent Beck and Teboulle (2003); Nedić and Lee (2014); Dai et al (2015); Rabbat (2015). Closely related variational interpretations of Bayes’ rule are well-known, and in particular have been given in Zellner (1988); Walker (2006); Hill and Dall’Aglio (2012). The specific connection to Stochastic Mirror Descent has not been noted, as far as we are aware of. This connection will serve to motivate a distributed learning method which will be the main focus of the paper.
3.1 Bayes’ rule as Stochastic Mirror Descent
Suppose we want to solve the following optimization problem
where is an unknown true distribution and is a parametrized family of distributions (see Section 2). Here, is the Kullback-Leibler (KL) divergence222 between distributions and (with dominated by ) is defined to be between distributions and .
First note that we can rewrite Eq. (2) as
where is the set of all possible densities on the parameter space . Since the distribution does not depend on the parameter , it follows that
The equality in Eq. (3.1), where we exchange the order of the expectations, follows from the Fubini-Tonelli theorem. Clearly, if minimizes Eq. (2), then a distributions which puts all the mass on minimizes Eq. (3.1).
The difficulty in evaluating the objective function in Eq. (3.1) lies in the fact that the distribution is unknown. A generic approach to solving such problems is using algorithms from stochastic approximation methods, where the objective is minimized by constructing a sequence of gradient-based iterates whereby the true gradient of the objective (which is not available) is replaced with a gradient sample that is available at a given time.
A particular method that is relevant for the solution of stochastic programs of the form
for some random variable with unknown distribution, is the stochastic mirror descent method Juditsky et al (2008); Nedić and Lee (2014); Beck and Teboulle (2003); Lan et al (2012). The stochastic mirror descent approach constructs a sequence as follows:
for a realization of . Here, is the step-size, , and is a Bregman distance function associated with a distance-generating function , i.e.,
where is the Fréchet derivative of at in the direction of .
For Eq. (3.1), Stochastic Mirror Descent generates a sequence of densities , as follows:
If we choose as the distance-generating function, then the corresponding Bregman distance is the Kullback-Leibler (KL) divergence . Additionally, by selecting , the solution to the optimization problem in Eq. (4) can be computed explicitly, where for each ,
3.2 Distributed Stochastic Mirror Descent
Now, consider the distributed problem where the network of agents want to collectively solve the following optimization problem
Recall that the distribution is unknown (though, of course, agents gain information about it by observing samples from and interacting with other agents) and that containing all the distributions is a private family of distributions and is only available to agent .
We propose the following algorithm as a distributed version of the stochastic mirror descent for the solution of problem Eq. (5):
with denoting the weight that agent assigns to beliefs coming from its neighbor . Specifically, if or , and if . The optimization problem in Eq. (5) has a closed form solution. In particular, the posterior density at each is given by
or equivalently, the belief on a measurable set of an agent at time is
We state the correctness of this claim in the following proposition.
Next, we add and subtract the KL divergence between and the density to obtain
Now, from Eq. (7) it follows that
where is the corresponding normalizing constant.
We remark that the update in Eq. (7) can be viewed as two-step processes: first every agent constructs an aggregate belief using a weighted geometric average of its own belief and the beliefs of its neighbors, and then each agent performs a Bayes’ update using the aggregated belief as a prior. We note that similar arguments in the context of distributed optimization have been proposed in Rabbat (2015); Li et al (2016) for general Bregman distances. In the case when the number of hypotheses is finite, variations on this update rule were previously analyzed in Shahrampour et al (2016); Nedić et al (2015a); Lalitha et al (2015).
3.3 An example
Consider a group of agents, connected over a network as shown in Figure 2. A set of metropolis weights for this network is given by the following matrix:
Furthermore, assume that each agent is observing a Bernoulli random variable such that , , and . In this case, the parameter space is . Thus, the objective is to collectively find a parameter that best explains the joint observations in the sense of the problem in Eq. (5), i.e.
where , , and . We can be see that the optimal solution is by determining it explicitly via the first-order optimality conditions or by exploiting the symmetry in the objective function.
To summarize, we have given an interpretation of Bayes’ rule as an instance of Stochastic Mirror Descent. We have shown how this interpretation motivates a distributed update rule. In the next section, we discuss explicit forms of this update rule for parametric models coming from exponential families.
4 Cooperative Inference for Exponential Families
We begin with the observation that, for a general class of models , it is not clear whether the computation of the posterior beliefs is tractable. Indeed, computation of involves solving an integral of the form
There is an entire area of research called variational Bayes’ approximations dedicated to efficiently approximating integrals that appear in such context Fox and Roberts (2012); Beal (2003); Dai et al (2016).
The exponential family, for a parameter
, is the set of probability distributions whose density can be represented as
for specific functions , , and , with . The function is usually referred to as the natural parameter.
When is used as a parameter itself, it is said that the distribution is in its canonical form. In this case, we can write the density as
with being the parameter.
Among the members of the exponential family, one can find the distributions such as Normal, Poisson, Exponential, Gamma, Bernoulli, and Beta, among others Gelman et al (2014). In our case, we will take advantage of the existence of conjugate priors
for all members of the exponential family. The definition of the conjugate prior is given below.
Assume that the prior distribution on a parameter space belongs to the exponential family. Then, the distribution is referred to as the conjugate prior for a likelihood function if the posterior distribution is in the same family as the prior.
Thus, if the belief density at some time is a conjugate prior for our likelihood model, then our belief at time will be of the same class as our prior. For example, if a likelihood function follows a Gaussian form, then having a Gaussian prior will produce a Gaussian posterior. This property simplifies the structure of the belief update procedure, since we can express the evolution of the beliefs generated by the proposed algorithm in Eq. (7) by the evolution of the natural parameters of the member of the exponential family it belongs to.
We now proceed to provide more details. First, the conjugate prior for a member of the exponential family can be written as
which is a distribution over the natural parameters , where and are the parameters of the conjugate prior. Then, it can be shown that the posterior distribution, given some observation , has the same exponential form as the prior with updated parameters as follows:
On the other hand, for a set on priors of the same exponential family, the weighted geometric averages also have a closed form in terms of the conjugate parameters.
Let be a set of distributions, all in the same class in the exponential family, i.e., for . Then, for a set of weights with for all , the probability distribution defined as
belongs to the same class in the exponential family with parameters and .
We write the explicit geometric product, and discard the constant terms
The last line provides explicit values for the parameters of the new distribution. ∎∎
Assume that the belief density at time has an exponential form with natural parameters and for all , and that these densities are conjugate priors of the likelihood models . Then, the belief density at time , as computed in the update rule in Eq. (7), has the same form as the beliefs at time with the natural parameters given by
Proposition 3 simplifies the algorithm in Eq. (7) and facilitates its use in traditional estimation problems where members of the exponential family are used. We next illustrate this by discussing a number of distributed estimation problems with likelihood models coming from exponential families.
4.1 Distributed Poisson Filter
Consider an observation model where the agent signals follow Poisson distributions, i.e.,for all . In this case, the optimization problem to be solved is
The conjugate prior of a Poisson likelihood model is the Gamma distribution. Thus, if at timethe beliefs are given by for all , then the beliefs at time are , where
4.2 Distributed Gaussian Filter with known variance
Assume each agent observes a signal of the form , where is finite and unknown, while , with , is known by agent . The optimization problem to be solved is
In this case, the likelihood models, the prior and the posterior are Gaussian. Thus, if the beliefs of the agents at time are Gaussian, i.e., for all , then their beliefs at time are also Gaussian. In particular, they are given by for all , with
4.3 Distributed Gaussian Filter with unknown variance
In this case, the agents want to cooperatively estimate the value of a variance. Specifically, based on observations of the form, with , where is known and is unknown to agent , they want to solve the following problem
We choose the Scaled Inverse Chi-Squared333The density function of the Scaled Inverse Chi-Squared is defined for as . as the distribution of our prior, so that for all , then the beliefs at time are given by for all , with
4.4 Distributed Gaussian Filter with unknown mean and variance
In the preceding examples, we have considered the cases when either the mean or the variance is known. Here, we will assume that both the mean and the variance are unknown and need to be estimated. Explicitly, we still have noise observations , with , and want to solve
The Normal-Inverse-Gamma distribution serves as conjugate prior for the likelihood model over the parameters . Specifically, we assume that the beliefs at time are given by
Then, the beliefs at time will have a Normal-Inverse-Gamma distribution with the following parameters
5 Belief Concentration Rates
We now turn to the presentation of our main results which concern the rate at which beliefs generated by the update rule in Eq. (7) concentrate around the true parameter . We will break up our analysis into two cases. Initially, we will focus on the case when is a countable set, and will prove a concentration result for a ball containing the optimal hypothesis having finitely many hypotheses outside it. We will use this case to gently introduce the techniques we will use. We will then turn to our main scenario of interest, namely when is a compact subset of . Our proof techniques use concentration arguments for beliefs on Hellinger balls from the recent work Birgé (2015) which, in turn, builds on the classic paper LeCam (1973).
We begin with two subsections focusing on background information, definitions, and assumptions.
5.1 Background: Hellinger Distance and Coverings
We equip the set of all probability distributions over the parameter set with the Hellinger distance444The Hellinger distance between two probability distributions and is given by,
Define an -Hellinger ball of radius centered at as
Additionally, when no center is specified, it should be assumed that it refers to , i.e. .
Given an -Hellinger ball of radius , we will use the following notation for a covering of its complement . Specifically, we are going to express as the union of finite disjoint and concentric anuli. Let and be a finite strictly decreasing sequence such that and . Now, express the set as the union of anuli generated by the sequence as
5.2 Background: Assumptions on Network and Mixing Weights
Naturally, we need some assumptions on the matrix . For one thing, the matrix has to be “compatible” with the underlying graph, in that information from node should not affect node if there is no edge from to in . At the other extreme, we want to rule out the possibility that
is the identity matrix, which in terms of Eq. (7) means nodes do not talk to their neighbors. Formally, we make the following assumption.
The graph and matrix are such that:
is doubly-stochastic with for if and only if .
has positive diagonal entries, for all .
The graph is connected.
can be done in a distributed way, for example, by choosing the so-called “lazy Metropolis” matrix, which is a stochastic matrix given by
where is the degree (the number of neighbors) of node . Note that although the above formula only gives the off-diagonal entries of , it uniquely defines the entire matrix (the diagonal elements are uniquely defined via the stochasticity of ). To choose the weights corresponding to a lazy Metropolis matrix, agents will need to spend an additional round at the beginning of the algorithm broadcasting their degrees to their neighbors.
Assumption 1 can be seen to guarantee that where is the vector of all ones. We will use the following result that provides convergence rate for the difference , based on the results from Shahrampour et al (2016) and Nedić et al (2015a):
Let Assumption 1 hold, then the matrix satisfies the following relation:
where with being the smallest positive entry of the matrix . Furthermore, if is a lazy Metropolis matrix associated with the graph , then .
5.3 Concentration for the Case of Countable Hypotheses
We now turn to proving a concentration result when the set of hypotheses is countable. We will consider the case of a ball in the Hellinger distance containing a countable number of hypotheses, including the correct one, and having only finitely many hypotheses outside it; we will show exponential convergence of beliefs to that ball. The purpose is to gently introduce the techniques we will use later in the case of a compact set of hypotheses.
In the case when the number of hypotheses is countable, the density update in Eq. (7) can be restated in a simpler form for discrete beliefs over the parameter space as
We will fix the radius , and our goal will be to prove a concentration result for a Hellinger ball of radius around the optimal hypothesis . We partition the complement of this ball as described above into annuli . We introduce the notation to denote the number of hypotheses within the annulus . We refer the reader to Figure 3 which shows a set of probability distributions, represented as black dots, where the true distribution is represented by a star.
We will assume that the number of hypotheses outside the desired ball is finite.
The number of hypothesis outside is finite.
Additionally, we impose a bound on the separation between hypotheses which will avoid some pathological cases. The separation between hypotheses is defined in terms of the Hellinger affinity between two distributions and , given by
There exists an such that for any and .
With these assumptions in place, our first step is a lemma that bounds concentration of log-likelihood ratios.
By the Markov inequality and Jensen’s inequality we have