Distributed Learning with Infinitely Many Hypotheses

05/06/2016 ∙ by Angelia Nedić, et al. ∙ 0

We consider a distributed learning setup where a network of agents sequentially access realizations of a set of random variables with unknown distributions. The network objective is to find a parametrized distribution that best describes their joint observations in the sense of the Kullback-Leibler divergence. Apart from recent efforts in the literature, we analyze the case of countably many hypotheses and the case of a continuum of hypotheses. We provide non-asymptotic bounds for the concentration rate of the agents' beliefs around the correct hypothesis in terms of the number of agents, the network parameters, and the learning abilities of the agents. Additionally, we provide a novel motivation for a general set of distributed Non-Bayesian update rules as instances of the distributed stochastic mirror descent algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Sensor networks have attracted massive attention in past years due to its extended range of applications and its ability to handle distributed sensing and processing for systems with inherently distributed sources of information, e.g., power networks, social, ecological and economic systems, surveillance, disaster management health monitoring, etc. [1, 2, 3]

. For such distributed systems, one can assume complete communication between every source of information (e.g. nodes or local processing unit) and centralized processor can be cumbersome. Therefore, one might consider cooperation strategies where nodes with limited sensing capabilities distributively aggregate information to perform certain global estimation or processing task.

Following the seminal work of Jadbabaie et al. in [4, 5], there have been many studies of Non-Bayesian rules for distributed algorithms. Non-Bayesian algorithms involve an aggregation step, usually consisting of weighted geometric or arithmetic average of the received beliefs, and a Bayesian update that is based on the locally available data. Therefore, one can exploit results from consensus literature [6, 7, 8, 9, 10] and Bayesian learning literature [11, 12]

. Recent studies have proposed several variations of the Non-Bayesian approach and have proved consistent, geometric and non-asymptotic convergence rates for a general class of distributed algorithms; from asymptotic analysis

[13, 14, 15, 16, 17, 18] to non-asymptotic bounds [19, 20, 21], time-varying directed graphs [22] and transmission and node failures  [23].

In contrast with the existing results that assume a finite hypothesis set, in this paper, we are extending the framework to the cases of a countable many and a continuum of hypotheses. We build upon the work in [24] on non-asymptotic behaviors of Bayesian estimators to construct non-asymptotic concentration results for distributed learning. In the distributed case, the observations will be scattered among a set of nodes or agents and the learning algorithm should guarantee that every node in the network will learn the correct parameter as if it had access to the complete data set. Our results show that in general the network structure will induce a transient time after which all agents will learn at a network independent rate, where the rate is geometric.

The contributions of this paper are threefold: First, we provide an interpretation of a general class of distributed Non-Bayesian algorithms as specific instances of a distributed version of the stochastic mirror descent. This motivates the proposed update rules and makes a connection between the Non-Bayesian learning literature in social networks and the Stochastic Approximations literature. Second, we establish a non-asymptotic concentration result for the proposed learning algorithm when the set of hypothesis is countably infinite. Finally, we provide a non-asymptotic bound for the algorithm when the hypothesis set is a bounded subset of . This is an initial approach to the analysis of distributed Non-Bayesian algorithms for a more general family of hypothesis sets.

This paper is organized as follows: Section II describes the studied problem and the proposed algorithm, together with the motivation behind the proposed update rule and its connections with distributed stochastic mirror descent algorithm. Section IV and Section V provide the non-asymptotic concentration rate results for the beliefs around the correct hypothesis set for the cases of countably many and continuum of hypotheses, respectively. Finally, conclusions are presented in Section VI.

Notation: The set denotes the complement of a set . Notation and

denotes the probability measure and expectation under a distribution

. The -th entry of a matrix is denoted by or . Random variables are denoted with upper-case letters, while the corresponding lower-case letters denote their realizations. Time indices are indicated by subscripts and the letter . Superscripts represent the agent indices, which are usually or .

Ii Problem Formulation

We consider the problem of distributed non-Bayesian learning, where a network of agents access sequences of realizations of a random variable with an unknown distribution. The random variable is assumed to be of finite dimension with the constraint that each agent can access only a strict subset of the entries of the realizations (e.g., an

-dimensional vector and

agents each observing a single entry). Observations are assumed to be independent among the agents. We are interested in situations where no single agent has the ability to learn the underlying distribution from its own observations, while collectively the agents can do so if they collaborate. The learning objective is for the agents to jointly agree on a distribution (from a parametrized family of distributions or a hypothesis set) that best describes the observations in a specific sense (e.g., Kullback-Leibler divergence). Therefore, the distributed learning objective requires collaboration among the agents which can be ensured by using some protocols for information aggregation and coordination. Specifically in our case, agent coordination consists of sharing their estimates (beliefs

) of the best probability distribution over the hypothesis set.

Consider, for example, the distributed source location problem with limited sensing capabilities [25, 26]. In this scenario a network of agents receives noisy measurements of the distance to a source, where sensing capabilities of each sensor might be limited to a certain region. The group objective is to jointly identify the location of the source and that every node knows the source location. Figure 1 shows an example, where a group of agents (circles) wants to localize a source (star). There is an underlying graph that indicates the communication abilities among the nodes. Moreover, each node has a sensing region indicated by the dashed line around it. Each agent obtains realizations of the random variable , where is the location of the source, is the position of agent and is a noise in the observations. If we consider as the set of all possible locations of the source, then each will induce a probability distribution about the observations of each agent. Therefore, agents need to cooperate and share information in order to guarantee that all of them correctly localize the target.

Fig. 1: Distributed source localization example

We will consider a more general learning problem, where agent observations are drawn from an unknown joint distribution

, where is the distribution governing the observations of agent . We assume that is an element of , the space of all joint probability measures for a set of independent random variables (i.e., is distributed according to an unknown distribution ). Also, we assume that each takes values in a finite set. When these random variables are considered at time , we denote them by .

Later on, for the case of countably many hypotheses, we will use the pre-metric space as the vector space equipped with the Kullback-Liebler divergence. This will generate a topology, where we can define an open ball with a radius centered at a point by . When the set of hypothesis is continuous, we instead equip with the Hellinger distance  to obtain the metric space , which we use to construct a special covering of subsets consisting of -separated sets.

Each agent constructs a set of hypothesis parametrized by about the distribution . Let be a parametrized family of probability measures for with densities with respect to a dominating measure111A measure is dominated by if implies for every measurable set . . Therefore, the learning goal is to distributively solve the following problem:

(1)

where and is the Kullback-Leibler divergence between the true distribution of and that would have been seen by agents if hypothesis were correct. For simplicity we will assume that there exists a single such that almost everywhere for all agents. Results readily extends to the case when the assumption does not hold (see, for example, [27, 20, 22] which disregard this assumption).

The problem in Eq. (1) consists of finding the parameter such that minimizes its Kullback-Liebler divergence to . However, is only available to agent and the distribution is unknown. Agent gains information on by observing realizations of at every time step . The agent uses these observations to construct a sequence of probability distributions over the parameter space . We refer to these distributions as agent beliefs, where denotes the belief, at time , that agent has about the event for a measurable set .

We make use of the following assumption.

Assumption 1

For all agents we have:

  1. There is a unique hypothesis such that .

  2. If , then there exists an such that for all .

Assumption 1(a) guarantees that we are working on the realizable case and there are no conflicting models among the agents, see [27, 20, 22] for ways of how to remove this assumption. Moreover in Assumption 1(b), the lower bound assumes the set of hypothesis are dominated by (i.e., our hypothesis set is absolutely continuous with respect to the true distribution of the data) and provide a way to show bounded differences when applying the concentration inequality results.

Agents are connected in a network where is the set of agents and is a set of undirected edges, where if agents and can communicate with each other. If two agents are connected they share their beliefs over the hypothesis set at every time instant . We will propose a distributed protocol to define how the agents update their beliefs based on their local observations and the beliefs received from their neighbors. Additionally, each agent weights its own belief and the beliefs of its neighbors; we will use to denote the weight that agent assigns to beliefs coming from its neighbor , and to denote the weight that the agent assigns to its own beliefs. The assumption of static undirected links in the network is made for simplicity of the exposition. The extensions of the proposed protocol to more general cases of time varying undirected and directed graphs can be done similar to the work in [27, 20, 22].

Next we present the set of assumptions on the network over which the agents are interacting.

Assumption 2

The graph and matrix are such that:

  1. is doubly-stochastic with if .

  2. If for some then .

  3. has positive diagonal entries, for all .

  4. If , then for some positive constant .

  5. The graph is connected.

Assumption 2

is common in distributed optimization and consensus literature. It guarantees convergence of the associated Markov Chain and defines bounds on relevant eigenvalues in terms of the number

of agents. To construct a set of weights satisfying Assumptions 2

, for example, one can consider a lazy metropolis (stochastic) matrix of the form

, where

is the identity matrix and

is a stochastic matrix whose off-diagonal entries satisfy

where is the degree (the number of neighbors) of node . Generalizations of Assumption 2 to time-varying undirected are readily available for weighted averaging and push-sum approaches [28, 10, 9].

Iii Distributed Learning Algorithm

In this section, we present the proposed learning algorithm and a novel connection between Bayesian update and the stochastic mirror descent method. We propose the following (theoretical) algorithm, where each node updates its beliefs on a measurable subset according to the following update rule: for all agents and all ,

(2)

where is a normalizing constant and is a probability distribution on with respect to which every is absolutely continuous. The term is the Radon-Nikodym derivative of the probability distribution . The above process starts with some initial beliefs , . Note that, if is a finite or a countable set, the update rule in Eq. (2) reduces to: for every ,

(3)

The updates in Eqs. (2) and (3) can be viewed as two-step processes. First, every agent constructs an aggregate belief using weighted geometric average of its own belief and the beliefs of its neighbors. Then, each agent performs a Bayes’ update using the aggregated belief as a prior.

Iii-a Connection with the Stochastic Mirror Descent Method

To make this connection222Particularly simple when is a finite set., we observe that the optimization problem in Eq. (1) is equivalent to the following problem:

where is the set of all distributions on . Under some technical conditions the expectations can exchange the order, so the problem in Eq. (1) is equivalent to the following one:

(4)

The difficulty in evaluating the objective function in Eq. (4) lies in the fact that the distributions are unknown. A generic approach to solving such problems is the class of stochastic approximation methods, where the objective is minimized by constructing a sequence of gradient-based iterates where the true gradient of the objective (which is not available) is replaced with a gradient sample that is available at the given update time. A particular method that is relevant here is the stochastic mirror-descent method which would solve the problem in Eq. (4), in a centralized fashion, by constructing a sequence , as follows:

(5)

where is a noisy realization of the gradient of the objective function in Eq. (4) and is a Bregman distance function associated with a distance-generating function , and is the step-size. If we take as the distance-generating function, then the corresponding Bregman distance is the Kullback-Leiblier (KL) divergence . Thus, in this case, the update rule in Eq. (2) corresponds to a distributed implementation of the stochastic mirror descent algorithm in (5), where and the stepsize is fixed, i.e.., for all .

We summarize the preceding discussion in the following proposition.

Proposition 1

The update rule in Eq. (2) defines a probability measure over the set generated by the probability density that coincides with the solution of the distributed stochastic mirror descent algorithm applied to the optimization problem in Eq. (1)., i.e.

(6)
Proof:

We need to show that the density associated with the probability measure defined by Eq. (2) minimizes the problem in Eq. (6).

First, define the argument in Eq. (6) as

and add and subtract the KL divergence between and the density to obtain

Now, we use the relation for the density , which is implied by the update rule for in Eq. (2), and obtain

The first term in the preceding line does not depend on the distribution . Thus, we conclude that the solution to the problem in Eq. (6) is the density (almost everywhere). ∎

Iv Countable Hypothesis Set

In this section we present a concentration result for the update rule in Eq. (3) specific for the case of a countable hypothesis set. Later in Section V we will analyze the case of .

First, we provide some definitions that will help us build the desired results. Specifically, we will study how the beliefs of all agents concentrate around the true hypothesis .

Definition 1

Define a Kullback-Leibler Ball (KL) of radius centered at as.

Definition 2

Define the covering of the set generated by a strictly increasing sequence with as the union of disjoint KL bands as follows:

where denotes the complement between the set and the set , i.e. We denote the cardinality by , i.e. .

We are interested in bounding the beliefs’ concentration on a ball for an arbitrary , which is based on a covering of the complement set . To this end, Definitions 1 and 2 provide the tools for constructing such a covering. The strategy is to analyze how the hypotheses are distributed in the space of probability distributions, see Figure 2. The next assumption will provide conditions on the hypothesis set which guarantee the concentration results.

[width=0.4]covering_dis

Fig. 2: Creating a covering for a ball . represents the correct hypothesis , indicates the location of other hypotheses and the dash lines indicates the boundary of the balls .
Assumption 3

The series converges, where the sequence is as in Definition 2.

We are now ready to state the main result for a countable hypothesis set .

Theorem 1

Let Assumptions 12 and 3 hold, and let be a desired probability tolerance. Then, the belief sequences , , generated by the update rule in Eq. (3), with the initial beliefs such that for all , have the following property: for any and any radius with probability ,

where with the sets and given by

where , , is as in Assumption 1(b), and are as in Definition 2, while . If is a lazy-metropolis matrix, then .

Observe that if , then for all , and the same is true for the set , so we can alternatively write

Further, note that depends on the radius of the KL ball, as the set involves and which both depend on , while the set explicitly involves . Finally, note that the smaller the radius , the larger is. We see that also depends on the number of agents, the learning parameter , the learning capabilities of the network represented by , the initial beliefs , the number of hypotheses that are far away from and their probability distributions.

Theorem 1 states that the beliefs of all agents will concentrate within the KL ball with a radius for a large enough , i.e., . Note that the (large enough) index is determined as the smallest for which two relations are satisfied, namely, the relations defining the index sets and . The set contains all indices for which a weighted sum of the total mass of the hypotheses is small enough (smaller than the desired probability tolerance ). Specifically, we require the number of hypothesis in the -th band does not grow faster than the squared radius of the band, i.e., the wrong hypothesis should not accumulate too fast far away from the true hypothesis . Moreover, the condition in also prevents having an infinite number of hypothesis per band. The set captures the iterations at which, for all agents, the current beliefs had recovered from the the cumulative effect of “wrong” initial beliefs that had given probability masses to hypotheses far away from .

In the proof for Theorem 1, we use the relation between the posterior beliefs and the initial beliefs on a measurable set such that . For such a set, we have

(7)

where is the appropriate normalization constant. Furthermore, after a few algebraic operations we obtain

(8)

Moreover, since for all , it follows that

(9)

Now we will state a useful result from [19] which will allow us to bound the right hand term of Eq. (8).

Lemma 1

[Lemma 2 in [19]] Let Assumptions 2 hold, then the matrix satisfies: for all ,

where , and if is a lazy-metropolis matrix associated with then .

If follows from Eq. (9), Lemma 1 and Assumption 1 that

(10)

for all , where is as defined in Theorem 1. Next we provide an auxiliary result about the concentration properties of the beliefs on a set .

Lemma 2

For any it holds that

where , and and are as in Theorem 1.

Proof:

First define the following random variable

Then, by using the union bound and McDiarmid inequality we have,

and by setting , it follows that

It can be seen that , thus yielding

Now, we let the set be the KL ball of a radius centered at and follow Definition 2 to exploit the representation of as the union of KL bands, for which we obtain

thus, completing the proof. ∎

We are now ready to proof Theorem 1

Proof:

From Lemma 2, where we take large enough to ensure the desired probability, it follows that with probability , we have: for all ,

where the last inequality follows from Eq. (10) where we further take sufficiently large . ∎

V Continuum of Hypotheses

In this section we will provide the concentration results for a continuous hypothesis set . At first, we present some definitions that we use in constructing coverings analogously to that in Section IV. In this case, however, we employ the Hellinger distance.

Definition 3

Define a Hellinger Ball (H) of radius centered at as.

Definition 4

Let be a metric space. A subset is called -separated with if for any . Moreover, for a set , let be the smallest number of Hellinger balls with centers in of radius needed to cover the set , i.e., such that .

Definition 5

Let be a strictly decreasing sequence such that and . Define the covering of the set generated by the sequence as follows:

where is the smallest such that . Moreover, given a positive sequence , we denote by the maximal -separated subset of the set and denote its cardinality by , i.e. . Therefore, we have the following covering of ,

where .

Figure 3 depicts the elements of a covering for a set . The cluster of circles at the top right corner represents the balls and for a specific case in the left of the image we illustrate the set .

[width=0.4]covering_cont

Fig. 3: Creating a covering for a set . represents the correct hypothesis .

We are now ready to state the main result regarding continuous set of hypotheses .

Theorem 2

Let Assumptions 12, and 3 hold, and let be a given probability tolerance level. Then, the beliefs generated by the update rule in Eq. (2) with uniform initial beliefs, are such that, for any and any with probability ,

where with

for a parameter such that and for all . The constant is as in Assumption 1, is the dimension of the space of , and are as in Definition 5, while is the same as in Theorem 1.

Analogous to Theorem 1, Theorem 2 provides a probabilistic concentration result for the agents’ beliefs around a Hellinger ball of radius with center at for sufficiently large .

Similarly to the preceding section, we represent the beliefs in terms of the initial beliefs and the cumulative product of the weighted likelihoods received from the neighbors. In particular, analogous to Eq. (7), we have that for every and for every measurable set :

(11)

with the corresponding normalization constant , and assuming all agents have uniform beliefs at time .

It will be easier to work with the beliefs’ densities, so we define the density of a measurable set with respect to the observed data.

Definition 6

The density of a measurable set , where with respect to the product distribution of the observed data is given by

(12)

where and .

The next lemma relates the density which is defined per agent to a quantity that is common among all nodes in the network.

Lemma 3

Consider the densities as defined in Eq. (12), then

(13)

where and

Proof:

By definition of the densities, we have

where the last line follows by adding and subtracting . Hence, by Lemma 1, we further obtain