Achieving global behaviors by repeatedly aggregating local information without complete knowledge of the network has been a recent topic of interest [1, 2, 3, 4, 5]. For example, distributed hypothesis testing method that uses belief propagation has been studied in . Various extensions to finite capacity channels, packet losses, delayed communications and tracking where developed in [6, 7]. In 1, 8], almost sure convergence of non-Bayesian rules based on consensus were shown for static graphs. Other methods to aggregate Bayes estimates in a network have been explored as well . The work in  extends the results of  to time-varying undirected graphs. In , local exponential rates of convergence for undirected gossip-like graphs are studied. The authors in [12, 13] proposed a non-Bayesian learning algorithm where a local Bayes’ update is followed by a consensus step. In , convergence result for fixed graphs is provided and large deviation convergence rates are given, proving the existence of a random time after which the beliefs will concentrate exponentially fast. In , similar probabilistic bounds for the rate of convergence are derived and comparisons with the centralized version of the learning rule are provided.
Following the seminal work of Jadbabaie et al. in [1, 14, 15], there have been many studies of non-Bayesian rules for distributed learning. Non-Bayesian algorithms involve an aggregation step, usually consisting of a belief aggregation and a Bayesian update that is based on the locally available data. The belief aggregation is typically consisting of a weighted geometric or arithmetic average of beliefs, in which case the results from consensus literature [16, 17, 18, 19, 20] are exploited, while Bayesian update step is based on the standard Bayesian learning approach [21, 22].
Several variants of non-Bayesian approach have been proposed and have been shown to produce consistent estimates, with provable asymptotic and non-asymptotic convergence rates for a general class of distributed algorithms. The main body of work is focused on the case of finitely many hypotheses. The established results include asymptotic convergence rate analysis [11, 23, 24, 25, 26, 27, 28] and non-asymptotic convergence rate bounds [13, 29, 12], time-varying directed graphs , continuum set of hypotheses , weakly connected graphs , bisection search algorithm , and transmission node failures [33, 34, 35].
In this paper, we overview a subset of recent studies on distributed (non-Bayesian) learning algorithms. To present a concise introduction to the topic, we start by presenting ideas from centralized learning and, then, transition to the most recent developments in the distributed setting. This tutorial is by no means extensive and the interested reader may like to look into the references for a more complete exposition of certain aspects.
This tutorial is organized as follows. Section II presents a general introduction to the distributed learning problem. We highlight the main assumptions and how they can be weakened for more general results. Section III provides an overview of the centralized non-Bayesian learning problem and describes some initial generalizations to the distributed setting (known as social learning). Moreover, convergence results as well as (non-)asymptotic convergence rate estimates are provided. Section IV discusses some generalizations aimed at improving the convergence rate estimates (in terms of their dependency on the number of agents), dealing with time-varying directed graphs, and learning with a continuum sets of hypotheses. Finally, some conclusions are presented in Section V.
The inner product of two vectorsand is denoted by . We write or to denote the entry of a matrix in the -th row and -th column. We write for the transpose of a matrix and for the transpose of a vector
. A matrix is said to be stochastic if its entries are nonnegative, and the sum of the entries in every row is equal to 1. A stochastic matrixwhose transpose is also stochastic is said to be doubly stochastic. We use
for the identity matrix, where its size will be inferred from the context. We writeto denote the vector with all zero entries except for its -th entry which is equal to . In general, when referring agents we will use superscripts with the letter or , while when referring to a time instant we will use subscripts and the letter .
We write to denote the cardinality of a set , and for a probability measure over the set
. Upper case letters represent random variables (e.g.) with their corresponding lower case letters as their realizations (e.g. ). Notation is reserved for expectation with respect to a random variable
. We denote the Kullback-Liebler (KL) divergence between two probability distributionsand with a common support set by . In particular, when the distributions and have a countable (or finite) support set, their KL-divergence is given by
The definition of the KL-divergence for general measures and on a given set is a bit more involved; it can be found, for example, in , page 111.
Ii Problem Statement
Consider a group of agents, indexed by , each having conditionally independent observations of a random process at discrete time steps . Specifically, agent observes the random variables which are i.i.d. in time and distributed according to an unknown probability distribution . The set of possible outcomes of the random variables is a finite set which we will denote by . For convenience, we stack up all the into a vector denoted as . Then, is an i.i.d. vector taking values in and distributed as . Furthermore, each agent has a family of probability distributions parametrized by , where is a set of parameters. One can think of as a set of hypotheses and as the probability distribution that would be seen by agent if hypothesis were true. In general it is not required that there exists with for all ; in other words, there may not be a hypothesis which matches the observations made by the nodes. Rather, the objective of all agents is to agree on a subset of that fits all the observations in the network best. Formally, this setup describes the scenario where the group of agents collectively tries to solve the following optimization problem
is the Kullback-Leibler divergence between the distribution ofand the distribution that would have been seen by agent if hypothesis were correct. The distributions ’s are not known, therefore, the agents want to “learn” the solution to this optimization problem based on their local observations and some local interactions. See Figure 1 for an illustration of the problem.
The agents interact over a sequence of directed communication graphs , where is the set of agents (where each agent is viewed as a node), and is the set of edges where if agent can communicate with agent at time instant . Specifically, the agents communicate with each other by sharing their beliefs about the hypotheses set, denoted as , which is a probability distribution over the hypothesis set . In the forthcoming discussion, we will consider the cases where the graphs can be static and may be undirected. We will clearly specify the assumptions made on the graphs.
The hypothesis set can be a finite, countable or continuum set, which will be self-evident from expressions used in Bayes’ update relation.
In this section, we describe some of the algorithms that have been proposed for the distributed non-Bayesian learning problem. Different algorithms and results exist due to the use of different communication networks and protocols for information exchange. Moreover, the variety in the algorithms is also due to the order in which the local information updates and neighbor beliefs aggregation updates are performed.
We will start by considering Bayes’ update for a case of a single agent, i.e., centralized case. Furthermore, initially, for simplicity of exposition we will assume there exists a single that minimizes problem (1) corresponding to a single agent case. In this case, updating the beliefs to account for a set of observations that lead to a posterior belief follows the Bayes’rule. Specifically, having a belief and a new observation at time , the agent updates its belief as follows (see ):
where denotes the Bayesian update of the belief given a new observation , i.e.,
where the symbol stands for positively proportional quantities (the proportionality constant here is the normalization factor needed to obtain a probability distribution).
where . When , algorithm (3) reduces to Bayesian learning in (2). When , a relative importance is given to the prior, whereas, for the updates over-react to observations. The authors in  showed that update rules of the form (3) converge to the correct parameter value in the almost sure sense whenever and is measurable. If or if is not measurable, then there is an incorrect parameter to which convergence can happen with a positive probability. Thus, as long as there is a constant flow of new information and the agent takes its personal signals into account in a Bayesian manner, the resulting learning process asymptotically coincides with Bayesian learning.
The seminal work of  has introduced a social learning approach to non-Bayesian learning, where different agents receive different observations and use a DeGroot-style update to incorporate the views of their neighbors
where are the weights taking positive values on the links in a static graph (i.e., for all ) and satisfying for all . In , it has been shown that, when the underlying social network is strongly connected, every , and at least one agent has a positive prior belief on the true parameter (i.e., for some ), then the beliefs generated by algorithm (4) results in all agents forecasts converging to the correct one with probability one.
A connection between non-Bayesian learning and optimization theory were pointed out in , where a distributed learning algorithm has been proposed that is based on a maximum likelihood analysis of the estimation problem and Nesterov’s dual averaging algorithm . Finding the true state of the world was described as the following optimization problem
Applying a regularized dual averaging algorithm to the optimization problem (6), one obtains a sequence , where
with , , being a sequence of non-increasing step-sizes, and a proximal function.
In the distributed setting in , for an undirected and static graph , the randomized gossip interactions were considered, where an agent “wakes-up” according to a Poisson clock and communicates with a randomly selected agent . Both agents average their accumulated observations and add their most recent stochastic gradient, resulting in the update of the form:
while the other agents in the system do not update.
Letting for all agents and using the Kullback-Liebler divergence as a proximal function, the update rule of the form (8) has a closed form solution given by
and with and being the agents involved in the random gossip communication at time (or alternatively, the link being randomly activated in the graph ).
The update rule in (8) involves a form of geometric average of beliefs instead of the linear aggregations of beliefs as in (4). Weak convergence is proven under the connectivity assumption of the interacting graph , i.e.,
Additionally, in , the convergence rate results for the estimation process are provided. An asymptotic rate is derived that guarantees that for sufficiently large time scales the beliefs will concentrate around the true hypothesis exponentially fast. The rate at which this happens is proportional to the distance (in the sense of the KL divergence) between the true hypothesis and the second best option, i.e., with probability for sufficiently large it holds that
where is a constant and
Similar asymptotic rates using large scale deviation theory were derived in  for a directed static graph but for a different algorithm. Specifically, in , an explicit belief update rule is considered where local Bayesian updates are aggregated via geometric averages of the form:
Under assumptions of strong connectivity, positive prior beliefs and existence of unique correct models, an exponential convergence rate of the beliefs to the correct hypothesis has been shown and an asymptotic convergence rate is provided (see Theorem 1 of ).
In recent works [13, 40], non-asymptotic convergence rates for a variety of distributed non-Bayesian learning algorithms have been established. In , the algorithm in (8) has been considered for the case of (non-random) agent interaction over a general static connected graphs. In particular, the following relations have been shown to hold
with probability , where is arbitrarily small. Here, denotes the total variation between vectors and , is a probability vector with an entry in the position corresponding to hypothesis , denotes the size of the hypotheses set, is a step-size and is a lower bound on the probability mass distribution in the likelihood models. The vector
denotes the stationary distribution of the corresponding Markov chain whose transition probability distribution is the interaction matrix(in other words the vector ).
The non-asymptotic probabilistic bound in (11) shows the concentration of the beliefs around the true state of the world as an exponentially fast process with a transient time related to the matrix properties and the desired accuracy level. The bound holds for a connected graph and stochastic weight matrix , and the exponential concentration rate depends explicitly on the left-eigenvector associated with eigenvalue 1 of the matrix .
An independent simultaneous work [40, 29] also has developed non-asymptotic bounds for distributed non-Bayesian learning for time-varying graphs and for different algorithms. The belief update rules in [40, 29] are based on mirror descent algorithm as applied to the learning problem in a distributed setting. The resulting update rule has the following form:
Algorithm (13) is applicable to time-varying graphs, as indicated by the use of time varying weight matrices that are compliant with the graphs’ structure. In particular, the following assumption is imposed on the graph sequence and the matrix sequence .
Assume that each graph is undirected and has no self-loops (i.e., for all and all ). Moreover, let the graph sequence and the matrix sequence satisfy the following conditions:
is doubly-stochastic for every , with if and for .
Each has positive diagonal entries, i.e., for all and all .
There exist a uniform lower bound on positive entries in , i.e., if .
The graph sequence is -connected, i.e., there is an integer such that the graph is connected for all .
We are now considering the learning problem in (1), where the hypothesis set is finite. We let denote the set of optimal solutions, and note that this set is nonempty. In this setting, the following assumption ensures that the learning process will identify correct hypothesis. In particular, the assumption is for the general case when a unique true state of the underlying process does not exist (implying that is not a singleton).
For all agents ,
There is a nonempty set such that for all . Furthermore, the intersection set is nonempty.
There exists an such that if then for all .
With the two assumptions above we can state the main result in .
where is a local learning objective of agent given by
Theorem 1 states that, with a high probability and after some time that is sufficiently long (as captured by ), the belief of each agent on any hypothesis outside the optimal set decays at a network-independent rate. This rate scales with the constant , which is the average Kullback-Leibler divergence to the next best hypothesis. However, there is a transient due to the term (since the bound of Theorem 1 is not even below until ), and the size of this transient depends on the network and the number of nodes through the constant .
We note that the transient time for each agent is affected by the discrepancy in the initial beliefs on the correct hypotheses (those in the set ), as captured by the term
in the expression for in Theorem 1. We note that, if agent uses a uniform initial belief, i.e., for all , then this term would be 0 for all and, consequently, it will not contribute to the transient time . Thus, the transient time has a dependence on the initial beliefs that is intuitively plausible. Moreover, if agent were to start with a good initial belief , i.e., a belief such that
then the corresponding transient time would be smaller, which is also to be expected.
Iii-a Connection with Distributed Stochastic Mirror Descent
To make this connection simple, we will keep the assumption that the hypothesis set is a finite set. Then, we can observe that the optimization problem in Eq. (1) is equivalent to the following problem:
The expectations in the preceding relation can exchange the order, so the problem in Eq. (1) is equivalent to the following one:
The difficulty in evaluating the objective function in Eq. (14) (even in the case of a single agent) lies in the fact that the distributions are unknown. A generic approach to solving such problems is the class of stochastic approximation methods, where the objective is minimized by constructing a sequence of gradient-based iterates where the true gradient of the objective (which is not available) is replaced with a gradient sample that is available at the given update time. A particular method that is relevant here is the stochastic mirror-descent method which would solve the problem in Eq. (14), in a centralized fashion, by constructing a sequence , as follows:
where is a noisy realization of the gradient of the objective function in Eq. (14) and is a Bregman distance function associated with a distance-generating function , and is the step-size. If we take as the distance-generating function, then the corresponding Bregman distance is the Kullback-Leiblier divergence . Let us note that this specific selection of Bregman divergence was previously studied in , where the entropic mirror descent algorithm was proposed. Thus, in this case, the update rule in Eq. (19) corresponds to a distributed implementation of the stochastic mirror descent algorithm in (15), where , , and the stepsize is fixed, i.e., for all .
The update rule in Eq. (13) defines a probability measure over the set which coincides with the iterate update of the distributed stochastic mirror descent algorithm applied to the optimization problem in Eq. (1), i.e.,
Iv-a Fast Rates with Nesterov’s Acceleration
For static undirected graphs, the authors in  proposed an update rule with one-step memory as follows:
where a constant to be set later. This update rule is based on an accelerated algorithm for computing network aggregate values as given in , which has the convergence rate a factor of faster than the previous rate results (in terms of the factor that governs the exponential decay).
For the algorithm in (17) we impose the following assumption.
The graph sequence is static (i.e. for all ) and undirected and the weight matrix is a lazy Metropolis matrix, defined by
where is the Metropolis matrix, which is the unique stochastic matrix whose off-diagonal entries satisfy
with being the degree of the node (i.e., the number of neighbors of in the graph).
The next theorem provides a convergence rate bound for the beliefs generated by algorithm (17). In particular, it shows the rate at which the beliefs dissipate the mass placed on wrong (non-optimal) hypotheses.
Let Assumptions 3 and 2 hold and let . Furthermore let and let . Then, the update rule of Eq. (17) with this , uniform initial condition and fixed to zero, has the following property: there is an integer such that, with probability , for all and for all , there holds
with from Assumption 2(b) and .
The bound of Theorem 2 is an improvement by a factor of compared to the bounds of Theorem 1 (17) when the graphs are static. Indeed, the term is in Theorem 1 if ; the same term is in Theorem 2 and, assuming is within a constant factor of , this becomes . We note, however, that the requirements of this theorem are more stringent than those of Theorem 1. Not only does the graph have to be fixed, but all nodes need to know an upper bound on the total number of agents. Moreover, the bound has to be within a constant factor of the actual number of agents. More details on fast algorithms for distributed optimization and learning can be found in a tutorial paper .
Iv-B Directed Time-Varying Graphs
In , the authors proposed a new algorithm inspired by the Push-Sum Protocol that is able to guarantee convergence for directed graphs, given as follows:
For this algorithm, we have the following result about its convergence behavior.
Assume that the graph sequence is -strongly connected and that assumption 2 holds. Also, let be a given error percentile (or confidence value). Then, for the update rule in Eq. (18), with and uniform initial beliefs, has the following property: there is an integer such that, with probability , for all and for all there holds
with and as defined in Theorem 1.
The constants , and satisfy the following relations:
(1) For general -strongly-connected graph sequences , we have
(2) If every graph is regular with , then we have
Iv-C Infinite Sets of Hypotheses
All previously discussed results assume that the hypothesis set is finite. The exponential convergence rates discussed so far depend on some form of distance between the optimal hypothesis and the second best one, and such results are extendable to the case of countably many hypotheses. However, in the case of a continuum of hypothesis, this approach will encounter obstacles. In a recent work , the exponential rates have been established for a compact set of hypotheses. In this case, the update rule for a measurable set is defined as
where is a measure with respect to which every belief is absolutely continuous. The particular details of the rate results can be found in .
V Conclusions and Future Work
We presented a highlight of recent developments on the problem of distributed (non-Bayesian) learning. We discussed the problem statement and how different assumptions on the learning model, communication graphs and hypothesis sets lead to different algorithmic implementations. We showed that the original Bayesian approach can be interpreted as a method for solving a related optimization problem.
Future work should focus on models where the observations are not necessarily identically distributed nor independent. Recent results on the concentration measures without independence provide the theoretical foundations for getting non-asymptotic rates in more general cases [44, 45]. Such time dependence could model changes in the optimal hypotheses, changes on the likelihood models or the Bregman divergences used. Online optimization have been shown efficient for some time dependencies [46, 47].
-  A. Jadbabaie, P. Molavi, A. Sandroni, and A. Tahbaz-Salehi, “Non-bayesian social learning,” Games and Economic Behavior, vol. 76, no. 1, pp. 210–225, 2012.
-  K. Rahnama Rad and A. Tahbaz-Salehi, “Distributed parameter estimation in networks,” in Proceedings of the IEEE Conference on Decision and Control, 2010, pp. 5050–5055.
-  R. Olfati-Saber, E. Franco, E. Frazzoli, and J. S. Shamma, “Belief consensus and distributed hypothesis testing in sensor networks,” in Networked Embedded Sensing and Control. Springer, 2006, pp. 169–182.
-  M. Alanyali, S. Venkatesh, O. Savas, and S. Aeron, “Distributed bayesian hypothesis testing in sensor networks,” in Proceedings of the American Control Conference, 2004, pp. 5369–5374.
-  Q. Zhou, D. Li, S. Kar, L. Huie, H. V. Poor, and S. Cui, “Learning-based distributed detection-estimation in sensor networks with unknown sensor defects,” arXiv preprint arXiv:1510.02371, 2015.
-  V. Saligrama, M. Alanyali, and O. Savas, “Distributed detection in sensor networks with packet losses and finite capacity links,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4118–4132, 2006.
-  R. Rahman, M. Alanyali, and V. Saligrama, “Distributed tracking in multihop sensor networks with communication delays,” IEEE Transactions on Signal Processing, vol. 55, no. 9, pp. 4656–4668, 2007.
-  A. Jadbabaie, P. Molavi, and A. Tahbaz-Salehi, “Information heterogeneity and the speed of learning in social networks,” Columbia Business School Research Paper, no. 13-28, 2013.
-  S. Bandyopadhyay and S.-J. Chung, “Distributed estimation using bayesian consensus filtering,” in Proceedings of the American Control Conference, 2014, pp. 634–641.
-  Q. Liu, A. Fang, L. Wang, and X. Wang, “Social learning with time-varying weights,” Journal of Systems Science and Complexity, vol. 27, no. 3, pp. 581–593, 2014.
-  S. Shahrampour and A. Jadbabaie, “Exponentially fast parameter estimation in networks using distributed dual averagingy,” in Proceedings of the IEEE Conference on Decision and Control, 2013, pp. 6196–6201.
-  A. Lalitha, T. Javidi, and A. Sarwate, “Social learning and distributed hypothesis testing,” preprint arXiv:1410.4307, 2015.
-  S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Distributed detection: Finite-time analysis and impact of network topology,” preprint arXiv:1409.8606, 2014.
-  P. Molavi, A. Tahbaz-Salehi, and A. Jadbabaie, “Foundations of non-bayesian social learning,” Columbia Business School Research Paper, 2015.
-  P. Molavi, K. R. Rad, A. Tahbaz-Salehi, and A. Jadbabaie, “On consensus and exponentially fast social learning,” in Proceedings of the American Control Conference, 2012, pp. 2165–2170.
-  D. Acemoglu, A. Nedić, and A. Ozdaglar, “Convergence of rule-of-thumb learning rules in social networks,” in Proceedings of the IEEE Conference on Decision and Control, 2008, pp. 1714–1720.
-  J. N. Tsitsiklis and M. Athans, “Convergence and asymptotic agreement in distributed decision problems,” IEEE Transactions on Automatic Control, vol. 29, no. 1, pp. 42–50, 1984.
-  A. Jadbabaie, J. Lin, and A. S. Morse, “Coordination of groups of mobile autonomous agents using nearest neighbor rules,” IEEE Transactions on Automatic Control, vol. 48, no. 6, pp. 988–1001, 2003.
-  A. Nedić and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” IEEE Transactions on Automatic Control, vol. 60, no. 3, pp. 601–615, 2015.
-  A. Olshevsky, “Linear time average consensus on fixed graphs and implications for decentralized optimization and multi-agent control,” preprint arXiv:1411.4186, 2014.
-  D. Acemoglu, M. A. Dahleh, I. Lobel, and A. Ozdaglar, “Bayesian learning in social networks,” The Review of Economic Studies, vol. 78, no. 4, pp. 1201–1236, 2011.
-  E. Mossel, A. Sly, and O. Tamuz, “Asymptotic learning on bayesian social networks,” Probability Theory and Related Fields, vol. 158, no. 1-2, pp. 127–157, 2014.
-  A. Lalitha, A. Sarwate, and T. Javidi, “Social learning and distributed hypothesis testing,” in IEEE International Symposium on Information Theory, 2014, pp. 551–555.
-  L. Qipeng, F. Aili, W. Lin, and W. Xiaofan, “Non-bayesian learning in social networks with time-varying weights,” in 30th Chinese Control Conference (CCC), 2011, pp. 4768–4771.
-  L. Qipeng, Z. Jiuhua, and W. Xiaofan, “Distributed detection via bayesian updates and consensus,” in 34th Chinese Control Conference (CCC). IEEE, 2015, pp. 6992–6997.
-  S. Shahrampour, M. Rahimian, and A. Jadbabaie, “Switching to learn,” in Proceedings of the American Control Conference, 2015, pp. 2918–2923.
-  M. A. Rahimian, S. Shahrampour, and A. Jadbabaie, “Learning without recall by random walks on directed graphs,” preprint arXiv:1509.04332, 2015.
-  A. K. Sahu and S. Kar, “Recursive distributed detection for composite hypothesis testing: Algorithms and asymptotics,” arXiv preprint arXiv:1601.04779, 2016.
-  A. Nedić, A. Olshevsky, and C. A. Uribe, “Fast convergence rates for distributed non-bayesian learning,” preprint arXiv:1508.05161, Aug. 2015.
-  ——, “Distributed learning with infinitely many hypotheses,” arXiv preprint arXiv:1605.02105, 2016.
-  H. Salami, B. Ying, and A. H. Sayed, “Social learning over weakly-connected graphs,” arXiv preprint arXiv:1609.03703, 2016.
-  A. Tsiligkaridis and T. Tsiligkaridis, “Distributed probabilistic bisection search using social learning,” arXiv preprint arXiv:1608.06007, 2016.
-  L. Su and N. H. Vaidya, “Asynchronous distributed hypothesis testing in the presence of crash failures,” University of Illinois at Urbana-Champaign, Tech. Rep, 2016.
-  ——, “Defending non-bayesian learning against adversarial attacks,” arXiv preprint arXiv:1606.08883, 2016.
-  ——, “Non-bayesian learning in the presence of byzantine agents,” in International Symposium on Distributed Computing. Springer, 2016, pp. 414–427.
-  J. M. Bernardo and A. F. Smith, Bayesian theory. IOP Publishing, 2001.
-  L. G. Epstein, J. Noor, and A. Sandroni, “Non-bayesian learning,” The BE Journal of Theoretical Economics, vol. 10, no. 1, 2010.
-  ——, “Non-bayesian updating: a theoretical framework,” Theoretical Economics, vol. 3, no. 2, pp. 193–229, 2008.
-  Y. Nesterov, “Primal-dual subgradient methods for convex problems,” Mathematical programming, vol. 120, no. 1, pp. 221–259, 2009.
-  A. Nedić, A. Olshevsky, and C. A. Uribe, “Nonasymptotic convergence rates for cooperative learning over time-varying directed graphs,” in Proceedings of the American Control Conference, 2015, pp. 5884–5889.
-  A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,” Operations Research Letters, vol. 31, no. 3, pp. 167–175, 2003.
-  A. Olshevsky, “Fast algorithms for distributed optimization and hypothesis testing: A tutorial,” in In Proceedings of the 55th IEEE Conference on Decision and Control, 2016, p. also available at https://arxiv.org/pdf/1609.03961v1.pdf.
-  A. Nedić, A. Olshevsky, and C. A. Uribe, “Network independent rates in distributed learning,” in 2016 American Control Conference (ACC), July 2016, pp. 1072–1077.
-  A. Kontorovich and M. Raginsky, “Concentration of measure without independence: a unified approach via the martingale method,” arXiv preprint arXiv:1602.00721, 2016.
-  J. C. Duchi, A. Agarwal, M. Johansson, and M. I. Jordan, “Ergodic mirror descent,” SIAM Journal on Optimization, vol. 22, no. 4, pp. 1549–1578, 2012.
-  A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online optimization in dynamic environments: Improved regret rates for strongly convex problems,” arXiv preprint arXiv:1603.04954, 2016.
-  S. Shahrampour and A. Jadbabaie, “Distributed online optimization in dynamic environments using mirror descent,” arXiv preprint arXiv:1609.02845, 2016.