I Introduction
The explosive influx of data and information for modern day technological systems has opened doors to revolutionary possibilities. One of the most vital uses of this data is in modeling, visualizing, and analyzing large data sets through probabilistic tools. Statistical machine learning is at the core of numerous such applications in what is becoming known as Internet of Things (IoT). Such iterative learning mechanisms help in better control performance for many cyberphysicalsystems in which estimation of parameters and systemidentification is required. Probabilistic graphical modeling is one key research area that has helped in data analysis in inference and prediction tasks,
[2]. These models visually express assumptions about data and its hidden structure. Posterior inference algorithms have been proven to exploit such models in explaining this hidden structure while being adaptive, robust, parallelizable, and scalable.Variational inference, from late 90s, is a method that transforms complex inference problems into high dimensional optimization problems. In contrast to MonteCarlo sampling methods (that simply aim to find exact answer to an approximate problem), the variational Bayesian approach solves for optimal solution under constraints to the right inference problem, [3]. On the same lines, stochastic variational inference (SVI) was developed recently that extends variational inference to be solved using stochastic optimization under certain assumptions, [1]. SVI works iteratively in gradient ascent fashion using noisy gradient estimates. It provides approximate model posteriors with only a few passes through a large data collection, making it highly scalable. We propose ADMMbased Networked SVI – a distributed stochastic variational inference technique that builds upon standard SVI, retaining most of its benefits, based on the highly parallelizable alternating direction method of multipliers (ADMM) where the agents are connected as in a graph with nodes and edges.
Ia Related Work
Numerous extensions have been proposed for the SVI framework in its application to more model classes (by [4] and [5]), different underlying processes (by [6] and [7]), and structural exploitations (by [8]) making it faster and widely deployable. A variety of works focus on making variational methods distributed to enhance parallelizability. The work on distributed Bayesian nonparametric models by [9] is commendable in making variational inference updates distributed, asynchronous, and ‘streaming’ (online) and they’ve shown it to outperform standard SVI. However, their work is only specific to the Dirichlet process mixture and lacks in generalizability to the class of probabilistic models that SVI can deal with. Similar to [9], another work DMFVI by [10] uses ADMM for decentralizing, like us, but lacks in being extendable to online updates, fast convergence rate, and other desirable properties of SVI. Distributed VBA ([11]) also uses ADMM however their approach lacks in scalability to large data without demanding adequate computational resources.
None of these works use stochastic optimization methods to speed up inference and hence they are fundamentally different from standard SVI itself. One recent work ‘ExtendedSVI’ by [12]
retains the benefits of SVI while making it distributed and asynchronous. They employ a rather simple algorithmic change to SVI however their work remains unexplored in terms of depth because they particularly focus on Gaussian mixture models and do not provide how it is extendable to all other probabilistic models for which SVI in general works.
In contrast to all related works highlighted, our approach extends the general SVI framework to a networked stochastic optimization consensus problem. We tackle the issue of generalizability to all graphical models by providing a general solution and make use of the stochastic gradient updates that make it fast. We run ADMM updates along with stochastic gradient ascent for variational objective to reach consensus among a number of distributed learners. SVI itself being a nonconvex stochastic optimization problem makes the distributed problem trickier. Our work aims to show that independent learners that use SVI for similar applications, can collaborate by exchanging their results (not the data itself) to benefit from each other improving overall accuracy of results. This approach makes our work unique and applicable to wider distributed largescale inference problems. Not just that, but this kind of an approach poses a game problem with multiple agents interested in performing their own inference tasks, and simultaneously benefiting from each other through reinforcement learning and cooperation.
Ii ADMMbased Distributed SVI
Building upon the recent work on SVI by Hoffman et al., [1], we consider the SVI problem for a network of learners.
The observations are
; the vector of global hidden variables is
; the local hidden variables are , each of which is a collection of variables ; the vector of fixed parameters is . (Note we can easily allowto partly govern any of the random variables, such as fixed parts of the conditional distribution of observations. To keep notation simple, we assume that they only govern the global hidden variables.)
Iia Optimization problem
subject to  
where each is an sized vector and indicates the feasible set for the variables (typically ) and,
(1) 
which is the standard SVI problem objective function for a single learner. The above optimization problem gives us a solution for learners when they form a consensus. Using an Augmented Lagrangian approach, as in ADMM, we solve this problem in a distributed iterative fashion for multiple learners.
IiB ADMMbased solution
Augmented Lagrangian with a quadratic penalty is used to arrive at the ADMM update iterations. The Lagrange multipliers are denoted by . Minimization updates for each processor/agent are given as:
where is called the central collector, and is the quadratic penalty parameter in the augmented Lagrangian which is given as:
Here, we note that the minimization update which is actually a solution to , requires solving a constrained nonconvex optimization problem. We solve this in a gradient descent fashion in of itself, as the standard SVI problem was also solved, but the original solution requires inversion of a Hessian matrix. For that we take into account onestep earlier value of — for details about derivation and Hessian inversion approximation used, see Appendix. Thereby, our proposed iterative ADMM methodology runs along with a gradientdescent iterative update of variables which is completely summarized in algorithm 1.
IiC Experimental results
New set of experiments for the distributed problem was performed with multiple learners. Here, we show the results with learners. Figure 1, shows the convergence properties of our distributed learners, for a metric of the estimated model’s fitness, known as the ‘heldout perplexity’. This same metric has been used by Hoffman et al. [13] to show convergence of the algorithm. A comparison of centralized versus distributed twoplayer SVI algorithms is depicted in Figure 2. We conclude that all the learners not only converge to higher precision in estimates (evident from the Figure 1), but also achieve accuracy of estimates (evident from Table I), while simultaneously maintaining consensus.
From Table I
, we see that the highly probable words for a given topic learned by any of the four learners all direct to a similar kind of subject. For example, in Topic#98, the learners understood it to represent the names of months — even though for this topic, the distribution of word occurrences is different for the four players but it is evident that they all point to the same abstract class of words. Similarly, in other topics as well we see similarity in estimates. Sometimes, we even see that the descending order of the words is exactly the same, e.g., in Topic#72 and Topic#38 all learners have the same ordering of words. Thus, the table shows that despite the fact that all learners use their own independently fetched datasets from
Wikipedia articles, the consensus between results is achieved among all the learners, due to the central collection constraint.Player 1  Player 2  Player 3  Player 4  

june  september  september  september  
Topic#98  march  october  october  october 
november  november  november  november  
elected  elected  elected  elected  
Topic#72  democratic  democratic  democratic  democratic 
republican  republican  republican  republican  
functions  functions  actor  functions  
Topic#59  users  users  functions  actor 
file  file  user  user  
university  university  university  university  
Topic#56  college  college  college  education 
education  education  education  college  
music  music  music  music  
Topic#38  song  song  song  song 
single  single  single  single 
Iii ADMMbased Networked SVI
Now, after conclusive results about distributed ‘fullyconected’ SVI algorithm, we move on to a network of nodes having independent learners residing at each node. The only difference in problem formulation, as we will see is in the equality constraints. We use the network formulation as given in [14].
The network is modeled by an undirected graph denoted by with representing the set of nodes, and representing the set of links between nodes. Node only communicates with his neighboring nodes . Note that without loss of generality, graph is assumed to be connected. The network can contain cycles. An example of such a network is is shown in Figure 4.
At every node , a set of observations of size is available, where denotes the th observation for the th node. Though not explicitly expressed, each can be a collection of multiple random variables. The vector of global hidden variables for node is ; its local hidden variables are , each of which is a collection of variables ; the vector of fixed parameters is .
With the graph formulation given above, we pose the distributed SVI problem for a network of learners given as:
(2) 
with as variables. Here, is a nonlinear function of rewritten here:
Optimization problem (2) is equivalent to the following,
(3) 
where are redundant variables that will facilitate the decoupling of variable at node from its neighboring nodes . This problem will be solved using its dual. We denote the Lagrange multipliers by () for the constraints (). We observe that for each we have equality constraints. The augmented Lagrangian with a quadratic penalty is:
(4) 
The augmented Lagrangian can be iteratively minimized with respect to each variable by keeping others constant, which gives us a set of minimization updates for each variable summarized in the following Proposition.
Iiia Proposition 1
The distributed iterations solving (3) are as follows:
(5)  
(6)  
(7)  
(8) 
and correspond to the standard ADMM solver discussed in [15].
Proof
The first task is to cast the problem (3) into standard ADMM problem form in [15]. The network description adopted here is similar to the one used in [16] and thus we use it to establish equivalence with standard ADMM [15]. Thereby, the remaining form of the minimization updates is directly derived from the augmented Lagrangian given in (4). The minimization update (5) is derived by eliminating the terms that do not affect the minimization in augmented Lagrangian:
which upon merging the two summations reduces to (5). Similarly, the minimization update (6) comes directly from,
Next we reduce the iteration equations to a simpler form. Here, we observe that the update has the following unique solution (by putting the derivative equal to zero and solving),
(9) 
Putting (12) in (7)–(8) gives,
(10)  
(11) 
Now, we assume that both the Lagrange multipliers are identically initialized at every node , as zero . This ensures that , and , and son on. We see that only one of the two multipliers per node needs to be updated at each time step. Furthermore, (9) simplifies to,
(12) 
Finally the ADMM iterations simplify, summarized in the following Proposition.
IiiB Proposition 2
Proof
Substituting (12) into the objective (5) gives the following,
(15) 
Note that is the set of all variables of optimization and denote constants known from previous iteration. Allzero initialization of the Lagrange multipliers implies that [cf. from (10)–(11)], and so the first double sum in (15) can be rewritten as:
(16) 
The other two double sums in (15) can be simplified to give,
(17) 
By defining , and substituting (16) and (17) into (15), gives the final form of the augmented Lagrangian which completes the proof:
IiiC Network solution
Now, we present a solution to the ADMM minimization update (13), which is a nonconvex optimization problem, similar to the corresponding update in distributed SVI in section IIB
. We make use of stochastic gradient descent like standard SVI algorithm for minimization of augmented Lagrangian (cf.
[1]). It is known that the natural gradient of is given as,The solution is presented in algorithm 2.
IiiD Experimental results
The issue of crossmatching the topics between two different players was solved using a correlation metric. Like discussed earlier, SVI relies on random initialization of the global parameters. So, each player initializes with different global parameters, and as they encounter observations, they update the global parameters. Since, in our setting each player has its own independent dataset, so the trajectory of converging to ‘true’ topics is different for every player. That is why, if two completely independent learners are fed the same data, they converge to similar estimates but with different trajectories (i.e. topic for player A may truly represent the contents of topic for player B, and so on). Thus, in order to match the right topics, a correlation metric was needed. We used the Pearson correlation coefficient for this.
Result for an experiment that used a linetype graph is shown in Fig. (b)b. The network is shown in Fig. (a)a. In this experiment nodes and were provided with exactly same set of data (limited to a fixed documents offline available that were fed repeatedly). The nodes , , and were provided with an online data i.e. independent and new data points at each iteration. The purpose of this experiment was to see how the connected nodes corroborate and improve estimation accuracy (of node ) in contrast to accuracy of the independently running learners (i.e. node ). The perplexity metric trajectory shown in Fig. (b)b, supports our claim that node interaction through ADMM updates certainly benefits. We achieve better accuracy in the estimate of node as compared to that of node because the learning at nodes  affects that of node due to the consensus constraint.
Iv Discussion and Conclusion
We have presented distributed ADMMbased SVI and an extension of it over a network of learners in a grpah – an algorithm that solves separable stochastic optimization problems and merges their results to achieve optimal consensus solution. Applications of distributed learning agent systems are common in IoT framework especially when different learning systems do not want to share data among each other but still agree on partial collaboration and transfer learning. A welltrod example of latent Dirichlet allocation for probabilistic topic models is implemented to show comparative results for the centralized, distributed (thoroughly fullyconnected) and networked settings. One key observation in the results of our networked SVI algorithm is that strongly connected networks exhibit substantial transfer learning benefits. This is highlighted in the comparative experiments of Fig. (a)a and Fig. (a)a. Another observation is that accuracy of estimation improves over time, as more and more data is analyzed. In a nutshell, results show that through collaboration without having to share private data, two or more independent model posterior learners for SVI can improve their learning capabilities. Due to the use of stochastic optimization, this algorithm is considerably fast, scalable, and accurate. Moreover, its distributed learning methodology enhances security and robustness aspects that underpin modern deep learning goals. For future, we intend to apply this to cyberphysical dynamical systems along with inspecting guarantees on the convergence properties of this algorithm.
References
 [1] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 1303–1347, 2013.
 [2] D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques. MIT press, 2009.
 [3] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine learning, vol. 37, no. 2, pp. 183–233, 1999.

[4]
N. Foti, J. Xu, D. Laird, and E. Fox, “Stochastic variational inference for hidden markov models,” in
Advances in Neural Information Processing Systems, 2014, pp. 3599–3607.  [5] M. Johnson and A. S. Willsky, “Stochastic variational inference for bayesian time series models.” in ICML, 2014, pp. 1854–1862.
 [6] J. Hensman, N. Fusi, and N. D. Lawrence, “Gaussian processes for big data,” arXiv preprint arXiv:1309.6835, 2013.
 [7] Y. Gal, M. van der Wilk, and C. Rasmussen, “Distributed variational inference in sparse gaussian process regression and latent variable models,” in Advances in Neural Information Processing Systems, 2014, pp. 3257–3265.
 [8] M. D. Hoffman and D. M. Blei, “Structured stochastic variational inference,” in Artificial Intelligence and Statistics, 2015.
 [9] T. Campbell, J. Straub, J. W. Fisher III, and J. P. How, “Streaming, distributed variational inference for bayesian nonparametrics,” in Advances in Neural Information Processing Systems, 2015, pp. 280–288.
 [10] B. BabagholamiMohamadabadi, S. Yoon, and V. Pavlovic, “Dmfvi: Distributed mean field variational inference using bregman admm,” arXiv preprint arXiv:1507.00824, 2015.
 [11] J. Hua and C. Li, “Distributed variational bayesian algorithms over sensor networks,” IEEE Transactions on Signal Processing, vol. 64, no. 3, pp. 783–798, 2016.
 [12] P. Raman, J. Zhang, H.F. Yu, S. Ji, and S. Vishwanathan, “Extreme stochastic variational inference: Distributed and asynchronous,” arXiv preprint arXiv:1605.09499, 2016.
 [13] M. Hoffman, F. R. Bach, and D. M. Blei, “Online learning for latent dirichlet allocation,” in advances in neural information processing systems, 2010, pp. 856–864.
 [14] R. Zhang and Q. Zhu, “Secure and resilient distributed machine learning under adversarial environments,” in Information Fusion (Fusion), 2015 18th International Conference on. IEEE, 2015, pp. 644–651.
 [15] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.

[16]
P. A. Forero, A. Cano, and G. B. Giannakis, “Consensusbased distributed support vector machines,”
Journal of Machine Learning Research, vol. 11, no. May, pp. 1663–1707, 2010.
Comments
There are no comments yet.