ADMM-based Networked Stochastic Variational Inference

02/27/2018 ∙ by Hamza Anwar, et al. ∙ NYU college 0

Owing to the recent advances in "Big Data" modeling and prediction tasks, variational Bayesian estimation has gained popularity due to their ability to provide exact solutions to approximate posteriors. One key technique for approximate inference is stochastic variational inference (SVI). SVI poses variational inference as a stochastic optimization problem and solves it iteratively using noisy gradient estimates. It aims to handle massive data for predictive and classification tasks by applying complex Bayesian models that have observed as well as latent variables. This paper aims to decentralize it allowing parallel computation, secure learning and robustness benefits. We use Alternating Direction Method of Multipliers in a top-down setting to develop a distributed SVI algorithm such that independent learners running inference algorithms only require sharing the estimated model parameters instead of their private datasets. Our work extends the distributed SVI-ADMM algorithm that we first propose, to an ADMM-based networked SVI algorithm in which not only are the learners working distributively but they share information according to rules of a graph by which they form a network. This kind of work lies under the umbrella of `deep learning over networks' and we verify our algorithm for a topic-modeling problem for corpus of Wikipedia articles. We illustrate the results on latent Dirichlet allocation (LDA) topic model in large document classification, compare performance with the centralized algorithm, and use numerical experiments to corroborate the analytical results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The explosive influx of data and information for modern day technological systems has opened doors to revolutionary possibilities. One of the most vital uses of this data is in modeling, visualizing, and analyzing large data sets through probabilistic tools. Statistical machine learning is at the core of numerous such applications in what is becoming known as Internet of Things (IoT). Such iterative learning mechanisms help in better control performance for many cyber-physical-systems in which estimation of parameters and system-identification is required. Probabilistic graphical modeling is one key research area that has helped in data analysis in inference and prediction tasks,

[2]. These models visually express assumptions about data and its hidden structure. Posterior inference algorithms have been proven to exploit such models in explaining this hidden structure while being adaptive, robust, parallelizable, and scalable.

Variational inference, from late 90s, is a method that transforms complex inference problems into high dimensional optimization problems. In contrast to Monte-Carlo sampling methods (that simply aim to find exact answer to an approximate problem), the variational Bayesian approach solves for optimal solution under constraints to the right inference problem, [3]. On the same lines, stochastic variational inference (SVI) was developed recently that extends variational inference to be solved using stochastic optimization under certain assumptions, [1]. SVI works iteratively in gradient ascent fashion using noisy gradient estimates. It provides approximate model posteriors with only a few passes through a large data collection, making it highly scalable. We propose ADMM-based Networked SVI – a distributed stochastic variational inference technique that builds upon standard SVI, retaining most of its benefits, based on the highly parallelizable alternating direction method of multipliers (ADMM) where the agents are connected as in a graph with nodes and edges.

I-a Related Work

Numerous extensions have been proposed for the SVI framework in its application to more model classes (by [4] and [5]), different underlying processes (by [6] and [7]), and structural exploitations (by [8]) making it faster and widely deployable. A variety of works focus on making variational methods distributed to enhance parallelizability. The work on distributed Bayesian nonparametric models by [9] is commendable in making variational inference updates distributed, asynchronous, and ‘streaming’ (online) and they’ve shown it to outperform standard SVI. However, their work is only specific to the Dirichlet process mixture and lacks in generalizability to the class of probabilistic models that SVI can deal with. Similar to [9], another work D-MFVI by [10] uses ADMM for decentralizing, like us, but lacks in being extendable to online updates, fast convergence rate, and other desirable properties of SVI. Distributed VBA ([11]) also uses ADMM however their approach lacks in scalability to large data without demanding adequate computational resources.

None of these works use stochastic optimization methods to speed up inference and hence they are fundamentally different from standard SVI itself. One recent work ‘Extended-SVI’ by [12]

retains the benefits of SVI while making it distributed and asynchronous. They employ a rather simple algorithmic change to SVI however their work remains unexplored in terms of depth because they particularly focus on Gaussian mixture models and do not provide how it is extendable to all other probabilistic models for which SVI in general works.

In contrast to all related works highlighted, our approach extends the general SVI framework to a networked stochastic optimization consensus problem. We tackle the issue of generalizability to all graphical models by providing a general solution and make use of the stochastic gradient updates that make it fast. We run ADMM updates along with stochastic gradient ascent for variational objective to reach consensus among a number of distributed learners. SVI itself being a non-convex stochastic optimization problem makes the distributed problem trickier. Our work aims to show that independent learners that use SVI for similar applications, can collaborate by exchanging their results (not the data itself) to benefit from each other improving overall accuracy of results. This approach makes our work unique and applicable to wider distributed large-scale inference problems. Not just that, but this kind of an approach poses a game problem with multiple agents interested in performing their own inference tasks, and simultaneously benefiting from each other through reinforcement learning and cooperation.

Ii ADMM-based Distributed SVI

Building upon the recent work on SVI by Hoffman et al., [1], we consider the SVI problem for a network of learners.

The observations are

; the vector of global hidden variables is

; the local hidden variables are , each of which is a collection of variables ; the vector of fixed parameters is . (Note we can easily allow

to partly govern any of the random variables, such as fixed parts of the conditional distribution of observations. To keep notation simple, we assume that they only govern the global hidden variables.)

Ii-a Optimization problem

subject to

where each is an -sized vector and indicates the feasible set for the variables (typically ) and,

(1)

which is the standard SVI problem objective function for a single learner. The above optimization problem gives us a solution for learners when they form a consensus. Using an Augmented Lagrangian approach, as in ADMM, we solve this problem in a distributed iterative fashion for multiple learners.

Ii-B ADMM-based solution

Augmented Lagrangian with a quadratic penalty is used to arrive at the ADMM update iterations. The Lagrange multipliers are denoted by . Minimization updates for each processor/agent are given as:

where is called the central collector, and is the quadratic penalty parameter in the augmented Lagrangian which is given as:

Here, we note that the -minimization update which is actually a solution to , requires solving a constrained non-convex optimization problem. We solve this in a gradient descent fashion in of itself, as the standard SVI problem was also solved, but the original solution requires inversion of a Hessian matrix. For that we take into account one-step earlier value of — for details about derivation and Hessian inversion approximation used, see Appendix. Thereby, our proposed iterative ADMM methodology runs along with a gradient-descent iterative update of variables which is completely summarized in algorithm 1.

1:Initialize
2:Schedule step-size routine
3:repeat
4:     for  do
5:         Sample separate data points for all learners
6:         Use to compute its local variational parameters,
7:         Apply ADMM -minimization-update by computing intermediate global parameters and natural gradient,
8:         Update the global variational parameters using gradient ascent,
9:     end for
10:     Update the central collector
11:     Update all the Lagrange multipliers
12:until forever
Algorithm 1 ADMM-based distributed SVI for players

Ii-C Experimental results

New set of experiments for the distributed problem was performed with multiple learners. Here, we show the results with learners. Figure 1, shows the convergence properties of our distributed learners, for a metric of the estimated model’s fitness, known as the ‘held-out perplexity’. This same metric has been used by Hoffman et al. [13] to show convergence of the algorithm. A comparison of centralized versus distributed two-player SVI algorithms is depicted in Figure 2. We conclude that all the learners not only converge to higher precision in estimates (evident from the Figure 1), but also achieve accuracy of estimates (evident from Table I), while simultaneously maintaining consensus.

Fig. 1: Four players running independent learners and collaborating. This plot shows that the perplexity is decreasing over time.
Fig. 2: Working for a two player network versus centralized algorithm.

From Table I

, we see that the highly probable words for a given topic learned by any of the four learners all direct to a similar kind of subject. For example, in Topic#98, the learners understood it to represent the names of months — even though for this topic, the distribution of word occurrences is different for the four players but it is evident that they all point to the same abstract class of words. Similarly, in other topics as well we see similarity in estimates. Sometimes, we even see that the descending order of the words is exactly the same, e.g., in Topic#72 and Topic#38 all learners have the same ordering of words. Thus, the table shows that despite the fact that all learners use their own independently fetched datasets from

Wikipedia articles, the consensus between results is achieved among all the learners, due to the central collection constraint.

Player 1 Player 2 Player 3 Player 4
june september september september
Topic#98 march october october october
november november november november
elected elected elected elected
Topic#72 democratic democratic democratic democratic
republican republican republican republican
functions functions actor functions
Topic#59 users users functions actor
file file user user
university university university university
Topic#56 college college college education
education education education college
music music music music
Topic#38 song song song song
single single single single
TABLE I: Top three words for five topics learned by each of the four players after 35 iterations i.e. independent documents analyzed by each player. Words are written in descending order of probability of occurrence. Penalty parameter for ADMM , and total topics were .

Iii ADMM-based Networked SVI

Now, after conclusive results about distributed ‘fully-conected’ SVI algorithm, we move on to a network of nodes having independent learners residing at each node. The only difference in problem formulation, as we will see is in the equality constraints. We use the network formulation as given in [14].

The network is modeled by an undirected graph denoted by with representing the set of nodes, and representing the set of links between nodes. Node only communicates with his neighboring nodes . Note that without loss of generality, graph is assumed to be connected. The network can contain cycles. An example of such a network is is shown in Figure 4.
At every node , a set of observations of size is available, where denotes the -th observation for the -th node. Though not explicitly expressed, each can be a collection of multiple random variables. The vector of global hidden variables for node is ; its local hidden variables are , each of which is a collection of variables ; the vector of fixed parameters is .

Fig. 3: A graphical probabilistic model for each node in the graph
Fig. 4: Example of an undirected graph having nodes and edges. Here, node has the highest number of neighbors .

With the graph formulation given above, we pose the distributed SVI problem for a network of learners given as:

(2)

with as variables. Here, is a non-linear function of re-written here:

Optimization problem (2) is equivalent to the following,

(3)

where are redundant variables that will facilitate the decoupling of variable at node from its neighboring nodes . This problem will be solved using its dual. We denote the Lagrange multipliers by () for the constraints (). We observe that for each we have equality constraints. The augmented Lagrangian with a quadratic penalty is:

(4)

The augmented Lagrangian can be iteratively minimized with respect to each variable by keeping others constant, which gives us a set of minimization updates for each variable summarized in the following Proposition.

Iii-a Proposition 1

The distributed iterations solving (3) are as follows:

(5)
(6)
(7)
(8)

and correspond to the standard ADMM solver discussed in [15].

Proof

The first task is to cast the problem (3) into standard ADMM problem form in [15]. The network description adopted here is similar to the one used in [16] and thus we use it to establish equivalence with standard ADMM [15]. Thereby, the remaining form of the minimization updates is directly derived from the augmented Lagrangian given in (4). The -minimization update (5) is derived by eliminating the terms that do not affect the minimization in augmented Lagrangian:

which upon merging the two summations reduces to (5). Similarly, the -minimization update (6) comes directly from,

Equations (7)–(8) are the dual variable updates (cf. [16]).

Next we reduce the iteration equations to a simpler form. Here, we observe that the update has the following unique solution (by putting the derivative equal to zero and solving),

(9)

Putting (12) in (7)–(8) gives,

(10)
(11)

Now, we assume that both the Lagrange multipliers are identically initialized at every node , as zero . This ensures that , and , and son on. We see that only one of the two multipliers per node needs to be updated at each time step. Furthermore, (9) simplifies to,

(12)

Finally the ADMM iterations simplify, summarized in the following Proposition.

Iii-B Proposition 2

Selecting as initialization , the iterations (5)–(8) reduce to the following,

(13)
(14)

Proof

Substituting (12) into the objective (5) gives the following,

(15)

Note that is the set of all variables of optimization and denote constants known from previous iteration. All-zero initialization of the Lagrange multipliers implies that [cf. from (10)–(11)], and so the first double sum in (15) can be rewritten as:

(16)

The other two double sums in (15) can be simplified to give,

(17)

By defining , and substituting (16) and (17) into (15), gives the final form of the augmented Lagrangian which completes the proof:

Iii-C Network solution

Now, we present a solution to the ADMM minimization update (13), which is a non-convex optimization problem, similar to the corresponding update in distributed SVI in section II-B

. We make use of stochastic gradient descent like standard SVI algorithm for minimization of augmented Lagrangian (cf.

[1]). It is known that the natural gradient of is given as,

The solution is presented in algorithm 2.

1:Initialize
2:Schedule step-size routine
3:repeat
4:     for  do
5:         Sample separate data points for all learners
6:         Use to compute its local variational parameters,
7:         Apply ADMM -minimization-update by computing intermediate global parameters and natural gradient of augmented Lagrangian,
8:         Update the global variational parameters using gradient ascent,
9:     end for
10:     Update all the Lagrange multipliers
11:until forever
Algorithm 2 ADMM-based networked SVI for players

Iii-D Experimental results

The issue of cross-matching the topics between two different players was solved using a correlation metric. Like discussed earlier, SVI relies on random initialization of the global parameters. So, each player initializes with different global parameters, and as they encounter observations, they update the global parameters. Since, in our setting each player has its own independent dataset, so the trajectory of converging to ‘true’ topics is different for every player. That is why, if two completely independent learners are fed the same data, they converge to similar estimates but with different trajectories (i.e. topic for player A may truly represent the contents of topic for player B, and so on). Thus, in order to match the right topics, a correlation metric was needed. We used the Pearson correlation coefficient for this.

Result for an experiment that used a line-type graph is shown in Fig. (b)b. The network is shown in Fig. (a)a. In this experiment nodes and were provided with exactly same set of data (limited to a fixed documents offline available that were fed repeatedly). The nodes , , and were provided with an online data i.e. independent and new data points at each iteration. The purpose of this experiment was to see how the connected nodes corroborate and improve estimation accuracy (of node ) in contrast to accuracy of the independently running learners (i.e. node ). The perplexity metric trajectory shown in Fig. (b)b, supports our claim that node interaction through ADMM updates certainly benefits. We achieve better accuracy in the estimate of node as compared to that of node because the learning at nodes - affects that of node due to the consensus constraint.

(a)

(b)
Fig. 5: (a) Line-type graph network. Dotted line between two nodes indicates same dataset supply. Solid line indicates possibility of transfer learning between nodes via ADMM. (b) Perplexity trajectory for a line-type graph.

(a)

(b)
Fig. 6: (a) Star-type strongly connected network. (b) Perplexity trajectory for a strongly connected network (V1) versus a weakly connected line-type network (V2). Clearly strongly connected network starts performing better after some iterations.
Fig. 7: Independent SVI with complete data versus networked SVI with partitioned data. Node 1 is an independent learner. Nodes 2-5 are connected in a network having partitioned datasets.

Iv Discussion and Conclusion

We have presented distributed ADMM-based SVI and an extension of it over a network of learners in a grpah – an algorithm that solves separable stochastic optimization problems and merges their results to achieve optimal consensus solution. Applications of distributed learning agent systems are common in IoT framework especially when different learning systems do not want to share data among each other but still agree on partial collaboration and transfer learning. A well-trod example of latent Dirichlet allocation for probabilistic topic models is implemented to show comparative results for the centralized, distributed (thoroughly fully-connected) and networked settings. One key observation in the results of our networked SVI algorithm is that strongly connected networks exhibit substantial transfer learning benefits. This is highlighted in the comparative experiments of Fig. (a)a and Fig. (a)a. Another observation is that accuracy of estimation improves over time, as more and more data is analyzed. In a nutshell, results show that through collaboration without having to share private data, two or more independent model posterior learners for SVI can improve their learning capabilities. Due to the use of stochastic optimization, this algorithm is considerably fast, scalable, and accurate. Moreover, its distributed learning methodology enhances security and robustness aspects that underpin modern deep learning goals. For future, we intend to apply this to cyber-physical dynamical systems along with inspecting guarantees on the convergence properties of this algorithm.

References

  • [1] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 1303–1347, 2013.
  • [2] D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques.   MIT press, 2009.
  • [3] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine learning, vol. 37, no. 2, pp. 183–233, 1999.
  • [4]

    N. Foti, J. Xu, D. Laird, and E. Fox, “Stochastic variational inference for hidden markov models,” in

    Advances in Neural Information Processing Systems, 2014, pp. 3599–3607.
  • [5] M. Johnson and A. S. Willsky, “Stochastic variational inference for bayesian time series models.” in ICML, 2014, pp. 1854–1862.
  • [6] J. Hensman, N. Fusi, and N. D. Lawrence, “Gaussian processes for big data,” arXiv preprint arXiv:1309.6835, 2013.
  • [7] Y. Gal, M. van der Wilk, and C. Rasmussen, “Distributed variational inference in sparse gaussian process regression and latent variable models,” in Advances in Neural Information Processing Systems, 2014, pp. 3257–3265.
  • [8] M. D. Hoffman and D. M. Blei, “Structured stochastic variational inference,” in Artificial Intelligence and Statistics, 2015.
  • [9] T. Campbell, J. Straub, J. W. Fisher III, and J. P. How, “Streaming, distributed variational inference for bayesian nonparametrics,” in Advances in Neural Information Processing Systems, 2015, pp. 280–288.
  • [10] B. Babagholami-Mohamadabadi, S. Yoon, and V. Pavlovic, “D-mfvi: Distributed mean field variational inference using bregman admm,” arXiv preprint arXiv:1507.00824, 2015.
  • [11] J. Hua and C. Li, “Distributed variational bayesian algorithms over sensor networks,” IEEE Transactions on Signal Processing, vol. 64, no. 3, pp. 783–798, 2016.
  • [12] P. Raman, J. Zhang, H.-F. Yu, S. Ji, and S. Vishwanathan, “Extreme stochastic variational inference: Distributed and asynchronous,” arXiv preprint arXiv:1605.09499, 2016.
  • [13] M. Hoffman, F. R. Bach, and D. M. Blei, “Online learning for latent dirichlet allocation,” in advances in neural information processing systems, 2010, pp. 856–864.
  • [14] R. Zhang and Q. Zhu, “Secure and resilient distributed machine learning under adversarial environments,” in Information Fusion (Fusion), 2015 18th International Conference on.   IEEE, 2015, pp. 644–651.
  • [15] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [16]

    P. A. Forero, A. Cano, and G. B. Giannakis, “Consensus-based distributed support vector machines,”

    Journal of Machine Learning Research, vol. 11, no. May, pp. 1663–1707, 2010.