1 Introduction
Decentralized networks, networks with limited communication between nodes, provide an ideal framework for privacy preserving optimization. They are particularly adapted to the case where agents seek to minimize a global cost function which is separable in the data collected locally by each agent. In such networks, it is typically impossible to efficiently centralize data or to globally aggregate intermediate results: agents can only communicate with their immediate neighbors, often in a completely asynchronous fashion. In a privacy setting, agents never share the raw information they have but rather communicate aggregated information or parameters. Recent work on decentralized optimization focuses on convex optimization: such methods are mostly based on gradient descent [1, 2] or dual averaging [3, 4] and theoretical convergence upper bounds have been established for both synchronous and fully asynchronous computations.
In this paper, we tackle the nonconvex problem of topic modelling, where agents have sensitive text data at their disposal that they can not or do not want to share (e.g., text messages, emails, confidential reports). More precisely, we adapt the particular Latent Dirichlet Allocation (LDA) [5] model to decentralized networks. We combine recent work of [6] on online inference for latent variable models, which adapts online EM [7] with local Gibbs sampling in the case of intractable latent variable models (such as LDA) and recent advances on decentralized optimization [3, 4]. The method presented in [6] is particularly adapted to decentralized framework as it consists in iteratively updating sufficient statistics, which can be done locally. After presenting our DeLeDA (for Decentralized LDA) algorithm, we give a brief sketch of convergence proof. Then, we apply our new method to synthetic datasets and show that our method recovers the same parameters and has similar performance than the online method [6] after enough iterations.
2 Background
Latent Dirichlet allocation (LDA) [5].
Let be the number of documents of a corpus , the number of words in our vocabulary and the number of latent topics in the corpus. Each topic corresponds to a discrete distribution on the words (that is an element of the simplex in dimensions). A hidden discrete distribution over the topics (that is an element of the simplex in dimensions) is attached to each document . LDA is a generative model applied to a corpus of text documents which assumes that each word of the th document is generated as follows:

Choose ,

For each word :

Choose a topic ,

Choose a word .

In LDA, the inference consists in learning the topic matrix — global parameter — and topic proportions of each document — local variable — from a corpus of documents. We choose to adapt Gibbs Online EM for LDA (GOEM) [6] to the decentralized framework. The other inference methods for LDA ([5, 8, 9, 10, 11]) are very close to GOEM but are less robust in practice (see [6] for complete analysis).
Gibbs Online EM for LDA [6].
We consider an exponential family model on random variables
with parameter and with density [12]:(1) 
with hidden and observed. GOEM is an adaptation of the online EM algorithm [7] that maximizes the likelihood of observed data in the case where the conditional is intractable to compute, such as for LDA. In particular, the GOEM algorithm iteratively updates the sufficient statistics of the model through the formula at step :
(2) 
where is the maximum of likelihood for the sufficient statistics , (which is the usual Mstep update in the EM algorithm [13]). For LDA, the expectation in (2) is intractable to compute and is approximated with Gibbs sampling — see [6] for more details.
In this setting for LDA, an observation is a document . Each observation is associated with the hidden variables , with
. The vector
represents the topic proportions of document and is the vector of topic assignments of each word of . The hidden variable is local, attached to one observation . The parameters of the model are global, represented by , where represents the topic matrix and represents the Dirichlet prior on topic proportions.Decentralized setting
Let and let be an undirected graph of agents. We consider a setting where each agent aims to optimize a global ojective, without revealing its local observations. In addition, we are interested in the case where communication is possibly asynchronous, there is no global clock ensuring iterations synchrony between nodes. Gossip algorithms [14] are particularly adapted to this setting, as they rely on peertopeer interactions. Initially, each agent stores an iterates ; at each iteration, a node awakes and select one of its neighbor: the two nodes then average their iterates. Then, in the synchronous (resp. asynchronous) setting, all the nodes (resp. the two active nodes) update their iterate according to a local rule, say (2).
When performing decentralized estimationestimating the sample mean, the averaging step allows for a quick uniformization of the network iterates. Indeed, nodes iterates converge to the network average at an exponential rate which depends on some communication property of the graph called the spectral gap: the larger the spectral gap, the faster the convergence. In the case of decentralized optimization, since the network is not conserving its mass during the process — typically, gradients are added —, the convergence is not guaranteed in general. However, provided that at each iteration the mass added to the network is bounded, one can formulate upper bounds on the convergence rate, usually polynomial in the number of iterations.
3 DeLeDA algorithm
In this paper, we aim at combining both GOEM for LDA and decentralized optimization. Since the GOEM updates are very similar to stochastic gradient updates, the reasoning of decentralized convex optimization can be extended to GOEM. In a synchronous setting, our algorithm coined DeLeDA consists in the following: each node initializes its sufficient statistics to . At each iteration, the two awaken nodes average their statistics. Then, every node performs a local update of its statistics using update rule (2) and its own observations. The method is detailed in Algorithm 1.
For , let us define the average iterate . Using Algorithm 1 description and update rule (2), one can see that is obtained from using the GOEM update rule with all observations of the network. Therefore, convergence derives from GOEM theoretical guarantees [6]; in order to prove the convergence of each agent, one only has to show that for all , tends to as the number of iterations grows.
Let us define the iterates matrix and the averaging operator , where and are the nodes awaken at iteration . An iteration of Algorithm 1 can be reformulated as follows:
where is the matrix of expectations in (2). Recursively, one can obtain:
Using properties of averaging matrices^{1}^{1}1See [14, 4] for more detailed analysis of averaging matrices., one can rearrange the above expression and write:
(3) 
where
is the second largest eigenvalue of
. If the graph is connected and nonbipartite, one can show that . Therefore, when tends towards 0 and provided that is bounded, the expected gap between nodes iterates and the average goes to 0. In practice, this last assumption holds for instance when the Dirichlet parameter is either fixed or lowerbounded, i.e., . Also, as evidenced by inequality (3), one should expect DeLeDA to converge faster on a graph with high spectral gap, as it is proportional to [4, 14].4 Numerical experiments
Dataset.
We perform experiments on a synthetic graph of =50 nodes. Given these nodes, we consider the complete graph ( edges) and a graph where the edges are generated through a WattsStrogatz model [15], with
edges and rewiring probability
. We attach 20 documents at each node, generated through LDA generative process, with a vocabulary size and topics. The length of each document is drawn from a Poisson with parameter 10. We compare our algorithm to — centralized — GOEM applied to the 1000 synthetic documents of the graph with a batch size of 20 documents. In particular, at each iteration of the GOEM algorithm we choose uniformly at random 20 documents among the 1000 of the total dataset, and update according to the GOEM process. In practice, we update at each iteration and let fixed, as often done in previous work [8, 11, 16, 17].Evaluation.
We evaluate our method by computing the likelihood on heldout documents, that is for any test document . For LDA, the likelihood is intractable to compute. We approximate with the “lefttoright” evaluation algorithm [18] applied to each test document. In the following, we present results in terms of logperplexity, defined as the opposite of the loglikelihood . The lower the logperplexity, the better the corresponding model. In our experiments, we compute the average test logperplexity over test documents. More precisely, we display the relative error , where is the average logperplexity of test documents computed with optimal parameters that generated the data.
We also compare the performance in terms of distance to the optimal topic matrix , which generated the documents. As the performance is invariant to topic permutation in LDA, we compute the following distance between the infered parameter and optimal parameter :
which is invariant by topic permutation (by permutation of rows of ).
Results for asynchrone and synchrone method are presented in Figure 1. We observe that our method converges to the same state than GOEM after enough iterations. In particular, each agent is able to recover the topic matrix that generated all the documents without any direct access to documents of the other nodes.
As expected, DeLeDA converges faster on a perfectly connected network, even though the “wellconnected” design of the WattsStrogatz network allows it to stay close to the complete graph, perplexitywise. Finally, the synchronous version of DeLeDA appears to converge slower than the asynchronous one. This is an overfitting issue: synchronous computations lead each node to perform roughly local updates between two averaging steps, whereas asynchronous computations force nodes to update only once between two averaging steps.
References
 [1] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multiagent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
 [2] Angelia Nedić. Asynchronous broadcastbased convex optimization over a network. Automatic Control, IEEE Transactions on, 56(6):1337–1351, 2011.
 [3] Alekh Agarwal, Martin J Wainwright, and John C Duchi. Distributed dual averaging in networks. In Advances in Neural Information Processing Systems, pages 550–558, 2010.
 [4] I. Colin, A. Bellet, J. Salmon, and S. Clémençon. Gossip dual averaging for decentralized optimization of pairwise functions. Proc. ICML, 2016.
 [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003.
 [6] C. Dupuy and F. Bach. Online but accurate inference for latent variable models with local gibbs sampling. arXiv preprint arXiv:1603.02644, 2016.
 [7] O. Cappé and E. Moulines. Online EM algorithm for latent data models. Journal of the Royal Statistical Society, 71(3):593–613, 2009.
 [8] M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent Dirichlet allocation. Adv. NIPS, 2010.
 [9] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004.
 [10] I. Sato, K. Kurihara, and H. Nakagawa. Deterministic singlepass algorithm for LDA. Adv. NIPS, 2010.
 [11] D. Mimno, M. Hoffman, and D. Blei. Sparse stochastic inference for latent Dirichlet allocation. Proc. ICML, 2012.
 [12] E. L. Lehmann and G. Casella. Theory of point estimation, volume 31. Springer Science & Business Media, 1998.
 [13] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), 39(1):1–38, 1977.
 [14] Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE Trans. Inf. Theory, 52(6):2508–2530, 2006.
 [15] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘smallworld’networks. Nature, 393(6684):440–442, 1998.
 [16] S. Patterson and Y. W. Teh. Stochastic gradient Riemannian langevin dynamics on the probability simplex. Adv. NIPS, 2013.
 [17] Y. Gao, J. Chen, and J. Zhu. Streaming gibbs sampling for LDA model. arXiv preprint arXiv:1601.01142, 2016.
 [18] H. M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models. Proc. ICML, 2009.
Comments
There are no comments yet.