Private Topic Modeling

09/14/2016 ∙ by Mijung Park, et al. ∙ 0

We develop a privatised stochastic variational inference method for Latent Dirichlet Allocation (LDA). The iterative nature of stochastic variational inference presents challenges: multiple iterations are required to obtain accurate posterior distributions, yet each iteration increases the amount of noise that must be added to achieve a reasonable degree of privacy. We propose a practical algorithm that overcomes this challenge by combining: (1) A relaxed notion of the differential privacy, called concentrated differential privacy, which provides high probability bounds for cumulative privacy loss, which is well suited for iterative algorithms, rather than focusing on single-query loss; and (2) Privacy amplification resulting from subsampling of large-scale data. Focusing on conjugate exponential family models, in our private variational inference, all the posterior distributions will be privatised by simply perturbing expected sufficient statistics. Using Wikipedia data, we illustrate the effectiveness of our algorithm for large-scale data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

We start by providing background information on the definitions of algorithmic privacy that we use, as well as the general formulation of the variational inference algorithm.

Differential privacy

Differential privacy (DP) is a formal definition of the privacy properties of data analysis algorithms [1]. A randomized algorithm is said to be -differentially private if for all measurable subsets of the range of and for all datasets , differing by a single entry. If , the algorithm is said to be -differentially private. Intuitively, the definition states that the output probabilities must not change very much when a single individual’s data is modified, thereby limiting the amount of information that the algorithm reveals about any one individual.

Concentrated differential privacy (CDP) [2] is a recently proposed relaxation of DP which aims to make privacy-preserving iterative algorithms more practical than for DP while still providing strong privacy guarantees. The CDP framework treats the privacy loss of an outcome,

as a random variable. An algorithm is

-CDP if this privacy loss has mean , and after subtracting the resulting random variable

is subgaussian with standard deviation

, i.e. . While -DP guarantees bounded privacy loss, and -DP ensures bounded privacy loss with probability , -CDP requires the privacy loss to be near w.h.p.

The general VI algorithm.

Consider a generative model that produces a dataset consisting of independent identically distributed items, where is an th observation, generated using a set of latent variables . The generative model provides , where is the model parameters. We also consider the prior distribution over the model parameters and the prior distribution over the latent variables . Here, we focus on conjugate-exponential (CE) models111

A large class of models falls in the CE family including linear dynamical systems and switching models; Gaussian mixtures; factor analysis and probabilistic PCA; hidden Markov models and factorial HMMs; discrete-variable belief networks; and latent Dirichlet allocation (LDA), which we will use in Sec 3.

, in which the variational updates are tractable. The CE family models satisfy the two conditions [3]: (1) Complete-data likelihood is in exponential family: , and (2) Prior over is conjugate to the complete-data likelihood: , where natural parameters and sufficient statistics of the complete-data likelihood are denoted by and

, respectively. The hyperparameters are denoted by

(a scalar) and

(a vector).

Variational inference for a CE family model iterates the following two steps in order to optimise the lower bound to the log marginal likelihood,


2 Privacy Preserving VI algorithm for CE family

The only place where the algorithm looks at the data is when computing the expected sufficient statistics in the first step. The expected sufficient statistics then dictates the expected natural parameters in the second step. So, perturbing the sufficient statistics leads to perturbing both posterior distributions and . Perturbing sufficient statistics in exponential families is also used in [4]. Existing work focuses on privatising posterior distributions in the context of posterior sampling [5, 6, 7, 8]

, while our work focuses on privatising approximate posterior distributions for optimisation-based approximate Bayesian inference.

Suppose there are two neighbouring datasets and , where there is only one datapoint difference among them. We also assume that the dataset is pre-processed such that the norm of any datapoint is less than . The maximum difference in the expected sufficient statistics given the datasets, e.g., the L-1 sensitivity of the expected sufficient statistics is given by (assuming is a vector of length ) Under some models like LDA below, the expected sufficient statistic has a limited sensitivity, in which case we add noise to each coordinate of the expected sufficient statistics to compensate the maximum change.

3 Privacy preserving Latent Dirichlet Allocation (LDA)

The most successful topic modeling is based on LDA, where the generative process is given by [9].

  • Draw topics Dirichlet , for , where is a scalar hyperarameter.

  • For each document

    • Draw topic proportions Dirichlet , where is a scalar hyperarameter.

    • For each word

      • Draw topic assignments Discrete

      • Draw word Discrete

where each observed word is represented by an indicator vector (th word in the th document) of length , where is the number of terms in a fixed vocabulary set. The topic assignment latent variable is also an indicator vector of length , where is the number of topics.

The LDA falls into the CE family, where we think of as two types of latent variables : , and as model parameters : (1) Complete-data likelihood per document is in exponential family: where

; (2) Conjugate prior over

: for . For simplicity, we assume hyperparameters and are set manually.

In VI, we assume the posteriors are : (1) Discrete for

, with variational parameters that capture the posterior probability of topic assignment,

; (2) Dirichlet for ; and (3) Dirichlet for . In this case, the expected sufficient statistics is .

Sensitivity analysis

To privatise the variational inference for LDA, we perturb the expected sufficient statistics. While each document has a different document length , we limit the maximum length of any document to by randomly selecting words in a document if the number of words in the document is longer than .

We add Gaussian noise to each coordinate, then map to 0 if the perturbed coordinate becomes negative:


where is the th coordinate of a vector of length : , and is the sensitivity given by


since , , , and .

Private stochastic variational learning

In a large-scale data setting, it is impossible to handle the entire dataset at once. In such case, stochastic learning using noisy gradients computed on mini-batches of data, a.k.a., stochastic gradient descent (SGD) provides a scalable inference method. While there are a couple of prior work on differentially private SGD (e.g.,

[10, 11]), privacy amplification due to subsampling combined with CDP composition (which will be described below) has not been used in the the context of variational inference or topic modeling before.

The privacy amplification theorem states the following.

Theorem 1.

(Theorem 1 in [12]) Any -DP mechanism running on a uniformly sampled subset of data with a sampling ratio guarantees differential privacy, where and

The privacy gain from subsampling allows us to use a much more relaxed privacy budget and the error tolerance per iteration, to achieve a reasonable level of ()-DP with a small sampling rate.

Furthermore, the zCDP composition allows a sharper analysis of the per-iteration privacy budget. We first convert DP to zCDP, then use the zCDP composition and finally convert zCDP back to DP (for comparison purposes), for which we use the following lemmas and proposition.

Lemma 1.

(Proposition 1.6 in [13]

) The Gaussian mechanism with some noise variance

and a sensitivity satisfies -zCDP.

Lemma 2.

(Lemma 1.7 in [13]) If two mechanisms satisfy -zCDP and -zCDP, respectively, then their composition satisfies -zCDP.

Proposition 1.

(Proposition 1.3 in [13]) If provides -zCDP, then is -DP for any .

So, using Lemma 2 and 3, we obtain -zCDP after -composition of the Gaussian mechanism. Using Proposition 4, we convert -zCDP to -DP, where .

These seemingly complicated steps can be summarised into two simple steps. First, given a total privacy budget and total tolerance level , our algorithm calculates an intermediate privacy budget using the zCDP composition, which maps () to (),


Second, our algorithm calculates the per-iteration privacy budget using the privacy amplification theorem, which maps () to (),


Algorithm 1 summarizes our private topic modeling algorithm.

0:  Data . Define (documents), (vocabulary), (number of topics).    Define , mini-batch size , and , and hyperparameters .
0:  Privatised expected natural parameters and sufficient statistics .
  Compute the per-iteration privacy budget () using eq (5) and eq (6).
  Compute the sensitivity of the expected sufficient statistics given in eq (3).
  for  do
     (1) E-step: Given expected natural parameters
     for  do
        Compute parameterised by .
        Compute parameterised by .
     end for
     Output the noised-up expected sufficient statistics , where is Gaussian noise given in eq (3).
     (2) M-step: Given noised-up expected sufficient statistics ,
     Compute parameterised by .
     Set .
     Output expected natural parameters .
  end for
Algorithm 1 Private Topic Modeling

4 Experiments using Wikipedia data

We randomly downloaded documents from Wikipedia. We then tested our VIPS algorithm on the Wikipedia dataset with four different values of total privacy budget, using a mini-batch size , until the algorithm sees up to documents. We assumed there are topics, and we used a vocabulary set of approximately terms.

We compare our method to two baseline methods. First, in linear (Lin) composition (Theorem 3.16 of [1]), privacy degrades linearly with the number of iterations. This result is from the Max Divergence of the privacy loss random variable being bounded by a total budget. Hence, the linear composition yields (, )-DP. We use eq (6) to map () to . Second, advanced (Adv) composition (Theorem 3.20 of [1]), resulting from the Max Divergence of the privacy loss random variable being bounded by a total budget including a slack variable , yields -DP. Similarly, we use eq (6) to map () to .

As an evaluation metric, we compute the upper bound to the perplexity on held-out documents

222We used the metric written in the python implementation by authors of [14].,

where is a vector of word counts for the th document, . In the above, we use the that was calculated during training. We compute the posteriors over and by performing the first step in our algorithm using the test data and the perturbed sufficient statistics we obtain during training. The per-word-perplexity is shown in Fig. 1. Due to privacy amplification, it is more beneficial to decrease the amount of noise to add when the mini-batch size is small. The zCDP composition results in a better accuracy than the advanced composition.

Figure 1: Per-word-perplexity with different mini-batch sizes . In the private LDA (Top/Right and Bottom), smaller mini-batch size achieves lower perplexity, due to the privacy amplification lemma (See Sec 3). We set the total privacy budget and the total tolerance in all private methods. Regardless of the mini-batch size, the zCDP composition (Top/Right) achieves a lower perplexity than the Advanced (Bottom/Left) and Linear compositions (Bottom/Right).

In Table 1, we show the top words in terms of assigned probabilities under a chosen topic in each method. We show topics as examples. Non-private LDA results in the most coherent words among all the methods. For the private LDA models with a total privacy budget fixed to , as we move from zCDP, to advanced, and to linear composition, the amount of noise added gets larger, and therefore more topics have less coherent words.

Non-private zCDP (eps=0.5) Adv (eps=0.5) Lin (eps=0.5)
topic 3: topic 81: topic 72: topic 27:
david 0.0667 born 0.0882 fragment 0.0002 horn 0.0002
king 0.0318 american 0.0766 gentleness 0.0001 shone 0.0001
god 0.0304 name 0.0246 soit 0.0001 age 0.0001
son 0.0197 actor 0.0196 render 0.0001 tradition 0.0001
israel 0.0186 english 0.0179 nonproprietary 0.0001 protecting 0.0001
bible 0.0156 charles 0.0165 westminster 0.0001 fils 0.0001
hebrew 0.0123 british 0.0138 proceedings 0.0001 trip 0.0001
story 0.0102 richard 0.0130 clare 0.0001 article 0.0001
book 0.0095 german 0.0119 stronger 0.0001 interests 0.0001
adam 0.0092 character 0.0115 hesitate 0.0001 incidents 0.0001
topic 4: topic 82: topic 73: topic 28:
university 0.1811 wat 0.0002 mount 0.0034 american 0.0228
press 0.0546 armed 0.0001 display 0.0011 born 0.0154
oxford 0.0413 log 0.0001 animal 0.0011 john 0.0107
italy 0.0372 fierce 0.0001 equipment 0.0011 name 0.0094
jacques 0.0359 infantry 0.0001 cynthia 0.0009 english 0.0062
cambridge 0.0349 sehen 0.0001 position 0.0008 actor 0.0061
barbara 0.0280 selbst 0.0001 systems 0.0008 united 0.0058
research 0.0227 clearly 0.0001 support 0.0008 british 0.0051
murray 0.0184 bull 0.0001 software 0.0008 character 0.0051
scientific 0.0182 recall 0.0001 heavy 0.0008 people 0.0048
topic 5: topic 83: topic 74: topic 29:
association 0.0896 david 0.0410 david 0.0119 shelter 0.0001
security 0.0781 jonathan 0.0199 king 0.0091 rome 0.0001
money 0.0584 king 0.0188 god 0.0072 thick 0.0001
joint 0.0361 samuel 0.0186 church 0.0061 vous 0.0001
masters 0.0303 israel 0.0112 samuel 0.0054 leg 0.0001
banks 0.0299 saul 0.0075 son 0.0051 considering 0.0001
seal 0.0241 son 0.0068 israel 0.0039 king 0.0001
gilbert 0.0235 dan 0.0067 name 0.0038 object 0.0001
trade 0.0168 god 0.0053 century 0.0038 prayed 0.0001
heads 0.0166 story 0.0048 first 0.0036 pilot 0.0001
topic 6: topic 84: topic 75: topic 30:
law 0.0997 simon 0.0101 recognise 0.0001 despair 0.0001
court 0.0777 cat 0.0008 comparison 0.0001 ray 0.0001
police 0.0442 maison 0.0005 violates 0.0001 successfully 0.0001
legal 0.0396 breach 0.0005 offices 0.0001 respectable 0.0001
justice 0.0292 says 0.0005 value 0.0001 acute 0.0001
courts 0.0229 dirty 0.0005 neighbor 0.0001 accompany 0.0001
welcome 0.0204 rifle 0.0004 cetait 0.0001 assuming 0.0001
civil 0.0178 door 0.0004 composed 0.0001 florence 0.0001
signal 0.0170 property 0.0004 interests 0.0001 ambition 0.0001
pan 0.0163 genus 0.0004 argue 0.0001 unreasonable 0.0001
Table 1: Posterior topics from private and non-private LDA

Using the same data, we then tested how perplexity changes as we change the mini-batch sizes. As shown in Fig. 1, due to privacy amplification, it is more beneficial to decrease the amount of noise to add when the mini-batch size is small. The zCDP composition results in a better accuracy than the advanced composition.

5 Conclusion

We have developed a practical privacy-preserving topic modeling algorithm which outputs accurate and privatized expected sufficient statistics and expected natural parameters. Our approach uses the zCDP composition analysis combined with the privacy amplification effect due to subsampling of data, which significantly decrease the amount of additive noise for the same expected privacy guarantee compared to the standard analysis.