 # Private Topic Modeling

We develop a privatised stochastic variational inference method for Latent Dirichlet Allocation (LDA). The iterative nature of stochastic variational inference presents challenges: multiple iterations are required to obtain accurate posterior distributions, yet each iteration increases the amount of noise that must be added to achieve a reasonable degree of privacy. We propose a practical algorithm that overcomes this challenge by combining: (1) A relaxed notion of the differential privacy, called concentrated differential privacy, which provides high probability bounds for cumulative privacy loss, which is well suited for iterative algorithms, rather than focusing on single-query loss; and (2) Privacy amplification resulting from subsampling of large-scale data. Focusing on conjugate exponential family models, in our private variational inference, all the posterior distributions will be privatised by simply perturbing expected sufficient statistics. Using Wikipedia data, we illustrate the effectiveness of our algorithm for large-scale data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Background

We start by providing background information on the definitions of algorithmic privacy that we use, as well as the general formulation of the variational inference algorithm.

### Differential privacy

Differential privacy (DP) is a formal definition of the privacy properties of data analysis algorithms . A randomized algorithm is said to be -differentially private if for all measurable subsets of the range of and for all datasets , differing by a single entry. If , the algorithm is said to be -differentially private. Intuitively, the definition states that the output probabilities must not change very much when a single individual’s data is modified, thereby limiting the amount of information that the algorithm reveals about any one individual.

Concentrated differential privacy (CDP)  is a recently proposed relaxation of DP which aims to make privacy-preserving iterative algorithms more practical than for DP while still providing strong privacy guarantees. The CDP framework treats the privacy loss of an outcome,

as a random variable. An algorithm is

-CDP if this privacy loss has mean , and after subtracting the resulting random variable

is subgaussian with standard deviation

, i.e. . While -DP guarantees bounded privacy loss, and -DP ensures bounded privacy loss with probability , -CDP requires the privacy loss to be near w.h.p.

### The general VI algorithm.

Consider a generative model that produces a dataset consisting of independent identically distributed items, where is an th observation, generated using a set of latent variables . The generative model provides , where is the model parameters. We also consider the prior distribution over the model parameters and the prior distribution over the latent variables . Here, we focus on conjugate-exponential (CE) models111

A large class of models falls in the CE family including linear dynamical systems and switching models; Gaussian mixtures; factor analysis and probabilistic PCA; hidden Markov models and factorial HMMs; discrete-variable belief networks; and latent Dirichlet allocation (LDA), which we will use in Sec 3.

, in which the variational updates are tractable. The CE family models satisfy the two conditions : (1) Complete-data likelihood is in exponential family: , and (2) Prior over is conjugate to the complete-data likelihood: , where natural parameters and sufficient statistics of the complete-data likelihood are denoted by and

, respectively. The hyperparameters are denoted by

(a scalar) and

(a vector).

Variational inference for a CE family model iterates the following two steps in order to optimise the lower bound to the log marginal likelihood,

 (a) given expected natural parameters ¯n, % the first step computes the approximate posterior over latent variables: q(l)=N∏n=1q(ln)∝N∏n=1f(Dn,ln)exp(¯n⊤s(Dn,ln))=N∏n=1p(ln|Dn,¯n). (1) Using q(l), the first step % outputs expected sufficient statistics ¯s(D)=1N∑Nn=1⟨s(Dn,ln)⟩q(ln). (b) given expected sufficient statistics ¯s(D), the second step computes the approximate posterior over % parameters: q(m)=h(~τ,~ν)g(m)~τexp(~ν⊤n(m)), where ~τ=τ+N,~ν=ν+N¯s(D). (2) Using q(m), the second step outputs % expected natural parameters ¯n=⟨n(m)⟩q(m).

## 2 Privacy Preserving VI algorithm for CE family

The only place where the algorithm looks at the data is when computing the expected sufficient statistics in the first step. The expected sufficient statistics then dictates the expected natural parameters in the second step. So, perturbing the sufficient statistics leads to perturbing both posterior distributions and . Perturbing sufficient statistics in exponential families is also used in . Existing work focuses on privatising posterior distributions in the context of posterior sampling [5, 6, 7, 8]

, while our work focuses on privatising approximate posterior distributions for optimisation-based approximate Bayesian inference.

Suppose there are two neighbouring datasets and , where there is only one datapoint difference among them. We also assume that the dataset is pre-processed such that the norm of any datapoint is less than . The maximum difference in the expected sufficient statistics given the datasets, e.g., the L-1 sensitivity of the expected sufficient statistics is given by (assuming is a vector of length ) Under some models like LDA below, the expected sufficient statistic has a limited sensitivity, in which case we add noise to each coordinate of the expected sufficient statistics to compensate the maximum change.

## 3 Privacy preserving Latent Dirichlet Allocation (LDA)

The most successful topic modeling is based on LDA, where the generative process is given by .

• Draw topics Dirichlet , for , where is a scalar hyperarameter.

• For each document

• Draw topic proportions Dirichlet , where is a scalar hyperarameter.

• For each word

• Draw topic assignments Discrete

• Draw word Discrete

where each observed word is represented by an indicator vector (th word in the th document) of length , where is the number of terms in a fixed vocabulary set. The topic assignment latent variable is also an indicator vector of length , where is the number of topics.

The LDA falls into the CE family, where we think of as two types of latent variables : , and as model parameters : (1) Complete-data likelihood per document is in exponential family: where

; (2) Conjugate prior over

: for . For simplicity, we assume hyperparameters and are set manually.

In VI, we assume the posteriors are : (1) Discrete for

, with variational parameters that capture the posterior probability of topic assignment,

; (2) Dirichlet for ; and (3) Dirichlet for . In this case, the expected sufficient statistics is .

### Sensitivity analysis

To privatise the variational inference for LDA, we perturb the expected sufficient statistics. While each document has a different document length , we limit the maximum length of any document to by randomly selecting words in a document if the number of words in the document is longer than .

We add Gaussian noise to each coordinate, then map to 0 if the perturbed coordinate becomes negative:

 ~¯svk=¯svk+Yvk, where Yvk∼N(0,σ2), and σ2≥2log(1.25/δiter)(Δ¯s)2/ϵ2iter, (3)

where is the th coordinate of a vector of length : , and is the sensitivity given by

 Δ¯s =max|D−~D|=1√∑k∑v(¯svk(D)−¯svk(~D))2, =maxd,d′1D√∑k∑v(∑n(ϕkdnwvdn−ϕkd′nwvd′n))2, ≤maxd1D∑k∑v|∑nϕkdnwvdn|, since L2 norm is less than equal to% L1 norm ≤maxd1D∑n(∑kϕkdn)(∑vwvdn)≤ND, (4)

since , , , and .

### Private stochastic variational learning

In a large-scale data setting, it is impossible to handle the entire dataset at once. In such case, stochastic learning using noisy gradients computed on mini-batches of data, a.k.a., stochastic gradient descent (SGD) provides a scalable inference method. While there are a couple of prior work on differentially private SGD (e.g.,

[10, 11]), privacy amplification due to subsampling combined with CDP composition (which will be described below) has not been used in the the context of variational inference or topic modeling before.

The privacy amplification theorem states the following.

###### Theorem 1.

(Theorem 1 in ) Any -DP mechanism running on a uniformly sampled subset of data with a sampling ratio guarantees differential privacy, where and

The privacy gain from subsampling allows us to use a much more relaxed privacy budget and the error tolerance per iteration, to achieve a reasonable level of ()-DP with a small sampling rate.

Furthermore, the zCDP composition allows a sharper analysis of the per-iteration privacy budget. We first convert DP to zCDP, then use the zCDP composition and finally convert zCDP back to DP (for comparison purposes), for which we use the following lemmas and proposition.

###### Lemma 1.

(Proposition 1.6 in 

) The Gaussian mechanism with some noise variance

and a sensitivity satisfies -zCDP.

###### Lemma 2.

(Lemma 1.7 in ) If two mechanisms satisfy -zCDP and -zCDP, respectively, then their composition satisfies -zCDP.

###### Proposition 1.

(Proposition 1.3 in ) If provides -zCDP, then is -DP for any .

So, using Lemma 2 and 3, we obtain -zCDP after -composition of the Gaussian mechanism. Using Proposition 4, we convert -zCDP to -DP, where .

These seemingly complicated steps can be summarised into two simple steps. First, given a total privacy budget and total tolerance level , our algorithm calculates an intermediate privacy budget using the zCDP composition, which maps () to (),

 ϵtot =JΔ2/(2τ)+2√JΔ2/(2τ)log(1/δtot), where τ≥2log(1.25/δ′)Δ2/ϵ′2. (5)

Second, our algorithm calculates the per-iteration privacy budget using the privacy amplification theorem, which maps () to (),

 ϵ′ =log(1+ν(exp(ϵiter)−1)), (6) δ′ =νδiter.

Algorithm 1 summarizes our private topic modeling algorithm.

## 4 Experiments using Wikipedia data

We randomly downloaded documents from Wikipedia. We then tested our VIPS algorithm on the Wikipedia dataset with four different values of total privacy budget, using a mini-batch size , until the algorithm sees up to documents. We assumed there are topics, and we used a vocabulary set of approximately terms.

We compare our method to two baseline methods. First, in linear (Lin) composition (Theorem 3.16 of ), privacy degrades linearly with the number of iterations. This result is from the Max Divergence of the privacy loss random variable being bounded by a total budget. Hence, the linear composition yields (, )-DP. We use eq (6) to map () to . Second, advanced (Adv) composition (Theorem 3.20 of ), resulting from the Max Divergence of the privacy loss random variable being bounded by a total budget including a slack variable , yields -DP. Similarly, we use eq (6) to map () to .

As an evaluation metric, we compute the upper bound to the perplexity on held-out documents

222We used the metric written in the python implementation by authors of .,

where is a vector of word counts for the th document, . In the above, we use the that was calculated during training. We compute the posteriors over and by performing the first step in our algorithm using the test data and the perturbed sufficient statistics we obtain during training. The per-word-perplexity is shown in Fig. 1. Due to privacy amplification, it is more beneficial to decrease the amount of noise to add when the mini-batch size is small. The zCDP composition results in a better accuracy than the advanced composition. Figure 1: Per-word-perplexity with different mini-batch sizes S∈{10,20,50,100,200,400}. In the private LDA (Top/Right and Bottom), smaller mini-batch size achieves lower perplexity, due to the privacy amplification lemma (See Sec 3). We set the total privacy budget ϵtot=1 and the total tolerance δtot=1e−4 in all private methods. Regardless of the mini-batch size, the zCDP composition (Top/Right) achieves a lower perplexity than the Advanced (Bottom/Left) and Linear compositions (Bottom/Right).

In Table 1, we show the top words in terms of assigned probabilities under a chosen topic in each method. We show topics as examples. Non-private LDA results in the most coherent words among all the methods. For the private LDA models with a total privacy budget fixed to , as we move from zCDP, to advanced, and to linear composition, the amount of noise added gets larger, and therefore more topics have less coherent words.

Using the same data, we then tested how perplexity changes as we change the mini-batch sizes. As shown in Fig. 1, due to privacy amplification, it is more beneficial to decrease the amount of noise to add when the mini-batch size is small. The zCDP composition results in a better accuracy than the advanced composition.

## 5 Conclusion

We have developed a practical privacy-preserving topic modeling algorithm which outputs accurate and privatized expected sufficient statistics and expected natural parameters. Our approach uses the zCDP composition analysis combined with the privacy amplification effect due to subsampling of data, which significantly decrease the amount of additive noise for the same expected privacy guarantee compared to the standard analysis.