1 Private Machine Learning
Our desire is to develop a strategy for machine learning driven by the requirement that private data should be shared as little as possible and that noone can be trusted with an individual’s data, neither a data collector/aggregator, nor the machine learner that tries to fit a model.
Randomised Response, see for example Warner65, is relevant in this context in which a datapoint is replaced with a randomised ‘noisy’ version . A classical example is voting in an election in which an individual voter votes for one of two candidates or
and is asked to lie (with probability
) about whom they voted for . This results in noisy data and estimating the fraction of voters that voted for candidate based on this noisy data(1) 
can give a potentially significantly incorrect estimate. As Warner65 showed, since we know the probabilistic mechanism that generated the noisy data, a better estimate of the fraction of voters voting for candidate is given by
(2) 
In a machine learning context, the kind of scenario we envisage is that users may have labelled face images as “happy" or “sad" on their mobile phones and the company MugTome wishes to train a “happy/sad" face classifier; however, users do not wish to send the raw face images to MugTome and also wish to be able to plausibly deny which label they gave any training image. To preserve privacy, each user will send to MugTome only a single corrupted datapoint — a single corrupted image and a single corrupted label.
It is straightforward to extend our approach to deal with users sending multiple corrupted datapoints. However, since MugTome will potentially then know which corrupted datapoints belong to each user, they will have more information to help reveal the underlying clean datapoint. Since we assume we cannot trust MugTome, MugTome may attempt to recover the underlying true datapoint. For example, if a user sends three class labels , , then MugTome can have a good guess of the underlying true class label by simple taking the majority class . Indeed, in general, if corrupted datapoints are independently generated for a user, then MugTome’s ability to reveal the true class (or attribute) increases dramatically. For example, if MugTome know the corruption mechanism the posterior of the class is given by
(3) 
where is the prior belief on the true class. This posterior distribution concentrates exponentially quickly (in ) around the true value . Similarly, if a pollster asks each voter three times what they voted, then the questioner would have a very good idea of the true vote of each voter; to protect the voter’s privacy, the voter would then have to trust that the pollster either does not pass on any information that states that the three votes came from the same person or that the pollster doesn’t attempt themselves to figure out what the voter voted for.
Similarly, in a medical setting in which a patient privately owns a datapoint, releasing synthetic versions (corruptions) of that datapoint can compromise privacy if which synthetic datapoints belong to each person is also known. To guarantee that privacy is retained would require patients to trust people with their data, namely that any data aggregation process will remove their patient ID. However, this is something out of the control of the patient and as such we do not consider generating multiple synthetic datapoints (see for example DBLP:journals/corr/abs170807975) a ‘safe’ mechanism.
For these reasons, we wish to make a process in which an individual only reveals a single corrupted datapoint; from that point onwards in the machine learning training process, no other trust in that process is required. To motivate our general approach to private machine learning we discuss the voting example in more detail in section(3). Connections to other forms of privacy preserving machine learning are discussed in section(7). The justification for our approach hinges on the properties of the Spread Divergence, which we review in the following section.
2 Spread Divergence
Throughout we use the notation
for a random variable
in state . However, to reduce notational overhead, where unambiguous, we write simply .A divergence (see, for example Dragomir2005) is a measure of the difference between two distributions and with the property
(4) 
An important class is the divergence, defined as
(5) 
where is a convex function with . A special case of an
divergence is the wellknown KullbackLeibler divergence
which is widely used to train models using maximum likelihood. For the Spread Divergence, from and we define new distributions and that have the same support. Using the notation to denote integration for continuous , and for discrete with domain , we define a random variable with the same domain as and distributions(6) 
where ‘spreads’ the mass of and such that and have the same support. For example, if we use a Gaussian , then and both have support . The spread divergence has a requirement on the noise , namely that ; that is, if the divergence of the spreaded distributions is zero, then the original nonspreaded distribution will match. As shown in zhang2018spread this is guaranteed for certain ‘spread noise’ distributions. In particular, for continuous and of the same dimension and injective function , a sufficient condition for a valid spread noise is that the kernel
has strictly positive Fourier Transform. For discrete variables, a sufficient condition is that
is that and the matrix is square and invertible.Spread divergences have a natural connection to privacy preservation and Randomised Response (Warner65). The spread divergence suggests a general strategy to perform private machine learning. We first express the machine learning problem as for a specified model . Then, given only noisy data , we fit the model by . To explain in more detail how this works, we first describe randomised response in a classical voting context and then justify how to generalise this to principled training of machine learning models based on corrupted data.
3 A classical voting example
There are two candidates in an election, candidate “one" and candidate “zero" and Alice would like to know the fraction of voters that voted for candidate “one". We write the dataset of voting as a collection of binary values , .
3.1 Learning using clean data
If we assume that Alice has full knowledge of which candidate each voter voted for, then clearly Alice may simply count the fraction of people that voted for “one” and set
(7) 
It will be useful to first consider how to arrive at the same result from a modelling perspective. We can consider an independent Bernoulli model
(8) 
where
(9) 
so that
(10) 
We also construct an empirical data distribution that places mass only on the observed joint state, namely
(11) 
where is the Kronecker delta function. Then
(12) 
where
(13)  
(14) 
and minimising (or maximising ) with respect to recovers the fraction of votes that are 1, equation(7). This shows how we can frame estimating the quantity from uncorrupted private data as a divergence minimisation problem.
3.2 Learning using corrupted data
Returning to the privacy setting, Bob would also like to know the fraction of votes that are 1. However, Alice does not want to send to Bob the raw data since the votes of any individual should not be explicitly revealed. To preserve privacy, Alice sends noisy data to Bob. In this case we draw a single joint sample from the distribution
(15) 
where the ‘spread noise’ model is . Hence, if Alice draws a sample with probability and with probability .
Given a sampled noisy dataset we form an empirical spreaded data distribution
(16) 
Similarly, the corrupted joint model is given by
(17) 
where
(18) 
On receiving the noisy dataset , Bob can try to estimate by minimising
(19) 
with respect to . Equivalently, he may maximise the scaled spread log likelihood
(20) 
For this simple model, Bob can easily explicitly calculate
(21) 
Similarly, . In this case, equation(20) becomes
(22) 
where
(23) 
Using , , , the maximum of the spread log likelihood is at
(24) 
which forms Bob’s estimate of the underlying fraction of voters that voted for candidate “one”.
For example, if there were no noise , Bob would estimate , simply recovering the fraction of votes that are 1 in the original data. In the limit of a large number of votes and true probability of a voter voting for candidate “one”, then tends to and Bob’s estimate recovers the true underlying voting probability . Hence, even though Bob only receives a corrupted set of votes, in the limit of a large number of votes, he can nevertheless estimate the true fraction of people that voted for candidate “one”.
4 Private Machine Learning using Randomised Response
The above example suggests a general strategy to perform private machine learning:

Phrase problem as likelihood maximisation: We first assume that a machine learning task for private data can be expressed as learning a data model by optimising an objective
(25) 
Form a corrupted dataset: Draw a single joint sample from the distribution
(26) where is a defined spread noise distribution and known by both the owner of the private data and the receiver of the corrupted data. To do this, we go through each element of the dataset and replace it with a corruption sampled from .

Send data to learner: We then send to the learner the corrupted dataset , the model to be learned and the corruption probability .

Estimate from corrupted data: Having received the corrupted data , the learner fits by maximising the objective
(27) where
(28)
4.1 Justification
If we assume that each element of the training data is identically and independently sampled from a model , then each corrupted observation is a sample from the same distribution given by
(29) 
By the law of large numbers the objective equation(
27) approaches its average over the data generating mechanism(30) 
and maximising the spread likelihood objective becomes equivalent to minimising
(31) 
Provided that the spread noise is valid (see section(2)), then
(32) 
for an identifiable model . Thus
(33) 
is a consistent estimator.
This means that (in the large data limit and assuming the training data is generated from the model), even though we only train on corrupted data, we are optimising an objective which has a global minimum close to that of the objective on uncorrupted data . Indeed, the estimator is consistent in the sense that as the amount of training data increases, we will recover the true clean data generating mechanism. Hence, provided that the corruption process is based on spread noise, then we can still learn the model parameters even by training on only corrupted data. In our motivating voting scenario in section(3), we saw explicitly that the estimate of the true underlying voting fraction is consistent and indeed, this is a general property of our approach.
4.2 Training on noise only
A common approach in private machine learning is to form synthetic (noisy, corrupted) data and then simply train the standard model on this noisy data — see for example li2019deepobfuscator. In our notation, this would be equivalent to maximising the likelihood
(34) 
As above, assuming that the training data is generated from an underlying model , by the law of large numbers,
(35) 
In general, the optimum of this objective does not occur when and therefore training on noisy data alone does not form a consistent estimator of the true underlying model. We discuss this more extensively in the context of logistic regression in section(B) in which we describe a situation in which training with corrupted class labels actually does from a consistent estimator. We also discuss more general settings in which training on the noise only may be justified.
) for the model with binary variable
, . In each case we plot along the axis the true from 0 to 1 and on the axis the value of that maximises . In each plot we use a different flip probability. For a consistent estimator we would require that each plot is a straight line, which only occurs in the case of no noise, .4.3 Reconstruction Approach
A seemingly natural alternative to our method is to attempt to reconstruct the clean datapoint from the noisy datapoint and use that within a standard learning framework. This approach would give an objective
(36) 
Here we need to define a posterior distribution to reconstruct the clean datapoint. Since the learner only has knowledge of the prior it is natural to set
(37) 
By the law of large numbers converges to its expectation with respect to the true data generating mechanism , so that
(38) 
In general, the optimum of is not at . To demonstrate this, we plot in figure(6) the optimal for a simple Bernoulli model for which we can calculate exactly. As we see, for all but zero flip noise, , the estimator does not correctly identify the underlying probabilty generating mechanism. For this reason, we do not pursue this approach further.
4.4 Other Divergences
An extension of the above is to learn by minimising other divergences
(39) 
However, this generalisation to any divergence is harder to justify since the expectation of this objective (by averaging over the noise realisations)
(40) 
will not in general give a divergence between spreaded distributions. This means that in the limit of a large number of datapoints, it is not guaranteed to recover the true data generating process, except for special choices of the divergence, such as the KL divergence. We leave a discussion of this for future work.
5 Private Logistic Regression
As an application of the above framework to a standard machine learning model, we now discuss how to form a private version of logistic regression.
Returning to our motivating example, users may have labelled face images as “happy" or “sad" on their mobile phones and the company MugTome wishes to train a “happy/sad" face classifier; however, users do not wish to send the raw face images to MugTome and also wish to be able to plausibly deny which label they gave any training image.
In this case we have a set of training data , and corresponding binary class labels , . We wish to fit a logistic regression model
(41) 
where is the logistic function. We follow the general approach outlined in section(4).

The model: For observation and parameter
(42) where is the standard logistic regression model above and is a model of the input . The training objective is
(43) (44) We note that this is a separable objective for , in which the logistic regression parameters are conditionally independent (conditioned on the training data) of the input parameters .

Form the corrupted dataset: We wish to send noisy data , to the learner. To do so we need to define a corruption model . For simplicity, we consider a corruption model of the form
(45) The corruption processes of and are problem specific; see the experiments section(6) for some examples.

Send to learner corrupted data and model: The corrupted labels and inputs are sent to the learner along with the model , and corruption process , .

Learn the model parameters : The spread log likelihood is
(46) (47) Unfortunately, in all but special cases, the integral (for continuous ) or sum (for discrete ) required to evaluate is not tractable and numerical approximation is required. For this stage, there are many options available and we present below the approach taken in the experiments.
Interestingly, we note that, unlike training on clean data, the objective is not separable into a function of plus a function of , meaning that learning the class prediction parameter is coupled with learning the input distribution parameter .
5.1 Implementation
In general, the spread noise defines a distribution on a pair of spread variables
and the full joint distribution, including the original model is
(48) 
For continuous , the spread likelihood is then obtained from
(49) 
In general, this sum/integral over is intractable due to the highdimensionality of . We use a standard approach to lower bound the log likelihood (for a single datapoint) by
(50) 
where is a distribution chosen to make the bound tight, see for example Barber:2012:BRM:2207809. This allows us to use an EMstyle procedure in which we iterate between the two steps : (Mstep) fix and optimise and (Estep) fix and update .

Iteration Mstep: Update to increase the “energy”
(51) where (for multiple datapoints)
(52) An advantage of this approach is that is separable and we can update the class prediction parameter independently of the input distribution parameter .
In practice we will typically only do a partial optimisation (gradient ascent step) over to guarantee an increase in the energy.

Iteration Estep: The bound is tightest when is set to the posterior (see for example Barber:2012:BRM:2207809),
(53) where , and the normaliser is given by
(54)
To implement the Mstep, Equation(52) requires expectations of the form
(55) 
for some function . Assuming that the posterior will be reasonably peaked around the noisy data we use sampling with an importance distribution
(56) 
The expectation is then motivated by
(57) 
Choosing
(58) 
for normalising functions , we then run a standard importance sampling approximation (see section(A)). For a given noisy datapoint we generate a set of samples from and samples from and compute the importance weights
(59) 
The energy equation(52) separates into two independent terms (see section(A))
(60) 
and
(61) 
Equation(60) is a weighted version of the standard logistic regression log likelihood, in equation(44); similarly equation(61) is a weighted version of . The advantage therefore is that, given the importance samples, the learning procedure for requires only a minor modification of the standard maximum likelihood training procedure on clean data.
The full procedure is that we randomly initialise and then, for each datapoint , draw samples and accumulate the gradient across samples and datapoints. After doing a gradient ascent step in , we update the importance distributions and repeat until convergence.
The Importance Sampling approximation is a convenient approach, motivated by the assumption that corrupted datapoints will be close to their uncorrupted counterparts. Whilst we used a bound as part of the approximation, this is not equivalent to using a parametric distribution; by sampling we form a consistent estimator of the tightest possible lower bound. In other words, we are simply using Importance Sampling to estimate the expectations required within a standard Expectation Maximisation algorithm, see for example Barber:2012:BRM:2207809. We also tried learning a parametric , similar to standard variational approaches to log likelihood maximisation, but didn’t find any improvement on the Importance Sampling approach.
5.2 Learning the prior
If we have access to clean data, the optimal input model can be learned from maximising the likelihood . However, our general assumption is that we will never have access to clean data. There may be situations in which the learner has a good model of , without compromising privacy (for example a publicly available dataset for a similar prediction problem might be available) in which case it makes sense to set the prior to this known model.
In the absence of a suitable prior we can attempt to learn from the corrupted data by maximising . For simplicity we assume a factorised model and for a
dimensional input vector
write(62) 
for a collection of learnable univariate distributions , . Under this assumption, and using the Importance Sampling approach in equation(61), this means that can be learned by maximising
(63) 
Since this is a separable objective, we can learn each independently.
For simplicity, we assume a discrete distribution for that contains states (or bins). Then
(64) 
where is 1 if the sample is in the state and 0 otherwise. Optimising with respect to we obtain
(65) 
For the Mstep of the algorithm we then make a gradient update for and update the prior using equation(65).
6 Experiments
We implemented our approach in section(5) to train a logistic regression classifier to distinguish between the MNIST digits 7 and 9 based on noisy data (250 train and 900 test examples from each class). We chose to train on a small dataset since this constitutes the most challenging scenario and helps highlight potential differences between rival approaches. The MNIST images have pixesl with 256 states and we used a discrete distribution to model .
For our experiments we assume a corruption model that flips the label and with probability with probability . We also assume here for simplicity assume a factorised input corruption model in which with probability and uniformly from the other states of that pixel with probability .
In this case, computing the Importance Sampling distribution is straightforward since the posterior is factorised over the image pixels. We considered three settings for the prior (required to compute the Importance Sampling distribution) : (i) flat prior, (ii) learned prior using EM, (iii) true factorised prior based on computing the marginal distribution of each pixel on the training dataset. In the ‘true prior’ case we assumed that we know the true marginal distribution of each pixel – in general, this information would be private, but it is interesting to consider how much improvement is obtained by knowing this quantity.
We compare the following approaches:
 Log Reg Clean Data

We trained logistic regression on clean data. This sets an upper limit on the expected performance.
 Log Reg on Noisy Data

We trained a standard logistic regression model but using the corrupted data. This forms a simple baseline comparison.
 Spread Log Reg with Learned Prior

We used our Spread Likelihood approach to learn the prior.
 Spread Log with ‘True Prior’

In general our assumption is that the true prior will not be known (since this requires users to release their private data to the prior learner). However, this forms an interesting comparison and expected upper bound on the performance of the spread approach.
 Spread Log with Flat Prior

In this case we used an informative, flat prior on all pixel states.
We ran 10 experiments for each level of flip noise from and then tested the prediction accuracy of the learned logistic classifiers on clean hold out data, see figure(18).
For all but small noise levels, the results show the superiority of the spread learning approach over simply training on noisy data, consistent with our theory that training the standard model on noisy data does not in general form a consistent estimator. The best performing spread approach is that which uses the true prior – however, in general this true prior will not be available. For this experiment, there appears to be little difference between using a flat prior and a learned prior.
The performance of standard logistic regression training but using corrupted data is surprisingly effective, at least at low noise levels. However the performance degrades quickly for higher noise levels. A partial explanation for why logistic regression may give good results simply trained on noisy data is given in section(B).
6.1 Gaussian Input Prior
We also demonstrate here training logistic regression treating the pixels as continuous. If we an independent have (per pixel) a Gaussian prior
(66) 
and independent Gaussian spread noise
(67) 
then using the Importance Sampling posterior is
(68) 
where
(69) 
We also used Gaussian spread noise to corrupt the images and train a binary classifier to distinguish between the MNIST digits 7 and 9 based on noisy data (4500 train and 900 test examples from each class). For simplicity, we assumed factorised distributions with prior , . We chose spread flip noise
for the class labels and uniform spread noise with variance
; the prior was set to be quite uninformative with mean and variance . This level of noise means that approximately 20% of the class labels are incorrect in the data passed to MugTome and the associated image is significantly blurred, see figure(25). For standard logistic regression we found that for a learning rate of 0.2, 400 iterations gave the best performance, with 95.5% train accuracy and 95.7% test accuracy. Using our Importance Sampling scheme with samples per noisy datapoint, the trained when tested on clean images had 94.4% test accuracy. This shows that despite the high level of class label and image noise, MugTome are able to learn an effective classifier, preserving the privacy of the users. The loss in test and training accuracy, despite this high noise level is around a modest 1%. When using higher spread noise with variance , the learned on the noisy data had a clean data test accuracy 93%, which is also a modest decrease in accuracy for a significant increase in privacy.For future work it would be interesting to consider other forms of noise, for example downsampling images. However, downsampling does not form an injective mapping and as such we cannot guarantee that we can find a consistent estimator for the underlying model.
7 Discussion
There are many forms of private machine learning. Some attempt to transform a datapoint to a form such that a protected attribute (such as gender) cannot be recovered from , yet the prediction of an output (for example using
) is retained. For example this could be achieved by using a loss function such as (see for example
li2019deepobfuscator)(70) 
where is the data index, is a function that takes input and outputs a prediction ; is a function that takes input and outputs an attribute prediction and gives a representation of the input; , are loss functions. In this protected attribute setting, typically some form of the clean dataset is required to learn the parameters .
Another common form of private machine learning is based on differential privacy (Dwork:2014), with the aim to make it difficult to discern whether a datapoint was used to train the predictor . That is, given a trained model, differential privacy attempts to restrict the ability to differentiate whether any individual’s datum was used to train the model.
A closely related concept to randomised response is that of plausible deniability, namely privately corrupting a datapoint such that noone (except the datapoint provider) can confidently state what the original (private) value of is. Recently DBLP:journals/corr/abs170807975 used this to create synthetic datapoints, which were subsequently used with a standard machine learning training approach. The authors showed that generating synthetic data from a distribution that takes dependency amongst the elements of the vector results in better machine learning predictors than sampling from a factorised distribution. In synthetic data generation approaches the assumption is that the statistical characteristics are similar to the real data. However, care is required since if the generating mechanism is powerful, it may generate data which is very similar to the private data.
In general these synthetic data generating approaches do not take into consideration when learning the parameters of the machine learning model what that synthetic data generation mechanism is. This is analogous to simply using the corrupted votes to directly estimate the fraction of voters that voted for a candidate, equation(1), rather than using knowledge of the data generation approach, equation(2).
8 Summary
We discussed a general privacy preserving mechanism based on random response in a datapoint is replaced by a corrupted versions. We showed that, provided the corruption process is a valid spread noise, then a maximum likelihood approach forms a consistent estimator. That is, even though the model is only trained on corrupted, synthetic data, it is possible to recover the true underlying data genering mechnanism on the clean data. We applied this approach to a simple logistic regression model, showing that the approach can work well, even with high levels of noise. The approach is readily applicable to a large class of much more complex models and other divergences.
Acknowledgements
I would like to thank Xijie Hou for useful discussions.
References
Appendix A Privacy Preserving Logistic Regression
The posterior is given by
Comments
There are no comments yet.