Deep Directed Information-Based Learning for Privacy-Preserving Smart Meter Data Release

11/20/2020
by   Mohammadhadi Shateri, et al.
McGill University
0

The explosion of data collection has raised serious privacy concerns in users due to the possibility that sharing data may also reveal sensitive information. The main goal of a privacy-preserving mechanism is to prevent a malicious third party from inferring sensitive information while keeping the shared data useful. In this paper, we study this problem in the context of time series data and smart meters (SMs) power consumption measurements in particular. Although Mutual Information (MI) between private and released variables has been used as a common information-theoretic privacy measure, it fails to capture the causal time dependencies present in the power consumption time series data. To overcome this limitation, we introduce the Directed Information (DI) as a more meaningful measure of privacy in the considered setting and propose a novel loss function. The optimization is then performed using an adversarial framework where two Recurrent Neural Networks (RNNs), referred to as the releaser and the adversary, are trained with opposite goals. Our empirical studies on real-world data sets from SMs measurements in the worst-case scenario where an attacker has access to all the training data set used by the releaser, validate the proposed method and show the existing trade-offs between privacy and utility.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

06/14/2019

Deep Recurrent Adversarial Learning for Privacy-Preserving Smart Meter Data Release

Smart Meters (SMs) are an important component of smart electrical grids,...
04/28/2015

Private Disclosure of Information in Health Tele-monitoring

We present a novel framework, called Private Disclosure of Information (...
12/06/2020

Privacy-Preserving Synthetic Smart Meters Data

Power consumption data is very useful as it allows to optimize power gri...
09/21/2018

Understanding Compressive Adversarial Privacy

Designing a data sharing mechanism without sacrificing too much privacy ...
10/04/2019

Energy Resource Control via Privacy Preserving Data

Although the frequent monitoring of smart meters enables granular contro...
07/30/2018

Load Control and Privacy-Preserving Scheme for Data Collection in AMI Networks

In Advanced Metering Infrastructure (AMI) systems, smart meters (SM) sen...
03/04/2020

Privacy-Aware Time-Series Data Sharing with Deep Reinforcement Learning

Internet of things (IoT) devices are becoming increasingly popular thank...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent decades, there has been an explosive progress in data collection tools that can measure, analyze and disseminate users data. These advances have led to a large impact in several different fields such as electrical power systems, health care, digital banking, etc. However, users are generally unwilling to share their data due to the possibility that a third party could infer their personal private information, e.g., living habits, economic status, health history. Therefore, guaranteeing users privacy while preserving the benefits of data collection is an important challenge in modern data science

[nelson2016]. In this paper, we focus on this problem when the data has a time series structure and, in particular, we consider different privacy scenarios motivated by the deployment of Smart Meters (SMs) in electrical distribution networks [giaconi2018privacy]. The SMs are devices that can register a fine-grained electricity consumption of users and communicate this information to the utility provider in almost real-time. The utility of the SMs data is diverse [wang2018review, depuru2011]. It can be used for power quality monitoring, timely fault detection, demand response, energy theft prevention, etc. However, widespread usage of SMs data can lead to serious leakages of consumers’ private information, e.g., a malicious third party could use the data to detect the presence of residents at home as well as their personal habits [giaconi2018privacy]. This problem can have a serious impact on the deployment pace of SMs and, more broadly, in the development of smart electrical grids. Thus, it is critical to ensure that SMs data are sanitized before being released.

A very simple strategy that has been proposed in the context of SMs is to use pseudonyms rather than the real identities of users for data publishing purposes [efthymiou2010smart]. However, this approach implicitly assumes that a trusted anonymizer is available. Another simple technique suggested in the literature is downsampling of the data, where the sampling rate is reduced to a level that does not pose any privacy threat [cardenas2012privacy, mashima2015authenticated]. Although this approach may be effective from the privacy point of view, it could also limit seriously the utility of the SMs data for some applications requiring a timely response. More sophisticated and recent approaches exploit the presence of renewable energy sources and rechargeable batteries in homes to modify the actual energy consumption of users in order to hide the sensitive information [backes2013differentially, zhao2014achieving, li2018information, giaconi2018privacy, erdemir2019privacy]. Some of these works use ideas from the well-known principle of differential privacy. However, recent articles suggest that the utility loss of differential privacy may be significant in practice [mendes2017privacy]. It should be noted that our approach to the problem, which works only with the power measurement data, does not preclude the use of methods that change the energy consumption patterns by using physical resources. In fact, it should be viewed as a complementary approach that could even be used on top of the above mentioned methods.

In an information-theoretic context, privacy is generally measured by the Mutual Information (MI) between the sensitive and release variables [li2018information, giaconi2018privacy, erdemir2019privacy, sankar2013smart]. Some of these studies aim to find a privacy-utility trade-off using ideas from rate-distortion theory [sankar2013smart, tripathy2017privacy]. More specifically, the theoretical framework of the privacy-utility problem was proposed in [sankar2013smart]

, where a hidden Markov model for the power measurements of SMs is considered in which the distribution is assumed to be controlled by only the state of the home appliances. The privacy-utility trade-off is then found for a stationary Gaussian model of electricity load with MI between release and private sequence of variables as a privacy measure. Besides the limitation of the Gaussian model, it is noted that the MI is not well-suited to capture the causal structure of the time series data.

In this paper, an information-theoretic cost function for privacy-preserving data release of time series is proposed. In order to take into account the time series structure and causality of the data in the privacy measure, we use the Directed Information (DI) [massey1990causality]

between the sensitive time series and an estimation of it. Then, a cost function is derived for the releaser mechanism based on an upper bound of the DI. To optimize and validate our cost function without imposing constraints on the data distribution, two recurrent neural networks, named as releaser and adversary networks, are employed. This approach is based on the framework of Generative Adversarial Networks (GANs)

[goodfellow2014, huang2017context], where two neural networks are trained simultaneously with opposite goals. We will show that by controlling the relative weight between a distortion measure and the DI privacy measure, we can control the utility-privacy trade-off of SMs power measurements. A similar approach for the privacy problem, but for different applications, was considered in [tripathy2017privacy, 2018arXiv180209386F]

considering the standard MI and independent and identically distributed (i.i.d.) data, where the authors use two deep feed-forward neural networks for the releaser and adversary. However, to the best of our knowledge, this is the first work to consider a DI privacy measure for time series data in the general privacy-preserving context and in SMs applications in particular.

This paper is organized as follows. In Section II, we present the theoretical formulation of the problem. Then, in Section III

, a privacy-preserving data release method based on Long-Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) is introduced along with the training algorithm. Results for two different applications based on SMs data are presented in Section

IV. Finally, some concluding remarks and a discussion about future work are given in Section V.

Notation and conventions

A sequence of random variables

of length is denoted by , while denotes a realization of and denotes the sample in a minibatch used for training. Mutual information [cover2006elements] between variables and is represented as and the entropy as . We use to indicate that , and

form a Markov chain. The expectation of a random variable

is denoted as .

Ii Problem Formulation and Training Objective

Ii-a Main definitions

Consider the private variables (such as occupancy label, household identity, or acorn family type), useful variables (such as actual electricity consumption of household), and observed variables (which could be a combination of private and useful variables). We assume that takes values on a discrete alphabet , for . A releaser (this notation is used to denote that the releaser is controlled by its parameters ) produces the release variables as based on the observation , for each time , while an adversary attempts to infer based on by finding an approximation of which we shall denote by . Thus, the Markov chain holds for all . In addition, due to causality, the distribution can be decomposed as follows:

(1)

The goal of the releaser is to minimize the flow of information from the sensitive variables to their estimation while simultaneously keeping the distortion between the release variables and the useful variables below some given value. On the other hand, the goal of the adversary (this notation is used to denote that the adversary is controlled by its parameters ) is to estimate as accurately as possible.

To take into account the causal relation between and , the flow of information is quantified by the DI [massey1990causality]:

(2)

where is the conditional mutual information between and conditioned on [cover2006elements].

The expected distortion between and is defined as:

(3)

where is any distortion function (i.e., a metric on ). In order to ensure the quality of the release we shall impose the following constraint: for some given . In this work, we will consider the normalized squared error as in [sankar2013smart], i.e.,

(4)

Nevertheless, it should be noted that other distortion measures can also be relevant for the SMs data. For instance, demand response programs usually require an accurate knowledge of peak power consumption, so a distortion function closer to the infinity norm would be more meaningful for this particular application. This brief discussion simply illustrates that the distortion function should be properly matched to the intended application of the release variables in order to preserve the characteristics of the useful variables that are considered essential. Since the goal of this paper is mainly to introduce a new privacy measure and privacy-preserving data release framework, we will not further investigate different fidelity measures.

Therefore, the problem of finding an optimal releaser subject to the aforementioned adversary and distortion constraint can be formally written as follows:

(5)

Note that the solution of this optimization problem is a function of , the conditional distributions that represent the adversary .

Ii-B Novel training objective

The optimization problem (II-A) can be directly used to define a loss function for . However, note that the cost of computing the DI term is , where is the size of . Thus, for the sake of tractability, DI will be replaced with the following surrogate bound:

(6)

where (i) is due to the fact that conditioning reduces entropy; equality (ii) is due to the Markov chains and ; and (iii) is due to the trivial bound . Therefore, the loss function for can be written as

(7)

where controls the privacy-utility trade-off and the factor has been introduced for normalization purposes. It should be noted that the value of in (7) indirectly controls the achievable in (II-A), which means that we can control the privacy-utility trade-off by varying . For , the loss function reduces to the expected distortion, being independent from the adversary . In such scenario, offers no privacy guarantees. Conversely, for very large values of , the loss function is dominated by the upper bound on the DI, so that privacy is the only goal of . In this regime, we expect the adversary to completely fail in the task of estimating , i.e., to approach to random guessing performance.

On the other hand, the adversary

is a classifier which optimizes the following cross-entropy loss:

(8)

where the expectation should be understood w.r.t. . Notice that

(9)

Therefore, if the adversary is ideal (i.e., for all ), the releaser network, by maximizing , prevents the adversary to infer private data.

Iii Privacy-Preserving Mechanism

Based on the previous theoretical formulation, an adversarial modeling framework consisting of two RNNs, a releaser and an adversary , is considered (see Fig. 1). Note that independent noise is appended to in order to randomize the released variables

, which is a popular approach in privacy-preserving methods. In addition, the available theoretical results show that, for Gaussian distributions, the optimal release contains such a noise component

[sankar2013smart, tripathy2017privacy]. For both networks, a LSTM architecture is selected (see Fig. 2), which was shown to be successful in several problems dealing with sequences of data (see [goodfellow2016] and references therein for more details). The training of the suggested framework is performed using Algorithm 1 which uses gradient steps to train followed by one gradient step to train . Note that should be large enough to ensure that is a strong adversary during training. This is in fact the common practice for effectively training two networks in an adversarial framework [goodfellow2014] and, in particular, in privacy scenarios [tripathy2017privacy]. It should be recalled that, after the training of both networks is completed, an attacker network is trained in order to test the privacy achieved by the releaser network. It should be clarified that this attacker network is distinct from the adversary network used during training and illustrated in Fig 1: the attacker used in testing mimics a real-world attacker that would try to deduce the private date from the release data.

Fig. 1: Privacy-Preserving framework. The seed noise

is generated from i.i.d. samples according to a uniform distribution:

.
Fig. 2:

LSTM recurrent network cell diagram. The cell includes four gating units to control the flow of information. All the gating units have a sigmoid activation function (

) except for the input unit (that uses an hyperbolic tangent activation function () by default). The parameters are respectively biases, input weights, and recurrent weights. In the LSTM architecture, the forget gate uses the output of the previous cell (which is called hidden state ) to control the cell state to remove irrelevant information. On the other hand, the input gate and input unit adds new information to from the current input. Finally, the output gate generates the output of the cell from the current input and cell state.

Input: Data set (which includes sample sequences of useful data , sensitive data ); seed noise samples ; seed noise dimension ; batch size ; number of steps to apply to the adversary

; gradient clipping value

; recurrent regularization parameter .
Output: Releaser network .

1:  for number of training iterations do
2:     for  steps do
3:         Sample minibatch of examples: .
4:         Compute the gradient of , approximated with the minibatch , w.r.t. to .
5:

         Update the adversary by applying the RMSprop optimizer with clipping value

.
6:     end for
7:     Sample minibatch of examples: .
8:     Compute the gradient of , approximated with the minibatch , w.r.t. to .
9:     Use recurrent regularization with value and update the releaser by applying RMSprop optimizer with clipping value .
10:  end for
Algorithm 1 Algorithm for training privacy-preserving data releaser neural network.

Iv Results and Discussion

Iv-a Description of datasets

In this study, the Electricity Consumption & Occupancy (ECO) and Pecan Street data sets are used. The ECO data set, collected and published by [beckel2014eco], includes 1 Hz power consumption measurements and occupancy information of five houses in Swiss over a period of months. In this study we re-sampled the data to have hourly samples. On the other hand, the Pecan Street data set contains hourly SMs data of houses in Texas, Austin and was collected by Pecan Street Inc. [street2019dataport]. Pecan Street project is a smart grid demonstration research program which provides electricity, water, natural gas, and solar energy generation measurements for over houses in Texas, Austin. In order to model time dependency over each day (with a data rate of 1 sample per hour), the data was reshaped to sample sequences of length . For the ECO and Pecan Street data set, a total number of and sample sequences are used, respectively. The data is splitted into train and test sets with a ratio of roughly 85 : 15 while of the training data is used as the validation set. It should be noted that in this study, we assume that the attacker has access to all the training data used by the releaser, which can be considered as a worst-case scenario study.

Iv-B Inference of households occupancy

The first practical case of study regarding privacy-preserving in time series data is the concern of inferring presence/absence of residents at home from the total power consumption collected by SMs [kleiminger2015household, jia2014human]

. For this application, the electricity consumption measurements from the ECO data set are considered as the useful data, while occupancy labels are considered as the private data. Therefore, our privacy-preserving data release method aims to minimize a trade-off between the distortion of the total electricity consumption incurred and the probability of inferring the presence of an individual at home from the release signal. The releaser and adversary networks used for the training consist of 4 LSTM layers with

cells and 2 LSTM layers with cells, respectively where a activation function used. In addition, recurrent regularizer with parameter

was used in each layer of the release network. The values of the other hyperparameters (

, ,) were set to , respectively. Finally, after training, a strong attacker is used, consisting of 3 LSTM layers.

Fig. 3: Privacy-utility trade-off for house occupancy inference application. Since in this application the attacker is a binary classifier, the random guessing (balanced) accuracy is 50. The fitted curve is based on an exponential function and is included only for illustration purposes.

Based on the target data and the release data , the normalized root mean-square-error (NRMSE) is defined by

(10)

Fig. 3 shows the empirically found privacy-utility trade-off for this application. It can be seen that by adding more distortion on the released data, the attacker is pushed toward a random guessing classifier.

In order to provide more insights about the release mechanism, the Power Spectrum Density (PSD) of the input signal and the PSD of the error signal (defined as the difference between the actual power consumption and the released signal) for four different cases along the privacy-utility trade-off curve of Fig. 3 are estimated using Welch’s method [stoica2005spectral]. For each case, we use 10 release signals and average the PSD estimates. Results are shown in Fig. 4. Looking at the PSD of the input signal (useful data) some harmonics are visible. The PSD of the error signals show that the model controls the trade-off in privacy-utility by mainly modifying the distortion on these harmonics.

Fig. 4: PSD of the actual electricity consumption and error signals for the house occupancy inference application.

It should be mentioned that two stationary tests including the Augmented Dickey-Fuller test [dickey1979distribution] and the Kwiatkowski, Phillips, Schmidt, and Shin (KPSS) test[kwiatkowski1992testing] applied to our data set indicates that there is enough evidence to suggest the data is stationary, supporting our PSD analysis.

Iv-C Inference of house identity

The second practical case of study regarding the privacy-preserving in SMs measurements is identity recognition from total power consumption of households [efthymiou2010smart]. It is assumed that the attacker has access to total power consumption of different households in a region (training data) and then attempts to determine identities of the households using the new released data (test data). Thus, our model aims at generating release data of total power consumption of households in a way that prevents the adversary to perform the identity recognition while keeping distortion on the total power minimized. For this task, total power consumption of five houses is used. For this application, the releaser consists of 6 LSTM layers each includes cells and adversary has 4 LSTM layers with cells. Similarly, a activation function was applied and was used in each layer of the release network. The values of the other hyperparameters (, ,) are set to , respectively. Finally, after training, an attacker, consisting of 4 LSTM layers was used. The empirical privacy-utility trade-off curve obtained for this application is presented in Fig. 5. Comparing Fig. 5 with Fig. 3, we see that a high level of privacy is expensive. For instance, in order to obtain an attacker accuracy of 30 , the NRMSE should be approximately equal to 0.30. This is attributed to the fact that this task is harder from the learning point of view than the one considered in Section IV-B.

Fig. 5: Privacy-utility trade-off for house identity inference application. Since in this application the attacker is a five-class classifier, the random guessing (balanced) accuracy is 20. The fitted curve is based on an exponential function and is included only for illustration purposes.

PSD analysis was also performed for this application, yielding the results of Fig. 6. Once again, we see that the release network provides privacy-utility trade-off by mainly distorting the harmonics on the actual electricity consumption signal.

Fig. 6: PSD of the actual electricity consumption and error signals for the house identity inference application.

V Conclusion

We have presented a new method to train privacy-preserving mechanisms controlling the privacy-utility trade-off in time series data. This lead us to define the directed information between sensitive variables and their estimation as a more suitable privacy measure than previous proposals in the literature. A tractable upper bound was then derived and a deep learning adversarial framework between two recurrent neural networks was introduced to optimize the new loss function. Our method was validated with two well-known privacy problems in smart meters data using two different open data sets. For both privacy problems we considered the worst-case where an attacker has access to all the training data used by the releaser. In future work, we will consider alternative formulations of the problem such as different distortion measures and a more general loss function in order to attempt to provide universal privacy guarantees (i.e., independent of the attacker structure and computational power).

Acknowledgment

This work was supported by Hydro-Quebec, the Natural Sciences and Engineering Research Council of Canada, and McGill University in the framework of the NSERC/Hydro-Quebec Industrial Research Chair in Interactive Information Infrastructure for the Power Grid (IRCPJ406021-14). The work of Prof. Pablo Piantanida was supported by the European Commission’s Marie Sklodowska-Curie Actions (MSCA), through the Marie Sklodowska-Curie IF (H2020-MSCAIF-2017-EF-797805-STRUDEL).

References