## I Introduction

In recent decades, there has been an explosive progress in data collection tools that can measure, analyze and disseminate users data. These advances have led to a large impact in several different fields such as electrical power systems, health care, digital banking, etc. However, users are generally unwilling to share their data due to the possibility that a third party could infer their personal private information, e.g., living habits, economic status, health history. Therefore, guaranteeing users privacy while preserving the benefits of data collection is an important challenge in modern data science

[nelson2016]. In this paper, we focus on this problem when the data has a time series structure and, in particular, we consider different privacy scenarios motivated by the deployment of Smart Meters (SMs) in electrical distribution networks [giaconi2018privacy]. The SMs are devices that can register a fine-grained electricity consumption of users and communicate this information to the utility provider in almost real-time. The utility of the SMs data is diverse [wang2018review, depuru2011]. It can be used for power quality monitoring, timely fault detection, demand response, energy theft prevention, etc. However, widespread usage of SMs data can lead to serious leakages of consumers’ private information, e.g., a malicious third party could use the data to detect the presence of residents at home as well as their personal habits [giaconi2018privacy]. This problem can have a serious impact on the deployment pace of SMs and, more broadly, in the development of smart electrical grids. Thus, it is critical to ensure that SMs data are sanitized before being released.A very simple strategy that has been proposed in the context of SMs is to use pseudonyms rather than the real identities of users for data publishing purposes [efthymiou2010smart]. However, this approach implicitly assumes that a trusted anonymizer is available. Another simple technique suggested in the literature is downsampling of the data, where the sampling rate is reduced to a level that does not pose any privacy threat [cardenas2012privacy, mashima2015authenticated]. Although this approach may be effective from the privacy point of view, it could also limit seriously the utility of the SMs data for some applications requiring a timely response. More sophisticated and recent approaches exploit the presence of renewable energy sources and rechargeable batteries in homes to modify the actual energy consumption of users in order to hide the sensitive information [backes2013differentially, zhao2014achieving, li2018information, giaconi2018privacy, erdemir2019privacy]. Some of these works use ideas from the well-known principle of differential privacy. However, recent articles suggest that the utility loss of differential privacy may be significant in practice [mendes2017privacy]. It should be noted that our approach to the problem, which works only with the power measurement data, does not preclude the use of methods that change the energy consumption patterns by using physical resources. In fact, it should be viewed as a complementary approach that could even be used on top of the above mentioned methods.

In an information-theoretic context, privacy is generally measured by the Mutual Information (MI) between the sensitive and release variables [li2018information, giaconi2018privacy, erdemir2019privacy, sankar2013smart]. Some of these studies aim to find a privacy-utility trade-off using ideas from rate-distortion theory [sankar2013smart, tripathy2017privacy]. More specifically, the theoretical framework of the privacy-utility problem was proposed in [sankar2013smart]

, where a hidden Markov model for the power measurements of SMs is considered in which the distribution is assumed to be controlled by only the state of the home appliances. The privacy-utility trade-off is then found for a stationary Gaussian model of electricity load with MI between release and private sequence of variables as a privacy measure. Besides the limitation of the Gaussian model, it is noted that the MI is not well-suited to capture the causal structure of the time series data.

In this paper, an information-theoretic cost function for privacy-preserving data release of time series is proposed. In order to take into account the time series structure and causality of the data in the privacy measure, we use the Directed Information (DI) [massey1990causality]

between the sensitive time series and an estimation of it. Then, a cost function is derived for the releaser mechanism based on an upper bound of the DI. To optimize and validate our cost function without imposing constraints on the data distribution, two recurrent neural networks, named as releaser and adversary networks, are employed. This approach is based on the framework of Generative Adversarial Networks (GANs)

[goodfellow2014, huang2017context], where two neural networks are trained simultaneously with opposite goals. We will show that by controlling the relative weight between a distortion measure and the DI privacy measure, we can control the utility-privacy trade-off of SMs power measurements. A similar approach for the privacy problem, but for different applications, was considered in [tripathy2017privacy, 2018arXiv180209386F]considering the standard MI and independent and identically distributed (i.i.d.) data, where the authors use two deep feed-forward neural networks for the releaser and adversary. However, to the best of our knowledge, this is the first work to consider a DI privacy measure for time series data in the general privacy-preserving context and in SMs applications in particular.

This paper is organized as follows. In Section II, we present the theoretical formulation of the problem. Then, in Section III

, a privacy-preserving data release method based on Long-Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) is introduced along with the training algorithm. Results for two different applications based on SMs data are presented in Section

IV. Finally, some concluding remarks and a discussion about future work are given in Section V.### Notation and conventions

A sequence of random variables

of length is denoted by , while denotes a realization of and denotes the sample in a minibatch used for training. Mutual information [cover2006elements] between variables and is represented as and the entropy as . We use to indicate that , andform a Markov chain. The expectation of a random variable

is denoted as .## Ii Problem Formulation and Training Objective

### Ii-a Main definitions

Consider the private variables (such as occupancy label, household identity, or acorn family type), useful variables (such as actual electricity consumption of household), and observed variables (which could be a combination of private and useful variables). We assume that takes values on a discrete alphabet , for . A releaser (this notation is used to denote that the releaser is controlled by its parameters ) produces the release variables as based on the observation , for each time , while an adversary attempts to infer based on by finding an approximation of which we shall denote by . Thus, the Markov chain holds for all . In addition, due to causality, the distribution can be decomposed as follows:

(1) |

The goal of the releaser is to minimize the flow of information from the sensitive variables to their estimation while simultaneously keeping the distortion between the release variables and the useful variables below some given value. On the other hand, the goal of the adversary (this notation is used to denote that the adversary is controlled by its parameters ) is to estimate as accurately as possible.

To take into account the causal relation between and , the flow of information is quantified by the DI [massey1990causality]:

(2) |

where is the conditional mutual information between and conditioned on [cover2006elements].

The expected distortion between and is defined as:

(3) |

where is any distortion function (i.e., a metric on ). In order to ensure the quality of the release we shall impose the following constraint: for some given . In this work, we will consider the normalized squared error as in [sankar2013smart], i.e.,

(4) |

Nevertheless, it should be noted that other distortion measures can also be relevant for the SMs data. For instance, demand response programs usually require an accurate knowledge of peak power consumption, so a distortion function closer to the infinity norm would be more meaningful for this particular application. This brief discussion simply illustrates that the distortion function should be properly matched to the intended application of the release variables in order to preserve the characteristics of the useful variables that are considered essential. Since the goal of this paper is mainly to introduce a new privacy measure and privacy-preserving data release framework, we will not further investigate different fidelity measures.

Therefore, the problem of finding an optimal releaser subject to the aforementioned adversary and distortion constraint can be formally written as follows:

(5) |

Note that the solution of this optimization problem is a function of , the conditional distributions that represent the adversary .

### Ii-B Novel training objective

The optimization problem (II-A) can be directly used to define a loss function for . However, note that the cost of computing the DI term is , where is the size of . Thus, for the sake of tractability, DI will be replaced with the following surrogate bound:

(6) |

where (i) is due to the fact that conditioning reduces entropy; equality (ii) is due to the Markov chains and ; and (iii) is due to the trivial bound . Therefore, the loss function for can be written as

(7) |

where controls the privacy-utility trade-off and the factor has been introduced for normalization purposes. It should be noted that the value of in (7) indirectly controls the achievable in (II-A), which means that we can control the privacy-utility trade-off by varying . For , the loss function reduces to the expected distortion, being independent from the adversary . In such scenario, offers no privacy guarantees. Conversely, for very large values of , the loss function is dominated by the upper bound on the DI, so that privacy is the only goal of . In this regime, we expect the adversary to completely fail in the task of estimating , i.e., to approach to random guessing performance.

On the other hand, the adversary

is a classifier which optimizes the following cross-entropy loss:

(8) |

where the expectation should be understood w.r.t. . Notice that

(9) |

Therefore, if the adversary is ideal (i.e., for all ), the releaser network, by maximizing , prevents the adversary to infer private data.

## Iii Privacy-Preserving Mechanism

Based on the previous theoretical formulation, an adversarial modeling framework consisting of two RNNs, a releaser and an adversary , is considered (see Fig. 1). Note that independent noise is appended to in order to randomize the released variables

, which is a popular approach in privacy-preserving methods. In addition, the available theoretical results show that, for Gaussian distributions, the optimal release contains such a noise component

[sankar2013smart, tripathy2017privacy]. For both networks, a LSTM architecture is selected (see Fig. 2), which was shown to be successful in several problems dealing with sequences of data (see [goodfellow2016] and references therein for more details). The training of the suggested framework is performed using Algorithm 1 which uses gradient steps to train followed by one gradient step to train . Note that should be large enough to ensure that is a strong adversary during training. This is in fact the common practice for effectively training two networks in an adversarial framework [goodfellow2014] and, in particular, in privacy scenarios [tripathy2017privacy]. It should be recalled that, after the training of both networks is completed, an attacker network is trained in order to test the privacy achieved by the releaser network. It should be clarified that this attacker network is distinct from the adversary network used during training and illustrated in Fig 1: the attacker used in testing mimics a real-world attacker that would try to deduce the private date from the release data.## Iv Results and Discussion

### Iv-a Description of datasets

In this study, the Electricity Consumption & Occupancy (ECO) and Pecan Street data sets are used. The ECO data set, collected and published by [beckel2014eco], includes 1 Hz power consumption measurements and occupancy information of five houses in Swiss over a period of months. In this study we re-sampled the data to have hourly samples. On the other hand, the Pecan Street data set contains hourly SMs data of houses in Texas, Austin and was collected by Pecan Street Inc. [street2019dataport]. Pecan Street project is a smart grid demonstration research program which provides electricity, water, natural gas, and solar energy generation measurements for over houses in Texas, Austin. In order to model time dependency over each day (with a data rate of 1 sample per hour), the data was reshaped to sample sequences of length . For the ECO and Pecan Street data set, a total number of and sample sequences are used, respectively. The data is splitted into train and test sets with a ratio of roughly 85 : 15 while of the training data is used as the validation set. It should be noted that in this study, we assume that the attacker has access to all the training data used by the releaser, which can be considered as a worst-case scenario study.

### Iv-B Inference of households occupancy

The first practical case of study regarding privacy-preserving in time series data is the concern of inferring presence/absence of residents at home from the total power consumption collected by SMs [kleiminger2015household, jia2014human]

. For this application, the electricity consumption measurements from the ECO data set are considered as the useful data, while occupancy labels are considered as the private data. Therefore, our privacy-preserving data release method aims to minimize a trade-off between the distortion of the total electricity consumption incurred and the probability of inferring the presence of an individual at home from the release signal. The releaser and adversary networks used for the training consist of 4 LSTM layers with

cells and 2 LSTM layers with cells, respectively where a activation function used. In addition, recurrent regularizer with parameterwas used in each layer of the release network. The values of the other hyperparameters (

, ,) were set to , respectively. Finally, after training, a strong attacker is used, consisting of 3 LSTM layers.Based on the target data and the release data , the normalized root mean-square-error (NRMSE) is defined by

(10) |

Fig. 3 shows the empirically found privacy-utility trade-off for this application. It can be seen that by adding more distortion on the released data, the attacker is pushed toward a random guessing classifier.

In order to provide more insights about the release mechanism, the Power Spectrum Density (PSD) of the input signal and the PSD of the error signal (defined as the difference between the actual power consumption and the released signal) for four different cases along the privacy-utility trade-off curve of Fig. 3 are estimated using Welch’s method [stoica2005spectral]. For each case, we use 10 release signals and average the PSD estimates. Results are shown in Fig. 4. Looking at the PSD of the input signal (useful data) some harmonics are visible. The PSD of the error signals show that the model controls the trade-off in privacy-utility by mainly modifying the distortion on these harmonics.

It should be mentioned that two stationary tests including the Augmented Dickey-Fuller test [dickey1979distribution] and the Kwiatkowski, Phillips, Schmidt, and Shin (KPSS) test[kwiatkowski1992testing] applied to our data set indicates that there is enough evidence to suggest the data is stationary, supporting our PSD analysis.

### Iv-C Inference of house identity

The second practical case of study regarding the privacy-preserving in SMs measurements is identity recognition from total power consumption of households [efthymiou2010smart]. It is assumed that the attacker has access to total power consumption of different households in a region (training data) and then attempts to determine identities of the households using the new released data (test data). Thus, our model aims at generating release data of total power consumption of households in a way that prevents the adversary to perform the identity recognition while keeping distortion on the total power minimized. For this task, total power consumption of five houses is used. For this application, the releaser consists of 6 LSTM layers each includes cells and adversary has 4 LSTM layers with cells. Similarly, a activation function was applied and was used in each layer of the release network. The values of the other hyperparameters (, ,) are set to , respectively. Finally, after training, an attacker, consisting of 4 LSTM layers was used. The empirical privacy-utility trade-off curve obtained for this application is presented in Fig. 5. Comparing Fig. 5 with Fig. 3, we see that a high level of privacy is expensive. For instance, in order to obtain an attacker accuracy of 30 , the NRMSE should be approximately equal to 0.30. This is attributed to the fact that this task is harder from the learning point of view than the one considered in Section IV-B.

PSD analysis was also performed for this application, yielding the results of Fig. 6. Once again, we see that the release network provides privacy-utility trade-off by mainly distorting the harmonics on the actual electricity consumption signal.

## V Conclusion

We have presented a new method to train privacy-preserving mechanisms controlling the privacy-utility trade-off in time series data. This lead us to define the directed information between sensitive variables and their estimation as a more suitable privacy measure than previous proposals in the literature. A tractable upper bound was then derived and a deep learning adversarial framework between two recurrent neural networks was introduced to optimize the new loss function. Our method was validated with two well-known privacy problems in smart meters data using two different open data sets. For both privacy problems we considered the worst-case where an attacker has access to all the training data used by the releaser. In future work, we will consider alternative formulations of the problem such as different distortion measures and a more general loss function in order to attempt to provide universal privacy guarantees (i.e., independent of the attacker structure and computational power).

## Acknowledgment

This work was supported by Hydro-Quebec, the Natural Sciences and Engineering Research Council of Canada, and McGill University in the framework of the NSERC/Hydro-Quebec Industrial Research Chair in Interactive Information Infrastructure for the Power Grid (IRCPJ406021-14). The work of Prof. Pablo Piantanida was supported by the European Commission’s Marie Sklodowska-Curie Actions (MSCA), through the Marie Sklodowska-Curie IF (H2020-MSCAIF-2017-EF-797805-STRUDEL).

Comments

There are no comments yet.