Privacy-preserving Data Analysis through Representation Learning and Transformation

11/16/2020 ∙ by Omid Hajihassani, et al. ∙ York University University of Alberta 0

The abundance of data from the sensors embedded in mobile and Internet of Things (IoT) devices and the remarkable success of deep neural networks in uncovering hidden patterns in time series data have led to mounting privacy concerns in recent years. In this paper, we aim to navigate the trade-off between data utility and privacy by learning low-dimensional representations that are useful for data anonymization. We propose probabilistic transformations in the latent space of a variational autoencoder to synthesize time series data such that intrusive inferences are prevented while desired inferences can still be made with a satisfactory level of accuracy. We compare our technique with state-of-the-art autoencoder-based anonymization techniques and additionally show that it can anonymize data in real time on resource-constrained edge devices.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Internet of Things (IoT) systems are becoming increasingly ubiquitous in our daily lives. They utilize a variety of sensors, actuators, and processing units to monitor and control the environment, and generate valuable insights. The sensors embedded in these systems collect large amounts of time series data from the environment, including sound and video [noto2016contracting, noda2018google, jeon2018iot], temperature [hernandez2014smart], and inertial [hassan2018robust]

data. Despite the useful services offered by IoT devices, their proliferation could jeopardize user privacy. This is because raw sensor data is often shared with third-party applications which could use it to make sensitive and unsolicited inferences. Recent advances in deep learning and edge computing have made it easier to perform these inferences on edge and IoT devices, increasing the risk of privacy violations.

Several approaches have been proposed in the literature to enable privacy-preserving data analysis through -anonymity [bayardo2005data, jia2017pad], differential-privacy [dwork2008differential, phan2016differential]

, or by applying various machine learning techniques 

[hajihass2020latent, malekzadeh19, wu2019privacy]. However, most approaches that prevent sensitive inferences greatly reduce the accuracy of desired inferences. In practice, a significant loss of data utility is not an acceptable trade-off for privacy. A promising approach should utilize the available data to its fullest potential without compromising user privacy.

In recent years, deep generative models, and in particular, autoencoders have been used to replace private attributes (e.g., race or gender) of time series data with synthetically generated data in such a way that public attributes remain intact. For example, in [malekzadeh19] autoencoders trained in an adversarial fashion are used to address privacy issues concerning the use of IoT and mobile data. We call such approaches model-based as they use specific inference models for adversarial training. Due to the reliance of model-based techniques on adversarial models, they can only fool these models and are susceptible to the re-identification attack presented in [hajihass2020latent]. In this attack, the adversary passes a dataset with known private attributes through the anonymization network and uses the output (anonymized data) along with the known attributes to train a model for user re-identification, thereby reversing the anonymization process.

We propose a novel approach to enable privacy-preserving data analysis by transforming the latent representation of time series data produced by a Variational Autoencoder (VAE). This approach is model-free and extends the variational autoencoder-based technique proposed in our previous work [hajihass2020latent]

in two main ways. First, we change the loss function of the conventional VAE by introducing a term that accounts for the private attribute classification error. This helps learn latent representations that correlate with the private attributes. Second, we train a separate VAE for each public attribute class and select the appropriate VAE at run time depending on the inferred public attribute. This is necessary for training useful and compact autoencoders as we discuss later. Figure 

1 shows different components of the proposed model-free anonymization technique including attribute-specific VAEs and the transformer adopted in the latent space. We evaluate our technique on two publicly available Human Activity Recognition (HAR) datasets, namely MotionSense [malekzadeh19] and MobiAct [vavoulas2016mobiact]. We use MotionSense for two-class gender anonymization and for comparison with Reference [hajihass2020latent]. The MobiAct dataset is considered for two-class gender anonymization and multi-class weight anonymization. The contribution of this paper is threefold:

  • We propose a VAE-based anonymization technique that learns and manipulates latent representations of the input sensor data to prevent sensitive inferences on the reconstructed data. We modify the loss function of the VAE by incorporating a private attribute classification loss term. This term helps the VAE learn more useful and anonymization-friendly representations. We further train a different VAE model for each public attribute and show that it can effectively reduce the model size.

  • We compare the performance of our technique with autoencoder-based anonymization techniques proposed in [malekzadeh19, malekzadeh2017replacement] by evaluating it on the MotionSense and MobiAct datasets. We show that our anonymization technique is less vulnerable to the re-identification attack and greatly reduces the accuracy of sensitive inferences.

  • We evaluate the feasibility of performing model-free anonymization in real time on edge devices by running experiments on a Raspberry Pi 3 Model B.

Figure 1: Data anonymization on an edge/IoT device using the proposed model-free anonymization technique. The mean latent representation for each pair of private and public class labels is assumed to be stored in a central (cloud) server.

The rest of this paper is organized as follows. In Section 2, we give an overview of the related work on privacy-preserving data analysis and anonymization. Section 3 provides the background on VAE, introduces the idea of model-free anonymization, and explains the distinction between model-based and model-free anonymization techniques. Section 4 presents our model-free anonymization technique, the modified VAE loss function, and specialized VAEs. The datasets and evaluation results of the proposed anonymization technique are presented in Section 5. Furthermore, we evaluate our technique with both deterministic and probabilistic modifications of the latent space representation and investigate whether they can successfully prevent a re-identification attack. In Section 6, we study real-time model-free anonymization of sensor data on an edge device. Section 7 concludes the paper and provides directions for future work.

2 Related Literature

Most IoT devices today are equipped with a myriad of sensors which collect data from the environment and people. The sheer amount of personally identifiable and sensitive information embedded in this time series data opens the door for unwanted and private attribute inferences. In [ren2019information], information exposure from 81 consumer IoT devices is analyzed with respect to their network traffic. It is found that 72 out of the 81 IoT devices send data to a third-party over the Internet. This underscores the need for proper anonymization of data before it leaves the user’s device.

The literature on data anonymization and privacy-preserving analysis is extensive. Related work can be broadly classified into systemic and algorithmic solutions. The systemic solutions provide mechanisms for monitoring and managing access to private or sensitive data, and efficiently masking or downsampling this data 

[Gotz2012, Chakraborty14, singh2018tussleos]. The algorithmic solutions can be further divided into solutions based on differential privacy and k-anonymity [dwork2011differential, Comas2014, bayardo2005data], and solutions that rely on deep generative models [malekzadeh19, malekzadeh2017replacement, feutry2018learning, hajihass2020latent]. The former category hides an individual’s private data in a population, whereas the latter reconstructs data in a certain way that it no longer contains any private information. These techniques have been applied in a variety of domains, including public health [Dankar2012, phan2016differential], smart homes and buildings [wu2019privacy, brkic2017know, jia2017privacy, jia2017pad], and mobile devices and sensor networks [hajihass2020latent, malekzadeh2017replacement, carbunar2010query, miao2019privacy].

Systemic solutions. There are various approaches to enhancing data privacy at the operating system and firmware level [fernandes2016flowfence, singh2018tussleos], at the application level [mo2020darknetz, osia2020hybrid], and via certain protocols [carbunar2010query]. The operating system solutions enable the user to navigate the trade-off between privacy and functionality of IoT devices [singh2018tussleos, fernandes2016flowfence]. Reference [fernandes2016flowfence] promotes user privacy through sand boxed execution of developers’ code in quarantined execution modules and taint-tracked data handlers. Taint-tracked opaque data handlers help prevent applications from accessing sensitive data and sharing such data via the network interface. In [singh2018tussleos], a privacy abstraction technique is proposed to manage control over sensor data tussles, addressing utility-privacy trade-offs. Other architectural solutions try to avoid data and model information leakage in data processing in the edge and the cloud. In [carbunar2010query], an efficient privacy-preserving querying protocol is proposed for sensor networks assuming that client queries are processed by servers controlled by multiple mutually distrusting parties. These queries reveal both an array of specific sensors and relationships between the subsequent queries. This will dissuade the organizations from sharing resources to build large-scale shared sensor networks. To address these risks, the authors propose the SPYC protocol which guarantees privacy protection if servers do not cooperate in attacking the clients. They also discuss possible solutions when servers cooperate to infringe privacy of clients. These systemic solutions are designed for specific use cases and do not address the data anonymization problem in general.

Algorithmic solutions. Algorithmic solutions are based on microaggregation and various machine learning techniques. The techniques that are based on -anonymity and differential-privacy include [dwork2008differential, Comas2014, bayardo2005data, erlingsson2014rappor]. The authors in [jia2017privacy]

propose a privacy aware HVAC control system architecture that decreases the privacy risk subject to some control performance guarantee using the mutual information (MI) metric. PAD is a privacy preserving sensor data publishing framework which is proposed in 

[jia2017pad]. It ensures privacy through data perturbation and works with customized user datasets with configurable privacy constraints for end users. The authors in [sangogboye2018framework] propose enhancements to PAD, allowing it to be used with non-linear features. In [he2011pda], authors propose two different techniques for private data aggregations namely, Cluster-based Private Data Aggregation (CPDA) and Slice-Mix-AggRegaTe (SMART) based on different properties. These techniques are used to provide efficient data aggregation while protecting the user’s data privacy. Furthermore, there exists cryptographic privacy-preserving techniques, such as [miao2019privacy], which proposes weighted aggregation on users’ encrypted data using a homomorphic cryptosystem to promote privacy-preserving crowd sensing systems in the truth discovery domain. This ensures both high privacy protection and high data utility.

Other algorithmic solutions rely on machine learning techniques that use deep neural networks (DNN), and generative models such as generative adversarial models (GAN) [goodfellow2014generative] and VAE [kingma2013auto]. DNNs help protect the privacy of patients by de-identifying patients’ personal notes through the removal of personal health identifiers [dernoncourt2017identification]. For security devices, such as home security cameras and baby monitors, research has focused on de-identifying personal attributes from the camera feed. These approaches are essential due to the lack of trust between users and cloud providers [wu2019privacy, brkic2017know]. Many security devices use cloud servers for data storage and processing. Contrary to crude facial blaring or pixelization techniques that have been adopted in the past (i.e., adding noise and masking), GANs can be used to swap faces in a camera feed [wu2019privacy]. Thus, less information is lost as the facial expressions can be kept intact while the user identity is protected.

The closest lines of work to ours are machine learning-based techniques that promote user data privacy. References [malekzadeh2017replacement, malekzadeh19] use autoencoders trained in an adversarial fashion. The private information (either categorical or binary) are removed or replaced by non-sensitive data while data utility (public attributes) is preserved. The anonymization problem addressed by these model-based techniques is outlined in the background section. We argue that due to the reliance of model-based techniques on adversarial training [malekzadeh19], they merely fool the adversarial model, hence they remain vulnerable to a re-identification attack [hajihass2020latent]. In this attack, the adversary uses the same anonymization application (with the trained models) used by the IoT or mobile device [hajihass2020latent] to build a dataset comprised of anonymized data entries and known private attribute classes. This allows for training a re-identification model that learns to de-anonymize the previously anonymized data. In this work, we propose the use of a model-free technique that modifies the latent space representation of the original data to enable privacy-preserving data analysis. We extend our previous work [hajihass2020latent] by improving the anonymization capability of our model-free technique and studying the feasibility of performing anonymization on edge devices. Moreover, we investigate whether the proposed technique can successfully transform latent space representations and reconstruct data when the private attribute class is not binary (i.e., has multiple classes).

3 Background

3.1 Variational Autoencoders

A Variational Autoencoder (VAE) [kingma2013auto]

is a generative model comprised of an encoder and a decoder network. It differs from a standard autoencoder as it models an underlying probability distribution over the latent variables. Figure 

2 shows the probabilistic encoder and decoder of a VAE. The probabilistic encoder represents the approximate posterior in the form of (e.g., a multivariate Gaussian with diagonal covariance). After training the network parameters denoted by , is used to sample a latent space representation, , for a given data point, . The data reconstruction is governed by the likelihood distribution which is modeled by the probabilistic decoder.

Figure 2: Probabilistic encoder and decoder networks of a VAE.

We minimize a loss function to train a variational autoencoder. The objective here is to increase the quality of the reconstruction performed by the decoder (i.e., maximizing the log-likelihood of ), while having the encoder learn meaningful and concise representations of the input data (minimizing the Kullback-Leibler (KL)-divergence between the approximate and true posterior). Concretely, the objective is to find the best set of parameters (i.e., weights and biases) in the probabilistic decoder and encoder that minimize the reconstruction error of the input data given the latent variables while approximating a variational posterior distribution that resembles the true posterior distribution .

Since the distance between the variational and true posterior cannot be calculated exactly, the Evidence Lower Bound (ELBO) of a variational autoencoder is maximized to minimize the KL-divergence term between the approximated variational posterior and the prior over latent variables,

, which is assumed to have a Gaussian probability distribution. The ELBO is the negative of the loss function and can be written as:

(1)

This lower bound is calculated above for a single data point , hence the loss function is obtained by summing this over all data points in the dataset. The KL-divergence can be viewed as a regularizer and a constraint on the approximate posterior. The ELBO with the Gaussian assumption (with mean

and standard deviation

) for the latent and the approximated posterior distribution is given below:

where is the index of a latent variable in the latent representation .

Recent research on the disentanglement of the latent representations shows that penalizing the KL-divergence term can help to achieve better disentanglement of latent variables. Concretely, a higher weight should be assigned to the KL-divergence term for latent variables to represent distinct features of the data. The weight factor is denoted by in the following

The value in the original VAE is equal to which would be multiplied by given the Gaussian assumption. It is argued in [higgins2017beta] that by choosing a in the ELBO, more disentangled latent representations can be learned. However, higher disentanglement degrades the reconstruction accuracy. This highlights the intricate trade-off in the training of VAEs.

3.2 Model-based versus Model-free Anonymization

The data collected by the sensors embedded in mobile and IoT devices must be processed locally or in the cloud to deliver services to consumers. Thus, an admissible anonymization technique should transform the data in such a way that the accuracy of desired inferences is maintained (i.e., high data utility), while the accuracy of unwanted and private attribute inferences is reduced significantly. We refer to the models used for making desired and unwanted inferences as the public and the private attribute inference models, respectively. The output of the public attribute inference model is a single public attribute class or a distribution over all public attribute classes. Similarly, the output of the private attribute inference model is a single private attribute class or a distribution over all private attribute classes. For example, in a fitness-tracking application, the public attribute inference model outputs an activity label while the private attribute inference model could assign the user to a particular age group. We assume that this application is not supposed to learn the age of the user.

To protect private and sensitive data against a wide range of private inference models rather than a select model, we propose the use of model-free anonymization. Unlike model-based anonymization techniques which use a specific private inference model in adversarial training, model-free anonymization techniques utilize the definition of the unwanted inference to transform latent space representations. This transformation should have imperceptible impact on data utility and promote privacy-preserving inferences.

The model-free anonymization technique which we proposed in this work utilizes a deep generative model (i.e., a VAE) to learn and subsequently manipulate the latent representation of input data. The absence of adversarial training in our work makes it less susceptible to the re-identification attack presented in [hajihass2020latent]. This is one advantage of our model-free anonymization technique over its model-based counterparts. Through latent representation transformation, we ensure that the reconstructed data will have different private attribute class labels from the original data. This transformation can be either deterministic or probabilistic. We explain this idea in details in the next section.

4 Methodology

In this section, we describe how we learn a useful representation for an embedding of time series data generated by a sensor using a VAE with a modified loss function. The choice of which attribute-specific VAE to use depends on the predicted public attribute class (i.e., the output of the public attribute inference model). We explain how this representation can be transformed given the predicted public attribute class and the predicted private attribute. These steps are illustrated in Figure 1.

4.1 A Modified Loss Function for VAE

The loss function we use in this paper builds on the original VAE’s loss function proposed in [kingma2013auto]. We modify this loss function by adding an extra term that corresponds to the classification error of the private attribute classifier, i.e., . Specifically, the encoder network is supplemented with a classification layer, , to encourage learning representations that are more representative of the private attribute associated with the input data. We first introduce this loss function and then discuss why minimizing this function can result in a more effective anonymization. The augmented loss function can be written as:

(2)

where denotes the latent representation of the input data embedding, denotes the true private attribute class label of that embedding111Note that is if and only if belongs to the private attribute class , and is otherwise., and is the classification layer.

The learned distribution over latent representations given

can be a multivariate Gaussian or a Bernoulli distribution. In our case, we choose a multivariate Gaussian since we are dealing with real-valued data. Note that the first two terms in this loss function are the two terms in Equation (

1). The only difference is the introduction of the

weight factor for the Kullback–Leibler divergence term as explained in 

[higgins2017beta].

The main limitation of the -VAE’s loss function for data anonymization is the inherent trade-off between the quality of the reconstructed data and the disentanglement of the learned latent representations. In general, lower values would yield better accuracy in the data reconstruction task (higher data utility), and higher values would train the VAE to generate more disentangled latent representations (lower private attribute inference model accuracy).

The best anonymization performance by a VAE is achieved when the data utility is the highest and the accuracy of the private inference model is the lowest. Thus, we need to tweak the loss function to have the highest data utility in the anonymized data (determined by the reconstruction loss and KL-divergence), while having as much disentanglement as possible (determined by KL-divergence) for the lowest private inference accuracy. As discussed in [higgins2017beta], there is a limit to the learning capacity of a conventional VAE’s loss function. Hence, to increase the anonymization capability of the trained VAE, we add the private-attribute classification loss to the ELBO. Specifically, we use the latent representation of the original input data as input to a single-layer neural network which infers the private attribute class of each data embedding. This neural network, represented as , will be trained alongside the VAEs encoder. In essence, the classification layer, , and the VAEs encoder together form a classification network. The learned latent variables become more representative of private attributes in the data consequently.

We use the cross entropy loss which is the distance between the predicted private attribute class of each anonymized data embedding and its ground truth value, . We create a simple classification layer that maps the latent representations generated by VAE to the private attribute class labels of each of the corresponding input data entries as illustrated in Figure 3. Thus, the addition of the classification loss to the loss function encourages the VAE to learn more anonymization-friendly representations.

We argue that adding the classification layer, , will force the probabilistic encoder to learn latent representations that are separable along the private attribute class labels, ’s. Our results confirm that the added term to the objective function improves the performance of VAE in the anonymization task by introducing some structure and enforcing a clear separation between different classes in the latent space. We instantiate

as a single layer of neurons with a softmax activation function. In particular, this layer contains

neurons, , where represents the number of private attribute classes in the original dataset. The trainable set of weights used by the classification layer is denoted by

. Suppose each latent representation is a vector of

latent variables, . Thus, each is the weight connecting input to neuron . The output of the neuron in the classification layer can be written as . The output of all the neurons goes through softmax activation to produce a probability distribution over private attribute classes given the input data: .

The two hyperparameters in Equation (

2), namely and , must be tuned for each VAE as discussed later. The VAE and the classification layer are depicted in Figure 3.

Figure 3: Variational Autoencoder with an additional classification layer denoted by .

4.2 Representation Learning with a VAE Customized for Each Public Attribute Class

We train a VAE for each public attribute class in our dataset. By having attribute-specific VAEs instead of just one general VAE, which learns latent representations for all input data regardless of their public and private attribute classes [hajihass2020latent], we break down the model into multiple models that are smaller in size. Each of these models is trained to reconstruct data for a given public attribute class.

One key advantage of using public attribute-specific VAEs is the reduction in the size of the model. It also allows for applying a higher disentanglement constraint (i.e., the weight ) in the training process. We get a -fold reduction in the model size and the number of trainable weights, from roughly million weights in the case of a general VAE to a total of million weights for all attribute-specific VAEs. Moreover, using parsimonious models enhances the anonymization performance when compared to [hajihass2020latent].

Since we have multiple public attribute-specific VAE models, it is necessary to predict the public and private attribute classes of a given data embedding at anonymization time. This information is used to determine which VAE must be selected for anonymization. We do this using the pretrained classifiers shown in Figure 1.

4.3 Transforming Latent Representations

Algorithm 1

shows different steps of the proposed model-free anonymization technique assuming that a VAE is trained already for each class of the public attribute. This algorithm operates on fixed-size embeddings of the input time series data. These embeddings are created by considering a window that contains a number of consecutive data points in the time series. After the first embedding, a new embedding is created after a certain number of new data points are received (determined by the stride length).

Suppose, we have data embeddings, denoted by . Each embedding has corresponding public and private attributes. The proposed anonymization algorithm takes as input an embedding along with encoder and decoder parameters of different VAEs, and the average latent representation denoted by for each public attribute class and private attribute class . These average representations are calculated for the training dataset in the cloud or at the edge provided that the IoT device retains a copy of the training dataset.

In the next step, it loads the pretrained public and private attribute classifiers (not to be confused with ) which are used to identify the public attribute class and the private attribute class for each data embedding . After inferring the public and private attribute classes, we load and models for the inferred public attribute class . The encoder part of this attribute-specific VAE encodes in a probabilistic manner. The corresponding latent representation, , can be sampled from the distribution. Once the representation is sampled, we change the inferred private attribute class label of via a simple function which we refer to as Modify. This function converts the inferred private attribute class label of from to an arbitrary private attribute class label denoted by .

Data: data embedding , average latent representations, autoencoder parameters for each public attribute class , pretrained classifiers for public and private attributes
Result: anonymized data embedding
Classify();
;
;
;
= ;
;
Algorithm 1 Model-free anonymization with representation learning and transformation

The transformation of a latent representation involves a sequence of simple arithmetic operations. Consider a representation with public attribute and private attribute , and let us denote the average of all latent representations with public attribute and private attribute by . We obtain the transformed representation of , denoted , by subtracting from and adding to the result. The probabilistic decoder, , takes instead of to synthesize data. We refer to which is the Euclidean distance between the average of all representations with private attribute and public attribute and the average of all representations with private attribute and public attribute as the transfer vector. Figure 4 illustrates the transfer vector in a three-dimensional latent space. The markers show only the latent representations of embeddings with public attribute . Circles and squares are representations of data embeddings with private attribute class and class , respectively. The mean latent representation is shown as a cross in each case. Once the transfer vector is found it can be applied to modify the private attribute of a given data embedding as described in Algorithm 1.

Figure 4: Overview of the model-free anonymization technique assuming a 3-dimensional latent space. Latent representations which have a public attribute other than are not shown in this figure.

We note that the Modify function can be either deterministic or probabilistic. When the private attribute is binary, the deterministic modification converts one class label to the other one at all times. When the private attribute class is not binary, an arbitrary bijective function can be used. In the case of probabilistic transformation, a probabilistic modification function is used. Specifically, for each data embedding we decide whether to perform the mean manipulation based on a cryptographically secure stream of pseudo random numbers. We use the CPRNG Secrets222https://docs.python.org/3/library/secrets.html. python module to generate the random numbers.

5 Evaluation Results

5.1 Datasets

To evaluate the efficacy of the proposed anonymization technique, we use two HAR datasets that are publicly available, namely MotionSense and MobiAct.

5.1.1 MotionSense

MotionSense HAR time series dataset contains measurements of accelerometer and gyroscope sensors [malekzadeh19]. This dataset is collected from iPhone 6s using the Sensing Kit framework [katevas2016sensingkit]. It contains data from 24 subjects (14 males and 10 females), each performing 15 trials that include 6 different activities.

These activities include climbing stairs up and down, walking, jogging, sitting, and standing. The subjects’ age, height, and weight cover a wide range of values. The dataset is collected at 50 Hz sampling rate and each sample contains 12 features including attitude (roll, pitch, yaw), gravity, rotation rate, and user acceleration in three dimensions, , , and .

We use windows of 128 samples, each corresponding to time series data sampled over seconds. These windows are moved with strides of 10 samples. This is the same configuration used in [malekzadeh19]. We use this configuration so that we can have a fair comparison between the anonymization results. From the different activities mentioned earlier, standing and sitting have quite similar features. Since the phone is placed in the pocket of a subject’s pant, the only distinction between standing and sitting activities is the position of the smart phone in the pocket of the users which differs from vertical to horizontal. Hence, we combine both of these activity labels into one activity label. Moreover, we omit these two activity labels as they do not contain enough data for private attribute inferences. We use trials , , , , , and from this dataset to form our test set. We treat each of these activity classes as a public attribute class, , and refer to each gender group and in the data as the private attribute class, .

5.1.2 MobiAct

MobiAct dataset is comprised of smart phone sensor readings. It includes different falls in addition to various activities [vavoulas2016mobiact]. This dataset is larger compared to the MotionSense dataset both in terms of the number of activities and the number of participants. There are participants in the MobiAct dataset performing daily lives activities such as running, jogging, going up and down the stairs and many more. Besides these activities, the subjects perform different types of falls. From the participants, we use only to create a relatively balanced dataset in terms of the number of female and male participants. Particularly, there are females (gender ) and males (gender ) in the subsampled dataset. Furthermore, we bin the recorded weights into three classes to showcase the ability of our anonymization technique to deal with a multi-class private attribute inference model. We label subjects that weigh less than or equal to  kg as , those who weigh between and  kg as , and those who weigh more than  kg as .

From the large list of the MobiAct activites, we select four activities for which all participants had representative data and sensor readings. These activities include walking, standing, jogging, and climbing stairs up (referred to as WAL, STD, JOG, and STU according to the activity labels in the dataset). Since not all the activities are performed for more than one trial, we use subject-based test and training data partitioning. We use of the available data for training and the remaining for test.

5.2 Hyperparameters Tuning

We perform grid search to tune hyperparameters and . We consider a range of values for and for each dataset and choose the values that result in better anonymization performance. Concretely, these parameters should result in a lower average loss across all attribute-specific VAEs.

For the MotionSense dataset, we assign pairs of values to and , setting to , , , and , and to , , , and . We find that the best performance is achieved when . For the MobiAct dataset, we select the most suitable weights for and in the same fashion. In this case, our empirical results suggest that we can further increase the weight of KL-divergence, , to . We consider , , , and for , and , , , and for . The best anonymization performance is attained when for gender anonymization and for weight group anonymization.

5.3 Anonymization Results: MotionSense

In this section, we present the results of applying our proposed anonymization technique to the MotionSense dataset. The public attribute inference model estimates the daily activity of each subject while the private attribute inference model aims to infer the gender identity of the subject (in this case male or female). We report the anonymization results using deterministic and probabilistic modification of the private attribute class. We use our previous work 

[hajihass2020latent] as a baseline, and compare the obtained results with the results reported in that work.

5.3.1 Anonymization with Deterministic Modification

To evaluate our technique with deterministic modification, we use the same architecture of the human activity recognition and gender inference models as our previous work [hajihass2020latent]

. These are Multilayer Perceptron (MLP) neural networks which are discussed in depth in Section 

6. By keeping the model architecture the same, our results can be better compared with [hajihass2020latent].

We evaluate our attribute-specific VAE models on the test set and report the results. First, we obtain the mean values of latent representations for each private attribute (gender) class and each public attribute (activity) class. These mean values are referred to as in Algorithm 1 and are estimated from data in the training set. We use these mean values to anonymize time series data in the test set according to the algorithm described earlier. Figures 4(a) and 4(b) show the accuracy of desired and sensitive inferences on data (from the MotionSense test set) anonymized by the proposed technique using a deterministic modification. As it can be seen in Figure 4(b), our work outperforms the Anonymization Autoencoder (AAE) proposed in [malekzadeh19]. Note that the anonymization technique does not know the true private and public attribute class labels and must predict these labels in the deployment phase using pretrained models as depicted in Figure 1.

(a) Activity inference
(b) Gender inference
Figure 5: Accuracy of inference models on the anonymized MotionSense test set for two-class gender anonymization.
Activity Class Act. Bef Act. After Gen. Before Gen. After nb. Embeddings
Down stairs 95.58% 89.73% 87.67% 26.25% 1.9k
Up stairs 93.19% 93.47% 90.86% 17.03% 2.5k
Walking 98.71% 98.49% 95.16% 15.34% 6.2k
Jogging 97.28% 97.28% 95.50% 18.39% 2.7k
Table 1: Accuracy of activity and gender recognition models on the anonymized MotionSense test dataset. The results are reported separately for each activity label (first column). The number of data embeddings is specified for each activity (last column).

Table 1 shows the accuracy of the activity and gender identification models before and after the anonymization of the test set data for each activity class. These models are also used as the inference models in Algorithm 1 for predicting public and private attribute classes. We see that the deterministic anonymization achieves up to reduction in gender identification accuracy (from to ). These results are the weighted average of the inference accuracy according to the number of samples representing each activity. Moreover, the public attribute inference model is improved by roughly across different activities. Comparing to our previous work [hajihass2020latent], the gender identification (i.e., the private attribute inference) accuracy is decreased from to on average while the activity recognition (i.e., the desired inference) accuracy is increased noticeably thanks to the use of attribute-specific VAEs which allow us to learn better representations in a highly imbalanced dataset.

Despite the success of the deterministic mean manipulation technique, it is possible to re-identify the private attribute should the adversary have access to time series data with a known private attribute. The adversary can anonymize this data using the proposed technique and use the anonymized data along with the known private attribute to train a model that re-identifies the private attribute [hajihass2020latent]. This model can be either a convolutional or an MLP neural network. We assume that it has the same architecture as the gender identification model.

Suppose the attacker has access to of the anonymized data, which we sample randomly from the training set, and the corresponding private attribute. We observe that the adversary can achieve roughly accuracy in the gender re-identification task. We attribute this to the deterministic nature of modifications in our proposed anonymization technique.

5.3.2 Anonymization with Probabilistic Modification

We now show that by replacing the deterministic modification with a probabilistic modification, the re-identification attack can be prevented to a great extent. To demonstrate this, we assume the adversary has access to of the anonymized data along with the true value of the private attribute associated with this data. This is the same assumption we made in the previous section. We anonymize this data using the proposed anonymization technique with a probabilistic modification and train a re-identification model on the of the anonymized data. To account for the randomness that may arise from training, anonymization, and sampling processes, we consider independent runs. The average and standard deviation of these results are shown in Figure 8. It can be readily seen that the accuracy of the re-identification model reduces to .

We compare our results with the results of the model-based anonymization techniques proposed in [malekzadeh19, malekzadeh2017replacement]. We use the implementation of [pmc_malek] which combines these two model-based techniques to anonymize data. Our evaluation on the test set shows that these two techniques can jointly reduce the gender inference accuracy from to . However, after training the re-identification model on the data anonymized by these model-based techniques, the gender re-identification model achieves accuracy. These results support our claim that the model-based techniques cannot scrub sensitive information from data and simply learn to fool the models used in adversarial training of the AAE model.

5.4 Anonymization Results: MobiAct

We use the MobiAct dataset for two-class gender and multi-class weight group anonymization. We evaluate our anonymization technique on this dataset besides MotionSense because it allows us to study the case where the private attribute is non-binary. The public attribute inference model estimates the daily activity of each subject, while the private attribute inference model aims to infer the gender identity (2 classes) of the subjects in one case and the weight group (3 distinct classes) of the subject in the other case. We use Convlutional Neural Network (CNN) models as our activity, gender, and weight group inference models. More details about these inference models are given in Section 6.

5.4.1 Anonymization with Deterministic Modification

The anonymization is first performed by modifying the private attribute class label deterministically. We first discuss the two-class gender anonymization results. Figures 5(a) and 5(b) show respectively the accuracy of the activity and gender inference models on the test set of MobiAct. Compared to our anonymization technique, AAE [malekzadeh19] performs worse, yielding a higher gender recognition accuracy. Note that the activity indices are different from the MotionSense dataset. For weight group anonymization, we modify weight group classes as follows: from to , from to , and from to . This is an arbitrary order and can be changed. The activity and weight group inference models accuracy on the test set are depicted in Figures 6(a) and 6(b), respectively333Since we did not have the original implementation of AAE [malekzadeh19] for the MobiAct dataset, we built our own AAE model and tested it on the MobiAct dataset..

(a) Activity inference
(b) Gender inference
Figure 6: Accuracy of inference models on the anonymized MobiAct test set for two-class gender anonymization
(a) Activity inference
(b) Weight group inference
Figure 7: Accuracy of inference models on the anonymized MobiAct test set for the ternary class weight group anonymization

Tables 2 and 3 show results of the deterministic anonymization for gender and weight group private attribute classes in the MobiAct dataset. In the case of the two-class gender anonymization, we conclude that the activity detection accuracy is dropped by around while the gender inference accuracy is decreased by (from to ). Turning our attention to the three-class weight anonymization, our results indicate that the activity detection accuracy is dropped only slightly (by ). This is while the weight group inference accuracy is decreased by (from to ).

Activity Class Act. Before Act. After Gen. Before Gen. After nb. Embeddings
Walking 98.00% 87.57% 99.58% 10.53% 42.9k
Standing 99.52% 99.67% 95.45% 31.98% 43.2k
Jogging 99.78% 96.75% 98.72% 16.51% 4.2k
Stairs up 95.03% 94.63% 93.22% 28.99% 1k
Table 2: Accuracy of the activity and gender Inference in MobiAct test dataset gender anonymization. The results are reported separately for each activity label (first column). The number of data embeddings is specified for each activity (last column).
Activity Class Act. Before Act. After Weight Before Weight After nb. Embeddings
Walking 98.16% 89.15% 97.48% 22.41% 42.9k
Standing 99.58% 99.75% 85.98% 37.93% 43.2k
Jogging 99.80% 94.43% 93.30% 31.25% 4.2k
Stairs up 94.22% 95.36% 78.88% 41.69% 1k
Table 3: Accuracy of the activity and weight Inference in MobiAct test dataset weight anonymization. The results are reported separately for each activity label (first column). The number of data embeddings is specified for each activity (last column).

5.4.2 Anonymization with Probabilistic Modification

We now investigate the efficacy of model-free anonymization with a probabilistic modification using the separate evaluations of the re-identification model which we explained earlier. We use CNNs as inference models for the weight group and gender; these models are described in Section 6.

We first focus on the inference accuracy of a gender re-identification model. The average and standard deviation of the accuracy of the gender re-identification model on probabilistic anonymization are depicted in Figure 8 for independent runs. We have used of the original anonymized input to train the re-identification model. Next, we turn our attention to the weight group re-identification attack. We conduct the same study with probabilistic anonymization. The results are also shown in Figure 8 where the weight group re-identification model is trained for independent runs on the anonymized output. Results from both two-class gender and multi-class weight anonymization show that the anonymization technique with a probabilistic modification prevents the re-identification attack to a great extent.

Figure 8: The re-identification accuracy when data is anonymized using the proposed technique with probabilistic modification.

6 Implementation Details and Practical Considerations

Performing model-free anonymization in the edge is essential for real-world applications since users may not trust third-party servers to operate on their raw sensor data. However, running the proposed model-free anonymization is a compute intensive task which poses a question about the feasibility of doing this in real time in the edge. To understand if the proposed anonymization technique can run in real time on a resource constrained edge device, we evaluate the response time of the model-free anonymization algorithm on a Raspberry Pi 3 Model B.

There are two factors that determine the time budget we have for real-time execution of the proposed algorithm. First, the rate at which new sensor data becomes available; this can be the same as the sampling rate of the respective sensors. Second, the size of the data embedding that we use for anonymization. Should the actual running time exceed the time budget, data anonymization cannot be carried out in real time. The actual running time of the proposed anonymization technique is the sum of running times of the pretrained public and private attribute inference models, the probabilistic encoder, the probabilistic or deterministic transformation, and the probabilistic decoder. It also depends on the computation power of the edge device.

The sampling rate of sensors is Hz in the MobiAct dataset. Hence, a new sample from the Inertial Measurement Unit (IMU) becomes available every twentieth of a second. We build data embeddings with windows of samples and strides of samples which means that after the first embedding, a new embedding is generated every half a second ( milliseconds). Similarly, the sampling rate of sensors is Hz in the MotionSense dataset. Given that we generate data embeddings with windows of samples and strides of samples, a new embedding becomes available every milliseconds. This allows us to calculate the total time budget we have for real-time execution of our anonymization technique.

There is a small difference in the way that embeddings are generated for each dataset. For the inference models to achieve high accuracy in the MobiAct dataset, the value of each coordinate should be utilized instead of the magnitude of the sensor readings. Thus, when considering the data produced by gyroscope and accelerometer sensors, coordinate readings must be stored in each embedding rather than magnitude readings. As a result, the model size and the size of the VAEs in the MobiAct dataset are bigger than those in the MotionSense dataset. We note that the use of smaller-size public attribute-specific VAEs promotes real-time execution of our anonymization technique. A general VAE has 12 times more trainable weights than the total number of trainable weights of all attribute-specific VAEs.

(a) An attribute-specific VAE for MotionSense
(b) MLPs for activity and gender identification
Figure 9: Neural network architecture and the number of neurons in each layer for the MotionSense dataset.

We measure the execution time of the steps of Algorithm 1 on a Raspberry Pi 3 Model B. This includes the time to load the models, and the time to run public, private attribute inferences, and VAE models. Our evaluation shows that the running time of probabilistic or deterministic transformation is several orders of magnitude smaller than the neural networks. Thus, we do not consider this in our calculations. We assume that the time to load the models and VAEs is constant and they are loaded in memory only once.

6.1 MotionSense dataset

For the MotionSense dataset, it takes and milliseconds respectively to perform the public and the private attribute inferences for one input data embedding. It also takes a maximum of and milliseconds for the probabilistic encoder and decoders of the VAEs to run, respectively. Thus, in the worst case, it takes about milliseconds in total to anonymize one embedding. This result is shown in Table 4. Note that the maximum execution time is written in bold. Considering the 200-millisecond time interval between two successive embeddings, we ascertain that it is feasible to perform real-time data anonymization in the edge.

It is important to note that we considered the probabilistic encoder and decoder with the longest running time to calculate the total running time in the worst case. Also note that for the sake of comparison with [hajihass2020latent], MLP models are used for activity and gender identification. The MLP models for activity and gender identification and the VAE architecture are shown in Figure 9.

Model Batch Sizes Type nb. Embeddings Time (s)
Time/
Embedding (s)
Activity Recognition 256 MLP 10,293 6.5 0.0006315
Gender Recognition 256 MLP 10,293 5.8940 0.0006964
Prob. Encoder 0 256 MLP 1,730 2.1039 0.0012161
Prob. Decoder 0 256 MLP 1,730 2.0714 0.0011973
Prob. Encoder 1 256 MLP 2,079 1.3763 0.000662
Prob. Decoder 1 256 MLP 2,079 1.4363 0.0006909
Prob. Encoder 2 256 MLP 4,547 2.9123 0.0006405
Prob. Decoder 2 256 MLP 4,547 2.9304 0.0006446
Prob. Encoder 3 256 MLP 1,937 1.1988 0.0006189
Prob. Decoder 3 256 MLP 1,937 1.2753 0.0006583
Table 4: The execution time of the gender anonymization algorithm on the MotionSense dataset.

6.2 MobiAct dataset

We present the running times for weight group and gender anonymization separately. We first focus on the binary gender anonymization. It takes and milliseconds to obtain the public and the private attribute inference results for one data embedding, respectively. In the worst case, the probabilistic encoder of the VAE takes about milliseconds and its probabilistic decoder takes about milliseconds to process one data embedding. These numbers add up to milliseconds which is the total time needed to anonymize one data embedding.

Similarly, for weight group anonymization, it takes milliseconds to make the public inference and milliseconds to make the weight group inference. The probabilistic encoder of the VAE model takes milliseconds to process each embedding and the probabilistic decoder processes one embedding in milliseconds. Hence, it takes about milliseconds in total to anonymize one data embedding in the worst case scenario. These results are shown in Tables 5 and 6 where the maximum execution time is written in bold in each case. Given that a new data embedding is generated every milliseconds, we ascertain that data anonymization can be performed in real time in the edge. The CNNs for activity, gender, and weight classification, and the VAE architecture are depicted in Figures 9(a) and 9(b).

(a) An attribute-specific VAE in MobiAct
(b) CNNs for detecting activity, gender, and weight group
Figure 10: Neural network architecture and the number of neurons in each layer for the MobiAct dataset.
Model Batch Sizes Type nb. Embeddings Time (s)
Time/
Embedding (s)
Activity Recognition 256 CNN 30,474 313.1836 0.0102771
Gender Recognition 256 CNN 30,474 314.0471 0.0103054
Prob. Encoder 0 256 MLP 3,530 6.9214 0.0019607
Prob. Decoder 0 256 MLP 3,530 4.5735 0.0012956
Prob. Encoder 1 256 MLP 3,625 5.9676 0.0016462
Prob. Decoder 1 256 MLP 3,625 3.5663 0.0009838
Prob. Encoder 2 256 MLP 372 3.583 0.0096317
Prob. Decoder 2 256 MLP 372 0.3594 0.0009661
Prob. Encoder 3 256 MLP 94 3.9088 0.0415829
Prob. Decoder 3 256 MLP 94 0.103 0.0010957
Table 5: The running time of each step of the gender anonymization algorithm in MobiAct.
Model Batch Sizes Type nb. Embeddings Time (s)
Time/
Embedding (s)
Activity Recognition 256 CNN 7,619 84.9811 0.0111538
Weight Group Recognition 256 CNN 7,619 90.5697 0.0118873
Prob. Encoder 0 256 MLP 3,577 7.5548 0.0014276
Prob. Decoder 0 256 MLP 3,577 5.1064 0.0014276
Prob. Encoder 1 256 MLP 3,612 6.2348 0.0017261
Prob. Decoder 1 256 MLP 3,612 3.968 0.0010985
Prob. Encoder 2 256 MLP 333 3.7621 0.0112975
Prob. Decoder 2 256 MLP 333 0.3817 0.0011462
Prob. Encoder 3 256 MLP 97 4.2115 0.0434175
Prob. Decoder 3 256 MLP 97 0.1067 0.0011
Table 6: The running time of each step of the weight group anonymization algorithm in MobiAct.

6.3 Comparison with Model-based Anonymization Techniques

For the sake of comparison with related work, we study the running time of the activity and gender identification models built in [malekzadeh19], and the running time of the Replacement Autoencoder [malekzadeh2017replacement] and the Anonymization Autoencoder [malekzadeh19] which are state-of-the-art model-based anonymization techniques. We use the implementation which is available in an online repository [pmc_malek] and run it on our Raspberry Pi. In this implementation the embedding data is passed through the Replacement Autoencoder before it is sent to the anonymization autoencoder.

We only discuss the results for the MotionSense dataset as these techniques are not originally used in [malekzadeh19] to anonymize the MobiAct dataset. Table 7 shows the performance results. The total running time of the anonymization pipeline is milliseconds per embedding of data, suggesting that our model-free technique can perform anonymization without significant overhead and is not computationally expensive when it is compared to the model-based techniques proposed in the literature.

Model Batch Sizes Type nb. Embeddings Time (s)
Time/
Embedding (s)
Activity Recognition 128 CNN 13,873 50.26 0.0036229
Gender Recognition 128 CNN 8,356 32.72 0.0039157
Replacement
Autoencoder [malekzadeh2017replacement]
128 CNN 13,873 62.23 0.0044857
Anonymization
Autoencoder [malekzadeh19]
128 CNN 13,873 199.79 0.0144013
Table 7: The running time of model-based gender anonymization techniques on the MotionSense dataset.

6.4 Anonymization on Edge Devices versus Anonymization on Cloud Servers

Partitioning neural network models and offloading parts of the computation to the cloud can decrease the response time and energy consumption at the edge [jeong2018ionn, neurosurgeon]. In this section, we explore the possibility of sending the raw sensor data, latent space representations, or intermediate data to the cloud to reduce the response time of the anonymization application. Clearly sending the raw data to a remote server which performs anonymization can expose user data to an adversary and contribute to existing privacy concerns. Hence, no partitioning can happen prior to the transformation of latent space representations.

We argue that the only part of the model that can run in the cloud is the probabilistic decoder. Since the running time of the probabilistic decoder is around millisecond on a Raspberry Pi, it does not currently make sense to run it in the cloud as the propagation delay is usually much greater than the running time of the decoder on the edge device. Nevertheless, if an energy constraint is imposed on the edge device, we may have to partition the probabilistic decoder and offload computation. We plan to investigate this in future work.

7 Conclusion

The number of IoT devices is estimated to surpass 30 billions worldwide by 2020. Many of these devices are installed in our homes and work environments collecting large amounts of data which can reveal private aspects of our lives if analyzed using advanced machine learning techniques. Nevertheless, most users weigh privacy risks against the perceived benefits of IoT devices and are reluctant to adopt privacy-preserving techniques that greatly reduce these benefits. Thus, it is critical to develop privacy-preserving techniques that enable the users to utilize the available data to its fullest potential without compromising their privacy.

In this paper, we extended and improved the model-free anonymization technique which we originally developed in [hajihass2020latent]. We proposed the addition of the cross-entropy loss of a simple private attribute inference model to the loss function of attribute-specific VAEs. Through this modification, we achieved higher anonymization performance. The incorporation of a classification layer, which is trained simultaneously with the VAE, is the direct opposite of what model-based anonymization techniques do when they use adversarial training. In our approach, incorporating the multi-class cross entropy loss in the VAE loss function encourages finding useful latent space representations that better represent private attribute classes that exist in the original data. As a result, latent representations become more representative of private attribute classes and transforming these representations ensures higher data utility and improved anonymization performance.

The proposed model-free technique utilizes attribute-specific VAE models rather than a single VAE. This helps with the reduction of the overall model size and supports real-time anonymization in the edge. We tested our model-free anonymization technique (with deterministic and probabilistic modifications) on two HAR datasets and corroborated that it outperforms baseline autoencoder-based anonymization techniques. Furthermore, it is not as vulnerable to the re-identification attack as the baseline methods.

Due to the growing importance of real-time anonymization on edge devices, we studied the feasibility of model-free anonymization on a Raspberry Pi 3 Model B. We showed that the proposed model-free technique is capable of meeting the time budget, and therefore, can be used in real time. Moreover, we studied the possibility and implications of performing anonymization in the edge and in the cloud. We found that partitioning neural network models does not make sense in terms of the response time of the anonymization application. We plan to further explore this in future work.

References