Data collection and sharing is growing to unprecedented volumes. Some of the reasons for this phenomenon include the decrease in storage cost, the rise of social networks, the ubiquity of smartphones and law regulations. For example, in many states in the US, medical institutions are obliged to make demographics data public about their patients (NAHDO, 1996; Sweeney, 2002; OSHPD, 2014).
Warner (1965) argues that the lack of privacy guarantees can cause subjects to be reluctant to share their data with data collectors (such as doctors, government agencies, researchers, etc.) or even result in subjects providing false information. Therefore, subjects need to be assured that their privacy will be preserved throughout the whole process of data collection and use.
One of the emerging areas with growing interest to collect sensitive personal and private data is health tele-monitoring. In this setting, a technology is used to collect health-related data about patients, which are later submitted to a medical staff for monitoring. The data are then used to assess the health status of patients and provide them with feedback and/or intervention. Research indicates that such technologies can improve readmission rates and lower overall costs (Clark et al., 2007; Chaudhry et al., 2010; Inglis, 2010; Giamouzis et al., 2012; Aranki et al., 2014). In such scenarios, the collected data are usually of sensitive nature from a privacy point of view and therefore privacy preserving technologies are needed in order to protect patients’ privacy and increase compliance.
There are multiple stages in the life-cycle of data, including i) the disclosure (or submission) of the data by the subjects to the data collector; ii) the processing of the data; iii) the analysis; and/or iv) the publishing of (often a privatized version of) the data or some findings based on them. In this paper we focus on the phase of disclosure of privacy-sensitive data by the data owners. Our framework for Private Disclosure of Information (PDI) is thus aimed to prevent an adversary from inferring certain sensitive information about the subject using the data that were disclosed during communication with an intended recipient. This is analogous to the problem of attribute linkage in statistical database privacy.
In traditional encryption approaches to maintaining privacy, it is often implicitly assumed that the data themselves are the private information. However, in more general scenarios, the data can be used to infer some private information about the subjects for which the data apply. For example, respiration rate by itself might not be considered private information. However, if the data from the collected respiration rate are used to infer whether the individual is a smoker or not, they become sensitive information. One can argue that because the information about whether someone smokes is private, the respiration rate data become private by implication.
Under such circumstances, one should attempt to privatize the transmitted data in a way that reveals as little as possible about the private information to an adversary. In summary, our objective is to encode the transmitted data in order to hide another private piece of information. In the words of Sweeney (2002): “Computer security is not privacy protection.” The converse is also true, privacy does not replace security. Our approach is therefore to be viewed as complementary to classical security approaches. For example, data can be privatized then encrypted.
The rest of this paper is organized as follows. In Section 2 we provide a survey of the literature for related work. In Section 3 we provide the motivation to the problem and formulate it, followed by further analysis in Section 4. We then discuss implementation details of the learning problem in Section 5 followed by experimental results in Section 6. Finally, we close by discussing our conclusions and future research directions in Section 7.
2 Related Work
The study of privacy-preserving techniques and technologies in the fields of statistics, computer security and databases, and their intersections, dates back to at least 1965 when Warner proposed a randomization technique for conducting surveys and collecting responses for the purpose of statistical and population analysis. Since then, extensive privacy research in the fields above was conducted. Therefore, in the interest of brevity, we provide a brief overview of the areas of study related to our work and refer the reader to more comprehensive surveys in each area.
Recently, attention to privacy has been rising in the health-care domain with the spread of electronic health-records usage and the growing data sharing between medical institutions. It has been reported that consumers are expressing increasing concerns regarding their health privacy (Bishop et al., 2005; Hsiao and Hing, 2012). Most of the research in privacy from the health community focuses on medical data publishing and is therefore database-centric. For a survey of results in this domain, we refer the reader to (Gkoulalas-Divanis et al., 2014).
In more general-purpose scenarios, the privacy of statistical databases and data publishing has been extensively studied. Denning and Schlorer (1983) presented some of the early threats related to inference in statistical databases and reviewed controls that are based on the lattice model (Denning, 1976). Duncan and Lambert (1989, 1986) studied methods for limiting disclosure and linkage risks in data publishing. Sandhu (1993) provided a tutorial on lattice-based access controls for information flow security and privacy. Later, Farkas and Jajodia (2002) provided a survey of more results in the field of access controls to the inference problem in database security. For rigorous surveys in the fields of data publishing privacy and statistical databases privacy, we refer the reader to (Adam and Worthmann, 1989; Fung et al., 2010).
Two semantic models of database privacy of growing interest in the privacy literature are -anonymity (Sweeney, 2002) and differential privacy (Dwork, 2006, 2008). In -anonymity, given a set of quasi-identifiers that can be used to re-identify subjects, a table is called -anonymous if every combination of quasi-identifiers in the table appears in at least records. If a table is
-anonymous, assuming each individual has a single record in the table, then the probability of linking a record to an individual is at most. Other extensions and refinements of -anonymity have been proposed including -diversity (Machanavajjhala et al., 2007), -closeness (Li et al., 2007) and others.
In differential privacy, the requirement is that the output of a statistical query should not be too sensitive to any single record in the database. Formally, given a statistical query , then is -differentially private if for any two realizations and of the database such that and all , where is the symmetric difference between and (Dwork, 2006, 2008). Cormode (2011) showed that sensitive attribute inference can be done on databases that are differentially private and -diverse with similar accuracy.
As can be seen from the review above, most of the research in data-privacy is focused on privacy-preserving data publishing and privacy-preserving statistical databases. In contrast, in this work we focus on preventing adverserial statistical inference of a piece of private information based on the disclosed messages in an individual’s information exchange scenario during communication.
3 Problem Formulation
We use the following shorthand notation for probability density (mass) functions. We always use a pair of a capital and a small symbols of the same letter for a random variable and a realization of it, respectively. For notation simplicity and conciseness, given random variablesand , instead of writing for the marginal density (mass) function of we simply write , and instead of writing for the conditional density (mass) function of given , we simply write .
3.2 Motivation and Threat Model
We are primarily motivated by the tele-monitoring setting. In this setting, a doctor wishes to monitor her patients remotely using a technology that can collect and transmit health-related data. The shared data are of sensitive nature because they can be used to infer private pieces of information like a health-condition or a disease. For example, updates about a patient’s weight can lead to disclosure of obesity as it will be demonstrated in Section 6.
More generally, an information provider Bob wants to disclose a piece of information to some recipient Alice. Furthermore, the information can be used to infer some private information about Bob. However, there is no guarantee that the transmitted information will not be intercepted and potentially used for inference of the private information about Bob by an untrusted but passive eavesdropper Eve. Finally, in this setting, we assume that Alice is more certain about than Eve is. The problem at hand is delivering the information under these circumstances such that Alice can make full use of the information but that Eve’s ability to infer about Bob, using the transmitted message, is minimized.
As a concrete example, consider the following scenario in health tele-monitoring. A patient Bob is trying to update his physician Alice about his weight and body mass index (BMI).111BMI is a measure of relative weight based on an individual’s mass and height. Defined as . Since Alice is Bob’s physician, she already knows the weight status category of Bob which he considers to be private information.222Weight status category indicates if an individual is underweight, overweight, obese or has a healthy weight. This notion will be presented formally in section 6 Eve, however, does not know Bob’s weight status category a priori but would like to learn it from the messages he sends to Alice. If Eve succeeds to listen in on the communication between Bob and Alice, Eve can, with some accuracy, infer the weight status category of Bob. Alice, being a considerate physician, wants to ensure the privacy of her patients. Alice decides to create an encoding scheme (that can be made public) for the communication such that the encoding is different per weight status group. Her objective is to make this encoding scheme “as privacy-preserving as possible” in the sense of keeping her patients’ weight status category information as private as possible to someone who does not know it a priori.
It is important to compare this scenario with the classical security approach. In classical security, the objective is to protect the transmitted message itself without taking into consideration an adversarial effort to statistically infer private information using the cipher-text. It has been demonstrated that statistical inference can still be performed on encrypted data (For example White et al., 2011; Miller et al., 2014). We complement this by capturing the notion of statistical inference of the private information from the transmitted data, and aim to find a way to minimize the ability of an adversary to infer using the transmitted data.
3.3 Problem Definition
Towards a more formal representation of the problem, we consider scenarios where i) Bob’s identity, , is attached to any message that is sent by him; ii) there is no guarantee that the sent information will not be intercepted by an untrusted but passive eavesdropper Eve; iii) the information can be used to infer some private information about Bob; and iv) Alice knows the private information about Bob but Eve does not. Under these assumptions, Bob would like to exploit the fact that Alice knows but Eve does not in order to send a message that is more useful to Alice than Eve. The utility value of the message follows the following decoding and “hiding class” (HC) premises:
Alice can make full use of the sent information , i.e. obtain the original message from the transmitted message ; and
Eve’s ability to make inference about given , based on the sent information is minimized.
Formally, we use for the set of identifiers of information providers, for the information space and for the set of private classes (the private information about the information providers). Similarly, we define the random variables for the identifier of the information provider, for the piece of information that the provider would like to disclose, for the class that the provider belongs to and for the encoded message that will be sent (called privatized information), which is a function of the original information and the class. We call this function a privacy mapping function and define it as where is the set of injective functions . A simple way to think about is as an encoding scheme. That is, for every class , it outputs an encoding function for the input information . Given , since is injective, then there exists a left inverse which will be used to decode the messages sent from subjects in class .333We say that is a left inverse of a function if for all we have . From that, is simply equal to . The statistical model that relates these random variables is described in Figure 1.
For conciseness, in this paper we treat the case of continuous information spaces. Note that in the case of a discrete information space, the reader is instructed to follow the discussion by substituting probability density functions with probability mass functions for the distributions ofand
. Note that our treatment also covers the case of information spaces of mixed nature (that are discrete in some attributes and continuous in others) by using the appropriate probability distribution functions.
For the model in Figure 1, one needs to supply the following probability distributions. , the prior of subjects transmitting messages in the system. , the adversary’s prior of class membership for the different subjects (based on auxiliary knowledge). , the generative model of data given a class and a subject. Finally, is simple and can be modeled as if and only if and otherwise, for all and .
Recall that the identity of the information provider is attached with the transmitted message. Moreover, the intended recipient knows the class of the information provider. Therefore, because of the injectivity requirement of the privacy mapping function, the intended recipient can decode the sent information back to the original message . Hence the requirement (DECODING) is satisfied.
Finally, in order to satisfy the second requirement (HC) we would like to find a privacy mapping function that minimizes the amount of information that the privatized information carries for the sake of inferring the private class , given the subject identifier , to an adversary. We adopt the measure of (conditional) mutual information
to model this quantity. We present the definition of conditional mutual information for continuous random variables, and refer the reader to(Cover and Thomas, 2006, Definitions 2.61 and 8.54)
for the corresponding definitions concerning discrete random variables and random variables that can be mixtures of discrete and continuous, respectively.
Definition ((Cover and Thomas, 2006, c.f. Definition 8.49)).
Let and be random variables with a joint probability density function and marginal probability density functions and , respectively. The conditional mutual information of and given , , is defined as I(X,Y—Z) ≜E_p_X,Y,Z(x,y,z)[logpX,Y—Z(x,y—z)pX—Z(x—z) pY—Z(y—z)]
Intuitively, measures in bits, the expected amount of mutual information that the random variables and have, given the information in .444The units are bits assuming the base in Section 3.3 is . Mutual information also provides a sufficient and necessary condition for conditional independence as follows.
Lemma ((Cover and Thomas, 2006, c.f. Corollary 2.92; c.f. Theorem 8.6.1)).
for any privacy mapping function . Furthermore, if and only if and are conditionally independent given using the privacy mapping function .
From the intuition above, and the fact in Section 3.3, we set our objective to find a privacy mapping function that minimizes the conditional mutual information of the privatized information and the private class given the identity of the information provider such that the model in Figure 1 holds. In short,
|and||Model in Figure 1|
Once a privacy mapping function is chosen, the communication process can be carried as follows.
The transaction of disclosing a piece of information by an information provider belonging to class is performed by applying the following transformation and sending (or some encrypted version of it).
The transaction of receiving a piece of information sent by an information provider belonging to class is performed by applying . Where is a left inverse of .
Because of the injectivity requirement for the privacy mapping function and the assumption that the intended recipient knows the class to which the sender belongs, the process defined above allows the intended recipient to decode the transmitted message successfully, satisfying our first requirement in the problem definition.[maybe give the requirements/assumptions labels?]
Note that the problem in Section 3.3 is not a convex problem. Furthermore, it is of interest to study how to learn the model in Figure 1 and find an optimal privacy mapping function from data. We will address this question in Section 5, but first we further study the properties of the formulated framework in the following section.
4 Further Analysis
First, we relate the value of the objective function in Section 3.3
to Bayesian inference in the following lemma.
If a privacy mapping function yields then Bayesian inference of based on is prevented for the adversary.
From Section 3.3 we know that is conditionally independent of given which means which is the prior of the class membership that the adversary already possesses. Therefore, the disclosure of does not change the adversary’s belief regarding the private information given the subject identifier . ∎
The next question that we need to ask is whether a privacy mapping function satisfying is ever attainable. There are three reasons for this question. First, if such a privacy mapping function exists, then it means that by knowing (which is always attached to the message), provides no extra information to inferring to an adversary, which sounds surprising. Second, there is generally a trade-off between information utility and privacy where optimal privacy is usually only attained at the cost of no utility (Dwork, 2006). In our case, the utility of the information to the intended recipient is always fully preserved, unrelated of the choice of , since is injective for all . From this it follows that the scenario of perfect privacy seems to be unattainable.555We consider “perfect privacy” to be that the adversary’s belief about given doesn’t change after observing . Finally, if such a privacy mapping function exists, it would assure optimality of Section 3.3. Fortunately (and somewhat unintuitively), such a mapping function can be attained as shown in the following sequence of results.
If there exists a function such that for all and then
Using Section 4, we prove the following theorem, which is a sufficient condition for optimality of Section 3.3.
If there exists a function such that for all and then for all and .666(Cover and Thomas, 2006, Definition 8.46) : The Kullback-Leibler divergence is defined as
: The Kullback-Leibler divergence is defined as.
Since then using Section 4 we know that . Therefore, for any and such that we get . This implies . ∎
If a privacy mapping function achieves for some function , for all and then is the optimal solution to Section 3.3.
The result follows from Section 4 and the fact that . ∎
Section 4 is a valuable tool for proving optimality of privacy mapping functions. Note that Section 4 is independent of the model of (and ). This is a very important observation since it means that in cases where a privacy mapping function satisfies the condition of the theorem, modeling the adversary’s prior knowledge about information providers’ class memberships is not needed. Furthermore, such privacy mapping function achieves perfect privacy against any adversary, regardless of her auxiliary knowledge (or ). In the following theorems we provide examples of using Section 4 that also serve as cases where such privacy mapping functions are attainable.
The proofs of the following theorems are similar to this of Section 4 and were thus omitted for conciseness.
(Gamma distribution with shape and scale parameters) for everyand , then is an optimal solution to Section 3.3.
Given two vectors , we define as the vector such that . That is, is the element-wise division of over . If
(Continuous Uniform distribution) for everyand , then is an optimal solution to Section 3.3.
In this section, we briefly describe an implementation of the learning problem that is publicly available in the form of a MATLAB777https://www.mathworks.com/products/matlab/ toolbox (Aranki and Bajcsy, 2015). In this implementation, we investigate the question of learning a privacy mapping function from a labeled data set . This implies a simplifying assumption of ignoring the modeling of the random variable corresponding to the identity of the information providers. This assumption has the following implications on the model in Figure 1. First, it implies that the adversary views information providers as uniformly distributed, that is for all . Second, the assumption implies that the subject-class membership belief function of the adversary is equal for all subjects, that is for all and . As discussed in Section 4, in the cases where perfect privacy is achievable, the solutions are independent of these models and therefore these implications are not limiting. Further study is necessary to assess the level of privacy-degradation incurred by this assumption in cases of imperfect privacy. Third, this assumption implies that the generative model of data per class is independent of the subjects, that is for all and . Finally, simplifies to .
In order to make the problem in Section 3.3 computationally tractable, a parametrized space for the privacy mapping functions can be introduced, allowing for the optimization to be performed on the parameter space. For example, consider the following parameter space
Then a parametrized space for affine privacy mapping functions on the classes set and information space of dimension can be defined as
Provided a parameter search space , the optimization problem in Section 3.3 can be re-written as
The straightforward way to modeling the required distributions and
, from data, is non-parametrically by using high-dimensional histograms. This approach, while simple to implement, suffers from the curse of dimensionality as its complexity grows exponentially with the dimension of the information space. Once the models forand are constructed, the model for can be computed for any choice of allowing the computation of the objective function in Equation 1
. Since the problem is non-convex, in order to optimize the objective function, we employ the genetic algorithm with the fitness function equal to the objective function inEquation 1. The chosen selection policy is fitness-proportional while the chosen transformations (evolution/genetic) operators are both mutations and crossovers (Banzhaf et al., 1998).
In this section we walk the reader through an example that aims to motivate and demonstrate PDI. In this example we use data that are published by the Center for Disease Control and Prevention (CDC) as part of the National Health and Nutrition Examination Survey of 2012.888https://wwwn.cdc.gov/nchs/nhanes/search/nhanes11_12.aspx Specifically, we use the Body Measures (BMX_G) portion of the data.999https://wwwn.cdc.gov/nchs/nhanes/2011-2012/BMX_G.htm
In our setting, we consider the disclosed information to be both Body Mass Index (BMI) and weight. Our information providers are assumed to be individuals of both genders that are years of age or less. We consider the private information to be the weight status category of the subject. The CDC considers the following four standard weight status categories for the aforementioned age group i) underweight; ii) healthy weight; iii) overweight; and iv) obese. There are data points in the data set with subjects of years of age or less.
According to the definitions of the CDC, the BMI category of a child or a teen is classified based on the individual’s BMI percentile among the same age and gender group as described inTable 1. Since the age of the information provider is not part of the information space, the inference of the weight status category of the information provider based on BMI and weight is not perfect. The data for the different classes are depicted in Figure 2.
|Weight Category||BMI Percentile Range|
6.1.1 Inference Based on Original Data
Using the data, we trained SVM classifiers with Gaussian kernels. The classifiers are aggregate in terms of the “positive” class in the following sense. The first classifier treats the “positive” class as the Underweight category (and so the “negative” class is the rest of the categories). The second classifier treats the “positive” class as either the underweight or healthy weight categories. Finally, the third classifier treats the “positive” class as any category except the obese category. We used a split for training-testing. In numbers, we used data points for training and data points for testing.
The training for all SVMs was done using -fold cross-validation among the data in the training set to pick the best of the Gaussian kernels and the best box boundaries of the classifiers. The classification phase is done by taking a majority vote from the classifiers and the output is the class which most classifiers agree on. The results of the classifier are described in Table 2
in terms of the confusion matrix of the different categories. The total accuracy of the classifier is.101010The adopted total accuracy measure is where is the confusion matrix and is the cardinality of the test set. This is the percentage of true classifications over the test set.
6.2 Privatizing Information
We would like to privatize the information at hand (BMI and weight) in order to maintain the weight status category as private as possible (based on the training set only). This scenario simulates a tele-monitoring scenario and fits the assumptions and motivation introduced in Section 3. Therefore, we aim to utilize PDI in order to privatize the data as discussed earlier. In order to learn the privacy mapping function from the training data, we use the MATLAB toolbox mentioned in Section 5 (Aranki and Bajcsy, 2015). We used the affine privacy mapping functions for the parameterized search space as shown in the example in Section 5
. Note that there are extra degrees of freedom in the problem, since any privacy mapping functionsand related by yield the same objective value in Section 3.3 for any and . That is, applying the same injective affine transformation to all encoding functions in does not change the value of . Therefore, in our problem we fix the encoding function of the “underweight” class to the identity function, i.e. .
|Ground Truth Category|
The resultant privatized information is depicted in Figure 3. It is clear that it should be much harder to do inference of the weight category based on this privatized data, given the decreased distinguishability between classes. Note that calculating the privatized information is simple and efficient since now we know the parameters for the privacy mapping functions for the different classes.
6.2.1 Inference Based on Privatized Data
In order to evaluate the quality of the privatization, we now train new SVM classifiers with the same training procedure as in Section 6.1.1, but this time using the privatized data (and of course, encoding the test set too for evaluation). Same as before, we then use a majority vote from the classifiers to predict the class of any data point. The resultant confusion matrix is described in Table 3.
|Ground Truth Category|
It is clear that the classification results are degraded after privatizing the information. The total accuracy dropped to (from ). Given that the data from different classes are highly indistinguishable, the classifier now classifies most data points as “healthy weight”. This is to be expected since most of the data points are in the “healthy weight” category. In informal words, if a classifier would have to make a “bet”, it would bet on the class with the most amount of data points. Formally, a lower bound on the total accuracy can be achieved by considering the trivial classifier that always predicts “healthy weight” (deterministic), which has total accuracy of . This shows that our result of is not much further from a lower-bound guaranteed accuracy.
Note that the data set is biased in size against the “underweight” category. There are only data points with weight category “underweight” out of the total data points (). This makes privatizing that class particularly hard, especially because the modeling is based on -dimensional histograms and is not parametric. For this reason the classification results before and after privatization for the “underweight” category are comparable.
To intuitively demonstrate how privacy is preserved, we take a piece of privatized information at random from our data set, , without looking at its ground truth weight category. If we decode this data point using the decoding function of “healthy weight”, we get , which is a legitimate “healthy weight” BMI and weight data point. If we use the decoding function of “overweight”, we get , which is also a legitimate “overweight” BMI and weight data point. Similarly, if we use the decoding function of “obese”, we get , which is also a legitimate “obese” BMI and weight data point.
7 Discussion and Future Work
In this paper, we presented a view on privacy in which the data themselves need not be the private object, but rather can be used to infer private information. From this point of view, we derived a framework that preserves the privacy of the private information from being inferred from the communicated messages. We provided theoretical analysis and properties of the devised framework. An important result (Section 4) provided conditions that ensure perfect privacy while preserving full data utility. We showed that such conditions are achievable by providing closed-form solutions to some cases of data generative models. Section 4 further showed that perfect privacy is not a function of the modeling of the adversary’s auxiliary knowledge about the private information per subject, (or ). This observation is important because modeling adversary’s auxiliary knowledge is generally a hard problem, and because it showed that perfect privacy can be achieved regardless of the adversary’s auxiliary knowledge. That is, the same privatization protects information providers from all adversaries, regardless of their auxiliary knowledge.
Subsequently, we discussed an implementation of the learning problem resulting from the framework and demonstrated its use with a data set published by the Center for Disease Control and Prevention using data about individuals’ Body Mass Indices, weights and their weight status categories. The experimentation shows that after privatizing the data set, the classification accuracy drops significantly, near a lower bound of guaranteed classification accuracy, thus achieving our set goal.
We make two important remarks about the approach presented in this paper. First, the described approach is philosophically different from the classical cryptography as it provides a model where the objective is maintaining the secrecy of the private information that is not the data themselves but the information that can be inferred based on the data. Second, even though the proposed approach is privacy-centric, it is not meant to serve as an alternative to cryptography but as a complement to it. That said, any message can be “privatized” then encrypted. If the encryption is in that case compromised by an adversary getting access to the clear text message, the privacy is still preserved.
The current implementation of the devised learning problem suffers from the curse of dimensionality. The cost of learning grows exponentially with the number of dimensions of the information space. This is a result of our choice to model
as a multi-dimensional histogram. To make this framework practical, there is a need to study other ways of estimating the mutual information measure between the disclosed information and the private class. One appealing option is leveraging parametric learning and modeling each distributionas a mixture model which could result in more computationally efficient estimation of the mutual information measure.
The presented framework has the potential of being extended to scenarios where the data recipient is not completely certain about the private class but is still more certain than the adversary. Such scenarios are clearly more general and may result in wider applicability of the framework to other scenarios than presented here. Indeed, in such scenarios, communicated messages can only be interpreted in a statistical sense and the implications of such assumptions must be studied as well.
Furthermore, the current implementation of the learning problem assumes that adversaries have equal belief about all the information providers so that the adversary’s belief about is independent of and that the generative model of data per private class is independent of . This is a simplifying assumption and its implications need to be further studied and remedied.
Given the non-convexity and the complexity of the problem at hand, areas for future research include studying heuristic techniques to learn the privacy mapping functions from sufficient and/or necessary conditions for local improvements in the mutual information as a function of local changes in the privacy mapping functions. This approach, as opposed to finding global optimal solutions toSection 3.3, is analogous to finding minimal anonymization as opposed to optimal anonymization in privacy preserving data publishing (Fung et al., 2010).
We would like to thank Katherine Driggs Campbell for the initial conversation that spurred this idea. We are also greatly indebted to Gregorij Kurillo, Yusuf Erol and Arash Nourian for their fruitful discussions and feedback that significantly improved the quality of this paper. This work was supported in part by TRUST, Team for Research in Ubiquitous Secure Technology, which receives funding support for the National Science Foundation (NSF award number CCF-0424422).
- Adam and Worthmann (1989) Nabil R Adam and John C Worthmann. Security-control methods for statistical databases: A comparative study. ACM Computing Surveys (CSUR), 21(4):515–556, 1989.
- Aranki and Bajcsy (2015) Daniel Aranki and Ruzena Bajcsy. Private disclosure of information matlab toolbox, 2015. URL https://www.eecs.berkeley.edu/~daranki/PDI/.
- Aranki et al. (2014) Daniel Aranki, Gregorij Kurillo, Posu Yan, David Liebovitz, and Ruzena Bajcsy. Continuous, real-time, tele-monitoring of patients with chronic heart-failure - lessons learned from a pilot study. ICST, 11 2014. doi: 10.4108/icst.bodynets.2014.257036.
- Banzhaf et al. (1998) Wolfgang Banzhaf, Peter Nordin, Robert E Keller, and Frank D Francone. Genetic programming: An introduction, volume 1. Morgan Kaufmann Publishers, Inc., 1998.
- Bishop et al. (2005) Lynne Bishop, Bradford J Holmes, and Christopher M Kelley. National consumer health privacy survey 2005. California HealthCare Foundation, Oakland, CA, 2005.
- Chaudhry et al. (2010) Sarwat I Chaudhry, Jennifer A Mattera, Jeptha P Curtis, John A Spertus, Jeph Herrin, Zhenqiu Lin, Christopher O Phillips, Beth V Hodshon, Lawton S Cooper, and Harlan M Krumholz. Telemonitoring in patients with heart failure. New England Journal of Medicine, 363(24):2301–2309, 2010.
- Clark et al. (2007) Robyn A Clark, Sally C Inglis, Finlay A McAlister, John GF Cleland, and Simon Stewart. Telemonitoring or structured telephone support programmes for patients with chronic heart failure: Systematic review and meta-analysis. BMJ, 334(7600):942, 2007.
- Cormode (2011) Graham Cormode. Personal privacy vs population privacy: Learning to attack anonymization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1253–1261. ACM, 2011.
- Cover and Thomas (2006) Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2 edition, 2006.
- Denning (1976) Dorothy E. Denning. A lattice model of secure information flow. Commun. ACM, 19(5):236–243, May 1976. ISSN 0001-0782. doi: 10.1145/360051.360056. URL http://doi.acm.org/10.1145/360051.360056.
- Denning and Schlorer (1983) Dorothy E. Denning and Jan Schlorer. Inference controls for statistical databases. Computer, 16(7):69–82, 1983.
- Duncan and Lambert (1989) George Duncan and Diane Lambert. The risk of disclosure for microdata. Journal of Business & Economic Statistics, 7(2):207–217, 1989. doi: 10.1080/07350015.1989.10509729. URL http://www.tandfonline.com/doi/abs/10.1080/07350015.1989.10509729.
- Duncan and Lambert (1986) George T Duncan and Diane Lambert. Disclosure-limited data dissemination. Journal of the American statistical association, 81(393):10–18, 1986.
- Dwork (2006) Cynthia Dwork. Differential privacy. In Automata, languages and programming, pages 1–12. Springer, 2006.
- Dwork (2008) Cynthia Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, pages 1–19. Springer, 2008.
- Farkas and Jajodia (2002) Csilla Farkas and Sushil Jajodia. The inference problem: A survey. SIGKDD Explor. Newsl., 4(2):6–11, December 2002. ISSN 1931-0145. doi: 10.1145/772862.772864. URL http://doi.acm.org/10.1145/772862.772864.
- Fung et al. (2010) Benjamin Fung, Ke Wang, Rui Chen, and Philip S Yu. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys (CSUR), 42(4):14, 2010.
- Giamouzis et al. (2012) Gregory Giamouzis, Dimos Mastrogiannis, Konstantinos Koutrakis, George Karayannis, Charalambos Parisis, Chris Rountas, Elias Adreanides, George E Dafoulas, Panagiotis C Stafylas, John Skoularigis, et al. Telemonitoring in chronic heart failure: A systematic review. Cardiology Research and Practice, 2012, 2012.
- Gkoulalas-Divanis et al. (2014) Aris Gkoulalas-Divanis, Grigorios Loukides, and Jimeng Sun. Publishing data from electronic health records while preserving privacy: A survey of algorithms. Journal of biomedical informatics, 50:4–19, 2014.
- Hsiao and Hing (2012) Chun-Ju Hsiao and Esther Hing. Use and characteristics of electronic health record systems among office-based physician practices, United States, 2001-2012. US Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for Health Statistics, 2012.
- Inglis (2010) Sally Inglis. Structured telephone support or telemonitoring programmes for patients with chronic heart failure. Journal of Evidence-Based Medicine, 3(4):228–228, 2010.
- Li et al. (2007) Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In IEEE International Conference on Data Engineering, volume 7, pages 106–115, 2007.
- Machanavajjhala et al. (2007) Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1(1), March 2007. ISSN 1556-4681. doi: 10.1145/1217299.1217302. URL http://doi.acm.org/10.1145/1217299.1217302.
- Miller et al. (2014) Brad Miller, Ling Huang, Anthony D Joseph, and J Doug Tygar. I know why you went to the clinic: Risks and realization of https traffic analysis. arXiv preprint arXiv:1403.0297, 2014.
- National Association of Health Data Organization (1996) National Association of Health Data Organization. A guide to state-level ambulatory care data collection activities, October 1996.
- Sandhu (1993) Ravi S Sandhu. Lattice-based access control models. Computer, 26(11):9–19, Nov 1993. ISSN 0018-9162. doi: 10.1109/2.241422.
- State of California Office of Statewide Health Planning and Development (2014) State of California Office of Statewide Health Planning and Development. California inpatient data reporting manual, medical information reporting for California, 7th edition, September 2014.
- Sweeney (2002) Latanya Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002.
- Warner (1965) Stanley L Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
- White et al. (2011) Andrew M White, Austin R Matthews, Kevin Z Snow, and Fabian Monrose. Phonotactic reconstruction of encrypted voip conversations: Hookt on fon-iks. In Security and Privacy (SP), 2011 IEEE Symposium on, pages 3–18. IEEE, 2011.
Appendix A Appendix – Implementation and Toolbox Details
In this appendix, we describe in more details our implementation of the learning procedure for Equation 1 in MATLAB as a toolbox. The toolbox design was inspired by the structure and the convenience of coding in CVX (cvx; gb08).111111A MATLAB convex modeling and optimization framework, http://cvxr.com/cvx/ Therefore, the toolbox adopts a similar structure of coding and provides several “keywords” for the users to use in order to define a PDI problem in a similar manner that one would define a CVX problem. We will present these keywords and their behavior in the following subsections. If the reader is familiar with CVX, then the similarities with CVX should be helpful to understand our toolbox.
a.2 Data structures
The PDI problem
The data structure pdi_problem encapsulates the definition of the PDI problem at hand. This data structure holds all the necessary information about the parametrized search space, the constraints (if any) over the parameters of the search space, the data that will be used for modeling, the class definitions and the definitions of the dimensions of the information space at hand.
Two keywords were implemented to define a PDI problem definition block, namely pdi_begin and pdi_end. Any code that is related to the definition of the PDI problem at hand, as will be described in the following subsections, should be inserted between these two keywords. It is worth noting that nested PDI blocks are not allowed.
A PDI variable is the atom object that the user can use to describe the parameters of the search space. For example, consider the search space , the space of affine functions . Then, if the user has classes so that , in order to represent a privacy mapping function , parametrized by , one would need a set of parameters for every . Using one can describe the complete privacy mapping function. This results in for every . PDI variables are implemented in the class pdi_variable to represent objects like and .
For instance, think about and being PDI variables for each class . The toolbox provides the keyword pdi_var that allows users to define variables for parameterization of the differential mapping functions. The syntax for using this keyword is as follows: pdi_var varName(n, m) which declares a PDI variable of size with name varName. For convenience pdi_var varName is shorthand for pdi_var varName(1,1). Currently, a PDI variable can be either a vector or a matrix.
PDI expressions and constraints
In order to allow the representation of constraints over PDI variables (both convex and non-convex constraints), we implemented a data structure for PDI expressions. A PDI expression holds information about a mathematical expression that involves PDI variables, for example the expression var1(1:2,[2 3])^2 - 3 where var1 is a PDI variable (say of size or larger) is a PDI expression. PDI expressions are implemented in the class PDI_expression.
PDI expressions have two main functions. The first function is that objects of the type PDI_expression can hold (potentially long) mathematical expressions involving PDI variables so that they can be used repeatedly. The second, and most important function is that PDI expressions are the building blocks of defining constraints over PDI variables. For example, if one of our PDI variables is a matrix A that we want to have determinant equal to one, by writing the line of code det(A) == 1 inside the PDI problem block, the PDI engine will create a PDI expression for det(A) - 1 and add a constraint on that expression (to be equal to ) to the PDI_problem object representing the PDI problem. These constraints are later passed to the learning engine so that it finds a feasible solution according to the user-defined constraints.
Since our problem is non-convex, there is no requirements for the constraints to be linear or even convex. However, since treating linear constraints is more efficient in general than treating non-linear constraints, linear constraints are tagged in the PDI_problem object so that they are passed separately to the learning engine for more efficient computation.
For visual convenience and for ease of reading the code, the keyword subject_to is provided to mark the beginning of the constraints block. The keyword is a void keyword that does nothing other than holding a line of code that makes the code look nicer.
a.3 Using the toolbox
The toolbox provides more keywords that can be used in the PDI problem definition block. The toolbox allows users to declare dimensions of the information space using the keyword PDI_dimension. As will be seen later, the current implementation models data and classes in a non-parametric way by binning the data into -dimensional histograms, so the keyword also allows the user to define the bins to be used for modeling in that dimension. For example, PDI_dimension weight 0:5:20 declares a dimension with the name “weight” and with bins 0:5:20.
Another building block of a PDI problem is a class (the private information). In order to declare a class, the keyword PDI_class can be used. For example, PDI_class male and PDI_class female declare the two classes “male” and “female”, class names are one word strings (no spaces). For convenience, we allow declaration of multiple classes in one call of PDI_class by delimiting different class names by spaces. For instance the line PDI_class male female will declare both classes “male” and “female” in one line.
Sometimes, it is helpful to have a constant-like keyword that returns the total number of dimensions of the information space declared in the problem. This is for example useful when trying to define a variable that is of the same dimension of the information space. For that the keyword PDI_nrdimensions is provided. Similarly, it is useful to have a constant-like keyword that returns the total number of classes declared in the problem. For that the keyword PDI_nrclasses is provided. For example, the line PDI_var b(PDI_nrdimensions, PDI_nrclasses) will declare a PDI variable of size assuming is the dimension of the declared information space and is the number of declared classes in the PDI problem.
In order to provide the data for the learning procedure (data per class), the keyword PDI_datapoints is provided. For example, PDI_datapoints male male_data provides the data stored in the variable male_data as coming from the class “male” to the learning procedure. The convention we use is that a single data point is a column vector, so that in the example above male_data is expected to be of size where is the dimension of the information space and is the number of data points provided.
|Keyword and syntax||Description|
|PDI_start||Begin a PDI problem definition block|
|PDI_dimension <dimension name> <dimension bins>||Declare a new dimension of information|
|PDI_class <class1> <class2> …||Declare new classes of information providers|
|PDI_datapoints <class> <data expression>||Provide data points from a class for the learning procedure|
|PDI_var <var 1>[(n1, m1)] <var 2>[(n2, m2)] …||Declare new PDI variables|
|PDI_reference <R(fv, cN)> <expression>||Provide the (parametrized) differential mapping function|
|PDI_nrdimensions||Returns the number of dimensions defined in the PDI problem|
|PDI_nrclasses||Returns the number of classes defined in the PDI problem|
|PDI_end||End a PDI problem definition block|
Structure of a PDI program
We now describe the general rules that need to be satisfied in order to properly write a PDI program. A PDI program always starts with the keyword PDI_begin, followed by the PDI problem definition and always ends with the keyword PDI_end. In order to maintain consistency of data (in terms of the dimensions), it is assumed that all dimensions are declared before the first PDI_datapoints keyword is invoked. That said, once any data point is provided the information space dimensions are locked and cannot be edited any further. The reason is that the toolbox checks that the dimensions of the provided data points are consistent with the declared dimensions of the information space and therefore the toolbox assumes that all dimensions are declared beforehand. Failing to do so will result in an error thrown by the toolbox and the computation will be terminated.
Another rule concerning PDI variables and constraints is as follows. It is assumed that all PDI variables are declared before providing the parametrized reference function, i.e. before calling PDI_reference (to be introduced). Also, it is assumed that all PDI variables are declared before adding any constraints to the problem definition. The reason for the latter requirement is that a linear constraint is represented by the coefficients used to create it. For instance, if x1, x2 and x3 are PDI variables of size each (for simplicity), the constraint 5 * x1 + 7 * x2 <= 11 is represented as the vector [5 7 0 -11]. In general, in order to map from a linear inequality (<=) constraint vector v to the symbolic representation, the following translation is used: VARS * v(1:end-1) + v(end) <= . From this, in order to keep the consistency of the constraints (in terms of dimensions), it is desired to know the number of variables before representing any constraints. Note that the last rule is an implication of the current implementation and can be later relaxed by fixing any existing constraints every time a new variable is declared. This can be done by appending zeros to the corresponding entries of the new variables in all existing linear constraint vectors. Although, we note that this rule isn’t very limiting and therefore doesn’t degrade the functionality of the toolbox.
The easiest way to remember these rules is by simply using the following order of things. 1) Start a PDI problem definition block by stating PDI_begin; 2) declare all dimensions; 3) declare all classes and provide data for each class; 4) declare all PDI variables; 5) declare the reference function; 6) add all needed constraints; and 7) close the PDI problem definition block by using PDI_end.
a.4 The engine
The PDI engine is the entity that performs the learning of the parameters in the parameter space so that the mutual information between the differential information and the class is minimized. In order to describe the engine, we will describe the way a user can declare the parametrized search space of differential mapping functions. For that, the keyword PDI_reference is used to declare the parametrized differential mapping function space. To best explain this keyword, we use the following example.
Consider the affine functions space and a situation where we have classes. First, we have to declare a matrix and a vector for each class. For the vectors , one can stack them into a matrix b of size such that the column b(:,i) is the vector corresponding to the class . For the matrices , we will represent them as a matrix A of size such that the column A(:,i) is the flattened matrix corresponding to class , so that reshape(A(:,i), N, N) is our . This can be done by declaring PDI_var A(PDI_nrdimensions^2,PDI_nrclasses) and PDI_var b(PDI_nrdimensions,PDI_nrclasses). Having these, we can write
In the code above, @(xs, classN) is used to define the function parameters, where the first one xs represents the data points passed to the function ( data points of dimensions will be submitted as one call of with a matrix) and classN is the class number (a single value). Each call to the reference function will include data only from one class.
Note that i) the reference function is assumed to be vectorized with respect to the first parameter, i.e. a matrix of data points will be passed to it in each call; and ii) the reference function body can include a call to an external function, i.e., users can design their own reference functions as regular MATLAB functions and use them in the function expression of PDI_reference.
The first step in the learning procedure is to prepare a prior for using the data provided by the user. This is modeled directly from the data as a histogram of data-class memberships. The next step in the engine is to represent an optimization objective function given the definition of a parameterized differential mapping function. Recall that the PDI_reference definition uses objects of type PDI_variable and note that it is provided to the toolbox as a string. For these two reasons, a compilation step is performed in order to translate the string representation of the reference function into MATLAB code and to resolve any PDI variables used in the function expression to their corresponding entries in the optimization vector of parameters that will be later passed to it by the optimizer. Once this step is done, we get a reference function that can be invoked with concrete values in place of the PDI variables used. This function is then used in order to calculate , by translating to using the function we compiled and then modeling the resulting distribution as an -dimensional histogram. From and the the objective function can be calculated and its value is returned as the objective value.
Using the objective function described above, the engine runs optimizers from the optimization toolboxes of MATLAB (by default, the engine first runs a genetic algorithm, ga(), then uses its output to initialize a gradient based optimizer, fmincon()) in order to find a set of parameters (PDI_variable’s) that yield an optimal (minimum) mutual information measure between the differential information and the class . That is, the engine solves the optimization problem in subsection 3.3 with respect to the set of PDI variables declared and under the constraints provided on them in the PDI problem definition by the user. This is done by first compiling the provided reference function from the user (that uses PDI variables) into a usable function that assumes concrete values for these variables and then using this altered version in order to provide the optimizer with an objective function that maps the original information to the differential information using the compiled reference function (with the current parameters provided by the optimizer) and outputs the mutual information . In the end, the engine substitutes the PDI variables declared by the user by the values found by the optimizer in the user’s workspace (so that they become number matrices instead of objects of type PDI_variable).
Note again that in the current implementation, the distribution is modeled non-parametrically as a high-dimensional histogram using computed bins based on the original bins provided by the user when invoking PDI_dimension and the reference function provided by the user. This approach clearly suffers from the curse of dimensionality but serves as a simple first implementation for a proof of concept. We discuss approaches to amend this problem in LABEL:sec:future.
a.5 Information Theory Background
The Shannon entropy (shannon1948mathematical) and differential entropy.
The Shannon entropy of a discrete random variable , denoted by , is defined by .
Definition ((Cover and Thomas, 2006, Definition 8.1)).
The Differential entropy of a continuous random variable with density , denoted by , is defined by .
Definition ((Cover and Thomas, 2006, c.f. Definition 8.46)).
Let and be two probability measures such that is absolutely continuous with respect to . The relative entropy (or Kullback-Leibler distance) between two probability measures and is defined by
a.6 Omitted Proofs
Proof of Section 4.
It is sufficient to show that
We abuse notation and denote by the expectation of w.r.t. the distribution and by the expectation of w.r.t. the distribution . We write
as requested. ∎
[Needs to be removed!]
Inference-sensitive scenarios is what we care about.
Collusion between different adversaries doesn’t degrade the performance of the system.
Bias is one of the criteria that Adam and Worthmann evaluated, the additive noise solutions to differential privacy seem to suffer from this problem (Dwork, 2006, 2008). For reference, matloff1986another shows the following result in non-interactive statistical databases (SDBs) with additive noise. If is the original value of a statistical query and is the perturbed value of the statistical query and if is a positive numerical variable with a strictly decreasing pdf then (matloff1986another; Adam and Worthmann, 1989). Although, Adam and Worthmann hint that the problem is less severe in output-perturbation methods (like interactive SDB, differential privacy) (1989, Section 5).
matloff1986another further shows the following bias result in non-interactive SDBs with additive noise. Let and
be correlated attributes with a bi-variate Gaussian distribution whose expected value is. Furthermore, let be the perturbed value of attribute where is an independent noise with mean
and variance, then the following bias occurs. Let
Since is independent of and both and have a mean of zero we get
Therefore, if , which is no unreasonable for perturbing noise, a bias of occurs (matloff1986another; Adam and Worthmann, 1989). More conservatively, if then we get which is bias.
“Computer security is not privacy protection” (Sweeney, 2002)