De-Health: All Your Online Health Information Are Belong to Us

02/02/2019 ∙ by Shouling Ji, et al. ∙ Zhejiang University 0

In this paper, we study the privacy of online health data. We present a novel online health data De-Anonymization (DA) framework, named De-Health. De-Health consists of two phases: Top-K DA, which identifies a candidate set for each anonymized user, and refined DA, which de-anonymizes an anonymized user to a user in its candidate set. By employing both candidate selection and DA verification schemes, De-Health significantly reduces the DA space by several orders of magnitude while achieving promising DA accuracy. Leveraging two real world online health datasets WebMD (89,393 users, 506K posts) and HealthBoards (388,398 users, 4.7M posts), we validate the efficacy of De-Health. Further, when the training data are insufficient, De-Health can still successfully de-anonymize a large portion of anonymized users. We develop the first analytical framework on the soundness and effectiveness of online health data DA. By analyzing the impact of various data features on the anonymity, we derive the conditions and probabilities for successfully de-anonymizing one user or a group of users in exact DA and Top-K DA. Our analysis is meaningful to both researchers and policy makers in facilitating the development of more effective anonymization techniques and proper privacy polices. We present a linkage attack framework which can link online health/medical information to real world people. Through a proof-of-concept attack, we link 347 out of 2805 WebMD users to real world people, and find the full names, medical/health information, birthdates, phone numbers, and other sensitive information for most of the re-identified users. This clearly illustrates the fragility of the notion of privacy of those who use online health forums.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Status Quo. The advance of information technologies has greatly transformed the delivery means of healthcare services: from traditional hospitals/clinics to various online healthcare services. This fact can be well explained by Charles Simmons, a software engineer in Los Angeles, California [8].

In 1997, after experiencing a variety of symptoms for which doctors had no explanation, Simmons turned to the Web for answers and support. When he did not find online support groups in the areas he needed, he realized that there was a need for a health support website covering a wide range of health topics.

Ever since their introduction, online health services experienced rapid growth, and have had millions of users and accumulated billions of users’ medical/health records [6][7].

According to several national surveys, of US adults have employed the Internet (online health information) as a diagnostic tool in 2012 [1], and on average, the US consumers spend hours annually to search for and peruse online health information while only visiting doctors three times per year in 2013 [2]. Moreover, “on an average day, of the US Internet users perform online medical searches to better prepare for doctors’ appointments and to better digest information obtained from doctors afterwards” [3]. Therefore, online health services play a more important role in people’s daily life.

When serving users (we use patients and users interchangeably in this paper), the online health services accumulate a huge amount of the users’ health data. For instance, as one of the leading American corporations that provide health news, advice, and expertise [4][5], WebMD reached an average of approximately 183 million monthly unique visitors and delivered approximately 14.25 billion page views in 2014 [6]. Another leading health service provider, HealthBoards (HB), has over 10 million monthly visitors, 850,000 registered members, and over 4.5 million health-related/medical messages posted [7]. Due to the high value of enabling low-cost, large-scale data mining and analytics tasks, e.g., disease transmission and control research [41], disease inference [17], and predicting future instances of domestic abuse [42], those user-generated health data are increasingly shared, disseminated, and published for research [23], business [5][8], government applications [9][10], and other scenarios [23][27].

Privacy Issues of Online Health Data. In addition to the high value for various applications, online health data carry numerous sensitive details of the users that generate them [23][27][28]. Therefore, before sharing, disseminating, and publishing the health data, proper privacy protection mechanisms should be applied and privacy policies should be followed. However, the reality is that it is still an open problem for protecting online health data’s privacy with respect to both the technical perspective and the policy perspective.

From the technical perspective, most existing health data anonymization techniques (which are called de-identification techniques in the medical and policy literature [23][28]), e.g., the techniques in [18]-[23], if not all, focus on protecting the privacy of structured medical/health data that are usually generated from hospitals, clinics, and/or other official medical agencies (e.g., labs, government agencies). Nevertheless, putting aside their performance and effectiveness, existing privacy protection techniques for structured health data can hardly be applied to online health data due to the following reasons [13][14][17]. () Structure and heterogeneity: the structured health data are well organized with structured fields while online health data are usually heterogeneous and structurally complex. () Scale: a structured health dataset usually consists of the records of tens of users to thousands of users [18]-[23], while an online health dataset can contain millions of users [6][7][17]. () Threat: Compared to online health data, the dissemination of structured health data is usually easier to control, and thus the potential of a privacy compromise is less likely. Due to its open-to-public nature, however, the online health data dissemination is difficult to control, and adversaries may employ multiple kinds of means and auxiliary information to compromise the data’s privacy (we will show this later in this paper).

From the policy making perspective, taking the US Health Insurance Portability and Accountability Act (HIPAA) [11] as an example, although HIPAA sets forth methodologies for anonymizing health data (including online health data), once the data are anonymized, they are no longer subject to HIPAA regulations and can be used for any purpose. However, when anonymizing the data, HIPAA does not specify any concrete technique other than high-level guidelines. Therefore, naive anonymization technique may be applied.

Our Work. Toward helping users, researchers, data owners, and policy makers comprehensively understand the privacy vulnerability of online health data, we study the privacy of online health data. Specifically, we focus on the health data generated on online health forums like WebMD [4] and HB [7] in this paper. These forums disseminate personalized health information and provide a community-based platform for connecting patients with doctors and other patients via interactive question and answering, symptom analysis, medication advice and side effect warning, and other interactions [4][7][13].

As we mentioned earlier, a significant amount of medical records have been accumulated in the repositories of these health websites. According to the website privacy policies [4][7], they explicitly state that they collect personal information of users (patients), including contact information, payment information, geographic location information, personal profile, medical information, transaction information, Cookies, and other sensitive information. For instance, in WebMD’s privacy policy [4], they state that

We may collect “Personal Information” about you – such as your name, address, telephone number, email address or health information We may collect “Non-Personal Information” – information that cannot be used by us to identify you – via Cookies, Web Beacons, WebMD mobile device applications and from external sources, even if you have not registered with or provided any Personal Information to WebMD

and similarly, in HB’ privacy policy [7],

We collect personal information for various business purposes when you interact with us We collect information about you in two basic ways: First, we receive information directly from you. Second, through use of cookies and other technologies, we keep track of your interactions

To use those online health services, users have to accept their privacy policies. For instance, in HB’ privacy policy, it is explicitly indicated that “if you do not agree to this privacy policy, please do not use our sites or services”. Therefore, using the online health services requires the enabling of those service providers like WebMD and HB to collect users’ personal informaiton.

As stated, the collected personal information will be used for research and various business purposes, e.g., data mining tasks and precise advertisements from pharmaceutical companies. Although these medical records are only affiliated with some user-chosen pseudonyms or anonymized IDs, some natural questions arise: when those data are shared with commercial partners (one of the most typical various business purposes)111The business model (revenue model) of most online health forums is advertisement based [5][7]., or published for research, or collected by adversaries, can they be de-anonymized even if the patients who generated them are anonymized? and can those medical records be connected to real world people? In this paper, we answer these two questions by () proposing a novel online health data De-Anonymization (DA) framework; () providing a general theoretical analysis for the soundness and effectiveness of online health data DA; and () discussing how to link the medical records to real world people. We also discuss the implications of our findings to online health data privacy researchers, users, as well as policy makers.

Our Contributions. Our key contributions are the following:

(1) We present a novel DA framework, named De-Health, for large-scale online health data. De-Health is a two-phase DA framework. In the first phase, De-Health performs Top- DA. It first constructs a User-Data-Attribute (UDA) graph based on the data correlation among users, and then identifies structural features (i.e., graph features) from the UDA graph. Leveraging those structural features, a Top- candidate set is constructed for each anonymized user. In the second phase, refined DA is performed. Leveraging both correlation and stylometric features of an anonymized user and the users in her corresponding Top-

candidate set, De-Health trains a classifier using benchmark machine learning techniques to de-anonymize the anonymized user to some user in its candidate set. De-Health has two distinguishing features: (

) by utilizing the UDA graph and Top- candidate sets, it can be easily scaled to large-scale health data with high accuracy preservation; and () De-Health can be applied to both closed-world DA (each anonymized user appears in the training/auxiliary data) and open-world DA (there are some anonymized users that may not appear in the training/auxiliary data).

(2) We provide a general theoretical analysis framework for the soundness and effectiveness of online health data DA. In the framework, we analyze the impacts of structural features and stylometric features on the anonymity of health data. Specifically, we quantify the conditions and probabilities of successfully de-anonymizing (including exact DA and Top- DA) one user or a group of users. The theoretical analysis has meaningful implications to health data privacy research and policy making: understanding the impacts of features on the data’s anonymity will facilitate researchers and policy makers to develop more effective anonymization techniques and proper privacy policies.

(3) Leveraging two real world online health datasets WebMD (89,393 users, 506K posts) and HB (388,398 users, 4.7M posts), we conduct extensive evaluations to examine the performance of De-Health in closed-world and open-world DA settings. The results show that the Top- DA of De-Health is very powerful on large-scale datasets. By seeking a Top- candidate set for each anonymized user, the DA space is effectively decreased by several orders of magnitude while providing high accuracy preservation, which enables the development of an elegant machine learning based classifier for refined DA. Even when little data are available for training, De-Health can still achieve a satisfying DA performance, which significantly outperforms the traditional DA approach.

(4) We present a linkage attack framework, which can link online health service users to other Internet services as well as real world people. We validate the framework leveraging proof-of-concept attacks. For instance, it can successfully link 347 out of 2805 (i.e., ) target WebMD users to real world people, and find most of their full names, medical/health information, birthdates, phone numbers, addresses, and other sensitive information. Thus, those users’ privacy can be compromised and one can learn the sexual orientation and related infectious diseases, mental/psychologicla problems, and suicidal tendency from some users’ health/medial data.

Ii Data Collection & Feature Extraction

Ii-a Data Collection

We collect online medical postings from two leading US online health services providers WebMD [4] and HB [7]. As the leading health portal in the US [5], WebMD provides valuable health information and tools for managing its users’ health, and support to those who seek online health information. HB provides a one-stop support group community offering more than 200 message boards on various diseases, conditions, and health topics. It was rated as one of the top 20 health information websites by Consumer Reports Health WebWatch [7][8]. We collected the health data of WebMD and HB registered users for approximately months, from May to August, 2015. This collection process resulted in 540,183 webpages from WebMD and 2,596,433 webpages from HB. After careful analysis and processing, we extracted 506,370 disease/condition/medcine posts that were generated by 89,393 registered users from the WebMD dataset (5.66 posts/user on average) and 4,682,281 posts that were generated by 388,398 registered users from the HB dataset (12.06 posts/user on average). We show some example posts from WebMD and HB in Appendix -A. From the example posts, we can see that a lot of sensitive information of the users can be learned.

Fig. 1: CDF of users with respect to the number of posts.
Fig. 2: Post length distribution.

We show the Cumulative Distribution Function (CDF) of the number of users with respect to the number of posts in Fig.

1, from which we observe that most of the users only have a few posts, e.g., WebMD users and HB users have less than 5 posts. We further show the length distribution of the posts in WebMD and HB in terms of the number of words in Fig.2. Most of the posts in the two datasets have a length less than 300 words. On average, the length of WebMD posts is 127.59 and the length of HB posts is 147.24.

Ii-B User-Data-Attribute Graph & Feature Extraction

User Correlation Graph. Online health services provide a platform for connecting patients via interactive disease and symptom discussion, health question answering, medicine and possible side effect advice, etc. For instance, on WebMD and HB, when one disease topic is raised by some user, other users may join the discussion of this topic by providing suggestions, sharing experience, making comments, etc. Due to this fact, HB is also classified as a health-oriented social networking service [8].

Therefore, if we take such user interactivity into consideration, there is some correlation, i.e., the co-disease/health/medicine discussion relation, among users. To characterize such interactivity, we construct a user correlation graph based on the relationships among the data (posts) of different users. Particularly, for each user in the WebMD/HB dataset, we represent him/her as a node in the correlation graph. For two users and , if they post under the same health/disease topic, i.e., they made posts on the same topic initialized by some user (could be , , or some other user), we consider that there is an undirected edge, denoted by , between and . Furthermore, we note that the number of interactive discussions between different pairs of users might be different. Therefore, we assign a weight for each edge to characterize the interactivity strength, which is defined as the number of times that the corresponding two users co-discussed under the same topic.

Now, we formally define the user correlation graph as , where denotes the set of users, denotes the set of edges among users, and for is the weight (interactivity strength) associated with edge . For , we define its neighborhood as . Let be the number of neighbor users of , i.e., the degree of user . When taking the weight information into consideration, we define to be the weighted degree of . For our following application, we also define a Neighborhood Correlation Strength

(NCS) vector for each user. Specifically, for

, its NCS vector is defined as , where is a decreasing order sequence of . Given , we define the distance (resp., weighted distance) between and as the length of the shortest path from to in when the weight information is overlooked (resp., considered), denoted by (resp., ).

We analyze the degree distributions of the WebMD graph and the HB graph, as well as the community structure of the WebMD graph in Appendix -B. Basically the graph’s connectivity is not strong (average degree is low and graph is not connected).

Stylometric Features. Using writing style for author attribution can be traced back to the 19th century [43]. Recently, stylometric approaches have been applied to broad security and privacy issues, from author attribution [29][31][32] to fraud and deception detection [33], underground cybercriminal analysis [34], and programmer DA [35]. According to the findings in those applications, users have distinctive writing styles (especially, in the non-adversarial scenario). Thus, when providing sufficient data (written materials, e.g., blogs, documents, passages), many users can be uniquely identified/de-anonymized from a (large) group of candidate users using benchmark machine learning models trained by their stylometric features [29][30][31][32]. Furthermore, as demonstrated in [33], it is difficult for users to intentionally obfuscate their writing style or attempt to imitate the writing styles of others in a long term. Moreover, even that happens, with a high probability, specific linguistic features can still be extracted from the long term written materials to identify the users. Therefore, for our purpose, we seek to employ the linguistic features of the health data (posts written by users) to de-anonymize the associated users.


Category
Description Count
Length
# of characters and paragraphs,
average # of characters per word
3
Word Length
freq. of words of different lengths
20
Vocabulary
richness
Yule’s K, hapax /tris/dis/tetrakis
legomena
5
Letter freq.
freq. of ‘a/A’ ‘z/Z’
26
Digit freq.
freq. of ‘0’ ‘9’
10
Uppercase letter
percentage
% of uppercase letters in a post
1
Special characters
freq. of special characters
21
Word shape
freq. of all uppercase words, all
lowercase words, first character
uppercase words, camel case words
21
Punctuation freq.
freq. of punctuation, e.g., !,;?
10
Function words
freq. of function words
337
POS tags
freq. of POS tags, e.g., NP, JJ
POS tag bigrams
freq. of POS tag bigrams
Misspelled words
freq. of misspellings
248
TABLE I: Stylometric features.

We extract various stylometric features from the WebMD and HB datasets as shown in Table I. Generally, the features in Table I can be classified into three groups: lexical features, syntactic features, and idiosyncratic features. The lexical features include length, word length, vocabulary richness, letter frequency, digit frequency, uppercase letter percentage, special characters, and word shape. They measure the writing style of users with respect to characteristics of employed characters, words, and vocabularies. The syntactic features include punctuation frequency, function words, POS tags, and POS tag bigrams. They measure the writing style of users with respect to the arrangement of words and phrases to create well-formed sentences in posts. For idiosyncratic features, we consider misspelled words, which measure some peculiar writing style of users.

Since the number of POS tags and POS tag bigrams could be variable, the number of total features is denoted by a variable for convenience. According to the feature descriptions, all the features are real and positive valued. Without loss of generality, we organize the features as a vector, denoted by . Then, given a post, we extract its features with respect to and obtain a feature vector consisting of 0 and positive real values, where 0 implies that this post does not have the corresponding feature while a positive real value implies that this post has the corresponding feature.

Note that, it is possible to extract more stylometric features from the WebMD/HB dataset, e.g., content features [29]

. However, in this paper, we mainly focus on developing an effective online health data DA framework. For the feature extraction part, we mainly employ the existing techniques such as those in

[29]-[37], and thus we do not consider this part as the technical contribution of this paper. Certainly, understanding which features are more effective in de-anonymizing online health data is an interesting topic to study. We take this as the future work of this paper.

User-Data-Attribute Graph and Structural Features. Previously, we constructed a correlation graph for the users in a health dataset. Now, we extend to a User-Data-Attribute (UDA) graph. As the stylometric features demonstrate the writing characteristics of users, logically, they can also be considered as the attributes of users, which are similar to the social attributes of users, e.g., career, gender, citizenship. Therefore, at the user level, we define an attribute set/space, denoted by , based on , i.e., . Then, following this idea, for each feature , if a user has a post that has feature (i.e., the dimension is not 0 in the feature vector of that post), we say has attribute , denoted by . Note that, each attribute is actually binary to a user, i.e., a user either has an atrribute or not, which is different from the feature, which could be either a continuous or a discrete real value. We define as the set of all the attributes that user has, i.e., . Since may have multiple posts that have feature , we assign a weight to the relation , denoted by , which is defined as the number of posts authored by that have the feature .

Based on the attribute definition, we extend the correlation graph to the UDA graph, denoted by , where , , and are the same as defined before, is the attribute set, denotes the set of all the user-attribute relationships, and denotes the set of the user-attribute relationship weights. Since the UDA graph is an extension of the correlation graph, we use the same notation for these two concepts. In practice, one may consider more attributes of a user, e.g., their social attributes (user’s social information) and behavioral attributes (user’s activity pattern), when defining .

From the definition of the UDA graph , we can see that it takes into account the data’s correlation as well as the data’s linguistic features (by introducing the concept of attribute in a different way compared to the traditional manner [29]-[37]). We will introduce how to use the UDA graph to conduct the user-level DA and analyze the benefits in the following section. Before that, we introduce more user-level features from the health data leveraging the UDA graph.

The features extracted from the UDA graph are classified as structural features, which can be partitioned into three categories: local correlation features, global correlation features, and attribute features. The local correlation features include user degree (i.e., for ), weighted degree (i.e., ), and NCS vector (i.e., ). Basically, the local correlation features measure the direct interactivity of a user in a health forum.

Given and a subset , the global correlation features of are defined as the the distances and weighted distances from to the users in , denoted by vectors and , respectively. Basically, the global correlation features measure the indirect interactivity of a user in a health dataset.

Based on of , we introduce a new notation to take into account the weight of each attribute of . We define . Then, the attribute features of are defined as and . The attribute features measure the linguistic features of users in the form of binary attributes and weighted binary attributes.

The defined structural features are helpful in conducting user-level DA. We show this in detail in the De-Health framework as well as in the experimental evaluations.

Iii De-anonymization

Iii-a Preliminary

Before this work, the privacy vulnerability of online health data was unclear, e.g., the health/medical data generated by users of WebMD and HB, to the best of our knowledge. In this section, we present a novel two-phase DA attack to online health data. The considered anonymized data, denoted by , are the data generated from current online health services, e.g., WebMD and HB. There are multiple applications of those anonymized online health data: () as indicated in the privacy policies of WebMD and HB, the health data of their users can be shared with researchers for multiple research and analytics tasks [4][7]; () again, according to their privacy policies, the data could be shared with commercial partners (e.g., insurance companies and pharmaceutical companies) for multiple business purposes [4][7]; and () the data might be publicly released for multiple government and societal applications [9][10]. Considering various applications of the online health data, our question is: can those data be de-anonymized to the users of online health services and can they be linked to the users’ real identities? We answer the first part of this question in this section by presenting De-Health and discuss the second part in Section VI.

To de-anonymize the anonymized data , we assume that the adversary222Here, the adversaries are defined as the ones who want to compromise the privacy of the users in the anonymized dataset. During the data sharing and publishing process (for research, business, and other purposes), every data recipient could be an adversary. In our paper, we focus on studying the potential privacy vulnerability of online health data. can collect some auxiliary data, denoted by , from the same or other online health service. According to our knowledge, this is possible in practice: from the adversary’s perspective, for some online health services, e.g., HB, it is not difficult to collect data from them using some intelligent crawling techniques; for some other online health services with strict policies, e.g. PatientsLikeMe [12], an adversary can also collect their data by combining intelligent crawling techniques and anonymous communication techniques (e.g., Tor). In this paper, we assume both and are generated from online health services like WebMD and HB.

After obtaining the anonymized data and the auxiliary data , we extract the features of the data and transform them into an anonymized graph and an auxiliary graph, denoted by and , respectively, using the techniques discussed in Section II. When it is necessary, we use the subscript ‘1’ and ‘2’ to distinguish the anonymized data/graph and the auxiliary data/graph. Now, the DA of leveraging can be approximately defined as: for an anonymized (unknown) user , seeking an auxiliary (known) user , such that can be identified to (i.e., they correspond to the same real world person), denoted by . However, in practice, it is unclear whether and are generated by the same group of users, i.e., it is unknown whether . Therefore, we define closed-world DA and open-world DA. When the users that generate are a subset of the users that generate , i.e., , the DA problem is a closed-world DA problem. Then, a successful DA is defined as , and and correspond to the same user. When , the DA problem is an open-world DA problem. Let , the overlapping users between and . Then, a successful DA is defined as , , and are in , and and correspond to the same user; or , if , where represents not-existence. For and , if and correspond to the same real world user, we call the true mapping of in . In this paper, the presented De-Health framework works for both the closed-world and the open-world situations.

Iii-B De-Health

1 construct and from and , respectively;
2 for every  do
3       for every  do
4             compute the structural similarity between and , denoted by ;
5            
6      
7compute the Top- candidate set for each user , denoted by , based on the structural similarity scores;
8 filter using a threshold vector;
9 for  do
10       leveraging the stylometric and structural features of the users in , build a classifier, using benchmark machine learning techniques (e.g., SMO);
11       using the classifier to de-anonymize ;
12      
Algorithm 1 De-Health

Overview. In this subsection, we present the De-Health framework. We show the high level idea of De-Health in Algorithm 1 and give the details later. At a high level, De-Health conducts user DA in two phases: Top- DA (line 2-6) and refined DA (line 7-9). In the Top- DA phase, we mainly focus on de-anonymizing each anonymized user to a Top- candidate set, denoted by , that consists of the most structurally similar auxiliary users with the anonymized user (line 2-5). Then, we optimize the Top- candidate set using a threshold vector by eliminating some less likely candidates (line 6). In the refined DA phase, an anonymized user will be de-anonymized to some user in the candidate set using a benchmark machine learning model trained leveraging both stylometric and structural features. Note that, we do not limit the DA scenario to closed-world or open-world. De-Health is designed to take both scenarios into consideration.

Top- DA. Now, we discuss how to implement Top- DA and optimization (filtering).

Structural Similarity. Before we compute the Top- candidate set for each anonymized user, we compute the structural similarity between each anonymized user and each auxiliary user , denoted by , from the graph perspective (line 2-3 in Algorithm 1). In De-Health, consists of three components: degree similarity , distance similarity , and attribute similarity . Specifically, is defined as

where is the cosine similarity between two vectors. Note that, it is possible that and

have different lengths. In that case, we pad the short vector with zeros to ensure that both have the same length. From the definition,

measures the degree similarity of and in and , i.e., their local direct interactivity similarity in and , respectively.

To define , we need to specify a set of landmark users from and , respectively. Usually, the landmark users are some pre-de-anonymized users that serve as seeds for a DA [38][39][40]. There are many techniques to find landmark users, e.g., clique-based technique [38], community-based technique [39], and optimization-based technique [40]. In De-Health, we do not require accurate landmark users. In particular, we select users with the largest degrees from and as the landmark users, denoted by and , respectively. We also sort the users in and in the degree decreasing order. Then, we define as

Basically, measures the relative global structural similarity, i.e., indirect interactivity similarity, of and .

For and , we define and . Further, let be the cardinality of a set and for the weighted set, we define and . Then, is defined as

which measures the attribute similarity (i.e., linguistic similarity) between and .

After specifying , , and , the structural similarity between and is defined as

where and are some positive constant values adjusting the weights of each similarity component.

Top- Candidate Set. After obtaining the structural similarity scores, we compute the Top- candidate set for each (line 5 in Algorithm 1)333Here, we assume that is far less than the number of auxiliary users. Otherwise, it is meaningless to seek Top- candidate sets.. Here, we propose two approaches: direct selection and graph matching based selection. In direct selection, we directly select auxiliary users from that have the Top- similarity scores with . In graph matching based selection: Step 1: we first construct a weighted completely connected bipartite graph (anonymized users on one side while auxiliary users on the other side), where the weight on each edge is the structural similarity score between the two corresponding users; Step 2: we find a maximum weighted bipartite graph matching on , denoted by ; Step 3: for each in the matching, we add to the Top- candidate set of and remove the edge between and in the bipartite graph ; Step 4: repeat Steps 2 and 3 until we find a Top- candidate set for each user in .

1 ;
2 ;
3 construct a threshold vector , where for , ;
4 for every  do
5       for  do
6             ;
7             for  do
8                   if  then
9                         ;
10                        
11                  
12            if  then
13                   , break;
14                  
15            
16      if  then
17             , ;
18            
19      
Algorithm 2 Filtering

Optimization/Filtering. After determining the Top- candidate set for each , we further optimize using the filtering procedure shown in Algorithm 2 (to finish line 6 in Algorithm 1), where is a positive constant value, is the length of the threshold vector (defined later), and is a temporary candidate set. The main idea of the filtering process is to pre-eliminate some less likely candidates in terms of structural similarity using a threshold vector. Below, we explain Algorithm 2 in detail. First, the threshold interval is specified based on , and the maximum and minimum similarity scores between the users in and (line 1-2). Then, the threshold interval is partitioned into segments with the threshold value (). We organize the threshold values as a threshold vector (line 3). Third, we use to filter each candidate set starting from large thresholds to small thresholds (line 5-13). If some candidate users pass the filtering at some threshold level, we then break the filtering process and take those candidate users as the final (line 7-10). If no candidate users are left even after being filtered by (the smallest threshold), we conclude that does not appear in the auxiliary data (i.e., ) and remove from for further consideration (line 12-13).

Note that, the filtering process is mainly used for reducing the size of the candidate set for each anonymized user, and thus to help obtain a better refined DA result and accelerate the DA process in the following stage. In practice, there is no guarantee for the filtering to improve the DA performance. Therefore, we set the filtering process as an optional choice for De-Health.

Refined DA. In the first phase of De-Health, we seek a Top- candidate set for each anonymized user. In the second phase (line 7-9 of Algorithm 1), De-Health conducts refined DA for each and either de-anonymizes to some auxiliary user in or concludes that , i.e., does not appear in the auxiliary data. To fulfill this task, the high level idea is: leveraging the stylometric and correlation features of the users in

, train a classifier employing benchmark machine learning techniques, e.g., Support Vector Machine (SVM), Nearest Neighbor (NN), Regularized Least Squares Classification (RLSC), which is similar to that in existing stylometric approaches

[29]-[35]444In [29]-[35], multiple benchmark machine learning based stylometric approaches are proposed to address the post/passage-level author attribution. Although we focus on user-level DA, those approaches could be extended to our refined DA phase.. Therefore, we do not go to further details to explain existing benchmark machine learning techniques.

Nevertheless, there is still an open problem here: by default, existing benchmark machine learning techniques are satisfiable at addressing the closed-world DA problem (e.g., [29][31]). However, their performance is far from expected in open-world DA [32]. To address this issue, we present two schemes: false addition and mean-verification, which are motivated by the open-world author attribution techniques proposed by Stolerman et al. in [32].

In the false addition scheme, when de-anonymizing , we randomly select users from (e.g., ), and add these users to as false users. Then, if is de-anonymized to a false user in , we conclude that , i.e., does not appear in the auxiliary data. Otherwise, is de-anonymized to a non-false user.

In the mean-verification scheme, we first use the trained classifier to de-anonymize to some user, say , in by assuming it is a closed-world DA problem. Later, we verify this DA: let be the mean similarity between and its candidate users; then, if , where is some predefined constant value, the DA is accepted; otherwise, it is rejected, i.e., . Note that, the verification process can also be implemented using other techniques, e.g., distractorless verification [45], Sigma verification [32].

Remark. To the best of knowledge, De-Health is the first user-level DA attack on online health data. In De-Health, we propose a novel approach to construct a UDA graph based on the health data by systematically characterizing the interactivity correlations among different users as well as the writing characteristics of users. The UDA graph further enables us to develop effective graph-based DA techniques, which can be easily scaled to large-scale data. Moreover, the UDA graph enables us to extract various structural features, which can be used to feed benchmark machine learning techniques together with traditional stylometric features and train more effective classifiers.

The Top- DA phase of De-Health can improve the DA performance from multiple perspectives. On one hand, it significantly reduces the possible mapping space for each anonymized user (from to ), and thus a more accurate classifier can be trained to de-anonymize an anonymized user, followed by the improved DA performance. From the description of De-Health (Algorithm 1), it seems that the Top- DA might degrade its DA performance if many true mappings of the anonymized users cannot be included into their Top- candidate sets. However, we seek the candidate set for each anonymized user based on structure similarities between and the users in , and the auxiliary users that have high structural similarities with are preferred to be selected as candidates, e.g., in the direct selection approach. According to our theoretical analysis in the following section, this candidate selection approach will not degrade the DA performance in practice. Furthermore, as shown in our experiments (Section V), most anonymized users’ true mappings can be selected into their candidate sets when a proper is chosen. On the other hand, since the possible mapping space is significantly reduced by the Top- DA, the computational cost for both constructing the machine learning based classifiers and performing refined DA can be reduced.

Most real world DA tasks are open-world problems. By introducing the false addition scheme and the mean-verification scheme, De-Health can address both closed-world and open-world DA issues.

Iv Theoretical Analysis

In this section, we present a general theoretical analysis framework for the soundness and effectiveness of online health data DA, which can also serve as the theoretical foundation of De-Health.

Iv-a Preliminary

For the convenience of analysis, we introduce some formal notations. We assume is an anonymized online health dataset generated by users denoted by set and is an auxiliary dataset generated by users denoted by set . Note that, it is possible that . However, since we employ to de-anonymize , we assume there are some overlap between users in and . Otherwise, this DA is meaningless. Let be an overlapping anonymized user and be the true mapping of ( and correspond to the same real world people).

To de-anonymize leveraging , many features of the data will be extracted to develop a DA algorithm/model. The feature here is a general concept, which could include stylometric features, structural features, social features, and other possible features in our theoretical analysis. Thus, we define a general feature space to characterize all the possible features, attributes, and other measurements of users. Then, given a user , we denote its features by a feature vector . Based on the features of and , we construct a DA model/algorithm, denoted by . For instance, De-Health can be considered as one implementation of : it de-anonymizes by employing structural similarity (derived from the defined structural features), stylometric features, and correlation features. Ideally, works in the following manner: if is an overlapping user, we have , i.e., successfully de-anonymizes to ; otherwise, we have .

To design , we introduce a general function , which is defined on the features of two users (e.g., for and ) and measures the distance of the two users in terms of their features. Note that, the distance concept here is very general. It could be defined in terms of different metrics, e.g., the distribution similarity or the Euclidian distance between the feature vectors of two users, depending on a particular DA algorithm. For instance, when is defined using the feature distribution similarity, it can be defined as a decreasing function with respect to the distribution similarity, i.e., a higher similarity implies a smaller distance ( value). Using , mathematically, can also be constructed as a function. For instance, we can define as 555Here, an implicated assumption is that and such that . Since is a distance function, this assumption is intuitively reasonable in practice., where indicates the probability that is de-anonymized to by .

Let be the mean value of correct DAs under , i.e., 666 denotes the mean/expectation value in this paper., and be mean value of incorrect DAs under , i.e., where and . Furthermore, assume that and (), i.e., the ranges of correct and incorrect DAs under are and , respectively. Let , , and . Below, we analyze the re-identifiability, defined as the probability of being successfully de-anonymized, of , and further specify the design of a corresponding to achieve that re-identifiability.

Due to the space limitations, we place all the proofs for theorems and corollaries in Appendix -C.

Iv-B Re-Identifiability Analysis

We start the re-identifiablity analysis from the simple case that: given be an overlapping user, , and , deriving the re-identifiability of with respect to . Let be the probability of such that can successfully de-anonymize to from . Then, we have the following theorem on quantifying . We also specify the design of in the proof.

Theorem 1.

When , .

In Theorem 1, we derived the probability of successfully de-anonymizing from and gave the design of . Based on Theorem 1, we can obtain a stronger conclusion as shown in Corollary 1 using stochastic theory, which states the asymptotical property of the DA. In the corollary, is a positive integer and the same as specified in Theorem 1 is employed.

Corollary 1.

When and , , i.e., it is asymptotically almost surely (a.a.s.) that can be successfully de-anonymized from 777Asymptotically almost surely (a.a.s.) implies that as , an event happens with probability goes to 1..

In Theorem 1, we studied the re-identifiability of de-anonymizing from . In practice, as shown in De-Health, we need to de-anonymize from , the set of all the auxiliary users. We give the re-identifiability of in this general case in Corollary 2. Now, suppose that , i.e., is an overlapping user of and . We define as the probability of such that can successfully de-anonymize to from ( is specified in the proof).

Corollary 2.

When and , , i.e., it is a.a.s. that can be successfully de-anonymized from .

Corollary 2 is an even stronger conclusion than that in Corollary 1. It specifies the conditions to successfully de-anonymize an anonymized user in general.

Now, we study the re-identifiability of any subset of , i.e., a subset of anonymized users. Let be some constant value and be an integer. For , it is an -subset if and , has a true mapping in . Then, we define that is -re-identifiable if there exists an -subset of such that , . We give the probability that is -re-identifiable in the following theorem.

Theorem 2.

Suppose that has an -subset . Then, when , is -re-identifiable.

In Theorem 2, we derived the probability that is -re-identifiable. Similar to that in Corollary 2, we now derive the conditions to have stochastically -re-identifiable. We show the result in Corollary 3. In the proof, we use the same design as that in Theorem 2.

Corollary 3.

Suppose that has an -subset . Then, when and , is -re-identifiable, i.e., it is a.a.s. that is -re-identifiable.

In Theorem 2 and Corollary 3, we derived the probability to have -re-identifiable as well as the conditions for to be -re-identifiable. Since , those results provide the theoretical analysis for general online health data DA.

Iv-C Top- Re-Identifiability Analysis

In the previous subsection, we analyzed the probability of de-anonymizing one user or a group of users. We also derived the conditions to have one or a group of users to be a.a.s. re-identifiable. In the DA research, in addition to accurate DA, we may also have interest in understanding the probability/conditions for conducting Top- DA. Thus, we give the Top- re-identifiability analysis in this subsection.

Formally, for , suppose it has a true mapping . Then, a correct Top- DA of is to seek a candidate set for such that , (in this paper, it is also acceptable if ), and . Let be the probability of that and can find a correct Top- candidate set for . We show the Top- re-identifiability of one user and the conditions to have it asymptotically Top- re-identifiable in the following theorem.

Theorem 3.

When , () ; () if , , i.e., it is a.a.s. that is Top- re-identifiable.

In Theorem 3, we derived the Top- re-identifiability of a user and the conditions to asymptotically de-anonymize the user. Now, we extend our analysis to a general scenario of Top- DA of a set of anonymized users. Let be an -subset of . Then, we define to be Top- -re-identifiable if , and can find a correct Top- candidate set for . Let be the probability that is Top- -re-identifiable. Then, we show the Top- re-identifiability of and the conditions to have it asymptotically Top- re-identifiable in the following theorem.

Theorem 4.

When , () ; () if , , i.e., it is a.a.s. that is Top- -re-identifiable.

V Experiments

In this section, we experimentally evaluate De-Health leveraging the two collected online health datasets: WebMD and HB. First, we evaluate De-Health’s performance in the closed-world DA setting, i.e., for each anonymized user, its true mapping is in the auxiliary data (training data). Then, we extend our evaluation to the more practical open-world DA setting: for each anonymized user, its true mapping may or may not appear in the auxiliary data.

V-a Closed-world DA

V-A1 Top- Da.

First, we evaluate the Top- DA performance of De-Health. In the Top- DA phase, we seek a candidate set for each anonymized user . We define that the Top- DA of is successful/correct if ’s true mapping is included in the returned by De-Health. Note that, the Top- DA is crucial to the success and overall performance of De-Health: given a relatively large auxiliary dataset and a small , if there is a high success rate in this phase, the candidate space of finding the true mapping of an anonymized user can be significantly reduced (e.g., from millions or hundreds of thousands of candidates to several hundreds of candidates). Then, many benchmark machine learning techniques can be employed to conduct the second phase refined (precise) DA, since as shown in [29]-[35], benchmark machine learning techniques can achieve much better performance on a relatively small training dataset than on a large training dataset888In the closed-world author attribution setting, state-of-the-art machine learning based stylometric approaches can achieve accuracy on 100-level of users [29], accuracy on 10K-level of users [30], and accuracy on 100K-level of users [31]..

Methodology and Setting. We partition each user’s data (posts) in WebMD and HB into two parts: auxiliary data denoted by and anonymized data denoted by . Specifically, we consider three scenarios: randomly taking , , and of each user’s data as auxiliary data and the rest as anonymized data (by replacing each username with some random ID), respectively. Then, we run De-Health to identify a Top- candidate set for each user in and examine the CDF of the successful Top- DA with respect to the increase of . For the parameters in De-Health, the default settings are: , , and . We assign low weights to degree and distance similarities when computing the structural similarity. This is because, as shown in Section II, even in the UDA graph constructed based on the whole WebMD/HB dataset, () the degree of most of the users is low; and () the size of most identified communities is small and the UDA graph is disconnected (consisting of tens of disconnected components). After partitioning the original dataset into auxiliary and anonymized data, the degree of most users gets lower and the connectivity of the UDA graph decreases further, especially in the scenario of -anonymized data (the anonymized UDA graph consists of hundreds of disconnected components in our experiments). Thus, intuitively, the degree and distance (vector) do not provide much useful information in distinguishing different users for the two leveraged datasets here, and we assign low weights to degree and distance similarities. Furthermore, we set the number of landmark users as (the Top-50 users with respect to degree). For the structural similarity based Top- candidate selection, we employ the direct selection approach. Since we conduct closed-world evaluation in this subsection, the filtering process is omitted. All the experiments are run 10 times. The results are the average of those 10 runs.

Fig. 3: CDF of correct Top- DA.

Results. We show the CDF of successful Top- DA with respect to different ranges (, , , and ) in Fig.3. We have the following observations.

First, with the increasing of , the CDF of successful Top- DA increases. The reason is evident. When increases, the probability of including the true mapping of an anonymized user to its Top- candidate set also increases.

Second, when comparing the Top- DA performance of De-Health on WebMD and HB, De-Health has a better performance on WebMD than that on HB. For instance, when de-anonymizing the two datasets in the -auxiliary data scenario, De-Health finds the correct Top-500 candidate sets for WebMD users while finds the correct Top-500 candidate sets for HB users. This is due to the fact that the HB dataset (388,398 users) has many more users than the WebMD dataset (89,393 users), and thus with a higher probability, the correct Top- candidate set can be found for a WebMD user under the same experimental setting.

Third, the size of the available dataset (either the auxiliary data or the anonymized data) is important to constructing the UDA graph and thus has an explicit impact on the Top- DA performance. For instance, when de-anonymizing WebMD, De-Health can find the correct Top-500 candidate sets for anonymized users in the -auxiliary data scenario while can find the correct Top-500 candidate sets for anonymized users in the -auxiliary data scenario. This is because in the -auxiliary data scenario, only of the original dataset severs as the anonymized data. Then, only a very sparse anonymized UDA graph that consists of hundreds of disconnected components can be constructed. Thus, the Top- DA performance has been clearly degraded.

Overall, De-Health is powerful in conducting Top- DA on large-scale datasets (especially, when sufficient data appear in the auxiliary/anonymized data). By seeking each anonymized user Top- candidate set, it decreases the DA space for a user from 100K-level to 100-level with high accuracy. This is further very meaningful for the following up refined DA, which enables the development of an effective machine learning based classifier.

V-A2 Refined DA

We have demonstrated the effectiveness of the Top- DA of De-Health on large-scale datasets. Now, we evaluate the refined DA phase of De-Health. As we indicated in Section III, the refined DA can be implemented by training a classifier employing existing benchmark machine learning techniques similar to those in [29]-[35]. In addition, more than (resp., ) WebMD users and more than (resp., ) HB users have less than 20 (resp., 40) posts, and the average length of those posts is short (the average lengths for WebMD posts and HB posts are 127.59 words and 147.24 words, respectively). Therefore, to enable the application of machine learning techniques to train a meaningful classifier999As indicated in [29][31][33][35], when applying machine learning based stylometric approaches for author attribution, there is a minimum requirement on the number of training words, e.g., 4500 words and 7500 words, for obtaining a meaningful classifier., we conduct this group of evaluation on small-scale datasets extracted from the WebMD dataset, which is actually sufficient to show the performance of De-Health.

Methodology and Settings. We construct the auxiliary (training) and anonymized (testing) data for two evaluation settings. In the first setting, we randomly select 50 users each with 20 posts. Then, for the posts of each user, we take 10 for training (auxiliary data) and the other 10 (anonymized) for testing. In the second setting, we randomly select 50 users each with 40 posts. Then, we take 20 posts from each user for training (auxiliary data) and take the remaining data for testing (anonymized). For each setting, we conduct 10 groups of evaluations. The reported results are the average of those 10 evaluations.

For the parameters in De-Health, the default settings are: , , (the reason is the same as before), , , and ; the employed Top- candidate set selection approach is direct selection. In the refined DA phase, the employed machine learning techniques for training the classifier are the

-Nearest Neighbors (KNN) algorithm

[31] and the Sequential Minimal Optimization (SMO) Support Vector Machine [32]. Note that, our settings and evaluations can be extended to other machine learning techniques directly. The features used to train the classifier are the stylometric features and structural features extracted from the auxiliary data (as defined in Section II).

We also compare De-Health with a DA method that is similar to traditional stylometric approaches [29]-[37]: leveraging the same feature set as in De-Health, training a classifier using KNN and SMO without of our Top- DA phase, and employing the classifier for DA. We denote this comparison method as Stylometry (although we included correlation features in addition to stylometric features). Actually, Stylometry is equivalent to the second phase (refined DA) of De-Health.

Fig. 4: DA accuracy (closed-world).

Results. Let be the number of anonymized users that have true mappings in and be the number of anonymized users that have true mappings in and are successfully de-anonymized by algorithm . Then, the accuracy of is defined as .

We demonstrate the DA accuracy of De-Health and Stylometry in Fig.4, where indicate the setting of Top- DA in De-Health, and ‘-10’ (e.g., SMO-10) and ‘-20’ (e.g., SMO-20) represent the evaluation settings with 10 and 20 posts of each user for training/testing, respectively. From the results, SMO has a better performance than KNN with respect to de-anonymizing the employed WebMD datasets.

De-Health significantly outperforms Stylometry, e.g., in the setting of SMO-20, De-Health () successfully de-anonyimzes users (with accuracy of ) while Stylometry only successfully de-anonymizes users: () for Stylometry, given 20 (resp., 10) posts and the average length of WebMD posts is 127.59, the training data is 2551.8 (resp., 1275.9) words on average, which might be insufficient for training an effective classifier to de-anonymize an anonymized user; and () as expected, this demonstrates that De-Health’s Top- DA phase is very effective, which can clearly reduce the DA space (from 50 to 5) with a satisfying successful Top- DA rate (consistent with the results in the Top- DA evaluation).

Interestingly, De-Health has better accuracy for a smaller than for a larger . Although a large implies a high successful Top- DA rate, it cannot guarantee a better refined (precise) DA accuracy in the second phase, especially when the training data for the second phase (same to Stylometry) are insufficient. On the other hand, a smaller is more likely to induce a better DA performance since it reduces more of the possible DA space. Therefore, when less data are available for training, the Top- DA phase is more likely to dominate the overall DA performance.

V-B Open-world DA

Now, we evaluate De-Health in a more challenging setting where the anonymized user may or may not have a true mapping in the auxiliary data, i.e., open-world DA.

V-B1 Top- Da

We start the open-world evaluation from examining the effectiveness of the Top- DA of De-Health.

Methodology and Settings. Leveraging the WebMD and HB datasets, we construct three open-world DA scenarios under which the anonymized data and the auxiliary data have the same number of users and their overlapping user ratios are , , and , respectively101010Let be the number of users in WebMD/HB, and and be the number of overlapping and non-overlapping users in the auxiliary/anonymized dataset. Then, it is straightforward to determine and by solving the equations: and (resp., and ).. Then, we employ De-Health to examine the Top- DA performance in each scenario with the default setting: for each overlapping user, take half of its data (posts) for training and the other half for testing; , , and (for the same reason as explained before); ; and for the Top- candidate approach, employ direct selection. All the evaluations are repeated 10 times. The results are the average of those 10 runs.

Fig. 5: CDF of correct Top- DA (open-world).

Results. We show the Top- DA performance given different ranges (, , , and ) in Fig.5. First, similar to that in the closed-world setting, the CDF of successful Top- DA increases with the increasing of since the true mapping of an anonymized user (if it has) is more likely to be included in its Top- candidate set for a large . Second, De-Health has a better Top- DA performance when more users are shared between the anonymized data (graph) and the auxiliary data (graph). For instance, when de-anonyming WebMD, the successful Top-500 DA rate is when the overlapping user ratio is , while when the overlapping user ratio is . This is because a higher overlapping user ratio implies more common users between the anonymized and auxiliary data, followed by higher structural similarity between the anonymized and auxiliary UDA graphs. Thus, De-Health can find the correct Top- candidate sets for more users (which are determined by the users’ structural similarities). Third, when comparing closed-world (Fig.3) and open-world (Fig.5) Top- DA, better performance can be achieved in the closed-world setting. The reason is the same as our analysis for the second observation. Finally, under the open-world setting, De-Health can still achieve a satisfying Top- DA performance (compared to the closed-world setting, a larger , e.g., , might be necessary), and thus significantly reduces the possible DA space for an anonymized user.

V-B2 Refined DA

Following the Top- DA, we evaluate the refined DA performance of De-Health in the open-world setting. Due to the same reason as analyzed before, we conduct this group of evaluations on small WebMD datasets, which is again sufficient to show the performance of De-Health.

Methodology and Settings. We construct an anonymized dataset and an auxiliary dataset such that () each dataset has 100 users and each user has 40 posts; () the overlapping user ratio between the two datasets is ; and () for each overlapping user, half of its posts appear in the anonymized data while the others appear in the auxiliary data. Taking the same approach, we construct two other pairs of anonymized datasets and auxiliary datasets except for with overlapping user ratios of and , respectively.

For De-Health, its default settings are: , , and ; ; and for filtering; the Top- candidate selection approach is direct selection; the leveraged features are the stylometric and structural features defined in Section II and the employed machine learning techniques are KNN and SMO; and after classification, we apply for the mean-verification scheme with . We also compare De-Health with Stylometry (which can be considered as equivalent to the second phase of De-Health). All the experiments are run 10 times and the results are the average of those 10 runs.

(a) DA accuracy
(b) FP rate
Fig. 6: DA accuracy and FP rate (open-world).

Results. We report the DA accuracy and False Positive (FP) rate in Fig.6, where , , and indicate the overlapping user ratios. First, in the open-world setting, De-Health again significantly outperforms Stylometry with respect to both DA accuracy and the FP rate. For instance, in the setting of -SMO, The DA accuracy of De-Health () is and of Stylometry is , respectively; and meanwhile, the FP rate of De-Health () is and of Stylometry is , respectively. For Stylometry, insufficient training data is one reason for its poor performance. In addition, in the open-world DA setting, non-overlapping users, which can be considered as noise, further degrade its performance. On the other hand, for De-Health, there are also two reasons responsible for its better performance: () the Top- DA reduces the possible DA space while preserving a relatively high success rate, and thus high DA accuracy is achieved; and () the mean-verification scheme eliminates FP DAs and thus reduces the FP rate. Second, similar to the closed-world scenario, De-Health with a smaller has better DA accuracy (not necessary the FP rate) than that with a larger . The reason is the same as discussed before: when less data are available for training in the second phase, the Top- DA is more likely to dominate the overall DA performance of De-Health. From the figure, we also observe that SMO-trained classifier induces better performance than KNN-trained classifier in most cases.

Vi Real Identity Identification

Leveraging De-Health, an adversary can now have the medical/health information of online health services users, e.g., users of WebMD and HB. On top of the DA results of De-Health, we present a linkage attack framework to link those medical/health information of the service users to real world people in this section.

Vi-a Linkage Attack Framework

In the designed linkage attack framework, we mainly conduct username-based linkage and avatar-based linkage.

Username-based Linkage. For most online health services, the users’ usernames are publicly available. In addition to that, there are many other social attributes that might be publicly available, e.g., gender, join date, and location of users are available on HB. In [48], Perito et al. empirically demonstrated that Internet users tend to choose a small number of correlated usernames and use them across many online services. They also developed a model to characterize the entropy of a given Internet username and demonstrated that a username with high (resp., low) entropy is very unlikely (resp., likely) picked by multiple users. Motivated by this fact, we implement a tool, named NameLink, to semi-automatically connect usernames on one online health service and other Internet services, e.g., Twitter.

NameLink works in the following manner: () collect the usernames of the users of an online health service; () compute the entropy of the usernames using the technique in [48] and sort them in the entropy decreasing order; () perform general and/or targeting online search using the sorted usernames (leveraging Selenium, which automates browsers and imitates user’s mouse click, drag, scroll and many other input events). For general online searches, NameLink searches a username with/without other attributes (e.g., location) directly, e.g., “jwolf6589 + California”; for targeted searches, in addition to terms used in general search, NameLink adds a targeting Internet service, e.g., “jwolf6589 + Twitter”; and (

) after obtaining the search results, NameLink filters unrelated results based on predefined heuristics. The main functionalities of NameLink include: (

) information aggregation; For instance, there is not too much information associated with WebMD users. However, there is rich information associated with HB users (e.g., location) and BoneSmart users (e.g., ages) [52]. By linking the users on those three services, we may obtain richer information of WebMD users; () real people linkage; For instance, for the WebMD users that have high entropy, e.g., “jwolf6589”, we may try to link them to social network services, e.g., Twitter, and thus reveal their true identities; and () cross-validation. For each user, we may link her to a real world person using multiple techniques, e.g., the username-based linkage and the following avatar-based linkage. Therefore, using the linkage results from different techniques can further enrich the obtained information as well as cross-validate the search results, and thus improve the linkage accuracy.

Avatar-based Linkage. Many online health services, e.g., WebMD, allow users to choose their own avatars. Thus, many users take this option by uploading an avatar without awareness of the privacy implications of their actions. However, as shown in [49], those photos may also cause serious privacy leakage. The reason behind is that a significant amount of users upload the same photo/avatar across different Internet services (websites). Similar to NameLink, we develop another semi-automatic tool, named AvatarLink, to link the users of one online health service to other Internet services, e.g., Facebook, Twitter. AvatarLink generally follows the same working procedure as NameLink except for the search engine, which takes either an image URL or user uploaded image file as a search key. AvatarLink can also fulfill the same functionalities as NameLink, i.e., information aggregation, real people linkage, and cross-validation.

Vi-B Evaluation

We validate the linkage attack framework using the collected WebMD dataset since all its users have publicly available usernames and many of them have publicly available avatars. Note that, the employed WebMD dataset is collected from a real world online health service (and thus generated by real people). Therefore, it might be illegal, at least improper, to employ NameLink and AvatarLink to conduct a large-scale linkage attack although we can do that. When linking the medical/health information to real world people, we only show a proof-of-concept attack and results.

Objectives and Settings. Considering that there is not too much information associated with WebMD users, we have two objectives for our evaluation: () information aggregation, i.e., enrich the information of WebMD users; and () link WebMD users to real world people, reveal their identities, and thus compromise their medical/health privacy.

To achieve the first objective, we employ NameLink for targeting linkage and the targeting service is HB, which has rich user information. Since we have both a WebMD dataset and a HB dataset, we limit our linkage to the users within the available datasets and thus we can do the linkage offline. Note that, this is a proof-of-concept attack and it can be extended to large-scale directly.

To achieve the second objective, we employ AvatarLink to link WebMD users to some well known social network services, e.g., Facebook, Twitter, and LinkedIn. There are 89,393 users in the WebMD dataset, which are too many for a proof-of-concept linkage attack. Thus, we filter avatars (i.e., users) according to four conditions: () exclude default avatars; () exclude avatars depicting non-human objects, such as animals, natural scenes, and logos; () exclude avatars depicting fictitious persons; and () exclude avatars with only kids in the picture. Consequently, we have 2805 avatars left. When using AvatarLink to perform the linkage attack, the employed search engine is Google Reverse Image Search. In order to avoid the violation of Google’s privacy and security policies, we spread the searching task of the 2805 avatars in five days (561 avatars/day, on average) and the time interval between two continuous searches is at least 1 minute.

Results and Findings. For understanding and analyzing the results returned by NameLink and AvatarLink, a challenging task is to validate their accuracy. To guarantee the preciseness as much as possible, we manually validate all the results and only preserve the ones with high confidence. Specifically, for the results returned by NameLink, in addition to using the technique in [48] to filter out results with low entropy usernames, we manually compare the users’ posts on two websites with respect to writing style and semantics, as well as the users’ activity pattern, e.g., post written time. Interestingly, many linked users post the same description of their medical conditions on both websites to seek suggestions. For the results returned by AvatarLink, we manually compare the person in the avatar and the person in the found picture, and only results in which we are confident are preserved.

Finally, using NameLink, we successfully link 1676 WebMD users to HB users and thus, those users’ medical records and other associated information can be combined to provide us (or adversaries) more complete knowledge about them. Using AvatarLink, we successfully link 347 WebMD users to real world people through well known social network services (e.g., Facebook, Twitter, LinkedIn, and Google+), which consists of the 2805 target users. Among the 347 WebMD users, more than can be linked to two or more social network services, and leveraging the Whitepage service [50], detailed social profiles of most users can be obtained. More interestingly, the WebMD users linked to HB and the WebMD users linked to real people have 137 overlapping users. This implies that information aggregation and linkage attacks are powerful in compromising online health service users’ privacy. Overall, we can acquire most of the 347 users’ full name, medical/health information, birthdate, phone numbers, addresses, jobs, relatives, friends, co-workers, etc. Thus, those users’ privacy suffers from a serious threat. For example, after observing the medical/health records of some users, we can find their sexual orientation, relationships, and related infectious diseases. More concerning, some of the users even have serious mental/psychological problems and show suicidal tendency.

Vii Discussion

De-Health: Novelty versus Limitation. As shown in the experiments (Section V), the Top- DA of De-Health is effective in reducing the DA space (from 100K-order of possible space to 100-order of possible space) while preserving a satisfying precision (having the true mapping of an anonymized user included into the candidate set). Further, when the training data for constructing a powerful classifier are insufficient, such DA space reduction is more helpful for De-Health to achieve a promising DA accuracy. Therefore, the Top- DA is stable and robust. For the refined DA phase, technically, it can be implemented by existing benchmark machine learning techniques. Nevertheless, due to the benefit of the Top- DA phase, the possible DA space is reduced by several orders of magnitude, which enables us to build an effective classifier even with insufficient training data. Therefore, the Top- DA together with the refined DA lead to promising performance of De-Health in both closed-world and open-world scenarios.

It is important to note that we do not apply advanced anonymization techniques to the health data when evaluating the performance of De-Health. This is mainly because no feasible or dedicated anonymization technique is available for large-scale online health data, to the best of our knowledge. Actually, developing proper anonymization techniques for large-scale online health data is a challenging open problem. The challenges come from () the data volume is very big, e.g., WebMD has millions of users that generate millions to billions of health/medical posts every month; () unlike well-structured traditional medical records, the online health data are generated by millions of different users. It is a challenging task to organize those unstructured (complex) data; and () different from other kinds of data, health/medical data have sensitive and important information. A proper health data anonymization scheme should appropriately preserve the data’s utility (e.g., preserve the accurate description of a disease). We take developing effective online health data anonymization techniques as a future work.

Re-identifiability Analysis: Generic versus Loose. In our theoretical analysis of online health data DA, we quantify the impacts of different data features, including local interactivity features, global interactivity features, associated attributes, and stylometric features, on the anonymity of the data. We also derive the conditions and probabilities for successfully de-anonymizing one user or a group of users in both Top-

and accurate DA. However, to guarantee the maximum generality, we do not specify the exact distribution of considered features. In reality, it is possible to obtain tighter conditions and probability bounds when specifying the distributions of features (evidently, this is at the cost of sacrificing the generality of the theoretical analysis), e.g., assuming the local interactivity features follow a Poisson distribution. Therefore, studying the characteristics of different features and deriving the analysis under some specific distribution will be another future work.

Online Health Data Privacy and Policies. Based on our analysis and experimental results (especially the results of the linkage attack), online health data privacy suffers from serious threats. Unfortunately, there is no effective solution for protecting the privacy of online health service users from either the technical perspective or the policy perspective. Therefore, our results in this paper are expected to shed light in two areas: () for our De-Health and linkage attack frameworks and evaluation results, they are expected to show users, data owners, researchers, and policy makers the concrete attacks and the corresponding serious privacy leakage; and () for our theoretical analysis, it is expected to provide researchers and policy makers a clear understanding of the impacts that different features have on the data anonymity, and thus help facilitate them to develop effective online health data anonymization techniques and proper privacy policies.

Viii Related Work

Hospital/Structured Data Anonymization and DA. To anonymize the claims data used in the Heritage Health Prize (HHP) competition and ensure that they meet the HIPAA Privacy Rule, Emam proposed several anonymization methods based on a risk threshold [18]. Fernandes et al. developed an anonymous psychiatric case register, the Clinical Record Interactive Search (CRIS), based on the EHRs generated from the South London and Maudsley NHS Trust (SLaM) [19]. For the scenario of statistical health information release, Gardner et al. developed SHARE, which can release the information in a differentially private manner [20]. To defend against the re-assembly attack, Sharma et al. proposed DAPriv, an encryption-based decentralized architecture for protecting the privacy of medical data [21]. In [22], Emam et al. systematically evaluated existing DA attacks to structured health data. A comprehensive survey on existing privacy-preserving structured health data publishing techniques (45+) was given by Gkoulalas-Divanis et al. in [23].

Online Health Data. In [13], Nie et al. sought to bridge the vocabulary gap between health seekers and online healthcare knowledge. Another similar effort is [14], where Luo and Tang developed iMed, an intelligent medical Web search engine. Along the line of analyzing users’ behavior in searching, Cartright et al. studied the intentions and attention in exploratory health search [15] and White and Horvitz studied the onset and persistence of medical concerns in search logs [16]. Nie et al. studied automatic disease inference in [17].

Health Data Policy. In [24], Barth-Jones re-examined the ‘re-identification’ attack of Governor William Weld’s medical information. In [25], Sen̄or et al. conducted a review of free web-accessible Personal Health Record (PHR) privacy policies. In [26], McGraw summarized concerns with the anonymization standard and methodologies under the HIPAA regulations. In [27], Hripcsak et al. summarized the ongoing gaps and challenges of health data use, stewardship, and governance, along with policy suggestions. In [28], Emam et al. analyzed the key concepts and principles when anonymizing health data while ensuring it preserves the utility for meaningful analysis.

Stylometric Approaches. Stylometric techniques have been widely used for compromising the anonymity of online users. In [29], Abbasi and Chen proposed the use of stylometric analysis techniques to identify authors based on writing style. In [30], Koppel et al. studied the authorship attribution problem in the wild. Later, in [31], Narayanan et al. studied the feasibility of Internet-scale author identification. Considering that the closed-world setting does not hold in many real world applications, Stolerman et al. presented a Classify-Verify framework for open-world author identification, which performs better in adversarial settings than traditional author classification. In [33], Afroz et al. studied the performance of stylometric techniques when faced with authors who intentionally obfuscate their writing style or attempt to imitate that of other authors. In [34], Afroz et al. investigated stylometry-based adapting authorship attribution to underground forums and proposed a general multiple author detection algorithm. Another scenario of applying stylometric techniques is [35], where Caliskan-Islam et al.de-anonymized programmers via code stylometry. To defend against stylometry-based author attribution, McDonald et al. presented Anonymouth [36]. In [37], Brennan et al. proposed a framework for creating adversarial passages, which includes obfuscation, imitation, and translation techniques.

Ix Conclusion

In this paper, we study the privacy of online health data. Our main conclusions are four-fold. First, we present a novel two-phase online health data DA attack, named De-Health, which can be applied to both closed-world and open-world DA settings. Second, we conduct the first theoretical analysis on the soundness and effectiveness of online health data DA. Our analysis explicitly shows the conditions and probabilities of successfully de-anonymizing one user or a group of users in both exact DA and Top- DA. Third, leveraging two large real world online health datasets, we validate the performance of De-Health. De-Health can significantly reduce the DA space while preserving high accuracy. Even in the scenario where training data are insufficient, De-Health still achieves promising DA accuracy. Finally, we present a linkage attack framework that can link online health data to real world people and thus clearly demonstrate the vulnerability of existing online health data. Our findings have meaningful implications to researchers and policy makers in helping them understand the privacy vulnerability of online health data and develop effective anonymization techniques and proper privacy policies.

References

  • [1] S. Fox and M. Duggan, “Health Online 2013”, Pew Research Center, Survey, 2013.
  • [2] “Online Health Research Eclipsing Patient-Doctor Conversations”, Makovsky Health and Kelton, Survey, 2013.
  • [3] C. Sherman,, “Curing Medical Information Disorder”, http://searchenginewatch.com/showPage.html?page=3556491, 2005.
  • [4] WebMD, http://www.webmd.com/.
  • [5] WebMD WiKi, https://en.wikipedia.org/wiki/WebMD.
  • [6] WebMD Annual Report, http://www.sec.gov/Archives/edgar/data /1326583/000119312515070081/d825668d10k.htm, 2014.
  • [7] HealthBoards, http://www.healthboards.com/.
  • [8] HealthBoards WiKi, https://en.wikipedia.org/wiki/HealthBoards.
  • [9] HealthData, http://www.healthdata.gov/.
  • [10] Interactive Health Data Application, http://www.ahw.gov.ab.ca/IHDA_Retrieval/.
  • [11] US HIPPA, http://www.hhs.gov/ocr/privacy/.
  • [12] PatientsLikeMe, https://www.patientslikeme.com/.
  • [13] L. Nie, Y.-L. Zhao, M. Akbari, J. Shen, and T.-S. Chua, “Bridging the Vocabulary Gap between Health Seekers and Healthcare Knowledge”, IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 27, No. 2, pp. 396-409, 2014.
  • [14] G. Luo and C. Tang, “On Iterative Intelligent Medical Search”, ACM SIGIR, 2008.
  • [15] M.-A. Cartright, R. W. White, and E. Horvitz, “Intentions and Attention in Exploratory Health Search”, ACM SIGIR, 2011.
  • [16] R. W. White and E. Horvitz, “Studies of the Onset and Persistence of Medical Concerns in Search Logs”, ACM SIGIR, 2012.
  • [17]

    L. Nie, M. Wang, L. Zhang, S. Yan, B. Zhang, and T.-S. Chua, “Disease Inference from Health-Related Questions via Sparse Deep Learning”,

    IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 27, No. 8, pp. 2107-2119, 2015.
  • [18] K. E. Emam, L. Arbuckle, G. Koru, B. Eze, L. Gaudette, E. Neri, S. Rose, and J. Howard, “De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset”, J Med Internet Res, 2012.
  • [19] A. C. Fernandes, D. Cloete, M. TM Broadbent, R. D. Hayes, C.-K. Chang, R. G. Jackson, A. Roberts, J. Tsang, M. Soncul, J. Liebscher, R. Stewart, and F. Callard, “Development and Evaluation of a De-identification Procdure for a Case Register Sourced from Mental Health Electronic Records”, BMC Medical Informatics and Decision Making, 2013.
  • [20] J. Gardner, L. Xiong, Y. Xiao, J. Gao, A. R. Post, X. Jiang, and L. Ohno-Machado, “SHARE: System Design and Case Studies for Statistical Health Information Release”, J Am Med Inform Assoc, Vol. 20, pp. 109-116, 2013.
  • [21] R. Sharma, D. Subramanian, S. N. Srirama, “DAPriv: Decentralized Architecture for Preserving the Privacy of Medical Data”, arXiv:1410.5696, 2014.
  • [22] K. E. Emam, E. Jonker, L. Arbuckle, and B. Malin, “A Systematic Review of Re-Identification Attacks on Health Data”, PloS ONE, 2011.
  • [23] A. Gkoulalas-Divanis, G. Loukides, and J. Sun, “Publishing Data from Electronic Health Records while Preserving Privacy: A Survey of Algorithms”, Journal of Biomedical Informatics, No. 50, pp. 4-19, 2014.
  • [24] D. C. Barth-Jones, “The ‘Re-Identification’ of Governor William Weld’s Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now”, http://dx.doi.org/10.2139/ssrn.2076397, 2012.
  • [25] I. C. Sen̄or, J. L. Fernández-Alemán, and A. Toval, “Are Personal Health Records Safe? A Review of Free Web-Accessible Personal Health Record Privacy Policies”, J Med Internet Res, 2012.
  • [26] D. McGraw, “Building Public Trust in Uses of Health Insurance Portability and Accountability Act De-identified Data”, J Am Med Inform Assoc, 2013.
  • [27] G. Hripcsak, M. Bloomrosen, P. FlatelyBrennan, et al., “Health Data Use, Stewardship, and Governance: Ongoing Gaps and Challenges: A Report from AMIA’s 2012 Health Policy Meeting”, J Am Med Inform Assoc, 2014.
  • [28] K. E. Emam, S. Rodgers, and B. Malin, “Anonymising and Sharing Individual Patient Data”, BMJ, 2015.
  • [29] A. Abbasi and H. Chen, “Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace”, ACM Transactions on Information Systems, Vol. 26, No. 2, Article 7, pp. 1-29, 2008.
  • [30] M. Koppel, J. Schler, and E. Bonchek-Dokow, “Authorship Attribution in the Wild”, Language Resources and Evaluation, Vol. 45, No. 1, pp. 83-94, 2011.
  • [31] A. Narayanan, H. Paskov, N. Gong, J. Bethencourt, E. Stefanov, E. C. R. Shin, and D. Song, “On the Feasibility of Internet-Scale Author Identification”, IEEE S&P, 2012.
  • [32] A. Stolerman, R. Overdorf, S. Afroz, and R. Greenstadt, “Classify, but Verify: Breaking the Closed-World Assumption in Stylometric Authorship Attribution”, The Tenth Annual IFIP WG 11.9 International Conference on Digital Forensics, 2014.
  • [33] S. Afroz, M Brennan, and R. Greenstadt, “Detecting Hoaxes, Frauds, and Deception in Writing Style Online”, IEEE S&P, 2012.
  • [34] S. Afroz, A. Caliskan-Islam, A. Stolerman, R. Greenstadt, and D. McCoy, “Doppelgänger Finder: Taking Stylometry To the Underground”, IEEE S&P, 2014.
  • [35] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi, and R. Greenstadt, “De-anonymizing Programmers via Code Stylometry”, USENIX Security, 2015.
  • [36] A. W. E. McDonald, S. Afroz, A. Cliskan, A. Stolerman, and R. Greenstadt, “Use Fewer Instances of the Letter “i”: Toward Writing Style Anonymization”, Proceedings of the 12th international conference on Privacy Enhancing Technologies (PETS), 2012.
  • [37] M. Brennan, S. Afroz, and R. Greenstadt, “Adversarial Stylometry: Circumventing Authorship Recognition to Preserve Privacy and Anonymity”, ACM Transactions on Information and System Security, Vol. 15, No. 3, pp. 1-21, 2012.
  • [38] A. Narayanan and V. Shmatikov, “De-anonymizing Social Networks”, IEEE S&P, 2009.
  • [39] S. Nilizadeh, A. Kapadia, and Y.-Y. Ahn, “Community-Enhanced De-anonymization of Online Social Networks”, ACM CCS, 2014.
  • [40] S. Ji, W. Li, M. Srivatsa, and R. Beyah, “Structural Data De-anonymization: Quantification, Practice, and Implications”, ACM CCS, 2014.
  • [41] M. J. Tildesley, T. A. House, M. C. Bruhn, R. J. Curry, M. O’Neil, J. L. E. Allpress, G. Smith, and M. J. Keeling, “Impact of Spatial Clustering on Disease Transmission and Optimal Control”, PNAS, Vol. 107, No. 3, pp. 1041-1046, 2010.
  • [42] B. Y. Reis, I. S. Kohane, and K. D. Mandl, “Longitudinal Histories as Predictors of Future Diagnosis of Domestic Abuse: Modelling Study”, BMJ, 2009.
  • [43] T. C. Mendenhall, “The Characteristic Curves of Composition”,, Science, Vol. ns-9, No. 214S, pp. 237-246, 1887.
  • [44] Cosine similarity, https://en.wikipedia.org/wiki/Cosinesimilarity.
  • [45] J. Noecker Jr and M. Ryan, “Distractorless Authorship Verification”, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC), 2012.
  • [46] M. Goemans, “Chernoff Bound, and some Applications”, http://math.mit.edu/goemans/18310S15/chernoff-notes.pdf.
  • [47] Borel-Cantelli Lemma, https://en.wikipedia.org/wiki/Borel-Cantellilemma.
  • [48] D. Perito, C. Castelluccia, M. A. Kaafar, and P. Manils, “How Unique and Traceable are Usernames?”, Proceedings of the 11th international conference on Privacy enhancing technologies (PETS), 2011.
  • [49] P. Ilia, I. Polakis, E. Athanasopoulos, F. Maggi, and S. Loannidis, “Face/Off: Preventing Privacy Leakage From Photos in Social Networks”, ACM CCS, 2015.
  • [50] http://www.whitepages.com/.
  • [51] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An Open Source Software for Exploring and Manipulating Networks”, International AAAI Conference on Weblogs and Social Media, 2009.
  • [52] BoneSmart, http://bonesmart.org/.