Privacy in Social Media: Identification, Mitigation and Applications

by   Ghazaleh Beigi, et al.
Arizona State University

The increasing popularity of social media has attracted a huge number of people to participate in numerous activities on a daily basis. This results in tremendous amounts of rich user-generated data. This data provides opportunities for researchers and service providers to study and better understand users' behaviors and further improve the quality of the personalized services. Publishing user-generated data risks exposing individuals' privacy. Users privacy in social media is an emerging task and has attracted increasing attention in recent years. These works study privacy issues in social media from the two different points of views: identification of vulnerabilities, and mitigation of privacy risks. Recent research has shown the vulnerability of user-generated data against the two general types of attacks, identity disclosure and attribute disclosure. These privacy issues mandate social media data publishers to protect users' privacy by sanitizing user-generated data before publishing it. Consequently, various protection techniques have been proposed to anonymize user-generated social media data. There is a vast literature on privacy of users in social media from many perspectives. In this survey, we review the key achievements of user privacy in social media. In particular, we review and compare the state-of-the-art algorithms in terms of the privacy leakage attacks and anonymization algorithms. We overview the privacy risks from different aspects of social media and categorize the relevant works into five groups 1) graph data anonymization and de-anonymization, 2) author identification, 3) profile attribute disclosure, 4) user location and privacy, and 5) recommender systems and privacy issues. We also discuss open problems and future research directions for user privacy issues in social media.



There are no comments yet.


page 1

page 2

page 3

page 4


Social Media and User Privacy

Online users generate tremendous amounts of data. To better serve users,...

Social Media Identity Deception Detection: A Survey

Social media have been growing rapidly and become essential elements of ...

Learning User Embeddings from Temporal Social Media Data: A Survey

User-generated data on social media contain rich information about who w...

Faceless Person Recognition; Privacy Implications in Social Media

As we shift more of our lives into the virtual domain, the volume of dat...

Inferring User Interests in Microblogging Social Networks: A Survey

With the popularity of microblogging services such as Twitter in recent ...

People are Strange when you're a Stranger: Impact and Influence of Bots on Social Networks

Bots are, for many Web and social media users, the source of many danger...

Privacy Intelligence: A Survey on Image Sharing on Online Social Networks

Image sharing on online social networks (OSNs) has become an indispensab...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

111This paper is currently under review.

Explosive growth of the Web in the last decade has drastically changed the way billions of people all around the globe conduct numerous activities such as surfing the web, creating online profiles in social media platforms, interacting with other people, and sharing posts and various personal information in a rich environment. This results in tremendous amounts of user-generated data. The centralization of massive amounts of user information and the availability of up-to-date data which is consistently tagged and formatted, makes social media platforms an attractive target for organizations seeking to collect and aggregate this information either for legitimate purposes or malicious goals (Bonneau et al., 2009). For example, the user-generated data provides opportunities for researchers and business partners to study and understand individuals at unprecedented scales (Backstrom et al., 2007; Beigi et al., 2018). This information is also crucial for online vendors to provide personalized services and a lack of it would result in a deteriorating quality of online personalization service.

On the other hand, this tremendous amount of user-generated data risks exposing individuals’ privacy as it is rich in content including a user’s relationships and other sensitive and private information (Ji et al., 2016c; Narayanan and Shmatikov, 2009; Beigi, 2018). This data also makes online users traceable and accordingly, users become severely vulnerable to potential risks ranging from persecution by governments to targeted frauds. For example, users may share their vacation plans publicly on Twitter without knowing that this information could be used by adversaries for break-ins and thefts in the future (Zhang et al., 2018; Mao et al., 2011). Moreover, sensitive information that users do not explicitly disclose such as location (Li et al., 2012b; Mahmud et al., 2014), age (Wang et al., 2016) and trust/distrust relationships (Beigi et al., 2016b, a), can be easily inferred from their activities on social media.

Privacy issues could be raised when the data get published by a data publisher or service provider. In general, two types of information disclosures have been identified in the literature: identity disclosure and attribute disclosure attacks (Duncan and Lambert, 1986; Lambert, 1993; Li et al., 2007)

. Identity disclosure occurs when an individual is mapped to an instance in a released dataset. Attribute disclosure happens when the adversary could infer some new information regarding an individual based on the released data. Attribute disclosure becomes more probable when there is accurate disclosure of people’s identities. Similarly, privacy leakage attacks in social media could be also categorized into either identity disclosure or attribute disclosure. These user privacy issues mandate social media data publishers to protect users’ privacy by sanitizing user-generated data before it is published publicly.

Data anonymization is a complex problem and its goal is to remove or perturb data to prevent adversaries from inferring sensitive information while ensuring the utility of the published data. One straightforward anonymization technique is to remove “Personally Identifiable Information” (a.k.a. PII) such as names, user ID, age and location information. This solution has been shown to be far from sufficient in preserving privacy (Backstrom et al., 2007; Narayanan and Shmatikov, 2008). An example of this insufficient approach is the anonymized dataset published for the Netflix prize challenge. As a part of the Netflix prize contest, Netflix publicly released a dataset containing movie ratings of 500,000 subscribers. The data was supposed to be anonymized and all PII are removed from it. Narayanan et al. (Narayanan and Shmatikov, 2008) propose a de-anonymization attack which map users’ records in the anonymized dataset to corresponding profiles on IMDB. In particular, the results of this work show that the structure of the data carry enough information for a potential breach of privacy to re-identiy anonymized users.

Consequently, various protection techniques have been proposed to anonymize user-generated social media data. In general, the ultimate goal of an anonymization approach is to preserve social media user privacy while ensuring the utility of published data. As a counterpart to this research direction, another group of works investigate the potential privacy breaches from social media user data by introducing new attacks. These group of works find the gaps in anonymizing user-generated data and further improve anonymization techniques.

There is vast literature on privacy of users in social media from many perspectives. The goal of this article is to provide a comprehensive review of existing works on user privacy issues and solutions in social media and give a guidance on future research directions. The contributions of this paper are summarized as follows:

  • [leftmargin=*]

  • We overview the traditional privacy models for structured data and discuss how these models are adopted for privacy issues in social media. We formally define two types of privacy leakage disclosures that covers most of the existing definitions in the literature.

  • We categorize privacy issues and solutions on social media into different groups including 1) graphs data anonymization and de-anonymization, 2) author identification, 3) user profile attributes disclosure, 4) location and privacy and 5) recommendation systems and privacy. We then give an overview of existing works in each group with a principled way to group representative methods into different categories.

  • We discuss several open issues and provide future directions for privacy in social media.

The remainder of this survey is organized as follows. In Section 2, we present an overview of traditional methods and formally define two types of privacy disclosures. In Section 3, we review the state-of-the-art methods for privacy of social media graphs. More specifically, Section 3.1. covers de-anonymization attacks on social media graphs and Section 3.2. covers anonymization techniques which are proposed for preserving privacy of graph data against de-anonymization attacks. We review author identification works in Section 4. In Sections 5 and 6, we overview state-of-the-art de-anonymization techniques for inferring users profile attributes and location information. In Section 7, privacy issues and solutions in recommendation systems are reviewed. Finally, we conclude this article in Section 8 by discussing the open issues and future directions .

2. Traditional Privacy Models

Privacy preserving techniques were first introduced for tabular and micro data. With the emergence of social media, the issue of online user privacy was raised. Researchers then focus on studying privacy leakage issues as well as anonymization and privacy preserving techniques specialized for social media data. There are two types of information disclosure in the literature: identity disclosure and attribute disclosure attacks (Duncan and Lambert, 1986; Lambert, 1993; Li et al., 2007). We can formally define identity disclosure attack as:

Definition 2.1 ().

Identity Disclosure Attack. Given , which is a snapshot of a social media platform with a social graph where is the set of users and demonstrates the social relations between them, a user behavior and an attribute information , the identity disclosure attack is to map all users in the list of target users to their known identities. For each , we have the information of her social friends and behavior.

Attribute disclosure attack for social media data could be also formally defined as:

Definition 2.2 ().

Attribute Disclosure Attack. Given , which is a snapshot of a social media platform with a social graph where is the set of users and demonstrates the social relations between them, a user behavior and an attribute information , the attribute disclosure attack is used to infer the attributes for all where is a list of targeted users. For each , we have the information of her social friends and behavior.

Network graph de-anonymization and author identification are examples of identity disclosure attacks that exists in social media. Examples of attribute disclosure attack include the disclosure of users’ profile attributes, location, and preferences information in recommendation systems.

Before we discuss privacy leakage in social media, we first overview the traditional privacy models for structured data. Traditional privacy models such as -anonymity (Sweeney, 2002), -diversity (Machanavajjhala et al., 2006), -closeness (Li et al., 2007) and differential privacy (Dwork, 2008) are defined over structured databases and cannot be applied to unstructured user generated data in social media platforms. The reason is that quasi-identifiers and sensitive attributes are not clear in the context of social media data. These techniques are further adopted for social media data which we will discuss more in the next sections. Last but not least, we discuss the related work and highlight the differences between this work and other surveys in existing literature.

2.1. k-anonymity, l-diversity and t-closeness

-anonymity was one of the first techniques introduced for protecting data privacy (Sweeney, 2002). The aim of -anonymity is to anonymize each instance in the dataset so that it is indistinguishable from at least other instances with respect to certain identifying attributes. -anonymity could be achieved through suppression or generalization of the data instances. The goal here is to anonymize the data such that -anonymity is preserved for all instances in the dataset with a minimum number of generalizations and suppressions while maximizing the utility of the resultant data. It has been shown that this problem is NP-hard (Aggarwal et al., 2005). -anonymity was initially defined for tabular data, but then researchers start to adopt it for solving privacy issues in social media data. In social media related problems, -anonymity ensures that users cannot be identified and there are other users with the same set of features which makes these users indistinguishable. These features may include users’ attributes and structural properties.

Although -anonymity is among the first techniques proposed for protecting the privacy of datasets, it is still vulnerable against specific types of privacy leakage. Machanacajjhala et al. (Machanavajjhala et al., 2006) introduces two simple attacks which defeats -anonymity. The first attack is homogeneity attack in which the adversary can infer an instance’s (in this case, a users in social media) sensitive attributes when sensitive values in an equivalence class lack diversity. In the second attack the adversary can infer an instance’s sensitive attributes when he has access to background knowledge even in the case that the data is -anonymized. The second attack is known as background knowledge attack. Variations of background knowledge attacks are proposed and used for inferring social media users’ attributes. The background knowledge could be users’ friends’ or behavioral information. We will discuss more about different types of the attribute inference attacks problem in Sections 6 and 7.

To protect data against homogeneity and background knowledge attacks, Machanacajjhala et al. (Machanavajjhala et al., 2006) introduce the concept of -diversity. It ensures that the sensitive attribute values in each equivalence class are diverse. More formally, a set of records in an equivalence is -diverse if the class contains at least well represented values for the sensitive attributes. The dataset is then -diverse if every class is -diverse. Two instantiations of the -diversity concept are then introduced, entropy -diversity and recursive -diversity. With entropy -diversity, each equivalence must not only have enough different sensitive values, but also each sensitive value must be distributed evenly enough. More formally, the entropy of the distribution of sensitive values in each equivalence class is at least . For recursive -diversity, the most frequent value should appear frequent enough in the dataset. Interested readers could refer to the work of (Machanavajjhala et al., 2006) for more details.

After -diversity, Li et al. (Li et al., 2007) studies the vulnerabilities of -diversity and introduce a new privacy concept, -closeness. They show that

-diversity cannot protect the privacy of data when the distribution of sensitive attributes in the equivalence class is different from the distribution in the whole dataset. If the distribution of sensitive attributes is skewed, then

-diversity presents a serious privacy risk. This attack is known as the skewness attack. -diversity is also vulnerable against similarity attacks. This attack can happen when the sensitive attributes in an equivalence class are distinct but semantically similar (Li et al., 2007). Li et al. (Li et al., 2007) thus introduce a new privacy concept -closeness which ensures that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the sensitive attribute in the overall table. More formally speaking, an equivalence class satisfies -closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole dataset is no more than a certain threshold. The whole dataset is said to have -closeness if all equivalence classes have -closeness. It’s valuable to mention that -closeness protects the data against attribute disclosure but not identity disclosure.

-anonymity, -diversity and -closeness are further adopted for unstructured social media data. Table 1 summarizes different approaches that leverage adopted versions of these techniques for privacy problems in social media. These works are discussed more in the following sections.

Technique Type of Information Paper
-degree anonymity graph structure (Liu and Terzi, 2008)
-neighborhood anonymity graph structure (Zhou and Pei, 2008)
-automorphism graph structure (Zou et al., 2009)
-isomorphic graph structure (Cheng et al., 2010b)
-anonymity graph structure and attribute information (Yuan et al., 2010)
-matching anonymity graph structure and attribute information (Andreou et al., 2017)
-anonimity graph structure and attribute information (backes2015closed; Backes et al., 2016)
-diversity attribute information (Machanavajjhala et al., 2006)
-closeness attribute information (Li et al., 2007)
Table 1. -anonymity, -diversity and -closeness applications in user privacy in social media.

2.2. Differential Privacy

Differential privacy is a powerful technique which protects a user’s privacy during statistical query over a database by minimizing the chance of privacy leakage while maximizing the accuracy of queries. It is introduced by Dwork et al. (Dwork et al., 2006; Dwork, 2008) and provides a strong privacy guarantee. The intuition behind differential privacy is that the risk of user’s privacy leakage should not be increased as a result of participating in a database (Dwork, 2008). In particular, it imposes a guarantee on the data release mechanism rather than the dataset itself. The privacy risk is also evaluated according to the existence or absence of an instance in the database. Differential privacy assumes that data instances are independent from each other and guarantees that existence of an instance in the database does not pose a threat to its privacy as the statistical information of data would not change significantly in comparison to the case that the instance is absent (Dwork et al., 2006; Dwork, 2008). This way, the adversary cannot infer whether an instance is in the database or not or which record is associated with it (Kifer and Machanavajjhala, 2011). Differential privacy can be more formally defined as:

Definition 2.3 ().

Differential Privacy. Given a query function , a mechanism with an output range satisfies -differential privacy for all datasets and differing in at most one element iff:


Here is called privacy budget and large values of (e.g., 10) results in large and indicates that large output difference could be tolerated and hence we have large privacy loss. This is because the adversary can infer the change in the database according to the large change of the query function . On the other hand, small values of (e.g., 0.1) indicate that small privacy loss could be tolerated. Query function

can be thought of as a request about value of a random variable and mechanism

is also a randomized function which can be considered as an algorithm that returns the results for the query function, possibly with some noise. To make it more clear, let’s assume that we have a dataset containing every patient information. An example of the query function could be the question: How many people have the disease ?. The mechanism could be any algorithm that finds the answer to this question. The output range for mechanism in this example is, where is the total number of patients in the dataset.

Differential privacy models could be either interactive or non-interactive. Assume that the data consumer executes a number of statistical queries on the same dataset. In the interactive models, the data publisher responds to the customer with , where perturbs the query results to achieve the privacy guarantees. In non-interactive models, the data publisher designs a mechanism which transforms the original data into a new anonymized dataset . The perturbed data is then returned to the consumer which is ready for arbitrary statistical queries.

A common way of achieving differential privacy is through adding random noises, i.e. Laplacian or Exponential to the query answers (Dwork, 2008). The Laplacian mechanism is a popular technique for providing -differential privacy which adds Laplace noise drawn from Laplace distribution. Since -differential privacy is defined over the query function and holds for all datasets according to Eq. 1, the amount of added noise only depends on the sensitivity of the query function. Sensitivity of the query function is further defined as:


for any and which differ in at most one element. denotes the norm.

The added Laplacian noise is then drawn from and the output result considering differential privacy constraint will be , where . The mechanism works best when is small as it introduces the least noise. The larger the sensitivity of a query, the less privacy risks can be tolerated as removing any instance from the dataset would change the output of the query more. Note that the sensitivity basically captures how a great difference (between the value of on two datasets differing in a single element) must be hidden by the additive noise generated by the data publisher.

Note that recent studies show that the dependency between instances in the dataset will hurt the privacy guarantees provided by the differential privacy (Kifer and Machanavajjhala, 2011; Liu et al., 2016a).

There also exists a relaxed version of -differential privacy, known as -differential privacy which was developed to deal with very unlikely outputs of  (Dwork et al., 2006; Dwork, 2008). It could be defined as:

Definition 2.4 ().

-differential privacy. Given a query function , a mechanism with an output range satisfies -differential privacy for all datasets and differing in at most one element iff:


where and are two model parameters related to the level of privacy guarantees and are considered to be very small numbers.

Table 2 summarizes different works that utilize differential privacy in social media data. All these works are discussed more later.

Type of Information Paper
graph structure (Sala et al., 2011; Proserpio et al., 2014; Xiao et al., 2014; Wang and Wu, 2013; Liu et al., 2016a)
recommender systems (McSherry and Mironov, 2009; Machanavajjhala et al., 2011; Zhu et al., 2013; Jorgensen and Yu, 2014; Shen and Jin, 2014; Hua et al., 2015; Guerraoui et al., 2015; Zhu and Sun, 2016; Meng et al., 2018)
textual data (Zhang et al., 2018)
Table 2. Differential privacy applications in user privacy in social media.

2.3. Related Work

There are multiple relevant surveys related to the privacy of data and privacy preserving approaches (Fung et al., 2010; Ji et al., 2015b, 2016d; Abawajy et al., 2016; Sharma et al., 2012; Zheleva et al., 2012; Verykios et al., 2004; Agrawal and Srikant, 2000; Rizvi and Haritsa, 2002; Evfimievski et al., 2004). Fung et al. (Fung et al., 2010) reviews privacy preserving data publishing methods for relational data such as -anonymity, -diversity, -closeness and their other variations. These methods are compared in terms of privacy models, anonymization algorithms and information metrics. Zhelva et al. (Zheleva et al., 2012) review the concepts of privacy issues in tabular data and introduce new privacy risks in graph data. Multiple surveys focus on reviewing graph data privacy risks (Ji et al., 2015b, 2016d; Abawajy et al., 2016; Sharma et al., 2012). Sharma et al. (Sharma et al., 2012) are among the first works which reviews -anonymity and randomization based techniques for anonymizing graph data. Another overview by Abawajy et al. (Abawajy et al., 2016)

presents the threat model for graph data and classified the background knowledge that is used by adversaries to breach the privacy of users. They also review and classify state-of-the-art approaches for anonymizing graph data. Ji et al. 

(Ji et al., 2015b, 2016d) conducted a survey on graph data anonymization, de-anonymization attacks and de-anonymizability quantification.

Another way of sanitizing data is by providing algorithms which are provably privacy-preserving and ensure no sensitive information leak from the data (Zheleva et al., 2012). There is a thorough survey (Verykios et al., 2004) on privacy preserving data mining which studies different privacy preserving data mining approaches. Another work from Agrawal et al (Agrawal and Srikant, 2000) proposes algorithms to perturb data values by adding random noise to them in order to preserve the privacy of users while retaining the statistical properties of the original data. Another set of works focus on developing privacy preserving association mining rules to minimize privacy loss (Rizvi and Haritsa, 2002; Evfimievski et al., 2004).

In this work, we go one step further and reviews all aspects of social media data which could lead to privacy leakage. Social media data is highly unstructured and noisy and inherently different from relational and tabular data. Therefore, other approaches are designed specifically to study privacy risks in the context of user-generated data in social media platforms. Different from previous works, we not only reviews state-of-the-art and recent approaches on social graph anonymization and de-anonymization, but we also survey other attribute and identity disclosure attacks which could be performed on the other aspects of user-generated social media data. In particular, we overview and summarize approaches that leverage users’ activities on social media to infer their profile and location information. In addition to identity disclosure risks raised from social graphs, we survey author identification and user linkage across social media approaches that incorporate various pieces of user-generated information such as user profiles and textual posts. We introduce more risks and cover more recent works related to privacy leakage in social media which are not covered in the work of Zhelva et al. (Zheleva et al., 2012). Furthermore, we include many new techniques related to the privacy of social graphs which are not included in previous surveys (Ji et al., 2015b, 2016d; Abawajy et al., 2016; Sharma et al., 2012).

In summary, to the best of our knowledge, this is the first and most comprehensive work that systematically surveys, and analyzes the advances of research on privacy issues in social media.

3. Social Graphs and Privacy

A large amount of data generated by users in social media platforms has graph structure. Friendship and following/followee relations, mobility traces (e.g. WiFi contacts, Instant Message contacts) and spatio-temporal data (latitude, longitude, timestamps) all could be modeled as graphs. This mandates paying attention to privacy issues of graph data. We will first overview graph de-anonymization works and then survey the proposed solutions for anonymizing graph data.

3.1. Graph De-anonymization

The work of Backstrom et al. (Backstrom et al., 2007) were among the first works which studied the privacy breach problem according to the social network’s graph structure. These attacks could be categorized as either a seed-based or seed-free approach according to whether pre-annotated seed users existed or not. Seed users are those whom their identity are clear for the attacker. Backstrom et al. (Backstrom et al., 2007) is among the first seed-based approaches. This work introduces both active and passive attacks on anonymized social networks. In active attacks, the adversary creates new user accounts (a.k.a Sybils) and links them to the set of predefined target nodes before the anonymized graph is produced. Then it links these new accounts together to create a subgraph . After publishing the anonymized graph, the attacher looks for the subgraph and then locates and re-identifies targeted nodes in the published graph. The main challenge in this approach is that the subgraph should be unique enough to be found efficiently regardless of with several million users. In passive accounts, the attacker is an internal user of the system and no new account is created. The attacker then de-anonymizes the users connected to him after the anonymized graph data is released. This attack is susceptible to Sybil defense approaches (Al-Qurishi et al., 2017) and wrongly assumes that attackers can always change the network before its release.

Another work from Narayanan et al. (Narayanan and Shmatikov, 2009) introduces an improved attack which does not need compromised accounts or Sybil users to perform the attack. This work assumes that the attacker has access to a different network whose membership has overlap with the original anonymized network. This auxiliary graph is also known as background or auxilary graph knowledge. It also assumes that the attacker has the information of a small set of users, i.e. seed users, who are present in both networks. Narayanan et al. (Narayanan and Shmatikov, 2009) discuss different ways of collecting background knowledge. For example, if the attacker is a friend of a portion of the targeted users, he knows all the details about them (Korolova et al., 2008; Stone et al., 2008). Another approach is paying a set of users to reveal information about themselves and their friends (Lewis et al., 2008). Crawling data via social media API or using compromised accounts as discussed in active attack are other approaches for gathering background knowledge. Social graph de-anonymization attack in social media could be then formally defined as:

Definition 3.1 ().

Social Graph De-anonymization Attack (Narayanan and Shmatikov, 2009; Fu et al., 2015). Given an auxiliary/ background graph and a target anonymized graph , the goal of de-anonymization is to find identity disclosures in the form of mappings as many and accurately as possible. An identity disclosure indicates that the two nodes and actually correspond to the same user.

3.1.1. Seed-based De-anonymization

Seed-based de-anonymziation approaches have two main steps. In the first step, a set of seed users are mapped from the anonymized graph to the background/auxiliary graph knowledge and thus are re-identified. In the second step, the mapping and de-anonymization is propagated from the seed users to the other remaining unidentified users. Similarly, the work of Narayanan et al. (Narayanan and Shmatikov, 2009)

starts from re-identifying seed users in an anonymized and auxiliary graph. Then, other users are re-identified by propagating mappings based on seed users pairs. Structural information such as user’s degree, user’s eccentricity, and edge directionality are used to heuristically measures the strength of match between users. A straightforward application of this de-anonymization attack with less heuristics is predicting links between users 

(Narayanan et al., 2011).

Yartseva et al. (Yartseva and Grossglauser, 2013) propose a percolation-based de-anonymization approach which maps every pair of users in both graphs (background knowledge and anonymized graphs) that have more than neighboring mapped pairs. The only parameter of this work is which is a predefined mapping threshold and does not require a minimum number of users in the seed set. Another similar work from Korula et al. (Korula and Lattanzi, 2014) propose a parallelizable percolation-based attack with provable guarantees. It again starts with a set of seed users who are previously mapped and then propagates the mapping to the remaining network. Two users will be mapped if they have a specific number of mapped neighbors. Their approach is robust to malicious users and fake social relationships in the network.

In another work, Nilizadeh et al. (Nilizadeh et al., 2014) propose a community based de-anonymization attack using the idea of divide-and-conquer. Community detection has been extensively studied in the literature of social network analysis (Yang et al., 2013; Alvari et al., 2016; Yang and Leskovec, 2013) and has been used in variety of tasks such as trust prediction (Beigi et al., 2014) and guild membership prediction (Alvari et al., 2014; Hajibagheri et al., 2018)

. In this work, the attacker first leverage community detection techniques to partition both graphs (i.e., anonymized and knowledge graphs) into multiple communities. It then maps communities by creating a network of communities in both graphs. Then users within mapped communities are re-identified and matched together. Mappings are then propagated to re-identify the remaining users. This attack uses similar heuristics as 

(Narayanan and Shmatikov, 2009) to measure the mapping strength between users.

Ji et al. (Ji et al., 2015a, 2016a) study de-anonymizability of social media graph data based on seed-based approaches under both the Erdos-Renyi and a statistical model. Similar to (Ji et al., 2014a), they specified the structure conditions for both perfect and partial de-anonymization (i.e. partial de-anonymization can only re-identify a set of users). Chiasserini et al. (Chiasserini et al., 2016; Fabiana et al., 2015) also study the problem of user de-anonymization according to their structural information under the scale-free user relation model. This assumption is more realistic since users degree-distribution in social media follows power-law distribution, a.k.a scale-free. The results of their analysis show that the information of a large portion of users in the seed set is useless in re-identifying users in anonymized graph. This because of the large inhomogeneities in the users degree. This results suggests that given a network with users, the order of (for any arbitrarily small

) seeds are needed to successfully de-anonymize all users when seeds are uniformly distributed among the vertices. It has been also shown that as few as

seeds are needed if the attacker has the option to select seeds according to their degree and scale-free property of social network. Chiasserini et al. (Chiasserini et al., 2016, 2018) also propose a two-phased percolation graph matching based de-anonymization attack similar to (Yartseva and Grossglauser, 2013).

Bringmann et al. (Bringmann et al., 2014) also propose an approach which uses seed nodes (for an arbitrarily small ) for a graph with nodes. This is an improvement over the state-of-the-art structure based de-anonymization techniques which need seeds (Korula and Lattanzi, 2014). This approach then finds a signature set for each node as the intersection of its neighbors and previously re-identified nodes. It then defines criterion that further is used to decide if two signatures originate from same nodes with high probability or not, i.e. if the similarity of two nodes signature is more than ( is a constant), then the two nodes are mapped together. Local sensitivity hashing technique (Indyk and Motwani, 1998) is also used to reduce the number of comparisons needed for the de-anonymization attack. Theoretical and empirical analysis of their work show that the attack is performed in quasilinear time.

Manasa et al. (Peng et al., 2014) propose another seed-based attack against anonymized social graphs which has two steps. In the first step, it identifies a seed sub-graph of users with known identities. As discussed earlier in (Backstrom et al., 2007), this sub-graph could be injected by an attacker or it could even be a small group of users which the attack is able to re-identify. In the second step, it extends the seed set based on the users’ social relations and re-identifies the remaining users. For each mapping iteration, the algorithm re-examines previous mapping decisions, given new evidences regarding re-identified nodes. This attack does not have any limitation on the size of the initial seed and the number of links between seeds. Another recent work by Chiasserini et al. (Chiasserini et al., 2018) incorporates clustering for de-anonymization attacks. Their attack uses various levels of clustering and their theoretical results highlight that clustering can potentially reduce the number of seeds in percolation based de-anonymization attacks due to its wave-like propagation effect. This attack is a modified version of (Yartseva and Grossglauser, 2013) which starts from a small set of seed users and then expands seed set to the closest neighbors of the users in the seed set and repeat the re-identification procedure. In this version, two users are mapped if they have a sufficiently large number of neighbors among the mapped pairs.

3.1.2. Seed-free De-anonymizatoin

The efficiency of most of seed-based approaches depends on the size of seed set. However, seed-free approaches do not have this problem since they do not need the information of users in the form of a seed set to de-anonymize other users. Recently, some powerful seed-free de-anonymization attacks have been developed for social media graph data (Ji et al., 2014a, 2016b; Pedarsani and Grossglauser, 2011). Pedarsani et al. (Pedarsani and Grossglauser, 2011) present a Bayesian model which starts from the users with the highest degree and iteratively solves a maximum weighted bipartite graph matching problem. This algorithm iteratively updates fingerprints of all users. A bipartite graph is a graph whose vertices can be divided into two disjoint sets and . The goal in the maximum bipartite graph matching problem is to find a maximum matching between two partites so that each vertex is the endpoint of exactly one of the chosen edges.

Moreover, Ji et al. (Ji et al., 2014a, 2016b) propose to use optimization based methods to minimize an error function iteratively. More specifically, in each iteration of this attack, two candidate sets of users are selected from the anonymized and background graphs. Then users in the set from the anonymized graph are mapped (de-anonymized) to users in background graph by minimizing an error function defined by the edge difference caused by a mapping scheme. In particular, Ji et al. (Ji et al., 2014a) quantify the structure-based de-anonymization under the Configuration model (Newman, 2003) and drive structural conditions for perfect and partial de-anonymization. The configuration Model generates a random graph given a degree sequence by randomly assigning edges to match the given degree sequence (Newman, 2003).

Another recently developed group of techniques leverages additional sources of information besides structural network to re-identify social media users in anonymized data. This information includes user interactions (e.g., commenting, tweeting) or non-personal identifiable information which is associated with users and is shared publicly such as gender, education, country and interests (Gong and Liu, 2018). This combination of structural and exogenous sources of information could increase the of risk user privacy. Zhang et al. (Zhang et al., 2014) study the privacy breach problem in anonymized heterogeneous networks. They first introduce a privacy risk measure based on the potential loss of the user and the number of users who have same value. They then propose a de-anonymization algorithm which incorporates the defined privacy risk measure. For each target user, this framework first finds a set of candidates based on entity attribute matches in the heterogeneous network and then narrows down this candidate set by comparing the neighbors (which are found via heterogeneous links) of the target user and each candidate.

Fu et al. (Fu et al., 2014, 2015) propose to use structural and descriptive information. Descriptive information is defined as attribute information such as name, gender, birth year. This work first proposes a new definition of user similarity, i.e., two users are similar if their neighbors match to each other as well. However, similarity of neighbors also depends on the similarity of users. Therefore, Fu et al. model similarity as a recursive problem and solves it iteratively. Then, they reduce the de-anonymization problem to a complete weighted bipartite graph matching which is solved with Hungarian algorithm (Kuhn, 2010). These weights here are calculated based on the users similarities.

In another work, the effect of user attribute information as an exogenous source of information on de-anonymizing social networks is studied (Qian et al., 2016). In particular, this work incorporates semantic background knowledge of adversary in the de-anonymization process and models it using knowledge graphs (James, 1992).This approach simultaneously de-anonymizes and infers users attributes (we will discuss user profile attribute inference attack later in section 5). The adversary first models both the de-anonymized dataset and the background knowledge as two knowledge graphs. Then, she makes a complete adversary weighted bipartite graph. Each weight indicates the structural and attribute similarity between corresponding nodes in the anonymized and knowledge graphs. The de-anonymization problem will be then reduced to a maximum weighted bipartite matching problem which can be furthered reduced to a minimum cost maximum flow problem.Attacker prior semantic knowledge could be obtained via different ways such as common sense, statistical information, personal information and network structural information.

Ji et al. (Ji et al., 2017) also study the same problem and show theoretically and empirically that using attribute information alongside structural information could result in a great privacy loss even in an anonymized dataset in comparison to the cases where the data only consists of structural information. They further propose the De-SAG de-anonymization framework which incorporates both attribute and structural information. It first augments both types of information into a structure-attribute graph. De-SAG has two variants, i.e. user-based and set-based. In user-based De-SAG, the proposed de-anonymization approach first selects the most similar candidates to the target user from background/auxiliary knowledge graph based on similarity of their attributes. is a pre-defined parameter which controls the efficiency-accuracy trade-off in de-anonymization. Next, the target user will be mapped to one of the selected candidates based on their structural similarity. In set-based De-SAG, for each iteration, two sets of users are selected from anonymized graph and knowledge graph, respectively. Then, the de-anonymization problem reduces to a Maximum Weighted Bipartite graph Matching problem and users in these two sets are mapped to each other using Hungarian algorithm (Kuhn, 2010). These steps are repeated till no users remains unidentified. Note that the similarity of users are again calculated according to their attribute and structural information. The results of De-SAG show that users are re-identified 10 times more accurately than state-of-the-art structure based de-anonymization techniques (Ji et al., 2014a; Korula and Lattanzi, 2014).

In another work by Lee et al. (Lee et al., 2017b), a blind de-anonymization technique is proposed in which the adversary does not need to have any background information. Inspired by the idea of -series for chatacterizing structural characteristics of a graph, they propose -series to describe structural features of each user by exploiting his multi-hop neighbors information. In particular, captures the degree histogram of the user’s -hop neighbors. Then, a structure score is calculated for each user (in both the anonymized graph and the background knowledge graph) based on his diversity score (calculated according to

-series scores) and his relationships with all other non-reidentified users in the network. It then uses this information to re-identify all users in the anonymized social graph by leveraging pseudo relevance feedback support vector machines. Backes et al. 

(Backes et al., 2017) develop an attack which infers relations between users (i.e., edges between nodes in graph data) based on the users mobility profiles without using any additional information about existing relations between users. Their approach first constructs mobility profile for each user and then infer the social links between users based on the similarity of their mobility profile. The intuition behind this attack is that friends have more similar profiles in comparison to strangers. To infer users’ mobility profiles, it first obtains random walk traces from the user-location bipartite graph and then uses skip-gram (Mikolov et al., 2013) to obtain features in a continuous vector space.

Beigi et al. (Beigi et al., 2018) also introduce a new adversarial attack for social media data that does not need to have any background information before initiating the attack. This attack is designed for heterogeneous social media data which consists of different aspects (e.g., textual, structural, location, etc.) and shows that anonymizing all aspects of data is not sufficient when it is done without considering the hidden relationships between different data aspects. This attack first extracts the most revealing information for each user in the anonymized dataset, and then finds a set of candidate users based on the extracted information. Each user is finally mapped to the most probable candidate user. Sharad et al. (Sharad and Danezis, 2014)

propose to formulate the problem of graph de-anonymizatin in social networks as a learning task. They use 1-hop and 2-hop neighborhood degree distributions to represent users in a graph. The intuition behind this selection is that two nodes refer to the same user if their neighborhoods also matches to each other. For each pair of users selected at random from background knowledge and anonymized graphs, their proposed approach first extracts structural features from user’s 1-hop and 2-hop neighborhood. These features help the machine learning model to learn the degree deviation for identical and non-identical user pairs. It then trains a classifier on these features and predicts whether two pair of nodes are the same nodes in different ego-nets or not. They use decision tree and random forest as classifiers. In another work, Sharad et al. 

(Sharad, 2016) go even further and propose a new generation of de-anonymization attacks which is heuristic free, seedless and is considered as a learning problem. They use the same set of structural features as proposed in (Sharad and Danezis, 2014) and then de-anonymize the sanitized graph by re-identifying users with high degree first and then use them to attack low-degree nodes. They divide nodes into three categories based on their degrees and produce an initial set of mappings of all nodes with highest degrees. The mappings are used to filter out some of the nodes. Mappings are then frozen and propagated to the remaining nodes to discover new set of mappings.

3.1.3. Theoretical Analysis and De-anonymization

Another set of works studies de-anonymization attacks from the theoretical perspective of view. For example, Liu et al. (Liu et al., 2016a) theoretically study the vulnerability of differential privacy mechanisms against de-anonymization attacks. Differential privacy provides protection against even the strongest attacks in which the adversary knows the entire dataset except one entry. However, differential privacy assumes the independence between dataset entities which is not correct in most real-world applications. This work introduces a new attack in which the probabilistic dependence between dataset entries are calculated and then leveraged to infer users’ sensitive information from differentially private queries. The attack is also tested on graph data in which users’ degree distributions is published differentially privately.

Lee et al. (Lee et al., 2017a) also study the theoretical quantification for relating the anonymized graph data vulnerability against de-anonymization attacks. In particular, they study the relation between application specific anonymized data utility (i.e., quality of data) and capability of de-anonymization attacks. They define local neighborhood utility and global structure utility. They theoretically show that under certain conditions for each of defined utilities, the probability of successful de-anonymization approaches one with the increase of number of users in data. Their foundations could be used to evaluate the effectiveness of the de-anonymization/anonymization techniques.

Recent research by Fu et al. (Fu et al., 2017) studies the conditions under which the adversary can perfectly de-anonymize user identities in social graphs. In particular, they theoretically study the cost of quantifying the quality of the mappings. Community structures are also parameterized and leveraged as side information for de-anonymization. They study two different cases in which the community information is available for both background knowledge and anonymized graphs or only for one of them. They showed that perfectly de-anonymizing graph data with community information in polynomial time is NP-hard. They further propose two algorithms with approximation guarantees and lower time complexity by relaxing the original optimization problem. The main drawback of this study is the assumption of disjoint communities which fails to reflect the real-world situations. Wu et al. (Wu et al., 2018) extend Fu et al.’s study by considering overlapping communities. In contrast to Fu et al.’s work (Fu et al., 2017)

which uses Maximum a Posteriori estimation to find the correct mappings, Wu et al. introduces a new cost function Minimum Mean Square Error which minimizes the expected number of mismatched users by incorporating all possible true mappings.

There are surveys by Ji et al. (Ji et al., 2016d, 2015b), Lee et al. (Lee et al., 2017a) and Abawaji et al. (Abawajy et al., 2016) on quantification and analysis of graph de-anomyziation techniques which studies a portion of covered works here in terms of scalability, robustness and practicability. Interested readers can refer to these surveys for further readings (Ji et al., 2016d, 2015b; Lee et al., 2017a; Abawajy et al., 2016).

3.2. Graph Anonymization

Another research direction in protecting privacy of users in graph data is studying graph anonymization techniques. Existing anonymization approaches use different techniques and mechanisms and could be categorized mainly into five categories: -anonymity based approaches (Liu and Terzi, 2008; Zhou and Pei, 2008; Zou et al., 2009; Yuan et al., 2010; Cheng et al., 2010b), Edge manipulation techniques (Ying and Wu, 2009), cluster based techniques (Hay et al., 2008; Bhagat et al., 2009; Khairnar and Bajpai, 2014; Liu and Mittal, 2016; Thompson and Yao, 2009; Mittal et al., 2012), random walk based techniques (Liu et al., 2016b; Mittal et al., 2012), and differential privacy based techniques (Sala et al., 2011; Proserpio et al., 2014; Xiao et al., 2014; Wang and Wu, 2013). We discuss each of these categories later.

3.2.1. K-anonymity Based Approaches

The aim of -anonymity methods is to anonymize each user/node in the graph so that it is indistinguishable from at least other users (Sweeney, 2002). Liu et al. (Liu and Terzi, 2008) proposed an anonymization framework for -degree anonymization in which for each user, there are at least other users with the same degree. The goal of this approach to add/delete the minimum number of edges to preserve -degree anonymity. This algorithm has two steps. In the first step, given the degree sequence of the original graph, a -degree anonymized version of the degree sequence is constructed and then in the second step, the anonymized graph is built based on the anonymized degree sequence. In another work (Zhou and Pei, 2008), Zhou et al. aim to achieve -neighborhood anonymity. They consider the assumption that the adversary knows the subgraph constructed by the immediate neighbors of a target node. In the first step of the anonymization, 1-hop neighborhoods of all users are extracted and encoded in a way that isomorphic neighborhoods could be easily identified. In the second step, users with similar/isomorphic neighborhoods are grouped together until size of each group is at least . Then, each group is anonymized satisfying -neighborhood anonymity as each neighborhood has at least isomorphic neighborhoods in the same group. Eventually, this approach anonymizes the graph against neighborhood attacks.

Zou et al. (Zou et al., 2009) propose a -automorphism based framework which protects the graph against multiple attacks including the neighborhood attack (Zhou and Pei, 2008), degree based attack (Liu and Terzi, 2008), hub-fingerprint attack (Hay et al., 2008) and subgraph attack (Hay et al., 2008). A graph is -authomorphic if there exists automorphic functions in the graph and for each user in the graph, the attacker cannot distinguish it from her symmetric vertices. The proposed approach first partitions the graph into blocks and then clusters blocks into groups (graph partitioning step). In the second step, alignments of blocks are obtained and original blocks are replaced with alignment blocks (block alignment step). In the last step, edge copy is performed to get the anonymized graph. Edge copy adds edges between pairs where is the automorphic function and and are users in the social graph. Authors also propose the use of generalized vertex ID’s for handling dynamic data releases. Another similar work by Cheng et a. (Cheng et al., 2010b) proposes a -isomorphism anonymization approach. A graph is -isomorphic if it is consisted of disjoint subgraphs and all subgraphs pairs are isomorphic. In the first step, the graph is partitioned into subgraphs with the same number of vertices. Then, edges are added or deleted so that these subgraphs are isomorphic. This approach protects the published graph against neighborhood attacks (Zhou and Pei, 2008).

Yuan et al. (Yuan et al., 2010) incorporate semantic and graph information together to achieve personalized privacy anonymization. In particular they consider three different levels for attacker’s knowledge regarding the target user, 1) only attribute information, 2) both attribute and degree information, and 3) combination of attribute, node degree and neighborhood’s information. They accordingly propose three levels of protection to achieve -anonymity. For level 1 protection, their approach considers label generalization. For the level 2 anonymization, it uses node/edge adding approach as well. For the level 3 protection, it uses edge label generalization.

3.2.2. Edge Manipulation Based Approaches

Edge manipulation and randomization algorithms for social graphs usually utilizes edge-based randomization strategies to anonymize data such as random edge adding/deleting and random edge switching (Ying and Wu, 2009). Ying et al. (Ying and Wu, 2009) propose spectrum preserved edge editing which either adds random edges to the graph and remove another edges randomly or alternatively switches edges. In the switching technique, two random edges and are selected from the original graph edge set such that . Then edges and are removed and new edges and are added instead. Backes et al. (Backes et al., 2017) also propose a randomization based approach to preserve the privacy of social links between users in graph data and counteract link inference attacks. In this specific type of attack, the adversary exploits users mobility traces to infer social links between users with the intuition that friends have more similar mobility profiles in comparison to the mobility profiles of two strangers (Backes et al., 2017). They utilize three privacy preserving techniques: hiding, replacement, and generalization of user mobility information. Results show that data publishers need to hide 80% of the location points or replace 50% of them to prevent leakage of information of users social links.

3.2.3. Clustering Based Techniques

Clustering based approaches group users and edges and only reveal the density and size of the cluster so that individual attributes are protected. Hay et al. (Hay et al., 2008) propose an aggregation based method for graph data anonymization which is robust against three types of attacks: neighborhood, subgraph, and hub fingerprint. Hay et al.’s approach models the aggregate network structure by partitioning original graph and describing it at the level of partitions. Partitions are considered as nodes and edges between them makes the edges in the generalized graph. This generalized graph can be further used to randomly sample a graph from that can be published as the anonymized data.

Another cluster based work (Bhagat et al., 2009) proposes two approaches, label list and partitioning, which consider user attributes (i.e., labels) in addition to structural information. In the label list approach, a list of labels are allocated to each user which also includes her true label. This approach first clusters nodes into classes and then a set of symmetric lists is built deterministically for each class from the set of nodes in the corresponding class. In the partitioning approach, nodes are divided into classes and instead of releasing full edge information, only the number of edges between and within each class is released. This is similar to the generalization approach of Hay et al. (Hay et al., 2008). Bhagat et al. also use a set of safety conditions to ensure that the released data does not leak information. The proposed partitioning approach is more robust than the label list technique when facing the attacks with richer background knowledge. However, the partitioning approach has lower utility than the label list as less information is revealed about the graph structure.

Thompson et al.’s approach (Thompson and Yao, 2009) protects the graph information against -hop degree-based attack. They present two clustering algorithms, bounded -means clustering and union-split clustering. These approaches group users with similar social roles into clusters with a minimum size constraint. Then they utilize the proposed inter-cluster matching anonymization method, which anonymizes the social graph by removing/adding edges according to the users’ inter-cluster connectivity. The number of nodes and edges between and within clusters are then released similar to Hay et al.’s approach (Hay et al., 2008). Another work (Khairnar and Bajpai, 2014) proposes an incremental approach to partition graph data and release clusters centroids information as the anonymized data. Mittal et al. (Liu and Mittal, 2016) also propose another clustering based aonymization technique which considers evolutionary dynamics of social graphs such as node/edge addition/deletion and consistently anonymizes the graph. It first dynamically clusters nodes and then perturbed the intra-cluster and inter-cluster links for changed clusters in a way that structural properties of social media graph is preserved. They leverage static perturbation method of (Mittal et al., 2012) to modify intra-cluster links and randomly connect marginal nodes to create fake inter-cluster links according to their degree. The obfuscated graph has higher indistinguishability which is defined from an information theoretic perspective.

3.2.4. Random Walk Based Approaches

Another group of works utilizes random walk idea to anonymize graph data. The idea of random walk has been previously used in many security applications such as Sybil defense (Al-Qurishi et al., 2017). Recent works also use this idea for anonymzing social graphs. The work of Mittal et al. (Mittal et al., 2012) introduces a random-walk based edge perturbation algorithm. According to this approach, for each node , a random walk with the length will be performed starting from one of the ’s contacts, and an edge between destination node, and will be added with an assigned probability and the edge will be removed accordingly. This probability will decrease as more random walks are performed from ’s contacts. Later, Liu et al. (Liu et al., 2016b) improve this approach such that instead of having a fixed length random walk with length , they utilize a smart adaptive random which its length is learned based on the local structure characteristics. This method first predicts the local mixing timing for each node which is the minimum random walk length for a starting node to be within a given distance to stationary (distance) node. This mixing time is predicted based on the local structure and limited global knowledge of the graph and is further used to adjust the length of random walk for social graph anonymziation.

3.2.5. Differential Privacy Based Approaches

Differential privacy (Dwork, 2008) was first proposed for providing a strong privacy guarantee for statistical database query. Recently many works extend differential privacy to the social graph data. Sala et al. (Sala et al., 2011) first use -series to capture sufficient graph structure at multiple granularities. -series is the degree distributions of connected components of size within a target graph (Dimitropoulos et al., 2009; Mahadevan et al., 2006). Then, they partition the statistical representation of the graph captured by -series into clusters and then use -differential privacy mechanism to add noise to the representation in each cluster. Proserpio et al. (Proserpio et al., 2014) propose another differentially private based approach which scales down the magnitude of added noise by reducing the contributions of challenging records.

In another work, Wang et al. (Wang and Wu, 2013) use -graph generation models to generate sanitized graphs. In particular, their approach first extracts various information form the original social graph such as degree correlations and then enforce differential privacy on the learned information and finally used perturbed pieces of information to generate an anonymized graph with -graph models. Different from the approach in Sala et al. (Sala et al., 2011), in the specific case of , noise is generated based on the smooth sensitivity rather than global sensitivity. The reason behind this specification is to reduce the magnitude of the added noise. Smooth sensitivity is a smooth upper bound on the local sensitivity when deciding the noise magnitude (Nissim et al., 2007). Another work (Xiao et al., 2014) proposes an anonymization approach which satisfies edge

-differential privacy to hide each user’s connections to other users. They propose to learn how to transform edges to connection probabilities via statistical Hierarchal Random Graphs (HRG) under differential privacy. In particular, their approach infers the HRG by learning the entire HRG model space and sampling an HRG by a Markov Chain Monte Carlo method and generating the sanitized graph according to the sampled HRG while satisfying differential privacy. Their results show that using edge probabilities can result in significant noise scale reduction in comparison to the case where the edges are used directly.

In another work from Liu et al. (Liu et al., 2016a), it has been shown that differential privacy is not robust to the de-anonymization attacks if there is dependence among dataset entries. Liu et al. (Liu et al., 2016a) also propose a stronger privacy notion, dependent differential privacy in which it incorporates the probabilistic dependence between the tuples in a statistical database. They then propose an effective perturbation framework which provides privacy guarantees. Their result show that more noise should be added when there is dependency between tuples. The added noise is also dependent on the sensitivity of two tuples as well as the dependence relationship between them. They evaluate their proposed framework on graph data to sanitize the degree distribution of the given graph.

Ji et al. (Ji et al., 2016d, 2015b) and Abajaway et al. (Abawajy et al., 2016) study the defense and attacking performance of a portion of existing social graph anonymization and de-anonymization techniques. Ji et al. (Ji et al., 2016d, 2015b) have also performed a thorough theoretical and empirical analysis on a portion of existing related papers. Results demonstrate that anonymized social graphs are vulnerable to de-anonymization attacks.

4. Authors in Social Media and Privacy

People have the right to have anonymous free speech over different topics such as Politics (Narayanan et al., 2012). However, an author’s identity can be unmasked by adversaries through providing her real name or IP address to a service provider. However, authors can use tools such as Tor to protect their identity at the network level (Dingledine et al., 2004). Manually generated content will always reflect some characteristics of the person who authored it. For example, some anonymous online author is prone to several specific spelling errors or has other recognizable idiosyncrasies (Narayanan et al., 2012). These characteristics could be enough to figure out whether authors of two pieces of content are same or not. Therefore, with material authored by the true identity of the author, the adversary can discover the identity of a content posted online by the same author anonymously.

Identifying the author of a text according to her writing style, a.k.a stylometry, has been studied a long time ago (Mendenhall, 1887; Mosteller and Wallace, 1964; Stamatatos, 2009). With the adverse of machine learning techniques, researches start to extract textual features and discriminate between 100–300 authors (Abbasi and Chen, 2008). The application of author identification includes identifying authors of terroristic threats and harassing messages (Chaski, 2005), detecting fraud (Afroz et al., 2012), and extracting author’s demographic information (Koppel et al., 2009).

Privacy implications of stylometry have been studied recently. For example, Rao et al. (Rao et al., 2000)

investigate whether people who are posting under different pseudonyms to USENET newsgroup can be linked based on their writing style. They use a dataset of 117 people having 185 different pseudonyms and exploit function words and Principal Component Analysis (PCA) to perform matching between newsgroups posting and email domains. Another work from Koppel et al. 

(Koppel et al., 2006, 2011), studies author identification at the scale of over 10,000 blog authors. They use 4-grams of characters which is a context specific feature. The problem with this work is that it is not clear whether their approach is solving author recognition or context recognition. In another work, Koppel et al. (Koppel et al., 2009) use both content-based and stylistic features to identify 10,000 authors in the blog corpus dataset. There are also several works on identifying authors of academic papers under blind review based on the citations of the paper (Bradley et al., [n. d.]; Hill and Provost, 2003) or other sources from unblind texts of potential authors (Nanavati et al., 2011).

Narayanan et al. (Narayanan et al., 2012) propose another author identification attack which exploits 1,188 real-valued features from each post, such as frequency of characters, capitalization of words (e.g., lowercase and uppercase words), syntactic structure (extracted by Stanford Parser (Klein and Manning, 2003), e.g. noun phrases containing a personal pronoun, noun phrases containing a singular proper noun), distribution of word length, etc. These features capture the writing style of the author regardless of the topic at hand. This approach works for re-identifying large number of authors and has also been tested over a cross-context setting (i.e., two different blogs). However this approach will not work when authors anonymize their writing style.

Almishari et al. (Almishari and Tsudik, 2012)

proposed a new linkage attack which investigates the linkability of prolific reviews that users post on social media platforms. More specifically, given a subset of information on reviews made by an anonymous user, this approach seeks to map it to a known identified record. This approach first extracts four types of tokens, unigrams, digrams, ratings and category of reviewed entity. Then, it uses Naive bayes and Kullback-Leibler divergence models to re-identify the anonymized information. This approach could be even used for identity disclosure attack across multiple platforms using people’s posts and reviews.

Bowers et al. (Bowers et al., 2015) propose an anonymization approach which uses iterative language translation (ILT) to conceal one’s writing style. This approach first translates English text into another foreign language (e.g., Spanish, Chinese, etc.) and then turns it back to English again for three iterations. Another work from Nathan et al. (Mack et al., 2015)

evaluates Bowers’s work by introducing a feature selection approach, namely Generative and Evolutionary Feature Selection (GEFES) over the set of predefined features which mask out non-salient previously extracted features. Both 

(Bowers et al., 2015) and (Mack et al., 2015) are tested over a set of blog posts by users and the results show the efficiency of ILT-based anonymization. A recent work is also proposed by Zhang et al. (Zhang et al., 2018) which anonymizes users’ textual information before publishing user-generated data. This approach first introduces a verified version of differential privacy specified for textual data, namely,

-Text Indistinguishability to overcome the curse of dimensionality problem when original differential privacy is deployed on high-dimensional textual data. It then proposes a framework which perturbs user-keyword matrix by adding Laplacian noise to satisfy

-Text Indistinguishability. Results confirms both the utility and privacy of the data.

5. Social Media Profile Attributes and Privacy

A user’s profile includes her self-disclosed demographic attributes such as age, gender, majors, cities she loved, etc. To address the privacy of users, social networks usually offer the option for users to limit the access to their attributes, i.e. they are only visible to friends or friends of friends. A user could also create a profile without explicitly disclosing any attribute information. A social network thus is a mixture of both private and public user information. However, there exists one privacy attack which focuses on inferring users’ attributes. This attack is known as privacy inference attack and it leverages publicly available information of users in social networks to infer missing or incomplete attribute information (Gong and Liu, 2016).

The attacker could be any party who is interested in this information such as social network service providers, cyber criminals, data brokers, advertisers. Data brokers benefit from selling individuals’ information to other parties such as banks, advertisers, and insurance companies222 Social network providers and advertisers leverage users’ attribute information to provide more targeted services and advertisements. Cyber criminals exploit attribute information to perform targeted social engineering, spear phishing attacks333 and attacking personal information based backup authentication (Gupta et al., 2013). This attribute information could be also used for linking users across multiple sites (Goga et al., 2013; Shu et al., 2017) and records (e.g., vote registration records) (Sweeney, 2002; Minkus et al., 2015). Existing attacks could be categorized into two groups, friend-based (He et al., 2006; Lindamood et al., 2009; Thomas et al., 2010; Mislove et al., 2010; Gong et al., 2014; Zheleva and Getoor, 2009; Dey et al., 2012; Backstrom et al., 2010; McGee et al., 2011; Jurgens, 2013; Rout et al., 2013; Compton et al., 2014; Kong et al., 2014; Jurgens et al., 2015) and behavior-based (Weinsberg et al., 2012; Kosinski et al., 2013; Bhagat et al., 2014; Chaabane et al., 2012; Luo et al., 2014). We will discuss each of these categories next.

5.1. Friend-based Profile Attribute Inference

Friend-based approaches use the homophily theory (McPherson et al., 2001) which states that two friends are more probable to share similar attributes rather than two strangers. Following this intuition, if most of a user’s friends study in Arizona State University, she is more likely studying in the same university. He et al. (He et al., 2006)

, first constructs a Bayesian network from a user’s social neighbors and then uses it to model the causal relations among people in the network and thus obtains the probability that the user has a specific attribute. The main challenge in this approach is its scalability as Bayesian inference is not scalable to the millions of users in social networks. Another work by Lindamood et al. 

(Lindamood et al., 2009) uses Naive Bayes classification algorithm to infer a user’s attributes by exploiting features from her node trait (i.e., other available attributes information) and link structures (i.e. friends). However, this approach is not usable for a user who does not share any attributes. In the other work (Thomas et al., 2010), authors propose an approach which leverages friends’ activities and information to infer a user’s attributes. These features from friends and wall posts are then exploited into a multi-label classifier. The authors then propose a multi-party privacy approach which defends against attribute inference attacks. This approach enforces mutual privacy requirements for all users to prevent disclosure of users attributes and sensitive information.

Zhelva et al. (Zheleva and Getoor, 2009) study how users sensitive attribute information could be leaked through their social relations and group memberships. This friend-based attribute inference attack exploits social links and group information to infer sensitive attributes for each user. Authors propose various algorithms in which it was found LINK was the best among those that only use link information. This method models each user as a binary vector whose length is the size of the network (i.e., number of users in the network) and the value of each element is one if is connected to . Then, different classifiers are trained over the users with a public profile and then attributes for users with private profiles could be inferred. The GROUP algorithm was the best among the methods which incorporates group information. This method first selects the groups that are relevant to the attribute inference problem using either feature selection approach (i.e., entropy ) or manually. Next, relevant groups are considered as features for each node and a classifier model is trained. In the last step, the attributes for targeted users are predicted using the classification model. Mislove et al. introduces a similar approach which leverages users’ social links and communities information (Mislove et al., 2010). Their approach takes some seed users with known attributes as the input and then finds the local communities around this seed set using available link information. Then it uses the fact that users in the same community share similar attributes. This approach then infers remaining users’ attributes based on the communities they are a member of. The limitation is that this approach is not able to infer attributes for users who are not assigned to any local communities.

Avello et al. (Gayo Avello, 2011) propose a semi-supervised profiling approach named McC-Splat. They consider the attribute inference problem as a multiclass classifier. It then learns the attributes’ weights according to the user’s friends’ attributes. Weights here indicate the users’ likelihood in belonging to a given attribute value class. Finally, McC-Splat assigns the class with the highest percentile to the target user. The percentile is calculated according to the labeled individuals information. In the other work from Dey et al. (Dey et al., 2012), the authors focus on predicting facebook users’ ages considering their friendship network information. Although a user’s friends list is not fully available for all users, this work uses reverse lookup approach to obtain a partial friend list for each user. Then, they designed an iterative algorithm which estimates users’ ages based on friends’ ages, friends of friends’ ages and so on. They also incorporated other public information in each user’s profile such as their high school graduation year to estimate their birth year. Another work (Humbert et al., 2013), seeks to find a targeted user based on her social network connections and the similarity of attributes between friends. It starts from a source user and continue crawling until it reaches the target user. The navigations are based on the set of target user’s known attributes, friendship links between users and their attributes as well. Similarly, Labitzke et al. (Labitzke et al., 2013) also study whether profile information of Facebook users could be still leaked through their social relations.

Another set of works in this category focuses on predicting both network structure (i.e. links) and inferring missing users attribute information (Yin et al., 2010a, b; Gong et al., 2014). The reason for simultaneously solving these two problems is that users with similar attributes tend to link to one another and individuals who are friends are likely to adopt similar attributes. The work of Yin et al. (Yin et al., 2010a, b), first creates a social-attribute network graph from an original social graph and user-attributes information, i.e. nodes in the graph are either users or attributes. Edges show the friendship between a pair of users or the relation between a user and attribute. Then, authors use random walk with restart algorithm (Tong et al., 2006) to calculate link relevance and attribute relevance with regard to a given user. Similarly, Gong et al. transform the attribute inference attack problem to a link prediction problem in the social-attribute network graph. They generalized several supervised and unsupervised link prediction algorithms to predict the links between user-user and user-attributes.

5.2. Behavior-based Profile Attribute Inference

Unlike friend-based approaches, behavior-based inference attacks infer a user’s attributes based on the publicly available information regarding her behaviors and public attributes of other users similar to her. For example, if a user is more engaged in liking and sharing posts that are mainly posted and liked by other female users, this user’s gender is female with high probability. Weinsberg et al.  (Weinsberg et al., 2012)

propose an approach which infers users’ attributes (i.e. gender) according to their behavior toward movies. In particular, each user is modeled with a vector with the size being the number of items. A non-zero value for each vector element demonstrates that the user has rated the item, and zero value means that user has not rated the item. Then, they use different classifiers such as logistic regression, SVM, and Naïve Bayes to infer users’ ages and their results revealed that logistic regression performed the best result. Accordingly, the authors propose a gender obfuscation method which adds movies and corresponding ratings to a given user’s profile such that it will be hard to infer the gender of the user while minimally impacting the quality of recommendations the user received. They use three different approaches for movie selection: random, sampled and greedy strategy. The sampled strategy picks a movie based on ratings distribution associated with the movies of the opposite gender. The greedy approach also selects a movie with the highest score in the list of movies for opposite gender. Ratings are also added for each movie based on either the average movie rating or the rating predicted using recommendation approaches such as matrix factorization. The greedy movie selection approach with predicted rating has the best results regarding user profile obfuscation. Kosinski et al. 

(Kosinski et al., 2013) follow a similar approach to (Weinsberg et al., 2012) and construct a feature vector for each user based on each users Facebook likes. Authors then use logistic regression classifier to train classifiers and infer various attributes for each user.

Another work from Bhagat et al. (Bhagat et al., 2014)

proposes an active learning based attack which infers users’ attributes via interactive questions. In particular, their approach involves finding a set of movies and asking users to rate them. Each selection maximizes the confidence of the attacker in inferring users attributes. The work of 

(Chaabane et al., 2012) seeks to infer users attributes based on the different types of musics they like. This approach first extracts a user’s interests and finds semantic similarity among them. It uses an ontologized version of Wikipeda related to each music and exploits topic modeling techniques (i.e. Latent Dirichlet Allocation, LDA (Blei et al., 2003)) and learns semantic interest topics for each user. Then, a user is predicted to have similar attributes as those who like similar types of musics as the user. In another work from Luo et al. (Luo et al., 2014)

, authors infer household structures of Internet Protocol Television (IPTV) based on the users’ watching behavior (e.g., dynamics of watching time and TV programs). Their approach first extracts related features from IPTV log-data including TV programs topics and viewing behavior using LDA and low-rank model, respectively. Then, it combines graph-based semi-supervised learning with non-parametric regression and uses it to learn a classifier based on the extracted features for inferring the structure of household. A recent work published by Li et al. 

(Li et al., 2017)

uses convolutionam neural network (CNN) to infer multi-valued attributes for a target user according to his ego network. A user’s ego network is a subset of the original social network based on the user’s friends and the social relations among them. CNN can capture the latent relationship between users’ attributes and social links.

5.3. Friend-based and Behavior-based Profile Attribute Inference

Another category of approaches exploit both social link and user behavior information for inferring users attributes. Gong et al. (Gong and Liu, 2016, 2018) first make a social-behavior-attribute network (SBA) in which social structures, user behaviors and user attributes are integrated in a unified framework. Nodes of this graph are either users, behaviors or attributes and edges represents the relationship between these attributes. Then, they infer a target user’s attributes through a vote distribution attack (VIAL) model. VIAL performs a customized random walk from a target user to all other users in the augmented SBA network and assigns probabilities to the users such that a user receives higher probability if it is structurally more similar to the target node in SBA network. The stationary probabilities of attribute nodes are then used to infer attributes of the target user, i.e., the attribute with maximum probability is assigned to the target user. Unlike most of the existing approaches which only use the information of users who have an attribute, a recent work from Ji et al. (Jia et al., 2017)

incorporate information from users who do not have the attribute in the training process as well, i.e. negative training samples. This work associates a binary random variable with each user characterizing whether a user has an attribute or not. Then it learns the prior probability of each user having a specified attribute by incorporating the user’s behavior information. Next, it models the joint probability of users as a pairwise Markov Random Field according to their social relationships and uses this model to infer posterior probability of attributes for each target user. Posterior probabilities are calculated using an optimized version of Loopy Belief Propagation.

5.4. Exploiting Other Sources of Information for Profile Attribute Inference

These approaches leverage sources of information other than social structures and behaviors, such as writing style (Otterbacher, 2010), posted tweets (Al Zamal et al., 2012), liked pages (Gupta et al., 2013), purchasing behavior (Wang et al., 2016) and checked-in locations (Zhong et al., 2015). A recent research combined identity and attribute disclosure across multiple social network platforms (Andreou et al., 2017). It defines the concept of -matching anonymity as a measure of identity disclosure risk. Given a user and her identity in a source social network, a matching anonymity set is defined as the set of identities in the target social network with a matching probability of more than . The user is anonymous if the size of the matching set is . Another work by Backes et al. (Backes et al., 2016) introduces a relative linkability measure that ranks identities within a social media site. In particular, it incorporates the idea of -anonymity to define -anonimity for each user in social media which captures the largest subset of identities (including ) who are within a similarity (or dissimilarity) threshold from considering their attributes. A recent work from Liu et al. (Liu et al., 2016a) also studies the vulnerability of differential privacy mechanism against the inference attack problem. As stated earlier, differential privacy provides protection against the adversary who knows the entire dataset except one entry. However, differential privacy considers the independence between dataset entities. Liu et al. introduce a new inference attack in which the probabilistic dependence between dataset entries are calculated and then leveraged to infer a user’s location information from differentially private queries.

Different from all the works focusing on profile attribute inference, a recent work from (Alufaisan et al., 2017) brings evasion and poisoning attacks into this problem. As mentioned earlier, attribute inference could be interpreted as a classification problem (each attribute value is considered as a class) and leveraged information for this task could be also called as features. This work introduces five variants of evasion and poisoning attacks to interfere with the results of the profile attribute inference. It then uses Facebook likes data to show the effectiveness of the aforementioned attacks in inferring a user’s sexual orientation and political view. Introduced attacks are as follows:

  • [leftmargin=*]

  • Good/Bad Feature Attack (Evasion): The adversary has knowledge of useful(good)/useless(bad) features for the inference task. She then adds good features from one attribute to another while removing bad features from each class for all users to introduce false signals for the predictor.

  • Mimicry Attack (Evasion): The goal is to make one class looks like the other class. Adversary first samples a subset of users from one class and then finds the set of the most similar users in the other class. Good (bad) features are added (removed) for users in the found subsets.

  • Class Altering Attack (Poisoning): In this attack, the adversary randomly chooses users from one class and then flips their class label. The number of contradictory profiles will then increase, which results in higher misclassification rate.

  • Feature Altering Attack (Poisoning): The goal is to increase the misclassification rate. She poisons the training data by randomly adding good feature values of one class to another class.

  • Fake Users Addition Attack (Poisoning): The attacker poisons the data by removing a set of real users and then injecting fake users into the training dataset. Feature values of fake users are selected randomly from the real users’ feature values.

6. Social Media Users Location and Privacy

This location disclosure attack is a specific version of attribute inference attack in which the adversary focuses on inferring geo-location information for a given user. The location disclosure attack takes as input some geolocated data and produces some additional knowledge about target users. More precisely, the objective of this attack may be to: 1) predict the movement patterns of an individual, 2) learn the semantics of the target user mobility behavior, 3) link records of the same individual, and 4) identify points of interest (Gambs et al., 2010). Existing works incorporates a given user’s friends’ known geo-location information (Backstrom et al., 2010; McGee et al., 2011; Jurgens, 2013; Rout et al., 2013; Compton et al., 2014; Kong et al., 2014; Jurgens et al., 2015; McGee et al., 2013). The work of (Backstrom et al., 2010) introduces a probabilistic model representing the likelihood of the target user’s location based on her friends’ location and geographic distance between them. (Kong et al., 2014) and (McGee et al., 2011) extend Backstrom et al.’s work (Backstrom et al., 2010) and find the target user’s friends that are strong predictors of her location.

In another work, Mcgee et al. (McGee et al., 2013) integrates social tie strength information to capture the uncertainty across multiple location granularities. The reason is that not all relationships in social media are the same and the location of friends with strong ties are more revealing of a user’s location. Rout et al. (Rout et al., 2013) deploy a SVM classifier on a given set of features to predict the target user’s location. These features include cities of the target user’s friends, number of friends in the same city as the target user and number of reciprocal relationships the target user has per city. Jurgens et al.(Jurgens, 2013) infer locations by proposing an iterative multi-pass label propagation approach. This approach calculates each target user’s location as the geometric median of her friends’ locations and it seeks to overcome the sparsity problem when the ground truth data is sparse. The work of (Compton et al., 2014) extends (Jurgens, 2013) and limits the propagation of noisy locations by weighting different locations using information such as the number of times the users have interacted.

Another work from Cheng et al. (Cheng et al., 2010a) proposes a probabilistic framework which infers Twitter users’ city level location based on the content of their tweets. The idea is that users’ tweets include either implicit or explicit location-specific content, e.g., place names, or words or phrases more associated with certain locations (e.g., ”howdy” for Texas). It uses lattice-based neighborhood smoothing technique to even out the word probabilities and overcome the tweet sparsity challenge. Hecht et al. (Hecht et al., 2011) also found that only of Twitter users do not provide their real location information or share fake locations or sarcastic comments to fool location inference approaches. They show that a user’s location could be inferred using machine learning techniques through the implicit user behavior reflected in their tweets. In another work, Ryoo et al. (Ryoo and Moon, 2014) refine Cheng et al.’s city-level granularity location inference approach (Cheng et al., 2010a) to 500 m distance bins. Having GPS-tagged tweets for a set of users, their approach builds geographic distributions of words and computes user location as a weighted center of mass from the user’s words. It then uses a probabilistic model and computes the foci and dispersions by binning the distance between GPS coordinates and the word’s center by 500m for computational scalability.

Li et al. (Li et al., 2012b) introduce a unified discriminative influence model which considers both users’ social network and user-centric data (e.g., tweets) in order to solve the scarce and noisy data challenge for location inference. It first augments social network and user data in a probabilistic framework which is viewed as a heterogeneous graph with users and tweets as nodes and social and tweeting relations as edges. Every node in this graph is then associated with a location and the proposed probabilistic influence model measures how likely an edge is generated between two nodes considering their locations. This can further handle the noisy data challenge in location inference problem. It then predicts user’s location either locally or globally. Another similar work from Li et al. (Li et al., 2012a) exploits a user’s tweets and social relations to build a complete location profile which infers a set of multiple long-term geographic location scopes related to her which not only includes her home location, but also other related ones, e.g. work space. Their approach captures the locations related to social relations as well (e.g. Bob and Alice are friends as they both live in Texas). In particular, their approach is a probabilistic generative model which is consisted of three components, 1) location-based following model, 2) location-based tweeting model, and 3) partial information from users known locations.

Srivatsa et al. (Srivatsa and Hicks, 2012) propose a de-anonymization attack which exploits a user’s friendship information in social media to de-anonymize users mobility traces. The idea behind this approach is that people meet those who have a relationship with them and thus they could be identified by their social relationships. This approach models mobility traces as contact graphs and identifies a set of seed users in both graphs, i.e. contacts graph and friendship in social network. In the second step, it propagates mapping from seed users to the remaining users in the graphs. This approach uses Distance Vector, Randomized Spanning Trees and Recursive Subgraph Matching heuristics to measure the mapping strength and propagate the measured strength through the network.

Another work from Ji et al. (Ji et al., 2014b, 2016c) improves the work of Srivasta et al. (Srivatsa and Hicks, 2012) in terms of accuracy and computational complexity. This work focuses on mapping anonymized users mobility traces to social media accounts. In addition to the users’ local features, their approach incorporates users’ global characteristics as well. Ji et al. define three similarity metrics: structural similarity, relative distance similarity and inheritance similarity. These similarities are then combined in a unified similarity. Structural similarity considers features such as degree centrality, closeness centrality, and betweenness centrality while relative distance similarity captures the distance between users and seed users. Inheritance similarity considers the number of common neighbors which have been mapped as well as the degree similarity between the users in mobility traces and social media network graph. Next, Ji et al. (Ji et al., 2014b, 2016c) propose an adaptive de-anonymization framework which adaptively starts de-anonymizing from a core matching set which is consisted of a number of mapped users and -hop mapping spanning set of them.

In another work (Mahmud et al., 2014), the location of Twitter users are inferred in different granularities (e.g., city, state, time zone, geographical region) based on their tweeting behavior (frequency of tweets per time unit) and the content of their tweets. This approach exploits external location knowledge (e.g., dictionary containing names of cities and states, and location based services such as Foursquare) and finds explicit references of locations in tweets. Then all features are fed into a dynamically weighted method which is an ensemble of the statistical and heuristic classifiers.

Another work from Wang et al. (Wang et al., 2018) links multiple users identities across multiple services/social media platforms (even with different types) according to the spatial-temporal locality of their activities, i.e. users mobility traces. This work also assumes that individuals can have multiple IDs/accounts. The motivation behind their algorithm is that IDs corresponding to the same person, are online at the same time in the same location and users’ daily movement is predictable with repeated patterns. Wang et al. model users information as a contact graph where nodes are IDs (regardless of the service) and an edge represents connected IDs that have visited the same location. The weight of the edge demonstrates the number of co-location of two nodes. Then, a Bayesian matching algorithm is proposed to find the most probable matching candidates for a given target ID. A Bayesian inference method is then used to generate confidence scores for ranking candidates.

The work of (Jurgens et al., 2015) compares different approaches in location inference attacks in social networks. There are also some other surveys discussing location inference techniques specifically in Twitter (Ajao et al., 2015; Zheng et al., 2018) which the reader can refer to. Note that a large portion of research is dedicated to inference attacks on geolocated data which is out of the scope of this survey (Shokri et al., 2011; Gambs et al., 2010; Liu et al., 2018). A thorough survey is also available discussing geolocation data privacy which readers can refer to it if they are interested (Liu et al., 2018). Note that the scope of this survey is a different from ours in which we cover the location privacy issues of users based on activities in social media.

7. Recommendation Systems and Privacy

Recommendation systems help individuals find information that matches with their interests by building user-interest profiles and recommending items to users based on those profiles. These profiles could be extracted from the users’ interactions as they express their preferences and interests, e.g. clicks, likes/dislikes, ratings, purchases, etc (Beigi and Liu, 2018; Zafarani et al., 2014). While user profiles help recommender systems to improve the quality of the services a user receives (a.k.a utility), they also raise privacy concerns by reflecting the preferences of users (Ramakrishnan et al., 2001). Many works have studied the relationship between privacy and utility and have proposed solutions to handle the trade-off. In general, these works focus on obfuscating users’ interactions to hide their actual intentions and prevent accurate profiling (Puglisi et al., 2015). Following this strategy, no third parties or external entities need to be trusted by the users to preserve their privacy. Existing approaches use different techniques and mechanisms and could be categorized mainly into three categories: cryptographic based techniques (Aimeur et al., 2008; Canny, 2002; Hoens et al., 2010; Tang and Wang, 2016; Badsha et al., 2017), differential privacy based approaches (McSherry and Mironov, 2009; Machanavajjhala et al., 2011; Zhu et al., 2013; Jorgensen and Yu, 2014; Shen and Jin, 2014; Hua et al., 2015; Guerraoui et al., 2015; Zhu and Sun, 2016; Meng et al., 2018) and perturbation based techniques (Parra-Arnau, 2017; Rebollo-Monedero et al., 2011; Parra-Arnau et al., 2014; Polat and Du, 2003; Luo and Chen, 2014; Xin and Jaakkola, 2014; Parameswaran and Blough, 2007; Puglisi et al., 2015; Howe and Nissenbaum, 2009)

A group of works focus on providing cryptographic solutions to the problem of secure recommender systems. The approaches do not let the single trusted party have access to everyone’s data (Aimeur et al., 2008; Canny, 2002; Hoens et al., 2010; Tang and Wang, 2016; Badsha et al., 2017). Instead, users’ ratings are stored as encrypted vectors and aggregates of the data are provided in the public domain. These approaches do not prevent privacy leaks through the output of recommendation systems (i.e., the recommendation themselves). Moreover, these techniques are not the scope of this survey.

7.1. Differential Privacy Based Solutions

Works in this group utilize a differential privacy strategy to either anonymize user data before sending it to the recommendation system or perturb the recommendation outputs. McSherry et al. (McSherry and Mironov, 2009) modify leading algorithms for recommendation systems (i.e., SVD and -nearest neighbor) for the first time so that drawing inferences about original ratings is difficult. They utilize differential privacy to construct private covariance matrices and make the collaborative filtering algorithms that use them private without having significant loss in accuracy.

In another work, Calandrino et al. (Calandrino et al., 2011) propose a new passive attack on recommender systems to infer a target user’s transactions (i.e., item ratings). Their attack first monitors changes in the public outputs of a recommender system over a period of time. Public outputs may include related-items lists or an item-item covariance matrix. Then, it combines this information with a moderate amount of auxiliary information about the target user’s transactions to further infer many of the target user’s unknown transactions. Calandrino et al. further introduce an active inference attack on -NN recommender systems. In this attack, sybil users accounts are created and the nearest neighbor of each sybil consists of other sybil users and the target user. The attack can then infer the target user’s transactions history based on the items recommended to any of the sybils. The results of this work confirms the existence of privacy risks over the public outputs of recommender systems. The work of McSherry et al. (McSherry and Mironov, 2009) is not effective in protecting users against this attack as it does not consider updates to the covariance matrices and cannot provide a privacy guarantee in the dynamic settings. Machanavajjhala et al. (Machanavajjhala et al., 2011) then quantifies the accuracy-privacy trade-off. In particular, they prove lower bounds on the minimum loss in accuracy for recommendation systems that utilize differential privacy. Moreover, they adapt two differentially private algorithms, Laplace (Dwork et al., 2006) and Exponential (McSherry and Talwar, 2007) for the problem of recommendation without disclosing any user sensitive attributes. This work assumes that all users’ attributes are sensitive.

Previous works (McSherry and Mironov, 2009; Machanavajjhala et al., 2011) are vulnerable to -nearest neighbor attack as they fail to hide similar neighbors (Calandrino et al., 2011). Zhu et al. (Zhu et al., 2013) also propose a private neighborhood-based collaborative filtering which protects the information of both neighbors and user ratings. The proposed work assumes that the recommender system is trusted and introduces two operations: private neighbor selection, and recommendation-aware sensitivity. The first operation seeks to protect neighbors identity by privately selecting neighbors from a list of candidates and then adopting the exponential mechanism (McSherry and Talwar, 2007) to arrange a probability for each candidate. The second operation is proposed in this work to enhance the performance of recommendation systems by reducing the magnitude of added noise. To do so, after selecting neighbors, the similarity of neighbors is then perturbed by adding Laplace noise to mask the ratings given by a certain neighbor. Finally, the neighborhood collaborative filtering based recommendation is performed on the private data. In another work, Jorgensen et al. (Jorgensen and Yu, 2014) assume that all users’ item-rating attributes are sensitive. However, different from Machanavajjhala et al. (Machanavajjhala et al., 2011), they assume that users’ social relations are non-sensitive. They propose a differentially private based recommendation which incorporates social relations besides user-item ratings. To address the utility loss, this work first clusters users according to their social relations. Then, noisy averages of the user-item preferences are computed for each cluster using the differential privacy mechanism. Results of this method show that the clustering phase reduces sensitivity and the amount of added noise which further reduces the utility loss.

Shen et al. (Shen and Jin, 2014) assume that the recommender system is untrusted. They propose a user perturbation framework which anonymizes user data under a novel mechanism for differential privacy: relaxed admissible mechanism. The recommender system then utilizes users’ perturbed data to perform recommendation. They provide mathematical bounds on the privacy and utility of the anonymized data. Hua et al. (Hua et al., 2015) also propose a matrix factorization based recommender system which is differentially private. In particular, they solve this problem for two scenarios, trusted recommender and untrusted recommender. For the first scenario, user and item profile vectors are learned via regular and private version of matrix factorization, respectively. Private version of matrix factorization adds noises to item vectors to make them differentially private. In the second scenario, item profile vectors are first differentially privately learned with private matrix factorization problem. Then, since a user’s profile depends on her own ratings rather than other users, her differentially private profile vector is derived from the private item profiles.

A novel and strong form of differential privacy, namely distance-based differential privacy, has been introduced by Guerroaui et al. (Guerraoui et al., 2015). Distance-based differential privacy ensures privacy for all the items rated by a user and the ones that are within a distance from it. The distance parameter controls the level of privacy and aids in tuning the recommendation privacy-utility trade-off. The proposed protocol first finds a group of similar items for each given item. Then, it creates a manipulated user profile to preserve -differential privacy by selecting an item and replacing it with another one. The most similar users for an active user also get updated periodically using the altered profiles generated in previous step.

Another differential privacy based recommendation by Zhu et al. (Zhu and Sun, 2016) seeks to solve the privacy problem in recommendations by applying a differential privacy mechanism into the procedure of recommendation. In particular, it proposed two approaches: item-based and user-based recommendation algorithms. In the item-based one, the exponential mechanism (McSherry and Talwar, 2007) is applied to the selection of the related items in order to guarantee differential privacy. Such resultant differentially private items list is further used to find recommendation for a given user. Similarly, in the user-based recommendation system, a list of related users are selected for each target user. This list is further used to find the relevance score for each item by calculating the sum of ratings provided by the related users. The exponential mechanism is used in the item selection process to make the recommendation process differentially private. Another work differentiates sensitive and non-sensitive ratings to further improve the quality of recommendation systems in the long run (Meng et al., 2018). Meng et al. (Meng et al., 2018) propose a personalized privacy preserving recommender system. Given sets of sensitive and non-sensitive ratings for each user, their approach utilizes differential privacy (Dwork, 2008) to perturb users’ ratings. Smaller and larger privacy budgets are considered for sensitive and non-sensitive ratings, respectively. This protects users’ privacy while retaining recommendation effectiveness. In order to protect sensitive ratings from untrusted friends, Meng et al. then utilize only non-sensitive ratings to calculate social relations regularization.

7.2. Perturbation Based Solutions

Perturbation based techniques usually obfuscate users item ratings by adding random noise to the user data. Rebollo et al. (Rebollo-Monedero et al., 2011) propose an approach which first measures the user’s privacy risk as the Kullback–Leibler(KL)-divergence (a.k.a relative entropy) (Cover and Thomas, 2012) between user’s apparent profile and average population’s distribution profile . The idea is that the more a user’s profile diverges from the general population, the more information an attacker can learn about her. Then it seeks to find the obfuscation rate for generating forged user profiles so that the privacy risk is minimized. Authors then provide a closed-form solution for perturbing users interactions with a recommender system in order to optimize the privacy risk function.

Puglisi et al. (Puglisi et al., 2015) further extend Rebollo et al.’s work (Rebollo-Monedero et al., 2011) to investigate the impact of this technique on content-based recommendation in terms of privacy and the potential degradation of the recommendation utility. This work measures a user’s privacy risk similar to the approach proposed in (Rebollo-Monedero et al., 2011). The utility of the service is also measured by the prediction accuracy of the recommender system. This paper evaluates three different strategies, namely: optimized tag forgery (rebollo2010optimized), uniform tag forgery and TrackmeNot (TMN) (Howe and Nissenbaum, 2009). The uniform tag forgery method assign forged tags according to a uniform distribution across all categories of the user profile. TMN constructed eleven categories from Open Directory Project (ODP) classification scheme444 and selected the tags uniformly from this set. According to this work, users tend to mimic the profile of the population distribution when larger values of obfuscation rate is considered which results in less privacy risk but lower utility rate. Moreover, the authors have found that for a small forgery rate (), it is possible to obtain an increase in privacy against a small degradation of utility.

Polat et al. (Polat and Du, 2003) use a randomized perturbation technique (Agrawal and Srikant, 2000)

to obfuscate user generated data. Each user generates the disguised z-score for the item he has rated. The z-score for each user-item pair is based on the original item-rating, the user’s average ratings and the total number of items she has rated. The proposed approach approach then passes the perturbed private data to the collaborative filtering based recommender system to perform recommendation. The reason that this technique works is because collaborative filtering works with the aggregated user data. Therefore, although information from each individual user is scrambled, since the number of users is significantly large, the aggregate information of these users can be estimated with decent accuracy. The accuracy of predictions with this approach depends on the amount of noise added. Another work from Parameswaran et al. 

(Parameswaran and Blough, 2007) obfuscate user rating information and then pass disguised information to the collaborative filtering system for further recommendation. The proposed Nearest Neighbor Data Substitution (NeNDS) obfuscation method substitutes a user’s data elements with one of her neighbors in the metric space (Parameswaran and Blough, 2005). However, one drawback of NeNDS is that the value of the perturbed data could be close enough to the original value which thus makes the data vulnerable. A hybrid version of NeNDS is then proposed which provides stronger privacy by combining geometric transformations with NeNDS. In this technique, the data sets are first geometrically transformed, and then operated upon by NeNDS.

In contrast to Mcsherry et al. (McSherry and Mironov, 2009), Xin et al, assume that the recommender is not trusted and the onus is on the users to protect their privacy (Xin and Jaakkola, 2014). Their approach separates the computations that can be done by the users locally and privately and those that must be done by the recommender system. In particular, item features are learned by the system and user features are obtained locally by the users and further used for recommendation. Their approach also divides users into two groups, users who publicly share their information, and those who keep their preferences private. It then uses information of users in the first group to estimate items features. Xin et al. show theoretically and empirically that having the public information of a moderate number of users with a high number of ratings is enough to have an accurate estimation. Moreover, they propose a new privacy mechanism which privately releases second order information that is needed for estimating item features. This information is extracted from users who keep their preferences private. The main assumption behind this work is not realistic though, as in a real-world scenario it is not easy to collect ratings of a moderate number of people with a high number of ratings.

Luo et al. (Luo and Chen, 2014) propose a perturbation-based group recommendation method which assumes that similar users are grouped to each other and they are not willing to expose their preferences to anybody other than the group members. The recommendation system then recommends items to the users within the same group. Their algorithm has four steps. In the first step, users are required to exchange their rating data among users in the same group given a secret key. This key varies for different users. The output of this step is a fake preference vector for each user. The value of the rating is then obfuscated in the second step by a chaos-based scrambling method. Similar to the traditional perturbation-based scheme in Polat et al. (Polat and Du, 2003), randomness is added to the output of the previous step to make sure no sensitive information remains in the published data for the attacker to misuse. This information is then sent to the recommender system and it iteratively extracts information about aggregated ratings of the users. Extracted information is then used to estimate a group preference vector for collaborative filtering based recommendation.

Parra-Arnau et al. (Parra-Arnau et al., 2014), propose a privacy enhancing technology framework, PET, which perturbs users preferences information by combining two techniques, namely, the forgery and the suppression of ratings. In this scenario, users may avoid rating items they like and instead rate those which do not reflect their actual preferences. Therefore, the apparent profile of users will be different from their actual profile. Similar to (Rebollo-Monedero et al., 2011), the privacy risk of each user is then measured as the KL-divergence (Cover and Thomas, 2012) between the user’s apparent profile and the average population distribution. Utility is also controlled with the forgery rate and suppression rate . Then, authors define the privacy-forgery-suppression optimization function which characterizes the optimal trade-off among privacy, forgery rate and suppression rate. In particular, the solution of the optimization problem contains information about which ratings for each user should be forged and which ones should be suppressed to achieve the minimum privacy risk while keeping the utility of the data as high as possible. Similarly, Parra-Arnau et al. (Parra-Arnau, 2017) propose a system which generates a perturbed version of a user rating profile according to her privacy preferences. The system has two components, 1) a profile-density model in which the user’s profile will be more similar to the crowd’s, and 2) a classification model in which the user will not be identified as a member of a given group of users. Their proposed framework considers the money loss for advertisement venue in the exchange of privacy and optimizes the trade-off between privacy and economic compensation. The system utilizes different privacy metrics such as KL-divergence and mutual information. The final output is a decision on whether each service provider (i.e. tracker) can have access to the user’s profile, or it should be blocked, or the user should be notified about the privacy risks.

Recently, the work of Biega et. al. (Biega et al., 2017) proposes a framework which scrambles the users’ rating history to preserve both their privacy and utility. The main assumption of this paper is that service providers, i.e. recommender systems, do not need the complete and accurate user profiles to have a personalized recommendation. Therefore, it splits users’ profiles which is consisted of pairs of user-item interactions to Mediator Accounts in a way that coherent pieces of different users’ profiles are kept intact in the MAs. The service provider will then deal with the MAs rather than real user profiles. This helps to first preserve users’ privacy by scrambling the user-item interactions across various proxy accounts. Moreover, it keeps the user utility high as possible since it tries to assign an user-item interaction to a proxy account which minimizes the average coherence loss over all other objects in the account. This framework also quantifies the user’s privacy-utility trade-off.

Another work from Guerraoui et. al. (Guerraoui et al., 2017) introduces metrics for measuring the utility and privacy effect of a user’s behavior such as clicks, and likes/dislikes. Then, it shows that there is not always a trade-off between utility and privacy. This paper also proposed a click-advisor platform which is an application of the utility and privacy metrics and could warn users regarding the status of their click with respect to the privacy and utility. This paper assumes that the recommender is trusted itself and the users’ sensitive information could be learned by curious users who could deduce profiles through what is recommended to them. According to this work, the utility of a click by user is the difference between commonality of this user before and after that click. Commonality is defined as the closeness of the user profileto other users profiles in the system. The disclosure degree of a user is measured as the probability that the user like items. The disclosure risk of a click is accordingly defined as the difference of the disclosure degree of a user before and after the click. It then uses privacy and utility metrics to guide users in their action by telling them whether their intended action leads to privacy leakage or if it has any effect on their utility or not.

8. Summary and Future Research Directions

Online users are increasingly sharing their personal information on social media platforms. These platforms publish and share user-generated data with third party consumers. This data is rich in content and contains sensitive information about users which risks exposing individuals’ privacy. Recent research has shown the vulnerability of user-generated data against the two general types of attacks, identity disclosure and attribute disclosure. Sanitizing user-generated social media data is more challenging than structured data as it is heterogeneous, highly unstructured, noisy and inherently different from relational and tabular data. In this survey, we reviewed the recent developments in the field of privacy of social media data. We first reviewed traditional privacy models for structural data. Then, we reviewed, categorized and compared existing methods in terms of privacy models, privacy leakage attacks, and anonymization algorithms. We also reviewed privacy risks which exists in different aspects of social media such as users graph information (e.g. social relations, mobility traces, sociotemporal information, etc.), profile attributes, textual information (e.g. posts) and preferences. We categorized relevant works into five groups 1) graph data anonymization and de-anonymization, 2) author identification, 3) profile attribute disclosure, 4) user location and privacy, and 5) recommender systems and privacy issues. For each category, we discussed existing attacks and solutions (if any was proposed) and classified them based on the type of data and the used technique. We outlined the privacy attacks/solutions in Figure 1. Figure 2 also depicts the relevant privacy issues with respect to the type of social media data.

Figure 1. An overview of privacy attacks and corresponding defenses in social media platforms. Tasks highlighted in red have not been extensively studied.
Figure 2. An overview of privacy issues with respect to the type of social media data. Tasks highlighted in red have not been extensively studied.

Detecting privacy issues and proposing techniques to protect privacy of users in social media is a challenging issue. Most of the existing works focus on introducing new attacks and thus the gap between protection and detection becomes larger. Although a large body of work has emerged in recent years for investigating privacy issues for social media data, the development of tasks in each category is highly imbalanced. Some of them are well studied, whereas others need further investigation. We highlight these tasks in red in Figure 1 and Figure 2 based on privacy issues and user-generated data type, respectively. Below, we identify potential research directions in this field:

  • [leftmargin=*]

  • Protecting privacy of textual information: Textual information is noisy, high-dimensional and unstructured. It is rich in content and could reveal many sensitive information that user does not originally expose such as demographic information and location. This makes textual data a very important source of information for adversaries and could be exploited in many attacks. We thus need more research for anonymizing users’ textual information to preserve privacy of users against various attacks such as author identification, and profile attribute disclosure.

  • Protecting privacy of profile attribute information: We also reviewed many state-of-the-art works which introduces privacy risks with respect to profile attributes. In particular, these works introduce new attacks which infer target users profile attributes considering their behavior in social media platforms. To the best of our knowledge, there is no work on introducing defense mechanism against these attacks. One research direction could be either in terms of a privacy preserving tool for users which warns them against their activities and possibility of privacy leakage. Another direction would be to propose a privacy protection technique which will be deployed before sharing users’ data with third parties. Profile attributes are very similar to tabular datasets but could be easily inferred from user-generated unstructured data.

  • Privacy of spatiotemporal social media data: Social media platforms support space-time indexed data and users have created a large volume of time-stamped, geo-located data. Such spatiotemporal data has an immense value for understanding users behavior better. In this survey, we reviewed the state-of-the-art re-identification attacks which incorporate this data to breach privacy of users. This information could be used to infer users’ location as well as their preferences and interests in case of recommendation systems. One future research direction could be investigating the role of temporal information in privacy of online users. More research should be done to build anonymization frameworks for protecting users temporal information.

  • Privacy of heterogeneous social media data: User-generated social media data is heterogeneous and consists of different aspects. Most of the previous works illustrate the vulnerability of each aspect of social media data against identity and attribute disclosure attacks. Existing anonymization techniques also assume that it is enough to anonymize each aspect of heterogeneous social media data independently. Beigi et al. (Beigi et al., 2018) evaluated this assumption for two specific aspects of data, i.e. textual and graph, and showed that this is not a correct assumption due to the hidden relations between different aspects of the heterogeneous data. One potential research direction is to examine how different combinations of heterogeneous data (e.g., a combination of location and textual information) are vulnerable to the de-anonymization attack. Another potential direction is to improve anonymization techniques to preserve the privacy of users in social media data by considering hidden relations between different components of the data due to the innate heterogeneity of user-generated data.

  • Privacy protection against identity and attribute disclosure attacks: User-generated data in social media platforms such as profile information, graph data, location and interest beliefs plays an important role in helping online service providers to offer better services for their users. We reviewed many related works in this survey which show how these information make users vulnerable against privacy breaches. However, very limited research has been done to exploit effective anonymization techniques for preserving privacy of users against these attacks. More research needs to be done to develop data sanitizaion approaches specialized for social media data.

The authors would like to thank Alexander Nou for his help throughout the paper. This material is based upon the work supported in part by Army Research Office (ARO) under grant number W911NF-15-1-0328 and Office of Naval Research (ONR) under grant number N00014-17-1-2605.


  • (1)
  • Abawajy et al. (2016) Jemal H Abawajy, Mohd Izuan Hafez Ninggal, and Tutut Herawan. 2016. Privacy preserving social network data publication. IEEE communications surveys & tutorials 18, 3 (2016), 1974–1997.
  • Abbasi and Chen (2008) Ahmed Abbasi and Hsinchun Chen. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS) 26, 2 (2008), 7.
  • Afroz et al. (2012) Sadia Afroz, Michael Brennan, and Rachel Greenstadt. 2012. Detecting hoaxes, frauds, and deception in writing style online. In Security and Privacy (SP), 2012 IEEE Symposium on. IEEE, 461–475.
  • Aggarwal et al. (2005) Gagan Aggarwal, Tomas Feder, Krishnaram Kenthapadi, Rajeev Motwani, Rina Panigrahy, Dilys Thomas, and An Zhu. 2005. Approximation algorithms for k-anonymity. Journal of Privacy Technology (JOPT) (2005).
  • Agrawal and Srikant (2000) Rakesh Agrawal and Ramakrishnan Srikant. 2000. Privacy-preserving data mining. In ACM Sigmod Record, Vol. 29.
  • Aimeur et al. (2008) Esma Aimeur, Gilles Brassard, Jose M Fernandez, Flavien Serge Mani Onana, and Zbigniew Rakowski. 2008. Experimental demonstration of a hybrid privacy-preserving recommender system. In Availability, Reliability and Security, 2008. ARES 08. Third International Conference on. IEEE, 161–170.
  • Ajao et al. (2015) Oluwaseun Ajao, Jun Hong, and Weiru Liu. 2015. A survey of location inference techniques on Twitter. Journal of Information Science 41, 6 (2015), 855–864.
  • Al-Qurishi et al. (2017) Muhammad Al-Qurishi, Mabrook Al-Rakhami, Atif Alamri, Majed Alrubaian, Sk Md Mizanur Rahman, and M Shamim Hossain. 2017. Sybil defense techniques in online social networks: a survey. IEEE Access 5 (2017), 1200–1219.
  • Al Zamal et al. (2012) Faiyaz Al Zamal, Wendy Liu, and Derek Ruths. 2012. Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors. (2012).
  • Almishari and Tsudik (2012) Mishari Almishari and Gene Tsudik. 2012. Exploring linkability of user reviews. In European Symposium on Research in Computer Security. Springer, 307–324.
  • Alufaisan et al. (2017) Yasmeen Alufaisan, Yan Zhou, Murat Kantarcioglu, and Bhavani Thuraisingham. 2017. Hacking social network data mining. In Intelligence and Security Informatics (ISI), 2017 IEEE International Conference on. IEEE, 54–59.
  • Alvari et al. (2016) Hamidreza Alvari, Alireza Hajibagheri, Gita Sukthankar, and Kiran Lakkaraju. 2016. Identifying community structures in dynamic networks. Social Network Analysis and Mining 6, 1 (2016), 77.
  • Alvari et al. (2014) Hamidreza Alvari, Kiran Lakkaraju, Gita Sukthankar, and Jon Whetzel. 2014. Predicting guild membership in massively multiplayer online games. In International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction. Springer, 215–222.
  • Andreou et al. (2017) Athanasios Andreou, Oana Goga, and Patrick Loiseau. 2017. Identity vs. Attribute Disclosure Risks for Users with Multiple Social Profiles. In Proceedings of the 2017 IEEE/ACM ASONAM. ACM, 163–170.
  • Backes et al. (2016) Michael Backes, Pascal Berrang, Oana Goga, Krishna P Gummadi, and Praveen Manoharan. 2016. On profile linkability despite anonymity in social media systems. In Proceedings of the 2016 ACM on Workshop on Privacy in the Electronic Society. ACM, 25–35.
  • Backes et al. (2017) Michael Backes, Mathias Humbert, Jun Pang, and Yang Zhang. 2017. walk2friends: Inferring Social Links from Mobility Profiles. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.
  • Backstrom et al. (2007) Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. 2007. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In Proceedings of the 16th international conference on WWW.
  • Backstrom et al. (2010) Lars Backstrom, Eric Sun, and Cameron Marlow. 2010. Find me if you can: improving geographical prediction with social and spatial proximity. In Proceedings of the 19th international conference on WWW.
  • Badsha et al. (2017) Shahriar Badsha, Xun Yi, Ibrahim Khalil, and Elisa Bertino. 2017. Privacy preserving user-based recommender system. In Distributed Computing Systems (ICDCS), 2017 IEEE 37th International Conference on. IEEE, 1074–1083.
  • Beigi (2018) Ghazaleh Beigi. 2018. Social Media and User Privacy. arXiv preprint arXiv:1806.09786 (2018).
  • Beigi et al. (2014) Ghazaleh Beigi, Mahdi Jalili, Hamidreza Alvari, and Gita Sukthankar. 2014. Leveraging Community Detection for Accurate Trust Prediction. In ASE International Conference on Social Computing, Palo Alto, CA, May 2014.
  • Beigi and Liu (2018) Ghazaleh Beigi and Huan Liu. 2018. Similar but Different: Exploiting Users’ Congruity for Recommendation Systems. In International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction. Springer.
  • Beigi et al. (2018) Ghazaleh Beigi, Kai Shu, Yanchao Zhang, and Huan Liu. 2018. Securing Social Media User Data: An Adversarial Approach. In Proceedings of the 29th on Hypertext and Social Media. ACM, 165–173.
  • Beigi et al. (2016a) Ghazaleh Beigi, Jiliang Tang, and Huan Liu. 2016a. Signed link analysis in social media networks. In 10th International Conference on Web and Social Media, ICWSM 2016. AAAI Press.
  • Beigi et al. (2016b) Ghazaleh Beigi, Jiliang Tang, Suhang Wang, and Huan Liu. 2016b. Exploiting emotional information for trust/distrust prediction. In Proceedings of the 2016 SIAM International Conference on Data Mining. SIAM, 81–89.
  • Bhagat et al. (2009) Smriti Bhagat, Graham Cormode, Balachander Krishnamurthy, and Divesh Srivastava. 2009. Class-based graph anonymization for social network data. Proceedings of the VLDB Endowment 2, 1 (2009), 766–777.
  • Bhagat et al. (2014) Smriti Bhagat, Udi Weinsberg, Stratis Ioannidis, and Nina Taft. 2014. Recommending with an agenda: Active learning of private attributes using matrix factorization. In Proceedings of RecSys. ACM.
  • Biega et al. (2017) Asia J Biega, Rishiraj Saha Roy, and Gerhard Weikum. 2017. Privacy through Solidarity: A User-Utility-Preserving Framework to Counter Profiling. In Proceedings of ACM SIGIR. ACM, 665–674.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
  • Bonneau et al. (2009) Joseph Bonneau, Jonathan Anderson, and George Danezis. 2009. Prying data out of a social network. In Social Network Analysis and Mining, 2009. ASONAM’09. International Conference on Advances in. IEEE, 249–254.
  • Bowers et al. (2015) Jasmine Bowers, Henry Williams, Gerry Dozier, and R Williams. 2015. Mitigation deanonymization attacks via language translation for anonymous social networks. Proceedings of the 7th International Conference on Machine Learning and Computing (2015).
  • Bradley et al. ([n. d.]) Joseph K Bradley, Patrick Gage Kelley, and Aaron Roth. [n. d.]. Author identification from citations. ([n. d.]).
  • Bringmann et al. (2014) Karl Bringmann, Tobias Friedrich, and Anton Krohmer. 2014. De-anonymization of heterogeneous random graphs in quasilinear time. In European Symposium on Algorithms. Springer, 197–208.
  • Calandrino et al. (2011) Joseph A Calandrino, Ann Kilzer, Arvind Narayanan, Edward W Felten, and Vitaly Shmatikov. 2011. ” You Might Also Like:” Privacy Risks of Collaborative Filtering. In Security and Privacy (SP). IEEE.
  • Canny (2002) John Canny. 2002. Collaborative filtering with privacy via factor analysis. In SIGIR. ACM, 238–245.
  • Chaabane et al. (2012) Abdelberi Chaabane, Gergely Acs, Mohamed Ali Kaafar, et al. 2012. You are what you like! information leakage through users’ interests. In Proceedings of the 19th Annual Network & Distributed System Security Symposium(NDSS).
  • Chaski (2005) Carole E Chaski. 2005. Who is at the keyboard? Authorship attribution in digital evidence investigations. International journal of digital evidence 4, 1 (2005), 1–13.
  • Cheng et al. (2010b) James Cheng, Ada Wai-chee Fu, and Jia Liu. 2010b. K-isomorphism: privacy preserving network publication against structural attacks. In Proceedings of ACM SIGMOD International Conference on Management of data.
  • Cheng et al. (2010a) Zhiyuan Cheng, James Caverlee, and Kyumin Lee. 2010a. You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of CIKM. ACM, 759–768.
  • Chiasserini et al. (2016) Carla-Fabiana Chiasserini, Michele Garetto, and Emilio Leonardi. 2016. Social network de-anonymization under scale-free user relations. IEEE/ACM Transactions on Networking 24, 6 (2016), 3756–3769.
  • Chiasserini et al. (2018) Carla-Fabiana Chiasserini, Michel Garetto, and Emili Leonardi. 2018. De-anonymizing clustered social networks by percolation graph matching. ACM Transactions on Knowledge Discovery from Data (TKDD) 12, 2 (2018), 21.
  • Compton et al. (2014) Ryan Compton, David Jurgens, and David Allen. 2014. Geotagging one hundred million twitter accounts with total variation minimization. In Big Data (Big Data), 2014 IEEE International Conference on. IEEE, 393–401.
  • Cover and Thomas (2012) Thomas M Cover and Joy A Thomas. 2012. Elements of information theory. John Wiley & Sons.
  • Dey et al. (2012) Ratan Dey, Cong Tang, Keith Ross, and Nitesh Saxena. 2012. Estimating age privacy leakage in online social networks. In INFOCOM, 2012 Proceedings IEEE. IEEE, 2836–2840.
  • Dimitropoulos et al. (2009) Xenofontas Dimitropoulos, Dmitri Krioukov, Amin Vahdat, and George Riley. 2009. Graph annotations in modeling complex network topologies. ACM Transactions on Modeling and Computer Simulation (TOMACS) 19, 4 (2009), 17.
  • Dingledine et al. (2004) Roger Dingledine, Nick Mathewson, and Paul Syverson. 2004. Tor: The second-generation onion router. Technical Report. Naval Research Lab Washington DC.
  • Duncan and Lambert (1986) George T Duncan and Diane Lambert. 1986. Disclosure-limited data dissemination. Journal of the American statistical association 81, 393 (1986), 10–18.
  • Dwork (2008) Cynthia Dwork. 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation. Springer, 1–19.
  • Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference. Springer, 265–284.
  • Evfimievski et al. (2004) Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, and Johannes Gehrke. 2004. Privacy preserving mining of association rules. Information Systems 29, 4 (2004), 343–364.
  • Fabiana et al. (2015) Carla Fabiana, Michele Garetto, and Emilio Leonardi. 2015. De-anonymizing scale-free social networks by percolation graph matching. In Computer Communications (INFOCOM), 2015 IEEE Conference on. IEEE, 1571–1579.
  • Fu et al. (2014) Hao Fu, Aston Zhang, and Xing Xie. 2014. De-anonymizing social graphs via node similarity. In Proceedings of the 23rd International Conference on World Wide Web. ACM, 263–264.
  • Fu et al. (2015) Hao Fu, Aston Zhang, and Xing Xie. 2015. Effective social graph deanonymization based on graph structure and descriptive information. ACM Transactions on Intelligent Systems and Technology (TIST) 6, 4 (2015), 49.
  • Fu et al. (2017) Xinzhe Fu, Zhongzhao Hu, Zhiying Xu, Luoyi Fu, and Xinbing Wang. 2017. De-anonymization of Networks with Communities: When Quantifications Meet Algorithms. In IEEE Global Communications Conference.
  • Fung et al. (2010) Benjamin CM Fung, K Wang, R Chen, and S Yu Philip. 2010. Privacy-Preserving Data Publishing: A Survey on Recent Developments. ACM Computations Survey 42, 4 (2010), 1–53.
  • Gambs et al. (2010) Sébastien Gambs, Marc-Olivier Killijian, and Miguel Núñez del Prado Cortez. 2010. Show me how you move and I will tell you who you are. In Proceedings of SIGSPATIAL International Workshop on Security and Privacy in GIS and LBS.
  • Gayo Avello (2011) Daniel Gayo Avello. 2011. All liaisons are dangerous when all your friends are known to us. In Proceedings of the 22nd ACM conference on Hypertext and hypermedia. ACM, 171–180.
  • Goga et al. (2013) Oana Goga, Howard Lei, Sree Hari Krishnan Parthasarathi, Gerald Friedland, Robin Sommer, and Renata Teixeira. 2013. Exploiting innocuous activity for correlating users across sites. In Proceedings of WWW.
  • Gong and Liu (2016) Neil Zhenqiang Gong and Bin Liu. 2016. You Are Who You Know and How You Behave: Attribute Inference Attacks via Users’ Social Friends and Behaviors.. In USENIX Security Symposium. 979–995.
  • Gong and Liu (2018) Neil Zhenqiang Gong and Bin Liu. 2018. Attribute Inference Attacks in Online Social Networks. ACM Transactions on Privacy and Security (TOPS) 21, 1 (2018), 3.
  • Gong et al. (2014) Neil Zhenqiang Gong, Ameet Talwalkar, Lester Mackey, Ling Huang, Eui Chul Richard Shin, Emil Stefanov, Elaine Runting Shi, and Dawn Song. 2014. Joint link prediction and attribute inference using a social-attribute network. ACM Transactions on Intelligent Systems and Technology (TIST) 5, 2 (2014), 27.
  • Guerraoui et al. (2015) Rachid Guerraoui, Anne-Marie Kermarrec, Rhicheek Patra, and Mahsa Taziki. 2015. D 2 p: distance-based differential privacy in recommenders. Proceedings of the VLDB Endowment 8, 8 (2015), 862–873.
  • Guerraoui et al. (2017) Rachid Guerraoui, Anne-Marie Kermarrec, and Mahsa Taziki. 2017. The Utility and Privacy Effects of a Click. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.
  • Gupta et al. (2013) Payas Gupta, Swapna Gottipati, Jing Jiang, and Debin Gao. 2013. Your love is public now: Questioning the use of personal information in authentication. In Proceedings of ACM SIGSAC. ACM.
  • Hajibagheri et al. (2018) Alireza Hajibagheri, Gita Sukthankar, Kiran Lakkaraju, Hamidreza Alvari, Rolf T Wigand, and Nitin Agarwal. 2018. Using Massively Multiplayer Online Game Data to Analyze the Dynamics of Social Interactions. Social Interactions in Virtual Worlds: An Interdisciplinary Perspective (2018).
  • Hay et al. (2008) Michael Hay, Gerome Miklau, David Jensen, Don Towsley, and Philipp Weis. 2008. Resisting structural re-identification in anonymized social networks. Proceedings of the VLDB Endowment 1, 1 (2008), 102–114.
  • He et al. (2006) Jianming He, Wesley W Chu, and Zhenyu Victor Liu. 2006. Inferring privacy information from social networks. In International Conference on Intelligence and Security Informatics. Springer, 154–165.
  • Hecht et al. (2011) Brent Hecht, Lichan Hong, Bongwon Suh, and Ed H Chi. 2011. Tweets from Justin Bieber’s heart: the dynamics of the location field in user profiles. In Proceedings of the SIGCHI. ACM, 237–246.
  • Hill and Provost (2003) Shawndra Hill and Foster Provost. 2003. The myth of the double-blind review?: author identification using only citations. Acm Sigkdd Explorations Newsletter 5, 2 (2003), 179–184.
  • Hoens et al. (2010) T Ryan Hoens, Marina Blanton, and Nitesh V Chawla. 2010. A private and reliable recommendation system for social networks. In Social Computing (SocialCom), 2010 IEEE Second International Conference on. IEEE, 816–825.
  • Howe and Nissenbaum (2009) DC Howe and H Nissenbaum. 2009. Lessons from the identity trail: privacy, anonymity and identity in a networked society. ch. TrackMeNot: Resisting surveillance in web search, Oxford Univ. Press, NY (2009), 417–436.
  • Hua et al. (2015) Jingyu Hua, Chang Xia, and Sheng Zhong. 2015. Differentially Private Matrix Factorization.. In IJCAI.
  • Humbert et al. (2013) Mathias Humbert, Théophile Studer, Matthias Grossglauser, and Jean-Pierre Hubaux. 2013. Nowhere to hide: Navigating around privacy in online social networks. In European Symposium on Research in Computer Security.
  • Indyk and Motwani (1998) Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In

    Proceedings of the thirtieth annual ACM symposium on Theory of computing

    . ACM, 604–613.
  • James (1992) P James. 1992. Knowledge Graphs. In Order 501.
  • Ji et al. (2015a) Shouling Ji, Weiqing Li, Neil Zhenqiang Gong, Prateek Mittal, and Raheem A Beyah. 2015a. On Your Social Network De-anonymizablity: Quantification and Large Scale Evaluation with Seed Knowledge. NDSS.
  • Ji et al. (2016a) Shouling Ji, Weiqing Li, Neil Zhenqiang Gong, Prateek Mittal, and Raheem A Beyah. 2016a. Seed based Deanonymizability Quantification of Social Networks. IEEE Transactions on Information Forensics & Security (TIFS) 11, 7, 1398–1411.
  • Ji et al. (2015b) Shouling Ji, Weiqing Li, Prateek Mittal, and Raheem Beyah. 2015b. SecGraph: A Uniform and Open-source Evaluation System for Graph Data Anonymization and De-anonymization. In USENIX Security Symposium. 303–318.
  • Ji et al. (2014a) Shouling Ji, Weiqing Li, Mudhakar Srivatsa, and Raheem Beyah. 2014a. Structural data de-anonymization: Quantification, practice, and implications. In Proceedings of the 2014 ACM SIGSAC. ACM, 1040–1053.
  • Ji et al. (2016b) Shouling Ji, Weiqing Li, Mudhakar Srivatsa, and Raheem Beyah. 2016b. Structural data de-anonymization: theory and practice. IEEE/ACM Transactions on Networking 24, 6 (2016), 3523–3536.
  • Ji et al. (2014b) Shouling Ji, Weiqing Li, Mudhakar Srivatsa, Jing Selena He, and Raheem Beyah. 2014b. Structure based data de-anonymization of social networks and mobility traces. In International Conference on Information Security. Springer.
  • Ji et al. (2016c) Shouling Ji, Weiqing Li, Mudhakar Srivatsa, Jing Selena He, and Raheem Beyah. 2016c. General graph data de-anonymization: From mobility traces to social networks. ACM TISS 18, 4 (2016).
  • Ji et al. (2016d) Shouling Ji, Prateek Mittal, and Raheem Beyah. 2016d. Graph data anonymization, de-anonymization attacks, and de-anonymizability quantification: A survey. IEEE Communications Surveys & Tutorials 19, 2 (2016), 1305–1326.
  • Ji et al. (2017) Shouling Ji, Ting Wang, Jianhai Chen, Weiqing Li, Prateek Mittal, and Raheem Beyah. 2017. De-SAG: On the De-anonymization of Structure-Attribute Graph Data. IEEE Transactions on Dependable and Secure Computing (2017).
  • Jia et al. (2017) Jinyuan Jia, Binghui Wang, Le Zhang, and Neil Zhenqiang Gong. 2017. AttriInfer: Inferring user attributes in online social networks using markov random fields. In Proceedings of the WWW. 1561–1569.
  • Jorgensen and Yu (2014) Zach Jorgensen and Ting Yu. 2014. A Privacy-Preserving Framework for Personalized, Social Recommendations. EDBT 582.
  • Jurgens (2013) David Jurgens. 2013. That’s What Friends Are For: Inferring Location in Online Social Media Platforms Based on Social Relationships. (2013).
  • Jurgens et al. (2015) David Jurgens, Tyler Finethy, James McCorriston, Yi Tian Xu, and Derek Ruths. 2015. Geolocation Prediction in Twitter Using Social Networks: A Critical Analysis and Review of Current Practice. (2015).
  • Khairnar and Bajpai (2014) Ms Sonali M Khairnar and Sanchika Bajpai. 2014. Anonymization of Centralized and Distributed Social Networks by Incremental Clustering. International Journal of Computer Science and Information Technologies 5, 5 (2014), 6724–6727.
  • Kifer and Machanavajjhala (2011) Daniel Kifer and Ashwin Machanavajjhala. 2011. No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 193–204.
  • Klein and Manning (2003) Dan Klein and Christopher D Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st annual meeting of the association for computational linguistics.
  • Kong et al. (2014) Longbo Kong, Zhi Liu, and Yan Huang. 2014. Spot: Locating social media users based on social network context. Proceedings of the VLDB Endowment 7, 13 (2014), 1681–1684.
  • Koppel et al. (2009) Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. Journal of the Association for Information Science and Technology 60, 1 (2009), 9–26.
  • Koppel et al. (2011) Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2011. Authorship attribution in the wild. Language Resources and Evaluation 45, 1 (2011), 83–94.
  • Koppel et al. (2006) Moshe Koppel, Jonathan Schler, Shlomo Argamon, and Eran Messeri. 2006. Authorship attribution with thousands of candidate authors. In Proceedings of ACM SIGIR. ACM, 659–660.
  • Korolova et al. (2008) Aleksandra Korolova, Rajeev Motwani, Shubha U Nabar, and Ying Xu. 2008. Link privacy in social networks. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 289–298.
  • Korula and Lattanzi (2014) Nitish Korula and Silvio Lattanzi. 2014. An efficient reconciliation algorithm for social networks. Proceedings of the VLDB Endowment 7, 5 (2014), 377–388.
  • Kosinski et al. (2013) Michal Kosinski, David Stillwell, and Thore Graepel. 2013. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences 110, 15 (2013), 5802–5805.
  • Kuhn (2010) Harold W Kuhn. 2010. The hungarian method for the assignment problem. In 50 Years of Integer Programming 1958-2008. Springer, 29–47.
  • Labitzke et al. (2013) Sebastian Labitzke, Florian Werling, Jens Mittag, and Hannes Hartenstein. 2013. Do online social network friends still threaten my privacy?. In Proceedings of the ACM conference on Data and application security and privacy.
  • Lambert (1993) Diane Lambert. 1993. Measures of disclosure risk and harm. Journal of Official Statistics 9, 2 (1993), 313.
  • Lee et al. (2017a) Wei-Han Lee, Changchang Liu, Shouling Ji, Prateek Mittal, and Ruby Lee. 2017a. How to Quantify Graph De-anonymization Risks. (2017).
  • Lee et al. (2017b) Wei-Han Lee, Changchang Liu, Shouling Ji, Prateek Mittal, and Ruby B Lee. 2017b. Blind De-anonymization Attacks using Social Networks. In Proceedings of the 2017 on Workshop on Privacy in the Electronic Society. ACM, 1–4.
  • Lewis et al. (2008) Kevin Lewis, Jason Kaufman, Marco Gonzalez, Andreas Wimmer, and Nicholas Christakis. 2008. Tastes, ties, and time: A new social network dataset using Facebook. com. Social networks 30, 4 (2008), 330–342.
  • Li et al. (2007) Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. 2007. t-closeness: Privacy beyond k-anonymity and l-diversity. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on. IEEE, 106–115.
  • Li et al. (2012a) Rui Li, Shengjie Wang, and Kevin Chen-Chuan Chang. 2012a. Multiple location profiling for users and relationships from social network and content. Proceedings of the VLDB Endowment 5, 11 (2012), 1603–1614.
  • Li et al. (2012b) Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, and Kevin Chen-Chuan Chang. 2012b. Towards social user profiling: unified and discriminative influence model for inferring home locations. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1023–1031.
  • Li et al. (2017) Xiaoxue Li, Yanan Cao, Yanmin Shang, Yanbing Liu, Jianlong Tan, and Li Guo. 2017.

    Inferring User Profiles in Online Social Networks Based on Convolutional Neural Network. In

    CIKM. Springer, 274–286.
  • Lindamood et al. (2009) Jack Lindamood, Raymond Heatherly, Murat Kantarcioglu, and Bhavani Thuraisingham. 2009. Inferring private information using social network data. In Proceedings of WWW. ACM, 1145–1146.
  • Liu et al. (2018) Bo Liu, Wanlei Zhou, Tianqing Zhu, Longxiang Gao, and Yong Xiang. 2018. Location Privacy and Its Applications: A Systematic Study. IEEE Access 6 (2018), 17606–17624.
  • Liu et al. (2016a) Changchang Liu, Supriyo Chakraborty, and Prateek Mittal. 2016a. Dependence Makes You Vulnberable: Differential Privacy Under Dependent Tuples.. In NDSS, Vol. 16. 21–24.
  • Liu and Mittal (2016) Changchang Liu and Prateek Mittal. 2016. LinkMirage: Enabling Privacy-preserving Analytics on Social Relationships.. In NDSS.
  • Liu and Terzi (2008) Kun Liu and Evimaria Terzi. 2008. Towards identity anonymization on graphs. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 93–106.
  • Liu et al. (2016b) Yushan Liu, Shouling Ji, and Prateek Mittal. 2016b. SmartWalk: Enhancing social network security via adaptive random walks. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 492–503.
  • Luo et al. (2014) Dixin Luo, Hongteng Xu, Hongyuan Zha, Jun Du, Rong Xie, Xiaokang Yang, and Wenjun Zhang. 2014. You are what you watch and when you watch: Inferring household structures from iptv viewing data. IEEE Transactions on Broadcasting 60, 1 (2014), 61–72.
  • Luo and Chen (2014) Zhifeng Luo and Zhanli Chen. 2014. A privacy preserving group recommender based on cooperative perturbation. In International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery. IEEE.
  • Machanavajjhala et al. (2006) Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthuramakrishnan Venkitasubramaniam. 2006. l-diversity: Privacy beyond k-anonymity. In Proceedings of ICDE. IEEE, 24–24.
  • Machanavajjhala et al. (2011) Ashwin Machanavajjhala, Aleksandra Korolova, and Atish Das Sarma. 2011. Personalized social recommendations: accurate or private. Proceedings of the VLDB Endowment 4, 7 (2011), 440–450.
  • Mack et al. (2015) Nathan Mack, Jasmine Bowers, Henry Williams, Gerry Dozier, and Joseph Shelton. 2015. The Best Way to a Strong Defense is a Strong Offense: Mitigating Deanonymization Attacks via Iterative Language Translation. International Journal of Machine Learning and Computing 5, 5 (2015), 409.
  • Mahadevan et al. (2006) Priya Mahadevan, Dmitri Krioukov, Kevin Fall, and Amin Vahdat. 2006. Systematic topology analysis and generation using degree correlations. In ACM SIGCOMM Computer Communication Review, Vol. 36. ACM, 135–146.
  • Mahmud et al. (2014) Jalal Mahmud, Jeffrey Nichols, and Clemens Drews. 2014. Home location identification of twitter users. ACM Transactions on Intelligent Systems and Technology (TIST) 5, 3 (2014), 47.
  • Mao et al. (2011) Huina Mao, Xin Shuai, and Apu Kapadia. 2011. Loose tweets: an analysis of privacy leaks on twitter. In Proceedings of the 10th annual ACM workshop on Privacy in the electronic society. ACM, 1–12.
  • McGee et al. (2013) Jeffrey McGee, James Caverlee, and Zhiyuan Cheng. 2013. Location prediction in social media based on tie strength. In Proceedings of CIKM. ACM.
  • McGee et al. (2011) Jeffrey McGee, James A Caverlee, and Zhiyuan Cheng. 2011. A geographic study of tie strength in social media. In Proceedings of CIKM. ACM, 2333–2336.
  • McPherson et al. (2001) Miller McPherson, Lynn Smith-Lovin, and James M Cook. 2001. Birds of a feather: Homophily in social networks. Annual review of sociology 27, 1 (2001), 415–444.
  • McSherry and Mironov (2009) Frank McSherry and Ilya Mironov. 2009. Differentially private recommender systems: Building privacy into the netflix prize contenders. In Proceedings of the ACM SIGKDD. ACM.
  • McSherry and Talwar (2007) Frank McSherry and Kunal Talwar. 2007. Mechanism design via differential privacy. In Foundations of Computer Science, 2007. FOCS’07. 48th Annual IEEE Symposium on. IEEE, 94–103.
  • Mendenhall (1887) Thomas Corwin Mendenhall. 1887. The characteristic curves of composition. Science 9, 214 (1887), 237–249.
  • Meng et al. (2018) Xuying Meng, Suhang Wang, Kai Shu, Jundong Li, Bo Chen, Huan Liu, and Yujun Zhang. 2018. Personalized privacy-preserving social recommendation. In AAAI.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
  • Minkus et al. (2015) Tehila Minkus, Yuan Ding, Ratan Dey, and Keith W Ross. 2015. The city privacy attack: Combining social media and public records for detailed profiles of adults and children. In Proceedings of the 2015 ACM on Conference on Online Social Networks. ACM, 71–81.
  • Mislove et al. (2010) Alan Mislove, Bimal Viswanath, Krishna P Gummadi, and Peter Druschel. 2010. You are who you know: inferring user profiles in online social networks. In Proceedings of WSDM. ACM, 251–260.
  • Mittal et al. (2012) Prateek Mittal, Charalampos Papamanthou, and Dawn Song. 2012. Preserving link privacy in social network based systems. arXiv preprint arXiv:1208.6189 (2012).
  • Mosteller and Wallace (1964) Frederick Mosteller and David Wallace. 1964. Inference and disputed authorship: The Federalist. (1964).
  • Nanavati et al. (2011) Mihir Nanavati, Nathan Taylor, William Aiello, and Andrew Warfield. 2011. Herbert West-Deanonymizer. HotSec.
  • Narayanan et al. (2012) Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. 2012. On the feasibility of internet-scale author identification. In Security and Privacy (SP). IEEE.
  • Narayanan et al. (2011) Arvind Narayanan, Elaine Shi, and Benjamin IP Rubinstein. 2011. Link prediction by de-anonymization: How we won the kaggle social network challenge. In International Joint Conference on Neural Networks. IEEE.
  • Narayanan and Shmatikov (2008) Arvind Narayanan and Vitaly Shmatikov. 2008. Robust de-anonymization of large sparse datasets. In Security and Privacy. IEEE.
  • Narayanan and Shmatikov (2009) Arvind Narayanan and Vitaly Shmatikov. 2009. De-anonymizing social networks. In Security and Privacy. IEEE.
  • Newman (2003) M.E.J Newman. 2003. The structure and function of complex networks. In SIAM Review, Vol. 45. 167–256.
  • Nilizadeh et al. (2014) Shirin Nilizadeh, Apu Kapadia, and Yong-Yeol Ahn. 2014. Community-enhanced de-anonymization of online social networks. In Proceedings of the 2014 acm sigsac conference on computer and communications security. ACM, 537–548.
  • Nissim et al. (2007) Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. ACM, 75–84.
  • Otterbacher (2010) Jahna Otterbacher. 2010. Inferring gender of movie reviewers: exploiting writing style, content and metadata. In Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 369–378.
  • Parameswaran and Blough (2005) Rupa Parameswaran and D Blough. 2005. A robust data obfuscation approach for privacy preservation of clustered data. In In Workshop on Privacy and Security Aspects of Data Mining. 18–25.
  • Parameswaran and Blough (2007) Rupa Parameswaran and Douglas M Blough. 2007. Privacy preserving collaborative filtering using data obfuscation. In IEEE International Conference on Granular Computing.
  • Parra-Arnau (2017) Javier Parra-Arnau. 2017. Pay-per-tracking: A collaborative masking model for web browsing. Information Sciences 385 (2017), 96–124.
  • Parra-Arnau et al. (2014) Javier Parra-Arnau, David Rebollo-Monedero, and Jordi Forné. 2014. Optimal forgery and suppression of ratings for privacy enhancement in recommendation systems. Entropy 16, 3 (2014), 1586–1631.
  • Pedarsani and Grossglauser (2011) Pedram Pedarsani and Matthias Grossglauser. 2011. On the privacy of anonymized networks. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1235–1243.
  • Peng et al. (2014) Wei Peng, Feng Li, Xukai Zou, and Jie Wu. 2014. A two-stage deanonymization attack against anonymized social networks. IEEE Trans. Comput. 63, 2 (2014), 290–303.
  • Polat and Du (2003) Huseyin Polat and Wenliang Du. 2003. Privacy-preserving collaborative filtering using randomized perturbation techniques. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on. IEEE, 625–628.
  • Proserpio et al. (2014) Davide Proserpio, Sharon Goldberg, and Frank McSherry. 2014. Calibrating data to sensitivity in private data analysis: a platform for differentially-private analysis of weighted datasets. Proceedings of the VLDB Endowment 7, 8 (2014).
  • Puglisi et al. (2015) Silvia Puglisi, Javier Parra-Arnau, Jordi Forné, and David Rebollo-Monedero. 2015. On content-based recommendation and user privacy in social-tagging systems. Computer Standards & Interfaces 41 (2015), 17–27.
  • Qian et al. (2016) Jianwei Qian, Xiang-Yang Li, Chunhong Zhang, and Linlin Chen. 2016. De-anonymizing social networks and inferring private attributes using knowledge graphs. In IEEE INFOCOM.
  • Ramakrishnan et al. (2001) Naren Ramakrishnan, Benjamin J Keller, Batul J Mirza, Ananth Y Grama, and George Karypis. 2001. Privacy risks in recommender systems. IEEE Internet Computing 5, 6 (2001), 54.
  • Rao et al. (2000) Josyula R Rao, Pankaj Rohatgi, et al. 2000. Can pseudonymity really guarantee privacy?. In USENIX Security.
  • Rebollo-Monedero et al. (2011) David Rebollo-Monedero, Javier Parra-Arnau, and Jordi Forné. 2011. An information-theoretic privacy criterion for query forgery in information retrieval. In International Conference on Security Technology. Springer, 146–154.
  • Rizvi and Haritsa (2002) Shariq J Rizvi and Jayant R Haritsa. 2002. Maintaining data privacy in association rule mining. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 682–693.
  • Rout et al. (2013) Dominic Rout, Kalina Bontcheva, Daniel Preoţiuc-Pietro, and Trevor Cohn. 2013. Where’s@ wally?: a classification approach to geolocating users based on their social ties. In Proceedings of Hypertext and Social Media. ACM.
  • Ryoo and Moon (2014) KyoungMin Ryoo and Sue Moon. 2014. Inferring twitter user locations with 10 km accuracy. In Proceedings of the 23rd International Conference on World Wide Web. ACM, 643–648.
  • Sala et al. (2011) Alessandra Sala, Xiaohan Zhao, Christo Wilson, Haitao Zheng, and Ben Y Zhao. 2011. Sharing graphs using differentially private graph models. In Proceedings of ACM SIGCOMM on Internet measurement conference.
  • Sharad (2016) Kumar Sharad. 2016. Change of Guard: The Next Generation of Social Graph De-anonymization Attacks. In

    Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security

    . ACM, 105–116.
  • Sharad and Danezis (2014) Kumar Sharad and George Danezis. 2014. An automated social graph de-anonymization technique. In Proceedings of the 13th Workshop on Privacy in the Electronic Society. ACM, 47–58.
  • Sharma et al. (2012) Sanur Sharma, Preeti Gupta, and Vishal Bhatnagar. 2012. Anonymisation in social network: A literature survey and classification. International Journal of Social Network Mining 1, 1 (2012), 51–66.
  • Shen and Jin (2014) Yilin Shen and Hongxia Jin. 2014. Privacy-preserving personalized recommendation: An instance-based approach via differential privacy. In Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 540–549.
  • Shokri et al. (2011) Reza Shokri, George Theodorakopoulos, Jean-Yves Le Boudec, and Jean-Pierre Hubaux. 2011. Quantifying location privacy. In Security and privacy (sp), 2011 ieee symposium on. IEEE, 247–262.
  • Shu et al. (2017) Kai Shu, Suhang Wang, Jiliang Tang, Reza Zafarani, and Huan Liu. 2017. User identity linkage across online social networks: A review. ACM SIGKDD Explorations Newsletter 18, 2 (2017), 5–17.
  • Srivatsa and Hicks (2012) Mudhakar Srivatsa and Mike Hicks. 2012. Deanonymizing mobility traces: Using social network as a side-channel. In Proceedings of the 2012 ACM conference on Computer and communications security. ACM, 628–637.
  • Stamatatos (2009) Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the Association for Information Science and Technology 60, 3 (2009), 538–556.
  • Stone et al. (2008) Zak Stone, Todd Zickler, and Trevor Darrell. 2008. Autotagging facebook: Social network context improves photo annotation. In

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

  • Sweeney (2002) Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 05 (2002), 557–570.
  • Tang and Wang (2016) Qiang Tang and Jun Wang. 2016. Privacy-preserving friendship-based recommender systems. IEEE Transactions on Dependable and Secure Computing (2016).
  • Thomas et al. (2010) Kurt Thomas, Chris Grier, and David M Nicol. 2010. unfriendly: Multi-party privacy risks in social networks. In International Symposium on Privacy Enhancing Technologies Symposium. Springer, 236–252.
  • Thompson and Yao (2009) Brian Thompson and Danfeng Yao. 2009. The union-split algorithm and cluster-based anonymization of social networks. In Proceedings of Symposium on Information, Computer, and Communications Security.
  • Tong et al. (2006) Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2006. Fast random walk with restart and its applications. (2006).
  • Verykios et al. (2004) Vassilios S Verykios, Elisa Bertino, Igor Nai Fovino, Loredana Parasiliti Provenza, Yucel Saygin, and Yannis Theodoridis. 2004. State-of-the-art in privacy preserving data mining. ACM Sigmod Record 33, 1 (2004), 50–57.
  • Wang et al. (2018) Huandong Wang, Yong Li, Gang Wang, and Depeng Jin. 2018. You Are How You Move: Linking Multiple User Identities From Massive Mobility Traces. In Proceedings of SIAM SDM. Society for Industrial and Applied Mathematics.
  • Wang et al. (2016) Pengfei Wang, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2016. Your cart tells you: Inferring demographic attributes from purchase data. In Proceedings of WSDM. ACM.
  • Wang and Wu (2013) Yue Wang and Xintao Wu. 2013. Preserving differential privacy in degree-correlation based graph generation. Transactions on data privacy 6, 2 (2013), 127.
  • Weinsberg et al. (2012) Udi Weinsberg, Smriti Bhagat, Stratis Ioannidis, and Nina Taft. 2012. BlurMe: Inferring and obfuscating user gender based on ratings. In Proceedings of the sixth ACM conference on Recommender systems. ACM, 195–202.
  • Wu et al. (2018) Xinyu Wu, Zhongzhao Hu, Xinzhe Fu, Luoyi Fu, Xinbing Wang, and Songwu Lu. 2018. Social network de-anonymization with overlapping communities: Analysis, algorithm and experiments. In Proceeding of INFOCOM.
  • Xiao et al. (2014) Qian Xiao, Rui Chen, and Kian-Lee Tan. 2014. Differentially private network data release via structural inference. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 911–920.
  • Xin and Jaakkola (2014) Yu Xin and Tommi Jaakkola. 2014. Controlling privacy in recommender systems. In Advances in Neural Information Processing Systems. 2618–2626.
  • Yang and Leskovec (2013) Jaewon Yang and Jure Leskovec. 2013. Overlapping community detection at scale: a nonnegative matrix factorization approach. In Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 587–596.
  • Yang et al. (2013) Jaewon Yang, Julian McAuley, and Jure Leskovec. 2013. Community detection in networks with node attributes. In Data Mining (ICDM), 2013 IEEE 13th international conference on. IEEE, 1151–1156.
  • Yartseva and Grossglauser (2013) Lyudmila Yartseva and Matthias Grossglauser. 2013. On the performance of percolation graph matching. In Proceedings of the first ACM conference on Online social networks. ACM, 119–130.
  • Yin et al. (2010a) Zhijun Yin, Manish Gupta, Tim Weninger, and Jiawei Han. 2010a. Linkrec: a unified framework for link recommendation with user attributes and graph structure. In Proceedings of WWW. ACM, 1211–1212.
  • Yin et al. (2010b) Zhijun Yin, Manish Gupta, Tim Weninger, and Jiawei Han. 2010b. A unified framework for link recommendation using random walks. In Proceedings of ASONAM. IEEE, 152–159.
  • Ying and Wu (2009) Xiaowei Ying and Xintao Wu. 2009. Graph generation with prescribed feature constraints. In Proceedings of the 2009 SIAM International Conference on Data Mining. SIAM, 966–977.
  • Yuan et al. (2010) Mingxuan Yuan, Lei Chen, and Philip S Yu. 2010. Personalized privacy protection in social networks. Proceedings of the VLDB Endowment 4, 2 (2010), 141–150.
  • Zafarani et al. (2014) Reza Zafarani, Mohammad Ali Abbasi, and Huan Liu. 2014. Social media mining: an introduction. Cambridge University Press.
  • Zhang et al. (2014) Aston Zhang, Xing Xie, Carl A Gunter, Jiawei Han, and XiaoFeng Wang. 2014. Privacy Risk in Anonymized Heterogeneous Information Networks. EDBT (2014).
  • Zhang et al. (2018) Jinxue Zhang, Jingchao Sun, Rui Zhang, and Yanchao Zhang. 2018. Privacy-Preserving Social Media Data Outsourcing. In Proceedings of IEEE International Conference on Computer Communications (INFOCOM).
  • Zheleva and Getoor (2009) Elena Zheleva and Lise Getoor. 2009. To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles. In Proceedings of the 18th international conference on World wide web. ACM, 531–540.
  • Zheleva et al. (2012) Elena Zheleva, Evimaria Terzi, and Lise Getoor. 2012. Privacy in social networks. Synthesis Lectures on Data Mining and Knowledge Discovery 3, 1 (2012), 1–85.
  • Zheng et al. (2018) X. Zheng, J. Han, and A. Sun. 2018. A Survey of Location Prediction on Twitter. IEEE Transactions on Knowledge and Data Engineering (2018).
  • Zhong et al. (2015) Yuan Zhong, Nicholas Jing Yuan, Wen Zhong, Fuzheng Zhang, and Xing Xie. 2015. You are where you go: Inferring demographic attributes from location check-ins. In Proceedings of WSDM. ACM, 295–304.
  • Zhou and Pei (2008) Bin Zhou and Jian Pei. 2008. Preserving privacy in social networks against neighborhood attacks. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on. IEEE, 506–515.
  • Zhu et al. (2013) Tianqing Zhu, Gang Li, Yongli Ren, Wanlei Zhou, and Ping Xiong. 2013. Differential privacy for neighborhood-based collaborative filtering. In Proceedings of ASONAM. ACM, 752–759.
  • Zhu and Sun (2016) Xue Zhu and Yuqing Sun. 2016. Differential privacy for collaborative filtering recommender algorithm. In Proceedings of the 2016 ACM on International Workshop on Security And Privacy Analytics. ACM, 9–16.
  • Zou et al. (2009) Lei Zou, Lei Chen, and M Tamer Özsu. 2009. K-automorphism: A general framework for privacy preserving network publication. Proceedings of the VLDB Endowment 2, 1 (2009), 946–957.