Many existing security and privacy applications/techniques can be characterized as a feature-based inference system, e.g., network traffic attribution in network forensic applications, private web search, feature-based data de-anonymization -
. To conduct network traffic attribution, usually, a network traffic attribution system is first learned based on the features extracted from historical network traces. Later, when new network traffic comes, features will be extracted from the new traffic first, and then the data will be automatically attributed to the users who generated them by the system based on the features (as shown in Fig.1) . In fact, the network traffic attribution system can be directly considered as a feature-based inference system, where the system is first learned based on the historical/training data (in detail, features of the historical/training) and then used to infer the new data (in this scenario, users who generate the new traffic) based on their features (as shown in Fig.2). Another example is the code stylometry-based de-anonymization attack to programmers proposed in . In this kind of attack, the code stylometry features of training programs are first extracted to train a de-anonymization model. Then, this model can be used to de-anonymize the programmers of the target programs based on their code stylometry features. For this example, the code stylometry-based de-anonymization model can also be considered as a feature-based inference system to infer (de-anonymize) target data (programmers of targeting programs).
Now, some interesting questions are brought: how to quantify the performance of those feature-based inference systems for security and privacy applications? and what is the performance of existing feature-based inference techniques relative to the inherent theoretical performance bound? Answering these questions are important to accurately evaluate and understand the performance of existing feature-based inference systems/techniques and further develop improved ones. Unfortunately, although we already have many feature-based inference systems/techniques for various security and privacy applications, the answers to the brought questions remain unclear. Therefore, to address these open problems, in this paper, we study the Feature-based Data Inferability (FDI) quantification for existing feature-based inference systems/techniques in various security and privacy applications. Particularly, we make the following contributions in this paper.
We first quantify the FDI under a naive data model, where each user-feature relationship is characterized by a binary function (a user either has a feature or does have a feature). Under the naive model, we quantified the conditions to have a target dataset to be -inferable, i.e., to have target users to be Top- inferable, where is a parameter in , is the number of overlapped users between the training data of the inference model and the targeting data, (thus, is the number of users that can be correctly Top- inferred), and is an integer specifying the desired inference accuracy.
Subsequently, we extend our FDI quantification to a general data model. Under the general data model, we quantify the FDI from both the feature distance perspective and the feature distribution perspective to have a target dataset to be -inferable. Our quantification in the general scenarios provides the answers to the raised open problems, and meanwhile, our quantification provides the theoretical foundation for the first time for existing feature-based inference systems in various security and privacy applications, to the best of our knowledge.
Based on our FDI quantification, we conduct a large-scale evaluation leveraging on real world data. Specifically, we evaluate the user inferability in two cases: network traffic attribution in network forensics and feature-based data de-anonymization. We explicitly demonstrate the -inferability of users in these two cases and analyze the reasons.
In terms of our quantification and evaluation, we discuss the implications of this paper to practical feature-based inference systems/techniques. We also point out the future research directions.
The rest of this paper is organized as follows. In Section II, we describe the motivation applications and formalize the problem. In Section III, we quantify the FDI under both naive and general data models. In Section IV, we evaluate the FDI in two scenarios. We make further discussion in Section V. In Section VI, we summarize the related work and we conclude the paper in Section VII.
Ii Problem Formalization
In this section, we formalize the studied problem. To make the problem easily understandable and to further motivate our research, we start from introducing motivation examples that our study is applicable for analysis.
Ii-a Motivation Examples
In this paper, we study data’s feature-based inferability. Our study is motivated by several existing security and privacy applications, e.g., network traffic attribution in network security forensics , linkage attacks and private web search , and data de-anonymization .
Network traffic attribution is one of the fundamental issues in network security forensics, under which users, who are responsible for the observed activities and behaviors on network interfaces, are inferred . Taking the network traffic attribution system Kaleido proposed in  and shown in Fig.1 as an example, a typical network traffic attribution system works as follows: 1⃝, based on the historical network traces, a set of features (corresponding to each user) are extracted; 2⃝, a learning model is designed to learn a discriminant model based on the features of historical network traces, which is used for network traffic attribution and/or new user (could be an intruder) identification; 3⃝, when new network traffic comes, the features of the new network traffic are extracted; and 4⃝, taking the features of the new network traffic as input, the discriminant model either attributes the traffic to a set of candidate users or concludes that the traffic is generated by a new user (a set of new users).
Web searching is one of the most fundamental computer applications, by which users obtain desired knowledge and/or find interested websites. Intuitively, users’ web search traces carry users’ interests and intents. Therefore, potential adversaries (e.g., eavesdroppers) may design some linkage attacks and exploit users’ web search traces to infer users’ profiles and other sensitive information . The key idea of a linkage attack is that () an adversary first learns a linkage function based on the features of target users’ historical web search data and then () determines whether the new generated web search data/events belong to the target users. To defend against the linkage attack in web search applications, several obfuscation mechanisms have been proposed for private web search . The basic idea is to obfuscates users’ web search data by adding some noise, i.e., obfuscating the features of users’ web search data such that the linkage attack cannot effectively infer the generator of the data.
Our study in this paper is also motivated by existing feature-based de-anonymization attacks and techniques, e.g., programmers de-anonymization , authorship distribution to underground forums and multi-author detection , and movie rating data de-anonymization . In these de-anonymization attacks/techniques, a feature-based de-anonymization model is first learned based on a training dataset. Subsequently, the new coming data (generated by an existing user or a new user) are de-anonymized by the de-anonymization model based on the data’s features.
Mathematically, all the aforementioned security and privacy applications can be reduced to a simple yet general system as shown in Fig.2: 1⃝, a model is learned based on the features of historical data; 2⃝, the target data are input to the model; and 3⃝, inferences, e.g., candidate users who generate the data and/or identified new users, are concluded based on the results of the learned model. Now, after observing the success of the aforementioned security and privacy applications -, e.g., Kaleido is able to identify the responsible users with over accuracy, two interesting questions are that why these techniques/attacks are success and given the target data, how to determine the performance of these techniques/attacks relative to the intrinsic inferability of the target data, e.g., how good the accuracy of Kaleido is and is that possible to achieve some better accuracy than ? To answer the two questions, we study the intrinsic inferability of the target data given the historical data (training data). Therefore, our research in this paper can serve as the theoretical foundation of the aforementioned security and privacy applications. Furthermore, our quantification enables the development of a tool to evaluate the relative performance of the aforementioned techniques/attacks and guides the development of future research (as discussed in Section V).
Ii-B Problem Formalization and Models
Now, we formalize the studied problem. During the formalization, the basic principle is to make the problem sufficiently general and meanwhile mathematically tractable.
We denote the training data (e.g., the historical data in the network traffic attribution scenario) as . Since we do not distinguish a user and the data generated by that user, we assume consists of users (or the data generated by users), and further assume , where is a user (or the data generated by a user). For , it represents a user or the data generated by a user depending on the context. To model the feature extraction process (as shown in Fig.1 and Fig.2), we assume there is a feature extraction mechanism 111In practice, could be any specific feature extraction mechanism, e.g., the ones in -., where denotes some particular feature function and is the dimension of the feature space. Applying to , we can get the features of , denoted by set . In this paper, we focus on the scenario that is a finite set, i.e., is some finite value222With this assumption, the studied problem is still sufficiently general to be applied to many existing security and privacy applications. For instance, in network security forensics , linkage attacks and private web search , and data de-anonymization , the extracted features of the training data can be modeled by a finite set.. Specifically, for , its features with respect to
are denoted by vector, where denotes the feature of with respect to the feature function .
Similar to formalizing the training data and taking account of the security and privacy applications (-), we denote the target data by , where is a user (or the data generated by a target user) in the target data and is the number of users in the target data. As shown in Fig.1 and Fig.2 (-), before inferring the users in , we apply the same to extract the features of denoted by , which is again assumed to be a finite set. For , its features with respect to are denoted by vector , where denotes the feature of with respect to the feature function . After having , the task now is to infer the users in using an inference model (e.g., the network traffic discriminant model as shown in Fig.1).
Based on the aforementioned definitions, the studied problem in this paper can be formalized as follows:
Feature-based Data Inferability (FDI). Given , , and , we quantify the inferability of with respect to and .
In this paper, we study the intrinsic FDI of the security and privacy applications as shown Section II-A. Mathematically, the FDI study can serve as the theoretical foundation of the applications in Section II-A, e.g., the network traffic distribution system Kaleido proposed in . Practically, the FDI study can be employed to evaluate the relative performance of the existing techniques in the applications of Section II-A, and guide the development of new/improved techniques.
Iii FDI Quantification
In this section, we conduct the FDI quantification. We start the quantification from a naive scenario. Then, we generalize the FDI quantification to the more practical cases.
To make our following discussion easily understandable, we use the network traffic attribution application in network security forensics as the studying context without of dedicated specification in the rest of this paper. Straightforwardly, our discussion is applicable to the scenarios of the linkage attack and private web search  and data de-anonymization .
Following the security and privacy applications in -, an inferring model can be learned from as shown in Fig.1 and Fig.2, e.g., the discriminant model in the network security forensics application , the linkage attack model in private web searching , and the de-anonymization model in . We denote the inference (attack, de-anonymization) model by . Then, is employed to infer the new coming data, i.e., the target data.
When employing to infer users (data generated by users) in the target data, employs some inference function learned from . We here model the inference function of by . Then, , when inferring using , we denote the process by and denote the inference result by , where denotes a new user (the data generated by a new user) such that . We further explain the inference result definition as follows: when employing to infer the target user (data generated by the target user) , it may be inferred to some candidate users in the training data if the inference function is satisfied. Otherwise, is more confident to infer as a new user that never appeared in . For instance, in the network traffic distribution application, when using Kaleido ( in our definition) to monitor the on-line network traffic, the inference result could be that the traffic is generated by some existing user (used for training Kaleido) or the traffic is generated by some new user that not appeared before (could be some intruder). Now, we are ready to start our quantification.
Iii-B Warmup: Naive Quantification
In this subsection, we conduct the FDI quantification for a naive scenario, where we assume that , is a binary feature function, i.e., or , or either has feature or not. Then, we have , , i.e., the feature vector of is a -dimensional 0-1 vector with respect to . Furthermore, for , we define . Given two 0-1 vectors and where , we define , where is the logical binary XOR operation.
For and , we denote the scenario that and correspond to the same user (or the data generated by the same user) and otherwise, e.g., the network traffic generated by the same user in different time windows or not. To conduct the FDI quantification, the first step is to understand and quantify the correlation of the features of and . Toward this objective, for and , we assume that for
, i.e., the probability thatpreserves the same property of with respect to a feature is . Now, for and , suppose while . Then, we have the following lemma, which quantifies the inferability of with respect to and 333Note that, all the quantifications in this paper are statistically meaningful, i.e., statistically, with probability of 1, the FDI quantifications hold..
If and , then such that , i.e., is inferable with respect to and .
Proof: To prove this lemma, we first analyze the difference between and . To facilitate our analysis, we partition the feature space into four disjoint subsets with respect to and , denoted by , , , and respectively as shown in Fig.3, where (the set of features that has while dose not have), (the set of features that both and have), (the set of features that does have while has), and (the set of features that neither nor has). Let for , where is the cardinality of a set. Furthermore, for and , let be the feature vector of with respect to the features in . Evidently, is a subvector of . Furthermore, let . Then, it is easy to show that , .
Let . Since and , we have . Now, we consider each separately: (1) since both and have the features in , we have ; (2) similar to , since neither nor has any feature in , we have ; (3) for , the set of features hold by while not , statistically, we have and , where is a binomial variable with parameters and ; and (4) for , the set of features hold by while not , statistically, we have and . Then, we have
Now, we consider two cases. First, if , we have . Then, applying the Pedarsani-Grossglauser lemma , we have
Since , we have
Then, according to the Borel-Cantelli Lemma and statistically, we have , which implies that statistically, .
Second, we consider the case that . In this case, we have . Then, applying the Pedarsani-Grossglauser lemma , we have
Considering that , we have
According to the Borel-Cantelli Lemma and statistically, we have , i.e., .
Now, we need to show that such that . Based on our proof, it is trivial to show that (1) when , if is an increasing function with respect to , where ; and similarly, when , if is a decreasing function with respect to . Therefore, for our purpose it is easy to design using existing techniques -. To name a naive one, we can set as shown in Algorithm 1.
In Lemma 1, we quantified the condition to successfully infer user from with respect to . We further discuss Lemma 1 as follows. First, one condition is that . This is consistent with our institution. If , the features of each user is uniformly and equiprobably distributed in . Then, theoretically, all the users are equivalent with respect to and thus it is difficult (if not impossible) to successfully infer based on the features in by any model. Second, when , we explicitly specify the condition that is statistically guaranteed to be successfully inferrable with respect to . In our proof, we also give how to design . Note that, the specified condition is sufficient while not necessary to have inferable with respect to . Even if the condition is not satisfied, it is also possible to successfully infer with respect to . Particularly, we show this fact in the following corollary.
For and , suppose and . If , then such that .
Proof: This corollary can be proven using the similar technique as in Lemma 1.
In Lemma 1, we quantify the FDI of with respect to . Now, we quantify the FDI of with respect to . In practice, we usually infer to a set of candidate users in . For instance, in the network traffic distribution system Kaleido , the user responsible for the new coming traffic might be inferred to a set of users. Therefore, given , we define the Top- candidate set of as follows.
Top- candidate set and Top- inferable. For , suppose that such that . Then, the Top- candidate set of , denoted by , is defined as such that and . is Top- inferable with respect to if such that , i.e., returns a subset of with size and is in that subset.
Now, we quantify the Top- FDI of a user . Let be a subset of such that and . We show the result in the following lemma.
For , suppose that . Then, is Top- inferable if and such that , where .
Proof: we prove this lemma by considering two cases. First, we consider the case that . We define an event as such that . Then, we have according to Boole’s inequality. From Lemma 1, when , . Then, we have
According to the Borel-Cantelli Lemma, we have , i.e., .
Second, we consider the case that . In this case, we define as an event that such that . Then, similar to the case that , we have
Again, according to the Borel-Cantelli Lemma, we have , i.e., .
Now, we discuss how to design and how to find . Based on our proof, if and such that , then (1) when , , which implies that among , there are at least users having their values greater than ; and (2) when , , there are at least users having their values smaller than . According to this observation, we give a preliminary implementation of as shown in Algorithm 2. Basically, if , Algorithm 2 returns a set consisting of users from that have the top- minimum values; and if , Algorithm 2 returns a set consisting of users from that have the top- maximum values. By a contradiction-based technique, we can show that the shown in Algorithm 2 returns a Top- candidate set of , i.e., is Top- inferable.
In Lemma 2, the conditions for a user to be Top- inferable are quantified. If the specified conditions are satisfied, we also provide an implementation of in the proof (Algorithm 2). In fact, there are also many other techniques to implement , e.g., the techniques proposed in -. Further, similar to Lemma 1, the conditions in Lemma 2 are sufficient while not necessary for to be Top- inferable. When the conditions are satisfied, it is statistically guaranteed that is Top- inferable. Otherwise, is still Top- inferable with some probability. Particularly, we show that probability in the following corollary.
For , suppose that . Then, if , , where and .
Now, we consider an even more general scenario where we try to infer multiple users in . A practical application corresponding to this scenario is to distribute the monitored network traffic generated by multiple users in network forensics . Let , i.e., is a set of users that appeared in both and . Furthermore, let be a constant and . Then, we define the -inferability of (i.e., is -inferable) as follows.
-Inferable. is -inferable if there are at least users in are Top- inferable444Without loss of generality, we assume is an integer in . In the case that is not an integer, we can define as .
Then, we quantify the -inferability of in the following theorem.
Let be any subset of and . is -inferable if and , such that , and .
Proof: We first prove this theorem for the case that . For , suppose . Evidently, . Now, to prove this theorem, it is sufficient to show that , is Top- inferable. Let be the event that such that is not Top- inferable. Then, we have
Following the Borel-Cantelli Lemma, we have , i.e., is Top- inferable which implies that is -inferable.
For the case that , we have
Again, following the Borel-Cantelli Lemma, we have , which implies that is -inferable.
In Theorem 1, we quantify the -inferability of . When comparing Theorem 1 and Lemma 2, we can see that the conditions specified in Theorem 1 is stronger than that in Lemma 2 with respect two aspects. First, in Theorem 1, it is required that for , there exists one desired . This is for the purpose of making Top- inferable. Second, the required is stronger in Theorem 1 than that in Lemma 2. This can be explained from the statistical perspective. In Lemma 2, the objective is make one user statistically Top- inferable while in Theorem 1, the objective is make all the users in statistically Top- inferable (simultaneously).
If the specified conditions in Theorem 1 are satisfied, an interesting question is how to design a to make -inferable. An preliminary implementation of can be built using the procedure in Algorithm 2: for each user in , we use Algorithm 2 to find a for it. Then, by the similar argument as in Lemma 2, we can conclude that is -inferable under .
In this subsection, we conduct the FDI quantification under the assumption that each feature function is binary. Apparently, this assumption may not hold in many real applications. Nevertheless, the quantification in this subsection can shed light on sophisticated FDI analysis. In the following subsections, we consider general FDI quantification by removing this assumption.
Iii-C General Quantification: From the Distance Perspective
In the previous FDI quantification, we assume that ,
is a binary function, i.e., .
Although this assumption holds in many real applications
(e.g., linkage attacks
and data de-anonymization attacks),
may not be a binary function in many other applications.
Therefore, in the following FDI quantification, we assume
that can be any function with a real-value output.
Furthermore, given ,
an inference model may assign different weights
to each feature (usually, the weights are learned from
the features of the training data, i.e., ).
To characterize this situation, we model that each feature in
corresponds to a weight value in ,
which can be obtained by a weight function .
In addition, to make our FDI quantification sufficiently general
and meanwhile mathematically tractable,
we model the correlation between the feature function
and the weight function by another function
, i.e., is a function defined
on and 555Here, to make our model sufficiently general,
we do not specify the dedicated definition of .
In a specifical application, can be specified accordingly.
For instance, we may have as in a
linear regression model.
as in a linear regression model.. Now, for a user (or ), we have its feature vector as , where is the function defined on the feature function and the weight function of .
Given learned from , we quantify the FDI of using . For instance, could be the new monitored network traffic or the new collected web search data. For , to infer to some user in (or the data in generated by the same user) or to determine whether is a new user (or the data generated by a new user), two fundamental approaches are usually employed in : distance-based approach and distribution-based approach -. In the distance-based approach, computes the feature distance between and each in , i.e., the distance between and for . Then, infers to a subset of candidates in (either has the minimum or the maximum distance value). In the distribution-based approach, computes the feature distribution similarity between and each in , i.e., the distribution similarity between and for . Then, infers to a subset of candidates in (usually, the users in who have the most similar feature distributions with that of ). In this paper, we quantify the FDI for both approaches. Specifically, in this subsection, we focus on distance-based FDI quantification.
To facilitate our quantification, we first make the following definitions and assumptions. For , we define their feature distance as . In practice, can be defined in an application-oriented manner. For instance, can be defined using the -norm distance as follows:
be the expectation/mean value of a random variable. Then, we define the expectation value ofas . Furthermore, we assume that , i.e., the feature distance between and is lower bounded by 0 (which is an intuitive assumption) and upper bounded by some value . Now, for and , suppose that and . We quantify the inferability of with respect to and in the following lemma.
(1) When , is inferable if ; (2) When , is inferable if .
Proof: We start from proving the first conclusion. Let , , and . When , we have