Preserving Both Privacy and Utility in Network Trace Anonymization

As network security monitoring grows more sophisticated, there is an increasing need for outsourcing such tasks to third-party analysts. However, organizations are usually reluctant to share their network traces due to privacy concerns over sensitive information, e.g., network and system configuration, which may potentially be exploited for attacks. In cases where data owners are convinced to share their network traces, the data are typically subjected to certain anonymization techniques, e.g., CryptoPAn, which replaces real IP addresses with prefix-preserving pseudonyms. However, most such techniques either are vulnerable to adversaries with prior knowledge about some network flows in the traces, or require heavy data sanitization or perturbation, both of which may result in a significant loss of data utility. In this paper, we aim to preserve both privacy and utility through shifting the trade-off from between privacy and utility to between privacy and computational cost. The key idea is for the analysts to generate and analyze multiple anonymized views of the original network traces; those views are designed to be sufficiently indistinguishable even to adversaries armed with prior knowledge, which preserves the privacy, whereas one of the views will yield true analysis results privately retrieved by the data owner, which preserves the utility. We present the general approach and instantiate it based on CryptoPAn. We formally analyze the privacy of our solution and experimentally evaluate it using real network traces provided by a major ISP. The results show that our approach can significantly reduce the level of information leakage (e.g., less than 1% of the information leaked by CryptoPAn) with comparable utility.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/02/2019

Improving Suppression to Reduce Disclosure Risk and Enhance Data Utility

In Privacy Preserving Data Publishing, various privacy models have been ...
02/01/2019

Privacy Against Brute-Force Inference Attacks

Privacy-preserving data release is about disclosing information about us...
01/08/2020

Local Information Privacy and Its Application to Privacy-Preserving Data Aggregation

In this paper, we study local information privacy (LIP), and design LIP ...
12/02/2020

Privacy-Preserving Directly-Follows Graphs: Balancing Risk and Utility in Process Mining

Process mining techniques enable organizations to analyze business proce...
11/11/2019

Privacy-Preserving Multiple Tensor Factorization for Synthesizing Large-Scale Location Traces

With the widespread use of LBSs (Location-based Services), synthesizing ...
08/24/2017

Fragmented Monitoring

Field data is an invaluable source of information for testers and develo...
06/05/2019

Impact of Prior Knowledge and Data Correlation on Privacy Leakage: A Unified Analysis

It has been widely understood that differential privacy (DP) can guarant...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

As the owners of large-scale network data, today’s ISPs and enterprises usually face a dilemma. As security monitoring and analytics grow more sophisticated, there is an increasing need for those organizations to outsource such tasks together with necessary network data to third-party analysts, e.g., Managed Security Service Providers (MSSPs) (outsource, ). On the other hand, those organizations are typically reluctant to share their network trace data with third parties, and even less willing to publish them, mainly due to privacy concerns over sensitive information contained in such data. For example, important network configuration information, such as potential bottlenecks of the network, may be inferred from network traces and subsequently exploited by adversaries to increase the impact of a denial of service attack (riboni, ).

In cases where data owners are convinced to share their network traces, the traces are typically subjected to some anonymization techniques. The anonymization of network traces has attracted significant attention (a more detailed review of related works will be given in section 6). For instance, CryptoPAn replaces real IP addresses inside network flows with prefix preserving pseudonyms, such that the hierarchical relationships among those addresses will be preserved to facilitate analyses (PP, ). Specifically, any two IP addresses sharing a prefix in the original trace will also do so in the anonymized trace. However, CryptoPAn is known to be vulnerable to the so-called fingerprinting attack and injection attack (brekene1, ; brekene2, ; a1, ). In those attacks, adversaries either already know some network flows in the original traces (by observing the network or from other relevant sources, e.g., DNS and WHOIS databases) (burkhart, ), or have deliberately injected some forged flows into such traces. By recognizing those known flows in the anonymized traces based on unchanged fields of the flows, namely, fingerprints (e.g., timestamps and protocols), the adversaries can extrapolate their knowledge to recognize other flows based on the shared prefixes (brekene1, ). We now demonstrate such an attack in details.

Example 1.1 ().

In Figure 1, the upper table shows the original trace, and the lower shows the trace anonymized using CryptoPAn. In this example, without loss of generality, we only focus on source IPs. Inside each table, similar prefixes are highlighted through similar shading.

Figure 1. An example of injection attack
  1. Step 1: An adversary has injected three network flows, shown as the first three records in the original trace (upper table).

  2. Step 2: The adversary recognizes the three injected flows in the anonymized trace (lower table) through unique combinations of the unchanged attributes (Start Time and Src Port).

  3. Step 3: He/she can then extrapolate his/her knowledge from the injected flows to real flows as follows, e.g., since prefix is shared by the second (injected), fifth (real) and sixth (real) flows, he/she knows all three must also share the same prefix in the original trace. Such identified relationships between flows in the two traces will be called matches from now on.

  4. Step 4: Finally, he/she can infer the prefixes or entire IPs of those anonymized flows in the original traces, as he/she knows the original IPs of his/her injected flows, e.g., the fifth and sixth flows must have prefix , and the IPs of the fourth and last flows must be .

More generally, a powerful adversary who can probe all the subnets of a network using injection or fingerprinting can potentially de-anonymize the entire CryptoPAn output via a more sophisticated frequency analysis attack (brekene1, ).

Most subsequent solutions either require heavy data sanitization or can only support limited types of analysis. In particular, the -obfuscation method first groups together or more flows with similar fingerprints and then bucketizes (i.e., replacing original IPs with identical IPs) flows inside each group; all records whose fingerprints are not sufficiently similar to others will be suppressed (riboni, ). Clearly, both the bucketization and suppression may lead to significant loss of data utility. The differentially private analysis method first adds noises to analysis results and then publishes such aggregated results (mcsherry, ; DP2, ; DP3, ). Although this method may provide privacy guarantee regardless of adversarial knowledge, the perturbation and aggregation prevent its application to analyses that demand accurate or detailed records in the network traces.

In this paper, we aim to preserve both privacy and utility by shifting the trade-off from between privacy and utility, as seen in most existing works, to between privacy and computational cost (which has seen a significant decrease lately, especially with the increasing popularity of cloud technology). The key idea is for the data owner to send enough information to the third party analysts such that they can generate and analyze many different anonymized views of the original network trace; those anonymized views are designed to be sufficiently indistinguishable (which will be formally defined in subsection 2.4) even to adversaries armed with prior knowledge and performing the aforementioned attacks, which preserves the privacy; at the same time, one of the anonymized views will yield true analysis results, which will be privately retrieved by the data owner or other authorized parties, which preserves the utility. More specifically, our contributions are as follows.

  1. We propose a multi-view approach to the prefix-preserving anonymization of network traces. To the best of our knowledge, this is the first known solution that can achieve similar data utility as CryptoPAn does, while being robust against the so-called semantic attacks (e.g., fingerprinting and injection). In addition, we believe the idea of shifting the trade-off from between privacy and utility to between privacy and computational cost may potentially be adapted to improve other privacy solutions.

  2. In addition to the general multi-view approach, we detail a concrete solution based on iteratively applying CryptoPAn to each partition inside a network trace such that different partitions are anonymized differently in all the views except one (which yields valid analysis results that can be privately retrieved by the data owner). In addition to privacy and utility, we design the solution in such a way that only one seed view needs to be sent to the analysts, which avoids additional communication cost.

  3. We formally analyze the level of privacy guarantee achieved using our method, discuss potential attacks and solutions, and finally experimentally evaluate our solution using real network traces from a major ISP. The experimental results confirm that our solution is robust against semantic attacks with a reasonable computational cost.

The rest of the paper is organized as follows: Section 2 defines our models. Sections 3 introduces building blocks for our schemes. Section 4 details two concrete multi-view schemes based on CryptoPAn. Sections 5 presents the experimental results. Section Appendix 4.3 provides more discussions, and section 6 reviews the related work. Finally, section 7 concludes the paper.

2. Models

In this section, we describe models for the system and adversaries; we briefly review CryptoPAn; we provide a high level overview of our multi-view approach; finally, we define our privacy property. Essential definitions and notations are summarized in Table 1.

max width=3.3in Symbol Definition Symbol Definition Original network trace Anonymized trace IP attributes: source and destination IP fp-QI Fingerprint quasi identifier Record number Number of records in Number of IP prefixes known by the attacker The set of addresses known by attacker The set of IP addresses in the seed view The set of IP addresses in view CryptoPAn function Reverse of CryptoPAn Partition Number of partitions in Index of real view , Private key and outsourced key

Table 1. The Notation Table.

2.1. The System and Adversary Model

Denote by a network trace comprised of a set of flows (or records) . Each flow includes a confidential multi-value attribute , and the set of other attributes is called the Fingerprint Quasi Identifier (fp-QI) (riboni, ). Suppose the data owner would like the analyst to perform an analysis on to produce a report . To ensure privacy, instead of sending , an anonymization function is applied to obtain an anonymized version . Thus, our main objective is to find the anonymization function to preserve both the privacy, which means the analyst cannot obtain or from , and utility, which means must be prefix-preserving.

In this context, we make following assumptions (similar to those found in most existing works (PP, ; brekene1, ; brekene2, ; a1, )). i) The adversary is a honest-but-curious analyst (in the sense that he/she will exactly follow the approach) who can observe . ii) The anonymization function is publicly known, but the corresponding anonymization key is not known by the adversary. iii) The goal of the adversary is to find all possible matches (as demonstrated in Example 1.1, an IP address may be matched to its anonymized version either through the fp-QI or shared prefixes) between and . iv) Suppose consists of groups each of which contains IP addresses with similar prefixes (e.g., those in the same subset), and among these the adversary can successfully inject or fingerprint () groups (e.g., the demilitarized zone (DMZ) or other subnets to which the adversary has access). Accordingly, we say that the adversary has knowledge. v) Finally, we assume the communication between the data owner and the analyst is over a secure channel, and we do not consider integrity or availability issues (e.g., a malicious adversary may potentially alter or delete the analysis report).

2.2. The CryptoPAn Model

To facilitate further discussions, we briefly review the CryptoPAn (PP, ) model, which gives a baseline for prefix-preserving anonymization.

Definition 2.1 ().

Prefix-preserving Anonymization (PP, ): Given two IP addresses and , and a one-to-one function , we say that

  • and share a -bit prefix (), if and only if , and .

  • is prefix-preserving, if, for any and that share a -bit prefix, and also do so.

Given and , the prefix-preserving anonymization function must necessarily satisfy the canonical form (PP, ), as follows.

(1)

where is a cryptographic function which, based on a -bit key , takes as input a bit-string of length and returns a single bit. Intuitively, the bit is anonymized based on and the preceding bits to satisfy the prefix-preserving property. The cryptographic function can be constructed as where returns the least significant bit, can be a block cipher such as Rijndael (rindal, ), and

is a padding function that expands

to match the block size of  (PP, ). In the following, will stand for this CryptoPAn function and its output will be denoted by .

The advantage of CryptoPAn is that it is deterministic and allows consistent prefix-preserving anonymization under the same . However, as mentioned earlier, CryptoPAn is vulnerable to semantic attacks, which will be addressed in next section.

2.3. The Multi-View Approach

We propose a novel multi-view approach to the prefix-preserving anonymization of network traces. The objective is to preserve both the privacy and the data utility, while being robust against semantic attacks. The key idea is to hide a prefix-preserving anonymized view, namely, the real view, among other fake views, such that an adversary cannot distinguish between those views, either using his/her prior knowledge or through semantic attacks. Our approach is depicted in Figure 2 and detailed below.

Figure 2. An overview of the multi-view approach

2.3.1. Privacy Preservation at the Data Owner Side

Step 1::

The data owner generates two CryptoPAn keys K and K, and then obtains an anonymized trace using the anonymization function (which will be represented by the gear icon inside this figure) and K. This initial anonymization step is designed to prevent the analyst from simulating the process as K will never be given out. Note that this anonymized trace is still vulnerable to semantic attack and must undergo the remaining steps. Besides, generating this anonymized trace will actually be slightly more complicated due to migration as discussed later in Section 3.3.

Step 2::

The anonymized trace is then partitioned (the partitioning algorithms will be detailed in Sections 3.2 and 4).

Step 3::

Each partition is anonymized using and key K, but the anonymization will be repeated, for a different number of times, on different partitions. For example, as the figure shows, the first partition is anonymized only once, whereas the second for three times, etc. The result of this step is called the seed trace. The idea is that, as illustrated by the different graphic patterns inside the seed trace, different partitions have been anonymized differently, and hence the seed trace in its entirety is no longer prefix-preserving, even though each partition is still prefix-preserving (note that this is only a simplified demonstration of the seed trace generator scheme which will be detailed in Section 4).

Step 4::

The seed trace together with some supplementary parameters, including K, are outsourced to the analyst.

2.3.2. Utility Realization at the Data Analyst Side

Step 5::

The analyst generates totally views based on the received seed view and supplementary parameters. Our design will ensure one of those generated views, namely, the real view, will have all its partitions anonymized in the same way, and thus be prefix-preserving (detailed in Section 4), though the analyst (adversary) cannot tell which exactly is the real view.

Step 6::

The analyst performs the analysis on all the views and generates corresponding reports.

Step 7::

The data owner retrieves the analysis report corresponding to the real view following an oblivious random access memory (ORAM) protocol (oram, ), such that the analyst cannot learn which view has been retrieved.

Next, we define the privacy property for the multi-view solution.

2.4. Privacy Property against Adversaries

Under our multi-view approach, an analyst (adversary) will receive different traces with identical fp-QI attribute values and different attribute values. Therefore, his/her goal now is to identify the real view among all the views, e.g., he/she may attempt to observe his/her injected or fingerprinted flows, or he/she can launch the aforementioned semantic attacks on those views, hoping that the real view might respond differently to those attacks. Therefore, the main objective in designing an effective multi-view solution is to satisfy the indistinguishability property which means the real view must be sufficiently indistinguishable from the fake views under semantic attacks. Motivated by the concept of Differential Privacy (dworks, ), we propose the -indisinguishablity property as follows.

Definition 2.2 ().

-Indisinguishable Views: A multi-view solution is said to satisfy -Indistinguishability against an

adversary if and only if (both probabilities below are from the adversary’s point of view)

(2)

In Defintion 2.2, a smaller value is more desirable as it means the views are more indistinguishable from the real view to the adversary. For example, an extreme case of would mean all the views are equally likely to be the real view to the adversary (from now on, we call these views the real view candidates). In practice, the value of would depend on the specific design of a multi-view solution and also on the adversary’s prior knowledge, as will be detailed in following sections.

Finally, since the multi-view approach requires outsourcing some supplementary parameters, we will also need to analyze the security/privacy of the communication protocol (privacy leakage in the protocol, which complements the privacy analysis in output of the protocol) in semi-honest model under the theory of secure multiparty computation (SMC) (Yao86, )(goldrich, ) (see section 4.2.4).

3. The Building Blocks

In this section, we introduce the building blocks for our multi-view mechanisms, namely, the iterative and reverse CryptoPAn, partition-based prefix preserving, and CryptoPAn with IP-collision (migration).

3.1. Iterative and Reverse CryptoPAn

As mentioned in section 2.3, the multi-view approach relies on iteratively applying a prefix preserving function for generating the seed view. Also, the analyst will invert such an application of in order to obtain the real view (among fake views). Therefore, we first need to show how can be iteratively and reversely applied.

First, it is straightforward that can be iteratively applied, and the result also yields a valid prefix-preserving function. Specifically, denote by () the iterative application of on IP address using key , where is the number of iterations, called the index. For example, for an index of two, we have . It can be easily verified that given any two IP addresses and sharing a k-bit prefix, and will always result in two IP addresses that also share a k-bit prefix (i.e., is prefix-preserving). More generally, the same also holds for applying under a sequence of indices and keys (for both IPs), e.g., and will also share k-bit prefix. Finally, for a set of IP addresses , iterative using a single key satisfies the following associative property:

(3)

On the other hand, when a negative number is used as the index, we have a reverse iterative CryptPAn function ( for short), as formally characterized in Theorem 3.1 (the proof is in Appendix A.1).

Theorem 3.1 ().

Given IP addresses and , the function defined as

(4)

is the inverse of the function given in Equation 1, i.e., .

3.2. Partition-based Prefix Preserving

As mentioned in section 2.3, the central idea of the multi-view approach is to divide the trace into partitions (Step ), and then anonymize those partitions iteratively, but for different number of iterations (Step ). In this subsection, we discuss this concept.

Given as a set of IP addresses, we may divide into partitions in various ways, e.g., forming equal-sized partitions after sorting based on either the IP addresses or corresponding timestamps. The partitioning scheme will have a major impact on the privacy, and we will discuss two such schemes in next section.

Once the trace is divided into partitions, we can then apply on each partition separately, denoted by for the partition. Specifically, given divided as a set of partitions {}, we define a

key vector

where each is a positive integer indicating the number of times should be applied to , namely, the key index of . Given a cryptographic key , we can then define the partition-based prefix preserving anonymization of as .

We can easily extend the associative property in Equation 3 to this case as the following (which will play an important role in designing our multi-view mechanisms in next section).

(5)

3.3. IP Migration: Introducing IP-Collision into CryptoPAn

As mentioned in section 2.3, once the analyst (adversary) receives the seed view, he/she would generate many indistinguishable views among which only one, the real view, will be prefix preserving across all the partitions, while the other (fake) views do not preserve prefixes across the partitions (Step 5). However, the design would have a potential flaw under a direct application of CryptoPAn. Specifically, since the original CryptoPAn design is collision resistant (PP, ), the fact that similar prefixes are only preserved in the real view across partitions would allow an adversary to easily distinguish the real view from others.

Figure 3. An example showing only the real view contains shared prefixes (which allows it to be identified by adversaries)
Example 3.1 ().

Figure 3 illustrates this flaw. The original trace includes three different addresses and has been divided into two partitions and . As illustrated in the figure, the real view is easily distinguishable from the two fake views as the shared prefixes () between addresses in and only appear in the real view. This is because, since the partitions in fake views have different rounds of PP applied, and since the original CryptoPan design is collision resistant (PP, ), the shared prefixes will no longer appear. Hence, the adversary can easily distinguish the real view from others.

To address this issue, our idea is to create collisions between different prefixes in fake views, such that adversaries cannot tell whether the shared prefixes are due to prefix preserving in the real view, or due to collisions in the fake views. However, due to the collision resistance property of CryptoPAn (PP, ), there is only a negligible probability that different prefixes may become identical even after applying different iterations of PP, as shown in the above example. Therefore, our key idea of IP migration is to first replace the prefixes of all the IPs with common values (e.g., zeros), and then fabricate new prefixes for them by applying different iterations of PP. This IP migration process is designed to be prefix-preserving (i.e,. any IPs sharing prefixes in the original trace will still share the new prefixes), and to create collisions in fake views since the addition of key indices during view generation can easily collide. Next, we demonstrate this IP migration technique in an example.

Figure 4. An example showing, by removing shared prefixes and fabricating them with the same rounds of PP, both fake view and real view may now contain fake or real shared prefixes (which makes them indistinguishable)
Example 3.2 ().

In Figure 4, the first stage shows the same original trace as in Example 3.1. In the second stage, we “remove” the prefixes of all IPs and replace them with all zeros (by xoring them with their own prefixes). Next, in the third stage, we fabricate new prefixes by applying different iterations of in a prefix preserving manner, e.g., the first two IPs still sharing a common prefix () different from that of the last IP. However, note that whether two IPs share the new prefixes only depends on their key indices now, e.g., for first two IPs and for the last IP. This is how we can create collisions in the next stage (the fake view) where the first and last IPs coincidentally share the same prefix due to their common key indices (however, note these are the addition results of different key indices from the migration stage and the view generation stage, respectively). Now, the adversary will not be able to tell which of those views is real based on the existence of shared prefixes.

We now formally define the migration function in the following.

Definition 3.1 ().

Migration Function: Let be a set of IP addresses consists of groups of IPs with distinct prefixes respectively, and be a random CryptoPAn key. Migration function is defined as

(6)

where is the set of non-repeating random key indices generated between using a cryptographically secure pseudo random number generator.

4. -Indistinguishable Multi-view Mechanisms

We first present a multi-view mechanism based on IP partitioning in Section 4.1. We then propose a more refined scheme based on distinct IP partitioning with key vector generator in Section 4.2.

4.1. Scheme I: IP-based Partitioning Approach

To realize the main ideas of multi-view anonymization, as introduced in Section 2.3, we need to design concrete schemes for each step in Figure 2. The key idea of our first scheme is the following. We divide the original trace in such a way that all the IPs sharing prefixes will always be placed in the same partition. This will prevent the attack described in Section 3.3, i.e., identifying the real view by observing shared prefixes across different partitions. As we will detail in Section 4.1.4, this scheme can achieve perfect indistinguishability without the need for IP migration (introduced in Section 3.3), although it has its limitations which will be addressed in our second scheme. Both schemes are depicted in Figure 5 and detailed below.

Figure 5. An example of a trace which undergoes multi-view schemes I, II

Specifically, our first scheme includes three main steps: privacy preservation (Section 4.1.1), utility realization (Section 4.1.2), and analysis report extraction (Section 4.1.3).

4.1.1. Privacy Preservation (Data Owner)

The data owner performs a set of actions to generate the seed trace together with some parameters to be sent to the analyst for generating different views. These actions are summarized in Algorithm 1, and detailed in the following.

  • Applying CryptoPAn using : First, the data owner generates two independent keys, namely (key used for initial anonymization, which never leaves the data owner) and (key used for later anonymization steps). The data owner then generates the initially anonymized trace =. This step is designed to prevent the adversary from simulating the scheme, e.g., using a brute-force attack to revert the seed trace back to the original trace in which he/she can recognize some original IPs. The leftmost block in Figure 5 shows an example of the initially anonymized trace.

  • Trace partitioning based on IP-value: The initially anonymized trace is partitioned based on IP values. Specifically, let be the set of IP addresses in consisting of groups of IPs with distinct prefixes , respectively; we divide to partitions, each of which is the collection of all records containing one of these groups. For example, the upper part of Figure 5 depicts how our first scheme works. The set of three IPs are divided into two partitions where includes both IPs sharing the same prefix, and , whereas the last IP goes to since it does not share a prefix with others.

  • Seed trace creation: The data owner in this step generates the seed trace using a -size (recall that is the number of partitions) random key vector.

    • Generating a random key vector: The data owner generates a random vector of size using a cryptographically secure pseudo random number generator (which generates a set of non-repeating random numbers between ). This vector and the key will later be used by the analyst to generate different views from the seed trace. For example, in Figure 5, for the two partitions, is generated. Finally, the data owner chooses the total number of views to be generated later by the analyst, based on his/her requirement about privacy and computational overhead, since a larger will mean more computation by both the data owner and analyst but also more privacy (more real view candidates will be generated which we will further study this through experiments later).

    • Generating a seed trace key vector and a seed trace: The data owner picks a random number and then computes as the key vector of seed trace. Next, the data owner generates the seed trace as . This ensures, after the analysts applies exactly iterations of on the seed trace, he/she would get back (while not being aware of this fact since he/she does not know ). For example, in Figure 5, and . We can easily verify that, if the analyst applies the indices in on the seed trace three times, the outcome will be exactly (the real view). This can be more formally stated as follows (the view is actually the real view).

  • Outsourcing: Finally, the data owner outsources , , and to the analyst.

4.1.2. Network Trace Analysis (Analyst)

The analyst generates the views requested by the data owner, which is summarized in Algorithm 2 in Appendix C and formalized below.

(7)

Since boundaries of partitions must be recognizable by the analyst to allow him/her to generate the views, we modify the time-stamp of the records that are on the boundaries of each partition by changing the most significant digit of the time stamps which is easy to verify and does not affect the analysis as it can be reverted back to its original format by the analyst. Next, the analyst performs the requested analysis on all views and generates analysis reports .

4.1.3. Analysis Report Extraction (Data Owner)

The data owner is only interested in the analysis report that is related to the real view, which we denote by . To minimize communication overhead, instead of requesting all the analysis reports of the generated views, the data owner can fetch only the one that is related to the real view . He/she can employ the oblivious random accesses memory (ORAM) (oram, ) to do so without revealing the information to the analyst (we will discuss alternatives in Section 6).

4.1.4. Security Analysis

We now analyze the level of indistinguishability provided by the scheme. Recall the indistinguishability property defined in Section 2; a multi-view mechanism is -indistinguishable if and only if

The statement inside the probability is the adversary’s decision on a view, declaring it as fake or a real view candidate, using his/her knowledge. Moreover, we note that generated views differ only in their IP values (fp-QI attributes are similar for all the views). Hence, the adversary’s decision can only be based on the published set of IPs in each view through comparing shared prefixes among those IP addresses which he/she already know (). Accordingly, in the following, we define a function to represent all the prefix relations for a set of IPs.

Lemma 4.1 ().

For two IP addresses and , function returns the number of bits in the prefix shared between and

where denotes the floor function.

Definition 4.1 ().

For a multiset of IP addresses , the Prefixes Indicator Set (PIS) is defined as follows.

(8)

Note that PIS remains unchanged when CryptoPAn is applied on , i.e., . In addition, since the multi-view solution keeps all the other attributes intact, the adversary can identify his/her pre-knowledge in each view and construct prefixes indicator sets out of them. Accordingly, we denote by the PIS constructed by the adversary in view .

Definition 4.2 ().

Let be the PIS for the adversary’s knowledge, and , be the PIS constructed by the adversary in view . A multi-view solution then generates -indistinguishable views against an adversary if and only if

(9)
Lemma 4.2 ().

The indistinguishability property, defined in equation 9 can be simplified to

Proof.

as view is the prefix preserving output. Moreover, we have . ∎

From the above, we only need to show (each generated view is a real view candidate).

Theorem 4.3 ().

Scheme I satisfies equation 4.2 with .

Proof.

Scheme I divides the trace into (number of prefix groups) partitions containing all the records that have similar prefixes. Hence, for any partition (), any two IP addresses and inside , and for any , we have because and are always assigned with equal key indices. Moreover, for any two IP addresses and in any two different partitions and any , we have since they do not share any prefixes. ∎

The above discussions show that scheme I produces perfectly indistinguishable views (). In fact, it is robust against the attack explained in Section 3.3 and thus does not required IP migration, because the partitioning algorithm already prevents addresses with similar prefixes from going into different partitions (the case in Figure 3). However, although adversaries cannot identify the real view, they may choose to live with this fact, and attack each partition inside any (fake or real) view instead, using the same semantic attack as shown in Figure 1. Note that our multi-view approach is only designed to prevent attacks across different partitions, and each partition itself is essentially still the output of CryptoPAn and thus still inherits its weakness.

Fortunately, the multi-view approach gives us more flexibility in designing specific schemes to further mitigate such a weakness of CryptoPAn. We next present scheme II which sacrifices some indistinguishability (in the sense of slightly less real view candidates) to achieve better protected partitions.

4.2. Scheme II: Multi-view Using Key Vectors

To address the limitation of our first scheme, we propose the next scheme, which is different in terms of the initial anonymization step, IP partitioning, and key vectors for view generation. The data owner’s and the analyst’s actions are summarized in Algorithms 34.

4.2.1. Initial Anonymization with Migration

First, to mitigate the attack on each partition, we must relax the requirement that all shared prefixes go into the same partition. However, as soon as we do so, the attack of identifying the real view through prefixes shared across partitions, as demonstrated in Section 3.3, might become possible. Therefore, we modify the first step of the multi-view approach (initial anonymization) to enforce the IP migration technique. Figure 6 demonstrates this. The original trace is first anonymized with , and then the anonymized trace goes through the migration process, which replaces the two different prefixes ( and ) with different iterations of , as discussed in Section 3.3.

Figure 6. The updated initial anonymization (Step in Figure 2) for enforcing migration

4.2.2. Distinct IP Partitioning and Key Vectors Generation

For the scheme, we employ a special case of IP partitioning where each partition includes exactly one distinct IP (i.e., the collection of all records containing the same IP). For example, the trace shown in Figure 5 includes three distinct IP addresses ,, and . Therefore, the trace is divided into three partitions. Next, the data owner will generate the seed view as in the first scheme, although the key will be generated completely differently, as detailed below.

Let , be the set of IP addresses after the migration step. Suppose consists of distinct IP addresses. We denote by the multiset of totally migration keys for those distinct IPs (in contrast, the number of migration keys in is equal to the number of distinct prefixes, as discussed in Section 3.3). Also, let be the set of random number generated between using a cryptographically secure pseudo random number generator at iteration . The data owner will generate key vector as follows.

(11)

and

(12)
Example 4.1 ().

In Figure 7, the migration and random vectors are , , , and , respectively. The corresponding key vectors will be , and where only and are outsourced.

In this scheme, the analyst at each iteration generates a new set of IP addresses by randomly grouping all the distinct IP addresses into a set of prefix groups. In doing so, each new vector essentially cancels out the effect of the previous vector , and thus introduces a new set of IP addresses consisting of prefix groups. Thus, it is straightforward to verify that the generated view will prefix preserving (the addresses are migrated back to their groups using ).

Example 4.2 ().

Figure 7 shows that, in each iteration, a different set (but with an equal number of elements) of prefix groups will be generated. For example, in the seed view, IP addresses and are mapped to prefix group .

Figure 7. An example of three views generation under scheme II

4.2.3. Indistinguishability Analysis

By placing each distinct IP in a partition, our second scheme is not vulnerable to semantic attacks on each partition, since such a partition contains no information about the prefix relationship among different addresses. However, compared with scheme I, as we show in the following, this scheme achieves a weaker level of indistinguishability (higher ). Specifically, to verify the indistinguishability of the scheme, we calculate for scheme II in the following. First, the number of all possible outcomes of grouping IP addresses into groups with predefined cardinalities is:

(13)

where denotes the cardinality of group . Also the number of all possible outcomes of grouping IP addresses into groups while still having is:

(14)

for some . This equation gives the number of outcomes when a specific set of IP addresses () are distributed into different groups and hence keeping (i.e., the adversary cannot identify collision). Note that term is all the combinations of choosing this groups for the numerator to model all the combinations. Finally, we have

(15)

Thus, to ensure the -indistinguishability, the data owner needs to satisfy the expression in equation 15 which is a relationship between the number of distinct IP addresses, the number of groups, the cardinality of the groups in the trace and the adversary’s knowledge.

Theorem 4.4 ().

The indistinguishability parameter of the generated views in scheme II is lower-bounded by

(16)
Proof.

Let be positive real numbers, and for define the averages as follows:

(17)

By Maclaurin’s inequality (mclauren, ), which is the following chain of inequalities:

(18)

where , we have

and since , we have

Figure 8(a) shows how the lower-bound in Equation 16 changes with respect to different values of fraction and also the adversary’s knowledge. As it is expected, stronger adversaries have more power to weaken the scheme which results in increasing or increasing the chance of identifying the real view. Moreover, as it is illustrated in the figure, when fraction grows, tends to converge to very small values. Hence, to decrease , the data owner may increase by grouping addresses based on a bigger number of bits in their prefixes, e.g., a certain combination of 3 octets would be considered as a prefix instead of one or two. Another solution could be aggregating the original trace with some other traces for which the cardinalities of each prefix group are small. We study this effect in our experiments in Section 5 where we illustrate the concept especially in Figures 10, 11.

Finally, Figure 8

(b) shows how variance of the cardinalities affects the indistinguishability for a set of fixed parameters

, , . In fact, when the cardinalities of the prefix groups are close (small ), grows to meet the lower-bound in Theorem 4.4. Hence, from the data owner perspective, a trace with a lower variance of cardinalities and a bigger fraction has a better chance of misleading adversaries who wants to identify the real view.

Figure 8. (a) The trend of bound 16 for when adversary’s knowledge varies. (b) The trend of exact value of in equation 15 for , and when variance of cardinalities varies

4.2.4. Security of the communication protocol

We now analyze the security/privacy of our communication protocol in semi-honest model under the theory of secure multiparty computation (SMC) (Yao86, ), (goldrich, ).

Lemma 4.5 ().

Scheme II only reveals the CryptoPan Key and the seed trace in semi-honest model.

Proof.

Recall that our communication protocol only involves one-round communication between two parties (data owner to data analyst). We then only need to examine the data analyst’s view (messages received from the protocol), which includes (1) : number of views to be generated, (2) : the outsourced key, (3) : the seed trace, and (4) : the key vectors. As we discuss in section 4.2.3, the probability of identifying the real view by the adversary using all provided information (key and vectors) depends on the adversary knowledge and the trace itself which clearly implies that such “leakage” is trivial.

Indeed, each of and can be simulated by generating a single random number from a uniform random distribution (which proves that they are not leakage in the protocol). Specifically, the number of generated views is integer which is bounded by , where is the maximum number of views the data owner can afford and all the entries in are in where is the number of groups. First, given integer , the probability that is simulated in the domain would be . Then, can be simulated in polynomial time (based on the knowledge data analyst already knew, i.e., his/her input and/or output of the protocol). Similarly, all the random entires in can also be simulated in polynomial time using a similar simulator (only changing the bound). Thus, the protocol only reveals the outsourced key and the seed trace in semi-honest model. ∎

Note that outsourcing the and the outsourced key are trivial leakage. The outsourced key can be considered as a public key and leakage of which is considered as the output of the protocol was studied earlier. Finally, we study the setup leakage and show that the adversary cannot exploit outsourced parameters to increase (i.e., decrease the number of real view candidates) by building his/her own key vector.

Lemma 4.6 ().

(proof in Appendix A.2) For an adversary, who wants to obtain the least number of real view candidates, if condition holds, the best approach is to follow scheme II, (scheme II returns the least number of real view candidates).

4.3. Discussion

In this section, we discuss various aspects and limitations of our approach.

  1. Application to EDB: We believe the multi-view solution may be applicable to other related areas. For instance, processing on encrypted databases (EDB) has a rich literature including searchable symmetric encryption (SSE) (sse1, ),  (sse2, ), fully-homomorphic encryption (FHE) (FHE, ), oblivious RAMs (ORAM) (goldrich, ), functional encryption (boneh, ), and property preserving encryption (PPE) (Bellare, )(Boldyreva, ). All these approaches achieve different trade-offs between protection (security), utility (query expressiveness), and computational efficiency (naveed, ). Extending and applying the multi-view approach in those areas could lead to interesting future directions.

  2. Comparing the Two Schemes: As we discussed in the two schemes, scheme I achieves a better indistinguishability but less protected partitions in each view. Figure 14 compares the relative effectiveness of the two schemes on a real trace under adversary knowledge. In particular, Figure 14(a) ,(b) demonstrate the fact that despite the lower number of real view candidates in scheme II compared with scheme I ( vs out of ), the end result of the leakage in scheme II is much more appealing (vs ). Therefore, our experimental section has mainly focused on scheme II.

  3. Choosing the Number of Views : The number of views is an important parameter of our approach that determines both the privacy and computational overhead. The data owner could choose this value based on the level of trust on the analysts and the amount of computational overhead that can be afforded. Specifically, as it is implied by Equation 4.2 and demonstrated by our experimental results in section 5, the number of real view candidates is approximately

    . The data owner should first estimate the adversary’s background knowledge

    (number of prefixes known to the adversary) and then calculate either using Equation 15 or (approximately) using Equation 16. As it is demonstrated in Figures 8(a) and 9(b), a bigger results in weaker indistinguishability and demands a larger number of views to be generated. An alternative solution is to increase the number of prefix groups () by sacrificing some prefix relations among IPs, e.g., grouping them based on first octets.

  4. Utility: The main advantage of the multi-view approach is it can preserve the data utility while protecting privacy. In particular, we have shown that the data owner can receive an analysis report based on the real view () which is prefix-preserving over the entire trace. This is more accurate than the obfuscated (through bucketization and suppression) or perturbed (through adding noise and aggregation) approaches. Specifically, in case of a security breach, the data owner can easily compute (migration output) to find the mapped IP addresses corresponding to each original address. Then the data owner applies necessary security policies to the IP addresses that are reported violating some policies in . A limitation of our work is it only preserve the prefix of IPs, and a potential future direction is to apply our approach to other property-preserving encryption methods such that other properties may be preserved similarly.

  5. Communicational/Computational Cost: One of our contributions in this paper is to minimize the communication overhead by only outsourcing one (seed) view and some supplementary parameters. This is especially critical for large scale network data like network traces from the major ISPs. On the other hand, one of the key challenges to the multi-view approach is that it requires times computation for both generating the views and analysis.

    Our experiments in Figure 11 shows that generating views for a trace of packets takes approximately minutes and we describe analytic complexity results in Tables 3 and  4. We note that the practicality of times computation will mainly depends on the type of analysis, and certainly may become impractical for some analyses under large . How to enable analysts to more efficiently conduct analysis tasks based on multiple views through techniques like caching is an interesting future direction. Another direction is to devise more accurate measures for the data owner to more precisely determine the number of views required to reach a certain level of privacy requirement.

5. Experiments

This section evaluates our multi-view scheme through experiments with real data.

5.1. Setup

To validate our multi-view anonymization approach, we use a set of network traces collected by a real ISP. We focus on attributes , , and in our experiments, and the meta-data are summarized in the table in Figure 9(a).

Figure 9. (a) Metadata of the collected traces (b) for different number of prefix groups and different adversary knowledges

In order to measure the security of the proposed approach, we implement the frequency analysis attack (naveed, )(brekene1, ). This attack can compromise individual addresses protected by existing prefix-preserving anonymization in multi-linear time (brekene1, ). We stress that in the setting of EDBs (encrypted database systems), an attack is successful if it recovers even partial information about a single cell of the DB (naveed, ). Accordingly, we define the information leakage metric to evaluate the effectiveness of our solution against the adversary’s semantic attacks. Several measures have been proposed in literature (PP, ; Ribeiro, ) to evaluate the impact of semantic attacks. Motivated by (PP, ), we model the information leakage (number of matches) as the number of records/packets, their original IP addresses are known by the adversary either fully or partially. More formally,
Information leakage metric (PP, ): We measure defined as the total number of addresses that has at least most significant bits known, where .

To model adversarial knowledge, we define a set of prefixes to be known by the adversary ranging from up to of all the prefixes in the trace. This knowledge is stored in a two dimensional vector that includes different addresses and their key indexes.

Figure 10. Percentage of the compromised packets (out of 1M) and number of real view candidates when number of views and the adversary knowledge vary and for the three different cases (1) Figures (a),(d) (2) Figures (b),(e) (3) Figures (c),(f) where legends marked by CP denote the CryptoPAn result whereas those marked by MV denote the multi-view results
Figure 11. Computation time obtained by our anonymization approach for different prefix grouping cases

Next, using our multi-view scheme, we generate all the views. However, before we apply the frequency analysis attack, we simulate how an adversary may eliminate some fake views from further consideration as follows. For each view, we check if two addresses from the adversary’s knowledge set with different prefixes now share prefixes in that view. If we find such a match in the key indices, the corresponding view will be discarded from the set of the real view candidates and will not be considered in our experiments since the adversary would know it is a fake view.

We validate the effectiveness of our scheme by showing the number of real view candidates and the percentage of the packets in the trace that are compromised (i.e., the percentage of IP packets whose addresses have at least eight most significant bits known). Each experiment is repeated more than times and the end results are the average results of the frequency analysis algorithm applied to each of the real view candidates.

Moreover, evaluating the utility preservation and studying the scalability of using ORAM in our scheme are respectively discussed in Appendix B.2 and  B.3.

We conduct all experiments on a machine running Windows with an Intel(R) Core(TM) i7-6700 3.40 GHz CPU, 4 GB Memory, and 500 GB storage.

5.2. Results

5.2.1. Information Leakage Analysis

First, the numerical results of the indistinguishability parameter under different adversary’s knowledges are depicted in Figure 9(b). Those results correspond to three different cases, i.e., when addresses are grouped based on (1) only the first octet ( groups), (2) the first and the second octets ( groups), and (3) the first three octets ( groups). As we can see from the results, decreases (meaning more privacy) as the number of prefix groups increases, and it increases as the amount of adversarial knowledge increases.

We next validate those numerical results through experiments in Figure 10. Specifically, we first analyze the behavior of our second multi-view scheme (introduced in Section 4.2) before comparing the two schemes in Appendix B. Figure 10 presents different facets of information leakage when our approach is applied in various grouping cases. The results in Figure 10 are for adversaries who has knowledge of no more than most of the prefix groups (Figure 13 in Appendix B.1 presents the more extreme cases for the same experiments, i.e., up to knowledge). The analysis of these figures is detailed in the following.

Effect of the number of prefix groups: As we discuss earlier, three different IP grouping cases are studied. Figures 10 (a) and (d) shows respectively the results of packet leakage and number of real view candidates when . As the numerical results in Figure 8 anticipates, because the fraction is relatively low, the indistinguishability of generated views diminishes specially for stronger adversary knowledges. Consequently, the adversary discards more views and the rate of leakage increases, compared with Figures 10 (b), (e) and Figures 10 (c), (f) for which the fraction are and , respectively. In particular, for the worst case of adversary knowledge and when the number of views is less than , we can verify that the number of real view candidates for case (1) remains resulting in packet leakage comparable to that of CryptoPAn.

Effect of the number of views: As it is illustrated in the figure, increasing the number of views always improves both the number of real view candidates and the packet leakages. All the figures for real view candidates evaluation, show a near linear improvement where the slope of this improvement inversely depends on the adversary’s knowledge. For the packet leakages, we can note that the improvement converges to a small packet leakage rate under a large number of views. This is reasonable, as each packet leakage result is an average of leakages in all the real view candidates. However, since each of the fake views leaks a certain amount of information, increasing the number of views beyond a certain value will no longer affect the end result. In other words, the packet leakage converges to the average of leakages in the (fake) real view candidates. Finally, the results show that our proposed scheme can more efficiently improve privacy by (1) increasing the fraction (number of views/number of distinct addresses) or (2) increasing the number of views. The first option may affect utility (since inter-group prefix relations will be removed), while the second option is more aligned with our objective of trading off privacy with computation.

5.2.2. Computational Overhead Evaluation

We evaluate the computational overhead incurred by our approach. Figure 11 shows the time required by our scheme in each grouping cases, when the number of views varies for a trace including one million packets. We observe that, when the number of views increases, the computational overhead increases near linearly. However, each case shows a different slope depending on the number of groups. This is reasonable as our second scheme generates key vectors with a larger number of elements for more groups, which leads to applying CryptoPAn for more iterations (see complexity analysis in Appendix D). Finally, linking this figure to the information leakage results shown in Figure 10 demonstrates the trade-off between privacy and computational overhead.

6. Related Work

In the context of anonymization of network traces, as surveyed in (Mivule, ), many solutions have been proposed (flaim, ; ref6, ; brekene1, ; paxon1, ; riboni, )

. Generally, these may be classified into different categories, such as

enumeration (Farah, ), partitioning (slagell, ), and prefix-preserving  (Xu, ; Gattani, ). These methods include removing rows or attributes, suppression, and generalization of rows or attributes (challengeof, ). Some of the solutions (Ribeiro, ; paxon1, ) are designed to address specific attacks and are generally based on the permutation of some fields in the network trace to blur the adversary’s knowledge. Later studies either prove theoretically (brekene2, ) or validate empirically (burkhart, ) that those works may be defeated by semantic attacks.

As our proposed anonymization solution fall into the category of prefix-preserving solutions, which aims to improve the utility, we review in more details some of the proposed solutions in this category. First effort to find a prefix preserving anonymization was done by Greg Minshall (greg, ) who developed TCPdpriv which is a table-based approach that generates a function randomly. Fan et al. (PP, ) then developed CryptoPAn with a completely cryptographic approach. Several publications (brekene1, ),  (Ribeiro, ; paxon1, ) have then raised the vulnerability of this scheme against semantic attacks which motivated query based (mcsherry, ) and bucketization based (riboni, ) solutions. In the following we review those works in more details.

Among the works that address such semantic attacks, Riboni et al. (riboni, ) propose a (k,j)-obfuscation methodology applied to network traces. In this method, a flow is considered obfuscated if it cannot be linked, with greater assurance, to its (source and destination) IPs. First, network flows are divided into either confidential IP attributes or other fields that can be used to attack. Then, groups of flows having similar fingerprints are first created, then bucketed, based on their fingerprints into groups of size . However, utility remains a challenge in this solution, as the network flows are heavily sanitized, i.e., each flows is blurred inside a bucket of flows having similar fingerprints. An alternative to the aforementioned solutions, called mediated trace analysis (mediated1, ; mediated2, ), consists in performing the data analysis on the data-owner side and outsourcing analysis reports to researchers requesting the analysis. In this case, data can only be analyzed where it is originally stored, which may not always be practical, and the outsourced report still needs to be sanitized prior to its outsourcing (mcsherry, ). In contrast to those existing solutions, our approach improves both the privacy and utility at the cost of a higher computational overhead. Table 2, summarizes the most important network trace anonymization schemes, over past twenty years(Mivule, ) and their main characteristics.

max width=3.3in Authors Privacy against semantic attacks Utility Slagell et al. (slagell, ) Violated Prefix preserving McSherry et al. (mcsherry, ) Preserved Noisy aggregated results Pang et al. (paxon1, ) Violate Partial prefix preserving Riboni et al. (riboni, ) Preserved Heavily sanitized Ribeiro et al. (Ribeiro, ) Violated Partial prefix preserving Mogul et al. (mediated1, ) Violated Aggregated results

Table 2. Summary of proposed network traces anonymization in literature

The last step of our solution requires data owner to privately retrieve an audit report of the real view, which can be based on existing private information retrieval (PIR) techniques. A PIR approach usually aims conceal the objective of all queries independent of all previous queries (orpir, ; pir2, ). Since the sequence of accesses is not hidden by PIR while each individual access is hidden, the amortized cost is equal to the worst-case cost (orpir, ). Since the server computes over the entire database for each individual query, it often results in impracticality for large databases. On the other hand, ORAM (oram1, ) has verifiably low amortized communication complexity and does not require much computation on the server but rather periodically requires the client to download and reshuffle the data (orpir, ). For our multi-view scheme, we choose ORAM as it is relatively more efficient and secure, and also the client (data owner in our case) has sufficient computational power and storage needed to locally store a small number of blocks (audit reports in our case) in a local stash.

7. Conclusion

In this paper, we have proposed a multi-view anonymization approach mitigating the semantic attacks on CryptoPAn while preserving the utility of the trace. This novel approach shifted the trade-off from between privacy and utility to between privacy and computational cost; the later has seen significant decrease with the advance of technology, making our approach a more preferable solution for applications that demand both privacy and utility. Our experimental results showed that our proposed approach significantly reduced the information leakage compared to CryptoPAn. For example, for the extreme case of adversary pre-knowledge of 100%, the information leakage of CryptoPAN was 100% while under approach it was still less than 10%. Besides addressing various limitations discussed in Appendix 4.3, our future works will adapt the idea to improve existing privacy-preserving solutions in other areas, e.g., we will extend our work to the multi-party problem where several data owners are willing to share their traces to mitigate coordinated network reconnaissance by means of distributed (or inter-domain) audit (sepia, ).

8. Acknowledgments

The authors thank the anonymous reviewers, for their valuable comments. We appreciate Momen Oqaily’s support in the implementation. This work is partially supported by the Natural Sciences and Engineering Research Council of Canada and Ericsson Canada under CRD Grant N01823. The research of Yuan Hong is partially supported by the National Science Foundation under Grant No. CNS-1745894 and the WISER ISFG grant.

References

  • (1) Ding, Wen, William Yurcik, and Xiaoxin Yin. ”Outsourcing internet security: Economic analysis of incentives for managed security service providers.” In International Workshop on Internet and Network Economics, pp. 947-958. Springer Berlin Heidelberg, 2005.
  • (2) Riboni, Daniele, Antonio Villani, Domenico Vitali, Claudio Bettini, and Luigi V. Mancini. ”Obfuscation of sensitive data in network flows.” In INFOCOM, 2012 Proceedings IEEE, pp. 2372-2380. IEEE, 2012.
  • (3) J. Fan, J. Xu, M. Ammar, and S. Moon. Prefix-preserving IP Address Anonymization: Measurement-based Security Evaluation and a New Cryptography-based Scheme. Computer Networks, 46(2):263-272, October 2004.
  • (4) Brekne, T., Årnes, A., & Øslebø, A. (2005, May). Anonymization of ip traffic monitoring data: Attacks on two prefix-preserving anonymization schemes and some proposed remedies. In International Workshop on Privacy Enhancing Technologies (pp. 179-196). Springer Berlin Heidelberg.
  • (5) Brekne, Tønnes, and André Årnes. ”Circumventing IP-address pseudonymization.” In Communications and Computer Networks, pp. 43-48. 2005.
  • (6) T.-F. Yen, X. Huang, F. Monrose, and M. K. Reiter, ”Browser fingerprinting from coarse traffic summaries: Techniques and implications,” in Proc. of Detection of Intrusions and Malware and Vulnerability Assessment, vol. 5587. Springer, 2009, pp. 157-175.
  • (7) M. Burkhart, D. Brauckhoff, M. May, and E. Boschi, ”The risk-utility tradeoff for IP address truncation.” In Proceedings of the 1st ACM workshop on Network data anonymization, 2008, pp. 23-30.
  • (8) Pang, R., Allman, M., Paxson, V. and Lee, J., 2006. The devil and packet trace anonymization. ACM SIGCOMM Computer Communication Review, 36(1), pp.29-38.
  • (9) Coull, Scott E., Michael P. Collins, Charles V. Wright, Fabian Monrose, and Michael K. Reiter. ”On Web Browsing Privacy in Anonymized NetFlows.” In USENIX Security. 2007.
  • (10) Wong, Wai Kit, David W. Cheung, Edward Hung, Ben Kao, and Nikos Mamoulis. ”Security in outsourcing of association rule mining.” In Proceedings of the 33rd international conference on Very large data bases, pp. 111-122. VLDB Endowment, 2007.
  • (11) Tai, C. H., Yu, P. S., and Chen, M. S. (2010, July). k-Support anonymity based on pseudo taxonomy for outsourcing of frequent itemset mining. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 473-482). ACM.
  • (12) Slagell, Adam J., Kiran Lakkaraju, and Katherine Luo. ”FLAIM: A Multi-level Anonymization Framework for Computer and Network Logs.” In LISA, vol. 6, pp. 3-8. 2006.
  • (13) Foukarakis, Michael, Demetres Antoniades, and Michalis Polychronakis. ”Deep packet anonymization.” In Proceedings of the Second European Workshop on System Security, pp. 16-21. ACM, 2009.
  • (14) M. Burkhart, D. Schatzmann, B. Trammell, E. Boschi, and B. Plattner, The role of network trace anonymization under attack, Computer Communication Review, vol. 40, no. 1, pp. 5-11, 2010.
  • (15) Mogul, Jeffrey C., and Martin Arlitt. ”Sc2d: an alternative to trace anonymization.” In Proceedings of the 2006 SIGCOMM workshop on Mining network data, pp. 323-328. ACM, 2006.
  • (16) Mittal, Prateek, Vern Paxson, Robin Sommer, and Mark Winterrowd. ”Securing Mediated Trace Access Using Black-box Permutation Analysis.” In HotNets. 2009.
  • (17) McSherry, Frank, and Ratul Mahajan. ”Differentially-private network trace analysis.” In ACM SIGCOMM Computer Communication Review, vol. 40, no. 4, pp. 123-134. ACM, 2010.
  • (18) K. Mivule and B. Anderson, ”A study of usability-aware network trace anonymization,” 2015 Science and Information Conference (SAI), London, 2015, pp. 1293-1304.
  • (19) T. Farah, and L. Trajkovic, ”Anonym: A tool for anonymization of the Internet traffic.” In IEEE 2013 International Conference on Cybernetics (CYBCONF), 2013, pp. 261-266.
  • (20) Mayberry, Travis, Erik-Oliver Blass, and Agnes Hui Chan. ”Efficient Private File Retrieval by Combining ORAM and PIR.” In NDSS. 2014.
  • (21) A.J. Slagell, K. Lakkaraju, and K. Luo, ”FLAIM: A Multi-level Anonymization Framework for Computer and Network Logs.” In LISA, vol. 6, 2006, pp. 3-8.
  • (22) J. Xu, J. Fan, M.H. Ammar, and Sue B. Moon, ”Prefix-preserving ip address anonymization: Measurement-based security evaluation and a new cryptography-based scheme.”, In 10th IEEE International Conference on Network Protocols, 2002, pp. 280-289.
  • (23) Wang, Xiao Shaun, Yan Huang, TH Hubert Chan, Abhi Shelat, and Elaine Shi. ”SCORAM: oblivious RAM for secure computation.” In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pp. 191-202. ACM, 2014.
  • (24) W. Yurcik, C. Woolam, G. Hellings, L. Khan, B. Thuraisingham, ”Measuring anonymization privacy/analysis tradeoffs inherent to sharing network data”, IEEE Network Operations and Management Symposium, 2008, pp.991-994.
  • (25) Gattani, Shantanu, and Thomas E. Daniels. ”Reference models for network data anonymization.” In Proceedings of the 1st ACM workshop on Network data anonymization, pp. 41-48. ACM, 2008.
  • (26) Zhang, Jianqing, Nikita Borisov, and William Yurcik. ”Outsourcing security analysis with anonymized logs.” In Securecomm and Workshops, 2006, pp. 1-9. IEEE, 2006.
  • (27) Saroiu, Stefan, P. Krishna Gummadi, and Steven D. Gribble. ”Measurement study of peer-to-peer file sharing systems.” In Electronic Imaging 2002, pp. 156-170. International Society for Optics and Photonics, 2001.
  • (28) Chor, Benny, et al. ”Private information retrieval.” Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995.
  • (29) Biler, Piotr, and Alfred Witkowski. ”Problems in mathematical analysis.” (1990).
  • (30) Zhang, Q., & Li, X. (2006, January). An IP address anonymization scheme with multiple access levels. In International Conference on Information Networking (pp. 793-802). Springer Berlin Heidelberg.
  • (31) B. Ribeiro, W. Chen, G. Miklau, and D. Towsley, Analyzing privacy in enterprise packet trace anonymization, in Proc. NDSS, San Diego, CA, Feb. 2008, pp. 87-100
  • (32) Coull, Scott E., Charles V. Wright, Fabian Monrose, Michael P. Collins, and Michael K. Reiter. ”Playing Devil’s Advocate: Inferring Sensitive Information from Anonymized Network Traces.” In NDSS, vol. 7, pp. 35-47. 2007.
  • (33) Yurcik, William, and Yifan Li. ”Internet security visualization case study: Instrumenting a network for NetFlow security visualization tools.” In 21st Annual Computer Security Applications Conference (ACSAC). 2005.
  • (34) Coull, S. E., Monrose, F., Reiter, M. K., & Bailey, M. (2009, March). The challenges of effectively anonymizing network data. In Conference For Homeland Security, 2009. CATCH’09. Cybersecurity Applications & Technology (pp. 230-236). IEEE.
  • (35) Del Piccolo, Valentin, et al. ”A Survey of network isolation solutions for multi-tenant data centers.” IEEE Communications Surveys & Tutorials 18.4 (2016): 2787-2821.
  • (36) C. Dwork, ”Differential privacy,” in Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP), ser. Lecture Notes in Computer Science, vol. 4052. Springer-Verlag, 2006.
  • (37) Dwork, Cynthia. ”Differential privacy: A survey of results.” In International Conference on Theory and Applications of Models of Computation, pp. 1-19. Springer, Berlin, Heidelberg, 2008.
  • (38) Kushilevitz, Eyal, and Rafail Ostrovsky. ”Replication is not needed: Single database, computationally-private information retrieval.” In Foundations of Computer Science, 1997. Proceedings., 38th Annual Symposium on, pp. 364-373. IEEE, 1997.
  • (39) Oded Goldreich and Rafail Ostrovsky. 1996. Software protection and simulation on oblivious RAMs. J. ACM 43, 3 (May 1996), 431-473. DOI=http://dx.doi.org/10.1145/233551.233553
  • (40) C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Proceedings of the Third Theory of Cryptography Conference, 2006, pp. 265–284.
  • (41) Le Ny, Jerome, and Meisam Mohammady. ”Differentially private MIMO filtering for event streams.” IEEE Transactions on Automatic Control 63, no. 1 (2018): 145-157.
  • (42) Le Ny, Jerome, and Meisam Mohammady. ”Differentially private MIMO filtering for event streams and spatio-temporal monitoring.” In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pp. 2148-2153. IEEE, 2014.
  • (43) Yao, Andrew Chi-Chih. ”How to generate and exchange secrets.” In Foundations of Computer Science, 1986., 27th Annual Symposium on, pp. 162-167. IEEE, 1986.
  • (44) Goldreich, Oded. ”Secure multi-party computation.” Manuscript. Preliminary version (1998): 86-97.
  • (45) Cormen, Thomas H., et al. ”Data structures for disjoint sets.” Introduction to Algorithms (2001): 498-524
  • (46) Sedgewick, Robert. ”Implementing quicksort programs.” Communications of the ACM 21.10 (1978): 847-857.
  • (47) Slagell, Adam, Jun Wang, and William Yurcik. ”Network log anonymization: Application of crypto-pan to cisco netflows.” Proceedings of the Workshop on Secure Knowledge Management 2004. 2004.
  • (48) Minshall G. TCPdpriv command manual. 1996. http://ita. ee. lbl. gov/html/contrib/tcpdpriv. 0. txt. 1996.
  • (49) Paul, Ruma R., Victor C. Valgenti, and Min Sik Kim. ”Real-time Netshuffle: Graph distortion for on-line anonymization.” In Network Protocols (ICNP), 2011 19th IEEE International Conference on, pp. 133-134. IEEE, 2011.
  • (50) Aggarwal, Gagan, Tomás Feder, Krishnaram Kenthapadi, Samir Khuller, Rina Panigrahy, Dilys Thomas, and An Zhu. ”Achieving anonymity via clustering.” In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 153-162. ACM, 2006.
  • (51) M. Burkhart, M. Strasser, D. Many, and X. Dimitropoulos. Sepia: privacy-preserving aggregation of multi-domain network events and statistics. In USENIX Security Symposium, pages 223–240, 2010.
  • (52) Boldyreva, Alexandra, Nathan Chenette, Younho Lee, and Adam O’neill. ”Order-preserving symmetric encryption.” In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 224-241. Springer, Berlin, Heidelberg, 2009.
  • (53) Curtmola, Reza, Juan Garay, Seny Kamara, and Rafail Ostrovsky. ”Searchable symmetric encryption: improved definitions and efficient constructions.” Journal of Computer Security 19, no. 5 (2011): 895-934.
  • (54) Song, Dawn Xiaoding, David Wagner, and Adrian Perrig. ”Practical techniques for searches on encrypted data.” In Security and Privacy, 2000. S&P 2000. Proceedings. 2000 IEEE Symposium on, pp. 44-55. IEEE, 2000.
  • (55)

    Craig Gentry. 2009. Fully homomorphic encryption using ideal lattices. In Proceedings of the forty-first annual ACM symposium on Theory of computing (STOC ’09). ACM, New York, NY, USA, 169-178.

  • (56) Boneh, Dan, Amit Sahai, and Brent Waters. ”Functional encryption: Definitions and challenges.” In Theory of Cryptography Conference, pp. 253-273. Springer, Berlin, Heidelberg, 2011.
  • (57) Bellare, Mihir, Alexandra Boldyreva, and Adam O’Neill. ”Deterministic and efficiently searchable encryption.” In Annual International Cryptology Conference, pp. 535-552. Springer, Berlin, Heidelberg, 2007.
  • (58) Boldyreva, Alexandra, Nathan Chenette, Younho Lee, and Adam O’neill. ”Order-preserving symmetric encryption.” In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 224-241. Springer, Berlin, Heidelberg, 2009.
  • (59) Islam, Mohammad Saiful, Mehmet Kuzu, and Murat Kantarcioglu. ”Access Pattern disclosure on Searchable Encryption: Ramification, Attack and Mitigation.” In Ndss, vol. 20, p. 12. 2012.
  • (60) Naveed, Muhammad, Seny Kamara, and Charles V. Wright. ”Inference attacks on property-preserving encrypted databases.” In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 644-655. ACM, 2015.
  • (61) Hautakorpi, Jani, and Gonzalo Camarillo Gonzalez. ”IP Address Distribution in Middleboxes.” U.S. Patent Application No. 12/518,452.
  • (62) Chang, Zhao, Dong Xie, and Feifei Li. ”Oblivious ram: a dissection and experimental evaluation.” Proceedings of the VLDB Endowment 9, no. 12 (2016): 1113-1124.
  • (63) Caswell, Brian, and Jay Beale. Snort 2.1 intrusion detection. Elsevier, 2004.
  • (64) Stefanov, E., Van Dijk, M., Shi, E., Fletcher, C., Ren, L., Yu, X. and Devadas, S., 2013, November. Path ORAM: an extremely simple oblivious RAM protocol. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security (pp. 299-310). ACM.

Appendix A Proofs

a.1. Proof of Theorem 3.1

Proof.

3.1 We must show that . To do so, we use induction:

where is empty string and is a constant bit in that depends only on padding function and cryptographic key . Assume , thus:

a.2. Proof of Lemma 4.6

Proof.

Suppose there exists algorithm which returns the smallest set of key vectors to reverse the seed trace and obtain the minimum number of real view candidates, given our setup. Also denote by the set of those key vectors if the adversary follows scheme II. We now show that if holds then we have . First, note that the key indices of different distinct addresses in is . Therefore, the adversary has to guess . However, note that elements of are in and there will be different combinations for . Thus to minimize this number, the adversary has to use the outsourced parameters which means we have . However, we showed earlier that all these inputs are trivial leakage. Therefore, if holds, we have . ∎

Appendix B Experiments

In this section, using experiments, we measure the security of the proposed approach against very strong adversaries. In addition, we evaluate the utility of the approach using two real network analyses. Finally, we justify the choice of ORAM in our setup using a comprehensive study on the scalability of ORAM in the literature.

b.1. Privacy Evaluation against Very Strong Adversaries

Figure 13 shows the leakage and the real view candidates results for stronger adversaries (). Note that in this figures, we only show results for case (2) and (3) as results in case (1) does not show a significant improvement compared with CryptoPAn results because the multi-view approach with fraction of cannot defeat the adversary’s knowledge ().

b.2. Utility Evaluation Using Real-life Network Analytics

Figure 12 shows the results of two different network analytics over the original trace (1M records), the real view and one of the fake views generated in our multi-view solution. In the first experiment, we present IP distribution (IPdist, ) in the trace; reporting the number of distinct addresses within each subnet (IP group). We compare the distribution of distinct IP addresses inside the aforementioned three traces for both  temporal distribution; if subnets are indexed based on their time stamps; and  cardinality-based distribution result; if subnets are indexed based on their cardinalities. We found that our results (both distributions) generated from the original trace and the real view are identical (see Figure 12(a)). This is reasonable because the real view is a prefix preserving mapping of IPs that keeps the fp-QI attributes intact (preserving both distributions). Moreover, the cardinality based distribution result generated from the fake view is identical to those in the original trace and the real view (see Figure 12(c)). Note that the later is resulted from the indistinguishability of our multi-view solution.

In the second experiment, we present a packet-level analytic (mcsherry, ). In particular, Figure 12(d,e) shows the 

empirical cumulative distribution function

results for the three traces. Our results clearly show that the original trace and our scheme results are identical as multi-view will not have any impact on fingerprinting quasi identifier attributes.

Figure 12. Distribution of distinct IP addresses in different subnets (IP groups) (out of 1M) (a) for the original trace, the real view and one of the fake views based on the order they appear in the trace (temporal distribution), (b) for the original trace and the real view and (c) for the fake view, based on the cardinalities of the subnets in an ascending order (cardinality-based distribution). Empirical CDF for the packet lengths in (e),(f) the original trace and the real view, and the fake view, respectively.
Figure 13. Percentage of the compromised packets (out of 1M) and number of real view candidates when number of views and the adversary knowledge vary and for case (1) Figures (b),(e) (2) Figures (c),(f) where legends marked by CP denote the CryptoPAn result whereas those marked by MV denote the multi-view results
Figure 14. Comparison between scheme I and scheme II with partitions (prefix groups based on first octet sharing). Figure (a): Percentage of the compromised packets (out of 1M) and Figure (b): Number of real view candidates for adversary knowledge

b.3. Multi-view and the Scalability of ORAM

In practice, we expect analysis reports would have significantly smaller sizes in comparison to the views, and considering the one round communication with ORAM -complexity), we believe the solution would have acceptable scalability. Experiments using our dataset and existing ORAM implementation (an implementation (dong, ) of non-recursive Path-ORAM (path, ) has been made public) would further confirm this. We generated various set of analyses reports using snort (snort, ), and we found that for our dataset the size of audit reports are in the range of KB which is perfect to be used in fast ORAM protocols, e.g., Path-ORAM. Specifically, for Path-ORAM, Figure 5 (b) in (dong, ) shows a less than 1MB communication overhead for the worst-case cost of up to number of blocks of size 4KB.

Appendix C Algorithms

Input:
       : Original network trace
       , : Cryptographic keys
       : Number of prefix groups
       : IP partitioning
       : Iteration number of the real view.
       : Random vectors, of size
Output:
       : Anonymized trace to be outsourced
Function: anonymize ()
begin
1        
2        
3        
4        
5        foreach       do:
6            GetFlows()
7           
8           
9         end
10     return ,
end
Algorithm 1 Data owner: Trace anomymization (scheme I)
Input:
       : Seed trace
       : Number of iterations requested by
            data owner
       : Number of prefix groups
       : IP partitioning
       : Outsourced key
       : Vector of size defined by data owner
       ): Compliance verification
Output:
       : Analysis report of view ,
            
Function: analysis ()
begin
1       
2       for   do:
3          
4          foreach       do:
6              GetFlows()
7             
8             
9          end
10         )
11         return
12       end
end
Algorithm 2 Analyst: Network trace analysis (scheme I)

Following Algorithms are summarized versions of the data owner’s and the analyst’s roles in our multi-view scheme presented in section 4.
Algorithm 1: The data owner’s actions (scheme I).
Algorithm 2: The analyst’s actions (scheme I).
Algorithm 3: The data owner’s actions (scheme II).
Algorithm 4: The analyst’s actions (scheme II).

Appendix D Complexity Analysis

Here, we discuss the overhead analysis, from both the data owner’s and the data analyst’s side. In particular, table 3 summarizes the overhead for all the action items in the data owner side. Here, is the computation overhead of CryptoPAn and is the number of the distinct IP addresses. Finally, table 4 summarizes the overhead for all the action items in the data analyst side where is the cost of times verifying the compliances (auditing).

max width=3.2in Blocks in Multi-view Computation Overhead Communication Overhead Initial anonymization Migration function Prefix grouping Index generator Seed trace Report retrieval (ORAM)

Table 3. Overhead on the data owner side

max width=3.2in Blocks in Multi-view Computation Overhead Communication Overhead Seed view N views generation Compliance verification (Analysis)

Table 4. Overhead on the data analyst side
Input:
       : Original network trace
       , : Cryptographic keys
       : Number of IPs
       : Number of prefix groups
       : Migration function
       : IP partitioning
       : Iteration number of the real view.
       : Vectors of size defined by data owner
Output:
       : Anonymized trace to be outsourced
Function: anonymize ()
begin
1-1     
1-2     
2        
3        
4        
5        foreach       do: