1. Introduction
As the owners of largescale network data, today’s ISPs and enterprises usually face a dilemma. As security monitoring and analytics grow more sophisticated, there is an increasing need for those organizations to outsource such tasks together with necessary network data to thirdparty analysts, e.g., Managed Security Service Providers (MSSPs) (outsource, ). On the other hand, those organizations are typically reluctant to share their network trace data with third parties, and even less willing to publish them, mainly due to privacy concerns over sensitive information contained in such data. For example, important network configuration information, such as potential bottlenecks of the network, may be inferred from network traces and subsequently exploited by adversaries to increase the impact of a denial of service attack (riboni, ).
In cases where data owners are convinced to share their network traces, the traces are typically subjected to some anonymization techniques. The anonymization of network traces has attracted significant attention (a more detailed review of related works will be given in section 6). For instance, CryptoPAn replaces real IP addresses inside network flows with prefix preserving pseudonyms, such that the hierarchical relationships among those addresses will be preserved to facilitate analyses (PP, ). Specifically, any two IP addresses sharing a prefix in the original trace will also do so in the anonymized trace. However, CryptoPAn is known to be vulnerable to the socalled fingerprinting attack and injection attack (brekene1, ; brekene2, ; a1, ). In those attacks, adversaries either already know some network flows in the original traces (by observing the network or from other relevant sources, e.g., DNS and WHOIS databases) (burkhart, ), or have deliberately injected some forged flows into such traces. By recognizing those known flows in the anonymized traces based on unchanged fields of the flows, namely, fingerprints (e.g., timestamps and protocols), the adversaries can extrapolate their knowledge to recognize other flows based on the shared prefixes (brekene1, ). We now demonstrate such an attack in details.
Example 1.1 ().
In Figure 1, the upper table shows the original trace, and the lower shows the trace anonymized using CryptoPAn. In this example, without loss of generality, we only focus on source IPs. Inside each table, similar prefixes are highlighted through similar shading.

Step 1: An adversary has injected three network flows, shown as the first three records in the original trace (upper table).

Step 2: The adversary recognizes the three injected flows in the anonymized trace (lower table) through unique combinations of the unchanged attributes (Start Time and Src Port).

Step 3: He/she can then extrapolate his/her knowledge from the injected flows to real flows as follows, e.g., since prefix is shared by the second (injected), fifth (real) and sixth (real) flows, he/she knows all three must also share the same prefix in the original trace. Such identified relationships between flows in the two traces will be called matches from now on.

Step 4: Finally, he/she can infer the prefixes or entire IPs of those anonymized flows in the original traces, as he/she knows the original IPs of his/her injected flows, e.g., the fifth and sixth flows must have prefix , and the IPs of the fourth and last flows must be .
More generally, a powerful adversary who can probe all the subnets of a network using injection or fingerprinting can potentially deanonymize the entire CryptoPAn output via a more sophisticated frequency analysis attack (brekene1, ).
Most subsequent solutions either require heavy data sanitization or can only support limited types of analysis. In particular, the obfuscation method first groups together or more flows with similar fingerprints and then bucketizes (i.e., replacing original IPs with identical IPs) flows inside each group; all records whose fingerprints are not sufficiently similar to others will be suppressed (riboni, ). Clearly, both the bucketization and suppression may lead to significant loss of data utility. The differentially private analysis method first adds noises to analysis results and then publishes such aggregated results (mcsherry, ; DP2, ; DP3, ). Although this method may provide privacy guarantee regardless of adversarial knowledge, the perturbation and aggregation prevent its application to analyses that demand accurate or detailed records in the network traces.
In this paper, we aim to preserve both privacy and utility by shifting the tradeoff from between privacy and utility, as seen in most existing works, to between privacy and computational cost (which has seen a significant decrease lately, especially with the increasing popularity of cloud technology). The key idea is for the data owner to send enough information to the third party analysts such that they can generate and analyze many different anonymized views of the original network trace; those anonymized views are designed to be sufficiently indistinguishable (which will be formally defined in subsection 2.4) even to adversaries armed with prior knowledge and performing the aforementioned attacks, which preserves the privacy; at the same time, one of the anonymized views will yield true analysis results, which will be privately retrieved by the data owner or other authorized parties, which preserves the utility. More specifically, our contributions are as follows.

We propose a multiview approach to the prefixpreserving anonymization of network traces. To the best of our knowledge, this is the first known solution that can achieve similar data utility as CryptoPAn does, while being robust against the socalled semantic attacks (e.g., fingerprinting and injection). In addition, we believe the idea of shifting the tradeoff from between privacy and utility to between privacy and computational cost may potentially be adapted to improve other privacy solutions.

In addition to the general multiview approach, we detail a concrete solution based on iteratively applying CryptoPAn to each partition inside a network trace such that different partitions are anonymized differently in all the views except one (which yields valid analysis results that can be privately retrieved by the data owner). In addition to privacy and utility, we design the solution in such a way that only one seed view needs to be sent to the analysts, which avoids additional communication cost.

We formally analyze the level of privacy guarantee achieved using our method, discuss potential attacks and solutions, and finally experimentally evaluate our solution using real network traces from a major ISP. The experimental results confirm that our solution is robust against semantic attacks with a reasonable computational cost.
The rest of the paper is organized as follows: Section 2 defines our models. Sections 3 introduces building blocks for our schemes. Section 4 details two concrete multiview schemes based on CryptoPAn. Sections 5 presents the experimental results. Section Appendix 4.3 provides more discussions, and section 6 reviews the related work. Finally, section 7 concludes the paper.
2. Models
In this section, we describe models for the system and adversaries; we briefly review CryptoPAn; we provide a high level overview of our multiview approach; finally, we define our privacy property. Essential definitions and notations are summarized in Table 1.
2.1. The System and Adversary Model
Denote by a network trace comprised of a set of flows (or records) . Each flow includes a confidential multivalue attribute , and the set of other attributes is called the Fingerprint Quasi Identifier (fpQI) (riboni, ). Suppose the data owner would like the analyst to perform an analysis on to produce a report . To ensure privacy, instead of sending , an anonymization function is applied to obtain an anonymized version . Thus, our main objective is to find the anonymization function to preserve both the privacy, which means the analyst cannot obtain or from , and utility, which means must be prefixpreserving.
In this context, we make following assumptions (similar to those found in most existing works (PP, ; brekene1, ; brekene2, ; a1, )). i) The adversary is a honestbutcurious analyst (in the sense that he/she will exactly follow the approach) who can observe . ii) The anonymization function is publicly known, but the corresponding anonymization key is not known by the adversary. iii) The goal of the adversary is to find all possible matches (as demonstrated in Example 1.1, an IP address may be matched to its anonymized version either through the fpQI or shared prefixes) between and . iv) Suppose consists of groups each of which contains IP addresses with similar prefixes (e.g., those in the same subset), and among these the adversary can successfully inject or fingerprint () groups (e.g., the demilitarized zone (DMZ) or other subnets to which the adversary has access). Accordingly, we say that the adversary has knowledge. v) Finally, we assume the communication between the data owner and the analyst is over a secure channel, and we do not consider integrity or availability issues (e.g., a malicious adversary may potentially alter or delete the analysis report).
2.2. The CryptoPAn Model
To facilitate further discussions, we briefly review the CryptoPAn (PP, ) model, which gives a baseline for prefixpreserving anonymization.
Definition 2.1 ().
Prefixpreserving Anonymization (PP, ): Given two IP addresses and , and a onetoone function , we say that

and share a bit prefix (), if and only if , and .

is prefixpreserving, if, for any and that share a bit prefix, and also do so.
Given and , the prefixpreserving anonymization function must necessarily satisfy the canonical form (PP, ), as follows.
(1) 
where is a cryptographic function which, based on a bit key , takes as input a bitstring of length and returns a single bit. Intuitively, the bit is anonymized based on and the preceding bits to satisfy the prefixpreserving property. The cryptographic function can be constructed as where returns the least significant bit, can be a block cipher such as Rijndael (rindal, ), and
is a padding function that expands
to match the block size of (PP, ). In the following, will stand for this CryptoPAn function and its output will be denoted by .The advantage of CryptoPAn is that it is deterministic and allows consistent prefixpreserving anonymization under the same . However, as mentioned earlier, CryptoPAn is vulnerable to semantic attacks, which will be addressed in next section.
2.3. The MultiView Approach
We propose a novel multiview approach to the prefixpreserving anonymization of network traces. The objective is to preserve both the privacy and the data utility, while being robust against semantic attacks. The key idea is to hide a prefixpreserving anonymized view, namely, the real view, among other fake views, such that an adversary cannot distinguish between those views, either using his/her prior knowledge or through semantic attacks. Our approach is depicted in Figure 2 and detailed below.
2.3.1. Privacy Preservation at the Data Owner Side
 Step 1::

The data owner generates two CryptoPAn keys K and K, and then obtains an anonymized trace using the anonymization function (which will be represented by the gear icon inside this figure) and K. This initial anonymization step is designed to prevent the analyst from simulating the process as K will never be given out. Note that this anonymized trace is still vulnerable to semantic attack and must undergo the remaining steps. Besides, generating this anonymized trace will actually be slightly more complicated due to migration as discussed later in Section 3.3.
 Step 2::
 Step 3::

Each partition is anonymized using and key K, but the anonymization will be repeated, for a different number of times, on different partitions. For example, as the figure shows, the first partition is anonymized only once, whereas the second for three times, etc. The result of this step is called the seed trace. The idea is that, as illustrated by the different graphic patterns inside the seed trace, different partitions have been anonymized differently, and hence the seed trace in its entirety is no longer prefixpreserving, even though each partition is still prefixpreserving (note that this is only a simplified demonstration of the seed trace generator scheme which will be detailed in Section 4).
 Step 4::

The seed trace together with some supplementary parameters, including K, are outsourced to the analyst.
2.3.2. Utility Realization at the Data Analyst Side
 Step 5::

The analyst generates totally views based on the received seed view and supplementary parameters. Our design will ensure one of those generated views, namely, the real view, will have all its partitions anonymized in the same way, and thus be prefixpreserving (detailed in Section 4), though the analyst (adversary) cannot tell which exactly is the real view.
 Step 6::

The analyst performs the analysis on all the views and generates corresponding reports.
 Step 7::

The data owner retrieves the analysis report corresponding to the real view following an oblivious random access memory (ORAM) protocol (oram, ), such that the analyst cannot learn which view has been retrieved.
Next, we define the privacy property for the multiview solution.
2.4. Privacy Property against Adversaries
Under our multiview approach, an analyst (adversary) will receive different traces with identical fpQI attribute values and different attribute values. Therefore, his/her goal now is to identify the real view among all the views, e.g., he/she may attempt to observe his/her injected or fingerprinted flows, or he/she can launch the aforementioned semantic attacks on those views, hoping that the real view might respond differently to those attacks. Therefore, the main objective in designing an effective multiview solution is to satisfy the indistinguishability property which means the real view must be sufficiently indistinguishable from the fake views under semantic attacks. Motivated by the concept of Differential Privacy (dworks, ), we propose the indisinguishablity property as follows.
Definition 2.2 ().
Indisinguishable Views: A multiview solution is said to satisfy Indistinguishability against an
adversary if and only if (both probabilities below are from the adversary’s point of view)
(2) 
In Defintion 2.2, a smaller value is more desirable as it means the views are more indistinguishable from the real view to the adversary. For example, an extreme case of would mean all the views are equally likely to be the real view to the adversary (from now on, we call these views the real view candidates). In practice, the value of would depend on the specific design of a multiview solution and also on the adversary’s prior knowledge, as will be detailed in following sections.
Finally, since the multiview approach requires outsourcing some supplementary parameters, we will also need to analyze the security/privacy of the communication protocol (privacy leakage in the protocol, which complements the privacy analysis in output of the protocol) in semihonest model under the theory of secure multiparty computation (SMC) (Yao86, ), (goldrich, ) (see section 4.2.4).
3. The Building Blocks
In this section, we introduce the building blocks for our multiview mechanisms, namely, the iterative and reverse CryptoPAn, partitionbased prefix preserving, and CryptoPAn with IPcollision (migration).
3.1. Iterative and Reverse CryptoPAn
As mentioned in section 2.3, the multiview approach relies on iteratively applying a prefix preserving function for generating the seed view. Also, the analyst will invert such an application of in order to obtain the real view (among fake views). Therefore, we first need to show how can be iteratively and reversely applied.
First, it is straightforward that can be iteratively applied, and the result also yields a valid prefixpreserving function. Specifically, denote by () the iterative application of on IP address using key , where is the number of iterations, called the index. For example, for an index of two, we have . It can be easily verified that given any two IP addresses and sharing a kbit prefix, and will always result in two IP addresses that also share a kbit prefix (i.e., is prefixpreserving). More generally, the same also holds for applying under a sequence of indices and keys (for both IPs), e.g., and will also share kbit prefix. Finally, for a set of IP addresses , iterative using a single key satisfies the following associative property:
(3) 
On the other hand, when a negative number is used as the index, we have a reverse iterative CryptPAn function ( for short), as formally characterized in Theorem 3.1 (the proof is in Appendix A.1).
Theorem 3.1 ().
Given IP addresses and , the function defined as
(4) 
is the inverse of the function given in Equation 1, i.e., .
3.2. Partitionbased Prefix Preserving
As mentioned in section 2.3, the central idea of the multiview approach is to divide the trace into partitions (Step ), and then anonymize those partitions iteratively, but for different number of iterations (Step ). In this subsection, we discuss this concept.
Given as a set of IP addresses, we may divide into partitions in various ways, e.g., forming equalsized partitions after sorting based on either the IP addresses or corresponding timestamps. The partitioning scheme will have a major impact on the privacy, and we will discuss two such schemes in next section.
Once the trace is divided into partitions, we can then apply on each partition separately, denoted by for the partition. Specifically, given divided as a set of partitions {}, we define a
key vector
where each is a positive integer indicating the number of times should be applied to , namely, the key index of . Given a cryptographic key , we can then define the partitionbased prefix preserving anonymization of as .We can easily extend the associative property in Equation 3 to this case as the following (which will play an important role in designing our multiview mechanisms in next section).
(5) 
3.3. IP Migration: Introducing IPCollision into CryptoPAn
As mentioned in section 2.3, once the analyst (adversary) receives the seed view, he/she would generate many indistinguishable views among which only one, the real view, will be prefix preserving across all the partitions, while the other (fake) views do not preserve prefixes across the partitions (Step 5). However, the design would have a potential flaw under a direct application of CryptoPAn. Specifically, since the original CryptoPAn design is collision resistant (PP, ), the fact that similar prefixes are only preserved in the real view across partitions would allow an adversary to easily distinguish the real view from others.
Example 3.1 ().
Figure 3 illustrates this flaw. The original trace includes three different addresses and has been divided into two partitions and . As illustrated in the figure, the real view is easily distinguishable from the two fake views as the shared prefixes () between addresses in and only appear in the real view. This is because, since the partitions in fake views have different rounds of PP applied, and since the original CryptoPan design is collision resistant (PP, ), the shared prefixes will no longer appear. Hence, the adversary can easily distinguish the real view from others.
To address this issue, our idea is to create collisions between different prefixes in fake views, such that adversaries cannot tell whether the shared prefixes are due to prefix preserving in the real view, or due to collisions in the fake views. However, due to the collision resistance property of CryptoPAn (PP, ), there is only a negligible probability that different prefixes may become identical even after applying different iterations of PP, as shown in the above example. Therefore, our key idea of IP migration is to first replace the prefixes of all the IPs with common values (e.g., zeros), and then fabricate new prefixes for them by applying different iterations of PP. This IP migration process is designed to be prefixpreserving (i.e,. any IPs sharing prefixes in the original trace will still share the new prefixes), and to create collisions in fake views since the addition of key indices during view generation can easily collide. Next, we demonstrate this IP migration technique in an example.
Example 3.2 ().
In Figure 4, the first stage shows the same original trace as in Example 3.1. In the second stage, we “remove” the prefixes of all IPs and replace them with all zeros (by xoring them with their own prefixes). Next, in the third stage, we fabricate new prefixes by applying different iterations of in a prefix preserving manner, e.g., the first two IPs still sharing a common prefix () different from that of the last IP. However, note that whether two IPs share the new prefixes only depends on their key indices now, e.g., for first two IPs and for the last IP. This is how we can create collisions in the next stage (the fake view) where the first and last IPs coincidentally share the same prefix due to their common key indices (however, note these are the addition results of different key indices from the migration stage and the view generation stage, respectively). Now, the adversary will not be able to tell which of those views is real based on the existence of shared prefixes.
We now formally define the migration function in the following.
Definition 3.1 ().
Migration Function: Let be a set of IP addresses consists of groups of IPs with distinct prefixes respectively, and be a random CryptoPAn key. Migration function is defined as
(6) 
where is the set of nonrepeating random key indices generated between using a cryptographically secure pseudo random number generator.
4. Indistinguishable Multiview Mechanisms
We first present a multiview mechanism based on IP partitioning in Section 4.1. We then propose a more refined scheme based on distinct IP partitioning with key vector generator in Section 4.2.
4.1. Scheme I: IPbased Partitioning Approach
To realize the main ideas of multiview anonymization, as introduced in Section 2.3, we need to design concrete schemes for each step in Figure 2. The key idea of our first scheme is the following. We divide the original trace in such a way that all the IPs sharing prefixes will always be placed in the same partition. This will prevent the attack described in Section 3.3, i.e., identifying the real view by observing shared prefixes across different partitions. As we will detail in Section 4.1.4, this scheme can achieve perfect indistinguishability without the need for IP migration (introduced in Section 3.3), although it has its limitations which will be addressed in our second scheme. Both schemes are depicted in Figure 5 and detailed below.
Specifically, our first scheme includes three main steps: privacy preservation (Section 4.1.1), utility realization (Section 4.1.2), and analysis report extraction (Section 4.1.3).
4.1.1. Privacy Preservation (Data Owner)
The data owner performs a set of actions to generate the seed trace together with some parameters to be sent to the analyst for generating different views. These actions are summarized in Algorithm 1, and detailed in the following.

Applying CryptoPAn using : First, the data owner generates two independent keys, namely (key used for initial anonymization, which never leaves the data owner) and (key used for later anonymization steps). The data owner then generates the initially anonymized trace =. This step is designed to prevent the adversary from simulating the scheme, e.g., using a bruteforce attack to revert the seed trace back to the original trace in which he/she can recognize some original IPs. The leftmost block in Figure 5 shows an example of the initially anonymized trace.

Trace partitioning based on IPvalue: The initially anonymized trace is partitioned based on IP values. Specifically, let be the set of IP addresses in consisting of groups of IPs with distinct prefixes , respectively; we divide to partitions, each of which is the collection of all records containing one of these groups. For example, the upper part of Figure 5 depicts how our first scheme works. The set of three IPs are divided into two partitions where includes both IPs sharing the same prefix, and , whereas the last IP goes to since it does not share a prefix with others.

Seed trace creation: The data owner in this step generates the seed trace using a size (recall that is the number of partitions) random key vector.

Generating a random key vector: The data owner generates a random vector of size using a cryptographically secure pseudo random number generator (which generates a set of nonrepeating random numbers between ). This vector and the key will later be used by the analyst to generate different views from the seed trace. For example, in Figure 5, for the two partitions, is generated. Finally, the data owner chooses the total number of views to be generated later by the analyst, based on his/her requirement about privacy and computational overhead, since a larger will mean more computation by both the data owner and analyst but also more privacy (more real view candidates will be generated which we will further study this through experiments later).

Generating a seed trace key vector and a seed trace: The data owner picks a random number and then computes as the key vector of seed trace. Next, the data owner generates the seed trace as . This ensures, after the analysts applies exactly iterations of on the seed trace, he/she would get back (while not being aware of this fact since he/she does not know ). For example, in Figure 5, and . We can easily verify that, if the analyst applies the indices in on the seed trace three times, the outcome will be exactly (the real view). This can be more formally stated as follows (the view is actually the real view).


Outsourcing: Finally, the data owner outsources , , and to the analyst.
4.1.2. Network Trace Analysis (Analyst)
The analyst generates the views requested by the data owner, which is summarized in Algorithm 2 in Appendix C and formalized below.
(7) 
Since boundaries of partitions must be recognizable by the analyst to allow him/her to generate the views, we modify the timestamp of the records that are on the boundaries of each partition by changing the most significant digit of the time stamps which is easy to verify and does not affect the analysis as it can be reverted back to its original format by the analyst. Next, the analyst performs the requested analysis on all views and generates analysis reports .
4.1.3. Analysis Report Extraction (Data Owner)
The data owner is only interested in the analysis report that is related to the real view, which we denote by . To minimize communication overhead, instead of requesting all the analysis reports of the generated views, the data owner can fetch only the one that is related to the real view . He/she can employ the oblivious random accesses memory (ORAM) (oram, ) to do so without revealing the information to the analyst (we will discuss alternatives in Section 6).
4.1.4. Security Analysis
We now analyze the level of indistinguishability provided by the scheme. Recall the indistinguishability property defined in Section 2; a multiview mechanism is indistinguishable if and only if
The statement inside the probability is the adversary’s decision on a view, declaring it as fake or a real view candidate, using his/her knowledge. Moreover, we note that generated views differ only in their IP values (fpQI attributes are similar for all the views). Hence, the adversary’s decision can only be based on the published set of IPs in each view through comparing shared prefixes among those IP addresses which he/she already know (). Accordingly, in the following, we define a function to represent all the prefix relations for a set of IPs.
Lemma 4.1 ().
For two IP addresses and , function returns the number of bits in the prefix shared between and
where denotes the floor function.
Definition 4.1 ().
For a multiset of IP addresses , the Prefixes Indicator Set (PIS) is defined as follows.
(8) 
Note that PIS remains unchanged when CryptoPAn is applied on , i.e., . In addition, since the multiview solution keeps all the other attributes intact, the adversary can identify his/her preknowledge in each view and construct prefixes indicator sets out of them. Accordingly, we denote by the PIS constructed by the adversary in view .
Definition 4.2 ().
Let be the PIS for the adversary’s knowledge, and , be the PIS constructed by the adversary in view . A multiview solution then generates indistinguishable views against an adversary if and only if
(9) 
Lemma 4.2 ().
The indistinguishability property, defined in equation 9 can be simplified to
Proof.
as view is the prefix preserving output. Moreover, we have . ∎
From the above, we only need to show (each generated view is a real view candidate).
Theorem 4.3 ().
Scheme I satisfies equation 4.2 with .
Proof.
Scheme I divides the trace into (number of prefix groups) partitions containing all the records that have similar prefixes. Hence, for any partition (), any two IP addresses and inside , and for any , we have because and are always assigned with equal key indices. Moreover, for any two IP addresses and in any two different partitions and any , we have since they do not share any prefixes. ∎
The above discussions show that scheme I produces perfectly indistinguishable views (). In fact, it is robust against the attack explained in Section 3.3 and thus does not required IP migration, because the partitioning algorithm already prevents addresses with similar prefixes from going into different partitions (the case in Figure 3). However, although adversaries cannot identify the real view, they may choose to live with this fact, and attack each partition inside any (fake or real) view instead, using the same semantic attack as shown in Figure 1. Note that our multiview approach is only designed to prevent attacks across different partitions, and each partition itself is essentially still the output of CryptoPAn and thus still inherits its weakness.
Fortunately, the multiview approach gives us more flexibility in designing specific schemes to further mitigate such a weakness of CryptoPAn. We next present scheme II which sacrifices some indistinguishability (in the sense of slightly less real view candidates) to achieve better protected partitions.
4.2. Scheme II: Multiview Using Key Vectors
To address the limitation of our first scheme, we propose the next scheme, which is different in terms of the initial anonymization step, IP partitioning, and key vectors for view generation. The data owner’s and the analyst’s actions are summarized in Algorithms 3, 4.
4.2.1. Initial Anonymization with Migration
First, to mitigate the attack on each partition, we must relax the requirement that all shared prefixes go into the same partition. However, as soon as we do so, the attack of identifying the real view through prefixes shared across partitions, as demonstrated in Section 3.3, might become possible. Therefore, we modify the first step of the multiview approach (initial anonymization) to enforce the IP migration technique. Figure 6 demonstrates this. The original trace is first anonymized with , and then the anonymized trace goes through the migration process, which replaces the two different prefixes ( and ) with different iterations of , as discussed in Section 3.3.
4.2.2. Distinct IP Partitioning and Key Vectors Generation
For the scheme, we employ a special case of IP partitioning where each partition includes exactly one distinct IP (i.e., the collection of all records containing the same IP). For example, the trace shown in Figure 5 includes three distinct IP addresses ,, and . Therefore, the trace is divided into three partitions. Next, the data owner will generate the seed view as in the first scheme, although the key will be generated completely differently, as detailed below.
Let , be the set of IP addresses after the migration step. Suppose consists of distinct IP addresses. We denote by the multiset of totally migration keys for those distinct IPs (in contrast, the number of migration keys in is equal to the number of distinct prefixes, as discussed in Section 3.3). Also, let be the set of random number generated between using a cryptographically secure pseudo random number generator at iteration . The data owner will generate key vector as follows.
(11)  
and
(12)  
Example 4.1 ().
In Figure 7, the migration and random vectors are , , , and , respectively. The corresponding key vectors will be , and where only and are outsourced.
In this scheme, the analyst at each iteration generates a new set of IP addresses by randomly grouping all the distinct IP addresses into a set of prefix groups. In doing so, each new vector essentially cancels out the effect of the previous vector , and thus introduces a new set of IP addresses consisting of prefix groups. Thus, it is straightforward to verify that the generated view will prefix preserving (the addresses are migrated back to their groups using ).
Example 4.2 ().
Figure 7 shows that, in each iteration, a different set (but with an equal number of elements) of prefix groups will be generated. For example, in the seed view, IP addresses and are mapped to prefix group .
4.2.3. Indistinguishability Analysis
By placing each distinct IP in a partition, our second scheme is not vulnerable to semantic attacks on each partition, since such a partition contains no information about the prefix relationship among different addresses. However, compared with scheme I, as we show in the following, this scheme achieves a weaker level of indistinguishability (higher ). Specifically, to verify the indistinguishability of the scheme, we calculate for scheme II in the following. First, the number of all possible outcomes of grouping IP addresses into groups with predefined cardinalities is:
(13) 
where denotes the cardinality of group . Also the number of all possible outcomes of grouping IP addresses into groups while still having is:
(14) 
for some . This equation gives the number of outcomes when a specific set of IP addresses () are distributed into different groups and hence keeping (i.e., the adversary cannot identify collision). Note that term is all the combinations of choosing this groups for the numerator to model all the combinations. Finally, we have
(15) 
Thus, to ensure the indistinguishability, the data owner needs to satisfy the expression in equation 15 which is a relationship between the number of distinct IP addresses, the number of groups, the cardinality of the groups in the trace and the adversary’s knowledge.
Theorem 4.4 ().
The indistinguishability parameter of the generated views in scheme II is lowerbounded by
(16) 
Proof.
Let be positive real numbers, and for define the averages as follows:
(17) 
By Maclaurin’s inequality (mclauren, ), which is the following chain of inequalities:
(18) 
where , we have
and since , we have
∎
Figure 8(a) shows how the lowerbound in Equation 16 changes with respect to different values of fraction and also the adversary’s knowledge. As it is expected, stronger adversaries have more power to weaken the scheme which results in increasing or increasing the chance of identifying the real view. Moreover, as it is illustrated in the figure, when fraction grows, tends to converge to very small values. Hence, to decrease , the data owner may increase by grouping addresses based on a bigger number of bits in their prefixes, e.g., a certain combination of 3 octets would be considered as a prefix instead of one or two. Another solution could be aggregating the original trace with some other traces for which the cardinalities of each prefix group are small. We study this effect in our experiments in Section 5 where we illustrate the concept especially in Figures 10, 11.
Finally, Figure 8
(b) shows how variance of the cardinalities affects the indistinguishability for a set of fixed parameters
, , . In fact, when the cardinalities of the prefix groups are close (small ), grows to meet the lowerbound in Theorem 4.4. Hence, from the data owner perspective, a trace with a lower variance of cardinalities and a bigger fraction has a better chance of misleading adversaries who wants to identify the real view.4.2.4. Security of the communication protocol
We now analyze the security/privacy of our communication protocol in semihonest model under the theory of secure multiparty computation (SMC) (Yao86, ), (goldrich, ).
Lemma 4.5 ().
Scheme II only reveals the CryptoPan Key and the seed trace in semihonest model.
Proof.
Recall that our communication protocol only involves oneround communication between two parties (data owner to data analyst). We then only need to examine the data analyst’s view (messages received from the protocol), which includes (1) : number of views to be generated, (2) : the outsourced key, (3) : the seed trace, and (4) : the key vectors. As we discuss in section 4.2.3, the probability of identifying the real view by the adversary using all provided information (key and vectors) depends on the adversary knowledge and the trace itself which clearly implies that such “leakage” is trivial.
Indeed, each of and can be simulated by generating a single random number from a uniform random distribution (which proves that they are not leakage in the protocol). Specifically, the number of generated views is integer which is bounded by , where is the maximum number of views the data owner can afford and all the entries in are in where is the number of groups. First, given integer , the probability that is simulated in the domain would be . Then, can be simulated in polynomial time (based on the knowledge data analyst already knew, i.e., his/her input and/or output of the protocol). Similarly, all the random entires in can also be simulated in polynomial time using a similar simulator (only changing the bound). Thus, the protocol only reveals the outsourced key and the seed trace in semihonest model. ∎
Note that outsourcing the and the outsourced key are trivial leakage. The outsourced key can be considered as a public key and leakage of which is considered as the output of the protocol was studied earlier. Finally, we study the setup leakage and show that the adversary cannot exploit outsourced parameters to increase (i.e., decrease the number of real view candidates) by building his/her own key vector.
Lemma 4.6 ().
(proof in Appendix A.2) For an adversary, who wants to obtain the least number of real view candidates, if condition holds, the best approach is to follow scheme II, (scheme II returns the least number of real view candidates).
4.3. Discussion
In this section, we discuss various aspects and limitations of our approach.

Application to EDB: We believe the multiview solution may be applicable to other related areas. For instance, processing on encrypted databases (EDB) has a rich literature including searchable symmetric encryption (SSE) (sse1, ), (sse2, ), fullyhomomorphic encryption (FHE) (FHE, ), oblivious RAMs (ORAM) (goldrich, ), functional encryption (boneh, ), and property preserving encryption (PPE) (Bellare, ), (Boldyreva, ). All these approaches achieve different tradeoffs between protection (security), utility (query expressiveness), and computational efficiency (naveed, ). Extending and applying the multiview approach in those areas could lead to interesting future directions.

Comparing the Two Schemes: As we discussed in the two schemes, scheme I achieves a better indistinguishability but less protected partitions in each view. Figure 14 compares the relative effectiveness of the two schemes on a real trace under adversary knowledge. In particular, Figure 14(a) ,(b) demonstrate the fact that despite the lower number of real view candidates in scheme II compared with scheme I ( vs out of ), the end result of the leakage in scheme II is much more appealing (vs ). Therefore, our experimental section has mainly focused on scheme II.

Choosing the Number of Views : The number of views is an important parameter of our approach that determines both the privacy and computational overhead. The data owner could choose this value based on the level of trust on the analysts and the amount of computational overhead that can be afforded. Specifically, as it is implied by Equation 4.2 and demonstrated by our experimental results in section 5, the number of real view candidates is approximately
. The data owner should first estimate the adversary’s background knowledge
(number of prefixes known to the adversary) and then calculate either using Equation 15 or (approximately) using Equation 16. As it is demonstrated in Figures 8(a) and 9(b), a bigger results in weaker indistinguishability and demands a larger number of views to be generated. An alternative solution is to increase the number of prefix groups () by sacrificing some prefix relations among IPs, e.g., grouping them based on first octets. 
Utility: The main advantage of the multiview approach is it can preserve the data utility while protecting privacy. In particular, we have shown that the data owner can receive an analysis report based on the real view () which is prefixpreserving over the entire trace. This is more accurate than the obfuscated (through bucketization and suppression) or perturbed (through adding noise and aggregation) approaches. Specifically, in case of a security breach, the data owner can easily compute (migration output) to find the mapped IP addresses corresponding to each original address. Then the data owner applies necessary security policies to the IP addresses that are reported violating some policies in . A limitation of our work is it only preserve the prefix of IPs, and a potential future direction is to apply our approach to other propertypreserving encryption methods such that other properties may be preserved similarly.

Communicational/Computational Cost: One of our contributions in this paper is to minimize the communication overhead by only outsourcing one (seed) view and some supplementary parameters. This is especially critical for large scale network data like network traces from the major ISPs. On the other hand, one of the key challenges to the multiview approach is that it requires times computation for both generating the views and analysis.
Our experiments in Figure 11 shows that generating views for a trace of packets takes approximately minutes and we describe analytic complexity results in Tables 3 and 4. We note that the practicality of times computation will mainly depends on the type of analysis, and certainly may become impractical for some analyses under large . How to enable analysts to more efficiently conduct analysis tasks based on multiple views through techniques like caching is an interesting future direction. Another direction is to devise more accurate measures for the data owner to more precisely determine the number of views required to reach a certain level of privacy requirement.
5. Experiments
This section evaluates our multiview scheme through experiments with real data.
5.1. Setup
To validate our multiview anonymization approach, we use a set of network traces collected by a real ISP. We focus on attributes , , and in our experiments, and the metadata are summarized in the table in Figure 9(a).
In order to measure the security of the proposed approach, we
implement the frequency analysis attack (naveed, ), (brekene1, ). This attack can compromise individual
addresses protected by existing prefixpreserving anonymization in
multilinear time (brekene1, ). We stress that in the setting of EDBs (encrypted database systems), an attack is successful if it recovers even partial information about a single cell of the DB (naveed, ). Accordingly, we define the information leakage metric to evaluate the effectiveness of our solution against the adversary’s semantic attacks. Several measures have been proposed in literature (PP, ; Ribeiro, ) to evaluate the impact of semantic attacks. Motivated by (PP, ), we model the information leakage (number of matches) as the number of records/packets, their original IP addresses are known by the adversary either fully or partially. More formally,
Information leakage metric (PP, ):
We measure defined as the total number of addresses that has at least most significant bits known, where .
To model adversarial knowledge, we define a set of prefixes to be known by the adversary ranging from up to of all the prefixes in the trace. This knowledge is stored in a two dimensional vector that includes different addresses and their key indexes.
Next, using our multiview scheme, we generate all the views. However, before we apply the frequency analysis attack, we simulate how an adversary may eliminate some fake views from further consideration as follows. For each view, we check if two addresses from the adversary’s knowledge set with different prefixes now share prefixes in that view. If we find such a match in the key indices, the corresponding view will be discarded from the set of the real view candidates and will not be considered in our experiments since the adversary would know it is a fake view.
We validate the effectiveness of our scheme by showing the number of real view candidates and the percentage of the packets in the trace that are compromised (i.e., the percentage of IP packets whose addresses have at least eight most significant bits known). Each experiment is repeated more than times and the end results are the average results of the frequency analysis algorithm applied to each of the real view candidates.
Moreover, evaluating the utility preservation and studying the scalability of using ORAM in our scheme are respectively discussed in Appendix B.2 and B.3.
We conduct all experiments on a machine running Windows with an Intel(R) Core(TM) i76700 3.40 GHz CPU, 4 GB Memory, and 500 GB storage.
5.2. Results
5.2.1. Information Leakage Analysis
First, the numerical results of the indistinguishability parameter under different adversary’s knowledges are depicted in Figure 9(b). Those results correspond to three different cases, i.e., when addresses are grouped based on (1) only the first octet ( groups), (2) the first and the second octets ( groups), and (3) the first three octets ( groups). As we can see from the results, decreases (meaning more privacy) as the number of prefix groups increases, and it increases as the amount of adversarial knowledge increases.
We next validate those numerical results through experiments in Figure 10. Specifically, we first analyze the behavior of our second multiview scheme (introduced in Section 4.2) before comparing the two schemes in Appendix B. Figure 10 presents different facets of information leakage when our approach is applied in various grouping cases. The results in Figure 10 are for adversaries who has knowledge of no more than most of the prefix groups (Figure 13 in Appendix B.1 presents the more extreme cases for the same experiments, i.e., up to knowledge). The analysis of these figures is detailed in the following.
Effect of the number of prefix groups: As we discuss earlier, three different IP grouping cases are studied. Figures 10 (a) and (d) shows respectively the results of packet leakage and number of real view candidates when . As the numerical results in Figure 8 anticipates, because the fraction is relatively low, the indistinguishability of generated views diminishes specially for stronger adversary knowledges. Consequently, the adversary discards more views and the rate of leakage increases, compared with Figures 10 (b), (e) and Figures 10 (c), (f) for which the fraction are and , respectively. In particular, for the worst case of adversary knowledge and when the number of views is less than , we can verify that the number of real view candidates for case (1) remains resulting in packet leakage comparable to that of CryptoPAn.
Effect of the number of views: As it is illustrated in the figure, increasing the number of views always improves both the number of real view candidates and the packet leakages. All the figures for real view candidates evaluation, show a near linear improvement where the slope of this improvement inversely depends on the adversary’s knowledge. For the packet leakages, we can note that the improvement converges to a small packet leakage rate under a large number of views. This is reasonable, as each packet leakage result is an average of leakages in all the real view candidates. However, since each of the fake views leaks a certain amount of information, increasing the number of views beyond a certain value will no longer affect the end result. In other words, the packet leakage converges to the average of leakages in the (fake) real view candidates. Finally, the results show that our proposed scheme can more efficiently improve privacy by (1) increasing the fraction (number of views/number of distinct addresses) or (2) increasing the number of views. The first option may affect utility (since intergroup prefix relations will be removed), while the second option is more aligned with our objective of trading off privacy with computation.
5.2.2. Computational Overhead Evaluation
We evaluate the computational overhead incurred by our approach. Figure 11 shows the time required by our scheme in each grouping cases, when the number of views varies for a trace including one million packets. We observe that, when the number of views increases, the computational overhead increases near linearly. However, each case shows a different slope depending on the number of groups. This is reasonable as our second scheme generates key vectors with a larger number of elements for more groups, which leads to applying CryptoPAn for more iterations (see complexity analysis in Appendix D). Finally, linking this figure to the information leakage results shown in Figure 10 demonstrates the tradeoff between privacy and computational overhead.
6. Related Work
In the context of anonymization of network traces, as surveyed in (Mivule, ), many solutions have been proposed (flaim, ; ref6, ; brekene1, ; paxon1, ; riboni, )
. Generally, these may be classified into different categories, such as
enumeration (Farah, ), partitioning (slagell, ), and prefixpreserving (Xu, ; Gattani, ). These methods include removing rows or attributes, suppression, and generalization of rows or attributes (challengeof, ). Some of the solutions (Ribeiro, ; paxon1, ) are designed to address specific attacks and are generally based on the permutation of some fields in the network trace to blur the adversary’s knowledge. Later studies either prove theoretically (brekene2, ) or validate empirically (burkhart, ) that those works may be defeated by semantic attacks.As our proposed anonymization solution fall into the category of prefixpreserving solutions, which aims to improve the utility, we review in more details some of the proposed solutions in this category. First effort to find a prefix preserving anonymization was done by Greg Minshall (greg, ) who developed TCPdpriv which is a tablebased approach that generates a function randomly. Fan et al. (PP, ) then developed CryptoPAn with a completely cryptographic approach. Several publications (brekene1, ), (Ribeiro, ; paxon1, ) have then raised the vulnerability of this scheme against semantic attacks which motivated query based (mcsherry, ) and bucketization based (riboni, ) solutions. In the following we review those works in more details.
Among the works that address such semantic attacks, Riboni et al. (riboni, ) propose a (k,j)obfuscation methodology applied to network traces. In this method, a flow is considered obfuscated if it cannot be linked, with greater assurance, to its (source and destination) IPs. First, network flows are divided into either confidential IP attributes or other fields that can be used to attack. Then, groups of flows having similar fingerprints are first created, then bucketed, based on their fingerprints into groups of size . However, utility remains a challenge in this solution, as the network flows are heavily sanitized, i.e., each flows is blurred inside a bucket of flows having similar fingerprints. An alternative to the aforementioned solutions, called mediated trace analysis (mediated1, ; mediated2, ), consists in performing the data analysis on the dataowner side and outsourcing analysis reports to researchers requesting the analysis. In this case, data can only be analyzed where it is originally stored, which may not always be practical, and the outsourced report still needs to be sanitized prior to its outsourcing (mcsherry, ). In contrast to those existing solutions, our approach improves both the privacy and utility at the cost of a higher computational overhead. Table 2, summarizes the most important network trace anonymization schemes, over past twenty years(Mivule, ) and their main characteristics.
The last step of our solution requires data owner to privately retrieve an audit report of the real view, which can be based on existing private information retrieval (PIR) techniques. A PIR approach usually aims conceal the objective of all queries independent of all previous queries (orpir, ; pir2, ). Since the sequence of accesses is not hidden by PIR while each individual access is hidden, the amortized cost is equal to the worstcase cost (orpir, ). Since the server computes over the entire database for each individual query, it often results in impracticality for large databases. On the other hand, ORAM (oram1, ) has verifiably low amortized communication complexity and does not require much computation on the server but rather periodically requires the client to download and reshuffle the data (orpir, ). For our multiview scheme, we choose ORAM as it is relatively more efficient and secure, and also the client (data owner in our case) has sufficient computational power and storage needed to locally store a small number of blocks (audit reports in our case) in a local stash.
7. Conclusion
In this paper, we have proposed a multiview anonymization approach mitigating the semantic attacks on CryptoPAn while preserving the utility of the trace. This novel approach shifted the tradeoff from between privacy and utility to between privacy and computational cost; the later has seen significant decrease with the advance of technology, making our approach a more preferable solution for applications that demand both privacy and utility. Our experimental results showed that our proposed approach significantly reduced the information leakage compared to CryptoPAn. For example, for the extreme case of adversary preknowledge of 100%, the information leakage of CryptoPAN was 100% while under approach it was still less than 10%. Besides addressing various limitations discussed in Appendix 4.3, our future works will adapt the idea to improve existing privacypreserving solutions in other areas, e.g., we will extend our work to the multiparty problem where several data owners are willing to share their traces to mitigate coordinated network reconnaissance by means of distributed (or interdomain) audit (sepia, ).
8. Acknowledgments
The authors thank the anonymous reviewers, for their valuable comments. We appreciate Momen Oqaily’s support in the implementation. This work is partially supported by the Natural Sciences and Engineering Research Council of Canada and Ericsson Canada under CRD Grant N01823. The research of Yuan Hong is partially supported by the National Science Foundation under Grant No. CNS1745894 and the WISER ISFG grant.
References
 (1) Ding, Wen, William Yurcik, and Xiaoxin Yin. ”Outsourcing internet security: Economic analysis of incentives for managed security service providers.” In International Workshop on Internet and Network Economics, pp. 947958. Springer Berlin Heidelberg, 2005.
 (2) Riboni, Daniele, Antonio Villani, Domenico Vitali, Claudio Bettini, and Luigi V. Mancini. ”Obfuscation of sensitive data in network flows.” In INFOCOM, 2012 Proceedings IEEE, pp. 23722380. IEEE, 2012.
 (3) J. Fan, J. Xu, M. Ammar, and S. Moon. Prefixpreserving IP Address Anonymization: Measurementbased Security Evaluation and a New Cryptographybased Scheme. Computer Networks, 46(2):263272, October 2004.
 (4) Brekne, T., Årnes, A., & Øslebø, A. (2005, May). Anonymization of ip traffic monitoring data: Attacks on two prefixpreserving anonymization schemes and some proposed remedies. In International Workshop on Privacy Enhancing Technologies (pp. 179196). Springer Berlin Heidelberg.
 (5) Brekne, Tønnes, and André Årnes. ”Circumventing IPaddress pseudonymization.” In Communications and Computer Networks, pp. 4348. 2005.
 (6) T.F. Yen, X. Huang, F. Monrose, and M. K. Reiter, ”Browser fingerprinting from coarse traffic summaries: Techniques and implications,” in Proc. of Detection of Intrusions and Malware and Vulnerability Assessment, vol. 5587. Springer, 2009, pp. 157175.
 (7) M. Burkhart, D. Brauckhoff, M. May, and E. Boschi, ”The riskutility tradeoff for IP address truncation.” In Proceedings of the 1st ACM workshop on Network data anonymization, 2008, pp. 2330.
 (8) Pang, R., Allman, M., Paxson, V. and Lee, J., 2006. The devil and packet trace anonymization. ACM SIGCOMM Computer Communication Review, 36(1), pp.2938.
 (9) Coull, Scott E., Michael P. Collins, Charles V. Wright, Fabian Monrose, and Michael K. Reiter. ”On Web Browsing Privacy in Anonymized NetFlows.” In USENIX Security. 2007.
 (10) Wong, Wai Kit, David W. Cheung, Edward Hung, Ben Kao, and Nikos Mamoulis. ”Security in outsourcing of association rule mining.” In Proceedings of the 33rd international conference on Very large data bases, pp. 111122. VLDB Endowment, 2007.
 (11) Tai, C. H., Yu, P. S., and Chen, M. S. (2010, July). kSupport anonymity based on pseudo taxonomy for outsourcing of frequent itemset mining. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 473482). ACM.
 (12) Slagell, Adam J., Kiran Lakkaraju, and Katherine Luo. ”FLAIM: A Multilevel Anonymization Framework for Computer and Network Logs.” In LISA, vol. 6, pp. 38. 2006.
 (13) Foukarakis, Michael, Demetres Antoniades, and Michalis Polychronakis. ”Deep packet anonymization.” In Proceedings of the Second European Workshop on System Security, pp. 1621. ACM, 2009.
 (14) M. Burkhart, D. Schatzmann, B. Trammell, E. Boschi, and B. Plattner, The role of network trace anonymization under attack, Computer Communication Review, vol. 40, no. 1, pp. 511, 2010.
 (15) Mogul, Jeffrey C., and Martin Arlitt. ”Sc2d: an alternative to trace anonymization.” In Proceedings of the 2006 SIGCOMM workshop on Mining network data, pp. 323328. ACM, 2006.
 (16) Mittal, Prateek, Vern Paxson, Robin Sommer, and Mark Winterrowd. ”Securing Mediated Trace Access Using Blackbox Permutation Analysis.” In HotNets. 2009.
 (17) McSherry, Frank, and Ratul Mahajan. ”Differentiallyprivate network trace analysis.” In ACM SIGCOMM Computer Communication Review, vol. 40, no. 4, pp. 123134. ACM, 2010.
 (18) K. Mivule and B. Anderson, ”A study of usabilityaware network trace anonymization,” 2015 Science and Information Conference (SAI), London, 2015, pp. 12931304.
 (19) T. Farah, and L. Trajkovic, ”Anonym: A tool for anonymization of the Internet traffic.” In IEEE 2013 International Conference on Cybernetics (CYBCONF), 2013, pp. 261266.
 (20) Mayberry, Travis, ErikOliver Blass, and Agnes Hui Chan. ”Efficient Private File Retrieval by Combining ORAM and PIR.” In NDSS. 2014.
 (21) A.J. Slagell, K. Lakkaraju, and K. Luo, ”FLAIM: A Multilevel Anonymization Framework for Computer and Network Logs.” In LISA, vol. 6, 2006, pp. 38.
 (22) J. Xu, J. Fan, M.H. Ammar, and Sue B. Moon, ”Prefixpreserving ip address anonymization: Measurementbased security evaluation and a new cryptographybased scheme.”, In 10th IEEE International Conference on Network Protocols, 2002, pp. 280289.
 (23) Wang, Xiao Shaun, Yan Huang, TH Hubert Chan, Abhi Shelat, and Elaine Shi. ”SCORAM: oblivious RAM for secure computation.” In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pp. 191202. ACM, 2014.
 (24) W. Yurcik, C. Woolam, G. Hellings, L. Khan, B. Thuraisingham, ”Measuring anonymization privacy/analysis tradeoffs inherent to sharing network data”, IEEE Network Operations and Management Symposium, 2008, pp.991994.
 (25) Gattani, Shantanu, and Thomas E. Daniels. ”Reference models for network data anonymization.” In Proceedings of the 1st ACM workshop on Network data anonymization, pp. 4148. ACM, 2008.
 (26) Zhang, Jianqing, Nikita Borisov, and William Yurcik. ”Outsourcing security analysis with anonymized logs.” In Securecomm and Workshops, 2006, pp. 19. IEEE, 2006.
 (27) Saroiu, Stefan, P. Krishna Gummadi, and Steven D. Gribble. ”Measurement study of peertopeer file sharing systems.” In Electronic Imaging 2002, pp. 156170. International Society for Optics and Photonics, 2001.
 (28) Chor, Benny, et al. ”Private information retrieval.” Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995.
 (29) Biler, Piotr, and Alfred Witkowski. ”Problems in mathematical analysis.” (1990).
 (30) Zhang, Q., & Li, X. (2006, January). An IP address anonymization scheme with multiple access levels. In International Conference on Information Networking (pp. 793802). Springer Berlin Heidelberg.
 (31) B. Ribeiro, W. Chen, G. Miklau, and D. Towsley, Analyzing privacy in enterprise packet trace anonymization, in Proc. NDSS, San Diego, CA, Feb. 2008, pp. 87100
 (32) Coull, Scott E., Charles V. Wright, Fabian Monrose, Michael P. Collins, and Michael K. Reiter. ”Playing Devil’s Advocate: Inferring Sensitive Information from Anonymized Network Traces.” In NDSS, vol. 7, pp. 3547. 2007.
 (33) Yurcik, William, and Yifan Li. ”Internet security visualization case study: Instrumenting a network for NetFlow security visualization tools.” In 21st Annual Computer Security Applications Conference (ACSAC). 2005.
 (34) Coull, S. E., Monrose, F., Reiter, M. K., & Bailey, M. (2009, March). The challenges of effectively anonymizing network data. In Conference For Homeland Security, 2009. CATCH’09. Cybersecurity Applications & Technology (pp. 230236). IEEE.
 (35) Del Piccolo, Valentin, et al. ”A Survey of network isolation solutions for multitenant data centers.” IEEE Communications Surveys & Tutorials 18.4 (2016): 27872821.
 (36) C. Dwork, ”Differential privacy,” in Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP), ser. Lecture Notes in Computer Science, vol. 4052. SpringerVerlag, 2006.
 (37) Dwork, Cynthia. ”Differential privacy: A survey of results.” In International Conference on Theory and Applications of Models of Computation, pp. 119. Springer, Berlin, Heidelberg, 2008.
 (38) Kushilevitz, Eyal, and Rafail Ostrovsky. ”Replication is not needed: Single database, computationallyprivate information retrieval.” In Foundations of Computer Science, 1997. Proceedings., 38th Annual Symposium on, pp. 364373. IEEE, 1997.
 (39) Oded Goldreich and Rafail Ostrovsky. 1996. Software protection and simulation on oblivious RAMs. J. ACM 43, 3 (May 1996), 431473. DOI=http://dx.doi.org/10.1145/233551.233553
 (40) C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Proceedings of the Third Theory of Cryptography Conference, 2006, pp. 265–284.
 (41) Le Ny, Jerome, and Meisam Mohammady. ”Differentially private MIMO filtering for event streams.” IEEE Transactions on Automatic Control 63, no. 1 (2018): 145157.
 (42) Le Ny, Jerome, and Meisam Mohammady. ”Differentially private MIMO filtering for event streams and spatiotemporal monitoring.” In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pp. 21482153. IEEE, 2014.
 (43) Yao, Andrew ChiChih. ”How to generate and exchange secrets.” In Foundations of Computer Science, 1986., 27th Annual Symposium on, pp. 162167. IEEE, 1986.
 (44) Goldreich, Oded. ”Secure multiparty computation.” Manuscript. Preliminary version (1998): 8697.
 (45) Cormen, Thomas H., et al. ”Data structures for disjoint sets.” Introduction to Algorithms (2001): 498524
 (46) Sedgewick, Robert. ”Implementing quicksort programs.” Communications of the ACM 21.10 (1978): 847857.
 (47) Slagell, Adam, Jun Wang, and William Yurcik. ”Network log anonymization: Application of cryptopan to cisco netflows.” Proceedings of the Workshop on Secure Knowledge Management 2004. 2004.
 (48) Minshall G. TCPdpriv command manual. 1996. http://ita. ee. lbl. gov/html/contrib/tcpdpriv. 0. txt. 1996.
 (49) Paul, Ruma R., Victor C. Valgenti, and Min Sik Kim. ”Realtime Netshuffle: Graph distortion for online anonymization.” In Network Protocols (ICNP), 2011 19th IEEE International Conference on, pp. 133134. IEEE, 2011.
 (50) Aggarwal, Gagan, Tomás Feder, Krishnaram Kenthapadi, Samir Khuller, Rina Panigrahy, Dilys Thomas, and An Zhu. ”Achieving anonymity via clustering.” In Proceedings of the twentyfifth ACM SIGMODSIGACTSIGART symposium on Principles of database systems, pp. 153162. ACM, 2006.
 (51) M. Burkhart, M. Strasser, D. Many, and X. Dimitropoulos. Sepia: privacypreserving aggregation of multidomain network events and statistics. In USENIX Security Symposium, pages 223–240, 2010.
 (52) Boldyreva, Alexandra, Nathan Chenette, Younho Lee, and Adam O’neill. ”Orderpreserving symmetric encryption.” In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 224241. Springer, Berlin, Heidelberg, 2009.
 (53) Curtmola, Reza, Juan Garay, Seny Kamara, and Rafail Ostrovsky. ”Searchable symmetric encryption: improved definitions and efficient constructions.” Journal of Computer Security 19, no. 5 (2011): 895934.
 (54) Song, Dawn Xiaoding, David Wagner, and Adrian Perrig. ”Practical techniques for searches on encrypted data.” In Security and Privacy, 2000. S&P 2000. Proceedings. 2000 IEEE Symposium on, pp. 4455. IEEE, 2000.

(55)
Craig Gentry. 2009. Fully homomorphic encryption using ideal lattices. In Proceedings of the fortyfirst annual ACM symposium on Theory of computing (STOC ’09). ACM, New York, NY, USA, 169178.
 (56) Boneh, Dan, Amit Sahai, and Brent Waters. ”Functional encryption: Definitions and challenges.” In Theory of Cryptography Conference, pp. 253273. Springer, Berlin, Heidelberg, 2011.
 (57) Bellare, Mihir, Alexandra Boldyreva, and Adam O’Neill. ”Deterministic and efficiently searchable encryption.” In Annual International Cryptology Conference, pp. 535552. Springer, Berlin, Heidelberg, 2007.
 (58) Boldyreva, Alexandra, Nathan Chenette, Younho Lee, and Adam O’neill. ”Orderpreserving symmetric encryption.” In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 224241. Springer, Berlin, Heidelberg, 2009.
 (59) Islam, Mohammad Saiful, Mehmet Kuzu, and Murat Kantarcioglu. ”Access Pattern disclosure on Searchable Encryption: Ramification, Attack and Mitigation.” In Ndss, vol. 20, p. 12. 2012.
 (60) Naveed, Muhammad, Seny Kamara, and Charles V. Wright. ”Inference attacks on propertypreserving encrypted databases.” In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 644655. ACM, 2015.
 (61) Hautakorpi, Jani, and Gonzalo Camarillo Gonzalez. ”IP Address Distribution in Middleboxes.” U.S. Patent Application No. 12/518,452.
 (62) Chang, Zhao, Dong Xie, and Feifei Li. ”Oblivious ram: a dissection and experimental evaluation.” Proceedings of the VLDB Endowment 9, no. 12 (2016): 11131124.
 (63) Caswell, Brian, and Jay Beale. Snort 2.1 intrusion detection. Elsevier, 2004.
 (64) Stefanov, E., Van Dijk, M., Shi, E., Fletcher, C., Ren, L., Yu, X. and Devadas, S., 2013, November. Path ORAM: an extremely simple oblivious RAM protocol. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security (pp. 299310). ACM.
Appendix A Proofs
a.1. Proof of Theorem 3.1
Proof.
3.1 We must show that . To do so, we use induction:
where is empty string and is a constant bit in that depends only on padding function and cryptographic key . Assume , thus:
∎
a.2. Proof of Lemma 4.6
Proof.
Suppose there exists algorithm which returns the smallest set of key vectors to reverse the seed trace and obtain the minimum number of real view candidates, given our setup. Also denote by the set of those key vectors if the adversary follows scheme II. We now show that if holds then we have . First, note that the key indices of different distinct addresses in is . Therefore, the adversary has to guess . However, note that elements of are in and there will be different combinations for . Thus to minimize this number, the adversary has to use the outsourced parameters which means we have . However, we showed earlier that all these inputs are trivial leakage. Therefore, if holds, we have . ∎
Appendix B Experiments
In this section, using experiments, we measure the security of the proposed approach against very strong adversaries. In addition, we evaluate the utility of the approach using two real network analyses. Finally, we justify the choice of ORAM in our setup using a comprehensive study on the scalability of ORAM in the literature.
b.1. Privacy Evaluation against Very Strong Adversaries
Figure 13 shows the leakage and the real view candidates results for stronger adversaries (). Note that in this figures, we only show results for case (2) and (3) as results in case (1) does not show a significant improvement compared with CryptoPAn results because the multiview approach with fraction of cannot defeat the adversary’s knowledge ().
b.2. Utility Evaluation Using Reallife Network Analytics
Figure 12 shows the results of two different network analytics over the original trace (1M records), the real view and one of the fake views generated in our multiview solution. In the first experiment, we present IP distribution (IPdist, ) in the trace; reporting the number of distinct addresses within each subnet (IP group). We compare the distribution of distinct IP addresses inside the aforementioned three traces for both temporal distribution; if subnets are indexed based on their time stamps; and cardinalitybased distribution result; if subnets are indexed based on their cardinalities. We found that our results (both distributions) generated from the original trace and the real view are identical (see Figure 12(a)). This is reasonable because the real view is a prefix preserving mapping of IPs that keeps the fpQI attributes intact (preserving both distributions). Moreover, the cardinality based distribution result generated from the fake view is identical to those in the original trace and the real view (see Figure 12(c)). Note that the later is resulted from the indistinguishability of our multiview solution.
In the second experiment, we present a packetlevel analytic (mcsherry, ). In particular, Figure 12(d,e) shows the
empirical cumulative distribution function
results for the three traces. Our results clearly show that the original trace and our scheme results are identical as multiview will not have any impact on fingerprinting quasi identifier attributes.b.3. Multiview and the Scalability of ORAM
In practice, we expect analysis reports would have significantly smaller sizes in comparison to the views, and considering the one round communication with ORAM complexity), we believe the solution would have acceptable scalability. Experiments using our dataset and existing ORAM implementation (an implementation (dong, ) of nonrecursive PathORAM (path, ) has been made public) would further confirm this. We generated various set of analyses reports using snort (snort, ), and we found that for our dataset the size of audit reports are in the range of KB which is perfect to be used in fast ORAM protocols, e.g., PathORAM. Specifically, for PathORAM, Figure 5 (b) in (dong, ) shows a less than 1MB communication overhead for the worstcase cost of up to number of blocks of size 4KB.
Appendix C Algorithms
Following Algorithms are summarized versions of the data owner’s and the analyst’s roles in our multiview scheme presented in section 4.
Algorithm 1: The data owner’s actions (scheme I).
Algorithm 2: The analyst’s actions (scheme I).
Algorithm 3: The data owner’s actions (scheme II).
Algorithm 4: The analyst’s actions (scheme II).
Appendix D Complexity Analysis
Here, we discuss the overhead analysis, from both the data owner’s and the data analyst’s side. In particular, table 3 summarizes the overhead for all the action items in the data owner side. Here, is the computation overhead of CryptoPAn and is the number of the distinct IP addresses. Finally, table 4 summarizes the overhead for all the action items in the data analyst side where is the cost of times verifying the compliances (auditing).
Comments
There are no comments yet.