To manage massive amounts of data in the wild, modern storage systems employ deduplication to eliminate content duplicates and save storage space. The common deduplication approach is to store only data copies, called chunks, that have unique content among all already stored chunks. Field studies have demonstrated that deduplication achieves significant storage savings in production, for example, by 50% in primary storage  and up to 98% in backup storage . Deduplication is also adopted by commercial cloud storage services (e.g., Dropbox, Google Drive, Bitcasa, etc.) for cost-efficient outsourced data management [26, 41].
In the security context, combining encryption and deduplication, referred to as encrypted deduplication, is essential for protecting against information leakage in deduplication storage. Conventional (symmetric) encryption requires that users encrypt data with their own distinct secret keys. As a result, duplicate plaintext chunks will be encrypted into distinct ciphertext chunks, thereby prohibiting deduplication across different users. To preserve deduplication effectiveness, encrypted deduplication requires that each chunk be encrypted with a secret key derived from the chunk content itself, so that identical plaintext chunks are always encrypted into identical ciphertext chunks for deduplication. Bellare et al.  propose a cryptographic primitive called Message-locked encryption (MLE) to formalize the key derivation requirement, and convergent encryption  is one classical instantiation of MLE by deriving the secret key through the hash of a chunk. On top of MLE, several storage systems address additional security issues, such as brute-force attacks , key management failures , side-channel attacks , and access control .
However, we argue that existing MLE schemes cannot fully protect against information leakage, mainly because their encryption approaches are deterministic. That is, each ciphertext chunk is encrypted by a key that is deterministically derived from the original plaintext chunk. Thus, an adversary, which can be malicious users or storage system administrators, can analyze the frequency distribution of ciphertext chunks and infer the original plaintext chunks based on frequency analysis
. We observe that practical deduplication storage workloads often exhibit skewed frequency distributions in terms of the occurrences of chunks with the same content. Figure1 justifies our observation, by depicting the skewed frequency distributions of chunks in the real-world FSL and VM datasets (see Section 5 for the dataset details). For example, the FSL dataset has 99.8% of chunks occurring less than 100 times, while around 30 chunks occur over 10,000 times; the VM dataset has 97% of chunks occurring less than 100 times, while around 15,000 chunks (or 0.04% of chunks) occur over 10,000 times. Such skewed frequency distributions allow the adversary to accurately differentiate chunks by their frequencies via frequency analysis. On the other hand, while frequency analysis is a historically well-known cryptanalysis attack , the practical implications of frequency analysis against encrypted deduplication remain unexplored.
In this paper, we conduct an in-depth study of how frequency analysis practically affects information leakage in encrypted deduplication. Our study spans both attack and defense perspectives, and is specifically driven by the characteristics of storage workloads in deduplication systems.
On the attack side, we propose a new inference attack called the locality-based attack, which extends classical frequency analysis to accurately infer ciphertext-plaintext chunk pairs in encrypted deduplication. The main novelty of the locality-based attack is to exploit chunk locality, a common property in practical backup workloads. Chunk locality states that chunks are likely to re-occur together with their neighboring chunks across different versions of backups, mainly because in practice, changes to backups often appear in few clustered regions of chunks, while the remaining regions of chunks will appear in the same order in previous backups. Previous studies have exploited chunk locality to improve deduplication performance and mitigate indexing overhead (e.g., [57, 37, 55]). Here, we adapt this idea from a security perspective into frequency analysis: if a plaintext chunk corresponds to a ciphertext chunk , then the neighboring plaintext chunks of are likely to correspond to the neighboring ciphertext chunks of .
Our trace-driven evaluation, using both real-world and synthetic datasets, shows that the locality-based attack can infer significantly more ciphertext-plaintext pairs than classical frequency analysis. For example, for the real-world FSL dataset, the locality-based attack can infer up to 23.2% of the latest backup data, while the basic attack that directly applies classical frequency analysis can only infer 0.0001% of data. If a limited fraction (e.g., 0.2%) of plaintext information of the latest backup is leaked, the inference rate of the locality-based attack can increase up to 27.1%.
We further combine the locality-based attack with the knowledge of chunk sizes, and propose an advanced locality-based attack against variable-size chunks. The advanced locality-based attack maps ciphertext chunks to some plaintext chunks with similar sizes, and further increases the inference rate.
Our inference attacks are harmful in practice, even though the underlying symmetric encryption remains secure. One security implication of our inference attacks is that they can identify critical chunks in an encrypted backup snapshot. Given the plaintext chunks of some critical files (e.g., password files) in an old backup, an adversary can infer the ciphertext chunks in the latest backup corresponding to those critical plaintext chunks. It can then launch specific attacks against such identified ciphertext chunks; for example, by dedicatedly corrupting such ciphertext chunks, the adversary can make the underlying critical plaintext information unrecoverable.
On the defense side, we present two defense approaches to combat the inference attacks. The first one is MinHash encryption, which derives a common encryption key based on a set of adjacent chunks, such that some identical plaintext chunks can be encrypted into multiple distinct ciphertext chunks. Note that MinHash encryption has been shown to effectively reduce the overhead of server-aided MLE ; here we show how it can also be used to break the deterministic nature of encrypted deduplication and disturb the frequency ranking of ciphertext chunks. The second one is scrambling, which randomly shuffles the original chunk ordering during the deduplication process in order to break chunk locality. Our trace-driven evaluation shows that the combined MinHash encryption and scrambling scheme can suppress the inference rate to only 0.23% for the FSL dataset.
We also evaluate the storage efficiency and deduplication performance of the combined MinHash encryption and scrambling scheme. First, the combined scheme maintains the high storage saving achieved by deduplication, and its storage saving is only up to 3.6% less than that of the original MLE, which uses chunk-based deduplication. In addition, we build a realistic deduplication prototype based on DDFS  and evaluate the on-disk metadata access overhead. We show that the combined scheme incurs up to 1.2% additional metadata access overhead compared to the original MLE, and it incurs even less metadata access overhead when there is sufficient memory for metadata caching. Our findings suggest that the combined scheme adds limited overhead to both storage efficiency and deduplication performance in practical deployment, while effectively defending against frequency analysis.
The source code of our attack and defense implementations as well as the deduplication prototype is available at: http://adslab.cse.cuhk.edu.hk/software/freqanalysis.
The remainder of the paper proceeds as follows. Section 2 reviews the basics of encrypted deduplication and frequency analysis. Section 3 defines the threat model. Section 4 presents our proposed inference attacks based on frequency analysis. Section 5 presents the evaluation results of our proposed inference attacks. Section 6 presents the defense schemes against the inference attacks. Section 7 presents the evaluation results of our defense schemes. Section 8 reviews the related work, and finally, Section 9 concludes the paper.
Deduplication can be viewed as a coarse-grained compression technique to save storage space. We focus on chunk-based deduplication that operates at the granularities of chunks. Specifically, a deduplication system partitions input data into variable-size chunks through content-defined chunking (e.g., Rabin fingerprinting ), which identifies chunk boundaries that match specific content patterns so as to remain robust against content shifts . We can configure the minimum, average, and maximum chunk sizes in content-defined chunking for different granularities. After chunking, each chunk is identified by a fingerprint
, which is computed from the cryptographic hash of the content of the chunk. Any two chunks are said to be identical if they have the same fingerprint, and the collision probability that two non-identical chunks have the same fingerprint is practically negligible. Deduplication requires that only one physical copy of identical chunks is kept in the storage system, while any identical chunk refers to the physical chunk via a small-size reference.
To check if any identical chunk exists, the deduplication system maintains a fingerprint index, a key-value store that holds the mappings of all fingerprints to the addresses of physical chunks that are currently stored. For each file, the storage system also stores a file recipe that lists the references to all chunks of the file for future reconstruction.
2.2 Encrypted Deduplication
Encrypted deduplication ensures that all physical chunks are encrypted for confidentiality (i.e., data remains secret from unauthorized users and even storage system administrators), while the ciphertext chunks that are originated from identical plaintext chunks can still be deduplicated for storage savings. As stated in Section 1, message-locked encryption (MLE)  is a formal cryptographic primitive for encrypted deduplication, in which each chunk is encrypted by a symmetric key that is derived from the chunk content itself. For example, convergent encryption  is one popular MLE instantiation, and uses the cryptographic hash of a chunk as the corresponding symmetric key. This ensures that identical plaintext chunks must be encrypted into the identical ciphertext chunks, thereby preserving deduplication effectiveness. Note that the encrypted deduplication system needs to maintain a key recipe for each user to track the per-chunk keys for future decryption. Each key recipe is encrypted by the user’s own secret key via conventional encryption for protection (see Section 3).
MLE is inherently vulnerable to the offline brute-force attack , which allows an adversary to determine which plaintext chunk is encrypted into an input ciphertext chunk. The brute-force attack works as follows. Suppose that the adversary knows the set of chunks from which the underlying plaintext chunk is drawn. Then for each chunk from the set, the adversary finds the chunk-derived key (whose key derivation algorithm is supposed to be publicly available), encrypts the chunk with the chunk-derived key, and finally checks if the output ciphertext chunk is identical to the input ciphertext chunk. If so, the plaintext chunk is the answer. Thus, MLE can only achieve security for unpredictable chunks , meaning that the size of the set of chunks is sufficiently large, such that the brute-force attack becomes infeasible.
To protect against the brute-force attack, DupLESS  realizes server-aided MLE, which outsources MLE key management to a dedicated key manager that is only accessible by authenticated clients. Each authenticated client needs to first query the key manager for the chunk-derived key. Then the key manager computes and returns the key via a deterministic key derivation algorithm that takes the inputs of both the chunk fingerprint and a system-wide secret maintained by the key manager itself. This makes the resulting ciphertext chunks appear to be encrypted by random keys from the adversary’s point of view. In addition, the key manager limits the rate of key generation to slow down any online brute-force attack for querying the encryption keys. If the key manager is secure against adversaries, server-aided MLE ensures security even for predictable chunks; otherwise, it still maintains security for unpredictable chunks as in original MLE .
Most existing MLE implementations, either based on convergent encryption or server-aided MLE, follow deterministic encryption to ensure that identical plaintext chunks always form identical ciphertext chunks to make deduplication possible. Thus, they are inherently vulnerable to frequency analysis as shown in this paper. Some encrypted deduplication designs are based on non-deterministic encryption [2, 11, 9, 38], yet they still keep deterministic components , incur high performance overhead , or require cryptographic primitives that are not readily implemented [2, 9]. We elaborate the details in Section 8.
3 Threat Model
We focus on backup workloads, which have substantial content redundancy and are proven to be effective for deduplication in practice [57, 53]. Backups are copies of primary data (e.g., application states, file systems, and virtual disk images) over time. They are typically represented as weekly full backups (i.e., complete copies of data) followed by daily incremental backups (i.e., changes of data since the last full backup), while the recent trend shows that full backups are now more frequently performed (e.g., every few days) in production . Our threat model focuses on comparing different versions of full backups from the same primary data source at different times. In the following discussion, we simply refer to “full backups” as “backups”.
Figure 2 shows the encrypted deduplication architecture considered in the paper. Suppose that multiple clients connect to a shared deduplication storage for data backups. Given an input file, a client divides file data into plaintext chunks that will be encrypted via MLE. It then uploads the ciphertext chunks to deduplication storage. An adversary can eavesdrop the ciphertext chunks before deduplication and launch frequency analysis. We assume that the adversary is honest-but-curious, meaning that it does not change the prescribed storage system protocols or modify any stored data.
To launch frequency analysis, the adversary should have access to auxiliary information  that provides ground truths about the backups being stored. In this work, we model the auxiliary information as the plaintext chunks of a prior (non-latest) backup, which may be obtained through unintended data releases  or data breaches . Clearly, the success of frequency analysis heavily depends on how accurate the available auxiliary information describes the backups . Our focus is not to address how to obtain accurate auxiliary information, which we pose as future work; instead, given the available auxiliary information, we study how an adversary can design severe attacks based on frequency analysis and how we can defend against the attacks. We also evaluate the attacks when the auxiliary information is publicly available (see Section 5).
Based on the available auxiliary information (which describes a prior backup), the primary goal of the adversary is to infer the content of the plaintext chunks that are mapped to the ciphertext chunks of the latest backup. The attack can be based on two modes:
Ciphertext-only mode: It models a typical case in which the adversary can access the ciphertext chunks of the latest backup (as well as the auxiliary information about a prior backup).
Known-plaintext mode: It models a more severe case in which a powerful adversary not only can access the ciphertext chunks of the latest backup and the auxiliary information about a prior backup as in ciphertext-only mode, but also knows a small fraction of the ciphertext-plaintext chunk pairs about the latest backup (e.g., from stolen devices ).
In both attack modes, we assume that the adversary can monitor the processing sequence of the storage system and access the logical order of ciphertext chunks of the latest backup before deduplication. Our rationale is that existing deduplication storage systems [57, 55] often process chunks in logical order, so as to effectively cache metadata for efficient deduplication. On the other hand, we assume that the adversary cannot access any metadata information (e.g., the fingerprint index, file recipes, key recipes of all files). In practice, we do not apply deduplication to the metadata, which can be protected by conventional encryption. For example, the file recipes and key recipes can be encrypted by user-specific secret keys. We also assume that the adversary cannot identify which prior backup to which a stored ciphertext chunk belongs by analyzing the physical storage space, as the storage system can store ciphertext chunks in randomized physical addresses or in commercial public clouds (the latter is more difficult to access directly).
While this work focuses on frequency analysis, another inference attack based on combinatorial optimization, called-optimization, has been proposed to attack deterministic encryption . Nevertheless, frequency analysis is shown to be as effective as the -optimization attack , and later studies [33, 44] also state that both frequency analysis and -optimization may have equivalent severity.
We do not consider other threats against encrypted deduplication, as they can be addressed independently by existing approaches. For example, the side-channel attack against encrypted deduplication [25, 26] can be addressed by server-side deduplication [26, 35] and proof of ownership ; the leakage of access pattern  can be addressed by oblivious RAM  and blind storage .
|Defined in Section 4|
|sequence of ciphertext chunks in logical order for the latest backup|
|sequence of plaintext chunks in logical order for a prior backup|
|associative array that maps each ciphertext chunk in to its frequency|
|associative array that maps each plaintext chunk in to its frequency|
|set of inferred ciphertext-plaintext chunk pairs|
|set of left neighbors of ciphertext chunk|
|set of left neighbors of plaintext chunk|
|set of right neighbors of ciphertext chunk|
|set of right neighbors of plaintext chunk|
|set of currently inferred ciphertext-plaintext chunk pairs|
|number of ciphertext-plaintext chunk pairs returned from frequency analysis during the initialization of|
|number of ciphertext-plaintext chunk pairs returned from frequency analysis in each iteration of locality-based attack|
|maximum size of|
|associative array that maps each ciphertext chunk in to its left neighbor and co-occurrence frequency|
|associative array that maps each plaintext chunk in to its left neighbor and co-occurrence frequency|
|associative array that maps each ciphertext chunk in to its right neighbor and co-occurrence frequency|
|associative array that maps each plaintext chunk in to its right neighbor and co-occurrence frequency|
|Defined in Section 6|
|segment-based key of segment|
|minimum fingerprint of chunks in a segment|
We present inference attacks based on frequency analysis against encrypted deduplication. We first present a basic attack, which builds on classical frequency analysis to infer plaintext content in encrypted deduplication. We next propose a more severe locality-based attack, which enhances the basic attack by exploiting chunk locality. Furthermore, we combine the locality-based attack with the chunk size information, and propose an advanced locality-based attack against variable-size chunks.
Table 1 summarizes the major notation used in this paper. We first formalize the adversarial goal of our proposed attacks based on the threat model in Section 3. Let be the sequence of ciphertext chunks in logical order for the latest backup, and be the sequence of plaintext chunks in logical order for a prior backup (i.e., is the auxiliary information). Both and show the logical orders of chunks before deduplication as perceived by the adversary (i.e., identical chunks may repeat), and each of them can have multiple identical chunks that have the same content. Note that both and do not necessarily have the same number of chunks. Furthermore, the -th plaintext chunk in (where ) is not necessarily mapped to the -th ciphertext chunk in ; in fact, may not be mapped to any ciphertext chunk in , for example, when has been updated before the latest backup is generated. Given and , the goal of an adversary is to infer the content of the original plaintext chunks in .
We quantify the severity of an attack using the inference rate, defined as the ratio of the number of unique ciphertext chunks whose plaintext chunks are successfully inferred over the total number of unique ciphertext chunks in the latest backup; a higher inference rate implies that the attack is more severe.
4.1 Basic Attack
We first demonstrate how we can apply frequency analysis to infer the original plaintext chunks of the latest backup in encrypted deduplication. We call this attack the basic attack.
In the basic attack, we identify each chunk by its fingerprint, and count the frequency of each chunk by the number of fingerprints that appear in a backup. Thus, a chunk (or a fingerprint) has a high frequency if there exist many identical chunks with the same content. We sort the chunks of both and by their frequencies, and infer that the -th frequent plaintext chunk in is the original plaintext chunk of the -th frequent ciphertext chunk in . Our rationale is that the frequency of a plaintext chunk is correlated to that of the corresponding ciphertext chunk due to deterministic encryption.
Algorithm 1 shows the pseudo-code of the basic attack. It takes and as input, and returns the result set of all inferred ciphertext-plaintext chunk pairs. It first calls the function Count to obtain the frequencies of all ciphertext and plaintext chunks, identified by fingerprints, in associative arrays and , respectively (Lines 2-3). It then calls the function Freq-analysis to infer the set of ciphertext-plaintext chunk pairs (Line 4), and returns (Line 5).
The function Count constructs an associative array (where can be either and ) that holds the frequencies of all chunks. If a chunk does not exist in (i.e., its fingerprint is not found), then the function adds to and initializes as zero (Lines 10-12). The function then increments by one (Line 13).
The function Freq-analysis performs frequency analysis based on and . It first sorts each of and by frequency (Lines 18-19). Since and may not have the same number of elements, it finds the minimum number of elements in and (Line 20). Finally, it returns the ciphertext-plaintext chunk pairs, in which both the ciphertext and plaintext chunks of each pair have the same rank (Lines 21-26).
The basic attack demonstrates how frequency analysis can be applied to encrypted deduplication. However, it only achieves a small inference rate, as shown in our trace-driven evaluation (see Section 5). One reason is that the basic attack is sensitive to data updates that occur across different versions of backups over time. An update to a chunk can change the frequency ranks of multiple chunks, including the chunk itself and other chunks with similar frequencies. Another reason is that there exist many ties, in which chunks have the same frequency. How to break a tie during sorting also affects the frequency rank and hence the inference results of the tied chunks. In the following, we extend the basic attack to improve its inference rate.
4.2 Locality-based Attack
The locality-based attack exploits chunk locality to make frequency analysis more effective.
We first define the notation that captures the notion of chunk locality. Consider two ordered pairsand of neighboring ciphertext and plaintext chunks in and , respectively. We say that is the left neighbor of , while is the right neighbor of ; similar definitions apply to and . Note that a ciphertext chunk in or a plaintext chunk in may repeat many times (i.e., there are many duplicate copies), so if we identify each chunk by its fingerprint, it can be associated with more than one left or right neighbor. Let and be the sets of left neighbors and right neighbors of a ciphertext chunk , respectively, and and be the left and right neighbors of a plaintext chunk , respectively.
Our insight is that if a plaintext chunk of a prior backup has been identified as the original plaintext chunk of a ciphertext chunk of the latest backup, then the left and right neighbors of are also likely to be original plaintext chunks of the left and right neighbors of , mainly because chunk locality implies that the ordering of chunks is likely to be preserved across backups. In other words, for any inferred ciphertext-plaintext chunk pair , we further infer more ciphertext-plaintext chunk pairs through the left and right neighboring chunks of and , and repeat the same inference on those newly inferred chunk pairs. Thus, we can significantly increase the attack severity.
The locality-based attack operates on an inferred set , which stores the currently inferred set of ciphertext-plaintext chunks pairs. How to initialize depends on the attack modes (see Section 3). In ciphertext-only mode, in which an adversary only knows and , we apply frequency analysis to find the most frequent ciphertext-plaintext chunk pairs and add them to . Here, we configure a parameter (e.g., 1 by default in our implementation) to indicate the number of most frequent chunk pairs to be returned. Our rationale is that the top-frequent chunks have significantly higher frequencies (see Figure 1) than the other chunks, and their frequency ranks are stable across different backups. This ensures the correctness of the ciphertext-plaintext chunk pairs in with a high probability throughout the attack. On the other hand, in known-plaintext mode, in which the adversary knows some leaked ciphertext-plaintext chunk pairs about for the latest backup, we initialize with the set of leaked chunk pairs that are also in .
The locality-based attack proceeds as follows. In each iteration, it picks one ciphertext-plaintext chunk pair from . It collects the corresponding sets of neighboring chunks , , , and . We apply frequency analysis to find the most frequent ciphertext-plaintext chunk pairs from each of and , and similarly from and . In other words, we find the left and right neighboring chunks of and that have the most co-occurrences with and themselves, respectively. We configure a parameter (e.g., 15 by default in our implementation) to indicate the number of most frequent chunk pairs returned from frequent analysis in an iteration. A larger increases the number of inferred ciphertext-plaintext chunk pairs, but it also potentially compromises the inference accuracy. The attack adds all inferred chunk pairs into , and iterates until all inferred chunk pairs in have been processed.
Note that may grow very large as the backup size increases. A very large can exhaust memory space. We configure a parameter (e.g., 200,000 by default in our implementation) to bound the maximum size of .
In our evaluation (see Section 5), we carefully examine the impact of the configurable parameters , , and .
Algorithm 2 shows the pseudo-code of the locality-based attack. It takes , , , , and as input, and returns the result set of all inferred ciphertext-plaintext chunk pairs. It first calls the function Count to obtain the following associative arrays: , which stores the frequencies of all ciphertext chunks, as well as and , which store the co-occurrence frequencies of the left and right neighbors of all ciphertext chunks, respectively (Line 2); similarly, it obtains the associative arrays , , and for the plaintext chunks (Line 3). It then initializes the inferred set , either by obtaining most frequent ciphertext-plaintext chunk pairs from frequency analysis in ciphertext-only mode, or by adding the set of leaked ciphertext-plaintext chunk pairs that appear in both the latest and prior backups (i.e., and , respectively) in known-plaintext mode (Lines 4-8). It also initializes with (Line 9).
In the main loop (Lines 10-22), the algorithm removes a pair from (Line 11) and uses it to infer additional ciphertext-plaintext chunk pairs from the neighboring chunks of and . It first examines all left neighbors by running the function Freq-analysis on and , and stores most frequent ciphertext-plaintext chunk pairs in (Line 12). Similarly, it examines all right neighbors and stores the results in (Line 13). For each in , if is not in (i.e., the original plaintext chunk of has not been inferred yet), we add to and also to if is not full (Lines 14-21). The main loop iterates until becomes empty. Finally, is returned.
Both the functions Count and Freq-analysis are similar to those in the basic attack (see Algorithm 1), with the following extensions. For Count, in addition to constructing the associative array (where can be either and ) that holds the frequencies of all chunks, it also constructs the associative arrays and that hold the co-occurrence frequencies of the left and right neighbors of each chunk , respectively. For Freq-analysis, it now performs frequency analysis on the associative arrays and , in which (resp. ) refers to either (resp. ) that holds the frequency counts of all chunks, or and (resp. and ) that hold the frequency counts of all ordered pairs of chunks associated with ciphertext chunk (resp. plaintext chunk ). Also, Freq-analysis only returns (where can be either or ) most frequent ciphertext-plaintext chunk pairs.
Figure 3 shows an example of how the locality-based attack works. Here, we consider ciphertext-only mode. Suppose that we have obtained the auxiliary information of some prior backup, and use it to infer the original plaintext chunks of of the latest backup. We set , and (i.e., the inferred set is unbounded). We assume that the ground truth is that the original plaintext chunk of the ciphertext chunk is for , while that of is some new plaintext chunk not in (note that in reality, an adversary does not know the ground truth).
We first apply frequency analysis and find that is the most frequent ciphertext-plaintext chunk pair, so we initialize and add it into . We then remove and operate on from , and find that , , , and . From and , we find that is the most frequent ciphertext-plaintext chunk pair, while from and , we find . Thus, we add both and into and . We repeat the processing on and , and we can infer another pair from the right neighbors of .
To summarize, the locality-based attack can successfully infer the original plaintext chunks of all four ciphertext chunks , , , and . It cannot infer the original plaintext chunk of , as it does not appear in .
4.3 Advanced Locality-based Attack
Based on the framework of the locality-based attack, we propose an advanced locality-based attack that specifically targets variable-size chunks generated from content-defined chunking (see Section 2.1). Specifically, if the generated chunks have varying sizes, the adversary can observe the size of each ciphertext chunk before deduplication and leverage the size information to increase the severity of the locality-based attack.
The advanced locality-based attack builds on the observation that if a ciphertext chunk corresponds to a plaintext chunk , then the actual size of approximates that of . Suppose that the symmetric encryption algorithm used by the encrypted deduplication system is based on block ciphers (e.g., AES). Then both and
should have the same number of blocks (i.e., the basic units of block ciphers). We exploit this additional information in frequency analysis. Specifically, we first classify the sets of ciphertext chunks (i.e.,, , ) and plaintext chunks (i.e., , , and ) by their sizes, measured in terms of the number of blocks. For each available chunk size, we relate top-frequent ciphertext chunks with the top-frequent plaintext chunks that have the same size. This improves the accuracy of each inferred ciphertext-plaintext pair, and hence the inferred neighbors in the iterated inference of the locality-based attack.
The advanced locality-based attack extends the original locality-based attack in Algorithm 2 and modifies the function Freq-analysis (called in Line 5 and Lines 12-13 in Algorithm 2) to augment frequency analysis with the knowledge of chunk sizes.
Algorithm 3 shows the pseudo-code of frequency analysis in the advanced locality-based attack. As in Algorithm 2, the function Freq-analysis takes the associative arrays and , as well as the parameter , as input. It calls the function Classify to classify the ciphertext and plaintext chunks in and into and , respectively (Lines 2-3), where (resp. ) maps the ciphertext (resp. plaintext) chunks that have the same sizes to corresponding frequencies. It infers top-frequent ciphertext-plaintext pairs for each available (Lines 4-12), and finally returns the inference results (Line 13).
The function Classify groups the chunks in (where can be either or ) by their sizes. In this work, we assume that AES encryption is used and the block size is 16 bytes. Thus, Classify derives the number of blocks of each ciphertext or plaintext chunk (denoted by ) (Line 18), and stores the frequency of in (Line 22).
5 Attack Evaluation
We present trace-driven evaluation results on the severity of frequency analysis against encrypted deduplication.
We consider three datasets in our evaluation.
This dataset is collected by the File systems and Storage Lab (FSL) at Stony Brook University [1, 51, 52] and describes real-world storage patterns. We focus on the Fslhomes dataset, which contains the daily snapshots of users’ home directories on a shared file system. Each snapshot is represented by a collection of 48-bit chunk fingerprints produced by variable-size chunking of different average sizes. We pick the snapshots from January 22 to May 21 in 2013, and fix the average size as 8KB for our evaluation. We select six users (User4, User7, User12, User13, User15, and User28) that have the complete daily snapshots over the whole duration. We aggregate each user’s snapshots on a monthly basis (on January 22, February 22, March 22, April 21, and May 21), and hence form five monthly full backups for all users. Our post-processed dataset covers a total of 2.7TB of logical data before deduplication.
This dataset contains a sequence of synthetic backup snapshots that are generated based on Lillibridge et al.’s approach . Specifically, we create an initial snapshot from a Ubuntu 14.04 virtual disk image (originally with 1.1GB of data) with a total of 4.3GB space. We create a sequence of snapshots starting from the initial snapshot, such that each snapshot is created from the previous one by randomly picking 2% of files and modifying 2.5% of their content, and also adding 10MB of new data. Finally, we generate a sequence of ten snapshots, each of which is treated as a backup. Based on our choices of parameters, the resulting storage saving of deduplication is around 90%; equivalently, the deduplication ratio is around 10:1, which is typical in real-life backup workloads . Note that the initial snapshot is publicly available. Later in our evaluation, we study the effectiveness of the attacks by using it as public auxiliary information.
This dataset is collected by ourselves in a real-world scenario and is not considered in our prior conference paper . It comprises 156 virtual machine (VM) image snapshots for the students enrolled in a university programming course in Spring 2014. Each snapshot is represented by the SHA-1 fingerprints of 4KB fixed-size chunks. We treat the VM image snapshot as a weekly backup of a user, and extract 13 weeks of backups of all users. We remove all zero-filled chunks that dominate in VM images , and obtain a reduced dataset covering 9.11TB of data. Our prior studies [35, 45] have also used the variants of the dataset for evaluation. Here, we include this dataset for cross-validation of other datasets in our attack evaluation.
We implement all three inference attacks by processing and comparing the chunk fingerprints in our datasets. We benchmark our current implementation on a Ubuntu 16.04 Linux machine with an AMD Athlon II X4 640 quad-core 3.0GHz CPU and 16GB RAM, and find that the locality-based attack takes around 15 hours to process an FSL backup of size around 500GB. In the following, we highlight the implementation details of some data structures used by the attacks.
Recall that there are three types of associative arrays: (i) and , (ii) and , and (iii) and (the latter two are only used by the locality-based attack). We implement them as key-value stores using LevelDB . Each key-value store is keyed by the fingerprint of the ciphertext/plaintext chunk. For and , each entry stores a frequency count; for , , , and , each entry stores a sequential list of the fingerprints of all the left/right neighbors of the keyed chunk and the co-occurrence frequency counts. For the latter, keeping neighboring chunks sequentially simplifies our implementation, but also increases the search time of a particular neighbor (which dominates the overall running time); we pose the optimization as future work.
We implement the inferred set in the locality-based attack as a first-in-first-out queue, whose maximum size is bounded by (see Section 4.2). Each time we remove the first ciphertext-plaintext chunk pair from the queue for inferring more chunk pairs from the neighbors.
We now present the evaluation results and show the inference rate (defined in Section 4) of each attack under different settings.
5.3.1 Impact of Parameters
We first evaluate the impact of parameters on the locality-based attack, in order to justify our choices of parameters. Recall that the locality-based attack is configured with three parameters: , , and . Here, we focus on the FSL and VM datasets, and evaluate the attack in ciphertext-only mode. For the FSL dataset, we consider a “medium” case and use the middle version of the backup on March 22 as auxiliary information in order to infer the plaintext chunks of the latest backup on May 21; for the VM dataset, we consider a “recent” case and use the 12th weekly backup to infer the plaintext chunks of the latest 13th weekly backup.
Figure 4(a) first shows the impact of , in which we fix 20 and 100,000. The inference rate gradually decreases with . For example, when increases from 1 to 20, the inference rate decreases from 13.3% to 7.4% and from 13.0% to 12.3% for the FSL and VM datasets, respectively. A larger implies that incorrect ciphertext-plaintext chunk pairs are more likely to be included into the inferred set during initialization, thereby compromising the inference accuracy. In addition, the decrease of the inference rate in the VM dataset is slower than that in the FSL dataset. The reason is that we use a more recent VM backup as auxiliary information and its frequency ranking is similar to that of the latest backup.
Figure 4(b) next shows the impact of , in which we fix 10 and 100,000. Initially, the inference rate increases with , as the underlying frequency analysis infers more ciphertext-plaintext chunk pairs in each iteration. It hits the maximum value at about 11.2% (for the FSL dataset) and 13.8% (for the VM dataset) when 15. When increases to 40, the inference rate drops slightly to about 9.5% and 11.8% for the FSL and VM datasets, respectively. The reason is that some incorrectly inferred ciphertext-plaintext chunk pairs are also included into , which compromises the inference rate.
Figure 4(c) finally shows the impact of , in which we fix 10 and 20. A larger increases the inference rate, since can hold more ciphertext-plaintext chunk pairs across iterations. We observe that when increases beyond 200,000, the inference rate becomes steady at about 10.2% and 13.8% for the FSL and VM datasets, respectively.
5.3.2 Inference Rate in Ciphertext-only Mode
We now evaluate the attacks in ciphertext-only mode. We select 1, 15, and 200,000 as default parameters to achieve the highest possible inference rate (see Figure 4).
We first consider the FSL dataset. We choose each of the prior FSL backups on January 22, February 22, March 22, and April 21 as auxiliary information, and we launch attacks to infer the original plaintext chunks in the latest backup on May 21. Figure 5(a) shows the inference rate versus the prior backup. As expected, the inference rates of all three attacks increase as we use more recent prior backups as auxiliary information, since a more recent backup has higher content redundancy with the target latest backup. We see that the basic attack is ineffective in all cases, as the inference rate is no more than 0.0001%. The locality-based attack and the advanced locality-based attack can achieve significantly high inference rates. For example, if we use the most recent prior backup on April 21 as auxiliary information, the inference rates of the locality-based attack and the advanced locality-based attack can reach as high as 23.2% and 33.6%, respectively.
We next consider the synthetic dataset. We use the initial snapshot (which is publicly available) as auxiliary information. We then infer the original plaintext chunks in each of the following synthetic backups. Figure 5(b) shows the inference rates of the three attacks over the sequence of backups. Both the locality-based attack and the advanced locality-based attack are again more severe than the basic attack, whose inference rate is less than 0.2%. For example, the locality-based attack and the advanced locality-based attack can infer 13.1% and 14.4% of original plaintext chunks in the first backup, while that of the basic attack is only 0.19%. After ten backups, since more chunks have been updated since the initial snapshot, the inference rates of the locality-based attack, the advanced locality-based attack, and the basic attack drop to 6.0%, 9.2%, and 0.0007%, respectively. Nevertheless, we observe that the locality-based attack and the advanced locality-based attack always achieve higher inference rates than the basic attack.
Finally, we consider the VM dataset, and launch the attacks based on a sliding window approach. Specifically, we choose the -th weekly backup as auxiliary information, and infer the original plaintext chunks in the -th weekly backup, while we vary and in our evaluation. We mainly focus on the locality-based attack, since the basic attack has low severity and the advanced locality-based attack reduces to the locality-based attack for fixed-size chunks in the VM dataset. Figure 5(c) shows the inference rates for 1, 2, and 3, where the x-axis represents different values of . We see that the inference rates fluctuate significantly. For example, when we use the 3rd weekly backup as auxiliary information, the inference rates hit the highest at 23.5%, 14.3%, and 14.4% for 1, 2, and 3, respectively. On the other hand, the inference rates drop down to less than 0.6% when we use the 5th to 8th weekly backups as auxiliary information, mainly because there were some heavy user activities and many unique chunks were added into the VM images. Even with such non-preferable cases, the locality-based attack still achieves moderate severity in general, with an average inference rate of 12.5%, 7.3%, and 4.8% for 1, 2, and 3, respectively.
5.3.3 Inference Rate in Known-Plaintext Mode
We further evaluate the severity of the locality-based attack and the advanced locality-based attack in known-plaintext mode. To quantify the amount of leakage about the latest backup (see Section 3), we define the leakage rate as the ratio of the number of ciphertext-plaintext chunk pairs known by the adversary to the total number of ciphertext chunks in a target backup. For the FSL dataset, we choose the backup on March 22 as auxiliary information to infer the latest backup on May 21; for the synthetic dataset, we use the initial snapshot as auxiliary information to infer the 5th backup snapshot; for the VM dataset, we use the 9th weekly backup to infer the 13th weekly backup. We configure 1, 15, and 500,000. Note that we increase to 500,000 (as opposed to 200,000 in Section 5.3.2), as the attack in known-plaintext mode can infer much more ciphertext-plaintext chunk pairs across iterations. Thus, we choose a larger to include them into the inferred set.
Figure 6 shows the inference rate (which also includes the chunks that are already leaked in known-plaintext mode) versus the leakage rate (which we vary from 0 to 0.2%) about the target backup being inferred. The slight increase in the leakage rate can lead to a significant increase in the inference rate. For example, when the leakage rate increases to 0.2%, the inference rates of the locality-based attack and the advanced locality-based attack reach 27.1% and 37.8% for the FSL dataset, and 27.3% and 29.9% for the synthetic dataset, respectively. Since the VM dataset includes fixed-size chunks, both attacks incur the same inference rate of 12.2% in this case.
The deterministic nature of encrypted deduplication discloses the frequency distribution of the underlying plaintext chunks, thereby making frequency analysis feasible. To defend against frequency analysis, we consider two defense approaches, namely MinHash encryption and scrambling.
6.1 MinHash Encryption
MinHash encryption builds on Border’s theorem , which states that if two sets share a large fraction of common elements (i.e., they are highly similar), then the probability that both sets share the same minimum hash element is also high. Since two backups from the same data source are expected to be highly similar and share a large number of identical chunks , MinHash encryption leverages this property to perform encrypted deduplication in a different way from the original MLE [11, 10]. We emphasize that previous deduplication approaches also leverage Broder’s theorem to minimize the memory usage of the fingerprint index in plaintext deduplication [12, 55] or key generation overhead in server-aided MLE . Also, security analysis shows that MinHash encryption preserves data confidentiality as in server-aided MLE . Thus, we do not claim the novelty of the design of MinHash encryption. Instead, our contribution is to study its effectiveness in defending against frequency analysis.
MinHash encryption works as follows. It partitions the plaintext chunks in into segments, each of which is a non-overlapped sub-sequence of adjacent plaintext chunks. For each segment, MinHash encryption requests a secret key from the key manager as in DupLESS  (see Section 2.2) using the minimum fingerprint of all chunks in the segment, and then encrypts each chunk in the segment using the same key derived from the minimum fingerprint. This implies that MinHash encryption only requests keys on a per-segment basis rather than on a per-chunk basis. As the number of segments is much less than that of chunks, the key generation overhead is greatly mitigated .
MinHash encryption is robust against the locality-based attack, by (slightly) breaking the deterministic nature of encrypted deduplication. Its rationale is that segments are highly similar as they share many identical plaintext chunks in backups [12, 55]. Thus, their minimum fingerprints, and hence the secret keys derived for segments, are likely to be the same as well due to Broder’s theorem . This implies that most identical plaintext chunks across segments are still encrypted by the same secret keys into identical ciphertext chunks, thereby preserving deduplication effectiveness. However, some identical plaintext chunks may still reside in different segments with different minimum fingerprints and hence different secret keys, so their resulting ciphertext chunks will be different and cannot be deduplicated, leading to a slight degradation of storage efficiency. Nevertheless, such “approximate” deduplication sufficiently alters the overall frequency ranking of ciphertext chunks by encrypting a small fraction of duplicate chunks using different keys, thereby making frequency analysis ineffective. We refer readers to our prior conference version  for the algorithmic details of MinHash encryption.
Scrambling augments MinHash encryption by disturbing the processing sequence of chunks, so as to prevent an adversary from correctly identifying the neighbors of each chunk in the locality-based attack. It is applied before the chunks are encrypted and stored, and its idea is to scramble the original plaintext chunk sequence into a new sequence . To be compatible with MinHash encryption, scrambling works on a per-segment basis by shuffling the ordering of chunks within each segment. Each plaintext chunk is still encrypted via MinHash encryption and stored as a ciphertext chunk, while the original file can still be reconstructed based on its recipe (see Section 2.1) that specifies the original order of plaintext chunks. Note that scrambling does not change the storage efficiency of MinHash encryption, since it only changes the order of plaintext chunks.
Algorithm 4 elaborates the pseudo-code of scrambling. It first partitions the original plaintext chunk sequence into segments as in MinHash encryption (Line 2). Then for each chunk of a segment , the algorithm randomly adds the chunk to either the front of or the end of , where is the scrambled version of (Lines 6-13). Finally, it returns the scrambled sequence that includes all the scrambled segments (Line 16).
7 Defense Evaluation
We conduct trace-driven evaluation on MinHash encryption and scrambling in three aspects: defense effectiveness, storage efficiency, and metadata access overhead.
Since both FSL and VM datasets do not contain actual contents, we simulate our defense approaches by directly operating on chunk fingerprints. First, we identify segment boundaries based on chunk fingerprints, by following the variable-size segmentation scheme in . Specifically, the segmentation scheme is configured by the minimum, average, and maximum segment sizes. It places a segment boundary at the end of a chunk fingerprint if (i) the size of each segment is at least the minimum segment size, and (ii) the chunk fingerprint modulo a pre-defined divisor (which determines the average segment size) is equal to some constant (e.g., ), or the inclusion of the chunk makes the segment size larger than the maximum segment size. In our evaluation, we set the minimum, average, and maximum segment sizes as 512KB, 1MB, and 2MB, respectively.
After scrambling the orders of chunks (a.k.a. fingerprints) in each segment, we mimic MinHash encryption as follows. We first calculate the minimum chunk fingerprint of each segment. We then concatenate with each chunk fingerprint in the segment and compute the SHA-256 hash of the concatenation. We also truncate the hash result to be consistent with the fingerprint sizes in the original FSL and VM datasets, respectively. The truncated hash result can be viewed as the fingerprint of the ciphertext chunk. We can easily check that identical plaintext chunks under the same will lead to identical ciphertext chunks that can be deduplicated.
7.2 Defense Effectiveness
We evaluate our defense schemes, including (i) MinHash encryption only and (ii) the combined MinHash encryption and scrambling scheme, against the advanced locality-based attack in known-plaintext mode under the same parameter setting as in Section 5.3. Note that the advanced locality-based attack reduces to the locality-based attack in the VM dataset, which uses fixed-size chunking.
Figure 7 shows the inference rate versus the leakage rate. When the leakage rate is 0.2%, MinHash encryption suppresses the inference rate to 7.3%, 3.8%, and 3.4% for the FSL, synthetic, and VM datasets, respectively, under the advanced locality-based attack. In addition, the combined MinHash encryption and scrambling scheme further suppresses the inference rate to 0.2-0.24% only for all datasets. This shows that scrambling effectively enhances the protection of MinHash encryption.
7.3 Storage Efficiency
We evaluate the storage efficiency of the combined MinHash encryption and scrambling scheme. Specifically, we add the encrypted backups to storage in the order of their creation times, and measure the storage saving as the percentage of the total size of all ciphertext chunks reduced by deduplication. We compare the storage saving with that of the original MLE, which performs chunk-based deduplication that operates at the more fine-grained chunk level and eliminates all duplicate chunks. Here, we do not consider the metadata overhead.
Figure 8(a) shows the storage saving after storing each FSL backup. We observe that after storing all five backups, the combined scheme achieves a storage saving of 83.2%, which is 3.6% less than that of MLE.
Figure 8(b) shows the storage saving after storing each synthetic snapshot. After 11 backups, the combined scheme achieves a storage saving of 86.2%. The drop of the storage saving remains small (about 3%) compared to MLE, which achieves a storage saving of 89.2%.
Figure 8(c) shows the storage saving for the VM dataset. Overall, the storage saving for the first backup reaches 97.4%, mainly because the VM images are initially installed with the same operating system. The storage saving drops after the 7th backup, since the students make big changes and add unique chunks into the VM images. After 13 backups, the storage saving of the combined scheme achieves 97.9%, with a reduction of 0.7% compared to that of MLE.
Overall, the combined scheme maintains high storage efficiency achieved by deduplication for all datasets.
7.4 Metadata Access Overhead
We evaluate the performance of the combined MinHash encryption and scrambling scheme via a case study of its deployment. We implement a deduplication prototype based on the Data Domain File System (DDFS)  to simulate the processing of encrypted deduplication workload. Suppose that the chunks have been encrypted, by either the original MLE-based deterministic encryption or our combined MinHash encryption and scrambling scheme. We focus on the metadata access overhead under our DDFS-like prototype, since metadata access plays an important role in deduplication performance .
7.4.1 Prototype Design
We design and implement our deduplication prototype based on DDFS. Specifically, our prototype organizes the unique (ciphertext) chunks on disk in units of containers. Each container size is typically of several megabytes (e.g., 4MB) to mitigate the disk seek overhead, as opposed to the chunk size that is often of several kilobytes (e.g., 4KB or 8KB). In addition, our prototype maintains a fingerprint index to hold the metadata (e.g., the mappings of fingerprints to chunk locations) and detect if any identical chunk has been stored. Since the size of the fingerprint index increases with the amount of unique chunks being stored, the fingerprint index is stored on disk, while our prototype maintains two in-memory data structures, namely a fingerprint cache and a Bloom filter, to mitigate the disk I/O overhead during deduplication (see below).
Our prototype follows the deduplication workflow of DDFS . In particular, it stores unique chunks in logical order and further exploits chunk locality to accelerate deduplication. Given an incoming ciphertext chunk , our prototype performs deduplication as follows.
Step S1: Our prototype checks by fingerprint if is in the fingerprint cache. If so, it is identical and does not need to be stored.
Step S2: If is not in the fingerprint cache, our prototype checks the Bloom filter. If is not in the Bloom filter, it must be unique. Then our prototype updates the Bloom filter, and also inserts and its fingerprint into an in-memory fixed-size buffer in logical order. If the in-memory buffer is full, our prototype flushes it to disk as a new container and updates the fingerprint index on disk.
Step S3: Even if is in the Bloom filter, it may be a false positive. Our prototype queries the fingerprint index to ensure that it is a duplicate. If is not in the fingerprint index, our prototype follows Step S2 to store as a unique chunk.
Step S4: If is in the fingerprint index, our prototype identifies the container that keeps the physical copy of , and loads the fingerprints of all chunks in the container into the fingerprint cache. The rationale is that the logically nearby chunks of are likely to be accessed together due to chunk locality. If the fingerprint cache is full, our prototype removes the least-recently-used fingerprints.
Our prototype mainly implements the metadata flow during deduplication, as shown in Figure 9. We focus on the evaluation of the metadata access overhead. We do not evaluate the performance of writing or reading containers and that of encrypting or decrypting chunks.
7.4.2 Evaluation Results
Our evaluation uses the following configurations. Here, we only focus on the FSL dataset. We set the metadata size of each fingerprint as 32 bytes. We consider two sizes of the fingerprint cache: 512MB and 4GB. We set the Bloom filter with a false positive rate of 0.01 , and the Bloom filter size depends on the number of fingerprints that are tracked. For example, our FSL dataset contains around 65 million fingerprints (i.e., the total size is around 2GB), so the corresponding Bloom filter size is around 74MB. We also set the container size as 4MB.
We categorize the on-disk metadata access into three types: (i) update access, which updates the metadata of unique chunks in the fingerprint index (in Steps S2 and S3); (ii) index access, which looks up the on-disk fingerprint index for the detection of duplicate chunks (in Step S3); and (iii) loading access, which loads the fingerprints of stored chunks into the cache (in Step S4). We measure the metadata access overhead in terms of the size of metadata being accessed.
In the following, we compare the metadata access overhead of our combined MinHash encryption and scrambling scheme with MLE, in which we encrypt the chunks by the original MLE-based deterministic encryption.
Figure 10 first presents the results when the fingerprint cache size is 512MB, in which case the size is insufficient to hold all fingerprints in the FSL dataset (whose total metadata size for all fingerprints is around 2GB). Figure 10(a) shows the overall metadata access overhead. In the first backup, the combined scheme even incurs less metadata access overhead than MLE, mainly because it generates more unique chunks at the beginning and reduces the frequency of loading fingerprints from disk into the fingerprint cache (in Step S4). In the subsequent backups, the combined scheme has slightly higher overhead than MLE (at most 1.2%), since it generates more unique chunks and needs to load fingerprints more often from disk to the fingerprint cache. Figures 10(b) and 10(c) show the breakdown of the metadata access overhead for MLE and the combined scheme, respectively. The update access size for both schemes is less than 0.3GB after the first backup (in which MLE and the combined scheme incur 1.0GB and 1.3GB of metadata access, respectively), as only a small portion of new or modified chunks are stored. The index access size is also small, with less than 0.1GB for both schemes in all backups, since a significant portion of duplicate and unique chunks can be detected by the fingerprint cache and the Bloom filter, respectively. Finally, we observe that the loading access size contributes the most overhead, with more than 74.2% of the total metadata access size for both schemes.
Figure 11 presents the results when the fingerprint cache size is increased to 4GB, in which the fingerprint cache is sufficiently large to hold the fingerprints of all unique chunks. As shown in Figure 11(a), the combined scheme incurs much less metadata access overhead than MLE by 6.4-20.0% as it generates more unique chunks while all fingerprints can be stored in the fingerprint cache. Figures 11(b) and 11(c) show the corresponding breakdown for MLE and the combined scheme, respectively. Both update access size and index access size are similar to those in Figure 10, while the loading access size for both schemes is significantly reduced by around 22% and 29% for MLE and the combined schemes, respectively, mainly due to a high probability of cache hits.
8 Related Work
Existing deduplication studies exploit workload characteristics (e.g., chunk locality [57, 37, 31, 55] and file similarity [12, 55]) to mitigate indexing overhead. For example, DDFS  prefetches the fingerprints of nearby chunks that are likely to be accessed together. Sparse Indexing  and Extreme Binning  exploit chunk locality and file similarity, respectively, to mitigate the memory storage for indexing, while SiLo  combines both chunk locality and file similarity for general backup workloads. Bimodel  builds on chunk locality and adaptively varies the expected chunk sizes to mitigate metadata overhead. All the above works do not consider security.
Traditional encrypted deduplication systems (e.g., [19, 5, 50, 54, 17, 29]) mainly build on convergent encryption , in which the encryption key is directly derived from the cryptographic hash of the content to be encrypted. CDStore  integrates convergent encryption with secret sharing to support fault-tolerant storage. However, convergent encryption is vulnerable to brute-force attacks (see Section 2.2). Server-aided MLE protects against brute-force attacks by maintaining content-to-key mappings in a dedicated key manager, and has been implemented in various storage system prototypes [48, 6, 45, 10]. Given that the dedicated key manager is a single-point-of-failure, Duan  proposes to maintain a quorum of key managers via threshold signature for fault-tolerant key management. Note that all the above systems build on deterministic encryption to preserve the deduplication capability of ciphertext chunks, and hence are vulnerable to the inference attacks studied in this paper.
Instead of using deterministic encryption, Bellare et al.  propose an MLE variant called random convergent encryption (RCE), which uses random keys for chunk encryption. However, RCE needs to add deterministic tags into ciphertext chunks for checking any duplicates, so that the adversary can count the deterministic tags to obtain the frequency distribution. Liu et al.  propose to encrypt each plaintext chunk with a random key, while the key is shared among users via password-based key exchange. However, the proposed approach incurs significant key exchange overhead, especially when the number of chunks is huge.
From the theoretic perspective, several studies propose to enhance the security of encrypted deduplication and protect the frequency distribution of original chunks. Abadi et al.  propose two encrypted deduplication schemes for the chunks that depend on public parameters, yet either of them builds on computationally expensive non-interactive zero knowledge (NIZK) proofs or produces deterministic ciphertext components. Interactive MLE  addresses chunk correlation and parameter dependence, yet it is impractical for the use of fully homomorphic encryption (FHE). This paper differs from the above works by using lightweight primitives for practical encrypted deduplication.
Frequency analysis  is the classical inference attack and has been historically used to recover plaintexts from substitution-based ciphertexts. It is also used as a building block in recently proposed attacks. Kumar et al.  use frequency-based analysis to de-anonymize query logs. Islam et al.  compromise keyword privacy based on the leakage of the access patterns in keyword search. Naveed et al.  propose to conduct frequency analysis via combinatorial optimization and present attacks against CryptDB. Kellaris et al.  propose reconstruction attacks against any system that leaks access pattern or communication volume. Pouliot et al.  present the graph matching attacks on searchable encryption. Grubbs et al.  build attacks on order-preserving encryption based on the frequency and ordering information.
In encrypted deduplication, Ritzdorf et al.  exploit the size information of deduplicated content and build an inference attack that determines if a file has been stored. Armknecht et al.  present formal analysis on the side-channel attack that just works in client-side deduplication. Our work is different as we focus on inferring the content of data chunks via frequency analysis. In particular, we exploit workload characteristics to construct attack and defense approaches.
Some inference attacks exploit the active adversarial capability. Brekne et al.  construct bogus packets to de-anonymize IP addresses. Cash et al.  and Zhang et al.  propose file-injection attacks against searchable encryption. Our proposed attacks do not rely on the active adversarial capability.
Encrypted deduplication has been deployed in commercial cloud environments and extensively studied in the literature to simultaneously achieve both data confidentiality and storage efficiency, yet we argue that its data confidentiality remains not fully guaranteed. We demonstrate how the deterministic nature of encrypted deduplication makes it susceptible to information leakage caused by frequency analysis. We propose the locality-based attack, which exploits the chunk locality property of backup workloads to infer the content of a large fraction of plaintext chunks from the ciphertext chunks of the latest backup. We also propose the advanced locality-based attack, which extends the locality-based attack with the knowledge of chunk sizes to launch frequency analysis specifically against variable-size chunks. We show how the inference attacks can be practically implemented, and demonstrate their severities through trace-driven evaluation on both real-world and synthetic datasets. To defend against information leakage, we consider MinHash encryption and scrambling to disturb frequency rank and break chunk locality. Our trace-driven evaluation shows that our combined MinHash encryption and scrambling effectively defends against the locality-based attack, while maintaining high storage efficiency and incurring limited metadata access overhead.
-  FSL traces and snapshots public archive. http://tracer.filesystems.org/, 2014.
-  Martín Abadi, Dan Boneh, Ilya Mironov, Ananth Raghunathan, and Gil Segev. Message-locked encryption for lock-dependent messages. In Advances in Cryptology – CRYPTO 2013, pages 374–391, 2013.
-  Ibrahim A. Al-Kadit. Origins of Cryptology: The Arab Contributions. Cryptologia, 16(2):97–126, 1992.
-  George Amvrosiadis and Medha Bhadkamkar. Identifying trends in enterprise data protection systems. In Proceedings of USENIX Annual Technical Conference (USENIX ATC’15), 2015.
-  Paul Anderson and Le Zhang. Fast and secure laptop backups with encrypted de-duplication. In Proceedings of the 24th International Conference on Large Installation System Administration (LISA’10), pages 1–8, 2010.
-  Frederik Armknecht, Jens-Matthias Bohli, Ghassan O. Karame, and Franck Youssef. Transparent data deduplication in the cloud. In Proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS’15), pages 886–900, 2015.
-  Frederik Armknecht, Colin Boyd, Gareth T. Davies, Kristian Gjøsteen, and Mohsen Toorani. Side channels in deduplication: Trade-offs between leakage and efficiency. In Proceedings of ACM Asia Conference on Computer and Communications Security (ASIACCS’17), pages 266–274, 2017.
-  Michael Arrington. AOL: “this was a screw up”. https://techcrunch.com/2006/08/07/aol-this-was-a-screw-up/, 2006.
-  Mihir Bellare and Sriram Keelveedhi. Interactive message-locked encryption and secure deduplication. In Public-Key Cryptography – PKC 2015, pages 516–538, 2015.
-  Mihir Bellare, Sriram Keelveedhi, and Thomas Ristenpart. DupLESS: Server-aided encryption for deduplicated storage. In Proceeding of the 22nd USENIX Security Symposium (USENIX Security’13), pages 179–194, 2013.
-  Mihir Bellare, Sriram Keelveedhi, and Thomas Ristenpart. Message-locked encryption and secure deduplication. In Advances in Cryptology – EUROCRYPT 2013, pages 296–312, 2013.
-  Deepavali Bhagwat, Kave Eshghi, Darrell D.E. Long, and Mark Lillibridge. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceeding of IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS’09), pages 1–9, 2009.
-  John Black. Compare-by-hash: a reasoned analysis. In Proceeding of USENIX Annual Technical Conference (USENIX ATC’06), pages 85–90, 2006.
-  Tønnes Brekne, André Årnes, and Arne Øslebø. Anonymization of IP traffic monitoring data: Attacks on two prefix-preserving anonymization schemes and some proposed remedies. In Proceeding of International Workshop on Privacy Enhancing Technologies (PET’05), pages 179–196, 2005.
-  Andrei Z. Broder. On the resemblance and containment of documents. In Proceeding of the Compression and Complexity of Sequences (SEQUENCES’97), pages 21–29, 1997.
-  David Cash, Paul Grubbs, Jason Perry, and Thomas Ristenpart. Leakage-abuse attacks against searchable encryption. In Proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS’15), pages 668–679, 2015.
-  Landon P. Cox, Christopher D. Murray, and Brian D. Noble. Pastiche: Making backup cheap and easy. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI’02), pages 285–298, 2002.
-  Barb Darrow. Harvard-affiliate McLean hospital loses patient data. http://fortune.com/2015/07/29/mclean-hospital-loses-patient-data/, 2015.
-  John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, and Marvin Theimer. Reclaiming space from duplicate files in a serverless distributed file system. In Proceeding of 22nd International Conference on Distributed Computing Systems (ICDCS’02), pages 617–624, 2002.
-  Yitao Duan. Distributed key generation for encrypted deduplication: Achieving the strongest privacy. In Proceedings of the 6th edition of the ACM Workshop on Cloud Computing Security (CCSW’14), pages 57–68, 2014.
-  Kave Eshghi and Hsiu Khuern Tang. A framework for analyzing and improving content-based chunking algorithms. HPL-2005-30R1, 2005.
-  Sanjay Ghemawat and Jeff Dean. LevelDB: A fast key/value storage library by Google. https://github.com/google/leveldb, 2014.
-  Paul Grubbs, Kevin Sekniqi, Vincent Bindschaedler, Muhammad Naveed, and Thomas Ristenpart. Leakage-abuse attacks against order-revealing encryption. In Proceeding of IEEE Symposium on Security and Privacy (SP’17), pages 655–672, 2017.
-  Robert Hackett. Linkedin lost 167 million account credentials in data breach. http://fortune.com/2016/05/18/linkedin-data-breach-email-password/, 2016.
-  Shai Halevi, Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. Proofs of ownership in remote storage systems. In Proceedings of the 18th ACM conference on Computer and Communications Security (CCS’11), pages 491–500, 2011.
-  Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. Side channels in cloud services: Deduplication in cloud storage. IEEE Security & Privacy, 8(6):40–47, 2010.
-  Mohammad Saiful Islam, Mehmet Kuzu, and Murat Kantarcioglu. Access pattern disclosure on searchable encryption: Ramification, attack and mitigation. In Proceeding of Network and Distributed System Security Symposium (NDSS’12), pages 1–15, 2012.
-  Keren Jin and Ethan L. Miller. The effectiveness of deduplication on virtual machine disk images. In Proceeding of the Israeli Experimental Systems Conference (SYSTOR’09), pages 7:1–7:12, 2009.
-  Mahesh Kallahall, Erik Riedel, Ram Swaminathan, Qian Wang, and Kevin Fu. Plutus: Scalable secure file sharing on untrusted storage. In Proceedings of USENIX Conference on File and Stroage Technologies (FAST’03), pages 29–42, 2003.
-  Georgios Kellaris, George Kollios, Kobbi Nissim, and Adam O´Neill. Generic attacks on secure outsourced databases. In Proceedings of ACM Conference on Computer and Communications Security (CCS’16), pages 1329–1340, 2016.
-  Erik Kruus, Cristian Ungureanu, and Cezary Dubnicki. Bimodal content defined chunking for backup streams. In Proceeding of USENIX Conference on File and Storage Technologies (FAST’10), 2010.
-  Ravi Kumar, Jasmine Novak, Bo Pang, and Andrew Tomkins. On anonymizing query logs via token-based hashing. In Proceedings of the 16th international conference on World Wide Web (WWW’07), pages 629–638, 2007.
-  Marie-Sarah Lacharité and Kenneth G. Paterson. A note on the optimality of frequency analysis vs. -optimization. Cryptology ePrint Archive: Report 2015/1158 https://eprint.iacr.org/2015/1158, 2015.
-  Jingwei Li, Chuan Qin, Patrick P. C. Lee, and Xiaosong Zhang. Information leakage in encrypted deduplication via frequency analysis. In Proceeding of the 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’17), pages 1–12, 2017.
-  Mingqiang Li, Chuan Qin, and Patrick P. C. Lee. CDStore: Toward reliable, secure, and cost-efficient cloud storage via convergent dispersal. In Proceedings of USENIX Annual Technical Conference (USENIX ATC’15), pages 111–124, 2015.
-  Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceeding of the 11th USENIX Conference on File and Storage Technologies (FAST’13), pages 183–197, 2013.
-  Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceeding of USENIX Conference on File and Storage Technologies (FAST’09), pages 111–123, 2009.
-  Jian Liu, N. Asokan, and Benny Pinkas. Secure deduplication of encrypted data without additional independent servers. In Proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS’15), pages 874–885, 2015.
-  Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. Handbook of Applied Cryptography. CRC Press, 2001.
-  Dutch T. Meyer and William J. Bolosky. A study of practical deduplication. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST’11), pages 1–1, 2011.
Martin Mulazzani, Sebastian Schrittwieser, Manuel Leithner, Markus Huber, and
Dark clouds on the horizon: Using cloud storage as attack vector and online slack space.In Proceeding of the 20th USENIX Security Symposium (USENIX Security’11), pages 1–11, 2011.
-  M. Naveed, M. Prabhakaran, and C.A. Gunter. Dynamic searchable encryption via blind storage. In Proceedings of IEEE Symposium on Security and Privacy (SP’14), pages 639–654, May 2014.
-  Muhammad Naveed, Seny Kamara, and Charles V. Wright. Inference attacks on property-preserving encrypted databases. In Proceeding of 22nd ACM Conference on Computer and Communications Security (CCS’15), pages 644–655, 2015.
-  David Pouliot and Charles V. Wright. The shadow nemesis: Inference attacks on efficiently deployable, efficiently searchable encryption. In Proceedings of the 23th ACM Conference on Computer and Communications Security (CCS’16), pages 1341–1352, 2016.
-  Chuan Qin, Jingwei Li, and Patrick P. C. Lee. The design and implementation of a rekeying-aware encrypted deduplication storage system. ACM Trans. on Storage, 13(1):9:1–9:30, Mar 2017.
-  Michael O. Rabin. Fingerprinting by random polynomials. Center for Research in Computing Technology, Harvard University. Tech. Report TR-CSE-03-01, 1981.
-  Hubert Ritzdorf, Ghassan Karame, Claudio Soriente, and Srdjan Čapkun. On information leakage in deduplicated storage systems. In Proceedings of ACM on Cloud Computing Security Workshop (CCSW’16), pages 61–72, 2016.
-  Peter Shah and Won So. Lamassu: Storage-efficient host-side encryption. In Proceedings of USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’15), pages 333–345, 2015.
-  Elaine Shi, T.-H. Hubert Chan, Emil Stefanov, and Mingfei Li. Oblivious RAM with worst-case cost. In Advances in Cryptology – ASIACRYPT 2011, pages 197–214, 2011.
-  Mark W. Storer, Kevin Greenan, Darrell D.E. Long, and Ethan L. Miller. Secure data deduplication. In Proceedings of the 4th ACM International Workshop on Storage Security and Survivability (StorageSS’08), pages 1–10, 2008.
-  Zhu Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. A long-term user-centric analysis of deduplication patterns. In Proceeding of the 32nd Symposium on Mass Storage Systems and Technologies (MSST’16), 2016.
-  Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, and Erez Zadok. Generating realistic datasets for deduplication analysis. In Proceedings of USENIX conference on Annual Technical Conference (USENIX ATC’12), pages 24–24, 2012.
-  Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX conference on File and Storage Technologies (FAST’12), pages 33–48, 2012.
-  Zooko Wilcox-O’Hearn and Brian Warner. Tahoe: The least-authority filesystem. In Proceedings of the 4th ACM International Workshop on Storage Security and Survivability (StorageSS’08), pages 21–26, 2008.
-  Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. SiLo: A similarity locality based near exact deduplication scheme with low ram overhead and high throughput. In Proceeding of USENIX Annual Technical Conference (USENIX ATC’11), pages 285–298, 2011.
-  Yupeng Zhang, Jonathan Katz, and Charalampos Papamanthou. All your queries are belong to us: the power of file-injection attacks on searchable encryption. In Proceeding of the 25th USENIX Security Symposium (Security’16), pages 707–720, 2016.
-  Benjamin Zhu, Kai Li, and R Hugo Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08), pages 269–282, 2008.