Deduplication  is a common practical compression technique in filesystems and other storage systems. It has been found to achieve significant space savings in several empirical studies for different workloads [2, 3]. Despite the practical importance, it has received little attention in the information theory community, with only Niesen’s recent work analyzing its compression potential . As more data is generated every year, a thorough understanding of the fundamental limits of deduplication and similar techniques are of utmost importance.
A significant shortcoming of deduplication is that near-identical files are not identified, and are considered as completely different files. This can discourage the adoption of deduplication in some scenarios. An example is a network of Internet of Things (IoT) devices sensing an underlying process. Their measurements will be highly correlated, but may differ slightly due to spatial distance, measurement noise, and other factors. Deduplication for data of this type can, to some extent, be enabled with a generalized view on deduplication. This generalized deduplication allows near-identical chunks to be deduplicated, while still ensuring lossless reconstruction of the data. The first study of the properties of this technique is presented in this paper, and it is shown how the generalization compares to standard deduplication.
I-a Related work
To our knowledge, Niesen presents the only previous information-theoretical analysis of deduplication . Niesen’s work introduces a source model, formalizes deduplication approaches with chunks of both fixed-length and variable-length, and analyzes the performance of the approaches. Our paper uses a similar strategy to analyze generalized deduplication.
The manner in which deduplication is presented will make it clear that it is similar to classical universal source coding techniques such as the LZ algorithms [5, 6]. In practice, the main difference between the methods is on the scale at which they operate. Deduplication attempts to identify large matching chunks (KB) on a global scale (GB to TB), whereas classical methods identify smaller amounts of redundancy (B) in a relatively small window (KB to MB).
The problem in deduplication is also similar to the problem of coding for sources with unknown alphabets  or multiple alphabets . Such schemes attempt to identify the underlying source alphabet, and use this for universal compression, ideally approaching entropy regardless of the source’s distribution. Deduplication can be seen as one such approach, compressing the source output by building a dictionary (alphabet) and replacing elements with a pointer to the dictionary.
This paper provides a formal analysis of generalized deduplication and comparisons to standard deduplication, a special case. The main contributions are:
We present a simple model for generalized deduplication as a source coding technique. This model is used to derive upper and lower bounds on the expected length of encoded sequences. The potential gain of the generalization against the standard approach is bounded, quantifying the value of the generalization for data fitting the source structure.
We derive the asymptotic cost of generalized deduplication, showing that the method converges to as little as one bit more than the source entropy per chunk. We analyze how fast this convergence happens, and show that the generalization allows for faster convergence.
Concrete examples are used to show that the lower bounds are achievable. The generalization’s potential for faster convergence and compression gain is easily visualized.
Theorem proofs are deferred to the appendices.
Ii Problem Setting
Ii-a Generalized deduplication
Generalized deduplication is now presented as a technique for source coding. In this paper, the technique operates on a randomly sampled binary sequence , which consists of several chunks. The chunks are restricted to have equal length, bits. The chunks in the sequence are a combination of a base and a deviation. The base is responsible for most of the chunk’s information content, whereas the deviation is the (small) difference between the base and the chunk. This property of the data is important for the coding procedure. Formally, the possible bases form a set and the deviations form a set . These sets define the set of all potential chunks, . The method requires identification of a minimum distance mapping , which will be used to identify a chunk’s base. The deviation can be found by comparing the chunk to its base. The encoder and decoder must have prior knowledge of and , which are used to determine the coded representations. The coder does not know which bases are the active ones, which is some set , forming .
The presented algorithm encodes (decodes) a sequence in one pass, encoding (decoding) over a dictionary of previously encountered bases. In practical systems, data is structured in databases, since this enables independent and parallel access and higher speed. However, this paper follows the traditional source coding style of operating on a sequence, since this simplifies analysis.
The encoding procedure is initialized with an empty deduplication dictionary, . To encode a sequence, it is processed sequentially, one chunk at a time. The mapping is applied to the chunk, identifying the base and the deviation. The base is deduplicated against elements in . If it does not yet exist in the dictionary, it is added to the dictionary and this is indicated with a in the output sequence followed by the base itself. If it already exists, this is indicated by a in the coded sequence followed by a pointer to the chunk’s location in the dictionary, using bits111All logarithms in this paper are to base .. The deviation is added to the output sequence, following the base. It does not need to be represented in full, since knowing allows specification of an optimal representation of bits.
The coded sequence is uniquely decodable. The decoding procedure is also initialized with an empty deduplication dictionary, . Decoding happens one chunk at a time, parsing the sequence on the fly. If a is the first bit of a coded chunk, a base follows directly and is added to . On the other hand, if a occurs, the base was deduplicated, so it must already exist in , and is looked up based on the following pointer. The coded deviation is expanded to its full representation. Finally, the chunk can be reconstructed by combining the base and deviation. The reconstruction is added to the output sequence. This is repeated until the coded sequence has been processed in its entirety.
The standard deduplication approach arises as an important special case. It is obtained by considering each chunk as its own base, and thus there is no deviation. Formally, this means contains only the all-zero chunk of length , so , and is the identity function.
Ii-B Source model
A formal source model is now specified. All analysis in this paper uses this source structure. Chunks will have a length of symbols, and are generated by a combination of two sources. Our analysis is restricted to binary symbols, so chunks are in the binary extension field .
The first source generates the active bases, and is denoted by . is a packing of -dimensional spheres with radius in . The second source generates the deviations, and is denoted by . This source consists of elements with low hamming weight, i.e., for the same as the packing. This allows definition of the chunk source, , which can be interpreted all points inside some spheres in , where the spheres are centered at the bases from and have radii . The fact that a sphere packing is used for implies that spheres are non-overlapping and, thus, and .
Example 1 (Source construction).
Let be the set of codewords from the Hamming code and let consist of all vectors of Hamming weight at most
consist of all vectors of Hamming weight at most. Spheres of radii 1 cover the entire field, so . In this example, let the base source have two elements, e.g.,
and then becomes
with . An optimal coding of this source uses bits per chunk. The mapping (or ) can be derived from the decoding procedure for the Hamming code.
Ii-C Coding a source
Generalized deduplication has greater potential with large data sets and long chunks, yet a small example is useful to understand the method. An example is presented for the source of Example 1. A detailed explanation of the encoding and decoding procedures is found in appendix A. We start with the simpler special case, standard deduplication.
Example 2 (Deduplication).
Let be the source from Example 1. Five chunks are chosen uniformly at random, and concatenated. This forms a sequence of bits222Delimiters are inserted between chunks for ease of reading; the coding and decoding procedures do not require this.:
Applying deduplication to this sequence results in:
where the final dictionary is and bits are used in total.
Let us now consider generalized deduplication. Full knowledge of and is available, and is used to determine the deviation representation and the minimum-distance mapping.
Example 3 (Generalized deduplication).
Consider again the sequence of Example 2. To apply generalized deduplication, a representation for the deviations is needed. As they are drawn uniformly bits, so bits is optimal for their representation. An optimal representation is
which in this special case is the syndrome representation of the (7,4) Hamming code. To compress the sequence, the minimum-distance mapping is applied to each chunk, identifying the closest base, which is a codeword of the Hamming code. The base is here represented in full, although it may also be compressed since is known. The result is:
where the final dictionary is and bits are used.
Although in this limited example deduplication outperforms the generalization, our results show that this is not the case in general. In fact, the results show that there are significant benefits in convergence speed of using the generalized form.
In this section, the coded length of sequences is studied. Let be a random binary sequence of chunks of bits each, so . The interesting metric is the expected coded length, given the length of the original sequence.
Iii-a Bounds for coded sequence length for the generalization
The expected length of the sequence after generalized deduplication is . This is decomposed as the sum of expected coded length of each chunk in :
where is the indicator function, is the dictionary after chunk , is the base of chunk , is the number of bits needed to point to the dictionary, and finally is the number of bits used for representing the deviation. The base itself might be compressed to bits with , since is known. Since chunks are drawn uniformly at random from , this is equivalent to picking a base and a deviation uniformly at random from and . Thus,
We now state Theorem 1 bounding the expected length after generalized deduplication in the presented source model.
The expected length of the generalized deduplication-encoded sequence from chunks of length is bounded as
Iii-B Bounds for coded sequence length for deduplication
Standard deduplication is a special case which allows for a slightly closer upper bound, and is therefore treated separately. The expected length of the sequence after deduplication is . With the previous notation,
where is chunk itself, since it is now the base. This base cannot be compressed as before, so it uses bits.
Iii-C Bounds for the gain of generalized deduplication
The generalization ratio is
Iv-a Asymptotic storage cost
In this section, we provide theorems bounding the asymptotic coded length of a new chunk for generalized deduplication. Let be the expected length of chunk when generalized deduplication is used, i.e.,
Then the asymptotic cost of generalized deduplication is bounded by Theorem 3.
Generalized deduplication has asymptotic cost
where is the set of potential chunks.
Generalized deduplication is thus asymptotically within one and three bits of the entropy of . In practice, the method will operate on larger chunks with high entropy, so this overhead will be negligible. Similarly, let be the expected length of chunk in standard deduplication:
For this special case, the closer upper bound in Theorem 2 translates to a closer upper bound in asymptotic cost.
Standard deduplication has asymptotic cost
where is the set of potential chunks.
Iv-B Rate of Convergence
Now that it is established that generalized deduplication schemes converge to slightly more than the entropy of , it is also important to quantify the speed of convergence. Generalized deduplication should converge faster than deduplication in general, since the number of potential bases is smaller. The generalization needs to identify bases for convergence, whereas the standard approach requires bases. Convergence of the standard approach thus requires identification of an additional factor of bases. To formally analyze this, the following definition is needed [9, pp 12–13].
The rate of convergence of a sequence converging to is
with smaller values implying faster convergence.
For generalized deduplication, convergence happens according to the convergence of . When this sequence has converged, remains constant, making it sufficient to analyze this sequence. Thus,
For the case of standard deduplication,
Since . Thus, generalized deduplication will be able to converge faster. In fact, even in simple cases. Both approaches exhibit linear convergence .
V Numerical Results
To visualize the results presented in the paper, a concrete example is considered.
Let be the codewords of the Hamming code. Let with . is the set of chunks with weight or less. The resulting has elements. Both generalized deduplication and standard deduplication are applied to this source for comparison.
The upper and lower bounds of are shown as dashed lines in Fig. 2. The solid lines are the simulated averages. It is seen that both standard deduplication and the generalization are converging to the same slope. The asymptotic slope comes from the asymptotic cost, . When both schemes have converged, a gap remains between the lines. The gap remains constant, but eventually becomes negligible as .
The upper and lower bounds of are shown as dashed lines in Fig. 2 as a function of the number of chunks, . The assessment of the convergence rate in the previous section is now visualized: The faster convergence of the generalization is easily seen. Further, the solid line shows the average which is seen to approximate the lower bound. This is because both and are powers of two, and thus no overhead (compared to the lower bounds) are used to represent neither bases, deviations, nor the entire chunks.
The generalization ratio is shown in Fig. 3. For the first few chunks deduplication performs best, but this is quickly outweighed by the faster convergence of the generalization. The gain grows sharply until convergence of , but slows down and then starts declining briefly thereafter. As the number of chunks goes to infinity, the ratio converges to .
A general observation is that the maximum gain is achieved in the range where the generalization has converged, and standard deduplication is still far from converging.
It is also seen that, for the first few samples, the generalization performs slightly worse. This is caused by the convention to put the uncompressed base in the output. In reality, since is known, it is sufficient to use bits for each base. This will increase the gain slightly.
By simulating sequences generated with longer chunks, it is clear that this increases the maximum generalization gain. The convergence of deduplication is affected by an increase in , which is unavoidable when changing the chunk size, unless the packing radius is also changed. The generalization is oblivious of this, so its convergence will not be affected, and thus the potential gain increases. In practice, where limited amounts of data are available, this enables the generalization to achieve a significant gain in storage costs.
Let the mapping for generalized deduplication be defined through the (1023, 1013) Hamming code. Chunks must be bits ( B), and the potential bases are the codewords. is the set of elements of length with weight or less, so . Thus . The amount of bases in standard deduplication is three orders of magnitude greater than in the generalization.
The effect of this difference in convergence speed is significant. Our simulations show that if is fixed and the chunk length, , is increased, then the maximum ratio, , increases linearly as a function of the chunk length. The potential gain of using the generalization instead of standard deduplication increases linearly with the chunk length.
The preceding sections present an information-theoretical analysis of generalized deduplication, which allows deduplication of near-identical data, and standard deduplication as a special case. Generalized deduplication exhibits linear convergence with the number of data chunks. In the limit each data chunk can be represented by at most 3 bits more than the entropy of the source, but our numerical results show that generalized deduplication can converge to the lower bound of 1 bit more than the entropy. The advantage of generalizing deduplication manifests itself in the convergence. If the data has characteristics similar to our source model, then the generalization can converge to near-entropy costs with orders of magnitude less data than standard deduplication. With an -to- mapping , a factor of fewer bases must be identified, creating a potential for improving compression in practice, where the amount of data will be limited.
Our future work will address how the method can be realized in practice. Given concrete data, it is relatively simple to empirically model a chunk source, , but this must be carefully split into two underlying sources, the base source and the deviation source , in order to approximate the model and realize the potential of generalized deduplication. Identifying suitable sets of bases and deviations may not be a trivial task.
Appendix A A detailed example
Assume that , and let . Let . Draw 5 elements from i.i.d. uniformly. Assume that these elements are:
The elements are then concatenated to a sequence:
The encoding is initialized with an empty dictionary, . Since we know that chunks have length , the sequence is split into chunks of that length:
Now, the chunks are handled sequentially. The first is . This chunk is not in , so it is added to it. The new dictionary then is
and the encoded sequence after the first chunk is formed by adding a (since we added the chunk to the dictionary) and then the chunk itself (the dot is only for easier visualization):
We then move to the next chunk, , which is not in the . It is added, and a followed by the chunk is added to the encoded sequence:
The next element is . This element is already in the dictionary, so it is not added again. For this reason, a is placed in the output sequence, followed by a pointer to the element in the dictionary using bit. Since the element is the second in the dictionary, it is represented by :
The next element, , is new. It is added to the dictionary, and the encoded sequence following a :
The final element is , which already is in the dictionary. A pointer to the dictionary is therefore added to the encoding, following a . The pointer now needs bits. Since the element is the second in the dictionary, it is represented as .
All chunks are now encoded, and is output as .
The encoding is initialized with an empty dictionary, . The sequence is processed sequentially. We start from
The first bit is always a , since the dictionary is empty. It is also known that chunks have length . At first, the sequence can then be parsed as:
The first element can now be extracted and added to the dictionary. It is also added to the decoded sequence directly:
Since the inserted delimiter is followed by a , it is known that the next chunk is also new. Therefore, a delimiter can be inserted bits after the first delimiter:
The chunk is added to the dictionary and the decoded sequence:
The new delimiter is followed by a flag this time. Therefore, the flag is followed by a pointer. Since , the flag is followed by a pointer of bit. A new delimiter can then be inserted:
The delimiter is followed by a , which means that the second element in the dictionary should be added to the output sequence:
A follows the last delimiter, so a chunk follows directly. A new delimiter is inserted after the chunk:
and the chunk is inserted into the dictionary and the output, resulting in
Finally, a follows the delimiter. Since , the two bits after the flag (which luckily is the rest of the sequence) points to an element in the dictionary. The value is , so the second element in the dictionary should be added to the output sequence:
The decoding is now complete, and is output as . Luckily , as expected.
As deviations are are drawn uniformly from , bits. bits is thus optimal for their representation. An optimal representation is
The encoding is initialized with an empty dictionary, . Since we know that chunks have length , it is split into chunks of that length:
The chunks are handled sequentially. The first is . By applying the minimum distance mapping (decode and encode using that is the Hamming codewords), the base is found to be . This base is not in , so it is added to it. In this example, we decide not to compress the base, but leave it in full size. The dictionary is then:
Since the base was not in the dictionary, a is added to the sequence, and followed by the base. The deviation is the difference between the base, which in this case is . The deviation is changed to the optimal representation. After the first chunk, the coded sequence is thus:
The next chunk is . It also maps to the base . A is added to the output sequence, followed by a pointer of bits pointing to the base. Since the base is the only element in the dictionary, no bits are needed to specify which one it is. The deviation is , which is added in the optimal representation. The dictionary and coded sequence thus becomes:
The next chunk is also , and will get the same coded representation. Thus
This chunk, however, is followed by . The nearest neighbor in (and ) is . This will thus be the base. The base is not in , so it is added to it, and
The deviation is found by comparing the chunk to the base, and is . Changing this to the optimal representation, it is now possible to form the coded representation of the chunk. It is added to the encoding:
Finally, the last chunk is again. The base is of course still , and the deviation . Although this base has been seen before, the representation in the output will be slightly different, since the dictionary has grown. Now bit is needed. The base is the first element in the dictionary, so it will be represented by a :
The concludes the process, and is output as . It is worth noting that already , and thus all subsequent chunks from will be represented with bits, one more than the entropy. This shows how the generalization can converge faster than standard deduplication.
The encoding is initialized with an empty dictionary, . The sequence is processed sequentially. We start from
The sequence starts with a . This means that a base will follow the directly. The base is not compressed, so it has length . The base is followed by a deviation represented with bits. This allows us to parse for the first chunk:
The base is added to the dictionary, so
and the deviation is expanded to the full representation: . The chunk is then reconstructed by combining the base and the deviation, using bitwise exclusive-or:
This is the reconstructed chunk, which is added to the decoded sequence,
The next chunk has a flag, so the base is already in the dictionary. Since the dictionary has a single element only, bits are needed for the pointer. The deviation is as always bits. This allows the parsing of the second chunk to be made:
The base is then again . The deviation is expanded: . These two are added, forming the new chunk:
and this chunk is added to the output:
The third chunk starts with a too, so the base is indicated with bits, and is again the one already in the dictionary. The coded chunk is parsed as
and is the same as the previous. The reconstruction is the same, so
Now, the current last delimiter is followed by a , so a new base of bits and a -bit deviation follows. The parsing is
The base is , and needs to be added to the dictionary:
The deviation is then expanded, . The base and deviation reconstructs the chunk:
which is added to the output:
The delimiter is now followed by a , so the base is already in the dictionary. bit is used for the pointer, so the parsing is
The pointer is , so the base is the first element in the dictionary, i.e., . The deviation is , so the chunk can be combined to . This means
The coded sequence is now fully decoded, and is output. As expected, .
Appendix B Proof of theorem 1
The structure of the source is such that drawing a chunk uniformly from is equivalent to drawing a base from and a deviation from
. Since bases are drawn uniformly at random, the probability that the base of chunkis not already in the dictionary is
The expected coded length can be bounded from below as:
Equivalently, the value can be bounded from above:
where the inequality in (17) follows from since , (18) follows from the fact that . The inequality in (19) is due to the encoding of the deviations, , since . The final inequality in (20) follows from , and the fact that the maximum possible size of the dictionary is . Finally (11) is substituted to get (21). ∎
Appendix C Proof of theorem 2
The proof of the special case of deduplication naturally follows the same steps, but considers and contains only the all-zero chunk. Because of this, deviations can be represented with exactly bits, so the step bounding their cost can be skipped. For completeness, the full proof is given. Since chunks are drawn from uniformly at random, the probability that chunk (=base) is not already in the dictionary is
The expected coded length can be bounded from below as:
The expected cost can also be bounded from above:
where the inequality in (27) follows from since , (28) follows from the fact that . The final inequality in (29) follows from , and the fact that the maximum possible size of the dictionary is . Finally, (22) is substituted to get (30). ∎