Optimal Document Exchange and New Codes for Small Number of Insertions and Deletions

04/10/2018 ∙ by Bernhard Haeupler, et al. ∙ Carnegie Mellon University 0

This paper gives a communication-optimal document exchange protocol and an efficient near optimal derandomization. This also implies drastically improved error correcting codes for small number of adversarial insertions and deletions. For any n and k < n our randomized hashing scheme takes any n-bit file F and computes a O(k n/k)-bit summary from which one can reconstruct F given a related file F' with edit distance ED(F,F') ≤ k. The size of our summary is information-theoretically order optimal for all values of k, positively answering a question of Orlitsky. It also is the first non-trivial solution when a small constant fraction of symbols have been edited, producing an optimal summary of size O(H(δ)n) for k=δ n. This concludes a long series of better-and-better protocols which produce larger summaries for sub-linear values of k. In particular, the recent break-through of [Belazzougi, Zhang; STOC'16] assumes that k < n^ϵ and produces a summary of size O(k^2 k + k n). We also give an efficient derandomization with near optimal summary size O(k ^2 n/k) improving, for every k, over a deterministic O(k^2 + k ^2 n) document exchange scheme by Belazzougi. This directly leads to near optimal systematic error correcting codes which efficiently can recover from any k insertions and deletions while having Θ(k ^2 n/k) bits of redundancy. For the setting of k=nϵ this O(ϵ^2 1/ϵ· n)-bit redundancy is near optimal and a quadratic improvement over the binary codes of Guruswami and Li and Haeupler, Shahrasbi and Vitercik which have redundancy Θ(√(ϵ)^O(1)1/ϵ· n).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper gives the first efficient hashing schemes for the document exchange problem which produces an order optimal summary size. Our efficient randomized hashing scheme takes any -bit file and for any computes an optimal sized -bit summary from which one can reconstruct given a related file with edit distance333The edit distance between two strings and is the minimal number of insertions, deletions, or symbol changes that transform one string into the other. .

Document exchange, or remote data synchronization, is an important problem in practice that frequently occurs when synchronizing files across computer systems or maintaining replicated data collections over a bandwidth limited network. In the simplest version it consists of two machines that each hold a copy of an -bit file where on one machine this file may have been updated to . When updating the data on the other machine one would ideally like to only send information about their differences instead of sending the whole file . This is particularly important because network bandwidth and data transfer times are limiting factors in most applications and often differs little from , e.g., only a small number of changes have been applied or a small fraction of the content has been edited, i.e., for some small constant . This “scenario arises in a number of applications, such as synchronization of user files between different machines, distributed file systems, remote backups, mirroring of large web and ftp sites, content distribution networks, or web access [over a slow network]” [15].

One can imagine a multi-round protocol in which the two machines adaptively figure out which parts of the outdated file have not been changed and need not be transmitted. However, multi-round protocols are too costly and not possible in many settings. They incur long network round-trip times and if multiple machines need updating the a separate synchronization protocol needs to be run for each individual such machine. Surprisingly, Orlitsky [19], who initiated the theoretical study of this problem in 1991, proved that the party knowing can compute a short summary of bits which can then be used by any other party knowing a file , which differs from by potentially very different edits to recover and the edits that have been applied to obtain from . This is initially quite surprising especially because the summary is, up to constants, of equal size as a description of the unknown changes themselves. Indeed, an exchange of bits is information-theoretically necessary to describe the difference between two -bit strings of edit distance . Unfortunately however, Orlitsky’s result is merely existential and requires exponential time computations for recovery, which prevents the result to be of practical use. Pointing out several powerful potential applications, Orlitsky left the question of an efficient single-round document exchange protocol that matches the non-constructive summary size as an open question which has since inspired many theoretical and practical results working towards this goal. This paper solely focuses on such single-round document exchange schemes. For simplicity, like all other prior works, we furthermore assume that a good upper bound on the edit distance is known444Alternatively starting with and doubling until the recovery was successful leads to the same amount of communication, up to a factor of two, since the summary size only depends linearly on ..

In practice rsynch [22] has become a widely used tool to achieve efficient two-round document exchange / file synchronization while minimizing the amount of information sent. Rsynch is also used as a routine in the rdiff tool to efficiently compute changes between files, e.g., in version control systems. Many similar protocols have been suggested and implemented. Unfortunately rsynch and almost all other tools do not have any guarantees on the size of the data communicated and it is easy to give examples on which these algorithms perform extremely poorly. A notable exception is a scheme of Irmak, Mihaylov and Suel [15].

On the theoretical side the protocol of [15] was the first computationally efficient single-round document exchange scheme with a provable guarantee on the size of a summary in terms of the edit distance achieving a size of . Independently developed fuzzy extractors [9] can also be seen as providing a document exchange scheme for some polynomially small in . A randomized scheme by Jowhari [16] independently achieved a size of . In two recent break-throughs Chakraborty, Goldenberg, and Kouckỳ [6] designed a low distortion embedding from edit distance to hamming distance which can be used to get a summary of size and Bellazougi and Zhang [2] further build on this randomized embedding and achieved a scheme with summary size which is order optimal for555We write as a shortcut for . . All of these schemes are randomized. The first deterministic scheme with summary size was given by Belazzougui [1]. All these document exchange schemes have some sub-linear restriction on the maximal magnitude of . For example, the breakthrough result of [2] assumes that for some sufficiently small constant . In particular, there does not exist a scheme which works for the interesting case where the edit distance is a small constant fraction of the file length, e.g., if of the content has been edited.

2 Our Results

We positively answer the 27 year old open question of Orlitsky [19] and give an efficient randomized hashing scheme for the single-round document exchange problem which, for any , produces a summary of order optimal size .

Theorem 2.1.

For any there is a randomized algorithm, which given any -bit string produces a -bit summary . There also is a deterministic recovery algorithm, which given and any string that is independent from the randomness used for computing and satisfies , recovers the string with high probability, i.e., with probability .

This improves over the recent break-through of [2] which produces a summary of size and works as long as . We remark that the scheme in [2] has a failure probability which is polylogarithmic in and polynomial in whereas the scheme in Theorem 2.1 works with high probability. If one wants to boost the scheme in [2] to work with high probability one needs to send independent summaries making the overall summary size (up to a for sub-logarithmic ) equal to .

As a precursor to our main result we obtain a document exchange protocol with sub-optimal summary size that has the advantage that it can be efficiently derandomized. This gives a deterministic document exchange protocol with summary size , improving over the deterministic scheme by Belazzougui[1] with summary size .

Lemma 2.1.1.

There is a deterministic document exchange algorithm, which given any any and any -bit string produces a -bit summary , such that a deterministic recovery algorithm, which is given and any string with edit distance , recovers .

The schemes from Theorem 2.1 and Lemma 2.1.1 are the first document exchange protocols which work for the interesting setting in which a constant fraction of edits need to be communicated. In particular, if the edit distance between and is for some small constant then our optimal randomized scheme produces a summary of size , where is the entropy function. Our deterministic scheme incurs another factor but the summary size of is still much smaller than for sufficiently small .

Remarks:

  • The deterministic document exchange from Lemma 2.1.1 was independently and simultaneously666Both works were put on arxiv days apart [12, 8]. The author of this paper served on the program committee of FOCS’18, where [8] was published, and was as such not permitted to submit his work there. discovered by Cheng, Jin, Li and Wu [7]. It remains an interesting open question whether Lemma 2.1.1 can be improved and whether an efficient deterministic document exchange with optimal summary size, matching Theorem 2.1, is possible.

  • Deterministic document exchange protocols are known to be equivalent to systematic error correcting codes for insertions and deletions. Lemma 2.1.1 and the improvements for error correcting codes in [7] lead to improved error correcting codes for insertions and deletions. We summarize these implications in Appendix B.

3 Hash Functions and Summary Structure

In this section we describe and define the simple inner-product hash functions used in our schemes and the content and structure of the summary of . We start by giving some intuition about the summary structure in Section 3.1, give our string notation in Section 3.2, formally define our hash function in Section 3.3 and define our summary structure in Section 3.4.

3.1 Intuition for the Summary Structure

Essentially all document exchange algorithms used in practice, including rsynch, use the very natural idea of cutting the file into blocks and sending across hashes of these blocks in order to identify which of these blocks are contained in without any edits.

Once identical blocks have been identified the remaining information containing all differences between and is small and can be transmitted. The protocol of Irmak, Mihaylov and Suel [15] also follows this strategy. However, it does not use a fixed block length but uses levels777All logarithms in this paper are with respect to basis two unless stated otherwise. of exponentially decreasing block length. This allows to progressively zoom into larger blocks containing some edits to identify smaller blocks within them that do not contain an edit. Given that higher levels should already identify large parts of the string that are identical to and thus known to the party reconstructing many of the hashes of lower levels will not be of interest to the party reconstructing . To avoid having to send these hashes one could run an adaptive multi-level protocol in which the reconstructing party provides feedback at each level. A great and much simpler single-round alternative introduced by [15] is to use (the non-systematic part of) systematic error correcting codes which allows the receiving party to efficiently reconstruct the hashes it is missing, without the sending party needing to know which hashes these are. The summary now simply consists of these encodings of hashes of all levels and this summary can be sent to the reconstructing party in a single-round document exchange protocol. Given that at most blocks can be corrupted in each level the summary size is where is the size of a single hash. Using randomized -bit hashes with -bit seeds which are guaranteed to be correct with high probability leads to the bit summary size of [15].

The hashing schemes in this paper mostly follow the same practical framework. In fact, the summary structure we use for our simpler (sub-optimal deterministic) document exchange protocol with summary size is identical to [15] except that we use a smaller hash size and a compact way to describe the randomness used for hashing, which can then also be used for derandomization. Our main result further reduces the hash size to merely a fixed constant. This requires an much more robust recovery algorithm which can deal with high hash collision probabilities. In particular, since the failure probability of a hash is exponential in the hash size , choosing as in [15] implies that no hash collision happens with high probability and choosing still keep the expected number of hash collisions at , that is, at the same order of magnitude as the errors one has to deal with anyway. The fact that recovery from constant size hashes with a constant failure probability is even existentially possible requires a much more intricate probabilistic analysis. Furthermore, the key trick used in [15, 1, 8] to use error correcting codes to obliviously communicate the missing or incorrect hashes in each round inherently requires -bits to be sent in each of the levels. This forms another serious barrier that needs to be overcome for our main result.

3.2 String Notation

Next we briefly give the string notation we use throughout.

Let be a string. We denote with the length of and for any with we denote with the substring of between the and symbol, both included. A sub-string of is always a set of consecutive symbols in , i.e., any string of the form . We also use multi-dimensional arrays of symbols, in which every index typically begins with . An array position with can either contain a symbol over some alphabet or be empty. We denote with the string of symbols containing all symbols of the form .

3.3 Inner-Product Hash Function and Randomness Table

Next we describe our hash function , which computes nothing more than some -inner-products between its input string and some random bits.

To properly keep track of the randomness used we will think of the randomness being supplied by a three dimensional table of bits we call . We remark that our algorithms do not actually need to instantiate or compute explicitly. Instead the description of the bits contained in will be so simple that they can be generated, stored, or computed on the spot when needed.

In addition to the string to be hashed we supply four more arguments to . A call like produces a hash of the string using the randomness table . The parameter denotes the size of the hash, i.e., number of bits that are produced as an output. The parameters and denote to which starting position and level the hash belongs to, respectively. These parameters are used to describe where in the randomness table to pull the randomness for the inner product from. This ensures firstly that hashes for different levels and intervals use different or “fresh” random bits and secondly that summary creation and recovery consistently use the same parts of when testing whether two strings stem from the same interval in the original string . The inner product computed by is now simply the -bit string for which .

Note that if

is filled with independent and uniformly distributed bits we have that any two non-identical strings have colliding hashes with probability

, i.e.,

The reason for this is that each of the output bits independently is an inner product between the string to be hashed and the same uniformly random bit string of length taken from . Therefore the difference between and is the inner product of a uniformly random string and a non-zero string and as such a uniformly distributed bit. The probability that each of the output bits is zero is now exactly .

Lastly, we add one further simplification to our hash function which is that if the output length is larger than the length of the string to be hashed then simply outputs

as a “hash”, possibly padded with zeros. This gives a collision probability of zero for any two same-length strings with length at most

and allows to read off the string from its hash.

3.4 Summary Structure and Construction

In this section we formally describe and define the summary structure and construction which follows the informal description given in Section 3.1:

The summary algorithm takes the string it wants to summarize, the parameter which essentially governs how many hashes are provided per level (in coded form) and the parameter which determines the hash size of the hashes used. For simplicity of the description we assume that the length of the string equals for some integer , i.e., is a multiple of and is a power of two. This assumption is without loss of generality. One can send the length of along with the summary and, for the hash computations, extend to a string of the length , with , by adding zeros to the end. The recovery algorithm simply adds the same number of zeros to during the recovery and removes them again in the end.

The summary contains the following coded hashes organized into levels:

  • The level zero simply cuts into equal size pieces and records the hashes for each piece. I.e., let and we include , which consists of bits, in the summary.

  • For level we cut into equal size pieces, compute the hash for each piece to form . The hashes themself however are too large to be sent completely. Instead our (warm-up) deterministic scheme encodes these hashes using an error correcting code which is simply the non-systematic part of a systematic linear error correcting code over . Such a code exists and many explicit constructions based on algebraic geometry are known if . Our optimal document exchange protocol with requires a more sophisticated scheme which we describe in Section 4.2.2. The information included in for this scheme consists essentially of one or multiple hashes of (subsets of) of total size for some sufficiently large constant .

We remark that the hashes of levels are hashes of strings of length at most . In this case the hash function simply outputs the strings itself making (or already ) essentially equal to itself.

In addition to these coded hashes the summary also contains the length and a compact description of the randomness table . Throughout this paper we will use the -biased probability spaces of Naor and Naor [18] for this compact description. In particular, we prove for all our schemes that the bits in do not need to be independent uniform bits but that it suffices if they are sampled from a distribution with reasonably small bias. The often exploited fact that a sample point from an -biased distribution over bits can be described by only bits allows us to give very compact descriptions of the randomness used and send these along in the summary . In particular, we do not need to assume that the summary construction and the summary recovery algorithm have any shared source of randomness.

4 Recovery Algorithms

This section describes our recovery algorithms. We start in Section 4.1 by defining hash induced substring matchings, which form the basis for our algorithms and their analysis. We then describe our recovery algorithms. In Section 4.2 we first describe a randomized algorithm which produces a summary of size . This is a good warm-up for our main result. It demonstrates the overall algorithmic structure common to both our recovery algorithms, introduces the basic probabilistic analysis used to analyze them, and makes it easier to understand the problems that need to be addressed when pushing both the algorithmic ideas and the analysis to the limit for our main result. We also show in the appendix how to derandomize this scheme to obtain Lemma 2.1.1. Lastly, Section 4.2.2 contains the order optimal randomized hashing scheme which uses constant size hashes and thus achieves the optimal summary size of .

4.1 Hash Induced Substring Matchings

For every we say that two index sequences of length are a level- size- (sub-string) matching between two strings induced by if

  • ,

  • all -indices are starting points of blocks that got hashed in in level , i.e, is a multiple of for every , and

  • hashes of the strings in blocks that are matched are identical (we also say the hashes are matching), i.e., for all we have that equals . In the rare instances where we (temporarily) relax this requirement we speak of a non-proper matching.

Furthermore, we call such a matching

  • monotone if ,

  • disjoint if intervals that are matched in are not overlapping, i.e., we have for every with that .

  • bad or -bad if for every matched blocks in and the actual strings are non-identical (despite having identical hashes), i.e., if for all we have that .

  • -plausible if it is monotone and . Note that a monotone matching is -plausible if it can be explained by at most insertion and deletion operations.

We generally assume all our matchings to be proper, monotone, and disjoint and often omit these qualifiers. Whenever we talk about non-necessarily monotone, not-necessarily proper or not-necessarily non-disjoint matchings we explicitly label these matchings at non-disjoint, non-proper and/or non-monotone. Furthermore, if the context allows it we sometimes omit mentioning the strings , the hash function , or the level with respect to which a matching satisfies the above conditions.

For any two strings one can compute a monotone matching of maximum size in time linear in the length of the strings and polynomial in using a standard dynamic program. The same is true for a maximum size disjoint, bad, or -plausible monotone matching. It is furthermore likely that using the same techniques which transforms the standard dynamic program for edit distance into an dynamic program [23] can also be used to obtain algorithms for the above maximum monotone matching variants as well.

4.2 Algorithm 1: Simple Level-wise Recovery

1:get from
2:
3:for  to  do
4:
5:      largest level- disjoint monotone matching from hashes in into
6: Recover level hashes
7:      guesses for level hashes using and
8:     
9:
10:
Algorithm 1 Simple Recovery with and Summary Size

Our first recovery algorithm, which we call Simple Level-wise Recovery, is now easily given (see also the pseudo-code description of this algorithm, which is given as Algorithm 1):

Assume that the recovery algorithm has recovered all hashes of at level correctly. Initially and this assumption is trivially true because these hashes are included in the summary . Equipped with these hashes the algorithm finds the largest monotone disjoint matching between the level blocks in and blocks in of the same length. The recovery algorithm now guesses the level hashes of using and as follows: Each block of in level splits into exactly two blocks in level . For any block in that is matched to a sub-string in with an identical hash the recovery algorithm guesses that the strings in these blocks are also identical and computes the hashes for the two sub-blocks in level by applying to the appropriate sub-string in . If a block in

is not matched one can fill in something arbitrarily as a guess or mark it as an erasure. The hope is that the vector of hashes

for level guessed in this way is close in Hamming distance to the correct hashes . If this is the case, concatenating with the redundancy for level from and decoding this to the closest codeword in correctly recovers the level hashes and allows the algorithm go proceed to the next iteration and level. In this way the recovery algorithm iteratively recovers the hashes for every level one by one until level . In level blocks are of constant size and becomes the (padded) identity function such that one can read off from .

4.2.1 Correctness of Algorithm 1 and -Bad Matchings

In this subsection we give a sufficient condition for the correctness of Algorithm . In particular, we prove that if there is no -bad self-matching in , i.e., a size- bad monotone disjoint matching, between and itself under , then Algorithm recovers correctly. Later in Section LABEL:sec:unlikelyhoodofmatchings we then show that hashes of size and only little randomness in are sufficient to make the existence of such a witness unlikely, in fact, so little randomness that one can easily derandomize the algorithm.

To prove the correctness of Algorithm 1 we will argue that the matching computed in each level is sufficiently large and, in the absence of a -bad self-matching, of sufficient quality to allow the recovery of the hashes for the next level using the redundancy in . This allows the recovery algorithm then to proceed similarly with the next level.

It is easy to see that the matching computed is always large assuming that and are not too different:

Lemma 4.0.1.

Assuming that the hashes for level were correctly recovered, Algorithm 1 computes a matching of size at least in level .

Proof.

Since and differ by only insertion, deletions, or symbol corruptions and since each such edit can affect at most one block we know that we can look at the monotone matching which matches all blocks in which did not suffer from such an edit to its identical sub-string in . Since the hashes were correctly recovered and the hashes use the same parts of to compute the inner-producet hashes this is a valid monotone matching of size . Since Algorithm 1 computes the largest valid matching it finds a matching of at least this size. ∎

We would like to say that if in the summary random enough hash functions with a small enough collision probability are used, which usually stemming from a sufficiently unbiased and a large enough hash output length , then most of the matching pairs computed by Algorithm 1 are correct, i.e., correspond to sub-strings of and that are identical under with the randomness used. For any matching which contains too many pairs of substrings which are not-identical but have the same hashes we abstract out a witness which explains why failed. For this we focus on bad matching pairs that go between non-identical intervals in and which do not contain any edits. This however is exactly a -bad selfmatching in . The advantage of looking at such a witness is that its existence only depends on (or , and ) but not on .

Lemma 4.0.2.

Assume that the hashes for level were correctly recovered and that for level there is no -bad matching of to itself under . Then the monotone matching computed by Algorithm 1 for this level matches at most non-identical blocks in and .

Proof.

and differ by only insertion, deletions, or symbol corruptions and each such edit can affect at most one of the blocks in that are matched to a non-identical block in . Therefore there are at most such matches in the monotone matching computed by Algorithm 1. Furthermore, if we restrict ourselves to the matches between non-identical sub-strings in and computed by Algorithm 1 which are not of this type it is true that each of these matches comes from matching a sub-string in to a sub-string in which, due to having no edits in it, is identical do a sub-string in . Since, by assumption, Algorithm 1 used the correctly recovered level hashes and computes a monotone disjoint matching these matches form a bad matching in . By assumption this matching can be of size at most giving the desired bound of at most non-identical blocks matched in and by Algorithm 1 ∎

Lastly, because we use error correcting codes with sufficiently large distance we can easily correct for the missing hashes and incorrect hashes.

Lemma 4.0.3.

Assume that for all levels there is no -bad matching of to itself under used to compute , which is given to Algorithm 1 as an input. Furthermore assume that the input file satisfies . Then, Algorithm 1 correctly the outputs .

Proof.

We will first show by induction on the level that Algorithm 1 correctly recovers the level hashes of that were computed for . For level this is trivial because these hashes are a part of and therefore given to Algorithm 1 as an input. For the induction step we want to show that the hashes for level will be correctly recovered assuming that this has successfully happened for the hashes for level . Here Lemma 4.0.1 guarantees that a matching of size is computed which results in at most blocks in the level of having no match and therefore at most hashes in level are assigned the erasure symbol “?” in the guessing step of Algorithm 1. Furthermore, the assumptions for Lemma 4.0.2 are satisfied guaranteeing that at most of the matchings computed by Algorithm 1 in level belong to non-identical strings in and . This results in at most of the hash values computed in the guessing step of Algorithm 1 being incorrect. The Hamming distance between the correct hashes

and the estimate

produced by Algorithm 1 is therefore at most . Given that the error correcting code used has distance these errors will be corrected leading to a correct recovery of the level hashes in Step of Algorithm 1. In its last iteration Algorithm 1 will correctly recover the level hashes , which, as discussed at the end of Section 3.4, is equal to (up to padding extra zeros to each hash). ∎

It remains to show that -bad matchings are highly unlikely. Lemma 4.0.4 does exactly this. It shows that for sufficiently random and with high probability no -bad matching exists. We start by showing that a distribution whose bias is exponentially small in suffices to avoid a -bad matching with high probability. This is sufficient to guarantee the correctness of Algorithm 1. We furthermore shows how to extend these arguments to much lower quality distributions with a polynomially large bias. Since this second part is important for our derandomization but not needed to understand our main result we defer this second part to the appendix.

Lemma 4.0.4.

For every sufficiently large it holds that if and is sampled from an -biased distribution of bits then for every level the probability that there exists a -bad self-matching of under is at most .

Proof.

Suppose for sake of simplicity that is sampled from iid uniformly random bits. In this case the probability for any individual sub-strings of to have a bad hash is . Furthermore, for a fixed -bad matching the probability that all matching pairs are bad under is . There furthermore exist at most ways to choose the indices for a potential -bad matching and therefore also at most many such matchings. Taking a union bound over all these potentially bad matchings and choosing the constant large enough this guarantees that the probability that there exists a -bad matching is at most .

Next we argue that the same argument holds if is -biased. In particular, for every - matching determining whether it is a -bad matching only depends on the outcome of linear tests on bits from . For each of the different outcomes for these tests the probability deviates at most by from the setting where is sampled from iid uniformly random bits and is therefore still at most . The same union bound thus applies. ∎

4.2.2 Algorithm 2: Using Constant Size Hashes

In this section we give a more sophisticated and even more robust recovery algorithm. Surprisingly this algorithm works even if the hash size used in the summary computations is a merely a small constant, leading to our main result, the order optimal document exchange hashing scheme.

The main difference between Algorithm 1 and Algorithm 2 is that we take matchings of previous levels into account and restricting ourself to -plausible monotone matchings. This is sufficient to reduce the problem to a Hamming type problem on how to communicate the next level of hashes when most of them are already known to the receiver. However, here we cannot use systematic error correcting codes anymore but need to develop more efficient techniques.

1:
2:get from
3:
4:for  to  do
5: Transform into a consistent level matching
6:     remove all matches not consistent with from
7: Compute plausible disjoint monotone matching for unmatched hashes
8:      largest -plausible level- matching of hashes unmatched in into
9:      (and split all edges into two making it a -level matching)
10:
11: Recover level hashes
12:     Recover by using the substrings matched in , guessing incorrect hashes and their values, and verifying correct guesses with hashes in
13:
14:
Algorithm 2 Randomized Recovery with and Summary Size

We note that the matchings produced by Algorithm 2 are not necessarily monotone and not neccesarily disjoint, i.e., they can contain matches to overlapping intervals in . At the beginning or end of an iteration the matching might even be a non-proper level matching in that there can be matches which stem from matching level hashes but do not have matching level hashes. At the beginning of an iteration such matches are removed making the matching proper. In order to analyze the progress of Algorithm 2 we introduce the following notion of an okay matching:

Definition 4.0.1.

We say a level- non-disjoint non-monotone non-proper matching at the beginning of an iteration of Algorithm 2 is okay if there are at most unmatched hashes or bad matches, i.e., matches between intervals in and that are not-identical.

We can now prove that each iteration of Algorithm 2 works correctly with exponentially high probability in , as long as it starts with an okay matching and as long as randomness in is independent between levels and sufficiently unbiased:

Lemma 4.0.5.

Suppose the randomness in is at most -biased and independent between levels, where is a sufficiently large constant. If, at the beginning of iteration of Algorithm , the matching is okay and has been correctly recovered then, with prob , the matching is also okay.

Proof.

We want to bound the number of unmatched and bad matches in the matching produced by iteration of Algorithm . By assumption is okay and thus has at most unmatched or bad matches.

The number of unmatched hashes in is easily bounded. In particular, for any unmatched hashes there exists a -plausible disjoint monotone matching which leaves at most hashes unmatched, namely the one which matches all blocks in that do not have an edit in it. This is in particular true for the hashes that are unmatched after the matching has been cleaned up and transformed into a proper level- matching in Step of Algorithm . Since is the largest such matching it leaves at most hashes unmatched, which split into at most unmatched hashes in in Step of Algorithm .

Next we bound the number of bad matches in . There are two potential sources for bad matches, namely, they can either stem from bad matches in that are not identified as bad matches by , or they can be newly introduced by the matching .

The expected number of bad matches of the first type is at most since each of the at most bad matches in the okay matching has a non-matching hash in with probability and gets split into two potentially bad matches in if it goes undetected. The probability that this happens to more than matches leading to more than bad edges of this type in is at most even if the probabilities in are -biased.

Next we want to argue that, with probability , the matching introduces at most new bad matches which get doubled into at most bad edges in . For this we first bound the number of possible matchings, given a fixed okay matching, by , and then take a union bound. To count the number of possible matchings, given a fixed okay , we specify such a matching by indicating which of the at most unmatched hashes in are matched and what the offsets of their starting positions is. Since , i.e., the sum of the differences between these offsets, is at most for every -plausible matching the values and signs of these offsets can take on at most different values. Overall there are therefore at most different possibilities for given . Furthermore, the probability for any fixed such matching to be contain bad matches is at most , if the randomness in is independent from and at most -biased. A union bound over all possibilities for thus shows that with high probability at most there is no matching which introduces more than new bad matches.

Overall, with probability , this leads to at most unmatched hashes or bad matches in at the end of iteration of Algorithm , making it an okay matching as desired. ∎

Lemma 4.0.5 shows that our improved matching procedure in Algorithm 2 is robust enough to tolerate hashes of constant size . In fact, it guarantees that given an okay matching for level and the correctly recovered hashes for level a finer grained matching for level is computed which is okay, i.e., which allows all but hashes of level to be guessed correctly. Algorithm 2 thus achieved the crucial feat of reducing the edit distance document exchange problem to its much simpler Hamming type equivalent in which two parties hold a long string differing by at most Hamming errors and one party wants to help the other learn its string.

Impossibility of Reconciling (Worst-Case) Hamming Errors or Erasures with bits

In Algorithm the reduction to the Hamming problem was all that was needed. There, too, the recovery algorithm found a guess for which differed by at most hashes. Both of these strings of hashes were over an alphabet of bits and one could then simply use the error correcting code idea of [15] to recover from using bits of additional information which could be put into . Concretely, we used the non-systematic part of a systematic linear error correcting code over to send the equivalent of hashes and recover the position and correct value for the hashes differing between the matching generated guess and the true hashes.

Unfortunately however for such error correcting codes cannot exist and in fact it is easy to verify888Thanks Xin Li and his group for pointing out this error in the initial preliminary draft put on arXiv [12] that it is impossible to reconciliate two parties holding long strings over some alphabet differing in any positions without sending at least bits because the position of the differences can already encode bits. For the encoding used in this implies that either bits need to be put into per level to allow the recovery of the bad or missing hashes, as we do in Algorithm 1, or one needs to have a better understanding of the distribution of the bad and missing hashes together with a better coding scheme which exploits the lower entropy in this distribution to communicate efficiently by using only bits per bad hash to describe its position and correct value. This is what we do in the next.

Our strategy to reconciliate the Hamming distance between the guesses for the level hashes stemming from the okay level matching and the correct values is to use the statistical distribution of bad hashes induced by random hash collisions to first guess the position of these bad hashes.

We remark that even if one would be able to identify all positions with incorrect hashes the recovery of the correct hash values still remains a hard problem because these positions are not known to the sender or respectively the party preparing the summary . Interestingly, and maybe somewhat counter-intuitively, solving this erasure problem deterministically using bits of communication is impossible, despite the fact that the positions of differences are known to the receiver (but not the sender). The argument above that bits are learned by the receiver and therefore need to be communicated falls away for this setting because only bits of information (namely the correct values) are learned by the receiver in the end. Still, any solution which allows to deterministically recover from the correct values for any erasures induces a coloring of all symbols strings which must produce a different output or color for any two strings that have Hamming distance . If this was not the case the output of the sender would be the same for two strings of Hamming distance at most and the receiver would not be able to determine which is the correct recovery if all positions of their differences are erased. In short, even to recover from erasures a Hamming code with distance is induced which then in turn could be used to recover from any corruptions as well, where by argument above clearly bits of communication are necessary.

Understanding the Distribution of Bad Hashes

The above shows that indeed a better understanding of the distribution of bad hashes is required. In essence we will show that typically these bad hashes are concentrated enough in their occurrences to end up with a sufficiently small number of guesses, which furthermore can be efficiently generated and verified.

Suppose we have run Algorithm 2 for iterations. As proved in Lemma 4.0.5 with high probability in each level the matching adds at most new guessed substrings. These substrings get split in two in every level thereafter or eliminated if non-matching hashes reveal that a guess was wrong. Each such substring in level

can thus be classified by the level

its first ancestor was generated, which of the at most substrings in level this ancestor was, and which of the at most substrings stemming from this ancestor it is. In particular, for each the guessed substrings are naturally organized by having binary trees of depth for each level .

If a guessed substring is discovered to be incorrect despite earlier hashes including it indicating otherwise then all hashes that included this substring in earlier levels failed. Since each of these hashes fails independently with probability having a substring from level be discovered to be wrong has probability . Of course there are also more of these substrings, namely up to of them. Given that is sufficiently large a union bound shows that one still expects most failures to be among the substrings generated in the most recent iterations, which therefore have not been tested by hashes quite as often. Indeed we will show that this effect and its concentration is strong enough to generate a sufficiently small number of guesses for the position of the incorrect substrings. In particular, let’s suppose for simplicity that the incorrect substrings solely stem from the last constant number of levels. This would mean that there are only candidates and in order to select any of them would lead to many possibilities to test. For each such possibility one guess the correct hashes and verify this guess using a hash of included in to recover all hashes correctly with probability .

Of course this is overly simplified because it is likely that at least some incorrectly matched substrings survive undetected for more than a constant number of levels. Trying to enumerate any possible set of substrings however would lead to trials which would not be efficient and furthermore destroy the union bound for the subsequent testing. We thus have to enumerate the guesses more carefully and employ a more sophisticated analysis.

Enumerating Plausible Guesses Using -Witnesses

We next define -witnesses, which are a specially formated way of specifying some guesses. To specify a -witness at level , i.e., a guess of at most substrings stemming from levels and above, we first specify a non-negative number for each of the last levels, i.e., for each integer we specify an integer with the restriction that . As we will describe later these essentially specify the number of extra substrings stemming from level for which our guessed hash is not matching the actual hash for the next iteration because the substring is incorrect but has gone undetected so far. To specify which substrings among those in this level those are we have for each integer a subset of positive integers of size .

Lemma 4.0.6.

For any the number of -witnesses is at most .

Proof.

The condition implies that the sum of all is at most . The number of possibilities to split up tokens over at most levels is at most . The number of possibilities to pick integers smaller than is at most . The total number of possibilities for a given setting of values is thus at most . ∎

Next we explain how a -witness in level for encodes a set of at most substrings for a given matching . Process the sets from the largest to the smallest and process each from the smallest to the largest. In particular, we start with the largest for which is non-empty and select the smallest integer . This specifies a leaf / substring from level in one of trees in level by simply taking the th tree and selecting its th leaf (counting leafs in cut-out subtrees as well). Any leaf can be specified this way given that there are at most trees. Before continuing to process the next (larger) (or smaller if there is no further integer in the current ) we cut all nodes from the chosen substring to its root, creating at most subtrees of smaller depth which we add to the corresponding layers. Overall there are at most such cuts which produce at most one extra tree per level this guarantees that each level contains at most trees and at most substrings/leafs at the time the level is processed. The reason for the cutting is that each incorrect substring from a level tree corresponds to failed hashes with the caveat that for two such strings these hashes might overlap. Cutting and reclassifying the cut-off trees and leafs/substrings as above makes sure that any substring specified by a corresponds to disjoint failed hashes.

It remains to show that by checking all -witnesses for a sufficiently large suffices to indeed check all typical ways in which hashes fail.

Lemma 4.0.7.

For any , the probability for a set of substrings to have non-matching hashes which requires a -witness to describe it is at most .

Proof.

In order for a specific substring in level to be incorrect the hashes including it failed. The probability for this is and due to the cutting of overlapping hashes these probabilities are independent given independent hashes. Given the substrings in level the expected number of substrings discovered to be incorrect in this level is and the probability for such substrings to exist is at most . The probability for a fixed -witness to describe the substrings with non-matching hashes is thus . Because, due to Lemma 4.0.6, the number of such -witnesses is at most and since is a sufficiently large constant even the union bound over all witnesses of all size equal or larger than is a geometric sum converging to . ∎

Therefore, checking all many -witnesses suffices with probability to find (a superset of) the positions of the substrings which have a non-matching hash in this level. Even trying all possibilities for what the correct hash values are for each of these guesses and verifying if it leads to a matching hash for the whole string of hashes works correctly because the probability that an incorrect guess has a matching hash is , which even after a union bound over all guesses is negligible.

Overall for any this makes Algorithm 2 correct and efficient since each of the levels succeeds with probability and the number of guesses in each level one needs to try is at most . In the case of there are too many levels to simply do a union bound with the failure probabilities for correctness and for larger the number of guesses needed makes the algorithm inefficient. These two problems are handeled relatively easily as we show next. We first prove Theorem 2.1 for the case of small , i.e., for :

Proof of Theorem 2.1 for . .

We first note that for the hashes in the summary are indeed of size as desired. Furthermore, given the constructions for -biased distributions from [18] one can specify the randomness for each level using bits or across all levels. We thus overall have a summary size of as claimed.

It furthermore follows almost immediately from Lemma 4.0.5 that Algorithm 2 is a successful decoding algorithm with probability as long as is chosen independently from an biased distribution for each level.

In particular, by induction on , each iteration starts with a correct and an okay matching . This is true for because is part of the summary of used as an input and is the empty -level matching which consists of unmatched hashes in . For subsequent levels, we get from Lemma 4.0.5 that is also okay. This then leads to a guess for the level hashes which is correct up to hashes. With probability these can be described by a -witness by Lemma 4.0.7. Guessing the correct hash values for this witness then leads to the correct hashes which is recognized by a matching of the hash . The probability that among the other many -witnesses each with guesses for their hash values there is an incorrect one which still matches is . In each level we thus recover the correct hashes with probability . A union bound over all levels then leads to a failure probability of at most .

While this failure probablity is if it is not quite as strong as the failure probability claimed by Theorem 2.1 and furthermore becomes meaningless for even smaller . Therefore, for we modify Algorithm 2 as follows: Enumerate over any subset of levels of size levels and instead of running Algorithm 2 on this level try all different -pausible matchings as a possibility of . Given that there are at most many subsets of levels and at most many matchings to try for all of these levels this requires at most many different modified runs of Algorithm 2 which is an essentially negligible overhead in the recovery time. Furthermore, the probability that none of these runs successfully recoveres is at most the probability that of having failures in trials in which each trial succeeds independently with probability , and thus at most . To see this we apply Lemma 4.0.5 as before until the first iteration fails, which happens with independent probability of for each iteration. We then look at the run in which the first failed iteration is the first iteration which the matching process is simply replaced by a “guess” for and inparticular when this guess is the correct matching. The following iterations in this run again fail independently with probability . If we continue to replace all failing iterations in this run we end up with a correct run, unless more than iterations fail independently.

Over all these runs we get at most some potential guesses for . For each such guess we can check whether indeed a file was recovered with