Given two strings over some finite alphabet , the edit distance between them is defined as the minimum number of edit operations (insertions, deletions and substitutions) to change into
. Being one of the simplest metrics, edit distance has been extensively studied due to its wide applications in different areas. For example, in natural language processing, edit distance is used in automatic spelling correction to determine possible corrections of a misspelled word; and in bioinformatics it can be used to measure the similarity between DNA sequences. In this paper, we study the general question of recovering from errors caused by edit operations. Note that without loss of generality we can only consider insertions and deletions, since a substitution can be replace by a deletion followed by an insertion, and this at most doubles the number of operations. Thus from now on we will only be interested in insertions and deletions, and we defineto be the minimum number of such operations required to change into .
Insertion and deletion errors happen frequently in practice. For example, they occur in the process of reading magnetic and optical media, in genetic mutation where DNA sequences may change, and in internet protocols where some packets may get lost during routing. Another typical situation where these errors can occur is in distributed file systems, e.g., when a file is stored in different machines and being edited by different people working on the same project. These files then may have different versions that need to be synchronized to remove the edit errors. In this context, we study the following two basic problems regarding insertion and deletion errors.
Document exchange. In this setting, two parties Alice and Bob each holds a string and , and we assume that their edit distance is bounded by some parameter . The goal is to have Alice send a sketch to Bob based on her string and the edit distance bound , such that Bob can recover Alice’s string based on his string and the sketch. Naturally, we would like to require both the message length and the computation time of Alice and Bob to be as small as possible.
Error correcting codes. In this setting, two parties Alice and Bob are linked by a channel where the number of worst case insertion and deletions is bounded by some parameter . Given any message, the goal is to have Alice send an encoding of the message to Bob through the channel, so that despite any possible insertion and deletion errors that may happen, Bob can recover the correct message after receiving the (possibly modified) codeword. Again, we would like to minimize both the codeword length (or equivalently, the number of redundant bits) and the encoding/decoding time. This is a generalization of the classical error correcting codes for Hamming errors.
It can be seen that these two problems are closely related. In particular, a solution to the document exchange problem can often be used to construct an error correcting code for insertion and deletion errors. In this paper, we focus on the setting where the strings have a binary alphabet, arguably the most popular and important setting in computer science.111Although, our document exchange protocols can be easily extended to larger alphabets, we omit the details here. In this case, assume that Alice’s string (or the message she wants to send) has length , then it is known that for small (e.g., ) both the optimal sketch size in document exchange and the optimal number of redundant bits in an error correcting code is , and this is true even for Hamming errors. In addition, both optimum can be achieved using exponential time, with the first one using a greedy coloring algorithm and the second one using a greedy sphere packing algorithm (which is essentially what gives the Gilbert-Varshamov bound).
It turns out that in the case of Hamming errors, both optimum (up to constants) can also be achieved efficiently in polynomial time. This is done by using sophisticated linear Algebraic Geometric codes . As a special case, one can use Reed-Solomon codes to achieve in both problems. However, the situation becomes much harder once we switch to edit errors, and our understanding of these two basic problems lags far behind the case of Hamming errors. We now survey related previous work below.
Historically, Orlitsky  was the first one to study the document exchange problem. His work gave protocols for generally correlated strings using a greedy graph coloring algorithm, and in particular he obtained a deterministic protocol with sketch size for edit errors. However, the running time of the protocol is exponential in . The main question left there is whether one can design a document exchange protocol that is both communication efficient and time efficient.
There has been considerable progress afterwards , , . Specifically, Irmak et al.  gave a randomized protocol that achieves sketch size and running time . Independently, Jowhari  also obtained a randomized protocol with sketch size and running time . A recent work by Chakraborty et al.  introduced a clever randomized embedding from the edit distance metric to the Hamming distance metric, and thus obtained a protocol with sketch size and running time . Using the embedding in , Belazzougui and Zhang  gave an improved randomized protocol with sketch size and running time , where the sketch size is asymptotically optimal for .
All of the above protocols, except the exponential time protocol of Orlitsky , are however randomized. In practice, a deterministic protocol is certainly more useful than a randomized one. Thus one natural and important question is whether one can construct a deterministic protocol for document exchange with small sketch size (e.g., polynomial in ) and efficient computation. This question is also important for applications in error correcting codes, since a randomized document exchange protocol is not very useful in designing such codes. It turns out that this question is quite tricky, and no such deterministic protocols are known even for until the work of Belazzougui  in 2015, where he gave a deterministic protocol with sketch size and running time .
Error correcting codes.
Error correcting codes are fundamental objects in both theory and practice. Starting from the pioneering work of Shannon, Hamming and many others, error correcting codes have been intensively studied in the literature. This is true for both standard Hamming errors such as symbol corruptions and erasures, and edit errors such as insertions and deletions. While the study of codes against standard Hamming errors has been a great success, leading to a near complete knowledge and a powerful toolbox of techniques together with explicit constructions that match various bounds, our understanding of codes for insertion and deletion errors (insdel codes for short) is still rather poor. Indeed, insertion and deletion errors are strictly more general than Hamming errors, and the study of codes against such errors has resisted progress for quite some time, as demonstrated by previous work which we discuss below.
Since insertion and deletion errors are strictly more general than Hamming errors, all the upper bounds on the rate of standard codes also apply to insdel codes. Moreover, by using a similar sphere packing argument, similar lower bounds on the rate (such as the Gilbert-Varshamov bound) can also be shown. In particular, one can show (e.g., ) that for binary codes, to encode a message of length against insertion and deletion errors with , the optimal number of redundant bits is ; and to protect against fraction of insertion and deletion errors, the optimal rate of the code is . On the other hand, if the alphabet of the code is large enough, then one can potentially recover from an error fraction approaching or achieve the singleton bound: a rate code that can correct fraction of insertion and deletion errors.
However, achieving these goals have been quite challenging. In 1966, Levenshtein  first showed that the Varshamov-Tenengolts code  can correct one deletion with roughly redundant bits, which is optimal. Since then many constructions of insdel codes have been given, but all constructions are far from achieving the optimal bounds. In fact, even correcting two deletions requires redundant bits, and even the first explicit asymptotically good insdel code (a code that has constant rate and can also correct a constant fraction of insertion and deletion errors) over a constant alphabet did not appear until the work of Schulman and Zuckerman in 1999 , who gave such a code over the binary alphabet. We refer the reader to the survey by Mercier et al.  for more details about the extensive research on this topic.
In the past few years, there has been a series of work trying to improve the situation for both the binary alphabet and larger alphabets. Specifically, for larger alphabets, a line of work by Guruswami et. al , ,  constructed explicit insdel codes that can correct fraction of errors with rate and alphabet size ; and for a fixed alphabet size explicit insdel codes that can correct fraction of errors with rate . These works aim to tolerate an error fraction approaching by using a sufficiently large alphabet size. Another line of work by Haeupler et al , ,  introduced and constructed a combinatorial object called synchronization string
, which can be used to transform standard error correcting codes into insdel codes, at the price of increasing the alphabet size. Using explicit constructions of synchronization strings, achieved explicit insdel codes that can correct fraction of errors with rate (hence approaching the singleton bound), although the alphabet size is exponential in .
For the binary alphabet, which is the focus of this paper, it is well known that no code can tolerate an error fraction approaching or achieve the singleton bound. Instead, the major goal here is to construct explicit insdel codes for some small fraction () or some small number () of errors that can achieve the optimal rate of or the optimal redundancy of , which is analogous to achieving the Gilbert-Varshamov bound for standard error correcting codes. Slightly less ambitiously, one can ask to achieve redundancy , which is optimal when for any constant , and easy to achieve in the case of Hamming errors by using Reed-Solomon codes. In this context, Guruswami et. al ,  constructed explicit insdel codes that can correct fraction of errors with rate , which is the best possible by using code concatenation. For any fixed constant , another work by Brakensiek et. al  constructed an explicit insdel code that can encode an -bit message against insertions/deletions with redundant bits, which is asymptotically optimal when is a fixed constant. We remark that the construction in  only works for constant , and does not give anything when becomes larger (e.g., ). Finally, using his deterministic document exchange protocol, Belazzougui  constructed an explicit insdel code that can encode an -bit message against insertions/deletions with redundant bits. In summary, there remains a huge gap between the known constructions and the optimal bounds in the case of a binary alphabet. In particular, even achieving redundancy has been far out of reach.
1.1 Our results
In this paper we significantly improve the situation for both document exchange and binary insdel codes. Our new constructions of explicit insdel codes are actually almost optimal for a wide range of error parameters . First, we have the following theorem which gives an improved deterministic document exchange protocol.
There exists a single round deterministic protocol for document exchange with communication complexity (sketch length) , time complexity , where is the length of the string and is the edit distance upper bound.
Note that this theorem significantly improves the sketch size of the deterministic protocol in , which is . In particular, our protocol is interesting for up to while the protocol in  is interesting only for .
We can use this theorem to get improved binary insdel codes that can correct fraction of errors.
There exists a constant such that for any there exists an explicit family of binary error correcting codes with codeword length and message length , that can correct up to edit errors with rate .
For the general case of errors, the document exchange protocol also gives an insdel code.
For any with , there exists an explicit binary error correcting code with message length , codeword length that can correct up to edit errors.
When is small, e.g., for some constant , the above theorem gives redundant bits. Our next theorem shows that we can do better, and in fact we can achieve redundancy , which is asymptotically optimal for small .
For any , there exists an explicit binary error correcting code with message length , codeword length that can correct up to edit errors.
Note that in this theorem, the number of redundant bits needed is , which is asymptotically optimal for , any constant . This significantly improves the construction in , which only works for constant and has redundancy , and the construction in , which has redundancy . In fact, this is the first explicit construction of binary insdel codes that have optimal redundancy for a wide range of error parameters , and this brings our understanding of binary insdel codes much closer to that of standard binary error correcting codes.
In all our insdel codes, both the encoding function and the decoding function run in time .
In a recent independent work , Haeupler also studied the document exchange problem and binary error correcting codes for edit errors. Specifically, he also obtained a deterministic document exchange protocol with sketch size , which leads to an error correcting code with the same redundancy, thus matching our theorems 1.1, 1.2 and 1.3. He further gave a randomized document exchange protocol that has optimal sketch size . However, for small , our Theorem 1.4 gives a much better error correcting code. Our code is optimal for , any constant , and thus better than the code given in .
1.2 Overview of the techniques
In this section we provide a high level overview of the ideas and techniques used in our constructions. We start with the document exchange protocol.
Our starting point is the randomized protocol by Irmak et al. , which we refer to as the IMS protocol. The protocol is one round, but Alice’s algorithm to generate the message proceeds in levels. In each level Alice computes some sketch about her string , and her final message to Bob is the concatenation of the sketches. After receiving the message, Bob’s algorithm also proceeds in levels, where in each level he uses the corresponding sketch to recover part of .
More specifically, in the first level Alice divides her string into blocks where each block has size , and in each subsequent level every block is divided evenly into two blocks, until the final block size becomes . This takes levels. Using shared randomness, in each level Alice picks a set of random hash functions, one for each block which outputs bits, and computes the hash values. In the first level, Alice’s sketch is just the concatenation of the hash values. In all subsequent levels, Alice obtains the sketch in this level by computing the redundancy of a systematic error correcting code (e.g., the Reed-Solomon code) that can correct erasures and symbol corruptions, where each symbol has bits (the hash value). Note that this sketch has size and thus the total sketch size is .
On Bob’s side, he always maintains a string which is the partially corrected version of . Initially is the empty string, and in each level Bob tries to use his string to fill . This is done as follows. In each level Bob first tries to recover all the hash values of Alice in this level (notice that the hash values of the first level are directly sent to Bob). Suppose Bob has successfully recovered all the hash values, Bob then tries to match every block of Alice’s string in this level with one substring of the same length in his string , by finding such a substring with the same hash value. We say such a match is bad if the substring Bob finds is not the same as Alice’s block (i.e., a hash collision). The key idea here is that if the hash functions output
bits, and they are chosen independently randomly, then with high probability a bad match only happens if the substring Bob finds contains at least one edit error. Moreover, Bob can find at leastmatches, where is the number of blocks in the current level. Bob then uses the matched substrings to fill the corresponding blocks in , and leaves the unmatched blocks blank. From the above discussion, one can see that there are at most unmatched blocks and at most mismatched blocks. Therefore in the next level when both parties divide every block evenly into two blocks, and have at most different blocks. This implies that there are also at most different hash values in the next level, and hence Bob can correctly recover all the hash values of Alice using the redundancy of the error correcting code.
Our deterministic protocol for document exchange is a derandomized version of the IMS protocol, with several modifications. First, we observe that the IMS protocol as we presented above, can already be derandomzied. This is because that to ensure a bad match only happens if the substring Bob finds contains at least one edit error, we in fact just need to ensure that under any hash function, no block of can have a collision with a substring of the same length in itself. We emphasize one subtle point here: Alice’s hash function is applied to a block of her string , while when trying to fill , Bob actually checks every substring of the string . Therefore we need to consider hash collisions between blocks of and substrings of . If the hash functions are chosen independently uniformly, then such a collision happens with probability , and thus by a union bound with high probability no collision happens. However, notice that if we write out the outputs of all hash functions on all inputs, then any collision is only concerned with two outputs which consists of bits. Thus it’s enough to use -wise independence to generate these outputs. To further save the random bits used, we can instead use an almost -wise independent sample space with and error . Using for example the construction by Alon et. al. , this results in a total of random bits (the seed), and thus Alice can exhaustively search for a fixed set of hash functions in polynomial time. Now in each level, Alice’s sketch will also include the specific seed that is used to generate the hash functions, which has bits. Note this only adds to the final sketch size. Bob’s algorithm is essentially the same, except now in each level he needs to use the seed to compute the hash functions.
The above construction gives a deterministic document exchange protocol with sketch size , but our goal is to further improve this to . The key idea here is to use a relaxed version of hash functions with nice “self matching” properties. To motivate our construction, first observe that in each level, when Bob tries to match every block of Alice’s string with one substring of the same length in his string , it is not only true that Bob can find a matching of size at least (where is the number of blocks in this level), but also true that Bob can find a monotone matching of at least this size. A monotone matching here means a matching that does not have edges crossing each other. In this monotone matching, there are at most bad matches caused by edit errors, and thus there exists a self matching between and itself with size at least . In the previous construction, we in fact ensure that all these matches are correct. To achieve better parameters, we instead relax this condition and only require that at most of these self matches are bad. Note if this is true then the total number of different blocks between and is still and we can again use an error correcting code to send the redundancy of hash values in the next level.
This relaxation motivates us to introduce -self matching hash functions, which is similar in spirit to -self matching strings introduced in . Formally, we have the following definitions.
(monotone matching) For every , any hash functions where , given two strings and , a monotone matching of size between under hash functions is a sequence of pairs of indices s.t. , and . When are clear from the context, we simply say that is a monotone matching between .
For , if , we say is a good match, otherwise we say it is a bad match. We say is a correct matching if all matches in are good. We say is a completely wrong matching is all matches in are bad.
If and are the same in terms of their binary expression, then is called a self-matching.
For simplicity, in the rest of the paper, when we say a matching we always mean a monotone matching.
(-self matching hash function) Let be such that . For any and , we say that a sequence of hash functions where is a sequence of -self matching hash functions with respect to , if any matching between and under , where is the binary expression of , has at most bad matches.
The advantage of using -self matching hash functions is that the output range of the hash functions can be reduced. Specifically, we can show that a sequence of -self matching hash functions exists with output range (i.e., bits) when the block size is at least bits for some constant . Furthermore, we can generate such a sequence of -self matching hash functions with high probability by again using an almost -wise independent sample space with , where is the current block length, and error . The idea is that in a monotone matching with bad matches, we can divide the matching gradually into small intervals such that at least one small interval will have the same fraction of bad matches. Thus in order to ensure the -self matching property we just need to make sure every small interval does not have more than fraction of bad matches, and this is enough by using the almost -wise independent sample space.
As discussed above, we need to ensure that there are at most bad matches in a self matching, thus we set . Consequently now the output of the hash functions only has bits instead of bits. Now in each level, in order to get optimal sketch size, instead of using the Reed-Solomon code we will be using an Algebraic Geometric code  which has redundancy . The almost -wise independent sample space in this case again uses only random bits, so in each level Alice can exhaustively search the correct hash functions in polynomial time and include the bits of description in the sketch. This gives Alice’s algorithm with total sketch size . On Bob’s side, we need another modification: in each level after Bob recovers all the hash values, instead of simply searching for a match for every block, Bob runs a dynamic programming to find the longest monotone matching between his string and the sequence of hash values. He then fills the blocks of using the corresponding substrings of matched blocks.
Error correcting codes.
Our deterministic document exchange protocol can be used to directly give an insdel code for edit errors. The idea is that to encode an -bit message , we can first compute a sketch of with size , and then encode the small sketch using an insdel code against edit errors. Since the sketch size is larger than , we can use an asymptotically good code such as the one by Schulman and Zuckerman , which results in an encoding size of . The actual encoding of the message is then the original message concatenated with the encoding of the sketch.
To decode, we can first obtain the sketch by looking at the last bits of the received string. The edit distance between these bits and the encoding of the sketch is at most , and thus we can get the correct sketch from these bits. Now we look at the bits of the received string from the beginning to index . The edit distance between these bits and is at most , thus if is a sketch for edit errors then we will be able to recover by using . This gives our first insdel code with redundancy .
We now describe our second insdel code, which has redundancy and uses many more interesting ideas. The basic idea here is again to compute a sketch of the message and encode the sketch, as we described above. However, we are not able to improve the sketch size of in general. Instead, our first observation here is that if the -bit message is a uniform random string, then we can actually do better. Thus, we will first describe how to come up with a sketch of size for a uniform random string, and then use this to get an encoding for any given -bit string.
To warm up, we first explain a simple algorithm to compute a sketch of size in this case. A uniform random string has many nice properties. In particular, for some , one can show that with high probability, every length substring in a uniform random string is distinct. If this property holds (which we refer to as the -distinct property
), then Alice can compute a sketch as follows. First create a vector of length, where in each entry indexed by the string , Alice looks at the substring in and record the bit left to it and the bit right to it. We need three special symbols, one to indicate the case of no left bit, one to indicate the case of no right bit, and one to indicate the case that there is no such string in . Thus it is enough to use an alphabet of size for each entry. Similarly Bob can create a vector from his string . Notice that the entries in have no collisions since we assume that Alice’s string is -distinct, while some entries in may have collisions due to edit errors, in which case Bob just treats it as there is no corresponding substring in . One can then show that and differ in at most entries. Now Alice can use the Reed-Solomon code to send a redundancy of size , and Bob can recover the correct . Bob can then recover the string by picking an entry in and growing the string on both ends gradually until obtaining the full string .
We now show how to reduce the sketch size. In the above approach, and can differ in entries since Alice is looking at every substring of length in . To improve this, instead we will have Alice first partition her string into several blocks, and then just look at the substrings corresponding to each block. Ideally, we want to make sure that each block has length at least so that again all blocks are distinct. Alice then creates the vector of length by using the -prefix of each block as the index in , and for each entry in Alice will record some information. Bob will do the same thing using his string to create another vector , and we will argue that and do not differ much so Alice can send some redundancy information to Bob, and Bob can recover the correct based on and the redundancy information.
However, the partitions need to be done carefully. For example, we cannot just partition both strings sequentially into blocks of size some , since if so then a single insertion/deletion at the beginning of could result in the case where all blocks of and are distinct. Instead, we will choose a specific string with length , for some parameter that we will choose appropriately. We call this string a pattern, and we will use this pattern to divide the blocks in and . More specifically, in our construction we will simply choose , the string with a followed by ’s. We use to divide the string as follows. Whenever appears as a substring in , we call the index corresponding to the bit of in a p-split point. The set of split points then naturally gives a partition of the string (and also into blocks). We note that the idea of using patterns and split points is also used in . However, there the construction uses patterns and takes a majority vote, which is why the sketch has a factor. Here instead we use a single pattern, and thus we can avoid the factor.
We now describe how to choose the pattern length . For the blocks of and obtained by the split points, we certainly do not want the block size to be too large. This is because if there are large blocks then an adversary can create errors in these blocks, and large blocks need longer sketches to recover from error. At the same time, we do not want the block size to be too small either. This is because if the block size is too small, then some of the blocks may actually be the same, while we would like to keep all blocks of to be distinct. In particular, we would like to keep the size of every block in to be at least . If is a uniform random string, then the pattern appears with probability and thus the expected distance between two consecutive appearances of is . Moreover, one can show that with high probability any interval of length some contains a -split point. We will call this property 1. If this property holds then we can argue that every block of has length at most .
To ensure that each block is not too short, we simply look at a -split point and the immediate next -split point after it. If the distance between these two split points is less than then we just ignore the first -split point. In other words, we change the process of dividing into blocks so that we will only use a -split point if the next -split point is at least away from it, and we call such a -split point a good -split point. We again show that if is uniform random, then with high probability every block of length some contains a good -split point. We call this property 2. Combined with the previous paragraph, we can now argue that with high probability every interval of length some will contain a -split point that we will choose. By setting , we can ensure that every block of has length at least and at most .
Now for each block of , Alice creates an entry in indexed by the -prefix of this block. The entry contains the length of this block and the -prefix of the next block, which has total size . Similarly Bob also creates a vector . We show that our approach of choosing -split points can ensure that and differ in at most entries, thus Alice can send a string with redundant bits to Bob (using the Reed-Solomon code) and Bob can recover the correct . The structure of guarantees that Bob can learn the correct order of the -prefix of all Alice’s blocks, and their lengths. At this point Bob will again try to fill a string , where for each of Alice’s block Bob searches for a substring with the same -prefix and the same length. We show that after this step, at most blocks in are incorrectly filled or missing. This completes state 1 of the sketch.
We now move to stage 2, where Bob correctly recovers the at most blocks in that are incorrectly filled or missing. One way to do this is by noticing that every block has size at most , thus we can use a deterministic IMS protocol as we described before, which will last levels and thus have sketch size ) (since we start with block size ). However, our goal is to do better and achieve sketch size .
To achieve this, we modify the IMS protocol so that in stage 2, in each level we are not dividing every block into smaller blocks. Instead, we divide every block evenly into smaller blocks. We continue this process until the final block size is , and thus this only takes levels. In each level, we will do something similar to our deterministic document exchange protocol: Alice sends a description of a sequence of hash functions to Bob, together with some redundancy of the hash values. Bob recovers the correct hash values and uses a dynamic programming to find the longest monotone matching between substrings of and the hash values. Bob then tries to fill the blocks of by using the matched substrings.
In order for the above approach to work, we need to ensure three things. First, Alice’s description of the hash functions should be short, ideally only bits. Second, the redundancy of the hash values only uses bits. Finally, after Bob recovers the hash values, the matching he finds contains at most unmatched blocks and mismatched blocks. For the second issue, we design the hash functions so that the outputs only have bits. One issue here is that if in some level there are at most unmatched blocks and mismatched blocks, then in the next level after dividing there may be unmatched blocks and mismatched blocks. However we observe that these blocks are actually concentrated (i.e., they are smaller blocks in a larger block of the previous level). Thus we can pack all the hash values of the smaller blocks together into a package, and the total size of the hash values is . Now we can show that again there are at most different packages between Alice’s version and Bob’s version, thus it is still enough to use bits of redundancy.
The third issue and the first issue are actually related, and require new ideas. Specifically, we cannot use the -self matching hash functions as we discussed earlier, since that would require the outputs of the hash functions to have bits, while we can only afford bits. To solve this problem, we strengthen the notion of -self matching hash functions and introduce -synchronization hash functions, similar in spirit to the notion of -synchronization strings introduced in . Specifically, we have the following definition.
Let be such that , and . Let be a string of length , and be a partition of into blocks of size . Let be a sequence of functions, where for any , is a function from to .
For a string of length , and some indices , , let denote the size of the maximum matching between and under . We say is a sequence of -synchronization hash functions with respect to , if it satisfies the following properties:
For any three integers where and , denote and .
For any three integers where and , denote and .
It can be seen that -synchronization hash functions are indeed stronger than -self matching hash functions, and we show that even a sequence of -synchronization hash functions with is enough to guarantee that there are at most unmatched blocks and mismatched blocks (in fact, the number of unmatched blocks and mismatched blocks is bounded by ). A similar property for -synchronization strings is also shown in .
Our construction of the -synchronization hash functions actually consists of two parts. In the first part, Alice generates a sequence of hash functions that ensures the -synchronization property holds for relatively large intervals (i.e., when is large). We show that this sequence of hash functions can again be generated by using an almost -wise independent sample space which uses random bits (the argument is similar to that of -self matching hash functions). Thus Alice can exhaustively search for a fixed set of hash functions in polynomial time, and send Bob the description using bits. The output of these hash functions has bits. To protect the small intervals, we compute a hash function for each block such that when the inputs are restricted to length substrings in a small interval, this function is injective. Since all substrings of length are distinct in , this ensures that the outputs of the same hash function within a small interval are also distinct, and thus the -synchronization property also holds for small intervals. The output of these hash functions has bits. We show that these functions can be constructed deterministically using an almost -wise independent sample space which uses random bits, and hence also has description size . Moreover the set of functions computed by Alice and the set of functions computed by Bob have at most differences (after packing each successive such functions together, which only needs bits), and thus Alice can again send redundant bits to Bob and Bob can recover the correct set of hash functions. The final sequence of -synchronization hash functions is then the combination of these two sets of functions, where the output has bits. This concludes our algorithm to generate a sketch of size for a uniform random string.
We now turn to solve the issue of assuming a uniform random string . Put simply, our idea is that given any arbitrary string , we first compute the XOR of and a pseudorandom string (a mask), to turn into a pseudorandom string as well. We will use random bits to generate and show that with high probability over the random bits used, the XOR of and satisfies the -distinct property, as well as property 1 and property 2. For this purpose, we construct three pseudorandom generators (PRGs), one for each property. The -distinct property can be ensured by again using an almost -wise independent sample space which uses random bits. For property 1, we divide the string of length into blocks of length and generate the same mask for every block. Within each block, we can view it as a sequence of sub-blocks each of length . For each sub-block, the test of whether it contains the pattern can be realized as a DNF with size , so we can use a PRG for DNF to fool this test with constant error, and this has length . We then use a random walk on a constant-degree expander graph of length to generate the full mask, which guarantees that the pattern occurs with probability . A union bound now shows that property 1 holds with high probability, and the PRG has seed length . For property 2, we can essentially do the same thing, i.e., divide the string of length into blocks of length and generate the same mask for every block. For each block, again we use a PRG for DNF together with a random walk on a constant-degree expander graph to get a PRG with seed length . Finally we take the XOR of the outputs of the three PRGs, and we show that with high probability all three properties hold for .
Since the PRG has seed length , again we can exhaustively search for a fixed that works for the string , and can be described by bits. The encoding of is then , together with an encoding of the concatenation of the description of and the sketch. For decoding, one can first recover and then recover by using the description of and the PRG.
1.3 More on -synchronization hash functions
Our definition and construction of -synchronization hash functions turn out to have several tricky issues. First of all, there may be other possible definitions of such hash functions, but for our application the current definition is the most suitable one. Second, in our construction of the -synchronization hash functions, we crucially use the fact that the string has the -distinct property. This is because if does not have this property, then when we try to find a maximum monotone matching, it may be the case that every pair in the matching is illy matched (e.g., the strings corresponding to the pairs are in different positions but the strings themselves are the same) and this will cause a problem in our analysis. This problem may be solved by modifying the definition of -synchronization hash functions, but it will potentially result in a larger output range of the hash functions (e.g., bits), which we cannot afford.
Finally, our construction of the -synchronization hash functions consists of two separate sets of hash functions, one for large intervals and one for small intervals. In the construction of hash functions for small intervals, we again crucially use the fact that he string has the -distinct property, so that for each block both parties can compute a hash function and Alice can just send the redundancy. We note that here one cannot simply apply the (deterministic) Lovász Local Lemma as in , , , since this will result in a very long description of the hash functions, and Alice cannot just send it to Bob.
Organization of this paper
In Section 2 we introduce some notation and basic techniques used in this paper. In Section 3 we give the deterministic protocol for document exchange. In Section 4 we give a protocol for document exchange of a uniform random string. In Section 5 we construct error correcting codes for edit errors using results from previous sections. Finally we conclude with some open problems in Section 6.
Let be an alphabet (which can also be a set of strings). For a string ,
denotes the length of the string.
denotes the substring of from position to position (Both ends included).
denotes the -th symbol of .
denotes the concatenation of and some other string .
-prefix denotes the first symbols of . (Usually used when .)
the concatenation of number of string .
to denote the uniform distribution on.
2.2 Edit distance and longest common subsequence
Definition 2.1 (Edit distance).
For any two strings , the edit distance is the minimum number of edit operations (insertions and deletions) required to transform into .
Definition 2.2 (Longest Common Subsequence).
For any strings over , the longest common subsequence of and is the longest pair of subsequences of and that are equal as strings. denotes the length of the longest common subsequence between and .
Note that .
2.3 Almost k-wise independence
Definition 2.3 (-almost -wise independence in max norm ).
Random variables are -almost -wise independent in max norm if , ,
A function is an -almost -wise independence generator in max norm if are -almost -wise independent in max norm.
In the following passage, unless specified, when we say -almost -wise independence, we mean in max norm.
Theorem 2.4 (-almost -wise independence generator ).
There exists an explicit construction s.t. for every , , it computes an -almost -wise independence generator , where .
The construction is highly explicit in the sense that, , the -th output bit can be computed in time given the seed and .
2.4 Pseudorandom generator
Definition 2.5 (Pesudorandom generator).
A generator is a pseudorandom generator (PRG) against a function with error if
where is called the seed length of .
We also say -fools function . Similarly, -fools a class of function if fools all functions in .
Theorem 2.6 (PRG for CNF/DNFs ).
There exists an explicit PRG s.t. for every , every CNF/DNF with variables, terms,
where , .
2.5 Random walk on expander graphs
Definition 2.7 (-expander graphs).
If is an -vertex -regular graph and for some number , then we say
that is an -expander graph. Here is defined as the second largest eigenvalue (in the absolute value) of the normalized adjacency matrix
is defined as the second largest eigenvalue (in the absolute value) of the normalized adjacency matrixof .
A family of graphs is a -expander graph family if there are some constants and such that for every , is an -expander graph.
Let be vertex sets of densities in an -expander graph . Let be a random walk on . Then
It is well known that some expander families have strongly explicit constructions.
Theorem 2.9 (Explicit expander family ).
For every constant and some constant depending on , there exists a strongly explicit -expander family.
2.6 Error correcting codes (ECC)
An -code is an ECC (for hamming errors) with codeword length , message length . The hamming distance between every pair of codewords in is at least .
Next we recall the definition of ECC for edit errors.
An ECC for edit errors with message length and codeword length consists of an encoding mapping and a decoding mapping . The code can correct edit errors if for every , s. t. , we have . The rate of the code is defined as .
An ECC family is explicit (or has an explicit construction) if both encoding and decoding can be done in polynomial time.
We will utilize linear algebraic geometry codes to compute the redundancy of the hamming error case.
Theorem 2.11 ().
There exists an explicit algebraic geometry ECC family with polynomial-time decoding when the number of errors is less than half of the distance.
Moreover, , for every message , the codeword is for some redundancy .
To construct ECC from document exchange protocol, we need to use a previous result about asymptotically good binary ECC for edit errors given by Schulman and Zuckerman .
Theorem 2.12 ().
There exists an explicit binary ECC family in which a code with codeword length , message length , can correct up to edit errors.
3 Deterministic protocol for document exchange
We derandomize the IMS protocol given by Irmak et. al. , by first constructing -self-matching hash functions and then use them to give a deterministic protocol.
3.1 -self-matching hash functions
The following describes a matching property between strings under some given hash functions.
For every , any hash functions where for every , given two strings and , a (monotone) matching of size between under hash functions is a sequence of pairs of indices s.t. , and .
For , if , we say is a good pair, otherwise we say it is a bad pair. We say is a correct matching if all pairs in are good. We say is a completely wrong matching is all pairs in are bad.
If parsing to be binary we get which is equal to , then is called a self-matching of (or ).
For the match between under hash functions , we simply say a match if and are clear in the context.
The next lemma shows that the maximum matching between two strings under some hash functions can be computed in polynomial time, using dynamic programming.
There is an algorithm s.t. for every , any hash functions where for every , , every and , given , , , , , , , , it can compute the maximum matching between under hash functions in time , if for every , can be computed in time . (Note that is not necessary to be part of the input).
We present a dynamic programming to compute the maximum matching.
For every , let be the size of the maximum matching between and under . We compute as follows,
Although only computes the size of the maximum matching, we can record the corresponding matching every time when we compute . So finally we can get the maximum matching after computing .
We need to compute one by one and . Every time we compute an , we need to compute a constant number of evaluations of the hash functions and append a pair of indices to some previous records to create the current record of the maximum matching. That takes . So the overall time complexity is . ∎
Next we show that a matching between two strings with edit distance , induces a self-matching in one of the strings, where the number of bad pairs decreases by at most .
For every , any hash functions where , given two sequences and s.t. , parsing to be , if there exists a matching between and under having at least bad pairs, then there is a self-matching of with size , having at least bad pairs.
Assume w.l.o.g. the bad pairs in the matching are the pairs .
As the number of edit errors is upper bounded by , to edit back to , we only need to do insertions and deletions on at most of substrings . The substrings left unmodified induce a self-matching of , which is a subsequence of . Note that only at most entries of are excluded. So the number of bad pairs in the matching is at least and the size of is at least .
The following property shows that a matching between two long intervals induces two shorter intervals having a matching s.t. the ratio between the matching size and the total interval length is maintained.
For every , any hash functions where , given two sequences and , if there is a matching between and under having size , then for every , there is a matching between having size , where is an interval of , is an interval of , and . Here is a subsequence of .
We consider the following recursive procedure.
At the beginning (the first round), let . We know that