# Edit Errors with Block Transpositions: Deterministic Document Exchange Protocols and Almost Optimal Binary Codes

Document exchange and error correcting codes are two fundamental problems regarding communications. In both problems, an upper bound is placed on the number of errors between the two strings or that the channel can add, and a major goal is to minimize the size of the sketch or the redundant information. In this paper we focus on deterministic document exchange protocols and binary error correcting codes. In a recent work CJLW18, the authors constructed explicit deterministic document exchange protocols and binary error correcting codes for edit errors with almost optimal parameters. Unfortunately, the constructions in CJLW18 do not work for other common errors such as block transpositions. In this paper, we generalize the constructions in CJLW18 to handle a much larger class of errors. Specifically, we consider document exchange and error correcting codes where the total number of block insertions, block deletions, and block transpositions is at most k ≤α n/ n for some constant 0<α<1. In addition, the total number of bits inserted and deleted by the first two kinds of operations is at most t ≤β n for some constant 0<β<1, where n is the length of Alice's string or message. We construct explicit, deterministic document exchange protocols with sketch size O(k ^2 n+t) and explicit binary error correcting code with O(k n n+t) redundant bits. As a comparison, the information-theoretic optimum for both problems is Θ(k n+t). As far as we know, previously there are no known explicit deterministic document exchange protocols in this case, and the best known binary code needs Ω(n) redundant bits even to correct just one block transposition.

## Authors

• 5 publications
• 3 publications
• 84 publications
• 7 publications
• ### Deterministic Document Exchange Protocols, and Almost Optimal Binary Codes for Edit Errors

We study two basic problems regarding edit error, i.e. document exchange...
04/16/2018 ∙ by Kuan Cheng, et al. ∙ 0

• ### Directory Reconciliation

We initiate the theoretical study of directory reconciliation, a general...
07/25/2018 ∙ by Michael Mitzenmacher, et al. ∙ 0

• ### A Short Course on Error-Correcting Codes

When digital data are transmitted over a noisy channel, it is important ...
08/26/2019 ∙ by Mario Blaum, et al. ∙ 0

• ### On Coding over Sliced Information

The interest in channel models in which the data is sent as an unordered...
09/07/2018 ∙ by Jin Sima, et al. ∙ 0

• ### Optimal Document Exchange and New Codes for Small Number of Insertions and Deletions

This paper gives a communication-optimal document exchange protocol and ...
04/10/2018 ∙ by Bernhard Haeupler, et al. ∙ 0

• ### Synchronization Strings: Efficient and Fast Deterministic Constructions over Small Alphabets

Synchronization strings are recently introduced by Haeupler and Shahrasb...
03/08/2018 ∙ by Kuan Cheng, et al. ∙ 0

• ### Robust Positioning Patterns with Low Redundancy

A robust positioning pattern is a large array that allows a mobile devic...
12/31/2019 ∙ by Yeow Meng Chee, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In communications and more generally distributed computing environments, there often arise questions regarding the synchronization of files or messages. For example, a message sent from one party to another party through a channel may get modified by channel noise or adversarial errors, and files stored on distributed servers may become out of sync due to different edit operations by different users. In many situations, these questions can be formalized in the framework of the following two fundamental problems.

• Document exchange. In this problem, two parties Alice and Bob each holds a string and , and the two strings are within distance in some metric space. The goal is for Alice to send a short sketch to Bob, so that Bob can recover based on his string and the sketch.

• Error correcting codes. In this problem, two parties Alice and Bob are linked by a channel, which can change any string sent into another string within distance in some metric space. Alice’s goal is to send a message to Bob. She does this by sending an encoding of the message through the channel, which contains some redundant information, so that Bob can recover the correct message despite any changes to the codeword.

These two problems are closely related. For example, in many cases a solution to the document exchange problem can also be used to construct an error correcting code, but the reverse direction is not necessarily true. In both problems, a major goal is to is to minimize the size of the sketch or the redundant information. For applications in computer science, we also require the computations of both parties to be efficient, i.e., in polynomial time of the input length. In this case we say that the solutions to these problems are explicit. Here we focus on deterministic document exchange protocols and error correcting codes with a binary alphabet, arguably the most important setting in computer science.

Both problems have been studied extensively, but the known solutions and our knowledge vary significantly depending on the distance metric in these problems. In the case of Hamming distance (or Hamming errors), we have a near complete understanding and explicit constructions with asymptotically optimal parameters. However, for other distance metrics/error types, our understanding is still rather limited.

An important generalization of Hamming errors is edit errors, which consist of bit insertions and deletions. These are strictly more general than Hamming errors since a bit substitution can be replaced by a deletion followed by an insertion. Edit errors can happen in many practical situations, such as reading magnetic and optical media, mutations in gene sequences, and routing packets in Internet protocols. However, these errors are considerably harder to handle, due to the fact that a single edit error can change the positions of all the bits in a string.

Non-explicitly, by using a greedy graph coloring algorithm or a sphere packing argument, one can show that the optimal size of the sketch in document exchange, or the redundant information in error correcting codes is roughly the same for both Hamming errors and edit errors. Specifically, suppose that Alice’s string or message has length and the distance bound is relatively small (e.g., ), then for both Hamming errors and edit errors, the optimal size in both problems is [12]. For Hamming errors, this can be achieved by using sophisticated linear Algebraic Geometric codes [9], but for edit errors the situation is quite different. We now describe some of the previous works regarding both document exchange and error correcting codes for edit errors.

##### Document exchange.

Orlitsky [13] first studied the document exchange problem for generally correlated strings . Using the greedy graph coloring algorithm mentioned before, he obtained a deterministic protocol with sketch size for edit errors, but the running time is exponential in . Subsequent improvements appeared in [8], [10], and [11], achieving sketch size [10] and [11] with running time . A recent work by Chakraborty et al. [5] further obtained sketch size and running time , by using a clever randomized embedding from the edit distance metric to the Hamming distance metric. Based on this work, Belazzougui and Zhang [3] gave an improved protocol with sketch size , which is asymptotically optimal for . The running time in [3] is .

Unfortunately, all of the above protocols, except the one in [13] which runs in exponential time, are randomized. Randomized protocols are less useful than deterministic ones in practice, and are also not suitable for the applications in constructing error correcting codes. However, designing an efficient deterministic protocol appears quite tricky, and it was not until 2015 when Belazzougui [2] gave the first deterministic protocol even for . The protocol in [2] has sketch size and running time .

##### Error correcting codes.

As fundamental objects in both theory and practice, error correcting codes have been studied extensively from the pioneering work of Shannon and Hamming. While great success has been achieved in constructing codes for Hamming errors, the progress on codes for edit errors has been quite slow despite much research. A work by Levenshtein [12] in 1966 showed that the Varshamov-Tenengolts code [14] corrects one deletion with an optimal redundancy of roughly bits, but even correcting two deletions requires redundant bits. In 1999, Schulman and Zuckerman [15] gave an explicit asymptotically good code, that can correct up to edit errors with redundant bits. However the same amount of redundancy is needed even for smaller number of errors.

For any fixed constant , a recent work by Brakensiek et. al [4] constructed an explicit code that can correct edit errors with redundant bits. This is asymptotically optimal when is a fixed constant, but the construction in [4] only works for constant , and breaks down for larger (e.g., ). Based on his deterministic document exchange protocol, Belazzougui [2] also gave an explicit code that can correct up to edit errors with redundant bits.

In a very recent work by the authors [6], we significantly improved the situation. Specifically, we constructed an explicit document exchange protocol with sketch size , which is optimal except for another factor. We also constructed explicit codes for edit errors with redundant bits, which is optimal for , any constant . These results bring our understanding of document exchange and error correcting codes for edit errors much closer to that of standard Hamming errors.

However, the constructions in [6] do not work for other common types of errors, such as block transpositions. Given any string , a block transposition takes an arbitrary substring of , moves it to a different position, and insert it into . These errors happen frequently in distributed file systems and Internet protocols. For example, it is quite common that a user, when editing a file, copies a whole paragraph in the file and moves it to somewhere else; and in Internet routing protocols, packets can often get rearranged during the process. Block transpositions also arise naturally in computational biology, where a subsequence of genes can be moved in one operation. In the setting of document exchange or error correcting codes, it is easy to see that even a single transposition of a block with length can result in edit errors, thus a naive application of document exchange protocols or codes for edit errors will result in very bad parameters.

In this paper we consider document exchange protocols and error correcting codes for edit errors as well as block transpositions. In fact, even for edit errors we also consider a larger, more general class of errors. Specifically, we consider edit errors that happen in bursts. This kind of errors is also pretty common, as most errors that happen in practice, such as in wireless or mobile communications and magnetic disk readings, tend to be concentrated. We model such errors as block insertions and deletions, where in one operation the adversary can insert or delete a whole block of bits. It is again easy to see that this is indeed a generalization of standard edit errors. However, in addition to the bound on such operations, we also need to put a bound on the total number of bits that the adversary can insert or delete, since otherwise the adversary can simply delete the whole string in one block deletion. Therefore, we model the adversary as follows.

An adversary considered in this paper is allowed to perform two kinds of operations: block insertion/deletion and block transposition. The adversary is allowed to perform block insertions/deletions and block transpositions, such that for a given parameter . In addition, the total number of bits inserted/deleted by block insertions/deletions is at most for a given parameter . We note that by the result of Schulman and Zuckerman [15], to correct block transpositions one needs at least redundant bits. Thus we only consider for some constant . Similarly, we only consider for for some constant since otherwise the adversary can simply delete the whole string.

Edit errors with block transpositions have been studied before in several different contexts. For example, Shapira and Storer [16] showed that finding the distance between two given strings under this metric is -hard, and they gave an efficient algorithm that achieves approximation. Interestingly, a work by Cormode and Muthukrishnan [7] showed that this metric can be embedded into the metric with distortion ; and they used it to give a near linear time algorithm that achieves approximation for this distance, something currently unknown for the standard edit distance. Coming back to document exchange and error correcting codes, in our model, we show in the appendix that non-explicitly, the information optimum for both the sketch size of document exchange, and the redundancy of error correcting codes, is . However, as far as we know, there are no known explicit deterministic document exchange protocols; and the only randomized protocol which can handle edit errors as well as block transpositions is the protocol of [10], that has sketch size . Similarly, the only previous explicit code that can handle edit errors as well as block transpositions is the work of Schulman and Zuckerman [15], which needs redundant bits even to correct one block transposition. We note that however none of the previous works mentioned studied edit errors that can allow block insertions/deletions.

### 1.1 Our results

In this paper we construct explicit deterministic document exchange protocols, and error correcting codes for adversaries discussed above. We have the following theorems.

###### Theorem 1.1.

There exist constants such that for every with , there exists an explicit, deterministic document exchange protocol with sketch size , for an adversary who can perform at most block insertions/deletions and block transpositions, where the total number of bits inserted/deleted is at most .

This is the first explicit, deterministic document exchange protocol for edit errors with block transpositions. The sketch size matches the randomized protocol of [10], and is optimal up to an extra factor. Using this protocol, we can construct the following error correcting code.

###### Theorem 1.2.

There exist constants such that for every with , there exists an explicit binary error correcting code with message length and codeword length , that can correct up to block insertions/deletions and block transpositions, where the total number of bits inserted/deleted is at most .

For small we can actually achieve the following result, which gives better parameters.

###### Theorem 1.3.

There exist constants such that for every with , there exists an explicit binary code with message length and codeword length , that can correct up to block insertions/deletions and block transpositions, where the total number of bits inserted/deleted is at most .

In the case of small , these results significantly improve the result of Schulman and Zuckerman [15], which needs redundant bits even to correct one block transposition. The redundancy here is also optimal up to an extra factor or factor.

### 1.2 Overview of our techniques

In this section we provide an informal, high-level overview of our techniques. We start by describing our document exchange protocol.

##### Document exchange.

Our construction is based on the deterministic document exchange protocol of [6], which is in turn based on the randomized protocol of Irmak et. al. [10]. We first briefly describe the construction of [6]. The protocol has levels where is the number of edit errors. Throughout the protocol, Bob always maintains a string , which is his current version of Alice’s string . In the -th level, both Alice and Bob parse their strings and evenly into blocks, i.e., in each subsequent level they divide a block in the previous level evenly into two blocks. The following invariance is maintained: at the beginning of each level, at most blocks are different between and , where is a universal constant.

This property is satisfied at the beginning of the protocol, and maintained for subsequent levels as follows: in each level Alice constructs an -self-matching hash function (see [6] for the exact definition) using the parsed . This function has a short description. Alice then hashes every block of and sends some redundancy of the hash values together with the description of the hash function to Bob. The redundancy here is computed by a systematic error correcting code that can correct Hamming errors, whose alphabet corresponds to the output of the hash function. Bob, after receiving the redundancy of the hash values and the short description, first uses the hash function to hash every block of . Since and differ in at most blocks, Bob can use the redundancy to correctly recover all the hash values. He then uses dynamic programming to find a maximum monotone matching between and under the hash function and the hash values, and uses the matched blocks of to fill the corresponding blocks of his string . The analysis shows that the Bob can correctly recover all blocks of except at most of them. Thus in the next level is the same as except for at most blocks. At the end of the protocol when the size of each block has become small enough (i.e., ), Alice can just send a sketch for hamming errors of the blocks to let Bob finally recover .

Our starting point is to try to generalize the above protocol. However, one immediate difficulty is to handle block transposition. The protocol of [6] actually performs badly for such errors. To see this consider the following example: the adversary simply moves the first bits of to the end. Since the protocol in [6] tries to find the maximum monotone matching in each level, Bob can only recover the last bits of since this gives the maximum monotone matching. In this case, one single error has cost roughly half of the string; while as a comparison, for standard edit errors, the protocol in [6] lets Bob recover all except blocks if there is only one edit error.

To solve this issue, we need to make several important changes to the protocol in [6]. The first major change is that, in each level, instead of having Bob find the maximum monotone matching between and using the hash values, we let Bob find the maximum non-monotone matching. However, the notion of -self-matching hash function in [6] is not suitable for this purpose, since -self-matching hash functions actually allow a small number of collisions in the hash values of blocks of , and the use of these hash functions in [6] relies crucially on the property of a monotone matching. Instead, here we strengthen the hash function to ensure that there is no collision, by using a slightly larger output size. We call such hash functions good hash functions.

###### Definition 1.4 (Good hash functions).

Given and a string , we say a function is good (for ), if for every , if and only if .

This definition guarantees that if the hash function we used is good, then substrings of cannot have the same hash value.

We show that a good hash function can be constructed by using a -almost -wise independence generator with seed length

. This can work since for each pair of distinct substrings, their hash values are the same with probability

. Since here are at most pairs, a union bound shows the existence of good hash functions. To get a deterministic hash function, we check each possible seed to see if the corresponding hash function is good, which can be done by checking if every pair of different substrings of have different hash values. Note that there are at most pairs and the seed length of the generator is , so this can be done in polynomial time.

However, even a non-monotone matching is not enough for our purpose. The reason is that in the matching, we are trying to match every well divided block of to every possible block of (not necessarily the blocks obtained by dividing evenly into disjoint blocks), because we have edit errors here. If we just do this in the naive way, then the matched blocks of can be overlapping. Using these overlapping blocks of to fill the blocks of is problematic, since even a single edit error or block transposition can create many new (overlapping) blocks in (which can be as large as the length of the block in each level). These new blocks are all possible to be matched, and then we won’t be able to maintain an upper bound of on the different blocks between and .

To solve this issue, we need to insist on computing a non-overlapping, non-monotone matching.

###### Definition 1.5 (Non-overlapping (non-monotone) matching).

Given , a function and two strings , a (non-overlapping) matching between and under is a sequence of matches (pairs of indices) s.t.

• for every ,

• for some , i.e., each is the starting index of some block of , when is divided evenly into disjoint blocks of length ,

• ,

• ,

• are distinct.

• Intervals , are disjoint.

Under this definition, we can indeed show a similar upper bound on the number of different blocks between and in each level. However, a tricky question here is how to compute a maximum non-overlapping, non-monotone matching efficiently, since the standard algorithm to compute a maximum matching only gives a possibly overlapping matching, while the dynamic programming approach in [6] only works for a monotone matching.

We show how to compute a maximum non-overlapping, non-monotone matching in polynomial time, by using an integer programming based on semi-definite programming (SDP). Specifically, we model the objective as solving a max-flow problem with constraints. In each level, we first evenly divide into a sequence of blocks with length , and for every block we create a node. Next create a node for every length substring of (these substrings can overlap). We create a tap which is connected to all ’s blocks with unlimited capacity, and a sink which is connected from every ’s blocks with unlimited capacity. For every potential match in the sense that , we create an edge from to with capacity .

We create two binary vectors

, and . Intuitively, denotes the flow (either or ) on edge , and denotes the total flow coming into node . We introduce the constraints as follows.

By our definition of the non-overlapping matching, we require that each is matched to at most one interval . So for if there are edges and for some , we require . Similarly, each should be matched to at most one interval . Thus for , if there are edges and , then we require .

To force the non-overlapping condition of the intervals , where these intervals correspond to ’s substrings in the matching, we require for every pair s.t. .

We now have all the major constraints. Another natural constraint is to require that for any , . The last constraint comes directly from the flow problem, i.e. for every . Note that under our previous constraints, for any , can be at most (since the block is matched to at most one block in ). Also note that for any , we have . So this constraint can also be written as for any . Finally, we define the objective function to be the overall flow, i.e., .

This gives us an integer programming in the form of an SDP. To solve it, we first relax to be real vectors, and solve the SDP by a standard ellipsoid method in polynomial time. We then put all edges with non-zero flow in the matching. This works because of the following observation: if we round any non-zero to (and modify the ’s correspondingly), then all constraints are still met, but the total flow will increase if initially . This means that an optimal solution to the SDP must be integral itself. Therefore, any edge with non-zero flow is in the maximum matching.

It can be shown without too much difficulty that this protocol can also handle block insertions/deletions. The reason is that such errors are concentrated, and won’t cause too many errors in the maximum non-overlapping matching.

The sketch size of our protocol is . This is because the number of block differences in the -th level is , where is the block length in the -th level. This again comes from the fact that the insertions/deletions are concentrated in at most blocks, and each block transposition affects at most blocks. Thus the redundancy of the hash values in the -th level is . The total number of levels is . Note that in each level, decreases by a factor of , and in the last level the block length is , thus the summation of the redundancy in all levels gives . The total length of other transmitted strings that contribute to the overall redundancy, is smaller than this term.

##### Error Correcting Code.

We now describe how to construct an error correcting code from a document exchange protocol for block insertions/deletions and block transpositions. Similar to the construction in [6], our starting point is to first encode the sketch of the document exchange protocol using the asymptotically good code by Schulman and Zuckerman [15], which can resist edit errors and block transpositions. Then we concatenate the message with the encoding of the sketch. When decoding, we first decode the sketch, then apply the document exchange protocol on Bob’s side to recover the message using the sketch.

However, here we have an additional issue with this approach: a block transposition may move some part of the encoding of the sketch to somewhere in the middle of the message, or vice versa. In this case, we won’t be able to tell which part of the received string is the encoding of the sketch, and which part of the received string is the original message.

To solve this issue, we use a fixed string as buffer to mark the encoding of the sketch, for some . More specifically, we evenly divide the encoding of the sketch into small blocks of length , and insert before every block. Note that this only increase the length of the encoding of the sketch by a constant factor. The reason we use such a small block length is that, even if the adversary can forge or destroy some buffers, the total number of bits inserted or deleted caused by this is still small. In fact, we can bound this by block insertions/deletions with at most bits inserted/deleted, for which both the sketch and the encoding of the sketch can handle. When decoding, we first recognize all the ’s. Then we take the bits after each to form the encoding of the sketch, and take the remaining bits as the message.

Unfortunately, this approach introduces two additional problems here. The first problem is that the original message may contain as a substring. If this happens then again we will be taking part of the message to be in the encoding of the sketch. The second problem is that the small blocks of the encoding of the sketch may also contain . In this case we will be deleting information from the encoding of the sketch, which causes too many edit errors.

To address the first problem, we turn the original message into a pseudorandom string by computing the XOR of the message with the output of a pseudorandom generator. Using a -almost -wise independence generator with seed length , we can ensure that with high probability does not appear as a substring in the XOR. We can then exhaustively search for a fixed seed that satisfies this requirement, and append the seed to the sketch of the document exchange protocol.

To address the second problem, we choose the length of the buffer to be longer than the length of each block in the encoding of the sketch, so that doesn’t appear as a substring in any block. This is exactly why we choose the length of the buffer to be while we choose the length of each block to be .

If we directly apply our document exchange protocol to the construction above, we would obtain an error correcting code with redundant bits. However, by combining the ideas in [6], we can achieve redundancy size , which is better for small and .

We first briefly describe the construction of the explicit binary insdel code for edit errors with redundancy in [6]. The construction in [6] starts by observing that a uniform random string satisfies some nice properties. For example, with high probability, any two substrings of length some are distinct. [6] calls this property -distinct. The construction in [6] goes by first transforming the message into a pseudorandom string, which is obtained by computing the XOR of the message with an appropriately designed pseudorandom generator. The construction then designs a document exchange protocol for a pseudorandom string with better parameters, and encodes the sketch of the document exchange protocol to give an error correcting code.

The document exchange protocol for a pseudorandom string in [6] actually consisted of two stages: in stage I, Alice uses a fixed pattern of length to divide her string into blocks of size . [6] defines a -split point to be the index that appears as a substring, and defines its next -split point to be the -split point right after it. When dividing the string, Alice needs to ensure that the size of each block is lager than , so that the -distinct property holds for the blocks. Hence she only chooses a -split point if its next -split point is at least indices away. One can set the parameter so that . Furthermore, [6] shows that for a pseudorandom string, with high probability, the size of each block is not too large, i.e., at most .

Next, Alice sends a sketch of size to help Bob recover the partition of her string. To achieve this, Bob also divides his string into blocks in the same way that Alice does, by using the pattern . However, due to edit errors, there are different blocks between Alice’s partition and Bob’s partition. So Alice creates a vector of length , where each entry of is indexed by a binary string of length . Alice looks at each block in her partition, and stores the -prefix of its next block and the length of the current block in the entry of indexed by the -prefix of the current block. This ensures each entry of the vector has only bits. Similarly Bob will create a vector in the same way. [6] shows that and differ in at most entries, thus Alice can send a sketch of size using the Reed-Solomon Code to help Bob recover from . Once this is done, Bob can use to obtain a guess of Alice’s string, which we call , by using his blocks to fill the blocks of , if they have the same -prefix.

Stage II of the construction in [6] consists of a constant number of levels. In each level, both parties divide each of their blocks evenly into smaller blocks, and Alice generates a sequence of special hash functions called -synchronization hash functions (see [6] for the exact definition) with respect to her string. The nice properties of the -synchronization hash functions guarantee that in each level Alice can send bits to Bob, so that Bob can recover all but blocks of Alice’s string. This stage ends in levels when the final block size reduces to , at which point Alice can simply send a sketch of size for Bob to recover her string .

Checking the two stages in [6], it turns out that stage I can actually be modified to work for block insertions/deletions and block transpositions as well. Intuitively, this is because both kinds of errors won’t cause too many different blocks in and . On the other hand, stage II becomes quite tricky, since the use of -synchronization hash functions crucially relies on the monotone property of edit errors. Allowing block transpositions will ruin this property, and it is not clear how to apply -synchronization hash functions.

To solve the issue, in stage II, we apply the deterministic document exchange protocol we developed earlier, based on the small blocks of size produced by stage I. Thus our stage II has levels. Note that after stage I, the block size can be as small as . Hence, for block insertions/deletions of bits, the number of blocks that are affected by these errors can only be bounded by . Thus we add an extra level before applying the deterministic document exchange protocol, to ensure that after this extra level, the number of bad blocks is at most .

We show that in stage I, Alice sends a sketch with bits; and in stage II, Alice sends a sketch with bits. So the total sketch size is still . By using the asymptotically encoding of Schulman and Zuckerman [15] and the buffer , the final redundancy of the error correcting code is also .

### 1.3 Discussions and open problems.

In this paper we study document exchange protocols and error correcting codes for block insertions/deletions and block transpositions. We give the first explicit, deterministic document exchange protocol in this case, and significantly improved error correcting codes. In particular, for both document exchange and error correcting codes, our sketch size or redundant information is close to optimal.

The obvious open problem is to try to achieve truly optimal constructions, where an interesting intermediate step is to try to adapt the -self matching hash functions and the -synchronization hash functions in [6] to handle block transpositions. More broadly, it would be interesting to study document exchange protocols and error correcting codes for other more general errors.

##### Organization of the paper.

The rest of the paper is organized as follows. In Section 3 we give a deterministic document exchange protocol for block insertions/deletions and block transpositions. In Section 4 we give a document exchange protocol for block insertions/deletions and block transpositions for uniformly random strings. Then in Section 5 we give constructions of codes correcting block insertions/deletions and block transpositions. Finally we give tight bounds of the sketch size or redundancy in Appendix A.

## 2 Preliminaries

### 2.1 Notations

Let be an alphabet (which can also be a set of strings) and be a string. denotes the length of the string. Let denote the substring of from the -th symbol to the -th symbol (Both ends included). Similarly denotes the substring of from the -th symbol to the -th symbol (not included). We use to denote the -th symbol of . The concatenation of and is . The -prefix of is the first symbols of . is the concatenation of copies of . For two sets and , let denotes the symmetric difference of and .

Usually we use

to denote the uniform distribution over

.

### 2.2 Edit errors

Consider two strings .

###### Definition 2.1 (Edit distance).

The edit distance is the minimum number of edit operations (insertions and deletions) that is enough to transform to .

A subsequence of a vector is a vector s.t. , , , , , .

###### Definition 2.2 (Longest Common Subsequence).

The longest common subsequence between and is the longest subsequence which is the subsequence of both and , its length denoted by .

We have .

###### Definition 2.3 (Block-Transposition).

Given a string , the -block-transposition operation for and is defined as an operation which removes and inserts right after for the original string (if , then inserts to the beginning of ).

###### Definition 2.4 (Block-insertion/deletion).

A block-insertion/deletion (or burst-insertion/deletion) of symbols to a string is defined to be inserting/deleting a block of consecutive symbols to . When we do not need to specify the number of symbols inserted or deleted, we simply say a block-insertion/deletion.

We define -block-insertions/deletions (to ) to be a sequence of block-insertions/deletions, where the total number of symbols inserted/deleted is at most .

### 2.3 Almost k-wise independence

###### Definition 2.5 (ε-almost κ-wise independence in max norm [1]).

A series of random variables

are -almost -wise independent in Maximum norm if , ,

A function is an -almost -wise independence generator in Maximum norm if are -almost -wise independent in Maximum norm.

For simplicity, we neglect the term in Maximum norm when saying -almost -wise independence, unless specified.

###### Theorem 2.6 (ε-almost κ-wise independence generator [1]).

For every , , there exists an explicit -almost -wise independence generator , where .

The construction is highly explicit in the sense that, , the -th output bit can be computed in time given the seed and .

### 2.4 Pseudorandom Generator (PRG)

###### Definition 2.7 (Prg).

A function is a pseudorandom generator (PRG) for a function with error if

 |Pr[f(Un)=1]−Pr[f(g(Ur))=1]|≤ε

where is the seed length of .

Usually this is also called that -fools function . Similarly, if fools every function in a class then we say -fools .

### 2.5 Error correcting codes (ECC)

An ECC is called an -code (for hamming errors) if it has block length , message length , and distance . The rate of the code is defined as .

We utilize the following algebraic geometry codes in our constructions.

###### Theorem 2.8 ([9]).

For every , there is an explicit -ECC over with polynomial-time unique decoding.

Moreover, , for every message , the codeword is with redundancy .

For an ECC for edit errors, with message length , we usually regard it as having an encoding mapping and a decoding mapping .

We say an ECC for edit errors is explicit (or has an explicit construction) if both encoding and decoding can be computed in polynomial time.

To construct ECCs in following sections, we use an asymptotically good binary ECC for edit errors by Schulman and Zuckerman [15].

###### Theorem 2.9 ([15]).

For every , there is an explicity ECC with codeword length , message length , which can correct up to edit errors and block-transpositions.

## 3 Deterministic document exchange protocol for block insertions/deletions and block transpositions

Given and a string , we say a hash function is good (for ), if for every , if and only if .

Given , a function and two strings , a (non-overlapping) matching between and under is a sequence of matches (pairs of indices) s.t.

• for every ,

• for some ,

• ,

• ,

• are distinct.

• Intervals , are disjoint.

In the rest of the paper, when we talk about matchings we always mean non-overlapping matchings.

A match in the matching is called a wrong match (or wrong pair) if . Otherwise it is called a correct match (or correct pair). A pair of indices is called a potential match between and if . When are clear from the context we simply say is a potential match.

To find a monotone matching we can use the dynamic programming method in [6]. Bur our matching is not necessarily monotone. So this raises the question of whether these exists a polynomial time algorithm which can do that. Next we give a positive answer to this question by using semi-definite programming (SDP).

###### Construction 3.1.

Given , a polynomial time computable function and two strings , we have the following algorithm.

• Solve the following SDP.

Let be two real vectors. Let denotes the set of potential matches between and .

 maxu,vn−p+1∑j=1v2j (1) s.t. for every i, every pair of distinct matches (i,j),(i,j′)∈E, ui,jui,j′=0 for every j, every pair of distinct matches (i,j),(i′,j)∈E, ui,jui′,j=0 for every pair of overlapping intervals y[j,j+p),y[j′,j′+p), vjv′j=0 for every j, v2j=∑(i,j)∈Eu2i,j for every (i,j)∈E, u2i,j≤1.
• For every pair , put into the matching .

• return .

###### Lemma 3.2.

Given , a polynomial time computable function and two strings , the maximum matching between and under can be computed in polynomial time.

###### Proof.

We model the algorithm as a max-flow with restrictions.

We evenly parse to be a sequence of length blocks. Every block corresponds to a node. Then we let every length interval of be a node (these intervals can overlap). Also, we create a tap which is connected to all ’s blocks with unlimited capacity, and a sink which is connected from every ’s blocks with unlimited capacity. For every potential match where , we create an edge from to with capacity .

For every in the model, let denote the flow from to . For every let denote the flow from the to the sink.

There are two issues to notice here.

First, by definition of the matching, we require that each is matched to one interval only. So for if there are edges and for some , we let . Also, for if there are edges and , then .

Second, we require that intervals should be non-overlapping, where these intervals are ’s intervals in the matching. So we let for every pair s.t. .

In summary, we have the following SDP.

 maxu,vn−p+1∑j=1v2j (2) s.t. for every i, for every pair of distinct edges (i,j),(i,j′), ui,jui,j′=0 for every j, for every pair of distinct edges (i,j),(i′,j), ui,jui′,j=0 for every pair of overlapping intervals y[j,j+p),y[j′,j′+p), vjv′j=0 for every j, v2j=∑(i,j)u2i,j for every edge (i,j), u2i,j≤1.

This SDP is exactly the same as that in Construction 3.1. A standard ellipsoid method can solve it in polynomial time.111More specifically, it suffices to achieve an output within of the optimal SDP value.

Note that the optimal solution is integral. Because if some is not an integer, then we can round it to be . After rounding, we get another valid solution (all restrictions are still met). But it gives a larger flow, since it increases . This shows that the returned optimal solution must be integral.

For every , we put into the matching. This would give us a maximum matching. We show this by establishing the correspondence between matchings and valid integral solutions of our SDP.

The reason that every valid integral solution corresponds to a matching, is that each integral flow in our flow model corresponds to a matching between and under , since we put into the matching if . By definition of the edge, the first bullet of our matching definition is satisfied. The second bullet is satisfied since we have the first restriction in our SDP. The third bullet is satisfied because the second restriction in our SDP.

On the other hand, every matching corresponds to an integral flow in our flow model. Note that for each pair in the matching, we can let the corresponding . Other being a potential match, are set to be . Thus the fifth restriction in our SDP is satisfied. Also, since it is a match, is definitely a potential match. Note that is a matching of the bipartite graph, it meets the first and second restrictions in our SDP. Moreover, our matching has its third bullet requirement which means that the intervals of in the matching are non-overlapping. Hence the third and fourth restrictions in the SDP are satisfied.

As a result, the optimal solution of our SDP gives a maximum matching.

###### Theorem 3.3.

There exists an algorithm which, on input for large enough constant , , outputs a description of a hash function that is good for , in time , where the description length is .

Also there is an algorithm which, given the same , the description of and any , can output in time .

###### Proof.

Let be small enough. Let be an -almost -wise independence generator with , from Theorem 2.6. Here outputs bits and we view the output as an array indiced by elements in , where each array entry is in .

To construct , we try every seed . Let (meaning for every is the -th entry of the array ). For any , we check that if and only if . If this is the case then the algorithm returns . The description of is the corresponding seed .

There exists a s.t. the corresponding is good. This is because, if we let be uniformly random, then by a union bound, the probability that there exists s.t. but is at most .

The exhaustive search is in polynomial time because the seed length is . The evaluation of is in polynomial time by Theorem 2.6. Thus the overall running time of our algorithm is a polynomial in . ∎

We now have the following document exchange protocol.

###### Construction 3.4.

The protocol works for every input length , every block-insertions/deletions block-transpositions, , for some constant . (If or , or , we simply let Alice send her input string.) Let .

Both Alice’s and Bob’s algorithms have levels.

Alice: On input ;

1. [label*=0.]

2. We set up the following parameters;

• For every , in the -th level,

• The block size is , i.e., in each level we cut every block in the previous level evenly into two blocks. We choose properly s.t. ;

• The number of blocks ;

3. For the -th level,

1. [label*=0.]

2. Construct a hash function for by Theorem 3.3.

3. Compute the sequence ;

4. Compute the redundancy for by Theorem 2.8, where the code has distance ;

4. Compute the redundancy for the blocks of the -th level by Theorem 2.8, where the code has distance ;

5. Send , , , .

Bob: On input and received , , ;

1. [label*=0.]

2. Create (i.e. his current version of Alice’s ), initiating it to be ;

3. For the -th level where ,

1. [label*=0.]

2. Apply the decoding of Theorem 2.8 on to get the sequence of hash values . Note that is received directly, thus Bob does not need to compute it;

3. Compute which is the maximum matching (may not be monotone) between and under , using , by Lemma 3.2;

4. Evaluate according to the matching, i.e. let ;

4. In the ’th level, apply the decoding of Theorem 2.8 on the blocks of and to get ;

5. Return .

For every , .

###### Proof.

We show that there exists a matching s.t.