Directory Reconciliation

07/25/2018
by   Michael Mitzenmacher, et al.
Harvard University
0

We initiate the theoretical study of directory reconciliation, a generalization of document exchange, in which Alice and Bob each have different versions of a set of documents that they wish to synchronize. This problem is designed to capture the setting of synchronizing different versions of file directories, while allowing for changes of file names and locations without significant expense. We present protocols for efficiently solving directory reconciliation based on a reduction to document exchange under edit distance with block moves, as well as protocols combining techniques for reconciling sets of sets with document exchange protocols. Along the way, we develop a new protocol for document exchange under edit distance with block moves inspired by noisy binary search in graphs, which uses only O(k n) bits of communication at the expense of O(k n) rounds of communication.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/03/2018

Edit Errors with Block Transpositions: Deterministic Document Exchange Protocols and Almost Optimal Binary Codes

Document exchange and error correcting codes are two fundamental problem...
04/10/2018

Optimal Document Exchange and New Codes for Small Number of Insertions and Deletions

This paper gives a communication-optimal document exchange protocol and ...
07/02/2020

Efficient Document Exchange and Error Correcting Codes with Asymmetric Information

We study two fundamental problems in communication, Document Exchange (D...
04/16/2018

Deterministic Document Exchange Protocols, and Almost Optimal Binary Codes for Edit Errors

We study two basic problems regarding edit error, i.e. document exchange...
09/05/2017

Semantic Document Distance Measures and Unsupervised Document Revision Detection

In this paper, we model the document revision detection problem as a min...
02/01/2022

Timely Gossiping with File Slicing and Network Coding

We consider a system consisting of a large network of n users and a libr...
12/29/2015

Analyzing Walter Skeat's Forty-Five Parallel Extracts of William Langland's Piers Plowman

Walter Skeat published his critical edition of William Langland's 14th c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Document exchange is a well studied two party communication problem, in which Alice and Bob have documents (strings) and respectively and they wish to communicate efficiently so that Bob can recover . They are given a small bound on the edit distance (or edit distance with block moves) between and , and wish to use communication proportional to , rather than to the lengths of their strings. This has immediate applications to version control software, in which a server and client wish to synchronize different versions of the same files. In such a setting, document exchange would match each file with the corresponding file with the same name on the other side; this can be done in parallel. However, if files are allowed to change names or locations in the file structure, the this approach could introduce significant inefficiencies, as a large file whose name is changed may have to be transmitted in its entirety between the parties.

We introduce the problem of directory reconciliation to address this issue. In this problem Alice and Bob each have a directory, which we define to be a set of documents. We have a small bound on the number of edits (character insertions, deletions, and substitutions) required to transform Alice’s directory into Bob’s.111Note that we use as our edit bound for document exchange and for our edit bound for directory reconciliation. This is for historical consistency, and it helps to keep clear which problem we are solving. The goal of directory reconciliation is for Bob to recover Alice’s directory using as little communication as possible. In our version control application, a file’s name and location could be encoded as a prefix for the document, and now a one character change to a file’s name corresponds only to a single edit, rather than the deletion of a whole file and the creation of a new one. Directory reconciliation is also applicable to situations where we wish to synchronize file collections that are closely related but do not have file names linking them.

We present two approaches to solving directory reconciliation. The first approach is designed to minimize total communication; we accomplish this by using more rounds of communication. To achieve this, we show that directory reconciliation can be reduced to document exchange under edit distance with block moves. Recall that a block move operation selects a contiguous substring of any length, deletes it from its current location and inserts it elsewhere in the string. The state of the art for this form of document exchange is the IMS sketch [11] which is a one round protocol using bits of communication. In section 3 we present our main technical result, a new protocol for document exchange under edit distance with block moves that achieves bits of communication at the expense of using more rounds communication ( of them). This protocol draws inspiration from techniques for noisy binary search, but requires new techniques and analysis in order to meet our communication requirements.

Our second approach is designed to be more communication efficient in terms of the number of rounds. This approach is based on combining document exchange protocols with techniques for reconciling sets of sets [14]. We provide one round directory reconciliation protocols that are generally superior to what we achieve from our reduction to the IMS sketch. In particular, they perform significantly better when the directory consists of a large number of small files. Additionally, we provide efficient directory reconciliation protocols that use only a constant number of rounds for the setting when the bound is unknown. These results motivate studying directory reconciliation problem as a distinct problem from document exchange, as they improve upon what is possible by direct reduction to document exchange.

1.1 Related Work

The formal study of document exchange began with Orlitsky [17] and has received significant attention since then; see, for example, [2, 11, 4]. We summarize the current state-of-the-art document exchange protocols in subsection 2.1. All of the modern protocols use only a single round of communication. While there was a line of work on multi-round protocols [19, 5, 12, 21], these protocols are dominated by the one round IMS sketch [11], which incorporates the ideas behind them. The rsync algorithm [1, 22] is a well known practical tool for synchronizing files or directories. However, rsync and related tools have poor worst case performance with regards to the amount of data they communicate. Furthermore, rsync performs directory synchronization by individually synchronizing files with common names/locations and thus behaves poorly if a file’s name/location changes, an issue we seek to remedy with our model of directory reconciliation.

Set reconciliation is a related problem in which Alice and Bob each have a set, and they wish to communicate efficiently so that Bob can recover Alice’s set given knowledge of a small bound on the difference between their sets [13, 20]. One of the solutions to set reconciliation makes use of the Invertible Bloom Lookup Table (IBLT) [7, 8]. We make extensive use of IBLTs in several of our protocols, and describe them further in subsection 2.2.

Set of sets reconciliation extends set reconciliation to the scenario where Alice and Bob’s set elements are themselves sets, as proposed in [14]. Here the bound is on the number of element additions and deletions needed to make their sets of sets equal. Mitzenmacher and Morgan [14] develop several protocols for this problem that we adapt to the setting of document exchange, once again making heavy use of IBLTs.

Noisy binary search is the problem of searching via comparison for an item in a list or graph in which those comparisons have some chance of returning an incorrect response [3, 6]. Solutions for this include multiplicative weights based algorithms which incrementally reinforce the likelihood that each possible candidate is the target. Our multi-round document exchange protocol draws heavily from these ideas, by using noisy hash based comparisons to incrementally discover common substrings of Alice and Bob’s documents. More specifically, we repeatedly use a variation of noisy binary search to find the longest substring of Bob’s document that is a prefix of Alice’s document. While similar to Nisan and Safra’s use of noisy binary search to find the longest common prefix of two strings [16], the expansion to substrings greatly complicates realizing our desired communication bound.

2 Preliminaries

We focus on two problems, directory reconciliation and document exchange. In the problem of directory reconciliation, Alice and Bob each have a directory, represented as a set of at most documents, each of which is binary string of length at most .222We use a binary alphabet here for convenience. Our results easily extended to larger alphabets. Recall that to interpret a file directory as a set, the file’s name and directory location are encoded as a prefix of the document. Hence moving a file or changing its name corresponds to a small number of character edits to the corresponding document in our set.

The sum of the sizes of each parties’ documents is at most . Bob’s directory is equal to Alice’s after a series of at most edits (single character insertions, deletions, and substitutions) to Alice’s documents. Let be an upper bound on the number of documents that differ between Alice and Bob. In general, we may not have such a bound in which case we use . We develop protocols designed to terminate with Bob fully recovering Alice’s directory.

In document exchange, Alice and Bob each have a binary string ( and respectively) of length at most . We have a bound on either , the edit distance between and , or , the edit distance with block moves. is equal to the minimum number of character insertions, deletions and substitutions to transform into . is equal to the minimum number of block moves, character insertions, deletions, and substitutions to transform into . The goal of a document exchange protocol is to allow Bob to recover as efficiently as possible.

Throughout this paper, we work in the word RAM model, with words of size for both problems. We refer to the number of rounds of communication in a protocol for the total number of messages sent. A one round protocol therefore consists of a single message from Alice to Bob.

All of our protocols use the public randomness model, meaning that any random bits used in the protocol are shared between Alice and Bob automatically, without additional communication. This simplifies our presentation as our protocols make heavy use of various hash functions, and public randomness allow Alice and Bob to be able to use the same hash functions without communication anything about them. Our protocols can be converted to the private randomness model using minimal additional communication via standard techniques [15]. In practice, one would instantiate the public randomness model by sharing of a small random seed to be used for generating all of the random bits used in the protocol.

2.1 Document Exchange Protocols

Our protocols often use existing document exchange protocols as subroutines. Here we review the current state-of-the-art document exchange protocols for the setting of edit distance, and edit distance with block moves.333At the time of this writing, there is a newly released protocol for document exchange under edit distance without block moves of [10]. As this paper is still in pre-print form, and in particular currently lacks a concrete running time for its document exchange protocol, we have opted not to discuss it here. The following is the best known protocol for document exchange under edit distance with block moves.

Theorem 2.1 (Theorem 1 of [11]).

Document exchange under edit distance with block moves can be solved in one round using bits of communication and

time, with probability at least

.

The protocol for this theorem is fairly simple. For each of levels, Alice transmits an encoding of . For each level in , she splits her string into blocks, computes a bit hash of each one, and encodes them using the systematic part of a systematic error correcting code for correcting errors. At the bottom level, where the blocks are of size , she encodes the blocks themselves rather than hashes of them. After receiving this message, Bob iterates through the levels. Since the first level has blocks, he can decode it immediately and recover Alice’s hashes. He applies a rolling hash to every length contiguous substring of and if any of Alice’s hashes match any of his own, then by inverting the hash (assuming no hash collisions) he knows the contents of that block of . Bob then continues through the levels, each time using what he has recovered so far of Alice’s string to decode the next level’s code. Since there are edits between their strings, it can be shown that each level will only have hashes that aren’t present in Bob’s string. By decoding the final level, Bob will have recovered Alice’s whole string. The protocol only fails if there are hash collisions, which, given the hash size and number of strings hashed, occurs with probability .

Now we turn to the best known protocol for document exchange under edit distance without block moves.

Theorem 2.2 (Theorem 9 of [2]).

Assuming for a sufficiently small constant , document exchange under edit distance can be solved in one round using bits of communication and time, with probability at least .

Note that for the case where , Theorem 2.1 represents the best known document exchange protocol, even under edit distance without block moves. Theorem 2.2 is based on a careful application of the CGK encoding of [4], which embeds from the edit distance space into Hamming space. Theorem 2.2 results from applying this encoding times, each time matching up common pieces of and using the encoding, until at the final level the unmatched strings are of small enough size that applying Theorem 2.1 yields the desired bound.

2.2 Invertible Bloom Lookup Tables

Several of our protocols make use of the Invertible Bloom Lookup Table (IBLT) [8], a data structure representing a set that was designed to solve the set reconciliation problem. We summarize the structure and its properties here; more details can be found in [7, 8]. An IBLT is a hash table with hash functions and cells, which stores sets of key-value pairs. (It can be used to just store keys also.) We add a key-value pair (each from a universe of size ) to the table by updating each of the cells that the key hashes to. (We assume these cells are distinct; for example, one can use a partitioned hash table, with each hash function having cells.) Each cell has a number of entries: a count of the number of keys hashed to it, an XOR of all of the keys hashed to it, an XOR of a checksum of all of the keys hashed to it, and, if we are using values, an XOR of all the values hashed to it. The checksum, produced by another hash function, is bits and is meant to guarantee that with high probability, no cells containing distinct keys will have a colliding checksum. We can also delete a key-value pair from an IBLT through the same operation as adding it, except that now the counts are decremented instead of incremented.

An IBLT is invertible because if is large enough compared to , the number of key-value pairs inserted into it, we can recover those pairs via a peeling process. Whenever a cell in the table has a count of 1, the key XOR will be equal to the unique key inserted there, and the value XOR will be equal to the unique value inserted there. We can then delete the pair from the table, potentially revealing new cells with counts of 1 allowing the process to continue until no key-value pairs remain in the table. This yields the following theorem.

Theorem 2.3 (Theorem 1 of [8]).

There exists a constant so that an IBLT with cells ( space) and at most key-value pairs will successfully extract all keys with probability at least .

A useful property of IBLTs is that we can “delete” pairs that aren’t actually in the table, by allowing the cells’ counts to become negative. In this case, the IBLT represents two disjoint sets, one for the inserted or “positive” pairs and one for the deleted or “negative” pairs. This addition requires a minor modification to the peeling process, which allows us to extract both sets. Now just as we peeled cells with 1 counts by deleting their pairs from the table, we also peel cells with counts by adding theirs pairs to the table. Unfortunately, a cell with count of 1 or might have multiple pairs (some from each set) hashed there, whose counts only add up to . However, we remedy this issue by using our checksums. With high probability, a cell with a count of will actually represent only a single pair if and only if the checksum of the cell’s key XOR equals the cell’s checksum XOR.

This property of IBLTs allows us to insert each of the items in a single set into the IBLT, then delete the items from another set from the IBLT. Inverting the IBLT then reveals the contents of the symmetric set difference of the original two sets, and by Theorem 2.3 will succeed with high probability so long as the size of this set difference is at most .

In most of our uses of the IBLT, we only have keys, and no associated values. As such, unless otherwise noted, assume that our IBLTs lack value fields, and when we insert or delete an item from the IBLT, we are treating it as a key. We sometimes refer to “encoding” a set in an IBLT as inserting all of its elements into it. We similarly “decode” a set difference from an IBLT by extracting its keys.

There is one final nice property of IBLTs that we exploit. Let and be two IBLTs with the same number of cells and the same hash functions. Let and be two sets, and we insert the items of into and the items of into . If and are disjoint, then we can “add” and together to make a single IBLT encoding . We do this iterating through , adding the th cells of and together by summing their counts and XORing their other fields. Similarly, we can “subtract” and to yield a single IBLT encoding the symmetric set difference of and .

2.3 Notation

Given a (1-indexed) vector

of length , we will use the notation to refer to subset of consisting of indices through inclusive. Similarly, refers to the length prefix of and refers to the the length suffix of .

We will frequently use bit hashes as identifiers for strings. We use the property that for a sufficiently large constant in the order notation, we have no collisions among at most such hashes with probability at least .

3 Multi-Round Document Exchange Protocol

Before we provide our own protocol for document exchange, we show that directory reconciliation can be solved via a straightforward reduction to document exchange under edit distance with block moves. The current state-of-the-art protocol for this problem is Theorem 2.1, and this reduction provides a baseline against which we compare the rest of our protocols.

theoremthmreduction Directory reconciliation can be solved in one round using bits of communication and time with probability at least .

The protocol for this starts with each party computing a bit hash of each of their documents. They concatenate their documents together, with each pair separated by a random bit delineation string, then perform document exchange on their concatenated documents via Theorem 2.1, and finally Bob decomposes Alice’s concatenated document into Alice’s directory. As we argue in Appendix A, the edit distance with block moves between the concatenated documents is at most , and thus Theorem 2.1 yields the desired bounds.

Now we develop a protocol for document exchange under edit distance with block moves that achieves a communication cost of . We do this with a multi-round protocol, inspired by noisy binary search algorithms [6], that identifies the common blocks between Alice and Bob’s strings via many rounds of back and forth communication.

Theorem 3.1.

Document exchange under edit distance with block moves can be solved in rounds using bits of communication and time, with probability at least .

This implies a protocol for directory reconciliation via the same reduction as in section 3.

Corollary 3.2.

Directory reconciliation can be solved in rounds using bits of communication and time, with probability at least .

The main technical work in our protocol comes from the following lemma.

Lemma 3.3.

Given , Alice can find the longest prefix of her document up to length that is a contiguous substring of Bob’s document in rounds using bits of communication and time, with probability at least .

We prove 3.3 later, building off of techniques from noisy binary search in graphs [6]. We use this lemma as a subroutine in our protocol for Theorem 3.1 by incrementally building up a larger and larger prefix of Alice’s document that is known to Bob. Along the way 3.3 may fail due to its own internal randomness, but we make no assumptions on what mode that failure takes. 3.3’s protocol may abort and report failure, it may report a prefix of Alice’s document that is too long, and thus is not a substring of Bob’s document, or it may report a prefix that is a substring of Bob’s document but is not the longest possible one. We show that so long as most of our applications of 3.3 succeed, no matter what form the failures take, our resulting protocol for document exchange will succeed.

Proof of Theorem 3.1.

Let be Alice’s document and be Bob’s document.

Our protocol will happen in phases. After each phase, Bob will have recovered a progressively larger prefix of . After the th phase he will have recovered , where is a prefix of and with probability at least . We use to refer to the string after removing the prefix . Given a string , we use to refer to the length of .

In the th phase, Alice and Bob uses 3.3 (setting ) to find the largest prefix (up to length ) of contained in . If is shorter than , Alice directly transmits the first bits of , thus . Otherwise, Alice transmits , along with a bit hash of to Bob, who compares it to to the hash of each length substring in . With probability at least , the hash of matches only one unique substring of , and that is , thus Bob has recovered . If at any point a failure occurs, either one detected in 3.3 or if the hash of does not have a unique match, we move on letting .

First we argue that assuming no hashing failures or failures in 3.3, the protocol succeeds with . The argument follows that of Lemma 3.1 of [11]. We imagine that is written on a long piece of paper, and we perform each edit operation to transform into by cutting the paper, rearranging pieces and inserting individual characters for insert and substitution operations. Each operation requires at most 3 cuts, so consists of the concatenation of at most substrings of plus up to newly inserted characters. In each phase of the algorithm, we either recover up to the end of one of the substrings, at least one of the inserted characters, or characters. The first case can happen at most times, the second and third cases can each happen at most times, so at most phases are required to recover in its entirety.

There are two ways that a phase can fail; either the hash of can match with a substring of that does not equal , or 3.3 can fail. Union bounding over the phases, with probability at least , none of the hashes mismatch. 3.3 can fail one of three ways: it can report an which is too short, report an that is too long, or it can fail to find an at all. If it fails to find an , we have case where , basically meaning we try again with fresh randomness. If is too long, then assuming no hash mismatches, Bob will not find a substring of that matches ’s hash so we once again have . Finally if is too short, but we won’t have completed one of the at most substrings of of which is comprised.

So long as 3.3 succeeds times in at most phases, the protocol succeeds. Since each application of 3.3 succeeds with probability at least , by a Chernoff bound, we succeed at least times in attempts with probability at least so our overall success probability is at least as desired.

Each phase uses rounds totaling bits of communication and takes time, thus the whole protocol takes rounds, uses bits of communication and takes time. ∎

Now we sketch the protocol for 3.3, with the details presented in Algorithm 2. In our protocol, Bob will represent all contiguous substrings of his document of length at most using a tree . Each node in represents a substring of . The root of corresponds to the empty string, and each node at depth corresponds to a unique substring of length . A node at depth ’s parent is the node corresponds to its length prefix. is essentially an uncompressed suffix tree [9] formed from the reverse of each of the length substrings of .

  • Alice and Bob: and .

  • Bob: for all . .

  • For to :

    • Bob: if such that :

      • Bob: , and .

      • Bob: send to Alice.

    • Bob: otherwise:

      • Bob: . Sets such that .

      • Bob: if , . Otherwise, .

      • Bob: (the midpoint of ) and .

      • Bob: if and is not balanced with respect to , , we set either or so that and straddle , meaning that .

      • Bob: send and for to Alice.

    • Alice and Bob: for : , , .

    • Alice: for , send to Bob.

    • For : for :

      • Bob: if , for all , and for all , .

      • Bob: if , for all , and for all , .

    • Alice and Bob: split around the queried points. Specifically, for in descending order of , if then

  • Bob: send to Alice a -bit vector indicating for each phase whether any nodes at depth or are in .

  • Alice and Bob: construct such that and is the ordered list of unique depths of nodes in .

  • Alice: return .

  • Bob: return and .

Algorithm 1 : Alice inputs . Bob inputs .
  • Bob: construct , the tree representing all contiguous substrings of up to length .

  • Alice and Bob: and .

  • Bob: construct , the tree representing the hierarchy of the nodes in .

  • Alice and Bob: .

  • For (in decreasing order):

    • Alice: send -bit hash to Bob

    • Bob: if such that , send to Alice and return

    • Alice: if received , return

  • Alice and Bob: return failure

Algorithm 2 Alice inputs , and . Bob inputs , and .

We use to refer to the set of nodes in at depth . Given a node , refers to the subtree rooted at in . We use to refer to the tree above depth . In other words .

Our protocol will consist of phases. In the th phase, we draw an bit hash function , and Alice sends to Bob. Bob then compares this hash to the depth nodes in (the length substrings of ). Bob then responds with information to determine what to use in the next phase. After rounds of this, we hope to gain enough information from these hashes for Bob to confidently determine the length of the largest prefix of which is contained in .

The set of prefixes of which are contained in correspond to a path from the root in . We are searching for a single node in , which is the last node in this path. Whenever does not match the hash of a node in , we know that is not in the subtree rooted at . When the hash does match, then is more likely to be in the subtree rooted at , and less likely to be in the rest of . This is essentially the kind of feedback used in noisy binary search in graphs [6]. If we were unconstrained in the amount of communication Bob sends to Alice to determine the next , and Bob could simply send bits to exactly specify the optimum , then we could reduce directly to the algorithm of [6]. However, in order to achieve the lemma we must restrict ourselves to bits of communication per phase. We emphasize that this is where most of our technical contribution lies, as it requires new techniques and analysis to handle this restriction.

Both Alice and Bob maintain a partitioning of the range into intervals . Before the first phase, they just have one interval containing all of . In each phase, Alice and Bob agree to query in the th interval. Alice then chooses to be the midpoint of . After transmitting the , the interval containing will be split in two around . Bob then tells Alice which interval to choose the next from, by transmitting bits specifying how much to add or subtract from .

We choose the size of so that the probability of two unequal strings having matching hash values is at most . For each node , Bob maintains a weight . All the weights are initialized to , and Bob updates them in response to the hashes he receives from Alice. For each , if , we multiply each of the weights in by and we multiply the weights of all other nodes in by . If , we multiply each of the weights in by and we multiply the weights of all other nodes in by . Here we are using the one-sidedness of comparisons, in that if the hashes do not match then we know that the strings do not match with probability 1. As a result, we could instead zero the weights of the nodes in when , but doing so would not improve our asymptotic result and would complicate the analysis.

Whenever a node’s weight is at least half of the total tree weight, we zero its weight and add the node to a list . We show that after rounds, our goal is likely to be in . We then repeat this protocol with a new tree consisting only of nodes in . is formed by taking the transitive closure of , then removing all nodes not in , then taking the transitive reduction of the result. In other words, in , is a parent of if was ’s most recent ancestor in . Note that unlike , need not be a binary tree. When running the protocol on , we now have that no longer consists of intervals, but is instead contiguous subsets of the node depths present in , and when we select the midpoint of an element of , we choose the median. We show that by running the protocol on for rounds, is likely to be in the new candidate list of nodes . , so at this point, we directly determine which node in is by sending a bit hash for each node in .

Now we argue that Algorithm 2 satisfies 3.3. First we note that all of the steps in MultWeightsProtocol are achievable, with the only non-trivial part being provided by the following lemma.

Lemma 3.4.

In MultWeightsProtocol, Alice can construct using .

Proof.

Consider a node and its parent . If has not been queried in any phase, then . This is because any weight update performed by a query at a depth will update and identically. If then either and will be unchanged or they will both be multiplied by . If then either and will both be multiplied by or they will both be set to .

In order for to be added to , . This is impossible if , therefore must have been queried in at least one phase. Thus, every depth that could be in will be included in so Alice can construct . ∎

We say that a depth is queried in a phase of Algorithm 1 if or in that phase. A query of depth partitions the the tree into components: along with for each . A query of depth is balanced if each component has at most of the total weight in the tree. In other words, . A query of depth is informative if the component containing the target node has total as most of the total mass. Specifically, if then and if for , then . A balanced query is guaranteed to be informative, but an informative query need not be balanced. We call phase of Algorithm 1 informative if either an informative query is performed or if a node is added to .

Lemma 3.5.

In MultWeightsProtocol, is guaranteed to be a balanced query.

Proof.

If is not balanced, then either or there exists a such that . Additionally, by the definition of , if it is not balanced then no query is.

Suppose . Let be the smallest (least deep) query such that . By the definition of , there must exist a for which . consists of together with a subset of . Since , . Furthermore, every node in the tree has weight at most , or it would have been zeroed in a previous step. Therefore, , which is a contradiction so we cannot have that .

The argument showing that there cannot exist a such that is essentially identical. ∎

Lemma 3.6.

If , performs at least informative phases, where is the height of .

Proof.

Let be the depth of , and be the interval containing . We introduce the potential function , where . If is the final interval, then instead . We argue that in each phase, either a node is added to , an informative query is performed and increases by at most 5, or decreases by at least 2.

No phase can ever increase by more than 5 since and at most two intervals are split in two, and and are non-increasing.

If then we have two cases:

  1. is in between and . In this case is informative because is balanced (by 3.5) and is closer to than so the component containing formed by partitioning around must have at most of the total weight.

  2. is in between and or is in between and . Since , must be 3 steps closer to than , which in either ordering means it must also be 3 steps closer to . Thus must decrease by at least 2 (not 3 because splitting an interval can increase the distance by 1).

If , then we have a few cases:

  1. is balanced. In this case our query is guaranteed to be informative.

  2. . is in between and so it is guaranteed to be informative.

  3. . is in between and so it is guaranteed to be informative.

  4. and . Either or so will shrink by a factor of at least . afterwards will be at most , thus in total will decrease by at least .

  5. and . Either so will shrink by a factor of at least . afterwards will be at most , thus in total will decrease by at least .

  6. and . is guaranteed to be an informative query since so .

Initially, since there is only one interval. is also always non-negative. Thus, over phases, if is our number of informative phases then

and so . ∎

Let be at the beginning of phase in MultWeightsProtocol. Let

. We will use these random variables to bound the probability that at the end of

MultWeightsProtocol, and thus is never added to . First we provide some bounds on them, whose proofs appear in Appendix B.

lemmalemybounded If no node is added to in phase , then . If a node other than is added to in phase , then .

lemmalemymean If , and is not in by the end of phase , then . Additionally, if phase is informative, then .

Using these bounds on the s, together with 3.6 we can apply a Chernoff bound to get the following lemma (whose proof also appears in Appendix B).

lemmalemmwpprob If , , and , where is the height of , then will output an which includes with probability at least .

Now we put all of these pieces together.

Proof of 3.3.

We prove that Algorithm 2 achieves the lemma. First we argue correctness.

The only step of MultWeightsProtocol where it is not immediate that Alice or Bob has the required information to perform the protocol is whether Bob can construct , which we proved in 3.4. By 3.6 and our choice of (since ), will contain with probability at least . Since , together with 3.6, our choice of , and a union bound we have that will contain with probability at least .

With our choice of hash size for , we can make the probability of hash collisions at most . Thus, by a union bound, with probability at least there are no hash collisions and the Alice returns and Bob returns . Thus by our choice of , the protocol succeeds with probability at least .

Each phase of MultWeightsProtocol uses rounds of communication and transmits bits. At the end of MultWeightsProtocol there is a single message of size , thus MultWeightsProtocol performs rounds of communication and bits of communication. Thus, between our two applications of MultWeightsProtocol and the hashes transmitted at the end our rounds of communication are bounded by

Our bits of communication are bounded by

The computation bottleneck is the first application of MultWeightsProtocol. Each phase takes time compute all of the hash evaluations (using a rolling hash function) and time to update all the weights and select the next interval to query, so the total computation time is . ∎

4 Sets of Sets Based Protocols

3.2 is aimed at minimizing the number of bits of communication, but it does so at the expense of both rounds of communication and computation time. In this section we develop a single round protocol that is much faster and still outperforms section 3 in communication cost for most cases.

theoremthmbestsos Directory reconciliation can be solved in one round using

bits of communication and

time with probability at least .

The protocol here is an adaptation of Theorem 3.7 of [14], using document exchange as a subroutine instead of set reconciliation. The basic idea is that we represent each document as a pair consisting of a hash of that document, together with the message that Theorem 2.2 (the best known one round document exchange protocol) would send to allow the other party to recover the document. For each of different levels, we make a different version of this representation for each document, varying the bound on the edit distance used in the document exchange protocol. In the th level, starting at , . Then, we encode all of the document representations in level into an IBLT of size . These IBLTs are the message from Alice to Bob.

Bob constructs analogous representations and IBLTs. Bob is able to decode all of the level 1 representations corresponding to differing documents, since the IBLT is of size . Bob consider all combinations of his documents and Alice’s extracted representations. For each of these pairs, he attempts to perform his side of the document exchange protocol used in Theorem 2.2. Since the bound Alice used at this level was 2, the document exchange protocol will succeed (with high probability) whenever he pairs one of his documents with Alice’s representation of a document that differs by at most 2 edits. For those pairs differing in more than 2 edits, the protocol will generally fail, either reporting failure or by yielding an incorrect document, which he can detect using Alice’s document’s hash. He then generates representations for those documents he recovered from Alice, and removes them from the level 2 IBLT so that he can recover most of Alice’s documents that have at most 4 edits, and so on, until he has recovered all of Alice’s documents. The full details of the protocol and the proof of the theorem appear in Appendix C.

If the only bound on and we have is that they are both (as they must be) then section 4 uses bits of communication, which already generally outperforms our previous best one round protocol, section 3, which uses bits of communication. The difference really shows when we have a tight bound on , such as . In such a case section 4 uses only bits of communication, outperforming section 3 by at least a factor of . This regime is not an unnatural one, as it corresponds to the setting where our directory consists of a large number of small files.

4.1 Speeding Up

The main weakness of section 4 is the running time, which is , rather than a more desirable . The original protocol for reconciling sets of sets on which this was based has a running time of . The reason that section 4 takes more time is that the document exchange protocols take time linear in the length of the documents to determine if protocol will succeed, while the underlying set reconciliation protocol used in [14] only takes time linear in the specified difference bound. This is a bottleneck because each document exchange message is compared to many documents to see if they can decode it, but each attempt takes linear time. In other words, if we can make a document exchange protocol that determines failure within time , we can reduce the computation time of section 4 to . This holds even if Bob is allowed precomputation time before receiving Alice’s message, since that would be a one time cost for each document. We develop such a document exchange protocol here, but it is primarily a proof of concept as it has an increased communication cost and so the resulting directory reconciliation protocol is very rarely superior to both section 3 and section 4 simultaneously. Further progress is necessary to yield an improved directory reconciliation protocol, but we believe that our approach here is a valuable starting point.

Our document exchange protocol is a modification of Theorem 2.1. We sketch our changes here, and present the full details and proof in Appendix D. First, we limit ourselves to edit distance, rather than block edit distance, which means that rather than having to consider all possible blocks as matches, we only have to consider the shifts of each corresponding block. For each of his possible blocks, of which there are in the first level, in the next and so forth, Bob precomputes the encoding of that block at that, and all lower levels. So for example the encoding of a block from the first level at the second level would be the combination of the encoding of the first half of the block and the second half of the block. For our error correcting code, we use IBLTs, which have the advantage that they can combine two of these precomputed pieces efficiently. This will progress only down to the level where there are blocks of size , at which point we transmit a direct encoding of the blocks of size , blowing up our communication to . There are levels, so the blocks from the first level require precomputed encodings and the bottom level requires precomputed encodings. Each encoding takes time to generate (once we have the lower level encodings and all of the hashes, which take ) so the total precomputation time takes roughly . Once we have these precomputations, we can decode in roughly time, since at each level we just have to combine precomputed encodings of size , and decode. This yields the following theorem.

theoremfastims There is a one round protocol for document exchange under edit distance using

bits of communication, which succeeds with probability at least if . Alice takes computation time, and Bob has an

time precomputation step, before receiving Alice’s message. After receiving Alice’s message, Bob has an

time failure check step. If the message passes the failure check, then the protocol will succeed with probability at least (independent of ) after a final time step from Bob.

Using this document exchange protocol as our subroutine instead of Theorem 2.2 in section 4 yields the following directory reconciliation protocol.

theoremthmfastsos Directory reconciliation with can be solved in one round using

bits of communication and

time with probability at least .

This degrades our communication bound in section 4 from to , but speeds it up from to . The proof of this theorem is also in Appendix D.

4.2 Unknown

So far all of our protocols have assumed that we know (or a good bound on it) in advance. Any of them can be extended to handle the case where is unknown by the repeated doubling method. First we try the protocol with and send a hash of Alice’s directory along with it. If the protocol succeeds and the directory Bob recovers matches the hash, then we are done, otherwise we proceed to until we eventually find the right value of and succeed with good probability. This strategy will not increase any of our asymptotic communication costs and will only increase the computation times by at most a factor of , however all of our one round protocols become round protocols. For some applications, it is important to minimize the number of rounds of communication, so in this section we develop a protocol that allow for unknown/unbounded values of while still using a constant number of rounds of communication.

Our key techniques here are the CGK embedding [4]

and set difference estimators

[14]. The CGK embedding is a randomized mapping from edit distance space to Hamming space, such that with constant probability the Hamming distance between embedded strings will be and , where was their original edit distance. Set difference estimators are bit sketches that allow the computation of a constant factor approximation to the size of the difference between two sets, with constant probability. They can equivalently be used to estimate the Hamming distance between two strings. By combining these tools with section 3 we get the following result.

theoremmultiroundnod Directory reconciliation with unknown can be solved in 4 rounds using

bits of communication and

time with probability at least .

This theorem is inspired by the multi-round approach to reconciling sets of sets in [14]. In the first round, we estimate the number of differing documents via set difference estimators. In the second round we perform set reconciliation on sets of hashes of the documents, to determine which among them differ. In the third round we exchange set difference estimators for the CGK embeddings of the differing documents, to determine which documents have close edit distance to which others, and what that edit distance is. In the final round we use that information to reconcile the differing documents by using Hamming distance sketches applied to the CGK embeddings used in the previous round. The fact that we use the same CGK embedding for reconciliation as for the estimate is what allows us to end up with only a final term. The full details and proof are presented in Appendix E.

5 Conclusion

Directory reconciliation considers the reconciliation problem in a natural practical setting. While directory reconciliation is closely related to document exchange and document exchange with block edits variations, it has its own distinct features and challenges; also, while the problem is closely related to practical tools such as rsync, rsync works on a file-by-file basis that may be inefficient in some circumstances. Theoretically, we have found a document exchange scheme with bits of communication to handle edits with block moves, at the expense of a number of rounds of interaction. The natural question is whether this result can be achieved with a single round. On the more practical side, we believe using the “set of sets” paradigm based on IBLTs may provide mechanisms that, besides being of theoretical interest, may also be useful in some real-world settings, where files may not be linked by file names but are otherwise closely related.

References

  • [1] rsync. https://rsync.samba.org.
  • [2] Djamal Belazzougui and Qin Zhang. Edit distance: Sketching, streaming, and document exchange. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 51–60. IEEE, 2016.
  • [3] Michael Ben-Or and Avinatan Hassidim. The bayesian learner is optimal for noisy binary search (and pretty good for quantum as well). In Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on, pages 221–230. IEEE, 2008.
  • [4] Diptarka Chakraborty, Elazar Goldenberg, and Michal Kouckỳ. Streaming algorithms for embedding and computing edit distance in the low distance regime. In

    Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing

    , pages 712–725. ACM, 2016.
  • [5] Graham Cormode, Mike Paterson, Süleyman Cenk Sahinalp, Uzi Vishkin, et al. Communication complexity of document exchange. In Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, pages 197–206, 2000.
  • [6] Ehsan Emamjomeh-Zadeh, David Kempe, and Vikrant Singhal. Deterministic and probabilistic binary search in graphs. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 519–532. ACM, 2016.
  • [7] David Eppstein and Michael Goodrich. Straggler identification in round-trip data streams via Newton’s identities and invertible Bloom filters. IEEE Transactions on Knowledge and Data Engineering, 23(2):297–306, 2011.
  • [8] Michael Goodrich and Michael Mitzenmacher. Invertible Bloom lookup tables. In 49th Annual Allerton Conference on Communication, Control, and Computing, pages 792–799, 2011.
  • [9] Dan Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, 1997.
  • [10] Bernhard Haeupler. Optimal document exchange and new codes for small number of insertions and deletions. arXiv preprint arXiv:1804.03604, 2018.
  • [11] Utku Irmak, Svilen Mihaylov, and Torsten Suel. Improved single-round protocols for remote file synchronization. In INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings IEEE, volume 3, pages 1665–1676. IEEE, 2005.
  • [12] John Langford. Multiround rsync. 2001.
  • [13] Yaron Minsky, Ari Trachtenberg, and Richard Zippel. Set reconciliation with nearly optimal communication complexity. IEEE Transactions on Information Theory, 49(9):2213–2218, 2003.
  • [14] Michael Mitzenmacher and Tom Morgan. Reconciling graphs and sets of sets. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 33–47. ACM, 2018.
  • [15] Ilan Newman. Private vs. common random bits in communication complexity. Information Processing Letters, 39(2):67–71, 1991.
  • [16] Noam Nisan. The communication complexity of threshold gates. Combinatorics, Paul Erdos is Eighty, 1:301–315, 1993.
  • [17] Alon Orlitsky. Interactive communication of balanced distributions and of correlated files. SIAM Journal on Discrete Mathematics, 6(4):548–564, 1993.
  • [18] Ely Porat and Ohad Lipsky. Improved sketching of hamming distance with error correcting. In

    Annual Symposium on Combinatorial Pattern Matching

    , pages 173–182. Springer, 2007.
  • [19] Thomas Schwarz, Robert W Bowdidge, and Walter A Burkhard. Low cost comparisons of file copies. In Distributed Computing Systems, 1990. Proceedings., 10th International Conference on, pages 196–202. IEEE, 1990.
  • [20] David Starobinski, Ari Trachtenberg, and Sachin Agarwal. Efficient pda synchronization. IEEE Transactions on Mobile Computing, 2(1):40–51, 2003.
  • [21] Torsten Suel, Patrick Noel, and Dimitre Trendafilov. Improved file synchronization techniques for maintaining large replicated collections over slow networks. In Data Engineering, 2004. Proceedings. 20th International Conference on, pages 153–164. IEEE, 2004.
  • [22] Andre Trigdell and Paul Mackerras. The rsync algorithm. https://rsync.samba.org/tech_report/.

Appendix A Reducing to Document Exchange with Block Edits

We show here that directory reconciliation can be solved via a straightforward reduction to document exchange under edit distance with block moves.

*

Proof.

First we describe the protocol, and then we will argue its correctness. Alice and Bob compute a bit hash of each of their documents, then sort their hashes in time using radix sort. They then concatenate all of their documents into a single document, ordering them by the sorted order of their hashes. They choose a random bit delineation string and insert it in between each pair of documents in the concatenation. They then engage in document exchange with their concatenated documents so that Bob recovers Alice’s concatenated document. They use Theorem 2.1 with . Bob then converts the concatenated document into a directory by splitting along the delineation strings, thus recovering Alice’s directory.

With probability at least , no two of Alice and Bob’s documents that are not equal will have equal hash values. Assuming this is the case, Theorem 2.1 will allow Bob to recover Alice’s concatenated document with probability at least . This is because without hash collisions, the documents will be ordered such that at most block moves are required to order Alice’s documents so that Alice and Bob’s document orders correspond to the minimum edit distance matching between their documents. After making these block moves, the edit distance between Alice’s and Bob’s concatenated documents would be at most . The optimal block edit distance between the original concatenated documents is therefore seen to be at most , and thus Bob can recover Alice’s concatenated document using the algorithm of Theorem 2.1 (for ) with probability at least . With probability at least the delineation string will not appear in any of the documents, so splitting Alice’s concatenated document using it will produce Alice’s directory.

The concatenated documents are of size since the directories cannot have duplicate documents (since they are sets) so . Thus, by Theorem 2.1 the communication cost is . The running time is dominated by the used by Theorem 2.1 since it only takes time to construct the concatenated strings and then extract the directory. ∎

Appendix B Missing Proofs for the Multi-Round Document Exchange Protocol

Here we providing the missing proofs from the lemmas used to prove that Algorithm 2 satisfies 3.3. Recall that is at the beginning of phase in MultWeightsProtocol, and