Compressed Communication Complexity of Longest Common Prefixes

06/08/2018 ∙ by Philip Bille, et al. ∙ 0

We consider the communication complexity of fundamental longest common prefix (Lcp) problems. In the simplest version, two parties, Alice and Bob, each hold a string, A and B, and we want to determine the length of their longest common prefix l=Lcp(A,B) using as few rounds and bits of communication as possible. We show that if the longest common prefix of A and B is compressible, then we can significantly reduce the number of rounds compared to the optimal uncompressed protocol, while achieving the same (or fewer) bits of communication. Namely, if the longest common prefix has an LZ77 parse of z phrases, only O( z) rounds and O(ℓ) total communication is necessary. We extend the result to the natural case when Bob holds a set of strings B_1, ..., B_k, and the goal is to find the length of the maximal longest prefix shared by A and any of B_1, ..., B_k. Here, we give a protocol with O( z) rounds and O( z k + ℓ) total communication. We present our result in the public-coin model of computation but by a standard technique our results generalize to the private-coin model. Furthermore, if we view the input strings as integers the problems are the greater-than problem and the predecessor problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Communication complexity is a basic, useful model, introduced by Yao [Yao:1979], which quantifies the total number of bits of communication and rounds of communication required between two or more players to compute a function, where each player holds only part the function’s input. A detailed description of the model can be found, for example, in the book by Kushilevitz and Nisam [KushilevitzNisan1997].

Communication complexity is widely studied and has found application in many areas, including problems such as equality, membership, greater-than, and predecessor (see the recent book by Rao and Yehudayoff [RaoYehudayoff18]). For the approximate string matching problem, the paper by Starikovskaya [Starikovskaya17] studies its deterministic one-way communication complexity, with application to streaming algorithms, and provides the first sublinear-space algorithm. Apart from these results, little work seems to have been done in general for the communication complexity of string problems [Starikovskaya18].

In this paper, we study the fundamental longest common prefix problem, denoted Lcp, where Alice and Bob each hold a string, and and want to determine the length of the longest common prefix of and , that is, the maximum , such that (where indicates the empty prefix). This problem is also called the greater than problem, since if we view both and as integers, the position immediately after their longest common prefix determines which is larger and smaller. The complexity is measured using the number of rounds required and the total amount of bits exchanged in the communication. An optimal randomized protocol for this problem uses communication and rounds [Smirnov1988, Nisan1993] where is the length of the strings. Other trade-offs between communication and rounds are also possible [senaVenkateshb2008]. Buhrman et al. [Buhrman] describe how to compute Lcp in rounds and communication.

We show that if and are compressible we can significantly reduce the number of needed rounds while simultaneously matching the bound on the number of bits of communication. With the classic and widely used Lempel-Ziv 77 (LZ77) compression scheme [Ziv1977] we obtain the following bound.

theoremlcp The Lcp problem has a randomized public-coin -round protocol with communication complexity, where is the length of the longest common prefix of and and is the number of phrases in the LZ77 parse of this prefix.

Compared to the optimal uncompressed bound we reduce the number of rounds from to (where typically is much smaller than ). At the same time we achieve communication complexity and thus match or improve the uncompressed bound. Note that the number of rounds is both compressed and output sensitive and the communication is output sensitive.

As far as we know, this is the first result studying the communication complexity problems in LZ77 compressed strings. A previous result by Bar-Yossef et al. [Bar-YossefJKK04] gives some impossibility results on compressing the text for (approximate) string matching in the sketching model, where a sketching algorithm can be seen as a public-coin one-way communication complexity protocol. Here we exploit the fact that the common prefixes have the same parsing into phrases up to a certain point, and that the “mismatching” phrase has a back pointer to the portion of the text represented by the previous phrases: Alice and Bob can thus identify the mismatching symbol inside that phrase without further communication (see the “techniques” paragraph).

We extend the result stated in Theorem 1 so as to compute longest common prefixes when Bob holds a set of strings , and the goal is to compute the maximal longest common prefix between and any of the strings . This problem, denoted , naturally captures the distributed scenario, where clients need to search for query strings in a text data base stored at a server. To efficiently handle many queries we want to reduce both communication and rounds for each search. If we again view the strings as integers this is the predecessor problem. We generalize Theorem 1 to this scenario.

theoremlcpk The Lcp problem has a randomized public-coin round communication protocol with communication complexity, where is the maximal common prefix between and any one of , and is the number of phrases in the LZ77 parse of this prefix. Compared to Theorem 1 we obtain the same number of rounds and only increase the total communication by an additive term. As the total communication increases by at most a factor .

The mentioned results hold only for LZ77 parses without self-references (see Sec. LABEL:sub:LZ). We also show how to handle self-referential LZ77 parses and obtain the following bounds, where we add either extra rounds or extra communication.

theoremlcpself The Lcp problem has an randomized public-coin protocol with

  1. rounds and communication complexity,

  2. rounds and communication complexity

where is the length of the longest common prefix of and , and is the number of phrases in the self-referential LZ77 parse of this prefix. theoremlcpkself The Lcp problem has a randomized public-coin protocol with

  1. rounds and communication complexity,

  2. rounds and communication complexity

where is the length of the maximal common prefix between and any one of , and is the number of phrases in the self-referential LZ77 parse of this prefix.

Turning again to LZ77 parses without self-references we also show the following trade-offs between rounds and communication.

theoremmultisearch For any constant the Lcp problem has a randomized public-coin protocol with

  1. rounds and total communication where is the number of phrases in the LZ77 parse of

  2. rounds and total communication where is the number of phrases in the LZ77 parse of the longest common prefix between and

Using the standard transformation technique by Newman [Newman1991] all of the above results can be converted into private-coin results for bounded length strings: If the sum of the lengths of the strings is , then, Newman’s construction adds an term in communication complexity, and only gives rise to additional round.

1.0.1 Techniques.

Our results rely on the following key idea. First, we want to perform a binary search over the LZ77-parses of the strings, to find the first phrase where Alice and Bob disagree. Then, the longest common prefix must end somewhere in the next phrase (see Figure LABEL:fig:common). So Alice needs only to send the offset and length of her next phrase, and Bob can determine the longest common prefix with his string or strings (as proven in Lemma LABEL:lem:lz).