A New Algorithm for Finding Closest Pair of Vectors

02/25/2018
by   Shuai Xu, et al.
Florida International University
0

Given n vectors x_0, x_1, ..., x_n-1 in {0,1}^m, how to find two vectors whose pairwise Hamming distance is minimum? This problem is known as the Closest Pair Problem. If these vectors are generated uniformly at random except two of them are correlated with Pearson-correlation coefficient ρ, then the problem is called the Light Bulb Problem. In this work, we propose a novel coding-based scheme for the Close Pair Problem. We design both randomized and deterministic algorithms, which achieve the best-known running time when the minimum distance is very small compared to the length of input vectors. When applied to the Light Bulb Problem, our algorithms yields state-of-the-art deterministic running time when the Pearson-correlation coefficient ρ is very large.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/15/2018

An Illuminating Algorithm for the Light Bulb Problem

The Light Bulb Problem is one of the most basic problems in data analysi...
02/04/2021

A Faster Algorithm for Finding Closest Pairs in Hamming Metric

We study the Closest Pair Problem in Hamming metric, which asks to find ...
01/04/2022

Polyline Simplification under the Local Fréchet Distance has Subcubic Complexity in 2D

Given a polyline on n vertices, the polyline simplification problem asks...
12/08/2015

Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors

We consider two problems that arise in machine learning applications: th...
07/15/2021

Deterministic and Las Vegas Algorithms for Sparse Nonnegative Convolution

Computing the convolution A⋆ B of two length-n integer vectors A,B is a ...
02/04/2020

A Double Exponential Lower Bound for the Distinct Vectors Problem

In the (binary) Distinct Vectors problem we are given a binary matrix A ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the following classic Closest Pair Problem: given vectors in , how to find the two vectors with the minimum pairwise distance? Here the distance is the usual Hamming distance: , where denotes the component of vector . Without loss of generality, we assume that is the unique minimum distance and all other pairwise distances are greater than .

The Closest Pair Problem is one of the most fundamental and well-studied problems in many science disciplines, having a wide spectrum of applications in computational finance, DNA detection, weather prediction, etc. For instance, the Closest Pair Problem recently finds the following interesting application in bioinformatics. Scientists wish to find connections between Single Nucleotide Polymorphisms (SNPs) and phenotypic traits. SNPs are one of the most common types of genetic differences among people, with each SNP representing a variation in a single DNA block called nucleotide [22]. Screening for most correlated pairs of SNPs has been applied to study such connections [10, 14, 17, 35]

. As the number of SNPs in humans is estimated to be around 10 to 11 million, for problem size

of this size, any improvement in running time for solving the Closest Pair Problem would have huge impacts on genetics and computational biology [35].

In theoretical computer science, the Closest Pair Problem has a long history in computational geometry, see e.g. [39] for a survey of many classic algorithms for the problem. The naive algorithm for the Closest Pair Problem takes time. When the dimension is a constant, either in the Euclidean space or space, the classic divide-and-conquer based algorithm runs in time [13]. Rabin [38] combined the floor function with randomization to devise a linear time algorithm. In 1995, Khuller and Matias [30] simplified Rabin’s algorithm to achieve the same running time and space complexity . Golin et al. [25] used dynamic perfect hashing to implement a dictionary and obtained the same linear time and space bounds.

When the dimension is not a constant, the first subquadratic time algorithm for the Closest Pair Problem is due to Alman and Williams [4] for as large as . The algorithm is built on a recently developed framework called polynomial method [46, 47, 2]. In particular, Alman and Williams firstly constructed a probabilistic polynomial of degree which can compute the MAJORITY function on variables with error at most , then applied the polynomial method to design an algorithm which runs in time where , and computed the minimum Hamming distance among all red-blue vector pairs through polynomial evaluations. In a more recent work, Alman et al. [3] unified Valiant’s fast matrix multiplication approach [42] with that of Alman and Williams’ [4]. They constructed probabilistic polynomial threshold functions (PTFs) to obtain a simpler algorithm which improved to randomized time or deterministic time .

The Light Bulb Problem.

A special case of the Closest Pair Problem, the so-called Light Bulb Problem, was first posed by Valiant in 1988 [43]. In this problem, we are given a set of vectors in chosen uniformly at random from the Boolean hypercube, except that two of them are non-trivially correlated (specifically, have Pearson-correlation coefficient , which is equivalent to that the expected Hamming distance between the correlated pair is ), the problem then is to find the correlated pair.

Paturi et al. [37] gave the first non-trivial algorithm, which runs in . The well-known locality sensitive hashing scheme of Indyk and Motwani [27] performs slightly worse than Paturi et al.’s hash-based algorithm. More recently, Dubiner [19] proposed a Bucketing Coding algorithm which runs in time . As gets small, all these three algorithms have running time . Comparing the constants in these three algorithms, Dubiner achieves the best constants, which is , in the limit of . Asymptotically the same bound was also achieved by May and Ozerov [32], in which the authors used algorithms that find Hamming closest pairs to improve the running time of decoding random binary linear codes.

In a recent breakthrough result, Valiant [42] presented a fast matrix multiplication based algorithm which finds the “planted” closest pair in time

with high probability for any constant

and , where is the exponent of fast matrix multiplications. The most striking feature of Valiant’s algorithm is that does not appear in the exponent of in the running time of the algorithm. Karppa et al. [29] further improved Valiant’s algorithm to . Both Valiant and Karppa et al. achieved runtime of for the Light Bulb Problem, which improved upon previous algorithms that rely on the Locality Sensitive Hashing (LSH) schemes. The LSH methods based algorithm only achieved runtime of for the Light Bulb Problem.

We remark that all the above-mentioned algorithms (except May and Ozerov’s work) that achieve state-of-the-art running time are based on either involved probabilistic polynomial constructions or impractical fast matrix multiplications111Subqubic fast matrix multiplication algorithms are practical only for Strassen-based ones [12, 26]. Even though the recent breakthrough results [40, 49, 23] achieve asymptotically faster than Strassen’s algorithm [41], however, these algorithms are all based on Coppersmith-Winograd’s algorithm [16], and to the best of our knowledge, there are no practical implementations of these trilinear based algorithms., or both. Moreover, these algorithms are all randomized in nature while our approach yields simple and practical randomized as well as deterministic algorithms.

1.1 Our approach

input : A set of vectors in and
output : Two vectors and their distance
1 generate a binary code
2 pick a random
3 for  to  do
4       decode in , and denote the resulting vector by
5 end for
6sort
7 for each of the pairs of adjacent vectors in the sorted list do
8       compute the distance between the two original vectors.
9 end for
output the pair of vectors with the minimum distance and their distance  
Algorithm 1 General Idea of Main Algorithm

We propose a simple, error-correcting code based scheme for the Closest Pair Problem. Apart from achieving the best running time for certain range of parameters, we believe that our new approach has the merit of being simple, and hence more likely being practical as well. In particular, neither complicated data structure nor fast matrix multiplication is employed in our algorithms.

The basic idea of our algorithms is very simple. Suppose for concreteness that and is the unique pair of vectors that achieve the minimum distance. Our scheme is inspired by the extreme case when and are identical vectors. In this case, a simple sort and check approach solves the problem in time: sort all vectors and then compute only the pairwise distances (instead of all distances) of adjacent vectors in the sorted list. Since the two closest vectors are identical, they must be adjacent in the sorted list and thus the algorithm would compute their distance and find them. This motivates us to view the input vectors as received messages that were encoded by an error correction code and have been transmitted through a noisy channel. As a result, the originally identical vectors are no longer the same, nevertheless are still very close. Directly applying the sort and check approach would fail but a natural remedy is to decode these received messages into codewords first. Indeed, if the distance between and is small and we are lucky to have a codeword that is very close to both of them, then a unique decoding algorithm would decode both of these two vectors into . Now if we “sort” the decoded vectors and then “check” the corresponding original vectors of each adjacent pair of vectors222Actually, we only need to “check” when the two adjacent decoded vectors are identical., the algorithm would successfully find the closest pair. How to turn this “good luck” into a working algorithm? Simply try different shift vectors and view as the input vectors, since the Hamming distances are invariant under any shift. The basic idea of our approach is summarized in Algorithm 1.

Figure 1 illustrates the effects “bad” shift vectors and “good” shift vectors on the decoding part of our algorithm.

Figure 2 illustrates what happens if we sort the vectors directly and why sorting decoded vectors works.

(a) bad shift
(b) good shift
Figure 1: Decoding with good and bad shift vectors
(a) Sorting original vectors directly
(b) Sorting decoded vectors
Figure 2:

Making the idea of decoding work for larger minimum pairwise distance involves balancing the parameters of the error-correcting code so that it is efficiently decodable as well as having appropriate decoding radius. The decoding radius should have the following properties. On one hand, should be small to ensure that there is a codeword such that only and will be decoded into (therefore and will be adjacent in the sorted array and hence will be compared with each other). On the other hand, we would like to be large so as to maximize the number of “good” shift vectors which enable both and decoding to the same codeword. As a result, our algorithms generally perform best when the closest pair distance is very small.

1.2 Our results

Our simple error-correcting code based algorithm can be applied to solve the Closest Pair Problem and the Light Bulb Problem.

1.2.1 The Closest Pair Problem

Our main result is the following simple randomized algorithm for the Closest Pair Problem.

Theorem 1.1 (Main).

Let in be binary vectors such that and is the unique pair achieving the minimum pairwise distance (and the second smallest distance can be as small as ). Suppose we are given the value of and let . Then there is a randomized algorithm running in which finds the closest pair and with probability at least . The running time can be improved to , if we are given black-box decoding algorithms for an ensemble of binary error-correcting codes that meet the Gilbert-Varshamov bound.

Here and are functions derived from the Gilbert-Varshamov (GV) bound and the Zyablov bound respectively (see Section 2.1.5 for details).

The running time of our algorithm depends on — in addition to the number of vectors — both dimension and . To illustrate its performance we choose two typical vector lengths , namely those corresponding to the Hamming bound333The Hamming bound, also known as the sphere packing bound, specifies an upper bound on the number codewords a code can have given the block length and the minimum distance of the code. and the Gilbert-Varshamov (GV) bound444The GV bound is known to be attainable by the random codes., and list the exponents in the running time of the GV-code version of our algorithm as a function of (in fact ) in Table 1. Here, we write the running of the algorithm as , where suppresses any polylogarithmic factor of . One can see that our algorithm runs in subquadratic time when is small, or equivalently when the Hamming distance between the closest pair is small.

Hamming bound GV bound
length of vector
()
exponent ()
length of vector
()
exponent ()
0.01 1.0476 1.0742 1.0879 1.0770
0.025 1.1074 1.1591 1.2029 1.1728
0.05 1.2029 1.2844 1.4013 1.3313
0.075 1.2999 1.4021 1.6242 1.5024
0.1 1.4013 1.5171 1.8832 1.6949
0.125 1.5090 1.6316 2.1909 1.9170
Table 1: Running time of our algorithm when vector length meets the Hamming bound and GV bound

In the setting of for some not too large constant , Alman et al. [3] gave a randomized algorithm which runs in time for the Closest Pair Problem. As it is very hard to calculate the hidden constant in the exponent of their running time, it is impossible to compare our running time with theirs quantitatively. However, as the running time of Alman et al. is of the form for some function , it is reasonable to believe that our algorithms run faster when the minimum distance is small enough.

Deterministic algorithm.

By checking all shift vectors up to certain Hamming weight, our randomized algorithm can be easily derandomized to yield the following.

Theorem 1.2.

Let in be binary vectors such that and is the unique pair achieving the minimum pairwise distance (and the second smallest distance can be as small as ). Suppose we are given the value of and let . Then there is a deterministic algorithm that finds the closest pair and with running time , where is the binary entropy function. Moreover, if we are given as black box the decoding algorithm of a random Varshamov linear code with block length and minimum distance , then the running time is .

Searching for .

If we remove the assumption that is given, our algorithm can be modified to search for first without too much slowdown; more details appear in Section 4.

Theorem 1.3.

Let in be binary vectors such that and is the unique pair achieving the minimum pairwise distance . Then for any , there is a randomized algorithm runs in which finds the (and the pair and with probability at least , The running time can be improved to , if we are given black-box decoding algorithms for an ensemble of binary error-correcting codes that meet the Gilbert-Varshamov bound.

Gapped version.

Intuitively, if there is a gap between and the second minimum distance, the Closest Pair Problem should be easier. This is reminiscent of the case of the -Approximate NNS Problem versus the NNS Problem. However, as we still need to find the exact solution to the Closest Pair Problem, the situation here is different.

Theorem 1.4 (Gapped version).

Let in be binary vectors such that and is the unique pair achieving the minimum pairwise distance . Suppose we are given the values of as well as the second minimum distance . Let and . Then there is a randomized algorithm running in which finds the closest pair and with probability at least . Moreover, the running time can be further improved to , if we are given the black box access to the decoding algorithm of an -code which meets the Gilbert-Varshamov bound.

Our gapped version algorithm uses instead of as the decoding radius. This, however, does not always give improved running time as illustrated in Figure 3. In Figure 3, we set and write the running time as for both the gapped version (the blue line) and the non-gapped version (the green line). One can see that the gapped version performs better only when is small enough.

Figure 3: The range of in which gapped version outperforms non-gapped version

1.2.2 The Light Bulb Problem

Applying our algorithms for the Closest Pair Problem to the Light Bulb Problem easily yields the following.

Theorem 1.5.

There is a randomized algorithm for the Light Bulb Problem which runs in time

and succeeds with probability at least . The running time can be further improved to

if we are allowed a one-time preprocessing time of to generate the decoding lookup table of a random Gilbert’s -code. Similar results can also be abtained for deterministic algorithms.

Our deterministic algorithm for the Light Bulb Problem is, to the best of our knowledge, the only deterministic algorithm for the problem. Moreover, we believe that our algorithms are very simple and therefore are likely to outperform other complicated ones for at least not too large input sizes.

1.3 Related work

The Nearest Neighbor Search problem.

The Closest Pair Problem is a special case of the more general Nearest Neighbor Search (NNS) problem, defined as follows. Given a set of vectors in , and a query point as input, the problem is to find a point in which is closest to . The performance of an NNS algorithm is usually measured by two parameters: the space (which is usually proportional to the preprocessing time) and the query time. It is easy to see that any algorithms for NNS can also be used to solve the Closest Pair problem, as we can try each vector in as the query vector against the remaining vectors in , and output the pair with minimum distance.

Most early work on this problem is for fixed dimension. Indeed, when the problem is easy, as we can just sort the input vectors (which in this case are numbers), then perform a binary search to find the closest vector to the input query. For , Clarkson [15] gave an algorithm with query time polynomial in , and space complexity . Meiser [33] designed an algorithm which runs in time and uses space for arbitrary . By far, all efficient data structures for NNS have dimension appear in the exponent of the space complexity, a phenomenon commonly known as the curse of dimensionality.

This motivates people to introduce a relaxed version of Nearest Neighbor Search called the -Approximate Nearest Neighbor Search (-Approximate NNS) Problem in the 1990s. The problem now is, for an input query point , find a point in such that the Hamming distance is:

We call such a as an -approximate nearest neighbor of input query .
The -Approximate NNS Problem has been studied extensively in the last two decades. In 1998, Indyk and Motwani [27] used a set of hash functions to store the dataset such that if two points are close enough, they will have a very high probability to be hashed into the same buckets. As a pair of close points have higher probability than a pair of far-apart points to fall into the same bucket, the scheme is called locality sensitive hashing (LSH). The query time of LSH is , which is sublinear, and the space complexity of LSH is , which is subquadratic. After Indyk and Motwani introducing the locality sensitive hashing, there have been many improvements on the parameters under different metric spaces, such as metric [31, 18, 6, 36, 34]. Recently, Andoni et al. [8] gave tight upper and lower bounds of time-space trade-offs for hashing based algorithms for the -Approximate NNS Problem. This is the first algorithm that achieves sublinear query time and near-linear space, for any . For many results on the Approximate NNS problem in high dimension, see e.g. [7] for a survey. Some algorithms for the low dimension problem are surveyed in [9].

Recently, Valiant [42] leveraged fast matrix multiplication to obtain a new algorithm for the -Approximate NNS Problem that is not based on LSH. The general setting of Valiant’s results is the following. Suppose there is a set of points in -dimensional Euclidean (or Hamming) space, and we are promised that for any and , , except for only one pair which has (which corresponds to the closest pair, and is known as the Pearson-correlation coefficient), for some . Valiant’s algorithm finds the closest pair in time, where is the exponent for fast matrix multiplication (). Notice that, if the Pearson-correlation coefficient is some fixed constant, then when approaches the running time tends to , which is less than . Valiant applied his algorithms to get improved bounds555All these results are due to the fact that Valiant’s algorithms are much more robust to weak correlations than other algorithms. Our algorithms therefore do not give improved bounds for these learning problems in the general settings. for the Learning Sparse Parities with Noise Problem, the Learning -Juntas with Noise Problem, the Learning -Juntas without Noise Problem, and so on. More recently, Karppa et al. [29] improved upon Valiant’s algorithm and obtained an algorithm that runs in time.

Note that in general algorithms for the -Approximate NNS Problem can not be used to solve the Closest Pair Problem, as the latter requires to find the exact solution for the closest pair of vectors.

Decoding Random Binary Linear Codes.

In 2015, May and Ozerov [32] observed that algorithms for high dimensional Nearest Neighbor Search Problem can be used to speedup the approximate matching part of the information set decoding algorithm. They designed a new algorithm for the Bichromatic Hamming Closest Pair problem when the two input lists of vectors are pairwise independent, and consequently obtained a decoding algorithm for random binary linear codes with time complexity . This improved upon the previously best result of Becker et al. [11] which runs in .

The Bichromatic Hamming Closest Pair problem.

In fact, the problem studied in [4, 3, 32] is the following Bichromatic Hamming Closest Pair Problem: we are given red vectors and blue vectors from , and the goal is to find a red-blue pair with minimum Hamming distance. It is easy to see that the Closest Pair Problem is reducible to the Bichromatic Hamming Closest Pair Problem via a random reduction. On the other hand, our algorithm for the Closest Pair Problem can also be easily adapted to solve the Bichromatic Hamming Closest Pair Problem as follows. Run the decoding part of our algorithm on both sets and to get and , sort and separately (without comparing the orginal vectors for adjacent pairs in the sorted lists), then merge the two sorted lists into one, and compute the distance between the original vectors for each red-blue pair of vectors that are compared during the merging process. On the other hand, the Bichromatic Closest Pair Problem is unlikely to have truly subquadratic algorithms under some mild conditions. Assuming the Strong Exponential Time Hypothesis (SETH), for any , there exists a constant such that when the dimension , then there is no -time algorithm for the Bichromatic Closest Pair Problem [4, 1, 48].

1.4 Organization

The rest of the paper is organized as follows. Preliminaries and notations that we use throughout the paper are summarized in Section 2. In Section 3 we present our main decoding-based algorithms for the Closest Pair Problem, assuming the minimum pairwise distance is given. We then show how to get rid of this assumption in Section 4. In Section 5, we apply our new algorithms to study the Light Bulb Problem. Finally, we conclude with several open problems in Section 6.

2 Preliminaries

Let be a natural number, we use to denote the set . All logarithms in this paper are base 2 unless specified otherwise.

The binary entropy function, denoted , is defined as for .

Let be a finite field with elements666When , we use and interchangeably throughout the paper. and be a natural number. If is an -dimensional vector over and , then we use to denote the coordinate of . The Hamming distance between two vectors is the number of coordinates at which they differ: . For a vector and a real number , the Hamming ball of radius around is . The weight of a vector , denoted , is the number of coordinates at which . The distance between two vectors and is easily seen to be equal to .

2.1 Error correcting codes

Definition 2.1 (Error correcting codes).

Let be a finite field with elements777In fact, error correcting codes, as well as constructing new codes existing codes by concatenations to be discussed shortly, can be defined more generally over an arbitrary set of distinct elements called alphabet of the code. For the purpose of designing algorithms in this paper, restricting to finite fields is simpler and sufficient. and let be a natural number. A subset of is called an -code if and for any two distinct vectors , . The vectors in are called codewords of , the block length of , and the minimum distance of .

Normalized by the block length , is known as the rate of and is known as the relative distance of . If is a linear subspace of of dimension , the code is called a linear code and denoted by . It is convenient to view such a linear code as the image of an encoding function , and is called message length of . This can be generalized to non-linear codes as well where we view as the effective message length. We usually drop the subscript when .

Definition 2.2 (Covering radius).

Let be a code. For any , define the distance between and to be (clearly, if and only if is a codeword of ). The covering radius of a code , denoted , is defined to be the maximum distance of any vector in from , i.e., .

2.1.1 Unique decoding

Given an -code , if a vector (aka received word) is at a distance from some codeword in , then by triangle inequality, is closer to than any other codewords in . Therefore can be uniquely decoded to the codeword . Such a decoding scheme888 Strictly speaking, the procedure described here is error correcting instead of decoding, where the latter should return the inverse of codeword of the encoding function. is called unique decoding (or minimum distance decoding) of code , and we shall call the (unique) decoding radius of .

2.1.2 Gilbert-Varshamov bound and Gilbert’s greedy code

The Gilbert-Varshamov bound asserts that there is an infinite family of codes (essentially random codes or even random linear codes meet this bound almost surely) that satisfy . In particular, the following greedy algorithm of Gilbert [24] finds a (non-linear) binary code of block length and minimum distance and satisfies that for any for all sufficiently large . Start with and ; while , pick any element , add it to and remove all the elements in from . We denote such a code by .

We will need the following simple facts about

Lemma 2.3.

The greedy algorithm of Gilbert can be implemented to run in time, and produces a decoding lookup table that supports constant time unique decoding. That is, for any , if there is a codeword with , then the lookup entry of is ; otherwise the entry is a special symbol, say, . Moreover, the code constructed by Gilbert’s greedy algorithm satisfies that .

2.1.3 Reed-Solomon codes

Definition 2.4 (Reed-Solomon codes).

Let be finite field, and be integers satisfying . The encoding function for Reed-Solomon code from to is the following: First pick distinct elements ; on input , define a degree- polynomial as ; finally output the evaluations of at , i.e. the codeword is . We will denote such a code by .

Theorem 2.5.

The Reed-Solomon code defined above is an linear code.

Theorem 2.6 ([45]).

There exists an efficient unique decoding algorithm for Reed-Solomon codes which runs in time .

Reed-Solomon codes are optimal in the sense that they meet the Singleton bound, which states that for any linear -code, .

2.1.4 Concatenated codes

The most commonly used way to transform a nice code which has constant rate and constant relative distance over a large alphabet to a similarly nice code over binary is concatenation, which was first introduced by Forney [21].

Definition 2.7 (Concatenated codes).

Let be an -code and let be an -code with . Then the code obtained by concatenating with , denoted by , is an -code defined as follows. Let by any mapping from onto . Then the codewords of are obtained by replacing each element in of any codeword with the corresponding codeword in defined by ; namely , where each consists of elements in and denotes string concatenation. Note that each codeword in is an element in and there are such codewords, therefore and . Usually is called the outer code and is called the inner code.

It is well-known that the minimum distance of is , and the rate of is . Another useful fact is that can be efficiently decoded as long as both and can be efficiently decoded.

Fact 2.8.

Suppose is an -code with a decoding algorithm running in time, is an -code, where , and a decoding algorithm running in time. If is the concatenated code , and then there is a decoding algorithm for which run in time by first decoding received words of each consisting of elements in , and then decode the concatenated elements in as a received word of .

2.1.5 Codes used in our algorithms

Some of the codes to be employed in our algorithm are a family of codes constructed by concatenating Reed-Solomon codes with certain binary non-linear Gilbert’s greedy codes meeting the Gilbert-Varshamov bound. It is well-known that concatenated codes such constructed can be made to meet the so-called Zyablov bound999In fact, a stronger bound called Blokh-Zyablov bound can be achieved by applying multilevel concatenations (see e.g. [20] for a detailed discussion on multilevel concatenations of codes); however, as the improvement is minor, we only use single level concatenation in our code constructions to make the algorithms simpler.

(1)

Suppose we want a binary -code for our algorithms, where and are fixed and our goal is to maximize , conditioned on that the code is efficiently decodable. We pick a Reed-Solomon code and a Gilbert’s greedy code with the following constraints: ( should be as close to as possible), , , and . It is easy to check that there are large ranges of values for and , and optimizing the choice of (and therefore ) makes our concatenated code both meets the Zyablov bound in Eqn. (1) and can be decoded in time.

We will denote the maximum rate as a function of the relative distance given by the Zyablov bound by , and similarly denote the maximum rate given by the Gilbert-Varshamov bound by (i.e. ). Note that for all , and the reason we use codes achieving only is because such codes can be generated and decoded in time.

2.2 The closest pair problem

Given vectors in , the Closest Pair Problem is to find two vectors whose pairwise Hamming distance is minimum. For ease of exposition and without loss of generality, we will assume throughout the paper that there is a unique pair, namely and , that achieves the minimum pairwise distance . We will use to denote the second minimum pairwise distance, where . In the most general case, we do not make any assumption about , or .

3 Main Algorithm for the Closest Pair Problem

We now present our Main Algorithm for the Closest Pair Problem. For ease of exposition, we make a somewhat unnatural assumption that the value of is given. However, as we show in Section 4, the algorithm can be modified to get rid this assumption, with only a slight slowdown in running time.

Theorem 3.1 (Non-gapped version).

Let in be binary vectors such that and is the unique pair achieving the minimum pairwise distance (and the second smallest distance can be as small as ). Suppose we are given the value of and let . Then there is a randomized algorithm running in which finds the closest pair and with probability at least .

Proof.

Our Main Algorithm for the Closest Pair problem is described in Algorithm 2, and the decoding subroutine is illustrated in Algorithm 3. Note that we choose the minimum distance of to be , hence the decoding radius of is (without loss of generality, assume that is even).

For the correctness of the algorithm, first note that our algorithm will output the correct minimum distance if and only if is ever compared against for computing pairwise distance, and this happens if and only if and are adjacent in the sorted array after decoding. A sufficient condition for the latter is that the decoded vectors of and are identical and they are different from any other decoded vectors.

How many shift vectors in Algorithm 2 satisfy this condition? We will call such vectors good vectors. Denote the set of vectors lying at the “middle” between and by

Note that any vector that shifts a vector to a codeword would be a good vector. To see this, first note that after such a shift, is a codeword in , and both and lie within the decoding radius of , and therefore will be decoded to . Moreover, the shifted vector of any other input vector , , lies outside the decoding radius of . This is because if it does, then by triangle inequality and the fact that the decoding radius of is ,

contradicting our assumption that and is the unique pair achieving the minimum distance.

How many such good vectors? There are in total vectors in mid, and all their pairwise distances are at most . Let be two distinct codewords in . By our choice of the minimum distance of , . Consider any two distinct vectors and in mid. Clearly applying these two shift vectors to the same codeword gives two distinct vectors, namely and . Moreover, applying two distinct vectors in mid to two distinct codewords also results in two distinct shift vectors, because

since but .

Recall that is a -code and hence there are codewords in . It follows that there are in total good vectors of this kind. Therefore

and hence repeatedly selecting

independent ’s will succeed with probability at least .

Finally, note that each choice of shift vector requires time decoding as well as sorting and comparing adjacent vectors, so the total running time of the algorithm is . ∎

input : A set of vectors in and
output : Two vectors and their distance
1 generate a binary -code with
2 for  to  do
3       pick a random
4       for  to  do
5            
6       end for
7      sort
8       (suppose the sorted sequence is , where is a permutation of )
9       for  to  do
10             compute
11       end for
12      
13 end for
output the pair of vectors with minimum distance ever found and their distance  
Algorithm 2 Main Algorithm for the Closest Pair Problem
input : A binary -code , a decoding radius , and a vector
output : A vector
1 run the (efficient) decoding algorithm for on input vector , and let the output vector be
2 if  then
3      output
4else
5      output
6 end if
Algorithm 3

If we assume further that a decoding algorithm for some binary -code which meets the Gilbert-Varshamov bound is given as a black box, then the running time in Theorem 3.1 can be improved to . Note that this is not a totally unrealistic assumption, as for most interesting settings, for some small constant . Therefore, greedily searching for a binary code of block length that meets the Gilbert-Varshamov bound is tantamount to running an time preprocessing, which can be reused for any problem instance with the same vector length and minimum closest pair distance.

If there is a gap between and (this roughly corresponds to the approximate closest pair problem in [42]), then we can improve the running time of the Main Algorithm in Theorem 3.1 by exploiting an error correcting code with larger decoding radius.

Theorem 3.2 (Gapped version).

Let in be binary vectors such that and is the unique pair achieving the minimum pairwise distance . Suppose we are given the values of as well as the second minimum distance . Let and . Then there is a randomized algorithm running in which finds the closest pair and with probability at least . Moreover, the running time can be further improved to , if we are given the black box access to the decoding algorithm of an -code which meets the Gilbert-Varshamov bound.

Proof.

The proof follows a similar structure as the proof of Theorem 3.1. The main difference is now we pick a binary error correcting code of minimum distance , thereby decoding radius (once again, for simplicity, we assume is even).

Accordingly, the “middle point” set is now defined as

We now give a lower bound on the size of .

Without loss of generality, we assume and let . Clearly . Let and . Then