Contemporary datasets are most often represented as points in a high-dimensional space. Many algorithms are based on the distances induced between those points. Thus, distance computation has emerged as a fundamental scalability bottleneck in many large-scale applications, and spurred a large body of research on efficient approximate algorithms. In particular, a typical goal is to design efficient data structures that, after preprocessing a given set of points, can report approximate distances between those points.
An important complexity measure of these data structures is the space they occupy. Small space usage enables storing more points in the main memory for faster access [JDS11], exploiting fast memory-limited devices like GPUs [JDJ17], and facilitating distributed architectures where communication is limited [CBS20], among other benefits. Indeed, a long line of applied research (e.g., [SH09, WTF09, JDS11, JDJ17, SDSJ19], see also Section 1.3) has been able to perform tasks like image classification in unprecedented scales, by designing distance-preserving space-efficient bit encodings of high-dimensional points.
These methods, while empirically successful, are heuristic in nature and do not possess worst-case guarantees on their accuracy. From a theoretical point of view, the problem can be formalized as follows:What is the minimal amount of space required to represent all distances between the given data points, up to a given relative error? In the notable case of Euclidean distances, a fundamental compression result is the dimension reduction theorem of Johnson and Lindenstrauss [JL84], which has been refined to a space-efficient bit encoding (often called a sketch) in a sequence of well-known follow-up works [KOR00, Ach03, AMS99, CCFC02]. However, despite these prominent results, it was not known whether these bounds are tight for compression of Euclidean metrics. In this work, we close this gap and obtain improved and tight sketching bounds for Euclidean metrics, as well as for metrics and general metric spaces.
1.1 Problem definition
The metric sketching problem is defined as follows:
Definition (metric sketching).
Let and . In the -metric sketching problem, we are given a set of points with -distances in the range . We need to design a pair of algorithms:
Sketching algorithm: given , it computes a bitstring called a sketch.
Estimation algorithm: given the sketch, it can report for every a distance estimate such that
The goal is to minimize the bit length of the sketch.
Put simply, the goal is to represent all distances between , up to distortion
, using as few bits as possible. The sketching algorithm can be randomized. In that case, we require that with probabilityit returns a sketch such that the requirement of the estimation algorithm is satisfied for all pairs simultaneously. The estimation algorithm is generally deterministic.
We remark that the assumption on the distances being in is essentially without loss of generality, by scaling. If , we can store in the sketch a -approximation of and scale all distances down by . This increases the total sketch size additively by bits. Then, in all bounds below, becomes the aspect ratio, which is the ratio of largest to smallest distance in the given point set.
The most notable case is Euclidean metrics, or . For this case, the celebrated Johnson-Lindenstrauss (JL) dimensionality reduction theorem [JL84] enables reducing the dimension of the input point set to . By the recent result of Larsen and Nelson [LN17], this bound is tight. The JL theorem leads to a sketch of size machine words per point. The bit size of the sketch generally depends on the numerical range of distances, encompassed by the parameter (a typical setting to consider below is ).111We remark that naïvely rounding each coordinate of the dimension-reduced points to its nearest power of does not yield a valid sketch. For example, consider two coordinates with values and , where for some integer . The squared difference between them is , whereas after rounding it becomes , and the distortion is unbounded.
For example, if the coordinates of the input points are integers in the range (note that in this case the diameter is ), then the discretized variant of [JL84] due to Achlioptas [Ach03], and related algorithms like AMS sketch [AMS99] and CountSketch [CCFC02, TZ12], yield a sketch size of bits per point. More generally, for any point set with diameter (regardless of coordinate representation), the distance sketches of Kushilevitz, Ostrovsky and Rabani [KOR00] yield a sketch size of bits per point. Perhaps surprisingly, prior to our work, it was not known whether this “discretized JL” upper bound is tight for metric sketching, or can be improved much further. We show it is indeed not tight, by proving an improved and optimal bound of amortized bits per points. Our result thus establishes that sketching techniques can go beyond dimension reduction in compressing Euclidean metric spaces.
The above formulation also captures sketching of general metric spaces — that is, the input is any metric space with distances between and — since they embed isometrically into with dimension . Specifically, for every , one defines . It is not hard to see that for every . General metric sketching has been studied extensively, under the name distance oracles [TZ05], in larger distortion regimes than , see Section 1.3. We provide tight bounds for distortion .
1.2 Our results
We resolve the optimal sketching size with distortion for several important classes of metrics: Euclidean metrics, (a.k.a. Manhattan) metrics, and general metrics. We start with our main results for Euclidean metrics.
Theorem 1.1 (Euclidean metric compression).
For -metric sketching with points (of arbitrary dimension) and distances in , bits are sufficient. If the input dimension is , then bits are also necessary.
The sketching algorithm is randomized and has running time , where is the ambient dimension of the input metric, and is an arbitrarily small constant.222As , the sketch size increases as . The estimation algorithm runs in time .
Theorem 1.1 improves over the best previous bound of , mentioned above. It also strengthens the upper bound of [AK17] for sketching with additive error (see Section 1.3), and resolves an open problem posed by them.
We note that the sketching algorithm in the above theorem is randomized. This means that with probability , it may output a sketch that distorts the distances by more than a factor. However, this does not affect the sketch size nor the running time.
We proceed to general metric spaces.
Theorem 1.2 (General metric compression).
For general metric sketching with points and distances in , bits are both sufficient and necessary.
Note that storing all distances exactly in a general metric takes at least bits. Naïvely, one could round each distance to its nearest power of , which yields a sketch of size bits. Theorem 1.2 improves the second term to . For example, for the goal of reporting a -approximation of each distance (i.e., for all ), where the input distances are polynomially bounded (), we get a tight bound of bits, compared to the naïve bound of bits.
Both of the theorems above are based on a more general upper bound, that holds for all -metrics.
Theorem 1.3 (-metric compression).
Let . For -metric sketching with points in dimension and distances in , bits are sufficient. The sketching algorithm is deterministic and runs in time . The estimation algorithm runs in time for , and for .
The upper bound of Theorem 1.2 follows immediately from Theorem 1.3, since as mentioned earlier, general metric spaces with points embed isometrically into with dimension . Similarly, for Euclidean metrics, one can apply the Johnson-Lindenstrauss transform as a preprocessing step in order to reduce the dimension of the input points to , and then apply Theorem 1.3. This gives an upper bound looser than that of Theorem 1.1 by . To obtain the tight bound, we will use additional properties special to Euclidean metrics.
1.3 Additional related work
|Reference||Bits per point||No. queries||Query type|
|Related work||[JL84, Ach03, KOR00, …]||distances|
|Our approach||Theorem 1.1||—||none|
Sketching with additive error.
In a work concurrent to our original paper [IW17], Alon and Klartag [AK17] studied a closely related problem of approximating squared Euclidean distances between points of norm at most , up to an additive error of (whereas distortion , as in Section 1.1, is equivalent to relative error ). For this problem, they proved a tight sketching bound of bits per point.333Note that in this model, the parameter does not need to enter the sketch size, since the error is allowed to be arbitrartily larger than the minimal distance in the pointset. Further work on additive error refined this result by parameterizing the reduced dimension by complexity measures of the embedded pointsets, designing faster or deterministic algorithms, or introducing sigma-delta-style quantization [DS20, Sto19, HS20, ZS20, PS20].
Sketching with additive error is generally less restrictive than relative error, in the following sense — on one hand, a relative error sketch implies an additive error sketch by setting ,444To this end, given a pointset with norms bounded by , let be an -separated -net of the unit ball (see Section 2 for the definitions). Let be the respective nearest neighbors of in . The separation property of implies that has aspect ratio at most . We may now sketch the distances in using a relative error sketch with , since given , reporting the distance between instead of increases the additive error by at most by the triangle inequality. and on the other hand, lower bounds for additive error hold for relative error as well. In particular, as we discuss in Section 6, the lower bound from [AK17] (as well as the lower bound in another concurrent paper [LN17]) provides another way to show the lower bound in Theorem 1.1.
New query points and nearest neighbor search.
In the model considered in this paper, the query algorithm needs to report distances only between points that were fully known to the sketching algorithm (see Section 1.1). In a closely related but different setting, the query algorithm gets a set of new points , that were not known to the sketching algorithm, and needs to estimate distances between each and each . A notable example of this setting is the nearest neighbor search problem.
The classical dimension reduction approach, which yields a dimension bound of and a sketching bound of bits per point, can handle as many as query points. Very recently, a new line of work known as terminal dimension reduction [MMMR18, NN19] was able to obtain the same bounds for an unbounded number of query points . On the other hand, the papers [JW13, MWY13] proved a matching lower bound of bits for sketching or distances, if , settling the optimal sketch size in this regime.
In a companion work [IW18], we develop the techniques of the current paper and prove nearly tight sketching bounds in the complement regime
, interpolating betweenTheorem 1.1 and the above tight bounds for . Furthermore, we show that for the easier task of reporting an approximate nearest neighbor in the dataset for each query point (rather than estimating all distances between dataset points and query points), a better sketching upper bound is possible. The picture is summarized in Table 1.
A prominent line of applied research (including [SH09, WTF09, JDS11, GLGP12, NF13, GHKS13, KA14, JDJ17, SDSJ19]; see also the surveys [WLKC16, WZS18]) has been dedicated to designing empirical solutions to the problem in Section 1.1, under the label learning to hash
. This nomenclature reflects the fact that in the preprocessing stage, these methods employ machine learning techniques to adapt the sketches to the given dataset, in order to optimize performance. While empirically successful, these methods are fundamentally heuristic and do not pose formal solutions toSection 1.1. In a companion work [IRW17], we design a sketch that on one hand has close to optimal worst-case guarantees (in particular, its size lossier than Theorem 1.1 by ), while on the other hand it empirically matches or improves the performance of state-of-the-art heuristic methods.
The distance oracle problem [TZ05] is equivalent to sketching of general metrics, and has been studied in a different distortion regime. A long line of work (including [PS89, ADD93, Mat96, TZ05, WN12, Che15] and more) has shown that for every integer , it is possible to compute a sketch of size with distortion , which is tight up to logarithmic factors under the Erdős Girth Conjecture. Notably, for distortion and above, the sketch size is . (However, note that in order to achieve a near-linear sketch size, the distortion must be almost logarithmic.) On the other hand, for any distortion less than , it is not hard to show (by considering all shortest-path metrics induced by bipartite simple graphs) that a sketch size of is necessary. For distortion , to our knowledge, the best upper bound prior to our work had been bits, which follows from naïve rounding as mentioned earlier.
1.4 Technical overview
The basic strategy in the sketch is to store each point by its relative location to a nearby point which had already been (approximately) stored. Note that this is different than dimension reduction and its discretizations, which approximately store the location of each point in the space in an absolute sense.
More precisely, let be the point set we wish to sketch. For every point , we aim to define a surrogate , which is an approximation of that can be efficiently stored in the sketch. To this end, we choose an ingress point near , and define inductively by its location relative to , namely , where denotes rounding to a -net, with an appropriate precision . We then hope to use the distance between the surrogates, , as an estimate for the distance for all pairs . The challenge is to choose the ingresses and the precisions in a way that on one hand ensures a small relative error estimate for each pair, while on the other hand does not occupy too many storage bits.
In order to ensure a relative error approximation of every distance, we need to consider all possible distance scales. To this end we construct a hierarchical clustering tree of the metric space, and define the ingresses and surrogates for clusters (or tree nodes) instead of individual points. Here, it may seem natural to use separating decomposition trees such as[Bar96, CCG98, FRT04], which provide both a separating property (far points are in different clusters) and a packing property (close points are often in the same cluster). However, such trees are bound to incur a super-constant gap between the two properties [Bar96, Nao17], which would lead to a suboptimal sketch size. Instead, our tree transitively merges any two clusters within a sufficiently small distance. This yields a perfect separation property, but no packing property — the diameter of each cluster may be unbounded. We replace it by a global bound on all cluster diameters in the tree (Section 3.1).
The tree size is first reduced to linear by compressing long non-branching paths. From a distance estimation point of view, this means that if a cluster is very well separated from the rest of the metric, then we can replace it entirely with one representative point (called center) for the purpose of estimating the distances between internal and external points. Then, the crucial step is a careful choice of the ingresses, that ensures that if we set the precisions so as to get correct estimates between all pairs, the total sketch occupies sufficiently few bits. This completes the description of the data structure, which we call relative location tree.
In order to estimate the distance for a given pair , we can identify in the tree two nodes , such that (i) the center of is a sufficiently good proxy for from the point of view of , and vice-versa, (ii) the error between the center of and its surrogate is proportional to , and the same holds for , and (iii) the surrogates of and can be recovered from the sketch (by following ingresses along the tree) up to a shift, which while unknown, is the same for both. Then we may return the distance between the shifted surrogates as the output distance estimate.
The above outline describes our upper bound for sketching -metrics. However, for Euclidean metrics, the resulting sketch size is suboptimal in the dependence on . To achieve the optimal bound we further develop the sketch.
To this end, we incorporate randomness into the sketching algorithm. To see why this might help, view a surrogate as an estimator for the point it represents. In the deterministic sketch described above, is necessarily a biased estimator, since it is a fixed point close to but different than . This bias bears on the sketch size: if, say, the surrogate is chosen by deterministically rounding to its nearest neighbor in a fixed net, then in order to get a desired level of accuracy , the net must have size , and hence the surrogate requires bits to store in the sketch. To improve this, we could hope to use an unbiased estimator for by designing a distribution over surrogates, which we call probabilisitic surrogates.
Alon and Klartag [AK17] took this approach to achieve the optimal sketch size for additive error . By using randomized rounding on the net, they showed its size can be reduced to , while still holds by probabilistic concentration if the dimension is large enough (). To achieve relative error , we incorporate this into our techniques described above. We build a relative location tree with
; this does not exceed the optimal sketch size for Euclidean metrics, but does not provide the desired approximation of distances. We then augment it with randomized roundings of displacement vectors between nodes to their surrogates, and between centers to non-centers in well-separated clusters. To estimate the distanceof a given pair , we sum an appropriate subset of those randomly rounded displacements along the tree, obtaining probabilistic surrogates
. These are random variables with expected values
up to an unknown but equal shift, and with variance appropriately related to. For technical reasons related to probabilistic independence, we return a proxy of the distance rather than the distance itself, and the result is tightly concentrated at the correct value .
1.5 Paper organization
Section 2 sets up preliminaries and notation. Section 3 contain the description of the main sketch, and proves the upper bound for metrics in Theorem 1.3. (The upper bound for general metrics in Theorem 1.2 follows as a corollary, as explained above.) Section 4 develops the sketch further and proves the upper bound for Euclidean metrics in Theorem 1.1. Section 5 points out that bounds for Euclidean metrics hold for all metrics with . Section 6 proves the lower bounds in Theorems 1.2 and 1.1.
We start by stating the classical dimension reduction theorem of Johnson and Lindenstrauss [JL84].
Theorem 2.1 ([Jl84]).
Let , , and for a sufficiently large constant . There is a distribution over matrices (for example, i.i.d. entries from ) such that with probability , for all ,
2.1 Grid nets
Let . Let denote the -dimensional -unit ball. Let . A subset is called a -net of if for every there is such that . Further, is -separated if the distance between any pair of distinct points in is at least . It is a well-known fact that has a -net of size for a constant that is -separated, and that the size bound is tight up to the constant .
We will use a specific net, given by the intersection of the ball with an appropriately scaled grid. For , let denote the uniform -dimensional grid with cell side length . Namely, is defined as the set of points such that each of their coordinates is an integer multiple of .
The net we use is (where is the origin-centered ball of radius ). We drop the dependence on and from the notation for simplicity. Also, in the case , we use as a convention. Since each cell of is a hypercube of diameter , or part of one, is indeed a -net of (and is -separated). It is also well-known that for a constant (see, e.g., [HPIM12] or [AK17]), meaning that attains the optimal size for -nets up to the constant . Finally, given , we can find such that by dividing each coordinate by , rounding it to the largest smaller integer, and multiplying it by . We call this operation rounding to . In summary,
For every , we can round it to in time , and store the resulting point of with bits.
We also record another variant of the above lemma.
Let and . The number of points in which are at distance at most from (in the -norm distance) is .
3 The relative location tree
In this section we prove Theorem 1.3, which implies the upper bound in Theorem 1.2, and will also serve as a stepping stone toward Theorem 1.1. The sketching scheme is based on a new data structure that we call relative location tree.
Let be a given point set endowed with the -metric for a fixed , with minimal distance and diameter . We assume w.l.o.g. that is an integer. To simplify notation, we drop the subscript from -norms (that is, we write for ).
3.1 Hierarchical tree construction
We start by building a hierarchical clustering tree over the points , by the following bottom-up process. In the bottom level, numbered , every point forms a singleton cluster . Level is generated from level by merging any clusters at distance less than , until no such pair remains. (The distance between two clusters is defined as .) By level , the pointset has been merged into one cluster, which forms the root of the tree.
For every tree node in , we denote its level by , its associated cluster by , its cluster diameter by , and its degree (number of children) by . For every , let denote the tree leaf whose associated cluster is .
Note that the nodes at each level of form a partition of . On one hand, we have the following separation property.
If are at different clusters of the partition induced by level , then .
On the other hand, we have the following global bound on the cluster diameters.
We write to denote an edge from a parent to a child . We call it a -edge if , and a non--edge otherwise. Note that since has leaves, it has at most non--edges. We define edge weights and node weights in as follows. The weight of an edge is if is a -edge, and otherwise. The weight of a node , denoted , is the sum of all edge weights in the tree under (that is, where the sum is over all edges such that is reachable from by a downward path in ).
We argue that for every node . This is seen by bottom-up induction on . In the base case is a leaf, and then . For the induction step, fix a node and consider two cases. In the first case, has degree and a single outgoing -edge . Then by the tree construction, and since , thus the claim follows by induction. In the second case has multiple outgoing edges . Since is a partition of , the diameter is upper-bounded by . By induction, for every . By the tree construction, . Together, , as needed.
Consequently, it now suffices to prove the bound . To this end we count the contribution of each edge to the sum. A -edge has no contribution since its weight is . For a non--edge , let , and let be the parent of for all until the root is reached. Then contributes its weight to for every , and its total contribution is . Since , the latter sum equals . Since has at most non--edges, the desired bound follows. ∎
3.1.1 Path compression
Next, we compress long non-branching paths in . A -path in is a downward path such that are degree- nodes. It is called maximal if and are not degree- nodes ( may be a leaf). For every such path in , if
we replace the path from to with a long edge directly connecting to . We mark it as long and annotate it with the original path length, . The rest of the edges are called short edges. Note that the right-hide side in Equation 1 depends both on the level of in the tree, , and on the diameter of the cluster it representes in the metric space, .
The tree after path compression will be denoted by . We note that will continue to denote the original level of in (or equivalently, the level in if the long edges are counted according to their lengths).
Follows from Section 3.1 since every node in is also present in (with the same level and associated cluster diameter ), and since for all . ∎
has at most nodes.
We charge the degree- nodes on every maximal -path in to the bottom node of the path. The total number of nodes in can then be written as , where is the length of the maximal -path whose bottom node is . Due to path compression, we have . Since has leaves, it has at most nodes whose degree is not , so the total contribution of the second term is at most . For the total contribution of the first term, we need to show . This is given by Section 3.1.1. ∎
We partition into subtrees by removing the long edges. Let denote the set of resulting subtrees. Furthermore let denote the set of nodes of which are leaves of subtrees in . Note that a node in is either a leaf in or the top node of a long edge in . These nodes are special in that they represent clusters whose diameter can be bounded individually.
For every , .
If is a leaf in then contains a single point, thus , and the lemma holds. Otherwise, is the top node of a long edge in . Let be the bottom node of that edge. By path compression, the long edge represents a -path of length at least (see Equation 1), hence , and hence . Since no clusters are merged along a -path, we have , hence , and the lemma follows. ∎
3.2 Tree annotations: Centers, ingresses, and surrogates
We now augment with the following annotations, which would efficiently encode information on the location of its clusters. Each cluster in the tree is represened by one of its points, chosen largely arbitrarily, called its center. The center location is stored using the approximate displacement from a nearby cluster center (already stored by induction), called its ingress. The approximate location of the center is called its surrogate.
For every node in we choose a center from the points in its cluster , in a bottom-up manner, as follows. For a leaf , let . For a non-leaf with children , let . The point is the center of .
Next, for every node in we assign an ingress node, denoted . Intuitively, the ingress is a node in such that is close to , and our eventual purpose is to store the latter by its location relative to the former.
Before turning to the formal definition of ingresses, let us give an intuitive overview. The distance between and can generally be as large as , since the centers are positioned arbitrarily inside their clusters. Since we plan to store the approximate displacement of from , we would pay the log of that distance in the sketch size. Since we plan to invoke Section 3.1.1 to bound the total size, we can afford to pay the log-diameter of each node only once. This could create a difficulty, since we may wish to use the same node as the ingress for multiple nodes. Our choice of ingresses is meant to avoid this difficulty, by ensuring that depends only on and not on (Section 3.2.2 below). To this end, once we have identified a cluster nearby at the same tree level, we intuitively want to choose the ingress of to be not the center of , but rather the nearest point to in . Call that point , and note that could be larger than by , which is the term we are trying to avoid. Since ingresses are nodes rather than points, we want to be a node whose center is , ideally , whose diameter is zero. This raises two technical points: One, due to the preceding path compression step, the node might not be reachable from anymore (by short edges), so we instead use the lowest ancestor of reachable from . Two, in order to use to approximately store the location of , we need to have already approximately stored , which means we need an ordering of the nodes such that each node appears after its ingress. We will argue that our somewhat involved choice of ingresses admits such an ordering.
We now formally define the ingresses. They are defined in each subtree separately. For the root of , we set for convenience (as we will not require ingresses for subtree roots). Now we assign ingresses to all children of every node in , and this would take care of the rest of the nodes in . Let be the children of , such that w.l.o.g. . Consider the simple graph whose nodes are , where are neighbors iff . The fact that have been merged into in the tree construction means that is a connected graph. Fix an arbitrary spanning tree of and root it at . For , the ingress is . For with , let be its parent node in . Let be the closest point to in (i.e., ). Let be the leaf of whose cluster contains . The ingress of is . See Figure 1 for illustration.
(Note that there is a downward path in from to , and is the bottom node on that path that belongs to . Equivalently, is the bottom node on the path that is reachable from without traversing a long edge.)
The following lemma bounds the distance from a node center to its ingress center.
For every node in , .
Fix a subtree . If is the root of , the claim is obvious since . Next, using the same notation as above, we prove the claim for all children of a given node in . For we have , and the claim holds. For with , recall that denotes its ancestor in , and that is a point in that realizes the distance , which is upper-bounded by . Therefore,
Noting that , we find
Recall that was chosen as the leaf in whose cluster contains . In particular, and are both contained in . By Section 3.1.2, . Since is a descendant of a sibling of , we have , hence . Combined with Equation 2, this implies the lemma by the triangle inequality. ∎
We also record the following fact.
For every node in , .
The ingress is either itself, the parent of in , or a descendant of the parent. ∎
The nodes in every subtree can be ordered such that every node appears after its ingress (except the root, which is its own ingress, and would be first in the ordering). Such ordering is given by a depth-first scan (DFS) on , in which additionally, the children of every node are traversed in a DFS order on . Since the ingress of every non-root node is either its parent in , or a descendant of the sibling in which is its predecessor in , this ordering places every non-root after its ingress as desired. This will be important since the rest of the proof utilizes induction on the ingresses.
Now we can define the surrogates, which are meant to serve as approximate locations for the center of each tree node. We start by defining a coarse surrogate for every node in . They are defined in every subtree separately, by induction on the ingress order in . For the root of , we let . For a non-root in , we denote
Let be the rounding of to the grid net (see Section 2). By this we mean that is obtained by rounding each coordinate of to the largest smaller integer multiple of . We define , by induction on , as
The following lemma bounds the distance between a node center and its surrogate.
For every in , .
By induction on the ingress ordering in the subtree that contains . In the base case, is the root and the claim holds trivially since . For a non-root , we have , where the first inequality is by induction on the ingress and the second is by section 3.2.2. By section 3.2.2 we have , and together, by the triangle inequality, . By Equations 4 and 3, this implies . Now, since is a -net for the unit ball, we have . Finally,
3.2.4 Leaf surrogates
For every subtree leaf we also use a finer surrogate , called leaf surrogate. To this end, let be the rounding of to the grid net , where and are the same as before. The leaf surrogate is defined as
Note that is the surrogate of defined earlier (the definition of is not inductive.)
For every , .
The proof of Section 3.2.3 showed that . Hence, as is a -net for the unit ball, we have . Thus,