A recent line of work extends the large-margin classification paradigm from Hilbert spaces to less structured ones, such as Banach or even metric spaces, see e.g. [23, 34, 13, 40]. In this metric approach, data is presented as points with distances but lacking the additional structure of inner-products. The potentially significant advantage is that the metric can be precisely suited to the type of data, e.g. earthmover distance for images, or edit distance for sequences.
However, much of the existing machinery of classification algorithms and generalization bounds, (e.g. [11, 32]) depends strongly on the data residing in a Hilbert space. This structural requirement severely limits this machinery’s applicability — many natural metric spaces cannot be represented in a Hilbert space faithfully; formally, every embedding into a Hilbert space of metrics such as , earthmover, and edit distance must distort distances by a large factor [14, 29, 2]. Ad-hoc solutions such as kernelization cannot circumvent this shortcoming, because imposing an inner-product obviously embeds the data in some Hilbert space.
To address this gap, von Luxburg and Bousquet  developed a powerful framework of large-margin classification for a general metric space . They first show that the natural hypotheses (classifiers) to consider in this context are maximally-smooth Lipschitz functions; indeed, they reduce classification (of points in a metric space ) with no training error to finding a Lipschitz function consistent with the data, which is a classic problem in Analysis, known as Lipschitz extension
. Next, they establish error bounds in the form of expected surrogate loss. Finally, the computational problem of evaluating the classification function is reduced, assuming zero training error, to exact nearest neighbor search. This matches a popular classification heuristic, and in retrospect provides a rigorous explanation for this heuristic’s empirical success in general metric spaces, extending the seminal analysis of Cover and Hart for the Euclidean case.
The work of  has left open some algorithmic questions. In particular, allowing nonzero training error is apt to significantly reduce the Lipschitz constant, thereby producing classifiers that have lower complexity and are less likely to overfit. This introduces the algorithmic challenge of constructing a Lipschitz classifier that minimizes the 0-1 training error. In addition, exact nearest neighbor search in general metrics has time complexity proportional to the size of the dataset, rendering the technique impractical when the training sample is large. Finally, bounds on the expected surrogate loss may significantly overestimate the generalization error, which is the true quantity of interest.
We solve the problems delineated above by showing that data residing in a metric space of low doubling dimension admits accurate and computationally efficient classification. This is the first result that ties the doubling dimension of the data to either classification error or algorithmic runtime.111Previously, the doubling dimension of the space of classifiers was used in , but this is less relevant to our discussion. Specifically, we (i) prove generalization bounds for the classification (0-1) error as opposed to surrogate loss, (ii) construct and evaluate the classifier in a computationally-efficient manner, and (iii) perform efficient structural risk minimization by optimizing the tradeoff between the classifier’s smoothness and its training error.
Our generalization bound for Lipschitz classifiers controls the expected classification error directly (rather than expected surrogate loss), and may be significantly sharper than the latter in many common scenarios. We provide this bound in Section 3, using an elementary analysis of the fat-shattering dimension. In hindsight, our approach offers a new perspective on the nearest neighbor classifier, with significantly tighter risk asymptotics than the classic analysis of Cover and Hart .
We further give efficient algorithms to implement the Lipschitz classifier, both for the training and the evaluation stages. In Section 4 we prove that once a Lipschitz classifier has been chosen, the hypothesis can be evaluated quickly on any new point using approximate nearest neighbor search, which is known to be fast when points have a low doubling dimension. In Section 5 we further show how to quickly compute a near-optimal classifier (in terms of classification error bound), even when the training error is nonzero. In particular, this necessitates the optimization of the number of incorrectly labeled examples — and moreover, their identity — as part of the structural risk minimization.
Finally, we give in Section 6
two exemplary setups. In the first, the data is represented using the earthmover metric over the plane. In the second, the data is a set of time series vectors equipped with a popular distance function. We provide basic theoretical and experimental analysis, which illustrate the potential power of our approach.
2 Definitions and notation
We will use standard notation for orders of magnitude. If and , we will write . Whenever , we will denote this by . If is a natural number denotes the set .
A metric on a set is a positive symmetric function satisfying the triangle inequality ; together the two comprise the metric space . The diameter of a set , is defined by and the distance between two sets is defined by . The Lipschitz constant of a function , denoted by , is defined to be the smallest that satisfies for all .
For a metric space , let be the smallest value such that every ball in can be covered by balls of half the radius. is the doubling constant of , the doubling dimension of is . A metric is doubling when its doubling dimension is bounded. Note that while a low Euclidean dimension implies a low doubling dimension (Euclidean metrics of dimension have doubling dimension ), low doubling dimension is strictly more general than low Euclidean dimension.
The following packing property can be demonstrated via repeated applications of the doubling property (see, for example ): Let be a metric space, and suppose that is finite and has a minimum interpoint distance at least . Then the cardinality of is
Let be a metric space and suppose . An -net of is a subset with the following properties: (i) Packing: all distinct satisfy , which means that is -separated; and (ii) Covering: every point is strictly within distance of some point , namely .
Our setting in this paper is the agnostic PAC learning model . Examples are drawn independently from
according to some unknown probability distributionand the learner, having observed such pairs produces a hypothesis . The generalization error is the probability of misclassifying a new point drawn from :
The quantity above is random, since it depends on the observations, and we wish to upper-bound it in probability. Most bounds of this sort contain a training error term, which is the fraction of observed examples misclassified by and roughly corresponding to bias in Statistics, as well as a hypothesis complexity term, which measures the richness of the class of all admissible hypotheses 
, and roughly corresponding to variance in Statistics. Optimizing the tradeoff between these two terms is known as Structural Risk Minimization (SRM).222
Robert Schapire pointed out to us that these terms from Statistics are not entirely accurate in the machine learning setting. In particular, the classifier complexity term does not correspond to the variance of the classifier in any quantitatively precise way. However, the intuition underlying SRM corresponds precisely to the one behind bias-variance tradeoff in Statistics, and so we shall occasionally use the latter term as well.Keeping in line with the literature, we ignore the measure-theoretic technicalities associated with taking suprema over uncountable function classes.
3 Generalization bounds
In this section, we derive generalization bounds for Lipschitz classifiers over doubling spaces. As noted by  Lipschitz functions are the natural object to consider in an optimization/regularization framework. The basic intuition behind our proofs is that the Lipschitz constant plays the role of the inverse margin in the confidence of the classifier. As in , small Lipschitz constant corresponds to large margin, which in turn yields low hypothesis complexity and variance. However, in contrast to  (whose generalization bounds rely on Rademacher averages) we use the doubling property of the metric space directly to control the fat-shattering dimension.
We apply tools from generalized Vapnik-Chervonenkis theory to the case of Lipschitz classifiers. Let be a collection of functions and recall the definition of the fat-shattering dimension [1, 4]: a set is said to be -shattered by if there exists some function such that for each label assignment there is an satisfying for all . The -fat-shattering dimension of , denoted by , is the cardinality of the largest set -shattered by .
For the case of Lipschitz functions, we will show that the notion of fat-shattering dimension may be somewhat simplified. We say that a set is -shattered at zero by a collection of functions if for each there is an satisfying for all . (This is the definition above with .) We write to denote the cardinality of the largest set -shattered at zero by and show that for Lipschitz function classes the two notions are the same.
Let be the collection of all with . Then .
We begin by recalling the classic Lipschitz extension result, essentially due to  and . Any real-valued function defined on a subset of a metric space has an extension to all of satisfying . Thus, in what follows we will assume that any function defined on is also defined on all of via some Lipschitz extension (in particular, to bound it suffices to bound the restricted ).
Consider some finite . If is -shattered at zero by then by definition it is also -shattered. Now assume that is -shattered by . Thus, there is some function such that for each there is an such that if and if . Let us define the function on and as per above, on all of , by . It is clear that the collection -fat-shatters at zero; it only remains to verify that , i.e.,
A consequence of Lemma 3 is that in considering the generalization properties of Lipschitz functions we need only bound the -fat-shattering dimension at zero. The latter is achieved by observing that the packing number of a metric space controls the fat-shattering dimension of Lipschitz functions defined over the metric space: Let be a metric space. Fix some , and let be the collection of all with . Then for all ,
where is the -packing number of , defined as the cardinality of the largest -separated subset of .
Suppose that is fat -shattered at zero. The case is trivial, so we assume the existence of and such that . The Lipschitz property then implies that , and the claim follows. ∎
Let metric space have doubling dimension , and let be the collection of real-valued functions over with Lipschitz constant at most . Then for all ,
Equipped with these estimates for the fat-shattering dimension of Lipschitz classifiers, we can invoke a standard generalization bound stated in terms of this quantity. For the remainder of this section, we takeand say that a function classifies an example correctly if
The following generalization bounds appear in .
Let be a collection of real-valued functions over some set , define and let and be some probability distribution on . Suppose that , are drawn from independently according to and that some classifies the examples correctly, in the sense of (1). Then with probability at least ,
Furthermore, if is correct on all but examples, we have with probability at least
Let metric space have doubling dimension , and let be the collection of real-valued functions over with Lipschitz constant at most . Then for any that classifies a sample of size correctly, we have with probability at least
Likewise, if is correct on all but examples, we have with probability at least
In both cases, .
3.1 Comparison with previous generalization bounds
Our generalization bounds are not directly comparable to those of von Luxburg and Bousquet 
. In general, two approaches exist to analyze binary classification by continuous-valued functions: thresholding by the sign function or bounding some expected surrogate loss function. They opt for the latter approach, defining the surrogate loss function
and bound the risk . We take the former approach, bounding the generalization error directly. Although for -valued labels the risk upper-bounds the generalization error, it could potentially be a crude overestimate.
von Luxburg and Bousquet  demonstrated that the Rademacher average of Lipschitz functions over the -dimensional unit cube () is of order , and since the proof uses only covering numbers, a similar bound holds for all metric spaces with bounded diameter and doubling dimension. In conjunction with Theorem 5(b) of , this observation yields the following bound. Let be a metric space with , and let be the collection of all with . If are drawn iid with respect to some probability distribution , then with probability at least every satisfies
where is the number of examples labels incorrectly. Our results compare favorably to those of  when we assume fixed diameter and Lipschitz constant and the number of observations goes to infinity. Indeed, Lemma 3.1 bounds the excess error decay by , whereas Corollary 3 gives a rate of .
3.2 Comparison with previous nearest-neighbor bounds
also allows us to significantly sharpen the asymptotic analysis of for the nearest-neighbor classifier. Following the presentation in  with an appropriate generalization to general metric spaces, the analysis of  implies that the -nearest-neighbor classifier achieves
where is the conditional probability of the label, and
is the Bayes optimal classifier. The curse of dimensionality exhibited in the termis real — for each , there exists a distribution such that for sample size , we have . However, Corollary 3 shows that this analysis is overly pessimistic. Comparing (4) with (2) in the case where , we see that once the sample size passes a critical number on the order of , the expected generalization error begins to decay as , which is much faster than the rate suggested by (4).
4 Lipschitz extension classifier
Given labeled points , we construct our classifier in a similar manner to , via a Lipschitz extension of the label values to all of . Let be the sets of positive and negative labeled points. Our starting point is the same extension function used in , namely, for define by
It is easy to verify, see also [34, Lemmas 7 and 12], that agrees with the sample label for all , and that its Lipschitz constant is identical to the one induced by the labeled points, which in turn is obviously . However, computing the exact value of for a point (or even the sign of at this point) requires an exact nearest neighbor search, and in arbitrary metric space nearest neighbor search requires time.
In this section, we design a classifier that is evaluated at a point using an approximate nearest neighbor search.333If is the nearest neighbor for a test point , then any point satisfying is called a -approximate nearest neighbor of . It is known how to build a data structure for a set of points in time , so as to support -approximate nearest neighbor searches in time [10, 22] (see also [25, 7]). Our classifier below relies only on a given subset of the given points, which may eventually lead to improved generalization bounds (i.e., it provides a tradeoff between and in Theorem 3).
Let be a metric space, and fix . Let be a sample consisting of labeled points . Fix a subset of cardinality , on which the constructed classifier must agree with the given labels, and partition it into according to the labels, letting . Then there is a binary classification function satisfying:
can be evaluated at each in time , after an initial computation of time.
With probability at least (over the sampling of )
We will use the following simple lemma. For any function class mapping to , define its -perturbation to be
where . Then for ,
Suppose that is able to -shatter the finite subset . Then there is an so that for all , there is an such that
Now by definition, for each there is some such that . We claim that the collection is able to -shatter . Indeed, replacing with in (6) perturbs the left-hand side by an additive term of at most . ∎
Proof of Theorem 4.
Without loss of generality, assume corresponds to points indexed by . We begin by observing that since all of the sample labels have values in , any Lipschitz extension may be truncated to the range . Formally, if is a Lipschitz extension of the labels from the sample to all of , then so is , where
is the truncation operator. In particular, take to be as in (5) with and write
where the second equality is by monotonicity of the truncation operator, we conclude that is a Lipschitz extension of the data, with the same Lipschitz constant .
Now precompute444The word precompute underscores the fact that this computation is done during the “offline” learning phase. Its result is then used to achieve fast “online” evaluation of the classifier on any point during the testing phase. in time a data structure that supports -approximate nearest neighbor searches on the point set , and a similar one for the point set . Now compute (still during the learning phase) an estimate for , by searching the second data structure for each of the points in , and taking the minimum of all the resulting distances. This estimate satisfies
and this entire precomputation process takes time.
Given a test point to be classified, search for in the two data structures (for and for ), and denote the indices of the points answered by them by , respectively. The -approximation guarantee means that
Define, as a computationally-efficient estimate of , the function
and let our classifier be . We remark that the case always attains the minimum in the definition of (because only produces values greater or equal than ), and therefore one can avoid the computation of , and even the construction of a data structure for . In fact, the same argument shows that also in the definition of in (7) we can omit from the minimization points with label .
This classifier can be evaluated on a new point in time , and it thus remains to bound the generalization error of . To this end, we will show that
To prove (9), fix an . Now let be an index attaining the minimum in the definition of in (7), and similarly for . Using the remark above, we may assume that their labels are . Moreover, by inspecting the definition of we may further assume that attains the minimum of (over all points labeled ) and thus also of its numerator . And since index was chosen as an approximate nearest neighbor (among all points labeled ), we get . Together with (8), we have
We now need the following simple claim:
To verify the claim, assume first that ; then , and now use the fact that adding and truncating are both monotone operations, to get , and the right-hand side is clearly at most . Assume next that ; then obviously . The claim follows.
5 Structural Risk Minimization
In this section, we show how to efficiently construct a classifier that optimizes the “bias-variance tradeoff” implicit in Corollary 3, equation (3). Let be a metric space, and assume we are given a labeled sample . For any Lipschitz constant , let be the minimal training error of over all classifiers with Lipschitz constant . We rewrite the generalization bound as follows:
where . This bound contains a free parameter, , which may be tuned in the course of structural risk minimization. More precisely, decreasing drives the “bias” term (number of mistakes) up and the “variance” term (fat-shattering dimension) down. We thus seek an (optimal) value of where achieves a minimum value, as described in the following theorem, which is our SRM result.
Let be a metric space and . Given a labeled sample , , there exists a binary classification function satisfying the following properties:
can be evaluated at each in time , after an initial computation of time.
The generalization error of is bounded by
for some constant , and where
We proceed with a description of our algorithm. We first give an algorithm with runtime , and then improve the runtime, first to , then to , and finally to .
We start by giving a randomized algorithm that finds a value that is optimal, namely, for that was defined in (11). The runtime of this algorithm is with high probability. First note the behavior of as increases. decreases only when the value of crosses certain critical values, each determined by a pair (that is, ); for such , the classification function can correctly classify both these points. There are critical values of , and these can be determined by enumerating all interpoint distances between subsets .
Below, we will show that for any given , the value can be computed in randomized time . More precisely, we will show how to compute a partition of into sets (with Lipschitz constant ) and (of size ) in this time. Given sets , we can construct the classifier of Corollary 3. Since there are critical values of , we can calculate for all critical values in total time, and thereby determine . Then by Corollary 3, we may compute a classifier with a bias-variance tradeoff arbitrarily close to optimal.
To compute for a given in randomized time , consider the following algorithm: Construct a bipartite graph . The vertex sets correspond to the labeled sets , respectively. The length of edge connecting vertices and is equal to the distance between the points, and includes all edges of length less than . (This can be computed in time.) Now, for all edges , a classifier with Lipschitz constant necessarily misclassifies at least one endpoint of . Hence, finding a classifier with Lipschitz constant that misclassifies a minimum number of points in is exactly the problem of finding a minimum vertex cover for the bipartite graph . (This is an unweighted graph – the lengths are used only to determine .) By König’s theorem, the minimum vertex cover problem in bipartite graphs is equivalent to the maximum matching problem, and a maximum matching in bipartite graphs can be computed in randomized time [28, 37]. This maximum matching immediately identifies a minimum vertex cover, which in turn gives the subsets , allowing us to compute a classifier achieving nearly optimal SRM.
The runtime given above can be reduced from randomized to randomized , if we are willing to settle for a generalization bound within a factor of the optimal , for any . To achieve this improvement, we discretize the candidate values of , and evaluate only for values of , rather than all values as above. In the extreme case where the optimal hypothesis fails on all points of a single label, the classifier is a constant function and . In all other cases, must take values in the range ; indeed, every hypothesis correctly classifying a pair of opposite labelled points has Lipschitz constant at least , and if then the complexity term (and ) is greater than .
Our algorithms evaluates for values of for , and uses the candidate that minimizes . The number of candidate values for is , and one of these values — call it — satisfies . Observe that and that the complexity term for is greater than that for by at most a factor (where the final inequality holds since ). It follows that , implying that this algorithm achieves a -approximation to .
The runtime can be further reduced from randomized to deterministic , if we are willing to settle for a generalization bound within a constant factor of the optimal . The improvement comes from a faster vertex-cover computation. It is well known that a -approximation to vertex cover can be computed (in arbitrary graphs) by a greedy algorithm in time linear in the graph size , see e.g. . Hence, we can compute in time a function that satisfies . We replace the randomized algorithm with this time greedy algorithm. Then , and because we can approximate the complexity term to a factor smaller than (as above, by choosing a constant ), our resulting algorithm finds a Lipschitz constant for which .
We can further improve the runtime from to , at the cost of increasing the approximation factor to . The idea is to work with a sparser representation of the vertex cover problem. Recall that we discretized the values of to powers of . As was already observed by [25, 10] in the context of hierarchies for doubling metric, contains at most of these distinct rounded critical values. After constructing a standard hierarchy (in time ), these ordered values may be extracted with more work.
Let be a discretized value considered above. We extract from a subset that is a -net for . Map each point to its closest net point , and maintain for each net point two lists of points of that are mapped to it — one list for positively labeled points and one for negatively labeled points. We now create an instance of vertex cover for the points of : An edge for and is added to the edge set if the distance between the respective net points and is at most . Notice that , because the distance between such is at most . Moreover, the edge set can be stored implicitly by recording every pair of net points that are within distance — oppositely labeled point pairs that map (respectively) to this net-point pair is considered (implicitly) to have an edge in . By the packing property, the number of net-point pairs to be recorded is at most , and by employing a hierarchy, the entire (implicit) construction may be done in time .
Now, for a given , the run of the greedy algorithm for vertex cover can be implemented on this graph in time , as follows. The greedy algorithm considers a pair of net points within distance . If there exist and that map to these net points, then are deleted from and from the respective lists of the net points. (And similarly if map to the same net point.) The algorithm terminates when there are no more points to remove, and correctness follows.
We now turn to the analysis. Since , the guarantees of the earlier greedy algorithm still hold. The resulting point set may contain opposite labeled points within distance