Linear classifiers play a central role in supervised learning, with a rich and elegant theory. This setting assumes data is represented as points in a Hilbert space, either explicitly as feature vectors or implicitly via a kernel. A significant strength of the Hilbert-space model is its inner-product structure, which has been exploited statistically and algorithmically by sophisticated techniques from geometric and functional analysis, placing the celebrated hyperplane methods on a solid foundation. However, the success of the Hilbert-space model obscures its limitations — perhaps the most significant of which is that it cannot represent many norms and distance functions that arise naturally in applications. Formally, metrics such as, earthmover, and edit distance cannot be embedded into a Hilbert space without distorting distances by a large factor [Enf69, NS07, AK10]. Indeed, the last decade has seen a growing interest and success in extending the theory of linear classifiers to Banach spaces and even to general metric spaces, see e.g. [MP04, vLB04, HBS05, DL07, ZXZ09].
A key factor in the performance of learning is the dimensionality of the data, which is known to control the learner’s efficiency, both statistically, i.e. sample complexity, and algorithmically, i.e. computational runtime. This dependence on dimension is true not only for Hilbertian spaces, but also for general metric spaces, where both the sample complexity and the algorithmic runtime can be bounded in terms of the covering number or the doubling dimension [vLB04, GKK10].
In this paper, we demonstrate that the learner’s statistical and algorithmic efficiency can be controlled by the data’s intrinsic dimensionality, rather than its ambient dimension (e.g., the representation dimension). This provides rigorous confirmation for the informal insight that real-life data (e.g., visual or acoustic signals) can often be learned efficiently because it tends to lie close to low-dimensional manifolds, even when represented in a high-dimensional feature space. Our simple and general framework quantifies what it means for data to be approximately low-dimensional, and shows how to leverage this for computational and statistical gain.
Previous work has mainly addressed statistical efficiency in Hilbertian spaces. Scholkopf, Shawe-Taylor, Smola, and Williamson [SSSW99]
noted the folklore fact that the intrinsic dimensionality of data affects the generalization performance of SVM on that data, and they provided a rigorous explanation for this phenomenon by deriving generalization bounds expressed in terms of the singular values of the training set. These results are a first step towards establishing a connection between Principal Components Analysis (PCA) and linear classification (in fact SVM). However, their generalization bounds are somewhat involved, and hold only for the case of zero training-error. Moreover, these results do not lead to any computational speedup, as the algorithm employed is SVM, without (say) a PCA-based dimensionality reduction.
Most generalization bounds depend on the intrinsic dimension, rather than the ambient one, when the training sample lies exactly on a low-dimensional subspace. This phenomenon is indeed immediate in generalization bounds obtained via the empirical Rademacher complexity [BM02, KP02], but we are not aware of rigorous analysis that extends such bounds to the case where the sample is “close” to a low-dimensional subspace.
Two geometric notions put forth by Sabato, Srebro and Tishby [SST10] for the purpose of providing tight bounds on the sample complexity, effectually represent “low intrinsic dimensionality”. However, these results are statistical in nature, and do not address at all the issue of computational efficiency. Our notion of low dimension may seem similar to theirs, but it is in fact quite different — our definition depends only on the (observed) training sample, while theirs depend on the data’s entire (unknown) distribution.
We present classification algorithms that adapt to the intrinsic dimensionality of the data, and can exploit a training set that is close to being low-dimensional for improved accuracy and runtime complexity. We start with the scenario of a Hilbertian space, which is technically simpler. Let the observed sample be , and suppose that is close to a low-dimensional linear subspace , in the sense that its distortion is small, where denotes orthogonal projection onto . We prove in Section 3 that when and the distortion are small, a linear classifier generalizes well regardless of the ambient dimension or the separation margin. Implicit in our result is a tradeoff between the reduced dimension and the distortion, which can be optimized efficiently by performing PCA. To the best of our knowledge, our analysis provides the first rigorous theory for selecting a cutoff value for the singular values, in any supervised learning setting. Algorithmically, our approach amounts to running PCA with a cutoff value implied by Corollary 3.2, constructing a linear classifier on the projected data , and “lifting” this linear classifier to , with the low dimensionality of being exploited to speed up the classifier’s construction.
We then develop this approach significantly beyond the Euclidean case, to the much richer setting of general metric spaces. A completely new challenge that arises here is the algorithmic part, because no metric analogue to dimension reduction via PCA is known. Let the observed sample be , where is some metric space,. The statistical framework proposed by [vLB04], where classifiers are realized by Lipschitz functions, was extended by [GKK10] to obtain generalization bounds and algorithmic runtime that depend on the metric’s doubling dimension, denoted (see Section 2 for definitions). The present work makes a considerably less restrictive assumption — that the sample points lie close to some low-dimensional set. First, we establish in Section 4 new generalization bounds for the scenario where there is a multiset of low doubling dimension, whose distortion is small. In this case, the Lipschitz extension classifier will generalize well, regardless of the ambient dimension ; see Theorem 4.4. Next, we address in Section 5 the computational problem of finding (in polynomial time) a near-optimal point set . Formally, we devise an algorithm that achieves a bicriteria approximation, meaning that and of the reported solution exceed the values of an optimal solution by at most a constant factor; see Theorem 5.1. The overall classification algorithm operates by computing and constructing a Lipschitz classifier on the modified training set , exploiting its low doubling dimension to compute a classifier faster, using for example [GKK10].
An important feature of our method is that the generalization bounds depend only on the intrinsic dimension of the training set, and not on the dimension of (or potential points in) the ambient space. Similarly, the intrinsic low dimensionality of the observed data is exploited to design faster algorithms.
There is a plethora of literature on dimensionality reduction, see e.g. [LV07, Bur10], and thus we restrict the ensuing discussion to results addressing supervised learning. Previously, only Euclidean dimension reduction was considered, and chiefly for the purpose of improving runtime efficiency. This was realized by projecting the data onto a random low-dimensional subspace — a data-oblivious technique, see e.g. [BBV06, RR07, PBMD12]. On the other hand, data-dependent dimensionality reduction techniques have been observed empirically to improve or speed up classification performance. For instance, PCA may be applied as a preprocessing step before learning algorithms such as SVM, or the two can be put together into a combined algorithm, see e.g. [BBE03, FBJ04, HA07, VW11]. Remarkably, these techniques in some sense defy standard margin theory because orthogonal projection is liable to decrease the separation margin. Our analysis in Section 3 sheds new light on the matter.
There is little previous work on dimension reduction in general metric spaces. MDS (Multi-Dimensional Scaling) is a generalization of PCA, whose input is metric (the pairwise distances); however, its output is Euclidean and thus MDS is effective only for metrics that are “nearly” Euclidean. [GK10] considered another metric dimension reduction problem: removing from an input set as few points as possible, so as to obtain a large subset of low doubling dimension. While close in spirit, their objective is technically different from ours, and the problem seem to require rather different techniques.
2 Definitions and notation
We use standard notation and definitions throughout, and assume a familiarity with the basic notions of Euclidean and normed spaces. We write for the indicator function of the relevant predicate and .
A metric on a set is a positive symmetric function satisfying the triangle inequality ; together the two comprise the metric space . The Lipschitz constant of a function , denoted by , is defined to be the infimum that satisfies for all .
For a metric , let be the smallest value such that every ball in can be covered by balls of half the radius. is the doubling constant of , and the doubling dimension of is defined as . It is well-known that while a -dimensional Euclidean space, or any subset of it, has doubling dimension ; however, low doubling dimension is strictly more general than low Euclidean dimension, see e.g. [GKL03].
The -covering number of a metric space , denoted , is defined as the smallest number of balls of radius that suffices to cover
. The covering numbers may be estimated as follows by repeatedly invoking the doubling property, see e.g.[KL04].
If is a metric space with and , then
Our setting in this paper is the agnostic PAC learning model, see e.g. [MRT12], where examples are drawn independently from
according to some unknown probability distribution. The learner, having observed such pairs produces a hypothesis . The generalization error is the probability of misclassifying a new point. Most generalization bounds consist of a sample error term (approximately corresponding to bias in statistics), which is the fraction of observed examples misclassified by and a hypothesis complexity term (a rough analogue of variance in statistics) which measures the richness of the class of all admissible hypotheses [Was06].111The additional confidence term, typically , is standard and usually not optimized over. A data-driven procedure for selecting the correct hypothesis complexity is known as model selection and is typically performed by some variant of Structural Risk Minimization [SBWA98]
— an analogue of the bias-variance tradeoff in statistics. Keeping in line with the literature, we ignore the measure-theoretic technicalities associated with taking suprema over uncountable function classes.
For any points in and any collection of functions mapping to a bounded range, we may define the Rademacher complexity of evaluated at the points:
where the expectation is over the iid random variablesthat take on with probability . The seminal work of [BM02] and [KP02] established the central role of Rademacher complexities in generalization bounds.
The Rademacher complexity of a binary function class may be controlled by the VC-dimension of through an application of Massart’s and Sauer’s lemmas:
Considerably more delicate bounds may be obtained by estimating the covering numbers and using Dudley’s chaining integral:
3 Adaptive Dimensionality Reduction: Euclidean case
Consider the problem of supervised classification in by linear hyperplanes, where . The training sample is , , with , and without loss of generality we take and the hypothesis class . Absent additional assumptions on the data, this is a high-dimensional learning problem with a costly sample complexity. Indeed, the VC-dimension of linear hyperplanes in dimensions is . If, however, it turns out that the data actually lies on a -dimensional subspace of , Eq. (1) implies that , and hence a much better generalization for . A more common distributional assumption is that of large-margin separability. In fact, the main insight articulated in [Blu05] is that data separable by margin effectively lies in an -dimensional space.
In this section, we consider the case where the data lies “close” to a low-dimensional subspace. Formally, we say that the data is -close to a subspace if (where denotes the orthogonal projection onto the subspace ). Whenever this holds, the Rademacher complexity can be bounded in terms of and alone (Theorem 3.1). As a consequence, we obtain a bound on the expected hinge-loss (Corollary 3.2). These results both motivate and guide the use of PCA for classification.
Let lie in with and define the function class . Suppose that the data is -close to some subspace and . Then
Remark. Notice that the Rademacher complexity is independent of the ambient dimension . Also note the tension between and in the bound above — as we seek a lower-dimensional approximation, we are liable to incur a larger distortion.
Denote by and the parallel and perpendicular components of the points with respect to . Note that each has the unique decomposition . We first decompose the Rademacher complexity into “parallel” and “perpendicular” terms:
We then proceed to bound the two terms in (3). To bound the first term, note that restricted to is a function class with linear-algebraic dimension , and furthermore our assumption that the data lies in the unit ball implies that the range of is bounded by in absolute value. Hence, the classic covering number estimate (see [MV03])
Let be an iid sample of size , where each satisfies . Then for all , with probability at least , for every with , and every -dimensional subspaces to which the sample is -close, we have
where is the hinge loss.
Implicit in Corollary 3.2
is a tradeoff between dimensionality reduction and distortion. Algorithmically, this tradeoff may be optimized using PCA. It suffices to compute the singular value decomposition once, with runtime complexity[GVL96]. Then for each , we obtain the lowest-distortion -dimensional subspace , corresponding to the top singular values. We then choose the value which minimizes the generalization bound of Corollary 3.2 and construct a low-dimensional linear classifier on the projected data , which is “lifted” to .
4 Adaptive Dimensionality Reduction: Metric case
In this section we extend the statistical analysis of Section 3 from Euclidean spaces to the general metric case. Suppose is a metric space and we receive the training sample , , with and . Following [vLB04] and [GKK10], the classifier we construct will be a Lipschitz function (whose predictions are computed via Lipschitz extension that in turn uses approximate nearest neighbor search) — but with the added twist of a dimensionality reduction preprocessing step.
In Section 4.1, we formalize the notion of “nearly” low-dimensional data in a metric space and discuss its implication for Rademacher complexity. Given , we say that is an -perturbation of if and . If our data admits an -perturbation, we can prove that the Rademacher complexity it induces on Lipschitz functions can be bounded in terms of and alone (Theorem 4.3), independently of the ambient dimension . As in the Euclidean case (Theorem 3.1), Rademacher estimates imply data-dependent error bounds, stated in Theorem 4.4.
In Section 4.3, we describe how to convert our perturbation-based Rademacher bounds into an effective classification procedure. To this end, we develop a novel bicriteria approximation algorithm presented in Section 5. Informally, given a set and a target doubling dimension , our method efficiently computes a set with and approximately minimal the distortion . As a preprocessing step, we iterate the bicriteria algorithm to find a near-optimal tradeoff between dimensionality and distortion. Having found a near-optimal -perturbation , we employ the machinery developed in [GKK10] to exploit its low dimensionality for fast approximate nearest-neighbor search.
4.1 Rademacher bounds
We begin by obtaining complexity estimates for Lipschitz functions in (nearly) doubling spaces. This was done in [GKK10] in terms of the fat-shattering dimension, but here we obtain data-dependent bounds by direct control over the covering numbers.
The following “covering numbers by covering numbers” lemma is a variant of the classic [KT61] estimate:
Let be the collection of -Lipschitz functions mapping the metric space to , and endow with the metric:
Then the covering numbers of may be estimated in terms of the covering numbers of :
Hence, for doubling spaces with diameter 1,
Equipped with the covering numbers estimate, we proceed to bound the Rademacher complexity of Lipschitz functions on doubling spaces.222Analogous bounds were obtained by [vLB04] in less explicit form.
Let be the collection of -Lipschitz -valued functions defined on a metric space with diameter and doubling dimension . Then
This bound essentially matches the rate for , as in [vLB04]. Finally, we quantify the savings earned by a low-distortion dimensionality reduction.
Let be a metric space with diameter , and consider the two -point sets , where is an -perturbation of . Let be the collection of all -Lipschitz, -valued functions on . Then
4.2 Generalization bounds
For , define the margin of on the labeled example by . The -margin loss, , that incurs on is which charges a value of for predicting the wrong sign, charges nothing for predicting correctly with confidence , and for
linearly interpolates betweenand . Since , the sample margin loss lower-bounds the margin misclassification error.
Let be the collection of -Lipschitz functions mapping the metric space of diameter 1 to . If the iid sample , , admits an -perturbation then for any , with probability at least , the following holds for all and all :
4.3 Classification procedure
Theorem 4.4 provides a statistical optimality criterion for the dimensionality-distortion tradeoff.333Although the estimate in Theorem 4.3 was given as for readability, its proof yields explicit, easily computable bounds. Unlike the Euclidean case, where a simple PCA optimized this tradeoff, the metric case requires a novel bicriteria approximation algorithm, described in Section 5. Informally, given a set and a target doubling dimension , our method efficiently computes a set with , which approximately minimizes the distortion . We may iterate this algorithm over all — since the doubling dimension of the metric space is at most — to optimize the complexity444Since multiplies in the error bound, the optimization may be carried out oblivious to and . term in Theorem 4.4.
Once a nearly optimal -perturbation has been computed, we predict the value at a test point by a thresholded Lipschitz extension from , which algorithmically amounts to an approximate nearest-neighbor classifier. The efficient implementation of this method (as well as technicalities stemming from its approximate nature) are discussed in [GKK10]. Their algorithm computes an -approximate Lipschitz extension in preprocessing time and test-point evaluation time . The latter also allows one to efficiently decide on which sample points (if any) the classifier should be allowed to err, with corresponding savings in the Lipschitz constant555Note that the complexity term in Theorem 4.4 scales as and hence the final classifier can always be normalized to have Lipschitz constant — so no further stratification over is necessary. We do, however, need to stratify over the doubling dimension (see [SBWA98]). (and hence lower complexity).
5 Approximating Intrinsic Dimension and Perturbation
In this section we consider the computation of an -perturbation (of the observed data) as an optimization problem, and design for it a polynomial-time bicriteria approximation algorithm. As before, let be a finite metric space. For a point and a point set , define . Given two point sets , define the cost of mapping to to be .
Define the Low-Dimensional Mapping (LDM) problem as follows: Given a point set and a target dimension , find with such that the cost of mapping to is minimized.666The LDM problem differs from -median (or -medoid) in that it imposes a bound on rather than on . An -bicriteria approximate solution to the LDM problem is a subset , such that the cost of mapping to is at most times the cost of mapping to an optimal (of ), and also . We prove the following theorem.
The Low-Dimensional Mapping problem admits an -bicriteria approximation in runtime , where .
In presenting the algorithm, we first give in Section 5.2 an integer program (IP) that models this problem. We show that an optimal solution to the LDM problem implies a solution to the IP, and also that an optimal solution to the integer program gives a bicriteria approximation to the LDM problem (Lemma 5.3). However, finding an optimal solution to the IP seems difficult; we thus relax in Section 5.3
some of the IP constraints, and derive a linear program (LP) that can be solved in the runtime stated above (Lemma5.5). Further, we give a rounding scheme that recovers from the LP solution an integral solution, and then show in Lemma 5.4 that the integral solution indeed provides an -bicriteria approximation, thereby completing the proof of Theorem 5.1.
Remark. The presented algorithm has very large (though constant) approximation factors. The introduced techniques can yield much tighter bounds, by creating many different point hierarchies instead of only a single one. We have chosen the current presentation for simplicity.
Let be a point set, and assume by scaling it has diameter and minimum interpoint distance . A hierarchy of a set is a sequence of nested sets ; here, and , while consists of a single point. Set must possess a packing property, which asserts that for all , and a -covering property for (with respect to ), which asserts that for each there exists with . Set is called a -net of the hierarchy. Every point set possesses one or more hierarchies for each value of . We will later need the following lemma, which extracts from an optimal solution a more structured sub-solution.
Let be a point set, and let be a hierarchy for with a -covering property. For every subset with doubling dimension , there exists a set satisfying , and an associated hierarchy with the following properties:
Every point is -covered by some point in , and -covered by some point of for all .
is a sub-hierarchy of , meaning that for all .
First take set and extract from it an arbitrary -covering hierarchy composed of nets . Note that each point is necessarily within distance of some point in : This is because exists in , and by the -covering property of , must within distance of some point .
We initialize the hierarchy by setting . Construct for by first including in all points of . Then, for each , if is not within distance of a point already included in , then add to the point closest to . (Recall from above that .) Clearly, inherits the packing property of hierarchy . Further, since obeyed a -cover property, the scheme above ensures that any point must be within distance of some point in , and within distance of some point in any , .
Turning to the dimension, possessed dimension , and may be viewed as ‘moving’ each net point a distance strictly less than , which can increase the dimension by a multiplicative factor of 3. Further, the retention of points of each in can add 1 to the doubling constant, as an added point may be the center of a new ball of radius . ∎
5.2 An integer program
The integer program below encapsulates a near-optimal solution to LDM, and will be relaxed to a linear program in Section 5.3. Denote the input by and , and let be a hierarchy for with a -covering property. We shall assume, following Section 5.1, that all interpoint distances are in the range , and the hierarchy possesses levels. We construct from an optimal IP solution a subset equipped with a hierarchy that is a sub-hierarchy of ; we will show in Lemma 5.3 that constructed in this way is indeed a bicriteria approximation to the LDM problem.
We introduce a set of 0-1 variables for the hierarchy ; variable corresponds to a point . Clearly . The IP imposes in Constraint (7) that , intended to be an indicator variable for whether appears in (level of the hierarchy of ). The IP requires in Constraint (8) that , which enforces the nested property in the hierarchy . When convenient, we may refer to distance between variables where we mean distance between their corresponding points.
Let us define the -level neighborhood of a point to be the net-points of that are relatively close to . Formally, when , let include all variables for which , for . If , then let be the nearest neighbor of in (notice that ), and define to include all variables for which . We similarly define three more neighbor sets: for , for , and for . The IP imposes on (or the corresponding points in ) the packing property for doubling spaces of dimension of the form , see Constraints (10)-(12). The IP imposes also covering property, as follows. Constraint (9) requires that , which implies that every is -covered by some point in for all .
Recall that is the optimal solution for the low-dimensional mapping problem on input , and let be the cost of mapping to . Let be the set given by Lemma 5.2, and the cost of mapping to cannot be greater than . The following lemma proves a bi-directional relationship between the IP and LDM, relating IP solution to LDM solutions .777Constraints (12) and (15) are not necessary for the purposes of the following lemma, but will later play a central role in the proof of Lemma 5.4.
Let be an input for the LDM problem.
Then yields (in the obvious manner) a feasible solution to the IP of cost at most .
A feasible solution to the IP with objective value yields that is a bicriteria approximate solution to LDM, with and cost of mapping to at most .
For part (a), we need to show that assigning the variables in and according to yields a feasible solution with the stated mapping cost. Note that is nested, so it satisfies Constraint (8). Further, the doubling dimension of implies that all points obey packing constraints (10)-(12). The covering properties of are tighter than those required by Constraint (9). Constraints (13)-(14) are valid, because if , then necessarily must be large enough to satisfy these constraints.
We then claim that Constraint (15) is actually extraneous for this IP, since it is trivially satisfied by any hierarchy possessing -covering (Constraint (9)): Since contains at most non-zero variables (Constraint (10)), Constraint (15) simply means that if contains at least one non-zero variable, then so does . But if contains a non-zero variable, then this variable is necessarily -covered by some non-zero variable in hierarchical level . Further, the non-zero covering variable must be in , since contains all variables within distance of .
Turning to the IP cost, a point