Adagio: Fast Data-Aware Near-Isometric Linear Embeddings
Many important applications, including signal reconstruction, parameter estimation, and signal processing in a compressed domain, rely on a low-dimensional representation of the dataset that preserves all pairwise distances between the data points and leverages the inherent geometric structure that is typically present. Recently Hedge, Sankaranarayanan, Yin and Baraniuk hedge2015 proposed the first data-aware near-isometric linear embedding which achieves the best of both worlds. However, their method NuMax does not scale to large-scale datasets. Our main contribution is a simple, data-aware, near-isometric linear dimensionality reduction method which significantly outperforms a state-of-the-art method hedge2015 with respect to scalability while achieving high quality near-isometries. Furthermore, our method comes with strong worst-case theoretical guarantees that allow us to guarantee the quality of the obtained near-isometry. We verify experimentally the efficiency of our method on numerous real-world datasets, where we find that our method (<10 secs) is more than 3 000× faster than the state-of-the-art method hedge2015 (>9 hours) on medium scale datasets with 60 000 data points in 784 dimensions. Finally, we use our method as a preprocessing step to increase the computational efficiency of a classification application and for speeding up approximate nearest neighbor queries.READ FULL TEXT VIEW PDF
Adagio: Fast Data-Aware Near-Isometric Linear Embeddings
Linear dimensionality reduction techniques lie at the core of high dimensional data analysis, due to their appealing geometric interpretations and computational properties. Among such techniques, principal component analysis (PCA) and its variants[9, 13, 32, 34] play a key role in analyzing high dimensional datasets. However, it is well-known that PCA can significantly distort pairwise distances between sample data points 
. This is an important drawback for machine learning and signal processing applications that rely on fairly accurate distances between points, such as reconstruction and parameter estimation.
By relaxing the notion of isometry into a near-isometry, the seminal Johnson-Lindenstrauss lemma  shows that a Euclidean space with much smaller number of dimensions preserves all pairwise distances up to relative error
. One simply needs to project the original space onto a random subspace and scale accordingly to obtain a low-distortion embedding with high probability. Formally, the JL-lemma states the following.
For any -point subset of Euclidean space and any , there exists a map with such that
To quantify the notion of near-isometry, we introduce the notion of distortion: if is a linear map, the distortion for a pair of points is defined as
The maximum distortion between any pair of points in the cloud is defined as the distortion of the embedding , and quantifies how close the embedding is to being a near-isometry.
The JL-lemma equivalently states that there exists a linear embedding to a low dimensional subspace whose distortion is at most . However, random projections are data oblivious and hence do not leverage any special geometric structure of the dataset, which frequently is present. For this reason, recent work has focused on developing linear dimensionality reduction techniques that are both data-aware and produce near-isometric embeddings, cf. [6, 17, 19, 25, 30]. It is worth mentioning that non-linear techniques have also been developed, see, e.g., .
In their recent work Hedge, Sankaranarayanan, Yin and Baraniuk  provide a non-convex formulation and a convex relaxation. On the one hand their method NuMax is data-aware and outputs high-quality near-isometries, but on the other hand does not scale well: on a dataset with 1000 data points in
, NuMax requires more than half an hour on a usual laptop to perform dimensionality reduction. For even higher dimensional datasets or datasets with more data points, NuMax fails frequently to produce any results in a reasonable amount of time. Furthermore, from a theory perspective it is an interesting open problem to prove any kind of theoretical guarantees for NuMax, as at the moment its performance is not well understood.
Contributions. Our contributions are summarized as follows.
We propose a novel randomized method for constructing efficiently linear, data-aware, near-isometric embeddings of a high-dimensional dataset. Our method combines the best of the PCA-world (data-awareness) and the JL-world (near-isometry) in an intuitive way: combine PCA and random projections appropriately. Also, our method comes with strong theoretical guarantees in contrast to prior work.
We verify the effectiveness of our proposed method on numerous real-world high-dimensional datasets. We observe that our method outperforms the state-of-the-art method due to Hedge et al.  significantly in terms of run times, while achieving high-quality near-isometries. Also, our method in constrast to NuMax comes is amenable to distributed implementations.
We use our method as a preprocessing step for an approximate nearest neighbor (ANN) application. Our findings indicate that our method can be used to improve the efficiency of ANN algorithms while achieving high accuracy. Our findings are similar for a classification application using k-nearest neighbor graphs. We observe that ADAgIO achieves high accuracy and significant speedups compared to NuMax , ranging from few hundred times up to hundred thousand times (specifically 424 780, ADAgIO runs in 0.01 seconds while NuMax requires 4 247.8 seconds).
Figure 2 offers a quick illustration of our contribution on the digit MNIST dataset. For the exact details of the experimental setup, confer Section 4. Figure 1 shows few data points: each data point is an image of a handwritten digit with 2828 pixels. Each image is converted to a data point in 784 dimensions. The dataset is publicly available . Figure 2(a),(b),and (c) show the distribution of pairwise distortions for PCA, Numax and our proposed method ADAgIO respectively. For the dataset specifics and the exact experimental setup see Section 4. Note that zero distortion corresponds to a perfect isometry. We observe that both NuMax and ADAgIO improve significantly PCA’s performance with respect to achieving a near-isometry. Also, both NuMax and ADAgIO leverage the geometric data structure. However, in terms of runtimes, ADAgIO runs in 5.6 seconds whereas Numax in roughly 18.4 minutes.
Linear dimensionality reduction methods are a cornerstone for analyzing high dimensional datasets. Given a cloud of -dimensional points (), where , and a target dimension , the goal is to produce a matrix that transforms linearly into and optimizes some objective function.
PCA. Principal Component Analysis (PCA) is a Swiss Army knife for data analysis. It minimizes the reconstruction error under the norm . It is well known, that once the data is zero-centered (i.e. ), PCA yields an orthogonal projection matrix that minimizes the following quantity
over all orthogonal projections of rank
. PCA is obtained by computing the singular value decomposition (SVD) of matrixand then by projecting the columns of on the subspace spanned by the leftmost columns of . As it can be seen by the Equation (4), PCA aims to minimize a quantity related to average distortion over all pairs of points rather than the maximum distortion. In practice PCA is not near-isometric: typically there exists a non-trivial fraction of pairs of points whose distance can be very distorted.
Random projections. Principal component analysis (PCA) is a classical unsupervised analysis method, and the predominant linear dimensionality reduction technique. While PCA is an important algorithmic tool for analyzing high-dimensional datasets, many important signal processing applications such as reconstruction and parameter estimation rely on preserving well all pairwise distances . Random projections provide a simple way to obtain near-isometric linear embeddings of clouds of points [3, 12, 15, 22, 26], see also [7, 8, 21] for some more practical aspects. Besides the existential result stated as Theorem 1, an efficient construction of the embedding is possible.
For any finite set of points , let be a matrix with uniformly and independently chosen random entries . Moreover let . Then with probability at least , it holds that
NuMax . Surprisingly, until recently little was known in terms of designing data-aware, near-isometric linear dimensionality reduction techniques. Such methods, like random projections preserve well pairwise distances, and at the same time leverage the geometric structure –typically inherent– of the dataset. In their work Hedge, Sankaranarayanan, Yin and Baraniuk  approached this question — they stated it as an non-convex optimization problem, and proposed a convex relaxation, which they called NuMAX. Despite the fine-tuned design of their algorithm, it still does not scale to large-scale datasets.
Our proposed method ADAgIO (fAst Data Aware IsOmetry) is shown in Algorithm 1. The algorithm takes as input a cloud of points and an orthonormal matrix , i.e., . We are particularly interested in the setting where . Note that is the projection matrix on the subspace spanned by the rows of . It is useful to think as as the PCA projection matrix,namely contains the top
left singular vectors of the data matrix. In general any other type of projection matrix can be used instead, for instance[2, 9, 34]. The algorithm generates a random projection JL-matrix and maps linear each to . This surprisingly simple algorithm provably achieves fast, data-aware, and near-isometric dimensionality reduction.
We prove a general theoretical result stated as Theorem 2. Then, we perform an extensive empirical evaluation of PCA with respect to how it distorts pairwise distances. Our findings show that PCA’s projection matrix can be used as input to Algorithm 1.
Padding Johnson-Lindenstrauss dimensions. Intuitively, the next theorem states that given a linear dimensionality reduction technique, we can append sufficiently many JL-dimensions to produce an embedding with smaller distortion.
Let be a finite set of points, and let be an orthonormal matrix with distortion . Then, it is possible to reduce the distortion by a multiplicative factor . Specifically, we can efficiently construct a matrix such that with probability , has distortion . Here, the target dimension is equal to .
Consider the normalized secant set . Observe, that since has distortion at most with respect to
Let be the projection matrix on the orthogonal subspace to the subspace spanned by the rows of . Moreover, let be a random projection matrix, with rows. By invoking the JL lemma 1 with we obtain that with probability the following holds
Let us condition on this event. Now, consider the map
We claim that this map has distortion with respect to . It is enough to prove that for all , we have . Fix any , and let . By Pythagoras Theorem, we have , by reordering . The assumption that has distortion at most with respect to , yields that , and therefore
On the other hand, we have
And finally , which implies the desired result. QED
A direct corollary of our result is that if the dataset is such that a low-rank linear dimensionality reduction method yields distortion , we can significantly reduce the distortion all the way to , by padding JL dimensions. For comparison, with Johnson-Lindenstrauss the necessary target dimension would be of order . Even more importantly ADAgIO leverages the geometric structure which is typically inherent to the dataset.
PCA and distance distortion. By the triangle inequality and standard SVD properties we obtain
Here are the singular values of the data matrix ordered in non-increasing order. Also, by Pythagoras’s theorem . Combining these facts we obtain that the distortion of the pairwise distance between satisfies
While PCA provides no other guarantees concerning the distribution of pairwise distances for an arbitrary cloud of points, empirically we find across a wide variety of datasets, including images, time series, gene data that the bulk of pairwise distances are preserved fairly well. Figure 3 shows the distortion that PCA causes for each pair of points for all pairs of points for nine datasets, see Table 1. Here, the target dimension of PCA is set to 20. We note that despite the relatively large number of principal components several pairwise distances become significantly distorted. We also note that PCA performs better on the dataset UWaveGesture, see Figure 3(i), while for the rest of the dataset it does not perform as well. We investigate why this is the case, and we deduce a rule of thumb when PCA has a “hard time” to preserve pairwise distances.
For every dataset in the collection, we compute its stable rank, defined as . Stable rank is frequently used in order to evaluate the intrinsic dimensionality of the data. Figure 4 shows the stable ranks of the 9 datasets. It turns out that the dataset UWaveGesture has the smallest stable rank compared to the other datasets. As a rule of thumb we expect that when the stable rank is small few principal components will capture well the dataset, resulting in a better quality near-isometry. On the contrary, when the stable rank is large, then we expect that a low-rank PCA approximation will distort significantly many pairwise distances. This is because the stable rank captures well how well the singular values are concentrated around their mean, and when it is large it implies that there is a lot of mass in the non-top principal components.
|Zürich Database||1 005||307 200|
|Glass (6 Classes)||214||10|
|Ionosphere (2 Classes)||351||34|
|Iris (3 Classes)||150||4|
|Pima diabetes (2 Classes)||768||8|
|Vehicle (4 Classes)||846||18|
|Wine (3 Classes)||178||13|
Datasets. Table 1 contains the description of the datasets we used. From the MNIST collection of hand-written letters , a dataset with 60 000 digit images, each of size ) , we selected 800 uniformly at random for our experiments. Unfortunately, NuMax takes an excessive amount of time ( 9 hours) without completing. We use 9 datasets from the UCR time series collection . Again, for this collection of datasets NuMax does not complete in a reasonable amount of time. For our approximate nearest neighbor application we use the ZuBuD image database . This dataset contains 1 005 images of 201 buildings in the city of Zürich. There exist 5 images of each building taken from different viewpoints, each of size 640 480 pixels. Finally we use 6 datasets from the UCI machine learning dataset collection  for our classification experiments.
Implementation. In our experiments, we use the following implementation for choosing the number of components for ADAgIO. For a given target dimension , we use principal components for our matrix and JL components for our random projection matrix . The resulting dimensionality reduction procedure maps each vector in the cloud to the vector . We have also experimented with using randomized PCA in place of the exact PCA using the approach proposed in . In our code we use the implementations provided by the Sklearn package. Our code is available at https://github.com/tsourolampis/Adagio. Experiments were conducted on laptop with processor Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz, and 8GB of RAM.
Synopsis of our findings. Before we delve into the experimental findings we present a synopsis of our findings in Table 2. As we can see, our method ADAgIO combines the best of all worlds: data-awareness, near-isometry, and scalability.
|Data-aware||Runs fast||Near Isometry|
Recall that Figure 3 shows the distribution of pairwise distortions for PCA when the target dimension is equally to 20, or equivalently when the cloud of points is projected on a 20 dimensional linear subspace. The computational cost for running PCA is at most few seconds for all datasets. We apply ADAgIO on the same collection of datasets using the same target dimension. As we described in Section 4.1, ADAgIO uses 10 principal components and 10 JL components. Figure 5 shows the distribution of distortion for ADAgIO for all pairs of points . The run times of ADAgIO are essentially identical with PCA. Specifically, the computational overhead of ADAgIO is at most one second (UWaveGesture). As we will see later, there exist also instances for which ADAgIO is faster than PCA. At the same time ADAgIO–as even eyeballing shows – achieves a higher quality isometry than PCA. NuMax does not produce any output in a reasonable amount of time.
The contrast between Figures 3 and 5 illustrates the importance of ADAgIO. Our method is able to preserve pairwise distances significantly better than PCA, is data-aware as it leverages the geometry through the top 10 principal components, and runs extremely faster than NuMax. The following experiments provide a detailed analysis of our method and its competitors.
For a given distortion parameter , what is the dimension of the subspace required by each method in order to achieve distortion at most ? This is one important question that we study empirically on the MNIST dataset. Specifically, Figure 6 shows the experimental trade-off between the distortion of the embedding and the target dimension for various dimensionality reduction methods. ADAgIO yields a trade-off that consistently outperforms both Johnson-Lindenstrauss and PCA in the distortion regime up to . In particular, in order to preserve all pairwise distances up to error, one can reduce dimension to with ADAgIO, whereas with PCA one needs dimension at least .
NuMax algorithm finds an even better dimensionality reduction than Adagio: for distortion , one can use target dimension . Unfortunately, this comes at a significant computational cost. The cost of computing such a reduction with NuMax is often prohibitively large, usually by several orders of magnitude larger than Adagio, as we will see in the next Section where we study computational efficiency.
Table 3 presents our findings on a randomly chosen sample of 800 points from the MNIST dataset for random projection, PCA, NuMax, ADAgIO using exact PCA as its subroutine (Adagio), and ADAgIO using randomized PCA  (Adagio-Randomized-PCA). For each maximum distortion level we compute how many dimensions are needed per method to achieve this near-isometry quality. Note that for random projections there is no information for equal to 0.05 and 0.1. This is because for these low levels of distortion, the target dimension is equal to the ambient dimension 784, i.e., there is no dimensionality reduction.
As in the previous section, we observe that NuMax is able to detect lower dimensional subspaces compared to the rest of the methods for a given level of distortion. Nonetheless, the computational cost is excessive. Even for the subset of 800 points, it takes more than 18 minutes (1 105 seconds) to achieve distortion equal to 0.05. When NuMax runs on the full MNIST dataset with 60 000 digits, it takes more than 9 hours without completion. Also, note that as the distortion level decreases from 0.2 to 0.05 the run time increases rapidly. Our method ADAgIO provides a nice trade-off between computational cost and the quality of the near-isometry. It outperforms random projections and PCA by simply being their combination.
Random Projection (JL Lemma)
Additionally, ADAgIO using randomized PCA as its black-box rather than exact PCA performs comparably well with ADAgIO. The astute reader may note that for distortion ADAgIO with exact PCA versus ADAgIO with randomized PCA is faster. This happens because the constants hidden in the big-O notation asymptotics and the number of principal components required. Specifically, randomized PCA requires time to compute the top principal components for a matrix which is . Exact PCA requires time equal to . It turns out that despite the better asymptotics of randomized PCA, exact PCA wins in practice for this specific experiment. In general, the use of randomized PCA speeds up ADAgIO while maintaining the output quality. We explore randomized PCA as a subroutine for ADAgIO in the following.
Figure 7(a),(b),(c),(d) plot the target dimension versus the resulting distortion for ADAgIO when it uses an exact and a randomized PCA subroutine to find the top principal components for the MNIST subset, ScreenType, FordB, and UWaveGesture respectively. We observe that the quality of the output remains almost unchanged when we use randomized PCA. Figure 7(e),(f),(g),(h) plot the resulting speedup respectively. We observe that when the number of principal components is relatively small, the resulting speedups are important. As the target dimension increases, the speedups decrease. The run times without randomized PCA are 5.5, 1.2, 2.0, and 8.3 for the four datasets respectively. Finally, the results shown in Figure 7 are representative of our findings across all datasets.
We explore the usage of ADAgIO as a preprocessing step for nearest neighbor queries. Such queries play a major role in numerous applications and have attracted significant research interest [4, 5, 14, 20, 16, 21, 23, 35].
Our running application is image retrieval. We use the ZuBuD database, with pictures of 201 buildings in Zürich; for each building 5 different photos are provided, with different viewpoints and light conditions. Thus, in total the image database consists of 1 005 pictures. We use the well-established SIFT features  that provide significant invariance properties for image retrieval applications [8, 27].
We create a training and a query/test dataset as follows. For each one of 201 buildings, we pick the first picture — all these pictures will form a query/test set. The four remaining pictures for each building are included in the training dataset.
Using the SIFT algorithm we extract scale-invariant descriptor vectors for every image in the training dataset. Each descriptor vector is a point in . At the end of processing the data, we have a database with points in .
In order to decide which building a test photo depicts, we extract from this photo scale-invariant descriptor vectors in using again the SIFT algorithm. For each descriptor vector we perform a 10 nearest neighbor query to our database
. Thus, for any given query image, we create a multiset of descriptor vectors. We classify the picture as the building that appears most often in the resulting multiset. If a tie occurs, then we perform random assignment to one of the tied candidate buildings.
We perform the same experiment for different target dimensions, and we evaluate the performance via the accuracy defined as
Using the aforementioned experimental setup, we find that without reducing the dimensionality of the descriptor vectors, we can achieve classification accuracy . Spefically, without any dimensionality reduction method out of objects are correctly classified using this simple majority rule.
We use ADAgIO to reduce the dimensionality of the descriptor vectors to different target dimensions. Figure 8(a) shows the effect of reducing the dimension of the SIFT descriptor vectors on query times. We verify the fact that the smaller the dimension of the points, the faster we can answer a given image retrieval query. On the other hand, Figure 8(b) shows that as the dimensionality decreases, so does the accuracy. Our findings indicate that the critical dimension value is around 20. For any target dimension greater than 20, we do not observe any significant loss in the classification accuracy.
Figure 9(a) shows a sample query image, together with three closest candidates from the training dataset. It is worth noting that the building in Figure 9(d) has significantly less votes compared to buildings (b) and (c). When we apply ADAgIO with target dimension greater than 20, Figure 9(a) is correctly classified to represent the same building as in Figure 9(b). When the target dimension becomes less than 20, it is confused with Figure 9(c).
We conclude our applications by studying the effect of dimensionality reduction on classification accuracy. For timing aspects of nearest-neighbor queries, our results are consistent with the detailed results of the approximate nearest neighbor application. We perform binary classification experiments as follows. For a given training dataset of vectors indexed by the set , let be the label of
. We apply the semi-supervised learning algorithm due to Zhu, Ghahramani, and Lafferty for learning the labels of vertices in the graphs. Specifically, we solve for the vector that minimizes
subject to for all . This can be encoded as a symmetric diagonally dominant linear system which can be solved in theory more efficiently even than sorting . However in our experiments, we use usual matrix inversion. For each node we decide if , and otherwise we set . We perform 10-fold cross-validation and we report the mean classification error of the classifier, defined as , where tp,tn,fp,fn are the number of true positives, true negatives, false positives, and false negatives respectively. When we have more than two classes, we consider each class separately and we group the points from the rest of the classes as a single class. We then solve (9) for each class to obtain a vector . We decide that an unlabeled point belongs to the class that maximizes . We evaluate the classification output with the mean classification error, namely the fraction of misclassified points.
Table 4 shows the results we obtain using ADAgIO and NuMax respectively. For each dataset we record the time required to reduce the dimensionality, the time required to construct the -nearest-neighbor graph for for each dataset, and the resulting classification accuracy. We also run Zhu et al. classification algorithm on the original dataset to understand the effect of dimensionality reduction on classification accuracy. The resulting accuracies are 80.89%, 59.79%, 56.67%, 62.93%, 74.15%, and 63.59% for Glass, Ionosphere, Iris, Pima Diabetes, Vehicle, and Wine datasets respectively. By comparing these accuracies with the accuracies obtained by using NuMax and ADAgIO as preprocessing steps, we observe that typically the accuracy does not change significantly, and in some cases it may even increase. Also, we observe again that is larger for NuMax compared to our method.
Summary. In this paper we propose ADAgIO, a novel data-aware near-isometric linear dimensionality method. Our method is scalable, amenable to distributed implementation, and comes with strong theoretical guarantees. It leverages the underlying dataset geometry while maintaining well all pairwise distances between points. Our experimental results show that ADAgIO scales significantly better than NuMax, a state-of-the-art method . Furthermore, it can be used as a preprocessing step in nearest-neighbor related applications, such as classification using -nearest neighbor graphs and approximate nearest neighbor queries, in order to improve their computational efficiency while maintaining the output quality.
Open Problems. An interesting general direction is to design better data-aware near-isometric methods. We conjecture that the following problem is NP-hard. Specifically, given a dataset , target dimension and target distortion , does there exist a matrix with distortion at most with respect to ?
We would like to thank Chinmay Hedge for sharing the original NuMax code.
Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 343–352. ACM, 2014.
Approximate nearest neighbors: towards removing the curse of dimensionality.In STOC’98, pages 604–613. ACM, 1998.
International Conference on Artificial Intelligence and Statistics, pages 235–242, 2007.
International Journal of Computer Vision, 60(2):91–110, 2004.
Artificial Neural Networks—ICANN’97, pages 583–588. Springer, 1997.