ADAGIO: Fast Data-aware Near-Isometric Linear Embeddings

09/17/2016
by   Jarosław Błasiok, et al.
Harvard University
0

Many important applications, including signal reconstruction, parameter estimation, and signal processing in a compressed domain, rely on a low-dimensional representation of the dataset that preserves all pairwise distances between the data points and leverages the inherent geometric structure that is typically present. Recently Hedge, Sankaranarayanan, Yin and Baraniuk hedge2015 proposed the first data-aware near-isometric linear embedding which achieves the best of both worlds. However, their method NuMax does not scale to large-scale datasets. Our main contribution is a simple, data-aware, near-isometric linear dimensionality reduction method which significantly outperforms a state-of-the-art method hedge2015 with respect to scalability while achieving high quality near-isometries. Furthermore, our method comes with strong worst-case theoretical guarantees that allow us to guarantee the quality of the obtained near-isometry. We verify experimentally the efficiency of our method on numerous real-world datasets, where we find that our method (<10 secs) is more than 3 000× faster than the state-of-the-art method hedge2015 (>9 hours) on medium scale datasets with 60 000 data points in 784 dimensions. Finally, we use our method as a preprocessing step to increase the computational efficiency of a classification application and for speeding up approximate nearest neighbor queries.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

page 14

02/25/2016

Auto-JacoBin: Auto-encoder Jacobian Binary Hashing

Binary codes can be used to speed up nearest neighbor search tasks in la...
06/30/2018

Approximate Nearest Neighbors in Limited Space

We consider the (1+ϵ)-approximate nearest neighbor search problem: given...
11/13/2021

Efficient Binary Embedding of Categorical Data using BinSketch

In this work, we present a dimensionality reduction algorithm, aka. sket...
02/23/2019

Near neighbor preserving dimension reduction for doubling subsets of ℓ_1

Randomized dimensionality reduction has been recognized as one of the fu...
03/05/2020

Fast Noise Removal for k-Means Clustering

This paper considers k-means clustering in the presence of noise. It is ...
12/15/2018

A Bandit Approach to Maximum Inner Product Search

There has been substantial research on sub-linear time approximate algor...
12/15/2019

Characterizing Generalized Rate-Distortion Performance of Video Coding: An Eigen Analysis Approach

Rate-distortion (RD) theory is at the heart of lossy data compression. H...

Code Repositories

Adagio

Adagio: Fast Data-Aware Near-Isometric Linear Embeddings


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Linear dimensionality reduction techniques lie at the core of high dimensional data analysis, due to their appealing geometric interpretations and computational properties. Among such techniques, principal component analysis (PCA) and its variants

[9, 13, 32, 34] play a key role in analyzing high dimensional datasets. However, it is well-known that PCA can significantly distort pairwise distances between sample data points [3]

. This is an important drawback for machine learning and signal processing applications that rely on fairly accurate distances between points, such as reconstruction and parameter estimation

[19].

By relaxing the notion of isometry into a near-isometry, the seminal Johnson-Lindenstrauss lemma [22] shows that a Euclidean space with much smaller number of dimensions preserves all pairwise distances up to relative error

. One simply needs to project the original space onto a random subspace and scale accordingly to obtain a low-distortion embedding with high probability. Formally, the JL-lemma states the following.

Theorem 1 (JL lemma [22, Lemma 1])

For any -point subset of Euclidean space and any , there exists a map with such that

(1)

To quantify the notion of near-isometry, we introduce the notion of distortion: if is a linear map, the distortion for a pair of points is defined as

(2)
Figure 1: Example images from MNIST dataset. Each image consists of pixels, and is considered as a point in a 784-dimensional space.
(a) (b) (c)
Figure 2: Distribution of pairwise distortions for (a) PCA, (b) Numax, and (c) ADAgIO. Respective run times in seconds are 5.2, 1 105, 5.6 seconds respectively.

The maximum distortion between any pair of points in the cloud is defined as the distortion of the embedding , and quantifies how close the embedding is to being a near-isometry.

(3)

The JL-lemma equivalently states that there exists a linear embedding to a low dimensional subspace whose distortion is at most . However, random projections are data oblivious and hence do not leverage any special geometric structure of the dataset, which frequently is present. For this reason, recent work has focused on developing linear dimensionality reduction techniques that are both data-aware and produce near-isometric embeddings, cf. [6, 17, 19, 25, 30]. It is worth mentioning that non-linear techniques have also been developed, see, e.g., [24].

In their recent work Hedge, Sankaranarayanan, Yin and Baraniuk [19] provide a non-convex formulation and a convex relaxation. On the one hand their method NuMax is data-aware and outputs high-quality near-isometries, but on the other hand does not scale well: on a dataset with 1000 data points in

, NuMax requires more than half an hour on a usual laptop to perform dimensionality reduction. For even higher dimensional datasets or datasets with more data points, NuMax fails frequently to produce any results in a reasonable amount of time. Furthermore, from a theory perspective it is an interesting open problem to prove any kind of theoretical guarantees for NuMax, as at the moment its performance is not well understood.

Contributions. Our contributions are summarized as follows.

  • We propose a novel randomized method for constructing efficiently linear, data-aware, near-isometric embeddings of a high-dimensional dataset. Our method combines the best of the PCA-world (data-awareness) and the JL-world (near-isometry) in an intuitive way: combine PCA and random projections appropriately. Also, our method comes with strong theoretical guarantees in contrast to prior work.

  • We verify the effectiveness of our proposed method on numerous real-world high-dimensional datasets. We observe that our method outperforms the state-of-the-art method due to Hedge et al. [19] significantly in terms of run times, while achieving high-quality near-isometries. Also, our method in constrast to NuMax comes is amenable to distributed implementations.

  • We use our method as a preprocessing step for an approximate nearest neighbor (ANN) application. Our findings indicate that our method can be used to improve the efficiency of ANN algorithms while achieving high accuracy. Our findings are similar for a classification application using k-nearest neighbor graphs. We observe that ADAgIO achieves high accuracy and significant speedups compared to NuMax [19], ranging from few hundred times up to hundred thousand times (specifically 424 780, ADAgIO runs in 0.01 seconds while NuMax requires 4 247.8 seconds).

Figure 2 offers a quick illustration of our contribution on the digit MNIST dataset. For the exact details of the experimental setup, confer Section 4. Figure 1 shows few data points: each data point is an image of a handwritten digit with 2828 pixels. Each image is converted to a data point in 784 dimensions. The dataset is publicly available [1]. Figure 2(a),(b),and (c) show the distribution of pairwise distortions for PCA, Numax and our proposed method ADAgIO respectively. For the dataset specifics and the exact experimental setup see Section 4. Note that zero distortion corresponds to a perfect isometry. We observe that both NuMax and ADAgIO improve significantly PCA’s performance with respect to achieving a near-isometry. Also, both NuMax and ADAgIO leverage the geometric data structure. However, in terms of runtimes, ADAgIO runs in 5.6 seconds whereas Numax in roughly 18.4 minutes.

Roadmap. Section 2 reviews briefly related work. Section 3 presents our mathematical and algorithmic contributions, and Section 4 evaluates experimentally our framework. Finally, Section 5 concludes the paper.

2 Related Work

Linear dimensionality reduction methods are a cornerstone for analyzing high dimensional datasets. Given a cloud of -dimensional points (), where , and a target dimension , the goal is to produce a matrix that transforms linearly into and optimizes some objective function.

PCA. Principal Component Analysis (PCA) is a Swiss Army knife for data analysis. It minimizes the reconstruction error under the norm . It is well known, that once the data is zero-centered (i.e. ), PCA yields an orthogonal projection matrix that minimizes the following quantity

(4)

over all orthogonal projections of rank

. PCA is obtained by computing the singular value decomposition (SVD) of matrix

and then by projecting the columns of on the subspace spanned by the leftmost columns of . As it can be seen by the Equation (4), PCA aims to minimize a quantity related to average distortion over all pairs of points rather than the maximum distortion. In practice PCA is not near-isometric: typically there exists a non-trivial fraction of pairs of points whose distance can be very distorted.

Random projections. Principal component analysis (PCA) is a classical unsupervised analysis method, and the predominant linear dimensionality reduction technique. While PCA is an important algorithmic tool for analyzing high-dimensional datasets, many important signal processing applications such as reconstruction and parameter estimation rely on preserving well all pairwise distances [19]. Random projections provide a simple way to obtain near-isometric linear embeddings of clouds of points [3, 12, 15, 22, 26], see also [7, 8, 21] for some more practical aspects. Besides the existential result stated as Theorem 1, an efficient construction of the embedding is possible.

Lemma 1 (Constructive Johnson-Lindenstrauss [3])

For any finite set of points , let be a matrix with uniformly and independently chosen random entries . Moreover let . Then with probability at least , it holds that

(5)

NuMax [19]. Surprisingly, until recently little was known in terms of designing data-aware, near-isometric linear dimensionality reduction techniques. Such methods, like random projections preserve well pairwise distances, and at the same time leverage the geometric structure –typically inherent– of the dataset. In their work Hedge, Sankaranarayanan, Yin and Baraniuk [19] approached this question — they stated it as an non-convex optimization problem, and proposed a convex relaxation, which they called NuMAX. Despite the fine-tuned design of their algorithm, it still does not scale to large-scale datasets.

3 Proposed Method

Our proposed method ADAgIO (fAst Data Aware IsOmetry) is shown in Algorithm 1. The algorithm takes as input a cloud of points and an orthonormal matrix , i.e., . We are particularly interested in the setting where . Note that is the projection matrix on the subspace spanned by the rows of . It is useful to think as as the PCA projection matrix,namely contains the top

left singular vectors of the data matrix. In general any other type of projection matrix can be used instead, for instance

[2, 9, 34]. The algorithm generates a random projection JL-matrix and maps linear each to . This surprisingly simple algorithm provably achieves fast, data-aware, and near-isometric dimensionality reduction.

0:  Cloud of points , orthonormal matrix with distortion , distortion reduction parameter
  
  Let

be a random matrix with uniformly and independently chosen random entries

  for each  do
     Output
  end for
Algorithm 1 ADAgIO
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Figure 3: Distribution of distortion for PCA for all pairs of points . Here, the target dimension is set to 20 for (a) Computers, (b) Earthquakes, (c) FordB, (d) LargeKitchenAppliances (e) Phoneme (f) RefrigerationDevices, (g) ScreenType, (h) SmallKitchenAppliances, (i) UWaveGesture.

We prove a general theoretical result stated as Theorem 2. Then, we perform an extensive empirical evaluation of PCA with respect to how it distorts pairwise distances. Our findings show that PCA’s projection matrix can be used as input to Algorithm 1.

Padding Johnson-Lindenstrauss dimensions. Intuitively, the next theorem states that given a linear dimensionality reduction technique, we can append sufficiently many JL-dimensions to produce an embedding with smaller distortion.

Theorem 2

Let be a finite set of points, and let be an orthonormal matrix with distortion . Then, it is possible to reduce the distortion by a multiplicative factor . Specifically, we can efficiently construct a matrix such that with probability , has distortion . Here, the target dimension is equal to .

Proof  1

Consider the normalized secant set . Observe, that since has distortion at most with respect to

(6)

Let be the projection matrix on the orthogonal subspace to the subspace spanned by the rows of . Moreover, let be a random projection matrix, with rows. By invoking the JL lemma 1 with we obtain that with probability the following holds

(7)
Figure 4: Stable rank for (a) Computers , (b) Earthquakes, (c) FordB, (d) LargeKitchenAppliances (e) Phoneme (f) RefrigerationDevices, (g) ScreenType, (h) SmallKitchenAppliances, (i) UWaveGesture.

Let us condition on this event. Now, consider the map

(8)

We claim that this map has distortion with respect to . It is enough to prove that for all , we have . Fix any , and let . By Pythagoras Theorem, we have , by reordering . The assumption that has distortion at most with respect to , yields that , and therefore

On the other hand, we have

And similarly

And finally , which implies the desired result. QED

A direct corollary of our result is that if the dataset is such that a low-rank linear dimensionality reduction method yields distortion , we can significantly reduce the distortion all the way to , by padding JL dimensions. For comparison, with Johnson-Lindenstrauss the necessary target dimension would be of order . Even more importantly ADAgIO leverages the geometric structure which is typically inherent to the dataset.

PCA and distance distortion. By the triangle inequality and standard SVD properties we obtain

Here are the singular values of the data matrix ordered in non-increasing order. Also, by Pythagoras’s theorem . Combining these facts we obtain that the distortion of the pairwise distance between satisfies

(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Figure 5: Distribution of distortion for ADAgIO for all pairs of points . Here, the target dimension is set to 20. Datasets (a)Computers , (b) Earthquakes, (c) FordB, (d) LargeKitchenAppliances (e) Phoneme (f) RefrigerationDevices, (g) ScreenType, (h) SmallKitchenAppliances, (i) UWaveGesture. ADAgIO improves significantly PCA’s distortions with negligible computational cost.

While PCA provides no other guarantees concerning the distribution of pairwise distances for an arbitrary cloud of points, empirically we find across a wide variety of datasets, including images, time series, gene data that the bulk of pairwise distances are preserved fairly well. Figure 3 shows the distortion that PCA causes for each pair of points for all pairs of points for nine datasets, see Table 1. Here, the target dimension of PCA is set to 20. We note that despite the relatively large number of principal components several pairwise distances become significantly distorted. We also note that PCA performs better on the dataset UWaveGesture, see Figure 3(i), while for the rest of the dataset it does not perform as well. We investigate why this is the case, and we deduce a rule of thumb when PCA has a “hard time” to preserve pairwise distances.

For every dataset in the collection, we compute its stable rank, defined as . Stable rank is frequently used in order to evaluate the intrinsic dimensionality of the data. Figure 4 shows the stable ranks of the 9 datasets. It turns out that the dataset UWaveGesture has the smallest stable rank compared to the other datasets. As a rule of thumb we expect that when the stable rank is small few principal components will capture well the dataset, resulting in a better quality near-isometry. On the contrary, when the stable rank is large, then we expect that a low-rank PCA approximation will distort significantly many pairwise distances. This is because the stable rank captures well how well the singular values are concentrated around their mean, and when it is large it implies that there is a lot of mass in the non-top principal components.

4 Experimental results

4.1 Experimental setup

Name
Computers 250 720
Earthquakes 139 512
FordB 810 500
LargeKitchenAppliances 375 720
Phoneme 214 1024
RefrigerationDevices 375 720
ScreenType 375 720
UWaveGesture 896 945
MNIST subset 800 768
Zürich Database 1 005 307 200
Glass (6 Classes) 214 10
Ionosphere (2 Classes) 351 34
Iris (3 Classes) 150 4
Pima diabetes (2 Classes) 768 8
Vehicle (4 Classes) 846 18
Wine (3 Classes) 178 13
Table 1: Datasets used in our experiments.

Datasets. Table 1 contains the description of the datasets we used. From the MNIST collection of hand-written letters [1], a dataset with 60 000 digit images, each of size ) , we selected 800 uniformly at random for our experiments. Unfortunately, NuMax takes an excessive amount of time ( 9 hours) without completing. We use 9 datasets from the UCR time series collection [10]. Again, for this collection of datasets NuMax does not complete in a reasonable amount of time. For our approximate nearest neighbor application we use the ZuBuD image database [33]. This dataset contains 1 005 images of 201 buildings in the city of Zürich. There exist 5 images of each building taken from different viewpoints, each of size 640 480 pixels. Finally we use 6 datasets from the UCI machine learning dataset collection [28] for our classification experiments.

Implementation. In our experiments, we use the following implementation for choosing the number of components for ADAgIO. For a given target dimension , we use principal components for our matrix and JL components for our random projection matrix . The resulting dimensionality reduction procedure maps each vector in the cloud to the vector . We have also experimented with using randomized PCA in place of the exact PCA using the approach proposed in [31]. In our code we use the implementations provided by the Sklearn package. Our code is available at https://github.com/tsourolampis/Adagio. Experiments were conducted on laptop with processor Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz, and 8GB of RAM.

Synopsis of our findings. Before we delve into the experimental findings we present a synopsis of our findings in Table 2. As we can see, our method ADAgIO combines the best of all worlds: data-awareness, near-isometry, and scalability.

Desired Properties
Data-aware Runs fast Near Isometry
JL
PCA
NuMax
ADAgIO
Table 2: Synopsis of desired properties for random projections (JL), PCA, NuMax, and our method ADAgIO.

4.2 Data-aware Near-Isometry

Recall that Figure 3 shows the distribution of pairwise distortions for PCA when the target dimension is equally to 20, or equivalently when the cloud of points is projected on a 20 dimensional linear subspace. The computational cost for running PCA is at most few seconds for all datasets. We apply ADAgIO on the same collection of datasets using the same target dimension. As we described in Section 4.1, ADAgIO uses 10 principal components and 10 JL components. Figure 5 shows the distribution of distortion for ADAgIO for all pairs of points . The run times of ADAgIO are essentially identical with PCA. Specifically, the computational overhead of ADAgIO is at most one second (UWaveGesture). As we will see later, there exist also instances for which ADAgIO is faster than PCA. At the same time ADAgIO–as even eyeballing shows – achieves a higher quality isometry than PCA. NuMax does not produce any output in a reasonable amount of time.

The contrast between Figures 3 and  5 illustrates the importance of ADAgIO. Our method is able to preserve pairwise distances significantly better than PCA, is data-aware as it leverages the geometry through the top 10 principal components, and runs extremely faster than NuMax. The following experiments provide a detailed analysis of our method and its competitors.

4.3 Dimension-distortion trade-off

Figure 6: Trade-off between target dimension and distortion for subset of pictures from MNIST dataset.

For a given distortion parameter , what is the dimension of the subspace required by each method in order to achieve distortion at most ? This is one important question that we study empirically on the MNIST dataset. Specifically, Figure 6 shows the experimental trade-off between the distortion of the embedding and the target dimension for various dimensionality reduction methods. ADAgIO yields a trade-off that consistently outperforms both Johnson-Lindenstrauss and PCA in the distortion regime up to . In particular, in order to preserve all pairwise distances up to error, one can reduce dimension to with ADAgIO, whereas with PCA one needs dimension at least .

NuMax algorithm finds an even better dimensionality reduction than Adagio: for distortion , one can use target dimension . Unfortunately, this comes at a significant computational cost. The cost of computing such a reduction with NuMax is often prohibitively large, usually by several orders of magnitude larger than Adagio, as we will see in the next Section where we study computational efficiency.

(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 7: (I) Trade-off between target dimension and distortion on various datasets — comparison between Adagio running PCA and randomized PCA as a subroutine — usage of randomized PCA does not have a significant effect on quality of solution. Datasets (a) MNIST-subset, (b) ScreenType, (c) FordB, (d) UWaveGesture. (II) Corresponding speedups for (e) MNIST-subset, (f) ScreenType, (g) FordB, (h) UWaveGesture.

4.4 Computational efficiency

Table 3 presents our findings on a randomly chosen sample of 800 points from the MNIST dataset for random projection, PCA, NuMax, ADAgIO using exact PCA as its subroutine (Adagio), and ADAgIO using randomized PCA [18] (Adagio-Randomized-PCA). For each maximum distortion level we compute how many dimensions are needed per method to achieve this near-isometry quality. Note that for random projections there is no information for equal to 0.05 and 0.1. This is because for these low levels of distortion, the target dimension is equal to the ambient dimension 784, i.e., there is no dimensionality reduction.

As in the previous section, we observe that NuMax is able to detect lower dimensional subspaces compared to the rest of the methods for a given level of distortion. Nonetheless, the computational cost is excessive. Even for the subset of 800 points, it takes more than 18 minutes (1 105 seconds) to achieve distortion equal to 0.05. When NuMax runs on the full MNIST dataset with 60 000 digits, it takes more than 9 hours without completion. Also, note that as the distortion level decreases from 0.2 to 0.05 the run time increases rapidly. Our method ADAgIO provides a nice trade-off between computational cost and the quality of the near-isometry. It outperforms random projections and PCA by simply being their combination.

Method Distortion Dimension Time

Random Projection (JL Lemma)
0.05
0.1
0.2 260 0.45s

PCA
0.05 337 5.18s
0.1 280 5.1s
0.2 205 4.84s

NuMax
0.05 83 1105s
0.1 59 373s
0.2 42 220s

Adagio-Exact-PCA
0.05 298 5.6s
0.1 187 5.5s
0.2 95 4.8s

Adagio-Randomized-PCA
0.05 298 6.64s
0.1 190 4.01s
0.2 98 1.82s

Table 3: Run time for dimensionality reduction on 800 sample points from MNIST dataset. For details, see Section 4.4.

Additionally, ADAgIO using randomized PCA as its black-box rather than exact PCA performs comparably well with ADAgIO. The astute reader may note that for distortion ADAgIO with exact PCA versus ADAgIO with randomized PCA is faster. This happens because the constants hidden in the big-O notation asymptotics and the number of principal components required. Specifically, randomized PCA requires time to compute the top principal components for a matrix which is . Exact PCA requires time equal to . It turns out that despite the better asymptotics of randomized PCA, exact PCA wins in practice for this specific experiment. In general, the use of randomized PCA speeds up ADAgIO while maintaining the output quality. We explore randomized PCA as a subroutine for ADAgIO in the following.

4.5 ADAgIO with Randomized PCA

Figure 7(a),(b),(c),(d) plot the target dimension versus the resulting distortion for ADAgIO when it uses an exact and a randomized PCA subroutine to find the top principal components for the MNIST subset, ScreenType, FordB, and UWaveGesture respectively. We observe that the quality of the output remains almost unchanged when we use randomized PCA. Figure 7(e),(f),(g),(h) plot the resulting speedup respectively. We observe that when the number of principal components is relatively small, the resulting speedups are important. As the target dimension increases, the speedups decrease. The run times without randomized PCA are 5.5, 1.2, 2.0, and 8.3 for the four datasets respectively. Finally, the results shown in Figure 7 are representative of our findings across all datasets.

4.6 Approximate Nearest Neighbor Queries

(a) (b)
Figure 8: Query time and accuracy for classification of building photos from ZuBuD database using ADAgIO for different target dimensions. The ambient dimension is 128. For details, please see Section 4.6.

We explore the usage of ADAgIO as a preprocessing step for nearest neighbor queries. Such queries play a major role in numerous applications and have attracted significant research interest [4, 5, 14, 20, 16, 21, 23, 35].

Our running application is image retrieval. We use the ZuBuD database

[33], with pictures of 201 buildings in Zürich; for each building 5 different photos are provided, with different viewpoints and light conditions. Thus, in total the image database consists of 1 005 pictures. We use the well-established SIFT features [29] that provide significant invariance properties for image retrieval applications [8, 27].

We create a training and a query/test dataset as follows. For each one of 201 buildings, we pick the first picture — all these pictures will form a query/test set. The four remaining pictures for each building are included in the training dataset.

Using the SIFT algorithm we extract scale-invariant descriptor vectors for every image in the training dataset. Each descriptor vector is a point in . At the end of processing the data, we have a database with points in .

In order to decide which building a test photo depicts, we extract from this photo scale-invariant descriptor vectors in using again the SIFT algorithm. For each descriptor vector we perform a 10 nearest neighbor query to our database

. Thus, for any given query image, we create a multiset of descriptor vectors. We classify the picture as the building that appears most often in the resulting multiset. If a tie occurs, then we perform random assignment to one of the tied candidate buildings.

We perform the same experiment for different target dimensions, and we evaluate the performance via the accuracy defined as

Using the aforementioned experimental setup, we find that without reducing the dimensionality of the descriptor vectors, we can achieve classification accuracy . Spefically, without any dimensionality reduction method out of objects are correctly classified using this simple majority rule.

We use ADAgIO to reduce the dimensionality of the descriptor vectors to different target dimensions. Figure 8(a) shows the effect of reducing the dimension of the SIFT descriptor vectors on query times. We verify the fact that the smaller the dimension of the points, the faster we can answer a given image retrieval query. On the other hand, Figure 8(b) shows that as the dimensionality decreases, so does the accuracy. Our findings indicate that the critical dimension value is around 20. For any target dimension greater than 20, we do not observe any significant loss in the classification accuracy.

Figure 9(a) shows a sample query image, together with three closest candidates from the training dataset. It is worth noting that the building in Figure 9(d) has significantly less votes compared to buildings (b) and (c). When we apply ADAgIO with target dimension greater than 20, Figure 9(a) is correctly classified to represent the same building as in Figure 9(b). When the target dimension becomes less than 20, it is confused with Figure 9(c).

(a) (b) (c) (d)
Figure 9: Sample images from ZuBuD database. (a) Query image, (b), (c), (d) nearby buildings in the training dataset. When the target dimension of the dimensionality reduction is , (a) is correctly classified as the same object as (b). After reducing dimension to , it becomes incorrectly classified as (c) .

4.7 Classification

We conclude our applications by studying the effect of dimensionality reduction on classification accuracy. For timing aspects of nearest-neighbor queries, our results are consistent with the detailed results of the approximate nearest neighbor application. We perform binary classification experiments as follows. For a given training dataset of vectors indexed by the set , let be the label of

. We apply the semi-supervised learning algorithm due to Zhu, Ghahramani, and Lafferty

[36] for learning the labels of vertices in the graphs. Specifically, we solve for the vector that minimizes

(9)

subject to for all . This can be encoded as a symmetric diagonally dominant linear system which can be solved in theory more efficiently even than sorting [11]. However in our experiments, we use usual matrix inversion. For each node we decide if , and otherwise we set . We perform 10-fold cross-validation and we report the mean classification error of the classifier, defined as , where tp,tn,fp,fn are the number of true positives, true negatives, false positives, and false negatives respectively. When we have more than two classes, we consider each class separately and we group the points from the rest of the classes as a single class. We then solve (9) for each class to obtain a vector . We decide that an unlabeled point belongs to the class that maximizes . We evaluate the classification output with the mean classification error, namely the fraction of misclassified points.

Table 4 shows the results we obtain using ADAgIO and NuMax respectively. For each dataset we record the time required to reduce the dimensionality, the time required to construct the -nearest-neighbor graph for for each dataset, and the resulting classification accuracy. We also run Zhu et al. classification algorithm on the original dataset to understand the effect of dimensionality reduction on classification accuracy. The resulting accuracies are 80.89%, 59.79%, 56.67%, 62.93%, 74.15%, and 63.59% for Glass, Ionosphere, Iris, Pima Diabetes, Vehicle, and Wine datasets respectively. By comparing these accuracies with the accuracies obtained by using NuMax and ADAgIO as preprocessing steps, we observe that typically the accuracy does not change significantly, and in some cases it may even increase. Also, we observe again that is larger for NuMax compared to our method.

NuMax ADAgIO
acc.% acc.%
Glass 32.04 0.1 78.2 0.004 0.006 79.9
Ion. 24.0 0.35 33.5 0.01 0.02 64.13
Iris 32.7 0.029 56.4 0.004 0.006 65.6
Pima 4 247.8 4.72 61.43 0.01 0.05 61.76
Veh. 101.9 8.80 74.12 0.007 0.84 74.22
Wine 38.7 0.04 51.2 0.007 0.006 55.66
Table 4: NuMax and ADAgIO performance for the classification task using -NN graphs, . Run-times are in seconds. For details, please see Section 4.7 for details.

5 Conclusion

Summary. In this paper we propose ADAgIO, a novel data-aware near-isometric linear dimensionality method. Our method is scalable, amenable to distributed implementation, and comes with strong theoretical guarantees. It leverages the underlying dataset geometry while maintaining well all pairwise distances between points. Our experimental results show that ADAgIO scales significantly better than NuMax, a state-of-the-art method [19]. Furthermore, it can be used as a preprocessing step in nearest-neighbor related applications, such as classification using -nearest neighbor graphs and approximate nearest neighbor queries, in order to improve their computational efficiency while maintaining the output quality.

Open Problems. An interesting general direction is to design better data-aware near-isometric methods. We conjecture that the following problem is NP-hard. Specifically, given a dataset , target dimension and target distortion , does there exist a matrix with distortion at most with respect to ?

Acknowledgment

We would like to thank Chinmay Hedge for sharing the original NuMax code.

References

  • [1] MNIST handwritten digits. Available at http://www.cs.nyu.edu/~roweis/data.html.
  • [2] A. Abdullah, A. Andoni, R. Kannan, and R. Krauthgamer. Spectral approaches to nearest neighbor search. In FOCS, 2014.
  • [3] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003.
  • [4] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 459–468. IEEE, 2006.
  • [5] S. Arya and D. M. Mount. Approximate nearest neighbor queries in fixed dimensions. In SODA, volume 93, pages 271–280, 1993.
  • [6] B. Bah, A. Sadeghian, and V. Cevher. Energy-aware adaptive bi-Lipschitz embeddings. arXiv preprint arXiv:1307.3457, 2013.
  • [7] E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 245–250. ACM, 2001.
  • [8] P. T. Boufounos, S. Rane, and H. Mansour. Representation and coding of signal geometry. arXiv preprint arXiv:1512.07636, 2015.
  • [9] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011.
  • [10] Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista. The ucr time series classification archive, July 2015. www.cs.ucr.edu/~eamonn/time_series_data/.
  • [11] M. B. Cohen, R. Kyng, G. L. Miller, J. W. Pachocki, R. Peng, A. B. Rao, and S. C. Xu. Solving SDD linear systems in nearly time. In

    Proceedings of the 46th Annual ACM Symposium on Theory of Computing

    , pages 343–352. ACM, 2014.
  • [12] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
  • [13] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM review, 49(3):434–448, 2007.
  • [14] C. Ding and X. He. K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization. In ACM symposium on Applied computing, pages 584–589, 2004.
  • [15] P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory, Series B, 44(3):355–362, 1988.
  • [16] A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In VLDB, volume 99, pages 518–529, 1999.
  • [17] E. Grant, C. Hegde, and P. Indyk. Nearly optimal linear embeddings into very low dimensions. In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, pages 973–976. IEEE, 2013.
  • [18] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011.
  • [19] C. Hegde, A. Sankaranarayanan, W. Yin, and R. Baraniuk. Numax: A convex approach for learning near-isometric linear embeddings. Signal Processing, IEEE Transactions on, 63(22):6109–6121, Nov 2015.
  • [20] T. Idé, S. Papadimitriou, and M. Vlachos. Computing correlation anomaly scores using stochastic nearest neighbors. In ICDM 2007, pages 523–528. IEEE, 2007.
  • [21] P. Indyk and R. Motwani.

    Approximate nearest neighbors: towards removing the curse of dimensionality.

    In STOC’98, pages 604–613. ACM, 1998.
  • [22] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
  • [23] J. M. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing (STOC’97), pages 599–608. ACM, 1997.
  • [24] B. Kulis, A. C. Surendran, and J. C. Platt. Fast low-rank semidefinite programming for embedding and clustering. In

    International Conference on Artificial Intelligence and Statistics

    , pages 235–242, 2007.
  • [25] A. Kyrillidis, M. Vlachos, and A. Zouzias. Approximate matrix multiplication with application to linear embeddings. In Information Theory (ISIT), 2014 IEEE International Symposium on, pages 2182–2186. Ieee, 2014.
  • [26] K. G. Larsen and J. Nelson. The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction. arXiv preprint arXiv:1411.2404, 2014.
  • [27] M. Li, S. Rane, and P. Boufounos. Quantized embeddings of scale-invariant image features for mobile augmented reality. In Multimedia Signal Processing (MMSP), 2012 IEEE 14th International Workshop on, pages 1–6, Sept 2012.
  • [28] M. Lichman. UCI machine learning repository, http://archive.ics.uci.edu/ml, 2013.
  • [29] D. G. Lowe. Distinctive image features from scale-invariant keypoints.

    International Journal of Computer Vision

    , 60(2):91–110, 2004.
  • [30] J. Luo, K. Shapiro, H.-J. M. Shi, Q. Yang, and K. Zhu. Practical algorithms for learning near-isometric linear embeddings. arXiv preprint arXiv:1601.00062, 2016.
  • [31] P.-G. Martinsson, V. Rokhlin, and M. Tygert. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis, 30(1):47 – 68, 2011.
  • [32] B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In

    Artificial Neural Networks—ICANN’97

    , pages 583–588. Springer, 1997.
  • [33] H. Shao, T. Svoboda, and L. van Gool. ZuBuD – Zurich Buildings Database for Image Based Recognition. Technical Report Technical Report No. 260, Zurich, Switzerland, Apr. 2003.
  • [34] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
  • [35] K. Ueno, X. Xi, E. Keogh, and D.-J. Lee. Anytime classification using the nearest neighbor algorithm with applications to stream mining. In ICDM’06, pages 623–632. IEEE, 2006.
  • [36] X. Zhu, Z. Ghahramani, J. Lafferty, et al. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, volume 3, pages 912–919, 2003.