1 Introduction
Progress in science and engineering depends more than ever on our ability to analyze huge amounts of sensor and simulation data [11]. The vast majority of this data, coming from, for example, highperformance highfidelity numerical simulations, highresolution scientific instruments (microscopes, DNA sequencers, etc.) or Internet of Things feeds, is a result of complex nonlinear processes. While these nonlinear processes can be characterized by lowdimensional submanifolds, the actual observable data they generate is highdimensional.
Highdimensional data is inherently difficult to explore and analyze, owing to the “curse of dimensionality” and “empty space phenomenon” that render many statistical and machine learning techniques (e.g. clustering, classification, model fitting, etc.) inadequate
[3, 16]. In this context, nonlinear spectral dimensionality reduction (NLSDR) has proved to be an indispensable tool [10]. However, the standard NLSDR methods, e.g. Isomap [17] or Locally Linear Embedding [13], have been designed for offline or batch processing where the entire input data is available at the time of analysis. Consequently, they are computationally too expensive or impractical in cases where dimensionality reduction must be applied on a data stream.In this paper, we address the above limitation of classic NLSDR methods and propose an alternative solution that builds on the systematic study of errors in manifold learning. We first describe a protocol to capture a “collective error” of the Isomap method. Then, we show how this error can be used to detect a transition point at which sufficient data has been accumulated to describe a high quality manifold, and computationally lightweight techniques can be used to efficiently map the remaining data in a stream. Our specific contributions are as follows:

We formalize a notion of collective error in Isomap and describe different strategies to quantify it using Procrustes analysis. We perform careful experimental study of the error behavior with respect to the available data using synthetic as well as reallife benchmark data. We identify properties of the error that can be used to detect when manifold becomes stable and robust to incorporating new points.

We propose a new efficient algorithm to incorporate streamed data into a stable Isomap manifold. The complexity of the algorithm depends only on the size of the initial stable manifold and is independent of the stream size. Thus, the algorithm is suitable for highvolume and highthroughput stream processing.
In our approach, we focus on Isomap because of its broad adoption in scientific computing [15] and biomedical research, including fMRI analysis [18], clustering of oncology data [12], genes and proteins expression profiling [9], modeling of spatiotemporal relationships in data [6], and many others.
Isomap provides a simple and elegant way to estimate the intrinsic geometry of the data manifold based on a rough estimate of each data point’s neighbors on the manifold. Unfortunately, its high computational complexity means it is too expensive to use on any but relatively small data sets, and it is not suited for stream processing. Consequently, there is a significant demand for an Isomap variant that would be capable to learn a robust manifold from highthroughput data streams.
The remainder of this paper is organized as follows. In Section 2, we provide preliminary information and high level overview of NLSDR methods. In Section 3, we introduce our approach to quantify error in Isomap, and in Section 4 we report experimental results showing properties of our error measure. In Section 5, we introduce our new efficient algorithm for outofsample Isomap extension. We close the paper with a brief survey of related work in Section 6, and concluding remarks in Section 7.
2 Preliminaries
In the most general terms, the nonlinear dimensionality reduction problem can be posed as follows. Given a data matrix , such that each , the task is to find a corresponding lowdimensional representation, , for each , where .
Nonlinear spectral dimensionality reduction methods assume that the data lies along a nonlinear lowdimensional manifold embedded in a highdimensional space, and exploit the global (Isomap [17], Minimum Volume Embedding [19]) or local (Locally Linear Embedding [13], Laplacian Eigenmaps [1]) properties of the manifold to map each to its corresponding .
Most NLSDR methods follow a similar series of data transformations as depicted in Figure 1. In the first step, the matrix is used to construct a neighborhood graph, where each node of the graph is connected to its
nearest neighbors (KNN) in some, typically Euclidean, metric. This process involves
pairwise distance computations. Next, the neighborhood graph is analyzed to construct a feature matrix. This matrix encodes properties of the input data that should be preserved during dimensionality reduction, i.e. it directly captures properties of the searched submanifold. For example, in the classic Isomap formulation [17], the feature matrix stores shortest paths between each pair of points (nodes) in the neighborhood graph, which approximates the actual geodesic distance between the points. The cost of the feature matrix construction varies between different NLSDR methods, but generally falls into the range and . To obtain the final dimensional representation of the input data, , the feature matrix is factorized and the first eigenvectors become the output . The cost of this step is usually .If directly and naively applied on streaming data, NLSDR methods have to recompute the entire manifold each time a new point is extracted from a stream. This quickly becomes computationally prohibitive but guarantees the best possible quality of the learned manifold given the data. To alleviate the computational problem, landmark or general outofsample extension methods have been proposed (see Section 6). These techniques however either neglect manifold approximation error or remain computationally too expensive for practical applications (for example, their complexity depends directly on the stream size). Here, we propose a different strategy. Initially, we aggregate incoming data and process it using the standard Isomap. At the same time, we trace the quality of the resulting manifold with some computationally acceptable overhead. When we detect that adding new points does not improve the quality of the discovered manifold, i.e. manifold is stable, we drop the standard Isomap and proceed with a faster approximate method that is sufficient to maintain the quality of the manifold. Our approach requires two components that we describe in the following sections: a method to assess accuracy of the learned manifold, and an efficient algorithm to assimilate the remaining points from a stream.
3 Proposed Approach to Isomap Error
To quantitatively assess the performance of Isomap in largescale streaming applications we first need an error metric to determine the accuracy of the learned manifold.
Error in Isomap may arise due to several reasons, including incorrect parameter selection and noisy input data. Parameter selection error may be introduced by a poor choice of neighborhood size, , or selecting a suboptimal dimensionality
. Error due to input data can be attributed to noise, missing or limited data entries, or a skewed, rather than uniform, sampling of the underlying manifold. Finally, there could be the intrinsic error of Isomap itself, for instance due to simplifying assumptions about manifold, limited numerical precision, etc.
While error in Isomap has been discussed in prior work (see Section 6), most of these studies have focused on one particular aspect of the error. We are interested in understanding the collective error associated with Isomap as a function of the size of the input data. Here we present metrics to measure this error.
3.1 Error Definition
We assume that there exists a function that maps each data sample to . The goal of NLSDR is to learn the inverse mapping, , that can be used to map highdimensional to lowdimensional , i.e. .
Let the approximate mapping learned by Isomap be denoted by the inverse function . Let denote the data matrix containing the true lowdimensional mapping for the samples in , i.e., and denote the matrix with the approximate mapping, i.e. .
To measure the error between the true representation and the Isomap induced approximate representation, we leverage Procrustes analysis, which is widely used for shape analysis [5]. The idea behind Procrustes analysis is to align two matrices, and , by finding the optimal translation , rotation , and scaling that minimizes the Frobenius norm between the two aligned matrices, that is:
(1) 
The above optimization problem has a closed form solution obtained by performing Singular Value Decomposition of
[5].We use the Procrustes analysis to measure the quality of the manifold approximated by Isomap in two ways: 1) A direct method for comparison when lowdimensional ground truth, , is available. 2) A referencesample method that can be used when ground truth is absent. The asymptotic behavior of this method converges to that of the first approach.
Direct Procrustes Error
The first approach requires the ground truth reference for the lowdimensional manifold. When both and are known, the error is measured using (1), that is:
(2) 
The error is expected to decrease with the size of the input sample . Our hypothesis is that and have an inverse relationship and converges to a constant value as becomes larger. We provide an informal analysis to support this hypothesis.
For manifolds with known properties (e.g. volume, radius of curvature, etc.), the original Isomap paper [17] shows that the ratio between the approximate geodesic distance between a pair of samples computed from the nearest neighbor graph and the true geodesic distance on the manifold can be bounded within
with probability at least
, for arbitrarily small values of and , if the density of the samples is at least greater than , which is a function of , , , , and the manifold properties ( is the bound on the ratio between the approximate and the true geodesic distances). Closer inspection reveals that:(3) 
We can relate and the Procrustes error
. In particular, using results from the Eigenvalue perturbation theory
[7], we note that the difference between the eigenvalues of the approximate geodesic matrix will be within of the eigenvalues obtained from the true geodesic matrix, which implies that . Assuming that the data is sampled uniformly and using the result from (3), the following holds:(4) 
where is the minimum number of points required to establish the error bound.
Reference Sample Error
In the absence of lowdimensional ground truth we use a referencesample method. The referencesample method works in the following way. Given , we select , a reference set, and two equal sized sample sets . Next, we create two data sets, and , such that for . An approximation is learned for each and error is computed as the Procrustes error for the two learned approximations of the reference set :
(5) 
4 Experimental Results
We present several experiments on a variety of data sets to illustrate the behavior of error using the metrics proposed in Section 3. This allows for better understanding of the asymptotic behavior of error in Isomap, required by our strategy for data stream processing. In particular, our objective is to show that the error converges after a certain number of samples are observed.
4.1 Test Data
The Swiss Roll data set is typically used for evaluating manifold learning algorithms. Swiss Roll data is generated from 2dimensional data, embedded into 3 dimensions using a nonlinear function. Specifically, we have Y where:
(6) 
The common approach is to generate the 3dimensional Swiss Roll from through the use of nonlinear functions of the form , , and . This is problematic due to the fact that , are not isometric, i.e. distances between points in are not preserved along the surface generated in . We propose a new data set to rectify this issue, the Euler Isometric Swiss Roll. The idea is to substitute the commonly used with the equations for the Euler Spiral:
(7) 
(8) 
The Euler spiral has the property that the curvature at any point is proportional to the distance from the origin. This gives constant angular acceleration along the curve thus ensuring that isometry is preserved.
In order to investigate our approach to error analysis in the absence of ground truth, we consider two benchmark data sets: the MNIST handwritten digit database and the Corel Image data set
. The MNIST database is composed of 70,000 normalized,
grayscale images of handwritten digits ‘0’ to ‘9’. Each image is represented by a 784dimensional vector resulting from the normalized, grayscale image. Each of the ten digits has roughly 6,000 samples. The Corel Image data set consists of 68,040 photo images from various categories. Each image is represented using 57 features.
4.2 Results and Analysis
First we apply Isomap to samples of data from the Swiss Roll data set. This is to investigate the ability of Isomap to reconstruct the true lowdimensional structure under the conditions of limited data and increasing data. The obtained results are presented in Figure 2. Each experiment is performed 10 times and the mean error over all trials is reported.
Figure 2 shows that initially for smaller amounts of data the error as formulated in (2) is relatively high. As more data samples are used the error first decreases rapidly and then begins to converge at approximately 1,500 samples. This suggests that the approximation may be learned within some error tolerance using 1,500 points, and that additional data points contribute no significant information. It is at this point that we wish to use more computationally lightweight, approximate methods to process the remaining samples.
Next, we consider the referencesample method described in Section 3. In applying our sampling method to the Swiss Roll data we accomplish two goals. Like in the first experiment, we are able to investigate the quality of the Isomap approximation as data availability increases. In addition, because we have ground truth available, we are able to validate our observation that both errors asymptotically converge, in the limit of increasing data.
In Figure 3, we present results for both approaches to error analysis. In the referencesample method we take reference set with . We start with sample sets with 100 samples and increase their size by 100 samples at each step. The error between and is computed as prescribed by (5). At the same time, the error for compared with ground truth coordinates (6) is computed and plotted alongside the referencesample error.
Having gained an understanding of the behavior of errors on synthetic data with availability of ground truth we turn our attention to realworld data. As an example we consider samples for single digits from the MNIST database and the Corel Image data set. For each data set, we determine its intrinsic dimensionality based on residual variance and then run the referencesample method to obtain the error plots for increasing sample size, as shown in Figure
4.For each data set, we observe a similar behavior to that seen for both approaches when applied to the Swiss Roll. The error decreases rapidly with additional samples. As the size of the sample sets approaches around 2,0002,500 samples, the error begins to stabilize as both approximations become more accurate. The conclusion we draw from this is that for sample of more than points, a reliable manifold can be learned using both and , which results in convergence of the error.
5 Proposed Streaming Isomap Algorithm
To form a stable, well represented manifold, Isomap requires an adequate number of samples to learn the topology of the manifold. More samples can capture the idiosyncrasies of the manifold surface better. However, our experimental results in Section 4 suggest that there exists a transition point beyond which the quality of the manifold does not change significantly, even on addition of more samples. This means that once the point of transition is reached, instead of building the entire manifold in a batch fashion, we can potentially employ a computationally less expensive procedure to map the new samples arriving in the data stream. Based on this observation, we propose an outofsample extension method called Streaming or SIsomap that can adequately predict the embedding of incoming samples in an efficient manner. The algorithm avoids recomputation of the entire geodesic distance matrix or its eigen decomposition, both of which are . While other outofsample techniques exist [2], we demonstrate in Section 5.2 that the proposed algorithm is significantly more efficient.
The algorithm assumes that a transition point has already been identified by monitoring the Isomap error using methods discussed in Section 3. Let matrix denote the batch of samples encountered in the stream before the point of transition. From our previous experimental results we can assume that is relatively small compared to the remaining size of the stream. Let represent the remaining part of the input data stream. Furthermore, let represent the geodesic distance matrix between the samples in , and let be the matrix containing the corresponding lowdimensional representations of the samples in .
5.1 Algorithm
Our proposed method is shown in Algorithm 1. The key assumption is that because the manifold learned from is stable, we do not have to recompute the entire geodesic distance matrix each time a new point is added. Instead, it is sufficient that we find the nearest neighbors of in and use those to approximate geodesic distance between and the remaining points in . This step is realized in lines 14. Here, stores indexes of the nearest neighbors of , represents the corresponding distances, and is the approximate geodesic distance between and the th point in . Given the updated geodesic distances we can obtain the lowdimensional coordinates of using transformations similar to [8]. Specifically, we match the inner product between and points in to the computed geodesics. If we denote the mean of entries of by and the mean of all entries of by , then our desired inner product can be expressed by as given in line 6. Here, we use to represent a vector of ones, and to represent the vector of row means of . Then, by solving for , whose inner product with is equal , we obtain the lowdimensional representation of (lines 79).
5.2 Analysis
For a single point , the proposed SIsomap algorithm takes time to compute NN, to compute , and for the leastsquares estimation in solving for , where is the size of the batch . Thus, its runtime complexity is per streaming point. If we consider the cost of learning the initial manifold, the proposed SIsomap algorithm requires time to process a stream of size . This is significant because the cost of mapping a new sample depends only on the batch size and parameters and . Since these are very small compared to the entire stream size, the resulting computational savings are significant. Consequently, the algorithm is well suited for highthroughput streams. In contrast, the complexity of the incremental Isomap [8], which is the other method designed for streams, scales quadratically with the size of the stream, and it can be when inserting th sample from the stream if no initial batch is available, or when is given.
The space complexity of our algorithm is dominated by the terms and due to the geodesic distance matrix, , and the lowdimensional representation of the batch samples, . Thus, the space requirement does not grow with the size of the stream, which makes the algorithm appealing for handling highvolume streams.
5.3 Experimental Results
To validate the proposed SIsomap algorithm we perform a set of experiments to asses how accurately and efficiently it maps samples in a stream to a manifold learned from a fixed sized batch.
In the first set of experiments, we measure the error in the mapping obtained using Algorithm 1 and compare it with the error in the mapping from using standard Isomap algorithm. Using a data set of size , we learn a manifold using a batch of size and then map the remaining samples using SIsomap. The entire mapping is compared to the ground truth using the Procrustes error. Figure 5 shows the results for the Isometric Swiss Roll data with 10,000. Similar results are shown for the other data sets in Figure 6. For the other data sets, we use the mapping obtained by applying Isomap on full data as the ground truth. The results show that while for smaller samples the error is high, the error stabilizes to a low value after the point of transition. The error closely tracks the error for only the batch portion of the data set. This indicates that beyond the point of transition, the mapping obtained using SIsomap is as accurate as that obtained from the standard Isomap.
To assess the computational efficiency of the SIsomap algorithm, we show the time taken by SIsomap (including the time for Isomap on the batch and SIsomap on remaining stream) in Figure 7. The standard Isomap timing result on all 10,000 samples is shown as the horizontal baseline for reference. We note that SIsomap is able to map the samples in the stream much more efficiently than Isomap on the entire data. Overall, Figures 5 and 7 show that the proposed method is able to map samples in a stream in a highly efficient manner without sacrificing the quality of the manifold.
To understand how SIsomap would behave in a truly streaming mode we conduct a second timing experiment. Here, we fix the size of the batch to the estimated switching point (2,000 for the Isometric Swiss Roll) and then progressively add samples to the stream and process it with SIsomap. We measure the cumulative time taken by SIsomap to process the remainder of the stream, for different sizes. Figure 8 shows the times compared with the runtime for running Isomap on the aggregated batch and stream data. As previously, the cumulative time of running SIsomap scales linearly with the size of the stream. This further confirms efficiency of the method.
6 Related Work
Given the high computational complexity of Isomap, variants of Isomap, such as Landmark Isomap [4] and outofsample extension techniques [2], have been proposed as a computationally viable alternative. Both of these methods either use a smaller set of landmark points or approximations to avoid performing the costly eigen decomposition on the geodesic distance matrix, where is the number of points in the entire data set. However, they still require computing the full geodesic distance matrix which is , where is the diameter of the embedded NN graph. The Incremental Isomap algorithm [8] avoids both eigen decomposition and a recreation of the geodesic distance matrix. However, it requires updates to the geodesic distance matrix that incurs a significant cost, as discussed in Section 5.2. Consequently, the method is unsuitable for the streaming setting.
Errors in Isomap have been discussed in prior work [10], but those studies have been typically in regards to selection of parameters. For example, Samko et al. [14] proposed measuring a simple manifold embedding error for a range of to find the best choice of . Similarly, an approach based on the edge disjoint minimal spanning tree algorithm has been proposed to construct a neighborhood graph with connectivity guarantees [20]. In the same spirit, several strategies to assess intrinsic manifold dimension are available [10]. However, to the best of our knowledge there is no prior work in defining and understanding the behavior of error that persists even with the selection of optimal parameters. We address this error in taking an abstract view of Isomap, providing a protocol for measuring collective error, and understanding its behavior. In doing so we identify the optimal point where we may switch from exact to lightweight methods.
7 Conclusions
The error in Isomap approaches as the density of sampled data tends to infinity. However, in practical settings, we can expect that after a certain level of sampling the error does not change significantly. In other words, the learned manifold becomes stable if the sample size reaches a certain threshold, under the assumption of uniform sampling. In this paper, we have presented the error metrics that can be used to empirically observe when the manifold becomes stable. In particular, the referencesample metric is appealing because it can assess the manifold quality even when ground truth data is unavailable.
Equipped with the knowledge of the point of transition, we have presented a streaming algorithm, SIsomap, that can be used to efficiently map new data samples to the stable manifold, instead of performing costly updates. The fact that the cost of mapping new samples in SIsomap depends only on the size of the initial batch used to generate the stable Isomap, and is independent of the size of the stream itself, makes it a very powerful tool to process massive streams of data.
8 Acknowledgment
Authors wish to acknowledge support provided by the Center for Computational Research at the University at Buffalo. This work has been supported in part by the NSF grants CNS1409551 and IIS1651475.
References
 [1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Neur. Inf. Proc. Sys., pages 585–591, 2001.

[2]
Y. Bengio, J. Paiement, P. Vincent, O. Delalleau, N. Le Roux, and M. Ouimet.
Outofsample extensions for lle, isomap, mds, eigenmaps, and spectral clustering.
In Neur. Inf. Proc. Sys., pages 177–184, 2004.  [3] R. Clarke, H.W. Ressom, A. Wang, J. Xuan, M.C. Liu, E.A. Gehan, and Y. Wang. The properties of highdimensional data spaces: implications for exploring gene and protein expression data. Nat. Rev., 8(1):37–49, 2008.
 [4] V. de Silva and J.B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In Neur. Inf. Proc. Sys., pages 705–712, 2002.
 [5] I.L. Dryden and K.V. Mardia. Statistical Shape Analysis. John Wiley & Sons, 1998.
 [6] O.C. Jenkins and M.J. Mataric. A spatiotemporal extension to isomap nonlinear dimension reduction. In Int. Conf. Mach. Lear., page 56. ACM, 2004.
 [7] P. Lancaster and M. Tismenetsky. The Theory of Matrices: With Applications. Computer Science and Scientific Computing Series. Academic Press, 1985.
 [8] M.H.C. Law and A.K. Jain. Incremental nonlinear dimensionality reduction by manifold learning. IEEE Tran. Patt. Anal. Mach. Intell., 28(3):377–391, 2006.

[9]
G. Lee, C. Rodriguez, and A. Madabhushi.
Investigating the efficacy of nonlinear dimensionality reduction schemes in classifying gene and protein expression studies.
IEEE/ACM Trans. on Comp. Biology and Bioinformatics, 5(3):368–384, 2008.  [10] J.A. Lee and M. Verleysen. Nonlinear dimensionality reduction. Springer Science & Business Media, 2007.
 [11] K. Morik, H. DurrantWhyte, G. Hill, D. Muller, and T. BergerWolf. Data driven science: SIGKDD Panel. In KDD, pages 2329–2330, 2015.
 [12] C. Orsenigo and C. Vercellis. A comparative study of nonlinear manifold learning methods for cancer microarray data classification. Expert Systems with Applications, 40(6):2189–2197, 2013.
 [13] S.T. Roweis and L.L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
 [14] O. Samko, A.D. Marshall, and P.L. Rosin. Selection of the optimal parameter value for the isomap algorithm. Pattern Recognition Letters, 27(9):968–979, 2006.
 [15] S.K. Samudrala, J. Zola, S. Aluru, and B. Ganapathysubramanian. Parallel framework for dimensionality reduction of largescale datasets. Scientific Programming, 2015:1, 2015.
 [16] D.W. Scott and J.R. Thompson. Probability density estimation in higher dimensions. In Symposium on the Interface of Computing Science and Statistics, volume 528, pages 173–179, 1983.
 [17] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319, 2000.
 [18] B. Thirion and O. Faugeras. Nonlinear dimension reduction of fMRI data: the Laplacian embedding approach. In IEEE Int. Symp. Biomed. Imaging Macro to Nano, pages 372–375, 2004.

[19]
K.Q. Weinberger, B. Packer, and L. K. Saul.
Nonlinear dimensionality reduction by semidefinite programming and
kernel matrix factorization.
In
International Conference on Artificial Intelligence and Statistics
, 2005.  [20] L. Yang. Building k edgedisjoint spanning trees of minimum total length for isometric data embedding. IEEE Tran. Patt. Anal. Mach. Intell., 27(10):1680–1683, 2005.
Comments
There are no comments yet.