Scalable Manifold Learning for Big Data with Apache Spark

by   Frank Schoeneman, et al.
University at Buffalo

Non-linear spectral dimensionality reduction methods, such as Isomap, remain important technique for learning manifolds. However, due to computational complexity, exact manifold learning using Isomap is currently impossible from large-scale data. In this paper, we propose a distributed memory framework implementing end-to-end exact Isomap under Apache Spark model. We show how each critical step of the Isomap algorithm can be efficiently realized using basic Spark model, without the need to provision data in the secondary storage. We show how the entire method can be implemented using PySpark, offloading compute intensive linear algebra routines to BLAS. Through experimental results, we demonstrate excellent scalability of our method, and we show that it can process datasets orders of magnitude larger than what is currently possible, using a 25-node parallel cluster.



There are no comments yet.


page 8


S-Isomap++: Multi Manifold Learning from Streaming Data

Manifold learning based methods have been widely used for non-linear dim...

Bayesian Inference on Matrix Manifolds for Linear Dimensionality Reduction

We reframe linear dimensionality reduction as a problem of Bayesian infe...

Randomized ICA and LDA Dimensionality Reduction Methods for Hyperspectral Image Classification

Dimensionality reduction is an important step in processing the hyperspe...

Non-linear dimensionality reduction: Riemannian metric estimation and the problem of geometric discovery

In recent years, manifold learning has become increasingly popular as a ...

Angle constrained path to cluster multiple manifolds

In this paper, we propose a method to cluster multiple intersected manif...

Hyperscaling Internet Graph Analysis with D4M on the MIT SuperCloud

Detecting anomalous behavior in network traffic is a major challenge due...

Dual Optimization for Kolmogorov Model Learning Using Enhanced Gradient Descent

Data representation techniques have made a substantial contribution to a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The vast majority of the current big data, coming from, for example, high performance high fidelity numerical simulations, high resolution scientific instruments (microscopes, DNA sequencers, etc.) or Internet of Things streams and feeds, is a result of complex non-linear processes. While these non-linear processes can be characterized by low dimensional manifolds, the actual observable data they generate is high-dimensional. This high-dimensional data is inherently difficult to explore and analyze, owing to the

curse of dimensionality and empty space phenomenon

that render many statistical and machine learning techniques (e.g. clustering, classification, model fitting, etc.) inadequate. In this context, non-linear spectral dimensionality reduction has proved to be an indispensable tool 

[1]. The non-linear spectral dimensionality reduction methods rely on a spectral decomposition of a feature matrix that captures properties of the underlying manifold, and effectively bring the original data into a more human-intuitive low dimensional space that makes quantitative and qualitative analysis of non-linear processes possible (e.g., by enabling visualization). However, the leading manifold learning methods, such as considered here Isomap [2], remain infeasible when presented with the current large-scale data. This is because of both computational and memory complexity that are at least quadratic in the number of input points.

In this paper, we propose a distributed memory framework to address complexity of the end-to-end exact Isomap under Apache Spark model. We show how each critical step of the Isomap algorithm can be efficiently expressed under the basic Spark model, without the need to provision data in the secondary storage. To achieve efficient and portable software realization, we implement the entire method using PySpark, and we offload compute intensive linear algebra routines to high performance BLAS library. Through experimental results, we demonstrate that the resulting software exhibits excellent parallel scalability, and even on a small parallel cluster can be used to process datasets orders of magnitude larger than what is possible withe the current methods, without sacrificing quality. Our specific contributions include Apache Spark-based efficient direct NN solver based on 1D data decomposition, scalable all-pairs shortest-path (APSP) solver leveraging ideas from communication avoiding algorithms, and practical eigendecomposition solver based on power iteration. All components are combined into end-to-end workflow realizing in-memory Isomap.

The reminder of the paper is organized as follows. In Section II, we formally state the problem and briefly introduce the basic Isomap algorithm. In Section III, we provide detailed description of our proposed approach, focusing on each individual Isomap component and its realization in Apache Spark. In Section IV, we showcase the resulting software on several benchmarks, assessing accuracy and parallel scalability, and we demonstrate its practical value by analyzing over 50000 EMNIST images. Finally, in Section V, we survey related work, and conclude the paper in Section VI.

Ii Preliminaries

In manifold learning, we assume that a given dataset of points, where each , is sampled from some manifold of latent dimension , . Our goal is to find a set such that and , where measures the actual distance between pair of points on , and the relation captures some property between points and . In this work, we focus on the case where is , that is, we seek a -dimensional representation, , of  that preserves manifold distances.

The state-of-the-art method for geodesic distance preserving manifold learning is Isomap [2]. The method falls under the category of non-linear spectral dimensionality reduction techniques [1], and it requires a series of data transformations shown in Alg. 1. In the first step, the dataset is used to construct a neighborhood graph, , where each node of the graph is connected to its nearest neighbors (NN), with respect to the Euclidean metric. Here NN distances capture local manifold distances in the neighborhood of each data point. This step requires distance computations, thus having complexity . In the second step, the neighborhood graph is analyzed to construct a feature matrix, , that captures the information to be preserved in dimensionality reduction. Because shortest paths induced by neighbor relations have been shown to be a good approximation of the actual manifold distance between points [3], stores the shortest paths between all pairs of points in the neighborhood graph . Consequently the cost of this step is . The penultimate step is a spectral decomposition of . The resulting top eigenvectors,

, and eigenvalues,

, are required as their product yields the final output set . The spectral decomposition step has complexity . We note that before spectral decomposition, the feature matrix is usually double-centered to ensure that the origin is contained in the final output (instead of being an affine translation).

0:  , ,
1:   = kNN(, )
2:   = AllPairsShortestPaths()
3:   = DoubleCenter()
4:   = EigenDecomposition()
5:   =
6:  return  
Algorithm 1 Isomap

While Isomap is the method of choice in many practical applications [1], it is too computationally and memory intensive for even modest size datasets. For example, commonly available sequential implementations in Matlab and Python scale to datasets with points. In contrast, datasets emerging in scientific applications, e.g. [4, 5, 6, 7], routinely involve hundreds of thousands of points.

Iii Proposed Approach

The computational complexity of Isomap arises from the cost of both construction and spectral decomposition of the feature matrix. The memory cost comes from the () size of the feature matrix that has to be maintained throughout the process. To alleviate these complexities, several approximate methods have been proposed [8, 9]. However, these approaches do not provide exactness guarantees, or are tailored for large-scale HPC systems [10].

In our proposed approach, we focus on exact solution targeting Apache Spark clusters that are easy to deploy, and are accessible to many potential end-users. Our main idea is to cast the Isomap workflow into the Apache Spark model, such that the intermediate data is never explicitly provisioned to the persistent storage, and each step of the workflow is efficiently expressed via Spark transformations implemented in PySpark and Python, with computationally intensive algebraic routines offloaded to a dedicated BLAS engine (e.g., Intel MKL).

Iii-a NN Search

The first step of Isomap is nearest neighbors search. The direct approach to NN is for each point to compute the distance to each of the others, recording the minimum in the process. This requires () comparisons and () space. Theoretically, this approach can be improved by the use of spatial data structures such as - trees, quad-trees, R-trees, or Voronoi diagrams [11, 12]. However, due to the curse of dimensionality, the performance of these data structures quickly deteriorates to the direct method as dimensionality increases [13]. Therefore, in our approach we propose scalable Spark realization of the direct NN method.

Fig. 1: Left: 1D decomposition of the input data . Middle: Matrix and example row . Right: NN RDD and element of list .

We first 1D-decompose the input data into logical blocks (see Fig. 1). We achieve this by loading the entire into RDD, such that single point is stored as one 1D NumPy array, and then executing combineByKey transformation. The transformation assigns to a point its block identifier, and points are grouped into a 2D NumPy array corresponding to a block. Hence, at the end of this step, each point in is assigned to a unique block , for . By decomposing in this way, we can exploit symmetry of the distance matrix over , and efficiently compute only its upper-triangular portion. This is contrary to the standard and expensive approach based on Spark cartesian, which produces RDD elements and , requiring additional filter operation to obtain the final upper-triangular form. Using flatMap transformation, we enumerate all block pairs , where each is paired with all , for . The resulting RDD, stores each block pair as a tuple . Although this induces data replication, it also directly exposes parallelism in computing the distance matrix that dominates the computational work of NN search. We exploit this parallelism when materializing the upper-triangular distance matrix, . Specifically, we apply map transformation that for each pair carries all-pairs distance computations to yield sub-matrix , . Here, computing is delegated to SciPy, which offloads distance computations to the efficient BLAS library (e.g., Intel MKL). At the end, the entire block , stored as 2D NumPy array, becomes a single element of the resulting RDD with . The RDD is persisted in the main memory for further processing.

The next step is to identify nearest neighbors of each with respect to matrix . When is in the block form, each row of is scattered across multiple column-blocks  (see Fig. 1). With each such chunk of row we can associate the same pair , where identifies row-block containing , and is local identifier of row within row-block (since within a block, rows are indexed from 0, and we have that ). We exploit this property, and for each point we identify the minimum elements as follows. First, within each block for every we identify a list of minimal values, . We achieve this in parallel via flatMap transformation that employs standard heap-based algorithm to maintain minimal values, and yields tuples . Here, stores coordinates of the selected minimal values, and the values themselves. Recall that we do not explicitly represent for . Therefore, to obtain row minima for blocks under diagonal of , our flatMap routine considers also transpositions . Next, using combineByKey transformation, we reduce all local minima, contained in lists , into the global NN list for each  (see Fig. 1).

While at this point we could finalize NN search, we apply one more round of transformations, to efficiently convert NN into the neighborhood graph , such that it is ready for processing via all-pairs shortest-paths solver. The idea is to reuse data blocks maintained by RDD storing , to store . To this end, we apply map transformation to the NN and for each  we produce key-value pair , where tuple identifies nearest neighbor of . This enables us to associate each NN distance with the sub-block of to which it belongs. Thus, we use union to combine the resulting RDD with the original block matrices of , and we follow with combineByKey transformation. The transformation fills blocks of with , and then sets the actual neighbor distances according to . The resulting neighborhood graph, , is stored as RDD of blocks of size , in exactly the same way as , benefiting from the fact that is persistent and minimizing NumPy arrays reallocation.

Fig. 2: Example of assignment of logical blocks to RDD partitions. Number inside a block indicates RDD partition storing the block.

The 1D decomposition we use in our NN solver, and the induced 2D decomposition of matrix , are logical and separate from the physical RDD partitioning employed by the Spark runtime. Moreover, instead of relying on the default Spark partitioner, we use a custom partitioner tailored for managing upper-triangular matrices. This is motivated by the performance considerations. The partitioner assigns an integer to each block of the matrix in the row-major fashion. Hence, when the number of RDD partitions is equal to the number of logical blocks, the assignment is exactly the upper-triangular row-major index for block . When this number is less than , we store blocks per partition, where . Specifically, we assign the first blocks to partition 0, the next blocks to partition 1, and so on (see Fig. 2). In our experiments, we found that this strategy of assigning multiple blocks to single RDD partition provides better data locality than the default partitioner or GridPartitioner found in the MLlib library [14] (a popular machine learning library for Apache Spark). This is because neighboring blocks are stored near one another in RDD, which leads to improved data locality and reduced data shuffling. Therefore, we employ the same custom partitioner in the APSP solver (see below).

Iii-B All-pairs Shortest-paths Computation

Given the neighborhood graph , the next step in the Isomap workflow is to solve all-pairs shortest-paths (APSP) in . Recall that shortest paths approximate manifold distances between data points [3], and hence they will form the feature matrix .

Two commonly used algorithms for APSP are Dijkstra and Floyd-Warshall [15]. However, both methods are ill-suited for Spark model, owing to low computation rate compared to the data movement rate. An alternative approach involves a special case of the matrix power  [16]. Consider matrix multiplication over the semiring with additive and multiplicative operations, and . The additive operation is minimum and the multiplicative operation is scalar addition, that is and . In this way we reduce APSP to repeated matrix multiplication of by itself (we refer to this method as min-plus). This is still not ideal: regular data movement patterns like those in matrix multiplication are unsuitable for data shuffles that occur under the Spark model. In fact, matrix multiplication is yet to see efficient realization in Spark [17]. Therefore, we take a middle-ground approach.

We cast an iterative communication-avoiding APSP algorithm by Solomonik et al. [18], which is based on the block formulation of Floyd-Warshall [19], into Spark model, and we use min-plus to batch update blocks of shortest paths values. It turns out that communication-avoiding algorithms, technique well known in High Performance Computing, enable us to minimize data shuffles, leading to efficient Spark realization. We note that the correctness of this algorithm follows directly from that of computing the transitive closure, and has been discussed, e.g., in [18, 19].

In our approach, we iteratively update the entire APSP adjacency matrix, , in three phases (Fig. 3 illustrates in which phase which block of is updated). In the first phase, we solve sequential Floyd-Warshall restricted to a single sub-block on the diagonal. This is the first diagonal block when iteration . Once we have solved the diagonal block we share the solution with blocks in the same row (and column). This begins Phase 2. Upon receiving the diagonal block, we perform min-plus matrix multiplication to in-place update Phase 2 blocks. In the final phase, all remaining blocks receive two matrices from Phase 2. One matrix is sent from the block in the shared row and the other from the shared column. After computing the min-plus product of the received matrices, we update Phase 3 blocks in-place. The process repeats with successive diagonal blocks. At the conclusion of all 3 phases for the final diagonal block, all matrix entries contain the corresponding shortest path length.

Fig. 3: Phases of computation for iteration along critical path defined by the block diagonal. In Phase 1, sequential Floyd-Warshall is performed on diagonal block . In Phase 2, the solution of the diagonal block is passed to, and multiplied with, all off-diagonal blocks in the row and column. In Phase 3, all remaining blocks are updated using the product of blocks and from Phase 2.

We realize the above algorithm in Spark, starting from the persisted RDD representing graph . We proceed iteratively over blocks, , on the diagonal, which form the critical path of the algorithm. To extract diagonal block , we use simple filter transformation over the keys assigned to each block. We follow with flatMap that executes Floyd-Warshall over sub-matrix . Upon completion, the transformation yields multiple instances of the updated block with keys and , where . The keys refer to the blocks in the same row and column where the updated diagonal block is needed for the next phase of computation. Here, similar to NN search, we replicate the data to expose parallelism inherent to Phase 2.

To begin Phase 2, we again filter , this time to obtain sub-blocks with keys and . These blocks store values not yet updated for iteration . Thus, we perform union transformation to bring together these blocks and the diagonal blocks processed in the earlier Phase. When applying union we take special care to ensure that the number of partitions in the resulting RDD is non-increasing. This could negatively impact the subsequent reduction operations, since the exchange of blocks requires shuffle between partitions. Therefore, we use partitionBy to ensure all RDDs involved in union have same partitioning as . Next, we execute combineByKey in order to pair each block from row (column) with the diagonal block . To conclude Phase 2, we run flatMap that performs min-plus matrix multiplication and in parallel updates the current blocks. The transformation yields additional instances of the updated blocks such that they can be passed to Phase 3.

To conclude single iteration, we update Phase 3 blocks , and this involves computing the min-plus product . Hence, we use yet another filter on to obtain remaining blocks , which are in neither row nor column . These blocks, along with blocks we yield at the end of Phase 2, are brought to a single RDD via union, again ensuring appropriate partitioning with partitionBy. Next, we use combineByKey transformation to bring together three matrices needed to update blocks . This includes the current value of and Phase 2 blocks specified above, necessary to compute . Once is ready, is updated by computing element-wise minimum of itself with .

The three phases are repeated for the next diagonal block. At the conclusion of iteration , all pairwise distances represent the approximate manifold distances. To prepare the matrix for subsequent normalization, we square each element of the adjacency matrix to obtain the actual feature matrix .

In our implementation, the sequential APSP component uses the optimized and efficient SciPy implementation of Floyd-Warshall algorithm. It operates in-place on already allocated memory referenced by NumPy arrays. For min-plus matrix multiplication, we use our own Python implementation. However, our Python code is compiled and optimized to the machine code using the Numba just-in-time compiler, providing excellent performance. Finally, to further improve performance of our min-plus code, we enforce C or Fortran memory layout of matrices as appropriate to maximize cache usage.

The overall performance of our APSP hinges on the efficient use of combineByKey transformation. While the transformation involves expensive data shuffling, we found that in practice it is more efficient than the available alternatives, like exchanging blocks via collect/broadcast [20], or provisioning in shared secondary storage, e.g., write to HDFS or SparkFiles.addFile, and SparkFiles.get. This is due to the inefficiency of collect or file I/O required for alternative approaches. As the problem size and block size increase, these operations become a bottleneck. Spark collect requires that the diagonal block and an entire row of blocks be brought to the driver in Phases 1 and 2, respectively. Similarly, the use of secondary storage requires a sequence of file write and read operations to make blocks accessible by appropriate tasks. In our experiments, we found the cost of using heap memory combined with reduceByKey for duplication of blocks to be superior to collect to driver or use of a secondary storage.

The final remark is with respect to RDD lineages (i.e., RDD provenance and execution records). Observe that at each iteration over the diagonal, new RDD describing the entire distance matrix is created. The ancestors of this RDD are all RDDs prescribed by earlier transformations. Consequently, the resulting RDD lineage grows from iteration to iteration, potentially overwhelming the Spark driver. Since the driver is responsible also for scheduling, this hinders performance of the entire algorithm. To address this issue, we checkpoint and persist to prune the lineage of RDDs which describe its blocks. Frequency of checkpointing depends on the size of the input data and block size , but in all test cases we found that checkpointing every 10 iterations performs best.

Iii-C Matrix Normalization

The goal of matrix normalization is to transform

so that it is both row and column centered. In other words, the mean of each row and each column is zero. Centering can be expressed as a linear transformation of the form

, where is centering matrix, and

is identity matrix. Thus, the update of

requires two matrix-matrix multiplications.

As previously discussed, matrix-matrix multiplication is inefficient in Spark. For this reason, in our implementation we double center by the direct approach, exploiting the fact that is symmetric. Specifically, we compute the mean of each column of

, and we store all means in the row vector

, where is mean of column . As means for rows of are equivalent to , we do not compute them explicitly, and to center we subtract from each row, and  from each column of . To finalize the process, we add to each entry of , where is the global mean computed over all elements of .

To express the above algorithm in Spark, we first transform RDD storing by flatMap that for each block in parallel computes sums of its columns. The transformation yields tuple , and if additional tuple that accounts for processing blocks under the diagonal (recall that is upper-triangular). Here, represents vector of means for the block, and is stored as 1D NumPy array. Next we aggregate the final column sums via reduceByKey transformation that applies addition operator over vectors with shared key . The vector summation is efficiently implemented in NumPy. At this point, the resulting RDD contains complete information to compute the actual column means as well as the global mean . Hence, we first sum all elements of the sum vectors into a scalar using single map transformation, and then use reduce action to obtain the global sum at Spark driver. Additionally, we bring to the driver all column sums by executing collectAsMap action. After dividing all the sums by to obtain the desired means, we broadcast them back to Spark executors. Here we note that both reduce/collectAsMap and broadcast actions involve relatively small messages (even for large ), and hence this communication pattern is the most direct and efficient to way to exchange computed means between executors. Finally, we conclude the normalization step by transforming RDD that stores via map that in parallel applies previously broadcast means to the local blocks . The resulting matrix is ready for spectral decomposition.

Iii-D Spectral Decomposition

Computing eigenvectors and eigenvalues of feature matrix is the final step common to all spectral dimensionality reduction methods, as eigenvectors represent the desired low-dimensional representation of . However, larg-scale eigendecomposition is computationally challenging, and the existing parallel eigensolvers are primarily designed for shared memory and multi-/many-core architectures [21, 22]. Because to the best of our knowledge currently there are no eigensolvers available for Apache Spark, we develop a new approach derived from the Power Iteration method [23]. By using simultaneous power iteration over we are able efficiently extract only the top eigenpairs at the same time. This is computationally advantageous considering that other approaches require iteratively extracting each eigenpair, and in practical dimensionality reduction applications is relatively small, e.g., 

when data visualization is the goal.

0:  , , ,
2:   = QRdecompose()
3:  for  do
5:      = QRdecompose()
6:     if  then
7:        break;
10:  return  
Algorithm 2 Simultaneous Power Iteration

The standard power method (see Alg. 2) begins with a set of linearly independent vectors, e.g., the first standard bases

. It then considers all vectors simultaneously, and iteratively applies QR decomposition until the process converges to some acceptable level

, or the predefined number of iterations . In practice, and are selected to achieve accuracy required by given target application.

To achieve efficient Spark realization of power method we exploit two facts. First, for large and reasonable values of , matrices and have small memory footprint, and hence incur small sequential processing and communication overheads. Second, QR decomposition has extremely efficient BLAS realizations, available in NumPy. Therefore, it makes sense to assign managing and decomposition of matrix to the Spark driver (line 5), while offloading computationally expensive matrix product (line 4) for parallel processing by Spark executors. We note that while MLlib [14] provides Spark-based QR decomposition (using tall skinny QR algorithm [24]), it is limited only to MLlib RowMatrix RDD that would be inefficient during matrix product stage required by the power method. Consequently, in our implementation, the driver is responsible for keeping track of matrices and , and checking convergence criteria (line 6). After QR decomposition, the driver broadcasts the entire matrix to all executors. In this way, we can directly multiply each block of with the appropriate block of . With this approach we avoid costly replication and shuffle for pairing blocks from each matrix for multiplication. We use flatMap to compute block products instead. The transformation yields tuples for each block of . For non-diagonal blocks we also yield , to account for upper-triangular storage. Here is a NumPy 2D array representing one of the block products required for the product . To obtain the final form of we use reduceByKey transformation, where key is used, with element-wise addition of NumPy 2D arrays. In the final step, we bring back to the driver via collectAsMap.

To complete a single iteration, we perform QR factorization of on the driver, and test the Frobenius norm of the difference between successive for convergence. At the completion of iterations or upon convergence, we return the current instance of and , finalizing the spectral decomposition stage, and thus the entire dimensionality reduction process.

Iv Experimental Results

To understand performance characteristics of our approach, we performed a set of experiments on a standalone Apache Spark cluster with 25 nodes and GbE interconnect. Each node in the cluster is equipped with 20-core dualsocket Intel Xeon E5v3 2.3 GHz processor, 64 GB of RAM and a standard SATA hard drive. In all tests, Spark driver was run with 56GB of memory, to account for length of the lineage of RDDs in the APSP loop. For the same reason, we run the driver on a dedicated node separately from Spark executors. We allocated one executor per node using the default configuration for the number of cores, i.e., each executor was using all available cores in a node. All executors were configured to use 56GB out of the available 64GB, with the remaining memory available to the operating system and Python interpreter. We note that we tested different configurations of executor-to-core ratios, across different datasets, without noticeable difference in performance.

Our entire software is implemented in Apache Spark 2.2 and Python 2.7. Compute intensive linear algebra operations are offloaded via NumPy and SciPy to BLAS library, specifically Intel MKL 2017. Finally, we use Numba 0.35 for optimal min-plus matrix multiplication with just-in-time compilation.

In all experiments, in the NN stage we used for the neighborhood size. For our test data (see below), such is large enough to deliver single connected component in the neighborhood graph, and small enough to avoid unnecessary shortcut edges. In the eigendecomposition stage, we used and we allowed a maximum of iterations to achieve convergence. We note that in all experiments, many fewer iterations were required to see convergence (usually around 20-50 iterations).

Iv-a Test Data and Method Correctness

The Swiss Roll benchmark is a classic test for evaluating manifold learning algorithms. Swiss Roll data is generated from some input 2D data, which is embedded into 3D using a non-linear function. In our tests, we used the Euler Isometric Swiss Roll [25]. To evaluate the correctness and demonstrate the scalability of our implementation, we created three test datasets. These are samples from the Swiss Roll of size , , and , which we refer to as Swiss50, Swiss75, and Swiss100.

Figure 4 shows the initial data, high-dimensional embedded data, and the resulting dimensionality reduction delivered by our method for the Swiss50 dataset. A quick visual inspection reveals that Fig. 4(c) appears to be a faithful representation of the original data in Fig. 4(a). To actually quantify how well our method reconstructs the original data, we use Procrustes error, which captures the similarity of two datasets and by determining the optimal affine transformation of that best conforms it with  [26]. In our case, Procrustes error of our result compared to the original 2D coordinates is , confirming high accuracy of the reconstruction.

To test our platform on the actual high-dimensional data, we used benchmarks derived from the EMNIST [27], which is frequently used in machine learning research. This dataset provides images of over 200,000 handwritten digits, and hence each data point (image of a digit) is a 784-dimensional vector. For testing purposes, we randomly sampled two sets from EMNIST. They contain and points and we refer to them as EMNIST50 and EMNIST125, respectively.

Figure 5 shows 2D mapping of EMNIST50 obtained using our Isomap method. From the figure we can see that clusters of digits that look similar appear close together, e.g., ‘5’ and ‘6’. At the same time, clusters of distinct digits that blend together also have these similarities. In Fig. 5(b) sample images of digits are included to accentuate features captured by axes D1 and D2. The axis D2 describes the angle of slant for the handwritten digit. We observe the change in angle from top to bottom of the cluster. The primary axis, D1, accounts for curved or straight segments in the digit. For example, on the left we see ‘4’s, which are made up entirely of straight line segments. In contrast, zeros on the right contain no straight segments. To the best of our knowledge, this is the first time such result is reported, and it confirms practical value of large-scale Isomap in processing noise image data.

(a) 2D data prior to embedding in 3D.
(b) Data embedded in 3D.
(c) 2D mapping learned by our method, .
Fig. 4: To demonstrate the correctness of our Spark Isomap method we sample 50,000 points from the Euler Isometric Swiss Roll, and then perform dimensionality reduction. (Please view in color).
(a) Points cluster by digit which they represent.
(b) Sample of original images shown to highlight features captured by reduced dimensions D1 and D2.
Fig. 5: To demonstrate our method on high-dimensional data we sample 50,000 points from the EMNIST dataset , and then perform dimensionality reduction . Original image shown for selected points. (Please view in color).

Iv-B Scalability Tests

To analyze how our implementation scales with the problem size, expressed by , and the compute resources used, we processed all datasets on a varying number of nodes, recording the wall time. The results of this experiment are summarized in Tables I-III. Here we report relative speedup, computed as a fraction , where is time taken to process data instance on nodes, and is the time taken by the same instance to process on the smallest feasible number of nodes. We note that even the smallest datasets we consider in our tests are impossible to process using single node, and hence is not . Similarly, we compute relative efficiency as .

The experimental results show that our approach exhibits strong scaling within the entire range of data and cluster configuration. For smaller datasets (), the method scales linearly up to 12 nodes, and maintains very good efficiency up to 24 nodes. The scaling is very similar for both Swiss50 and EMNIST50, which is expected taking into account that only NN stage depends (linearly) on , and it constitutes only a small fraction of the entire runtime. At this point we should note that because we express all Isomap stages as linear equations or matrix multiplications, the performance of our method is not at all impacted by data distribution or density (e.g., connectivity of graph ). When the size of the input data increases, the method seems to exhibit super-linear performance, however, this result should be taken with care, considering that we compute relative and not actual speedup. The method maintains also weak scalability. Specifically, for the fixed ratio , the execution time scales roughly as , which reflects the dominating cost of the APSP stage.

Compute Nodes
Name 2 4 8 12 16 20 24
EMNIST50 294.92 117.90 62.41 48.47 41.44 35.82 32.92
Swiss50 261.43 98.88 53.58 39.41 33.26 29.47 26.49
Swiss75 958.85 345.47 105.28 84.05 76.66 63.25
Swiss100 1122.89 465.30 275.22 225.89 157.97
EMNIST125 985.94 662.92 599.28 448.42

– means that dataset was impossible to process on given resources.

TABLE I: Execution time in minutes
Compute Nodes
Name 2 4 8 12 16 20 24
EMNIST50 1 2.50 4.73 6.09 7.11 8.23 8.96
Swiss50 1 2.64 4.88 6.63 7.86 8.87 9.87
Swiss75 1 2.78 9.11 11.41 12.51 15.16
Swiss100 1 2.41 4.08 4.97 7.11
EMNIST125 1 1.49 1.65 2.20
TABLE II: Relative speedup
Compute Nodes
Name 2 4 8 12 16 20 24
EMNIST50 1 1.25 1.18 1.02 0.89 0.82 0.75
Swiss50 1 1.32 1.22 1.11 0.98 0.89 0.82
Swiss75 1 1.39 3.04 2.85 2.50 2.53
Swiss100 1 1.61 2.04 1.99 2.37
EMNIST125 1 1.12 0.99 1.10
TABLE III: Relative efficiency

The performance of our method directly depends on the logical block size (see Sec. III-A). Recall that in NN search, blocks of points are paired to compute one block of the pairwise distance matrix. The execution time of computing a single block scales as (). When computing APSP, we require sequential Floyd-Warshall for each block along the diagonal, and matrix-matrix min-plus multiplication for others. Both cases have complexity . Moreover, the APSP solver proceeds iteratively over the diagonal of . Since each block on the diagonal must be solved in sequence, the critical path has length . All these factors must be taken into consideration when selecting . In our experiments, we found that in the range gives ideal performance when . This is because the length of the critical path is not overwhelming for the Spark driver, and at the same time, the block size is small enough to leverage cache memory when executing BLAS routines for matrix products. As the problem size grows, it is advantageous to increase block size to keep control of the critical path. This however may increase the time taken to process individual block, and may lead to resource underutilization as there are fewer blocks to distribute for processing between executors. Figure 6 shows impact of block size when processing Swiss75 on 24 nodes (we observe similar pattern for other datasets). The sweet spot is for and both undersizing and oversizing degrades performance. Currently, we do not have model that would let us select automatically, however, the intuition we provide above is sufficient for the majority of practical applications.

Fig. 6: To identify the effect of block size on overall execution time, we run our method on Swiss75 using 24 compute nodes with varying .

V Related Work

As previously mentioned, existing implementations of Isomap and its variants do not scale to data as large as presented here. Talkwalkar et al. successfully learned manifold for 18 million points. However, their analysis includes approximation method for APSP, and Nystrom method for approximate spectral decomposition. To date, the largest study for exact solutions for Isomap includes points, and has been reported in [10]. The method is tailored for HPC clusters with MPI, and hence its applicability is somewhat constrained.

To partially mitigate the complexity of APSP, the authors of Isomap have proposed Landmark-Isomap (L-Isomap) [8]. The idea behind L-Isomap is that randomly selected landmark points may be selected to learn an underlying manifold. Remaining points are placed on the manifold via triangulation using their distance to landmarks. This approach greatly reduces complexity by targeting APSP, but introduces new sources of error in i) landmark selection method, ii) selecting the number of landmarks, and iii) adding the complete dataset by triangulation. We previously extended the works of de Silva and Tenenbaum [8, 2] and proposed Streaming-Isomap [25]. The streaming approach requires learning a faithful representation of the underlying manifold for some initial batch of points, and then quickly maps new points arriving on a high-volume, high-throughput stream. This approach is orthogonal to the one we present here, and in fact both methods could be combined in case when the initial batch is large.

Vi Conclusion

High-dimensional big data, arising in many practical applications, presents a challenge for non-linear spectral dimensionality reduction methods. This is due to the computationally intensive construction of the feature matrix and it’s spectral decomposition. In this paper, we proposed a scalable distributed memory approach for manifold learning using Isomap, addressing key computational bottlenecks of the method. Through experiments we demonstrated that our resulting Apache Spark-based implementation is scalable and efficient. The method can solve exact Isomap on high-dimensional datasets, which are an order of magnitude larger than what can be processed with the existing methods. We note that individual components, like NN, APSP and eigendecomposition solvers, can be used as a standalone routines. Finally, since other non-linear spectral decomposition methods, like e.g., LLE [28], share the same computational backbone, with a minimal effort our software could be extended to cover these methods as well. The source code of our platform is open and available from:


The authors would like to acknowledge support provided by the Center for Computational Research at the University at Buffalo. This work has been supported in part by the Department of Veterans Affairs under grant 7D0084.


  • [1] J. Lee and M. Verleysen, Nonlinear Dimensionality Reduction.   Springer Verlag, 2007.
  • [2] J. Tenenbaum, V. de Silva, and J. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, p. 2319, 2000.
  • [3] M. Bernstein, V. de Silva, J. Langford, and J. Tenenbaum, “Graph approximations to geodesics on embedded manifolds,” 2000.
  • [4] P. Turnbaugh, R. Ley, M. Hamady, C. Fraser-Liggett, R. Knight, and J. Gordon, “The human microbiome project,” Nature, vol. 449, no. 7164, p. 804, 2007.
  • [5] R. Marchand, N. Beagley, and T. Ackerman, “Evaluation of hydrometeor occurrence profiles in the multiscale modeling framework climate model using atmospheric classification,” Journal of Climate, vol. 22, no. 17, pp. 4557–4573, 2009.
  • [6] B. Thirion and O. Faugeras, “Nonlinear dimension reduction of fMRI data: the Laplacian embedding approach,” in IEEE International Symposium on Biomedical Imaging: Nano to Macro, 2004, pp. 372–375.
  • [7] W. Li, S. Prasad, J. Fowler, and L. Bruce, “Locality-preserving dimensionality reduction and classification for hyperspectral image analysis,” IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 4, pp. 1185–1198, 2012.
  • [8] V. de Silva and J. Tenenbaum, “Global versus local methods in nonlinear dimensionality reduction,” in Advances in Neural Information Processing Systems, 2003, pp. 721–728.
  • [9] A. Talwalkar, S. Kumar, and H. Rowley, “Large-scale manifold learning,” in

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2008, pp. 1–8.
  • [10] S. Samudrala, J. Zola, S. Aluru, and B. Ganapathysubramanian, “Parallel framework for dimensionality reduction of large-scale datasets,” Scientific Programming, vol. 2015, p. 1, 2015.
  • [11] J. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.
  • [12] W. Kim, Y. Kim, and K. Shim, “Parallel computation of k-nearest neighbor joins using MapReduce,” in IEEE International Conference on Big Data, 2016, pp. 696–705.
  • [13] R. Weber, H. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces,” in International Conference on Very Large Data Bases, 1998, pp. 194–205.
  • [14] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen et al., “MLlib: Machine learning in apache spark,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235–1241, 2016.
  • [15] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to algorithms.   MIT press, 2009.
  • [16] J. Kepner and J. Gilbert, Graph Algorithms in the Language of Linear Algebra.   Society for Industrial and Applied Mathematics, 2011.
  • [17] R. Zadeh, X. Meng, A. Ulanov, B. Yavuz, L. Pu, S. Venkataraman, E. Sparks, A. Staple, and M. Zaharia, “Matrix computations and optimization in Apache Spark,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 31–38.
  • [18] E. Solomonik, A. Buluc, and J. Demmel, “Minimizing communication in all-pairs shortest paths,” in IEEE International Symposium on Parallel and Distributed Processing, 2013, pp. 548–559.
  • [19] G. Venkataraman, S. Sahni, and S. Mukhopadhyaya, “A blocked all-pairs shortest-paths algorithm,” Journal of Experimental Algorithmics, vol. 8, 2003.
  • [20] T. Vacek, “Flyby improved dense matrix multiplication,” 2015.
  • [21] J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. Whaley, “ScaLAPACK: A portable linear algebra library for distributed memory computers—design issues and performance,” Computer Physics Communications, vol. 97, no. 1-2, pp. 1–15, 1996.
  • [22] S. Balay, S. Abhyankar, M. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. Gropp, D. Kaushik et al., “PETSc users manual revision 3.8,” Argonne National Laboratory (ANL), Tech. Rep., 2017.
  • [23] G. Golub and C. V. Loan, Matrix Computations.   JHU Press, 2012.
  • [24] P. Constantine and D. Gleich, “Tall and skinny QR factorizations in MapReduce architectures,” in International Workshop on MapReduce and Its Applications, 2011, pp. 43–50.
  • [25] F. Schoeneman, S. Mahapatra, V. Chandola, N. Napp, and J. Zola, “Error metrics for learning reliable manifolds from streaming data,” in SIAM International Conference on Data Mining, 2017, pp. 750–758.
  • [26] I. Dryden and K. Mardia, Statistical Shape Analysis.   John Wiley & Sons, 1998.
  • [27] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “EMNIST: an extension of MNIST to handwritten letters,” ArXiv e-prints, 2017.
  • [28] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by Locally Linear Embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.