I Introduction
The vast majority of the current big data, coming from, for example, high performance high fidelity numerical simulations, high resolution scientific instruments (microscopes, DNA sequencers, etc.) or Internet of Things streams and feeds, is a result of complex nonlinear processes. While these nonlinear processes can be characterized by low dimensional manifolds, the actual observable data they generate is highdimensional. This highdimensional data is inherently difficult to explore and analyze, owing to the
curse of dimensionality and empty space phenomenonthat render many statistical and machine learning techniques (e.g. clustering, classification, model fitting, etc.) inadequate. In this context, nonlinear spectral dimensionality reduction has proved to be an indispensable tool
[1]. The nonlinear spectral dimensionality reduction methods rely on a spectral decomposition of a feature matrix that captures properties of the underlying manifold, and effectively bring the original data into a more humanintuitive low dimensional space that makes quantitative and qualitative analysis of nonlinear processes possible (e.g., by enabling visualization). However, the leading manifold learning methods, such as considered here Isomap [2], remain infeasible when presented with the current largescale data. This is because of both computational and memory complexity that are at least quadratic in the number of input points.In this paper, we propose a distributed memory framework to address complexity of the endtoend exact Isomap under Apache Spark model. We show how each critical step of the Isomap algorithm can be efficiently expressed under the basic Spark model, without the need to provision data in the secondary storage. To achieve efficient and portable software realization, we implement the entire method using PySpark, and we offload compute intensive linear algebra routines to high performance BLAS library. Through experimental results, we demonstrate that the resulting software exhibits excellent parallel scalability, and even on a small parallel cluster can be used to process datasets orders of magnitude larger than what is possible withe the current methods, without sacrificing quality. Our specific contributions include Apache Sparkbased efficient direct NN solver based on 1D data decomposition, scalable allpairs shortestpath (APSP) solver leveraging ideas from communication avoiding algorithms, and practical eigendecomposition solver based on power iteration. All components are combined into endtoend workflow realizing inmemory Isomap.
The reminder of the paper is organized as follows. In Section II, we formally state the problem and briefly introduce the basic Isomap algorithm. In Section III, we provide detailed description of our proposed approach, focusing on each individual Isomap component and its realization in Apache Spark. In Section IV, we showcase the resulting software on several benchmarks, assessing accuracy and parallel scalability, and we demonstrate its practical value by analyzing over 50000 EMNIST images. Finally, in Section V, we survey related work, and conclude the paper in Section VI.
Ii Preliminaries
In manifold learning, we assume that a given dataset of points, where each , is sampled from some manifold of latent dimension , . Our goal is to find a set such that and , where measures the actual distance between pair of points on , and the relation captures some property between points and . In this work, we focus on the case where is , that is, we seek a dimensional representation, , of that preserves manifold distances.
The stateoftheart method for geodesic distance preserving manifold learning is Isomap [2]. The method falls under the category of nonlinear spectral dimensionality reduction techniques [1], and it requires a series of data transformations shown in Alg. 1. In the first step, the dataset is used to construct a neighborhood graph, , where each node of the graph is connected to its nearest neighbors (NN), with respect to the Euclidean metric. Here NN distances capture local manifold distances in the neighborhood of each data point. This step requires distance computations, thus having complexity . In the second step, the neighborhood graph is analyzed to construct a feature matrix, , that captures the information to be preserved in dimensionality reduction. Because shortest paths induced by neighbor relations have been shown to be a good approximation of the actual manifold distance between points [3], stores the shortest paths between all pairs of points in the neighborhood graph . Consequently the cost of this step is . The penultimate step is a spectral decomposition of . The resulting top eigenvectors,
, and eigenvalues,
, are required as their product yields the final output set . The spectral decomposition step has complexity . We note that before spectral decomposition, the feature matrix is usually doublecentered to ensure that the origin is contained in the final output (instead of being an affine translation).While Isomap is the method of choice in many practical applications [1], it is too computationally and memory intensive for even modest size datasets. For example, commonly available sequential implementations in Matlab and Python scale to datasets with points. In contrast, datasets emerging in scientific applications, e.g. [4, 5, 6, 7], routinely involve hundreds of thousands of points.
Iii Proposed Approach
The computational complexity of Isomap arises from the cost of both construction and spectral decomposition of the feature matrix. The memory cost comes from the () size of the feature matrix that has to be maintained throughout the process. To alleviate these complexities, several approximate methods have been proposed [8, 9]. However, these approaches do not provide exactness guarantees, or are tailored for largescale HPC systems [10].
In our proposed approach, we focus on exact solution targeting Apache Spark clusters that are easy to deploy, and are accessible to many potential endusers. Our main idea is to cast the Isomap workflow into the Apache Spark model, such that the intermediate data is never explicitly provisioned to the persistent storage, and each step of the workflow is efficiently expressed via Spark transformations implemented in PySpark and Python, with computationally intensive algebraic routines offloaded to a dedicated BLAS engine (e.g., Intel MKL).
Iiia NN Search
The first step of Isomap is nearest neighbors search. The direct approach to NN is for each point to compute the distance to each of the others, recording the minimum in the process. This requires () comparisons and () space. Theoretically, this approach can be improved by the use of spatial data structures such as  trees, quadtrees, Rtrees, or Voronoi diagrams [11, 12]. However, due to the curse of dimensionality, the performance of these data structures quickly deteriorates to the direct method as dimensionality increases [13]. Therefore, in our approach we propose scalable Spark realization of the direct NN method.
We first 1Ddecompose the input data into logical blocks (see Fig. 1). We achieve this by loading the entire into RDD, such that single point is stored as one 1D NumPy array, and then executing combineByKey transformation. The transformation assigns to a point its block identifier, and points are grouped into a 2D NumPy array corresponding to a block. Hence, at the end of this step, each point in is assigned to a unique block , for . By decomposing in this way, we can exploit symmetry of the distance matrix over , and efficiently compute only its uppertriangular portion. This is contrary to the standard and expensive approach based on Spark cartesian, which produces RDD elements and , requiring additional filter operation to obtain the final uppertriangular form. Using flatMap transformation, we enumerate all block pairs , where each is paired with all , for . The resulting RDD, stores each block pair as a tuple . Although this induces data replication, it also directly exposes parallelism in computing the distance matrix that dominates the computational work of NN search. We exploit this parallelism when materializing the uppertriangular distance matrix, . Specifically, we apply map transformation that for each pair carries allpairs distance computations to yield submatrix , . Here, computing is delegated to SciPy, which offloads distance computations to the efficient BLAS library (e.g., Intel MKL). At the end, the entire block , stored as 2D NumPy array, becomes a single element of the resulting RDD with . The RDD is persisted in the main memory for further processing.
The next step is to identify nearest neighbors of each with respect to matrix . When is in the block form, each row of is scattered across multiple columnblocks (see Fig. 1). With each such chunk of row we can associate the same pair , where identifies rowblock containing , and is local identifier of row within rowblock (since within a block, rows are indexed from 0, and we have that ). We exploit this property, and for each point we identify the minimum elements as follows. First, within each block for every we identify a list of minimal values, . We achieve this in parallel via flatMap transformation that employs standard heapbased algorithm to maintain minimal values, and yields tuples . Here, stores coordinates of the selected minimal values, and the values themselves. Recall that we do not explicitly represent for . Therefore, to obtain row minima for blocks under diagonal of , our flatMap routine considers also transpositions . Next, using combineByKey transformation, we reduce all local minima, contained in lists , into the global NN list for each (see Fig. 1).
While at this point we could finalize NN search, we apply one more round of transformations, to efficiently convert NN into the neighborhood graph , such that it is ready for processing via allpairs shortestpaths solver. The idea is to reuse data blocks maintained by RDD storing , to store . To this end, we apply map transformation to the NN and for each we produce keyvalue pair , where tuple identifies nearest neighbor of . This enables us to associate each NN distance with the subblock of to which it belongs. Thus, we use union to combine the resulting RDD with the original block matrices of , and we follow with combineByKey transformation. The transformation fills blocks of with , and then sets the actual neighbor distances according to . The resulting neighborhood graph, , is stored as RDD of blocks of size , in exactly the same way as , benefiting from the fact that is persistent and minimizing NumPy arrays reallocation.
The 1D decomposition we use in our NN solver, and the induced 2D decomposition of matrix , are logical and separate from the physical RDD partitioning employed by the Spark runtime. Moreover, instead of relying on the default Spark partitioner, we use a custom partitioner tailored for managing uppertriangular matrices. This is motivated by the performance considerations. The partitioner assigns an integer to each block of the matrix in the rowmajor fashion. Hence, when the number of RDD partitions is equal to the number of logical blocks, the assignment is exactly the uppertriangular rowmajor index for block . When this number is less than , we store blocks per partition, where . Specifically, we assign the first blocks to partition 0, the next blocks to partition 1, and so on (see Fig. 2). In our experiments, we found that this strategy of assigning multiple blocks to single RDD partition provides better data locality than the default partitioner or GridPartitioner found in the MLlib library [14] (a popular machine learning library for Apache Spark). This is because neighboring blocks are stored near one another in RDD, which leads to improved data locality and reduced data shuffling. Therefore, we employ the same custom partitioner in the APSP solver (see below).
IiiB Allpairs Shortestpaths Computation
Given the neighborhood graph , the next step in the Isomap workflow is to solve allpairs shortestpaths (APSP) in . Recall that shortest paths approximate manifold distances between data points [3], and hence they will form the feature matrix .
Two commonly used algorithms for APSP are Dijkstra and FloydWarshall [15]. However, both methods are illsuited for Spark model, owing to low computation rate compared to the data movement rate. An alternative approach involves a special case of the matrix power [16]. Consider matrix multiplication over the semiring with additive and multiplicative operations, and . The additive operation is minimum and the multiplicative operation is scalar addition, that is and . In this way we reduce APSP to repeated matrix multiplication of by itself (we refer to this method as minplus). This is still not ideal: regular data movement patterns like those in matrix multiplication are unsuitable for data shuffles that occur under the Spark model. In fact, matrix multiplication is yet to see efficient realization in Spark [17]. Therefore, we take a middleground approach.
We cast an iterative communicationavoiding APSP algorithm by Solomonik et al. [18], which is based on the block formulation of FloydWarshall [19], into Spark model, and we use minplus to batch update blocks of shortest paths values. It turns out that communicationavoiding algorithms, technique well known in High Performance Computing, enable us to minimize data shuffles, leading to efficient Spark realization. We note that the correctness of this algorithm follows directly from that of computing the transitive closure, and has been discussed, e.g., in [18, 19].
In our approach, we iteratively update the entire APSP adjacency matrix, , in three phases (Fig. 3 illustrates in which phase which block of is updated). In the first phase, we solve sequential FloydWarshall restricted to a single subblock on the diagonal. This is the first diagonal block when iteration . Once we have solved the diagonal block we share the solution with blocks in the same row (and column). This begins Phase 2. Upon receiving the diagonal block, we perform minplus matrix multiplication to inplace update Phase 2 blocks. In the final phase, all remaining blocks receive two matrices from Phase 2. One matrix is sent from the block in the shared row and the other from the shared column. After computing the minplus product of the received matrices, we update Phase 3 blocks inplace. The process repeats with successive diagonal blocks. At the conclusion of all 3 phases for the final diagonal block, all matrix entries contain the corresponding shortest path length.
We realize the above algorithm in Spark, starting from the persisted RDD representing graph . We proceed iteratively over blocks, , on the diagonal, which form the critical path of the algorithm. To extract diagonal block , we use simple filter transformation over the keys assigned to each block. We follow with flatMap that executes FloydWarshall over submatrix . Upon completion, the transformation yields multiple instances of the updated block with keys and , where . The keys refer to the blocks in the same row and column where the updated diagonal block is needed for the next phase of computation. Here, similar to NN search, we replicate the data to expose parallelism inherent to Phase 2.
To begin Phase 2, we again filter , this time to obtain subblocks with keys and . These blocks store values not yet updated for iteration . Thus, we perform union transformation to bring together these blocks and the diagonal blocks processed in the earlier Phase. When applying union we take special care to ensure that the number of partitions in the resulting RDD is nonincreasing. This could negatively impact the subsequent reduction operations, since the exchange of blocks requires shuffle between partitions. Therefore, we use partitionBy to ensure all RDDs involved in union have same partitioning as . Next, we execute combineByKey in order to pair each block from row (column) with the diagonal block . To conclude Phase 2, we run flatMap that performs minplus matrix multiplication and in parallel updates the current blocks. The transformation yields additional instances of the updated blocks such that they can be passed to Phase 3.
To conclude single iteration, we update Phase 3 blocks , and this involves computing the minplus product . Hence, we use yet another filter on to obtain remaining blocks , which are in neither row nor column . These blocks, along with blocks we yield at the end of Phase 2, are brought to a single RDD via union, again ensuring appropriate partitioning with partitionBy. Next, we use combineByKey transformation to bring together three matrices needed to update blocks . This includes the current value of and Phase 2 blocks specified above, necessary to compute . Once is ready, is updated by computing elementwise minimum of itself with .
The three phases are repeated for the next diagonal block. At the conclusion of iteration , all pairwise distances represent the approximate manifold distances. To prepare the matrix for subsequent normalization, we square each element of the adjacency matrix to obtain the actual feature matrix .
In our implementation, the sequential APSP component uses the optimized and efficient SciPy implementation of FloydWarshall algorithm. It operates inplace on already allocated memory referenced by NumPy arrays. For minplus matrix multiplication, we use our own Python implementation. However, our Python code is compiled and optimized to the machine code using the Numba justintime compiler, providing excellent performance. Finally, to further improve performance of our minplus code, we enforce C or Fortran memory layout of matrices as appropriate to maximize cache usage.
The overall performance of our APSP hinges on the efficient use of combineByKey transformation. While the transformation involves expensive data shuffling, we found that in practice it is more efficient than the available alternatives, like exchanging blocks via collect/broadcast [20], or provisioning in shared secondary storage, e.g., write to HDFS or SparkFiles.addFile, and SparkFiles.get. This is due to the inefficiency of collect or file I/O required for alternative approaches. As the problem size and block size increase, these operations become a bottleneck. Spark collect requires that the diagonal block and an entire row of blocks be brought to the driver in Phases 1 and 2, respectively. Similarly, the use of secondary storage requires a sequence of file write and read operations to make blocks accessible by appropriate tasks. In our experiments, we found the cost of using heap memory combined with reduceByKey for duplication of blocks to be superior to collect to driver or use of a secondary storage.
The final remark is with respect to RDD lineages (i.e., RDD provenance and execution records). Observe that at each iteration over the diagonal, new RDD describing the entire distance matrix is created. The ancestors of this RDD are all RDDs prescribed by earlier transformations. Consequently, the resulting RDD lineage grows from iteration to iteration, potentially overwhelming the Spark driver. Since the driver is responsible also for scheduling, this hinders performance of the entire algorithm. To address this issue, we checkpoint and persist to prune the lineage of RDDs which describe its blocks. Frequency of checkpointing depends on the size of the input data and block size , but in all test cases we found that checkpointing every 10 iterations performs best.
IiiC Matrix Normalization
The goal of matrix normalization is to transform
so that it is both row and column centered. In other words, the mean of each row and each column is zero. Centering can be expressed as a linear transformation of the form
, where is centering matrix, andis identity matrix. Thus, the update of
requires two matrixmatrix multiplications.As previously discussed, matrixmatrix multiplication is inefficient in Spark. For this reason, in our implementation we double center by the direct approach, exploiting the fact that is symmetric. Specifically, we compute the mean of each column of
, and we store all means in the row vector
, where is mean of column . As means for rows of are equivalent to , we do not compute them explicitly, and to center we subtract from each row, and from each column of . To finalize the process, we add to each entry of , where is the global mean computed over all elements of .To express the above algorithm in Spark, we first transform RDD storing by flatMap that for each block in parallel computes sums of its columns. The transformation yields tuple , and if additional tuple that accounts for processing blocks under the diagonal (recall that is uppertriangular). Here, represents vector of means for the block, and is stored as 1D NumPy array. Next we aggregate the final column sums via reduceByKey transformation that applies addition operator over vectors with shared key . The vector summation is efficiently implemented in NumPy. At this point, the resulting RDD contains complete information to compute the actual column means as well as the global mean . Hence, we first sum all elements of the sum vectors into a scalar using single map transformation, and then use reduce action to obtain the global sum at Spark driver. Additionally, we bring to the driver all column sums by executing collectAsMap action. After dividing all the sums by to obtain the desired means, we broadcast them back to Spark executors. Here we note that both reduce/collectAsMap and broadcast actions involve relatively small messages (even for large ), and hence this communication pattern is the most direct and efficient to way to exchange computed means between executors. Finally, we conclude the normalization step by transforming RDD that stores via map that in parallel applies previously broadcast means to the local blocks . The resulting matrix is ready for spectral decomposition.
IiiD Spectral Decomposition
Computing eigenvectors and eigenvalues of feature matrix is the final step common to all spectral dimensionality reduction methods, as eigenvectors represent the desired lowdimensional representation of . However, largscale eigendecomposition is computationally challenging, and the existing parallel eigensolvers are primarily designed for shared memory and multi/manycore architectures [21, 22]. Because to the best of our knowledge currently there are no eigensolvers available for Apache Spark, we develop a new approach derived from the Power Iteration method [23]. By using simultaneous power iteration over we are able efficiently extract only the top eigenpairs at the same time. This is computationally advantageous considering that other approaches require iteratively extracting each eigenpair, and in practical dimensionality reduction applications is relatively small, e.g.,
when data visualization is the goal.
The standard power method (see Alg. 2) begins with a set of linearly independent vectors, e.g., the first standard bases
. It then considers all vectors simultaneously, and iteratively applies QR decomposition until the process converges to some acceptable level
, or the predefined number of iterations . In practice, and are selected to achieve accuracy required by given target application.To achieve efficient Spark realization of power method we exploit two facts. First, for large and reasonable values of , matrices and have small memory footprint, and hence incur small sequential processing and communication overheads. Second, QR decomposition has extremely efficient BLAS realizations, available in NumPy. Therefore, it makes sense to assign managing and decomposition of matrix to the Spark driver (line 5), while offloading computationally expensive matrix product (line 4) for parallel processing by Spark executors. We note that while MLlib [14] provides Sparkbased QR decomposition (using tall skinny QR algorithm [24]), it is limited only to MLlib RowMatrix RDD that would be inefficient during matrix product stage required by the power method. Consequently, in our implementation, the driver is responsible for keeping track of matrices and , and checking convergence criteria (line 6). After QR decomposition, the driver broadcasts the entire matrix to all executors. In this way, we can directly multiply each block of with the appropriate block of . With this approach we avoid costly replication and shuffle for pairing blocks from each matrix for multiplication. We use flatMap to compute block products instead. The transformation yields tuples for each block of . For nondiagonal blocks we also yield , to account for uppertriangular storage. Here is a NumPy 2D array representing one of the block products required for the product . To obtain the final form of we use reduceByKey transformation, where key is used, with elementwise addition of NumPy 2D arrays. In the final step, we bring back to the driver via collectAsMap.
To complete a single iteration, we perform QR factorization of on the driver, and test the Frobenius norm of the difference between successive for convergence. At the completion of iterations or upon convergence, we return the current instance of and , finalizing the spectral decomposition stage, and thus the entire dimensionality reduction process.
Iv Experimental Results
To understand performance characteristics of our approach, we performed a set of experiments on a standalone Apache Spark cluster with 25 nodes and GbE interconnect. Each node in the cluster is equipped with 20core dualsocket Intel Xeon E5v3 2.3 GHz processor, 64 GB of RAM and a standard SATA hard drive. In all tests, Spark driver was run with 56GB of memory, to account for length of the lineage of RDDs in the APSP loop. For the same reason, we run the driver on a dedicated node separately from Spark executors. We allocated one executor per node using the default configuration for the number of cores, i.e., each executor was using all available cores in a node. All executors were configured to use 56GB out of the available 64GB, with the remaining memory available to the operating system and Python interpreter. We note that we tested different configurations of executortocore ratios, across different datasets, without noticeable difference in performance.
Our entire software is implemented in Apache Spark 2.2 and Python 2.7. Compute intensive linear algebra operations are offloaded via NumPy and SciPy to BLAS library, specifically Intel MKL 2017. Finally, we use Numba 0.35 for optimal minplus matrix multiplication with justintime compilation.
In all experiments, in the NN stage we used for the neighborhood size. For our test data (see below), such is large enough to deliver single connected component in the neighborhood graph, and small enough to avoid unnecessary shortcut edges. In the eigendecomposition stage, we used and we allowed a maximum of iterations to achieve convergence. We note that in all experiments, many fewer iterations were required to see convergence (usually around 2050 iterations).
Iva Test Data and Method Correctness
The Swiss Roll benchmark is a classic test for evaluating manifold learning algorithms. Swiss Roll data is generated from some input 2D data, which is embedded into 3D using a nonlinear function. In our tests, we used the Euler Isometric Swiss Roll [25]. To evaluate the correctness and demonstrate the scalability of our implementation, we created three test datasets. These are samples from the Swiss Roll of size , , and , which we refer to as Swiss50, Swiss75, and Swiss100.
Figure 4 shows the initial data, highdimensional embedded data, and the resulting dimensionality reduction delivered by our method for the Swiss50 dataset. A quick visual inspection reveals that Fig. 4(c) appears to be a faithful representation of the original data in Fig. 4(a). To actually quantify how well our method reconstructs the original data, we use Procrustes error, which captures the similarity of two datasets and by determining the optimal affine transformation of that best conforms it with [26]. In our case, Procrustes error of our result compared to the original 2D coordinates is , confirming high accuracy of the reconstruction.
To test our platform on the actual highdimensional data, we used benchmarks derived from the EMNIST [27], which is frequently used in machine learning research. This dataset provides images of over 200,000 handwritten digits, and hence each data point (image of a digit) is a 784dimensional vector. For testing purposes, we randomly sampled two sets from EMNIST. They contain and points and we refer to them as EMNIST50 and EMNIST125, respectively.
Figure 5 shows 2D mapping of EMNIST50 obtained using our Isomap method. From the figure we can see that clusters of digits that look similar appear close together, e.g., ‘5’ and ‘6’. At the same time, clusters of distinct digits that blend together also have these similarities. In Fig. 5(b) sample images of digits are included to accentuate features captured by axes D1 and D2. The axis D2 describes the angle of slant for the handwritten digit. We observe the change in angle from top to bottom of the cluster. The primary axis, D1, accounts for curved or straight segments in the digit. For example, on the left we see ‘4’s, which are made up entirely of straight line segments. In contrast, zeros on the right contain no straight segments. To the best of our knowledge, this is the first time such result is reported, and it confirms practical value of largescale Isomap in processing noise image data.
IvB Scalability Tests
To analyze how our implementation scales with the problem size, expressed by , and the compute resources used, we processed all datasets on a varying number of nodes, recording the wall time. The results of this experiment are summarized in Tables IIII. Here we report relative speedup, computed as a fraction , where is time taken to process data instance on nodes, and is the time taken by the same instance to process on the smallest feasible number of nodes. We note that even the smallest datasets we consider in our tests are impossible to process using single node, and hence is not . Similarly, we compute relative efficiency as .
The experimental results show that our approach exhibits strong scaling within the entire range of data and cluster configuration. For smaller datasets (), the method scales linearly up to 12 nodes, and maintains very good efficiency up to 24 nodes. The scaling is very similar for both Swiss50 and EMNIST50, which is expected taking into account that only NN stage depends (linearly) on , and it constitutes only a small fraction of the entire runtime. At this point we should note that because we express all Isomap stages as linear equations or matrix multiplications, the performance of our method is not at all impacted by data distribution or density (e.g., connectivity of graph ). When the size of the input data increases, the method seems to exhibit superlinear performance, however, this result should be taken with care, considering that we compute relative and not actual speedup. The method maintains also weak scalability. Specifically, for the fixed ratio , the execution time scales roughly as , which reflects the dominating cost of the APSP stage.
Compute Nodes  

Name  2  4  8  12  16  20  24 
EMNIST50  294.92  117.90  62.41  48.47  41.44  35.82  32.92 
Swiss50  261.43  98.88  53.58  39.41  33.26  29.47  26.49 
Swiss75  –  958.85  345.47  105.28  84.05  76.66  63.25 
Swiss100  –  –  1122.89  465.30  275.22  225.89  157.97 
EMNIST125  –  –  –  985.94  662.92  599.28  448.42 
– means that dataset was impossible to process on given resources.
Compute Nodes  

Name  2  4  8  12  16  20  24 
EMNIST50  1  2.50  4.73  6.09  7.11  8.23  8.96 
Swiss50  1  2.64  4.88  6.63  7.86  8.87  9.87 
Swiss75  –  1  2.78  9.11  11.41  12.51  15.16 
Swiss100  –  –  1  2.41  4.08  4.97  7.11 
EMNIST125  –  –  –  1  1.49  1.65  2.20 
Compute Nodes  

Name  2  4  8  12  16  20  24 
EMNIST50  1  1.25  1.18  1.02  0.89  0.82  0.75 
Swiss50  1  1.32  1.22  1.11  0.98  0.89  0.82 
Swiss75  –  1  1.39  3.04  2.85  2.50  2.53 
Swiss100  –  –  1  1.61  2.04  1.99  2.37 
EMNIST125  –  –  –  1  1.12  0.99  1.10 
The performance of our method directly depends on the logical block size (see Sec. IIIA). Recall that in NN search, blocks of points are paired to compute one block of the pairwise distance matrix. The execution time of computing a single block scales as (). When computing APSP, we require sequential FloydWarshall for each block along the diagonal, and matrixmatrix minplus multiplication for others. Both cases have complexity . Moreover, the APSP solver proceeds iteratively over the diagonal of . Since each block on the diagonal must be solved in sequence, the critical path has length . All these factors must be taken into consideration when selecting . In our experiments, we found that in the range gives ideal performance when . This is because the length of the critical path is not overwhelming for the Spark driver, and at the same time, the block size is small enough to leverage cache memory when executing BLAS routines for matrix products. As the problem size grows, it is advantageous to increase block size to keep control of the critical path. This however may increase the time taken to process individual block, and may lead to resource underutilization as there are fewer blocks to distribute for processing between executors. Figure 6 shows impact of block size when processing Swiss75 on 24 nodes (we observe similar pattern for other datasets). The sweet spot is for and both undersizing and oversizing degrades performance. Currently, we do not have model that would let us select automatically, however, the intuition we provide above is sufficient for the majority of practical applications.
V Related Work
As previously mentioned, existing implementations of Isomap and its variants do not scale to data as large as presented here. Talkwalkar et al. successfully learned manifold for 18 million points. However, their analysis includes approximation method for APSP, and Nystrom method for approximate spectral decomposition. To date, the largest study for exact solutions for Isomap includes points, and has been reported in [10]. The method is tailored for HPC clusters with MPI, and hence its applicability is somewhat constrained.
To partially mitigate the complexity of APSP, the authors of Isomap have proposed LandmarkIsomap (LIsomap) [8]. The idea behind LIsomap is that randomly selected landmark points may be selected to learn an underlying manifold. Remaining points are placed on the manifold via triangulation using their distance to landmarks. This approach greatly reduces complexity by targeting APSP, but introduces new sources of error in i) landmark selection method, ii) selecting the number of landmarks, and iii) adding the complete dataset by triangulation. We previously extended the works of de Silva and Tenenbaum [8, 2] and proposed StreamingIsomap [25]. The streaming approach requires learning a faithful representation of the underlying manifold for some initial batch of points, and then quickly maps new points arriving on a highvolume, highthroughput stream. This approach is orthogonal to the one we present here, and in fact both methods could be combined in case when the initial batch is large.
Vi Conclusion
Highdimensional big data, arising in many practical applications, presents a challenge for nonlinear spectral dimensionality reduction methods. This is due to the computationally intensive construction of the feature matrix and it’s spectral decomposition. In this paper, we proposed a scalable distributed memory approach for manifold learning using Isomap, addressing key computational bottlenecks of the method. Through experiments we demonstrated that our resulting Apache Sparkbased implementation is scalable and efficient. The method can solve exact Isomap on highdimensional datasets, which are an order of magnitude larger than what can be processed with the existing methods. We note that individual components, like NN, APSP and eigendecomposition solvers, can be used as a standalone routines. Finally, since other nonlinear spectral decomposition methods, like e.g., LLE [28], share the same computational backbone, with a minimal effort our software could be extended to cover these methods as well. The source code of our platform is open and available from: https://gitlab.com/SCoReGroup/IsomapSpark.
Acknowledgment
The authors would like to acknowledge support provided by the Center for Computational Research at the University at Buffalo. This work has been supported in part by the Department of Veterans Affairs under grant 7D0084.
References
 [1] J. Lee and M. Verleysen, Nonlinear Dimensionality Reduction. Springer Verlag, 2007.
 [2] J. Tenenbaum, V. de Silva, and J. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, p. 2319, 2000.
 [3] M. Bernstein, V. de Silva, J. Langford, and J. Tenenbaum, “Graph approximations to geodesics on embedded manifolds,” 2000.
 [4] P. Turnbaugh, R. Ley, M. Hamady, C. FraserLiggett, R. Knight, and J. Gordon, “The human microbiome project,” Nature, vol. 449, no. 7164, p. 804, 2007.
 [5] R. Marchand, N. Beagley, and T. Ackerman, “Evaluation of hydrometeor occurrence profiles in the multiscale modeling framework climate model using atmospheric classification,” Journal of Climate, vol. 22, no. 17, pp. 4557–4573, 2009.
 [6] B. Thirion and O. Faugeras, “Nonlinear dimension reduction of fMRI data: the Laplacian embedding approach,” in IEEE International Symposium on Biomedical Imaging: Nano to Macro, 2004, pp. 372–375.
 [7] W. Li, S. Prasad, J. Fowler, and L. Bruce, “Localitypreserving dimensionality reduction and classification for hyperspectral image analysis,” IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 4, pp. 1185–1198, 2012.
 [8] V. de Silva and J. Tenenbaum, “Global versus local methods in nonlinear dimensionality reduction,” in Advances in Neural Information Processing Systems, 2003, pp. 721–728.

[9]
A. Talwalkar, S. Kumar, and H. Rowley, “Largescale manifold learning,” in
IEEE Conference on Computer Vision and Pattern Recognition
, 2008, pp. 1–8.  [10] S. Samudrala, J. Zola, S. Aluru, and B. Ganapathysubramanian, “Parallel framework for dimensionality reduction of largescale datasets,” Scientific Programming, vol. 2015, p. 1, 2015.
 [11] J. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.
 [12] W. Kim, Y. Kim, and K. Shim, “Parallel computation of knearest neighbor joins using MapReduce,” in IEEE International Conference on Big Data, 2016, pp. 696–705.
 [13] R. Weber, H. Schek, and S. Blott, “A quantitative analysis and performance study for similaritysearch methods in highdimensional spaces,” in International Conference on Very Large Data Bases, 1998, pp. 194–205.
 [14] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen et al., “MLlib: Machine learning in apache spark,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235–1241, 2016.
 [15] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to algorithms. MIT press, 2009.
 [16] J. Kepner and J. Gilbert, Graph Algorithms in the Language of Linear Algebra. Society for Industrial and Applied Mathematics, 2011.
 [17] R. Zadeh, X. Meng, A. Ulanov, B. Yavuz, L. Pu, S. Venkataraman, E. Sparks, A. Staple, and M. Zaharia, “Matrix computations and optimization in Apache Spark,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 31–38.
 [18] E. Solomonik, A. Buluc, and J. Demmel, “Minimizing communication in allpairs shortest paths,” in IEEE International Symposium on Parallel and Distributed Processing, 2013, pp. 548–559.
 [19] G. Venkataraman, S. Sahni, and S. Mukhopadhyaya, “A blocked allpairs shortestpaths algorithm,” Journal of Experimental Algorithmics, vol. 8, 2003.
 [20] T. Vacek, “Flyby improved dense matrix multiplication,” 2015.
 [21] J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. Whaley, “ScaLAPACK: A portable linear algebra library for distributed memory computers—design issues and performance,” Computer Physics Communications, vol. 97, no. 12, pp. 1–15, 1996.
 [22] S. Balay, S. Abhyankar, M. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. Gropp, D. Kaushik et al., “PETSc users manual revision 3.8,” Argonne National Laboratory (ANL), Tech. Rep., 2017.
 [23] G. Golub and C. V. Loan, Matrix Computations. JHU Press, 2012.
 [24] P. Constantine and D. Gleich, “Tall and skinny QR factorizations in MapReduce architectures,” in International Workshop on MapReduce and Its Applications, 2011, pp. 43–50.
 [25] F. Schoeneman, S. Mahapatra, V. Chandola, N. Napp, and J. Zola, “Error metrics for learning reliable manifolds from streaming data,” in SIAM International Conference on Data Mining, 2017, pp. 750–758.
 [26] I. Dryden and K. Mardia, Statistical Shape Analysis. John Wiley & Sons, 1998.
 [27] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “EMNIST: an extension of MNIST to handwritten letters,” ArXiv eprints, 2017.
 [28] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by Locally Linear Embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
Comments
There are no comments yet.