In the past decades, we have been witnessing a steady increase in the size of datasets generated and processed by computational systems. Such voluminous data comes from various sources, such as business sales records, the collected results of scientific experiments or real-time sensors used in the Internet of Things (IoT). The most popular way to represent such data is via a set of data points lying in a vector space. The construction of the vector space is often performed using a distance or similarity matrix that can be constructed manually using perceptual ratings or, more commonly, computed automatically using a set of features. In many of these applications high-dimensional data representations are assumed to lie in the vicinity of a low-dimensional, possibly non-linear manifold, embedded in the high-dimensional space. This is known as the manifold hypothesis[fefferman2016testing]. Intuitively human cognition also performs similar mappings when performing everyday tasks, i.e., high-dimensional sensory input get embedded into low dimensional cognitive subspaces [kanerva1988sparse]; [pothos2011quantum]; [pothos2013can] for rapid and robust decision making, since only a small number of features are salient for each task. Given this assumption manifold learning aims to discover such hidden low-dimensional structure and to output a representation with much fewer “intrinsic variables”.
In this paper, we study the problem of manifold learning in non-metric topological spaces. The input to this problem is a matrix of (similarities or) dissimilarities333It should be mentioned that in many real-world tasks the used dissimilarity measures may correspond in pseudo- or semimetric distance functions that violate the triangular inequality. of the dataset objects. “Objects” can be colors, faces, map coordinates, political persuasion scores, or any kind of real-world or synthetic stimuli. For each input dataset object, the output is a low-dimensional vector such that the pairwise Euclidean distances of the output vectors resemble the original dissimilarities. This problem is known as non-metric multidimensional scaling (MDS) or non-linear dimensionality reduction (NLDR) task. An abundance of embedding methods have been developed for dealing with this task as detailed in Section 2.
The majority of these algorithms reduce this problem to the optimization of a deterministic loss function
. Given this minimization objective, they usually employ gradient-based methods to find a global or a local optimum. In many situations, however, the loss function is non-differentiable or estimating its gradient may be computational expensive. Additionally, gradient-based algorithms usually yield a slow convergence; multiple iterations are needed in order to minimize the loss function.
Inspired by the recent progress in derivative-free optimization tools, we propose an iterative algorithm which treats the non-metric MDS task as a derivative-free optimization problem. The main contributions of the paper are as follows: 1) Using the General Pattern Search (GPS) formulation we are able to provide theoretical convergence guarantees for the proposed non-metric MDS algorithm. 2) A set of heuristics are proposed that significantly improve the performance of the proposed algorithm in terms of computational efficiency, convergence rate and solution accuracy. 3) The proposed algorithm is evaluated on a variety of tasks including manifold unfolding, word embeddings and optical digit recognition, showing consistent performance and good convergence properties. We also compare performance with state-of-the-art MDS algorithms for the aforementioned tasks for clean and noisy datasets. An optimized implementation of pattern search MDS and the experimental code is made available as open source to the research community444Open source code available: https://github.com/georgepar/pattern-search-mds.
The remainder of the paper is organized as follows: We begin with an overview of the related work in Section 2. Then in Section 3, we review several optimization problems that are related to the manifold learning task and we present the GPS framework. Then in Section 4, we present in detail the proposed derivative-free algorithm, a sketch of the reduction of the algorithm to the GPS formulation and the associated proof of fixed-point convergence guarantees. Finally in Section 5, the proposed algorithm is compared and contrasted with other dimensionality reduction methods in a variety of tasks with or without the presence of noise. We conclude and present future directions for research in Sections 6 and 7, respectively.
2 Related Work
Loosely speaking, a manifold is a topological space that locally resembles a Euclidean space. The purpose of Multidimensional Scaling (MDS) is to infer data representations on a low-dimensional manifold while simultaneously preserving the distances of the high-dimensional data points. When data lies on or close to a linear subspace, low-dimensional representations of data can be obtained using linear dimensionality reduction techniques like Principle Components Analysis (PCA) [Pearson] and classical MDS.
In real data applications, such a linearity assumption may be too strong and can lead to meaningless results. Thus a significant effort has been made by the machine learning community to apply manifold learning in non-linear domains. Representative manifold learning algorithms include Isometric Feature Mapping (ISOMAP)[tenenbaum_global_2000, bernstein_graph_2000, zha_isometric_2003, DonohoG05, pless_image_2003], Landmark ISOMAP [silva2004-landmark, silva_global_2003],Locally Linear Embedding (LLE) [BelkinN03, Cayton, saul_think_2003, sha_analysis_2005, belkin_laplacian_2001], Modified LLE [zhang2007mlle] Hessian LLE [zhang_principal_2004, donoho_hessian_2003], Semidefinite Embedding [weinberger2006unsupervised], [weinberger2005nonlinear], [vandenberghe_semidefinite_1996], [bertsekas_nonlinear_1999], Laplacian Eigenmaps (LE) [BelkinN01, BelkinN03, BelkinN06], Local Tangent Space Alignment (LTSA) [LTSA],etc. ISOMAP uses a geodesic distance to measure the geometric information within a manifold. LLE assumes that a manifold can be approximated in a Euclidean space and the reconstruction coefficients of neighbors can be preserved in the low-dimensional space. LE uses an undirected weighted graph to preserve local neighbor relationships. Hessian LLE obtains low-dimensional representations through applying eigenanalysis on a Hessian coefficient matrix. LTSA utilizes local tangent information to represent the manifold geometry and extends this to global coordinates. Finally, SDE attempts to maximize the distance between points that don’t belong in a local neighborhood. Also, a common nonlinear method for dimensionality reduction is the kernel extension of PCA [scholkopf1998nonlinear].
A wide class of derivative-free algorithms for nonlinear optimization has been studied and analyzed in [Rios2013] and [avriel2003nonlinear]. GPS methods consist a subset of the aforementioned algorithms which do not require the explicit computation of the gradient in each iteration-step. Some GPS algorithms are: the original Hooke and Jeeves pattern search algorithm [Hooke:1961:DSS:321062.321069], the evolutionary operation by utilizing factorial design [box1957evolutionary] and the multi-directional search algorithm [torczon1989multidirectional], [doi:10.1137/0801027]. In [torczon1997convergence], a unified theoretical formulation of GPS algorithms under a common notation model has been presented as well as an extensive analysis of their global convergence properties. Local convergence properties have been studied later by [dolan2003local]. Notably, the theoretical framework as well as the convergence properties of GPS methods have been extended in cases with linear constrains [doi:10.1137/S1052623497331373], boundary constrains [doi:10.1137/S1052623496300507] and general Lagrangian formulation [doi:10.1137/0728030].
We denote real, integer and natural numbers as , , , respectively. Scalars are represented by no-boldface letters, vectors appear in boldface lowercase letters and matrices are indicated by boldface uppercase letters. All vectors are assumed to be column vectors unless they are explicitly defined as row vectors. For a vector , is its norm and is its norm, where is the ith element of . By we denote a real-valued matrix with n rows and m columns. Additionally, the th column of the matrix and its entry at th row and th column are referenced as and , respectively. The trace of the matrix appears as and its Frobenius norm as
. The square identity matrix withrows is denoted as . For the matrices and we indicate their Hadamard product as . The -ary Cartesian product over sets is denoted by Finally, refers to the estimate of a variable at the th iteration of an algorithm.
3.2 Classical MDS
Classical MDS was first introduced by [ClassicalMDS] and can be formalized as follows. Given the matrix consisting of pairwise distances or dissimilarities between points in a high dimensional space, the solution to Classical MDS is given by a set of points which lie on the manifold and their pairwise distances are able to preserve the given dissimilarities as faithfully as possible. Each point corresponds to a column of the matrix . The embedding dimension is selected as small as possible in order to obtain the maximum dimensionality reduction but also to be able to approximate the given dissimilarities by the Euclidean distances in the embedded space .
The proposed algorithm uses a centering matrix in order to subtract the mean of the columns and the rows for each element. Where a vector of ones in space. By applying the double centering to the Hadamard product of the given dissimilarities, the Gram matrix is constructed as follows:
It can be shown (Ch. 12 [borg_groenen_2005]) that classical MDS minimizes the Strain algebraic criterion in Eq. 2 below:
The eigendecomposition of the symmetric matrix gives us and thus the new set of points consisting the embedding in are given by the first
positive eigenvalues of, namely
. This solution provides the same result as Principal Component Analysis (PCA) applied on the vector in the high dimensional space[doi:10.1093/biomet/53.3-4.325]. Classic MDS was originally proposed for dissimilarity matrices which can be embedded with good approximation accuracy in a low-dimensional Euclidean space. However, matrices which correspond to embeddings in Euclidean sub-spaces [10.1007/978-3-642-46900-8_44], Poincare disks [Poincare:MDS] and constant-curvature Riemannian spaces [lindman1978constant] have also been studied.
3.3 Metric MDS
Metric MDS describes a superset of optimization problems containing classical MDS. Shepard has introduced heuristic methods to enable transformations of the given dissimilarities [Shepard1962], [Shepard1962b] but did not provide any loss function in order to model them [groenen2014past]. Kruskal in [Kruskal:a] and [Kruskal:b] formalized the metric MDS as a least squares optimization problem of minimizing the non-convex Stress-1 function defined in Eq. 3 shown next:
where matrix with elements represents all the pairs of the transformed dissimilarities that are used to fit the embedded distance pairs .
In essence, where is usually an affine transformation555Monotone and polynomial regression transformations are employed for nonmetric-MDS, as well as, a wider family of transformations [france2011two]. for unknown and . Kruskal proposed an iterative gradient-based algorithm for the minimization of since the solution cannot be expressed in closed form. Assuming that the algorithm iteratively tries to find the coordinates of points which are lying in the low embedding space . Trivial solutions ( and ) are avoided by the denominator term in Eq. 3.
A weighted MDS raw Stress function is defined as:
where the weights are restricted to be non-negative; for missing data the weights are set equal to zero. By setting one can model an equal contribution to the Metric-MDS solution for all the elements.
SMACOF which stands for Scaling by Majorizing a Complex Function is a state-of-the-art algorithm for solving metric MDS and was introduced by [Leeuw77applicationsof]. By setting in raw stress function defined in Eq. 4, SMACOF minimizes the resulting stress function .
The algorithm proceeds iteratively and decreases stress monotonically up to a fixed point by optimizing a convex function which serves as an upper bound for the non-convex stress function in Eq. 5. An extensive description of SMACOF can be found in [borg_groenen_2005] while its convergence for a Euclidean embedded space has been proven by [deLeeuw1988].
Let matrices and be defined element-wise as follows:
The stress function in Eq. 5 is converted to the following quadratic form:
The quadratic can be minimized iteratively as follows:
where is the estimate of matrix at the th iteration and is Moore-Penrose pseudoinverse of . At iteration the convex majorizing convex function touches the surface of at the point . By minimizing this simple quadratic function in Eq. 9 we find the next update which serves as a starting point for the next iteration . The solution to the minimization problem is shown in Eq. 10. The algorithm stops when the new update yields a decrease that is smaller than a threshold value.
3.5 GPS formulation
The unconstrained problem of minimizing a continuously differentiable function is formally described as
Next we present a short description of iterative GPS minimization of Eq. 11 based on [torczon1997convergence, dolan2003local]. First we have to define the following components:
A basis matrix that could be any nonsingular matrix .
A matrix for generating all the possible moves for the th iteration of the minimization algorithm
where the columns of form a positive span of and contains at least the zero column of the search space .
A pattern matrix defined as
where the submatrix forms a basis of .
In each iteration , we define a set of steps generated by the pattern matrix as shown next:
where is the th column of and defines the direction of the new step, while configures the length towards this direction. If the pattern matrix contains columns, then in order to positively span the search space . Thus, a new trial point of GPS algorithm towards this step would be where we evaluate the value of the function to minimize. The success of a new trial point is decided based on the condition that it takes a step towards further minimizing the function , i.e., . The steps of a GPS method are presented in Alg. 1.
To initialize the algorithm we select a point and a positive step length parameter . In each iteration , we explore a set of moves defined by the subroutine at line 5 of the algorithm. Pattern search methods described using a GPS formalism mainly differ on the heuristics used for the selection of exploratory moves. If a new exploratory point lowers the value of the function , iteration is successful and the starting point of the next iteration is updated as shown in line 8, else there is no update. The step length parameter is modified by the subroutine in line 11. For successful iterations, i.e., , the step length is forced to increase in a determistic way as follows:
where and are predefined constants that are used for the th successive successful iteration. For unsuccessful iterations the step length parameter is decreased, i.e., as follows:
where and the negative integer determine the fixed ratio of step reduction. Note that the generating matrix could be also updated for unsuccessful/successful iterations in order to contain more/less search directions, respectively.
3.6 GPS Convergence
GPS methods under the aforementioned defined framework have some important convergence properties shown in [torczon1997convergence, dolan2003local, doi:10.1137/S1052623496300507, doi:10.1137/0728030, doi:10.1137/S1052623497331373] and summarized here. For any GPS method which satisfies the specifications of Hyp. 3.6 on the exploratory moves one may be able to show convergence for Alg. 1.
Hypothesis (Weak Hyp. on Exploratory Moves):
If among the exploratory moves at iteration selected from the columns of the matrix exist at least one move that leads to success, i.e., , then the subroutine will return a move such that .
Hyp. 3.6 enforces some mild constraints on the configuration of the exploratory moves produced by Alg. 1, line 5. Essentially, the suggested step is derived from the pattern matrix , while the algorithm needs to provide a simple decrease for the objective function . Specifically, the only way to accept an unsuccessful iteration would be if none of the steps from the columns of the matrix lead to a decrease of the objective function . Based on this hypothesis one can formulate Thm. 3.6 as follows:
Let be closed and bounded and continuously differentiable on a neighborhood of , namely on the union of the open balls where . If a GPS method is formulated as described in Section 3.5 and Hyp. 3.6 holds then for the sequence of iterations produced by Alg. 1
As shown in [Audet2004] one can construct a continuously differentiable objective function and a GPS method with infinite many limit points with non-zero gradients and thus even Thm. 3.6 holds, the convergence of is not assured. However, the convergence properties of GPS methods can be further strengthened if additional criteria are met. Specifically, a stronger hypothesis on exploratory moves Hyp. 3.6 regulates the measure of decrease of the objective function for each step produced by the GPS method, as follows:
Hypothesis (Strong Hyp. on Exploratory Moves):
If among the exploratory moves at iteration selected from the columns of the matrix exists at least one move that leads to success, i.e., , then the subroutine will return a move such that:
Hyp. 3.6 enforces the additional strong constraint on the configuration of the exploratory moves, namely that the subroutine will do no worse than produce the best exploratory move from the columns of the matrix . Based on this hypothesis and by adding requirements restricting the exploration step direction and length for the GPS method, one can formulate Thm. 3.6 which is also presented here without proof.
Let be closed and bounded and continuously differentiable on a neighborhood of , namely on the union of the open balls where . If a GPS method is formulated as described in Section 3.5, , the columns of the generating matrices are bounded by norm and Hyp. 3.6 holds then for the sequence of iterations produced by Alg. 1
The additional requirements specify that: 1) the generating matrix should be norm bounded in order to produce trial steps from Eq. 14 that are bounded by the step length parameter and 2) that can be easily met by selecting in Eq. 16; this also guarantees a non increasing sequence of steps [torczon1997convergence]. Although these criteria provide much stronger convergence properties, we are faced with a trade off between the theoretical proof of convergence and the efficiency of heuristics in finding a local optimum.
Both theorems 3.6 and 3.6 provide a first order optimality condition if their specifications hold. Although the latter theorem premises much stronger convergence results, step-length control parameter , provides a reliable asymptotic measure of first-order stationarity when it is reduced after unsuccessful iterations [dolan2003local].
4 Pattern Search MDS
4.1 Core algorithm
The key idea behind the proposed algorithm is to treat MDS as a derivative-free problem, using a variant of general pattern search optimization to minimize a loss function. The input to pattern search MDS is a target dissimilarity matrix and the target dimension of the embedding space. An overview of the algorithm shown in Alg. 2 is presented next.
The initialization process of the algorithm consists of: 1) random sampling of points in the embedded space and construction of the matrix , 2) computing the embedded space dissimilarity matrix , where the element is the Euclidean distance between vectors and of , and 3) computing the initial approximation error , where is the element-wise mean squared error (MSE) between the two matrices. The functional that we attempt to minimize is the normalized square of the Frobenius norm of the matrix , i.e., . Equivalently one may express element-wise as follows:
Following the initialization steps, in each epoch (iteration), we consider the surface of a hypersphere of radius around each point . The possible search directions lie on the surface of a hypersphere along the orthogonal basis of the space, e.g., in the case of -dimensional space along the directions on the sphere shown in Fig. 1. This creates the search directions matrix and is summarized in Alg. 3
Each point is moved greedily along the dimension that produces the minimum error. At this stage we only consider moves that yield a monotonic decrease in the error function. Alg. 4 finds the optimal move that minimizes for each new point and moves in that direction. Note that when writing , the matrix is considered to be a set of row vectors.
The resulting error is computed after performing the optimal move for each point in . If the error decrease hits a plateau, we halve the search radius and proceed to the next epoch. This is expressed as , where is a small positive constant, namely the error decrease becomes very small in relation to . The process stops when the search radius becomes very small, namely , where is a small constant, as shown in Alg. 2.
4.2 Optimizations and algorithm complexity
Next, a set of algorithmic optimizations are presented that can improve the execution time and the solution quality of Alg. 2. We also present ways to improve the execution time by searching for an approximate solution, as well as, discuss ways to utilize parallel computation for parts of the algorithm.
4.2.1 Allow for “bad” moves
In Section 4.1 we restrict the accepted moves so that the error decreases monotonically. This is a reasonable restriction that also provides us with theoretical guarantees of convergence. Nonetheless in our experimental setting, we observed that if we relax this restriction and allow each point to always make the optimal move, regardless if the error (temporarily) increases the algorithm converges faster to better solutions. The idea of allowing greedy algorithms to make some “bad” moves in hope to get over local minima can be found in other optimization algorithms, simulated annealing [kirkpatrick1983optimization] being the most popular. To implement this one can modify line 13 in Alg. 2 to:
4.2.2 Online computation of dissimilarity matrix
In line 6 of Alg. 4 we observe that we recompute the dissimilarity matrix for each move. This can be avoided because each move modifies only one point , therefore only the row and column of the dissimilarity matrix are affected. Furthermore only one dimension of the vector is modified by the move, i.e., only element of matrix . In detail, the element that stores the dissimilarity between points and should be updated as follows for the move from to for :
4.2.3 Step and move selection
It follows from the need to search for the optimal move across the embedding dimensions , that the complexity of the algorithm has a linear dependency on . A large value of might affect the execution time of the algorithm. An approximate technique to alleviate this is perform a random sampling over all possible directions in the dimensional space in order to select a “good” direction instead of the optimal, thus restricting the search space666One can potentially do better than random sampling of all possible directions in the dimensional space. As the geometry of the embedding space starts becoming apparent, after a few epochs of the algorithm, it makes sense to increasingly bias the search towards the principal component vectors of the neighborhood of the point that is being moved..
An important parameter for our algorithm is the starting radius . This parameter controls how broad the search will be initially and has an effect similar to the learning rate of gradient-based optimization algorithms. If we are too conservative and choose a small initial radius, the algorithm will converge slowly to a local optimum, whereas if we set it too high, the error will overshoot and convergence is not guaranteed. A simple technique to automatically find a good starting radius is to use binary search. In particular, we set the starting radius to an arbitrary value, perform a dry run of the algorithm for one epoch and observe the effect on error. If the error increases we halve the radius. Otherwise we double it and repeat the process. This process is allowed to run for a small number of epochs. The starting radius found using this technique is a not too pessimistic or too optimistic estimate of the best parameter value.
Another way to boost the execution time is to utilize parallel computation to speed up parts of the algorithm. In our case we can parallelize the search for the optimal moves across the embedding dimensions using the map-reduce parallelization pattern. Specifically, we can map the search for candidate moves to run in different threads and store the error for each candidate move in an array . After the search completes we can perform a reduction operation (min) to find the optimal move and the optimal error . For our implementation we used the OpenMP parallelization framework [dagum1998openmp] and it led to a times speedup in execution time.
For each epoch we search across dimensions for points. In each search we also need operations to update the distance matrix. Thus, the per epoch computational complexity of the algorithm is . The optimizations proposed above do not change the complexity of the algorithm per epoch with the notable exception of the move selection optimization: if instead of 2 moves per epoch one would consider only 2 moves. In this case, the overall complexity per epoch would be instead of . However, as we shall see in the experiments that follow the (rest of the) proposed optimization significantly improve convergence speed, resulting in fewer epochs and less computation complexity overall.
4.3 GPS formulation of our Algorithm
Pattern Search MDS belongs to the general class of GPS methods and can be expressed using the unified GPS formulation introduced in Section 3.5. Next, we express our proposed algorithm and associated objective function under this formalism.
First, we restate the problem of MDS in a vectorized form. We use matrix with elements that expresses the dissimilarities between points in the high dimensional space. The set of points lie on the low dimensional manifold and form the column set of matrix . The matrix will be now vectorized as an one column vector as shown next:
Now our new variable lies in the search space . The distance between any two points and of the manifold remains the same but is now expressed as a function of the vectorized variable . Namely, . To this end, our new objective function to minimize is the MSE between the given dissimilarities and the euclidean distances in the low dimensional manifold as defined in Eq. 20 shown next:
Consequently, the initial MDS is now expressed as an unconstrained non-convex optimization problem which is expressed by minimizing the function over the search space of (Eq. 21). Specifically, the coordinates for all points on the manifold
now serve as degrees of freedom for our solution.
Now that we have formulated the problem and the variable in the appropriate format we can match each epoch of our initial algorithm with an iteration of a GPS method. Therefore, the moves produced by our algorithm form a sequence of points . Moreover, we are going to define the matrices for our algorithm as in Eqs. 12, 13. The choice of our basis matrix is the identity matrix as shown in Eq. 23.
While the identity matrix is non singular and its columns span positively the search space , we also define as the identity matrix. In Eq. 24 matrix represents the movement alongside the unit coordinate vectors of . Nevertheless, our generating matrix also comprises of all the remaining possible directions which are generated by the set . In total, we have extra direction vectors inside the corresponding matrix as it is shown in Eq. 25.
According to Eqs. 24, 25, we construct the full pattern matrix in Eq. 26 in a similar way to Eq. 13. For our algorithm the pattern matrix is is equal to our generating matrix which is also fixed for all iterations. Conceptually, the generating matrix contains all the possible exploratory moves while a heuristic is utilized for evaluating the objective function only for a subset of them.
Finally, we configure the updates of the step length parameter for each class of both successful and unsuccessful iterations as they were previously described in Eqs. 15, 16, respectively. Recalling the notation of Section 3.5, is the step which is returned from our exploratory moves subroutine at th iteration. For the successful iterates we do not further increase the length of our moves by limiting as follows:
Similarly, for the unsuccessful iterations we halve the distance by a factor of by setting as it is shown next:
A short description of our algorithm as a GPS method for solving the problem stated in Eq. 21 follows: In each iteration, we fix the optimal coordinate direction for each one of the points lying on the low dimensional manifold . For each internal iteration of Alg. 4, if the optimal direction produces a lower value for our objective function we accumulate this direction and move alongside this coordinate of the . Otherwise, we remain at the same position. As a result, the exploration of coordinates for the new point begins from this temporary position. This greedy approach provides a potential one-hot vector as described in Eq. 22 if the iterate is successful or otherwise, the zero vector . The final direction vector for th iteration is computed by summing these one-hot or zero vectors. At the th iteration, the movement would be given by a scalar multiplication of the step length parameter with the final direction vector in a similar way as defined in Eq. 14. This provides a simple decrease for the objective function or in the worst case represent a zero movement in the search space . Regarding the movement across , it is trivial to show that this reduction of the objective function is an associative operation. In other words, accumulating all best coordinate steps for each point and performing the movement at the end of the th iteration (as GPS method formulation requires) produces the same result as taking each coordinate step individually. Finally, pattern search MDS terminates when the step length parameter becomes smaller than a predefined threshold.
4.4 Convergence of our Algorithm
Now that we have homogenized the notation framework as well as have expressed the proposed algorithm as a GPS method one can utilize the theorems stated in Section 3.6 to prove the convergence properties of the proposed algorithm.
First of all, the objective function is indeed continuously differentiable for all the values of the search space by its definition in Eq. 20. Moreover, the pattern matrix in Eq. 26 contains all the possible step vectors provided by our exploratory moves routine. Thus, all of our exploratory moves are defined by Eq. 14. In each iteration we evaluate the trial steps alongside all coordinates for all the points . In our restated problem definition (see Section 4.3), this is translated to searching all over the identity matrices and of the search space . But from our definition of the first columns of our generating matrix in Eq. 24 this corresponds to checking all the potential coordinate steps provided by . Consequently, if there exists a simple decrease when moving towards any of the directions provided by the columns of then our algorithm also provides a simple decrease. This result verifies that Hyp. 3.6 is true for the exploratory moves. By combining the differentiability of our objective function and Hyp. 3.6, Thm. 3.6 holds for pattern search MDS. Hence, is guaranteed.
Trying to further strengthen the convergence properties of the proposed algorithm, we note that most of the requirements of Thm. 3.6 are met but we fail to meet the specifications of Hyp. 3.6 for the minimum decrease provided by the the columns of . However, our generating matrix is indeed bounded by norm because . By halving the step length parameter for the unsuccessful iterations we also ensure that . In order to meet the specifications of Thm. 3.6 we would need a quadratic complexity of in order to ensure that each iteration provides the same decrease in function as the decrease provided by the “best” column of . This is formally stated at the second part of Hyp. 3.6. If we modify our algorithm in order to meet these requirements we would not be able to implement all the optimizations proposed in Section 4.2 and the overall runtime would be dramatically increased.
5.1 Tuning the hyperparameters
Next we present some guidelines on how to set the hyperparameters for the proposed algorithm and report the values used in the experiments that follow. Specifically:
The constant in line 9 of Alg. 2 determines when the move radius is decreased. By setting to a value very close to , e.g., , the search will take more epochs but the solution will be closer to the local optimum. If we relax to a value around , we can do a coarse exploration of the search space that will produce a rough solution in a small number of epochs. In our experiments we set that provides a good trade-off between solution quality and fast convergence for the datasets used.
We experimentally found that if is large, we may only search of the search dimensions and still get a good solution, while significantly reducing the execution time. For this to hold, it is important that we randomly sample a new search space for each epoch.
The proposed algorithm is relatively robust to the choice of the initial size of the move search radius. However, the choice of does affect convergence speed. We show the convergence for an example run of the classical swissroll (see Section 5.2) for best-case (), pessimistic () and optimistic () starting radii in Fig. 2.
5.2 Manifold Geometry
The key assumption in manifold learning is that input data lie on a low-dimensional, non-linear manifold, embedded in a high-dimensional space. Thus non-linear dimensionality reduction techniques aim to extract the low-dimensional manifold from the high dimensional space. To showcase this we generated a variety of geometric manifold shapes and compared the proposed MDS to other, well-established dimensionality reduction techniques. We make the code to generate the synthetic data openly available to the community777Open source code available: https://github.com/georgepar/gentlemandata.
One should note that MDS algorithms with Euclidean distance matrices as inputs cannot infer data geometry, thus we need to provide as input a geodesic distance matrix. This matrix is computed by running Djikstra’s shortest path algorithm on the Nearest Neighbors graph trained on the input data. For our experiments we sample points on shapes and reduce them to dimensions using pattern search MDS, SMACOF MDS [Leeuw77applicationsof], truncated SVD [golub1965calculating], Isomap [tenenbaum_global_2000, bernstein_graph_2000, zha_isometric_2003, DonohoG05, pless_image_2003], Local Linear Embedding (LLE) [BelkinN03, Cayton, saul_think_2003, sha_analysis_2005, belkin_laplacian_2001], Hessian LLE [zhang_principal_2004, donoho_hessian_2003], modified LLE [zhang2007mlle] and Local Tangent Space Alignment (LTSA) [LTSA].
The geodesic distance matrices provided to pattern search MDS and SMACOF MDS is computed using Djikstra’s algorithm on k-NN (nearest neighbor) graphs. We list the times it took each method to run. Note that pattern search MDS is faster than SMACOF MDS.
We present characteristic shapes selected from the ones we tested. The first shape we examine is the classical swissroll, where a plane is “rolled” in space and the target is to extract the original plane. Results are presented in Fig. 2(a). We observe that linear dimensionality reduction techniques like truncated SVD have trouble unrolling the swissroll. Also LLE introduces a lot of distortion to the constructed plane.
Next we examine how the algorithms handle sparse distance matrices. To this end, we generate a dataset of non-overlapping clusters with a line connecting the centroids, where sparsity of the distance matrix follows because the vast majority of the points are very closely sampled inside the clusters. A good mapping should preserve the cluster structure in lower dimensions. In Fig. 2(b) we see that the truncated SVD and the MDS family of algorithms (proposed, SMACOF, Isomap) produce good results, while the LLE variants can’t handle sparsity in distance matrices very well. In particular Hessian LLE and LTSA do not produce any output because of numerical instability888In Hessian LLE the matrices used for the null space computation become singular, while in LTSA the resulting point coordinates are infinite. in the eigenvalue decomposition stages of these algorithms. Pattern search MDS does not rely on eigenvalue computation or equation system solvers and therefore it is numerically stable.
Finally, we showcase how the algorithms perform with transitions from dense to sparse regions with a toroidal helix shape in Fig. 2(c). We can see that five methods, including pattern search MDS, unroll the shape into the expected circle, while truncated SVD provides a daisy-like shape. Hessian LLE and LTSA collapse the helix into multiple overlapping circles.
5.3 Dimensionality reduction for semantic similarity
Construction of semantic network models consists of representing concepts as vectors in a, possibly high-dimensional, space
. The relations between concepts are quantified as the distances, or inversely the cosine similarities, between semantic vectors. The semantic similarity task aims to evaluate the correlation of the similarities between concepts in a given semantic space against a set of ground truth similarity values provided by human annotators.
We evaluate the performance of the dimensionality techniques investigated also in Section 5.2 for the semantic similarity task. We use the MEN [Bruni:2014:MDS:2655713.2655714] and SimLex-999 [hill2015simlex] semantic datasets as ground truth. Both datasets are provided in the form of lists of word pairs, where each pair is associated with a similarity score. This score was computed by averaging the similarities provided by human annotators. As the high-dimensional semantic word vectors, we use the -dimensional GloVe vectors constructed by [baziotis-pelekis-doulkeridis:2017:SemEval2] using a large Twitter corpus. We reduce the dimensionality of the vectors to the target dimension and calculate the Spearman correlation coefficient between the human provided and the automatically computed similarity scores. Results are summarized in Table 1 for . We observe that LLE yields the best results for MEN, while pattern search MDS performs best for SimLex-999. In addition, we observe that non-linear dimensionality reduction techniques can significantly improve the performance of the semantic vectors in some cases.
|pattern search MDS||0.242|
5.4 Dimensionality reduction for k-NN classification
The next set of experiments aims to compare the proposed algorithm to other dimensionality reduction methods for k-NN classification on a real dataset. We choose to use MNIST as a benchmark dataset which contains handwritten digit images. We selected a random subset of images and reduced the dimensionality from to . Performance of the models is evaluated on 1-NN classification and using
-fold cross-validation. The evaluation metric is macro-averaged F1 score. Table2 summarizes the results. Observe that dimensionality reduction using pattern search MDS and Truncated SVD can improve classification performance over the original high-dimensional data. Pattern search MDS yields the best results overall. Hessian LLE, Modified LLE and LTSA did not run due to numerical instability.
|Method||Dimensions||MNIST 1-NN F1 score|
|pattern search MDS||0.878|
5.5 Convergence characteristics
Next we compare speed of convergence of pattern search MDS and MDS SMACOF, in terms of numbers of epochs. To this end we will consider the experiments of Sections 5.2 and 5.3 and present comparative convergence plots. We see the convergence plots for the cases of swissroll, clusters, toroid helix in Fig. 3(a), 3(b) and 3(c), respectively. The convergence plot for the word semantic similarity task is shown in Fig. 3(d). The plots are presented in y-axis logarithmic scale because the starting error is many orders of magnitude larger than the local minimum reached by the algorithms.
For all cases, we observe that pattern search MDS converges very quickly to a similar or better local optimum while MDS SMACOF hits regions where the convergence slows down and then recovers. These saw-like structure of the pattern search plots are due to the fact that we allow for “bad moves” as detailed in Section 4.2.1.
5.6 Robustness to noisy or missing data
The final set of experiments aims to demonstrate the robustness of pattern search MDS when the input data are corrupted or noisy. To this end two cases of data corruption are considered: additive noise and missing data.
5.6.1 Robustness to additive noise
For this set of experiments, we inject Gaussian noise of variable standard deviation () to the input data and use the dissimilarity matrix calculated on the noisy data as input to each one of the algorithms evaluated.
For the synthetic data of Section 5.2, we will follow a qualitative evaluation by showing the unrolled manifolds for high levels of noise. We perform dimensionality reduction for swissroll, toroid helix and clusters for increasing noise levels. We report results for the highest possible noise deviation where one or more techniques still produce meaningful manifolds. Beyond these values of the original manifolds become corrupted and the output of all methods is dominated by noise. Figs. 4(a), 4(b), 4(c) show the results for noisy swissroll with , clusters with and toroid helix with respectively. Overall, the pattern search MDS, followed by SMACOF MDS and Isomap are more robust to additive noise.
For the semantic similarity task we injected different levels of Gaussian noise in the original word vectors and evaluated the correlation on MEN and Simlex-999. Results are presented in Table 3. We observe that the relative performance of the algorithms is maintained under noise injection, except for LLE which cannot handle high amounts of noise. LLE is achieving the best correlation values on MEN at and , while pattern search MDS achieving the best performance on Simlex-999.
|pattern search MDS||0.249||0.315||0.204|