Given a dataset of points in some high-dimensional space with dimensionality , manifold learning algorithms try to find a low-dimensional embedding of every point from in some space with dimensionality
. These algorithms play an important role in high-dimensional data analysis, specifically for data visualization, whereor . The quality of the methods have come a long way in recent decades, from classic linear methods (e.g. PCA, MDS), to more nonlinear spectral methods, such as Laplacian Eigenmaps (Belkin and Niyogi, 2003), LLE (Saul and Roweis, 2003) and Isomap (de Silva and Tenenbaum, 2003), finally followed by even more general nonlinear embedding (NLE) methods, which include Stochastic Neighbor Embedding (SNE, Hinton and Roweis, 2003), -SNE (van der Maaten and Hinton, 2008), NeRV (Venna et al., 2010) and Elastic Embedding (EE, Carreira-Perpiñán, 2010
). This last group of methods is considered as state-of-the-art in manifold learning and became a go-to tool for high-dimensional data analysis in many domains (e.g. to compare the learning states in Deep Reinforcement Learning algorithms(Mnih et al., 2015)
or to visualize learned vectors of some embedding model(Kiros et al., 2015)).
While the results of NLE have improved in quality, their algorithmic complexity has increased as well. NLE methods are defined using a nonconvex objective that requires careful iterative minimization. A lot of effort has been spent on improving the convergence of NLE methods, including Spectral Direction (Vladymyrov and Carreira-Perpiñán, 2012) that uses partial-Hessian information in order to define a better search direction, or optimization using a Majorization-Minimization approach (Yang et al., 2015). However, even with these sophisticated custom algorithms, it is still often necessary to perform a few random restarts in order to achieve a decent solution. Sometimes it is not even clear whether the learned embedding represents the structure of the input data, noise, or the artifacts of an embedding algorithm (Wattenberg et al., 2016).
Consider the situation in fig. 1. There we run the EE times on the same dataset with the same parameters, varying only the initialization. The dataset, COIL-20, consists of photos of different objects as they are rotated on a platform with new photo taken every degrees ( images per object). Good embedding should separate objects one from another and also reflect the rotational sequence of each object (ideally via a circular embedding). We see in the left plot that for virtually every run the embedding gets stuck in a distinct local minima. The other two figures show the difference between the best and the worst embedding depending on how lucky we get with the initialization. The embedding in the center has much better quality comparing to the one on the right, since most of the objects are separated from each other and their embeddings more closely resemble a circle.
In this paper we focus on the analysis of the reasoning behind the occurrence of local minima in the NLE objective function and ways for the algorithms to avoid them. Specifically, we discuss the conditions under which some points get caught in high-energy states of the objective function. We call these points “pressured points” and show that specifically for the NLE class of algorithms there is a natural way to identify and characterize them during optimization.
Our contribution is twofold. First, we look at the objective function of the NLE methods and provide a mechanism to identify the pressured points for a given embedding. This can be used on its own as a diagnostic tool for assessing the quality of a given embedding at the level of individual points. Second, we propose an optimization algorithm able to utilize the insights from the pressured points analysis to achieve better objective function values even from a converged solution of an existing state-of-the-art optimizer. The proposed modification augments existing analysis of the NLE optimization and can be run on top of existing state-of-the-art methods: Spectral Direction and -body algorithms (Yang et al., 2013; van der Maaten, 2014; Vladymyrov and Carreira-Perpiñán, 2014).
Our analysis arises naturally from a given NLE objective function and does not depend on any other assumptions. Other papers have looked into the problem of assessing the quality of the embedding (Peltonen and Lin, 2015; Lee and Verleysen, 2009; Lespinats and Aupetit, 2011). However, their quality criteria are defined separately from the actual learned objective function, which introduces additional assumptions and does not connect to the original objective function. Moreover, we also propose a method for improving the embedding quality in addition to assessing it.
2 Nonlinear Manifold Learning Algorithms
The objective functions for SNE and, usually computed as , to capture a similarity of points in the original space . The algorithms differ in the kernels they use for the distributions in the low-dimensional space. SNE uses the Gaussian kernel111Instead of the classic SNE, in this paper we are going to use symmetric SNE (Cook et al., 2007), where each probability is normalized by the interaction between all pairs of points and not every point individually. , while -SNE is using Student’s kernel .
Carreira-Perpiñán (2010) showed that these algorithms could be defined as an interplay between two additive terms: . Attractive term , usually convex, pulls points close to each other with a force that is larger for points located nearby in the original space. Repulsive term , on the contrary, pushes points away from each other.
Elastic Embedding (EE) modifies the repulsive term of the SNE objective by dropping the , adding a weight to better capture non-local interactions (e.g. as
), and introducing a scaling hyperparameterto control the interplay between two terms.
Here are the objective functions of the described methods:
3 Identifying pressured points
Let us consider the optimization with respect to a given point from . For all the algorithms the attractive term increases as grows and thus has a high penalty for points placed far away in the embedding space (especially if they are located nearby in the original space). The repulsive term is mostly localized and concentrated around individual neighbors of . As navigates the landscape of it tries to get to the minimum of while avoiding the “hills” of
created around repulsive neighbors. However, the degrees of freedom ofis limited by which is typically much smaller than the intrinsic dimensionality of the data. It might happen that the point gets stuck surrounded by its non-local neighbors and is unable to find a path through.
We can illustrate this with a simple scenario involving three points in the original space, where and are near each other and is further away. We decrease the dimensionality to using EE algorithm and assume that due to e.g. poor initialization is located in between and . In the left plot of fig. 2 we show different parts of the objective function as a function of . The attractive term creates a high pressure for to move towards . However, the repulsion between and creates a counter pressure that pushes away from , thus creating two minima: one local near and another global near . Points like are trapped in high energy regions and are not able to move. We argue that these situations are the reason behind many of the local minima of NLE objective fucntions. Identifying and repositioning these points we can improve the objective function and overall the quality of the embedding.
We propose to evaluate the pressure of every point with a very simple and intuitive idea: increased pressure from the “false” neighbors would create a higher energy for the point to escape that location. However, for a true local minimum, there are no directions for that point to move. That is, given the existing number of dimensions. If we were to add a new dimension temporarily just for that point, it would be possible for the points to move along that new dimension (see fig. 2, right). The more that point is pressured by other points, the farther across this new dimension it would go.
More formally, we say that the point is pressured if the objective function has a nontrivial minimum when evaluated at that point along the new dimension . We define the minimum along the dimension as the pressure of that point.
It is important to notice the distinction between pressured points and points that just have higher objective function value when evaluated at those points (a criterion that is used e.g. in Lespinats and Aupetit (2011) to assess the embedding quality). Large objective function value alone does not necessary mean that the point is stuck in a local minimum. First, the point could still be on its way to the minimum. Second, even for an embedding that represents the global minimum, each point would converge to its own unique objective function value since the affinities for every point are distinct. Finally, not every NLE objective function can be easily evaluated for every point separately. SNE (2) and -SNE (3) objective functions contain term that does not allow for easy decoupling.
In what follows we are going to characterize the pressure of each point and look at how the objective function changes when we add an extra dimension to each of the algorithms described above.
For a given point we extend the objective function of EE (1) along the new dimension . Notice that we consider points individually one by one, therefore all for all . The objective function of EE along the new dimension becomes:
where , and is a constant independent from . The function is symmetric wrt and convex for . Its derivative is
The function has a stationary point at , which is a minimum when . Otherwise, is a maximum and the only non-trivial minimum is . The magnitude of the fraction under the log corresponds to the amount of pressure for . The numerator depends on and represents the pressure that points in exert on . The denominator is given by the diagonal element of the degree matrix and represents the attraction of the points in the original high-dimensional space. The fraction is smallest when points are ordered by for all , i.e. in ascending order from to its neighbors. As points change order and move closer to (especially those far in the original space, i.e. with high ) increases and eventually turns from a minimum to a maximum, thus creating a pressured point.
Stochastic Neighbor Embedding.
The objective along the dimension for a point is given by:
where, slightly abusing the notation between different methods, we define and . The derivative is equal to
Similarly to EE, the function is convex, has a stationary point at , which is a minimum when . It also can be rewritten as . The LHS represents the pressure of the points on normalized by an overall pressure for the rest of the points. If this pressure gets larger than the similar quantity in the original space (RHS), the point becomes pressured with the minimum at
-SNE uses Student’s distribution which does not decouple as nice as the Gaussian kernel for EE and SNE. The objective along and its derivative are given by
where . The function is convex, but in this case it is not easy to find the minimum in a closed form. Practically it can be done with just few iterations of the Newton’s method initialized at some positive value close to . In addition, we can quickly test whether the point is pressured or not from the sign of the second derivative at :
4 Pressured points for quality analysis
The analysis above can be directly applied to the existing algorithms as is, resulting in a qualitative statistic of the amount of pressure each point is experiencing during the optimization. A nice additional property is that computing pressure points can be done in constant time by reusing parts of the gradient. A practitioner can run the analysis for every iteration of the algorithm essentially for free to see how many points are pressured and whether the embedding results can be trusted.
In fig. 3 we show a couple of the examples of embeddings with pressured points computed. The embedding of the swissroll on the left had a poor initialization that SNE was not able to recover from. Pressured points are concentrated around the twist in the embedding and in the corners, precisely where the difference with the ground truth occurs. On the right, we can see the embedding of the subset of COIL-20 dataset midway through optimization with EE. The embeddings of some objects overlap with each others, which results in high pressure.
In fig. 4 we show an embedding of the subset from MNIST after iteration of -SNE. We highlight some of the digits that ended up in clusters different from their ground truth. We put them in a red frame if a digit has a high pressure and in a green frame if their pressure is . For the most part the digits in red squares do not belong to the clusters where they are currently located, while digits in green squares look very similar to the digits around them.
5 Improving convergence by pressured points optimization
The analysis above can be also used for improvements in optimization. Imagine for some iteration we have a set of points that are pressured according to the definition above. Effectively it means that given a new dimension the pressured points would utilize it in order to improve their current location. Let us create this new dimension with for all . Non-pressured points still exist in the old space and can only move along those dimensions. For example, here is the augmented objective function for EE:
The objective function splits into three parts. The first two parts represent the minimization of pressured and non-pressured points independently in and dimensional space. The last part represents the interaction between pressure and non-pressure points and also has three parts. The first term represents the attraction between pressured and non-pressured points in space. The second term essentially pulls each to with the weight proportional to the attraction between point and all the non-pressured points. Its form is identical to the norm applied to the extended dimension with the weight given by the attraction between point and all the non-pressured points. Finally, the last term captures the interactions between for pressured points and for non-pressured ones. On one hand, it pushes away from as pressured and non-pressured points move closer to each other in space. On the other hand, it re-weights the repulsion between pressured and non-pressured points proportional to reducing the repulsion for larger values of . In fact, since for all , the repulsion between pressured and non-pressured points would always be weaker than the repulsion of non-pressured points between each other.
Since our final objective is not to find the minimum of (6), but rather get a better embedding of , we are going to add few additional steps to facilitate this. First, after each iteration of minimizing (6) we are going to update by removing points that are not pressured anymore and adding points that just became pressured. Second, we want pressured points to explore the new dimension only to the extent that it could eventually help lowering the original objective function. We want to restrict the use of the new dimension so it would be harder for the points to use it comparing to the original dimensions. It could be achieved by adding penalty to dimension as . This is an organic extension since it has the same form as the second term in (6). For the penalty is given as the weight between pressured and non-pressured points and for larger the penalty prevents the points to use the new dimension at all. This property gives a distinct advantage to our algorithm comparing to the standard use of regularization, where a practitioner has little control over the effect of and has to resort to trial and error. In our case, the regularizer already exists in the objective and its weight sets a natural scale of values to try. Another advantage over traditional regularizer is that there is no upper bound of beyond which the algorithm stops working. For large the points along just collapse to and the algorithm falls back to the original one.
Practically, we propose to use a sequence of values starting at and increase proportionally to the magnitude of , . In the experiments below, we set , although a more aggressive schedule of or more conservative could be used as well. We increase up until for all the points. Typically, it occurs after – steps.
The resulting method is described in Algorithm 1. The algorithm can be embedded and run on top of the existing optimization methods for NLE: Spectral Direction and -body methods.
In Spectral Direction the Hessian is approximated using the second derivative of only. The search direction has the form , where is the gradient, is the graph Laplacian defined on positive affinities and is a small constant that makes the inverse possible. The modified objective that we propose has one more quadratic term and thus the Hessian for the pressured points along dimension is regularized by a positive term . This is good for two reasons: it improves the direction of the Spectral Direction by adding new bits of Hessian, and it makes the Hessian approximation pd, thus avoiding the need to add any constant to it.
Large-scale -Body approximations using Barnes-Hut (Yang et al., 2013; van der Maaten, 2014) or Fast Multipole Methods (FMM, Vladymyrov and Carreira-Perpiñán, 2014) decrease the cost of objective function and the gradient from to or by approximating the interaction between distant points. Pressured points computation uses the same quantities as the gradient, so whichever approximation is applied to the gradient could also be applied to compute pressured points. The only difference comes from slightly larger cost of minimizing (6) which includes a term that computes the interaction between the pressured points in dimension.
Here we are going to compare the original optimization algorithm, which we call simply spectral direction (SD) to the Pressure Point (PP) framework defined above222It would be more fair to call our method SD+PP, since we also apply spectral direction to minimize the extended objective function, but we are going to call it simply PP to avoid extra clutter..
We applied the algorithm to EE and SNE methods. While the proposed methodology could also be applied to -SNE, in practice we were not able to find it useful. -SNE is defined on Student’s kernel that has much longer tails than the Gaussian kernel used in EE and SNE. Because of that, the repulsion between points is much stronger and points are spread far away from each other. The extra-space given by new dimension is not utilized well and the objective function decrease is similar with and without the PP modification.
|EE (COIL-20)||SNE (COIL-20)||EE (MNIST)|
For the first experiment, we run the algorithm on objects from COIL- dataset. We run both SNE and EE different times with the original algorithm until the objective function does not change for more than per iteration. We then run PP optimization with two different initializations: same as the original algorithm and initialized from the convergence value of SD. Over runs for EE, SD got to an average objective function value of , whereas PP with random initialization got to . Initializing from the convergence of SD, out of times PP was able to find better local minima with the average objective function value of . We got similar results for SNE: average objective function value for SD is , which PP improved to for random initialization and to for initialization from local minima of SD. In fig. 5 we show the results for one of the runs for EE and SNE as well as the final embedding after SD and PP optimization. Notice that for initial small values the algorithm extensively uses and explores the extra dimension, which one can see from the increase in the original objective function values as well as from the large fraction of the pressured points. However, for larger the number of pressured points drops sharply, eventually going to . Once gets large enough so that extra dimension is not used, optimization for every new goes very fast, since essentially nothing is changing. On the right side we show the local minimum of SD and the better minimum that was found by PP. Notice that some of the individual digits (e.g. yellow, light blue and top green) improved their embedding. The number of pressured points and the sum of all the pressured values have also decreased.
|Spectral Direction embedding||Pressure Point embedding|
As another comparison point, we evaluate how much headroom we can get on top improvements demonstrated by PP algorithm. For that, we run EE with homotopy method (Carreira-Perpiñán, 2010) where we performed a series of optimizations from a very small , where the objective function has a single global minimum, to final , each time initializing from the previous solution. We got the final value of the objective function around (dashed red line on the EE objective function plot). While we could not get to a same value with PP, we got very close with (comparing to for the best SD optimization).
On the right plot of fig. 5 we show the minimization of MNIST using FMM approximation with accuracy (i.e. truncating the Hermite functions to terms). PP optimization improved the convergence both in case of random initialization and for initialization from the solution of SD. Thus, the benefits of PP algorithm can be increased by also applying SD to improve the optimization direction and FMM to speed up the objective function and gradient computation.
Finally, we run the EE for word embedding vectors pretrained using word2vec algorithm (Mikolov et al., 2013) on Google News dataset. The dataset consists of word-vectors that were downsampled to most popular English words. We first run SD times with different initialization until the embedding does not change by more than per iteration. We then run PP, initialized from the final value of SD. Fig. 6 shows the embedding of one of the worst results that we got from SD and the way the embedding got improved by running PP algorithm. We specifically highlight six different word categories for which the embedding improved significantly. Notice that the words from the same category got closer to each other and formed tighter clusters. Note that more feelings-oriented categories, such as emotion, sensation and nonverbalcommunication got grouped together and now occupy the right side of the embedding instead of being spread across. In fig. 7 we show the final objective function values for all runs together with the improvements achieved by continuing the optimization using PP. In the inset, we show the histogram of the final objective function values of SD and PP. While the very best results of SD have not improved a lot (suggesting that the near-global minimum has been achieved), most of the times SD gets stuck in the higher regions of the objective function that are improved by the PP algorithm.
We proposed a novel framework for assessing the quality of several manifold learning methods using intuitive, natural and computationally cheap way to measure the pressure that each point experiencing from its neighbors. We then outlined the method to make use of that extra dimension in order to find a better solution for the pressured points. The algorithm works well for EE and SNE methods when initialized at random or from the local minima of some other methods. An interesting future direction would be to extend this analysis to measure intrinsic dimensionality of the manifold. In other words, answer the question: how many dimensions do we need to remove all the pressured points?
I would like to acknowledge Nataliya Polyakovska for initial analysis and Makoto Yamaha for useful suggestions that helped improve this work significantly.
- Becker et al. (2003) S. Becker, S. Thrun, and K. Obermayer, editors. Advances in Neural Information Processing Systems (NIPS), volume 15, 2003. MIT Press, Cambridge, MA.
- Belkin and Niyogi (2003) M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, June 2003.
M. Á. Carreira-Perpiñán.
The elastic embedding algorithm for dimensionality reduction.
Proc. of the 27th Int. Conf. Machine Learning (ICML 2010), Haifa, Israel, June 21–25 2010.
Cook et al. (2007)
J. Cook, I. Sutskever, A. Mnih, and G. Hinton.
Visualizing similarity data with a mixture of maps.
In M. Meilă and X. Shen, editors,
Proc. of the 11th Int. Workshop on Artificial Intelligence and Statistics (AISTATS 2007), San Juan, Puerto Rico, Mar. 21–24 2007.
- de Silva and Tenenbaum (2003) V. de Silva and J. B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In Becker et al. (2003), pages 721–728.
- Hinton and Roweis (2003) G. Hinton and S. T. Roweis. Stochastic neighbor embedding. In Becker et al. (2003), pages 857–864.
- Kiros et al. (2015) R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.
- Lee and Verleysen (2009) J. A. Lee and M. Verleysen. Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing, 72, 2009.
- Lespinats and Aupetit (2011) S. Lespinats and M. Aupetit. CheckViz: Sanity check and topological clues for linear and non-linear mappings. In Computer Graphics Forum, volume 30, pages 113–125, 2011.
- Mikolov et al. (2013) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
- Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Peltonen and Lin (2015) J. Peltonen and Z. Lin. Information retrieval approach to meta-visualization. Machine Learning, 99(2):189–229, 2015.
Saul and Roweis (2003)
L. K. Saul and S. T. Roweis.
Think globally, fit locally: Unsupervised learning of low dimensional manifolds.Journal of Machine Learning Research, 4:119–155, June 2003.
- van der Maaten (2014) L. van der Maaten. Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research, 15:1–21, 2014.
- van der Maaten and Hinton (2008) L. J. van der Maaten and G. E. Hinton. Visualizing data using -SNE. Journal of Machine Learning Research, 9:2579–2605, November 2008.
- Venna et al. (2010) J. Venna, J. Peltonen, K. Nybo, H. Aidos, and S. Kaski. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research, 11:451–490, Feb. 2010.
- Vladymyrov and Carreira-Perpiñán (2012) M. Vladymyrov and M. Á. Carreira-Perpiñán. Partial-Hessian strategies for fast learning of nonlinear embeddings. In Proc. of the 29th Int. Conf. Machine Learning (ICML 2012), pages 345–352, Edinburgh, Scotland, June 26 – July 1 2012.
- Vladymyrov and Carreira-Perpiñán (2014) M. Vladymyrov and M. Á. Carreira-Perpiñán. Linear-time training of nonlinear low-dimensional embeddings. In Proc. of the 17th Int. Workshop on Artificial Intelligence and Statistics (AISTATS 2014), pages 968–977, Reykjavik, Iceland, Apr. 22–25 2014.
- Wattenberg et al. (2016) M. Wattenberg, F. Viégas, and I. Johnson. How to use -sne effectively. Article at https://distill.pub/2016/misread-tsne/, 2016.
- Yang et al. (2013) Z. Yang, J. Peltonen, and S. Kaski. Scalable optimization for neighbor embedding for visualization. In Proc. of the 230h Int. Conf. Machine Learning (ICML 2013), pages 127–135, Atlanta, GA, 2013.
- Yang et al. (2015) Z. Yang, J. Peltonen, and S. Kaski. Majorization-minimization for manifold embedding. In Proc. of the 18th Int. Workshop on Artificial Intelligence and Statistics (AISTATS 2015), pages 1088–1097, Reykjavik, Iceland, May 10–12 2015.