1 Introduction
Many highdimensional datasets comprise points derived from a smooth, lowerdimensional manifold embedded within the highdimensional space of measurements and possibly corrupted by noise. For instance, biological or medical imaging data might reflect the interplay of a small number of latent processes that all affect measurements nonlinearly. Linear multivariate analyses such as principal component analysis (PCA) or multidimensional scaling (MDS) have long been used to estimate such underlying processes, but cannot always reveal lowdimensional structure when the mapping is nonlinear (or, equivalently, the manifold is curved). Thus, there has been substantial recent interest in algorithms to identify nonlinear manifolds in data.
Many moreorless heuristic methods for nonlinear manifold discovery are based on the idea of preserving the geometric properties of local neighbourhoods within the data, while embedding, unfolding or otherwise transforming the data to occupy fewer dimensions. Thus, algorithms such as locallylinear embedding (LLE) and Laplacian eigenmap attempt to preserve local linear relationships or to minimise the distortion of local derivatives
[1, 2]. Others, like Isometric feature mapping (Isomap) or maximum variance unfolding (MVU) preserve local distances, estimating global manifold properties by continuation across neighbourhoods before embedding to lower dimensions by classical methods such as PCA or MDS
[3]. While generally hewing to this same intuitive path, the range of available algorithms has grown very substantially in recent years [4, 5].However, these approaches do not define distributions over the data or over the manifold properties. Thus, they provide no measures of uncertainty on manifold structure or on the lowdimensional locations of the embedded points; they cannot be combined with a structured probabilistic model within the manifold to define a full likelihood relative to the highdimensional observations; and they provide only heuristic methods to evaluate the manifold dimensionality. As others have pointed out, they also make it difficult to extend the manifold definition to outofsample points in a principled way [6].
An established alternative is to construct an explicit probabilistic model of the functional relationship between lowdimensional manifold coordinates and each measured dimension of the data, assuming that the functions instantiate draws from Gaussianprocess priors. The original Gaussian process latent variable model (GPLVM) required optimisation of the lowdimensional coordinates, and thus still did not provide uncertainties on these locations or allow evaluation of the likelihood of a model over them [7]; however a recent extension exploits an auxiliary variable approach to optimise a more general variational bound, thus retaining approximate probabilistic semantics within the latent space [8]. The stochastic process model for the mapping functions also makes it straightforward to estimate the function at previously unobserved points, thus generalising outofsample with ease. However, the GPLVM gives up on the intuitive preservation of local neighbourhood properties that underpin the nonprobabilistic methods reviewed above. Instead, the expected smoothness or other structure of the manifold must be defined by the Gaussian process covariance function, chosen a priori.
Here, we introduce a new probabilistic model over highdimensional observations, lowdimensional embedded locations and locallylinear mappings between high and lowdimensional linear maps within each neighbourhood, such that each group of variables is Gaussian distributed given the other two. This
locally linear latent variable model (LLLVM) thus respects the same intuitions as the common nonprobabilistic manifold discovery algorithms, while still defining a fullfledged probabilistic model. Indeed, variational inference in this model follows more directly and with fewer separate bounding operations than the sparse auxiliaryvariable approach used with the GPLVM. Thus, uncertainty in the lowdimensional coordinates and in the manifold shape (defined by the local maps) is captured naturally. A lower bound on the marginal likelihood of the model makes it possible to select between different latent dimensionalities and, perhaps most crucially, between different definitions of neighbourhood, thus addressing an important unsolved issue with neighbourhooddefined algorithms. Unlike existing probabilistic frameworks with locally linear models such as mixtures of factor analysers (MFA)based and local tangent space analysis (LTSA)based methods [9, 10, 11], LLLVM does not require an additional step to obtain the globally consistent alignment of lowdimensional local coordinates.^{1}^{1}1This is also true of one previous MFAbased method [12] which finds model parameters and global coordinates by variational methods similar to our own.This paper is organised as follows. In section 2, we introduce our generative model, LLLVM, for which we derive the variational inference method in section 3. We briefly describe outofsample extension for LLLVM and mathematically describe the dissimilarity between LLLVM and GPLVM at the end of section 3. In section 4, we demonstrate the approach on several real world problems.
Notation:
In the following, a diagonal matrix with entries taken from the vector
is written . The vector of ones is and the identity matrix is . The Euclidean norm of a vector is , the Frobenius norm of a matrix is . The Kronecker delta is denoted by ( if , and otherwise). The Kronecker product of matrices and is . For a random vector, we denote the normalisation constant in its probability density function by
. The expectation of a random vector with respect to a density is .2 The model: LLLVM
Suppose we have data points , and a graph on nodes with edge set
. We assume that there is a lowdimensional (latent) representation of the highdimensional data, with coordinates
, . It will be helpful to concatenate the vectors to form and .Our key assumption is that the mapping between highdimensional data and lowdimensional coordinates is locally linear (Fig. 1). The tangent spaces are approximated by and , the pairwise differences between the th point and neighbouring points . The matrix at the th point linearly maps those tangent spaces as
(1) 
Under this assumption, we aim to find the distribution over the linear maps and the latent variables that best describe the data likelihood given the graph :
(2) 
The joint distribution can be written in terms of priors on and the likelihood of as
(3) 
In the following, we highlight the essential components the Locally Linear Latent Variable Model (LLLVM). Detailed derivations are given in the Appendix.
Adjacency matrix and Laplacian matrix
The edge set of for data points specifies a symmetric adjacency matrix . We write for the th element of , which is 1 if and are neighbours and 0 if not (including on the diagonal). The graph Laplacian matrix is then .
Prior on
We assume that the latent variables are zerocentered with a bounded expected scale, and that latent variables corresponding to neighbouring highdimensional points are close (in Euclidean distance). Formally, the log prior on the coordinates is then
where the parameter controls the expected scale (
). This prior can be written as multivariate normal distribution on the concatenated
:Prior on
We assume that the linear maps corresponding to neighbouring points are similar in terms of Frobenius norm (thus favouring a smooth manifold of low curvature). This gives
(4) 
where . The second line corresponds to the matrix normal density, giving as the prior on . In our implementation, we fix to a small value^{2}^{2}2 sets the scale of the average linear map, ensuring the prior precision matrix is invertible., since the magnitude of the product is determined by optimising the hyperparameter above.
Likelihood
Under the locallinearity assumption, we penalise the approximation error of Eq. (1), which yields the log likelihood
(5) 
where and .^{3}^{3}3The term centers the data and ensures the distribution can be normalised. It applies in a subspace orthogonal to that modelled by and and so its value does not affect the resulting manifold model. Thus, is drawn from a multivariate normal distribution given by
with , , and ; . For computational simplicity, we assume . The graphical representation of the generative process underlying the LLLVM is given in Fig. 2.
3 Variational inference
Our goal is to infer the latent variables () as well as the parameters in LLLVM. We infer them by maximising the lower bound of the marginal likelihood of the observations
(6) 
Following the common treatment for computational tractability, we assume the posterior over () factorises as [13]. We maximise the lower bound w.r.t. and
by the variational expectation maximization algorithm
[14], which consists of (1) the variational expectation step for computing by(7)  
(8) 
then (2) the maximization step for estimating by .
VariationalE step
Computing from Eq. (7) requires rewriting the likelihood in Eq. (5) as a quadratic function in
where the normaliser has all the terms that do not depend on from Eq. (5). Let . The matrix is given by where the th block is and each th block of is given by The vector is defined as with the component dimensional vectors given by The likelihood combined with the prior on gives us the Gaussian posterior over (i.e., solving Eq. (7))
(9) 
Similarly, computing from Eq. (8) requires rewriting the likelihood in Eq. (5) as a quadratic function in
(10) 
where the normaliser has all the terms that do not depend on from Eq. (5), and . The matrix where the th subvector of the th column is . We define whose th block is .
The likelihood combined with the prior on gives us the Gaussian posterior over (i.e., solving Eq. (8))
(11) 
The expected values of and are given in the Appendix.
VariationalM step
We set the parameters by maximising w.r.t. which is split into two terms based on dependence on each parameter: (1) expected loglikelihood for updating by ; and (2) negative KL divergence between the prior and the posterior on for updating by
. The update rules for each hyperparameter are given in the Appendix.
The full EM algorithm^{4}^{4}4An implementation is available from http://www.gatsby.ucl.ac.uk/resources/lllvm. starts with an initial value of . In the Estep, given , compute as in Eq. (9). Likewise, given , compute as in Eq. (11). The parameters are updated in the Mstep by maximising Eq. (6). The two steps are repeated until the variational lower bound in Eq. (6) saturates. To give a sense of how the algorithm works, we visualise fitting results for a simulated example in Fig. 3. Using the graph constructed from D observations given different , we run our EM algorithm. The posterior means of and given the optimal chosen by the maximum lower bound resemble the true manifolds in 2D and 3D spaces, respectively.
Outofsample extension
In the LLLVM model one can formulate a computationally efficient outofsample extension technique as follows. Given data points denoted by , the variational EM algorithm derived in the previous section converts into the posterior : . Now, given a new highdimensional data point , one can first find the neighbourhood of without changing the current neighbourhood graph. Then, it is possible to compute the distributions over the corresponding locally linear map and latent variable via simply performing the Estep given (freezing all other quantities the same) as .
Comparison to GPLVM
A closely related probabilistic dimensionality reduction algorithm to LLLVM is GPLVM [7]. GPLVM defines the mapping from the latent space to data space using Gaussian processes. The likelihood of the observations ( is the vector formed by the th element of all high dimensional vectors) given latent variables is defined by , where the th element of the covariance matrix is of the exponentiated quadratic form: with smoothnessscale parameters [8]. In LLLVM, once we integrate out from Eq. (5), we also obtain the Gaussian likelihood given ,
In contrast to GPLVM, the precision matrix depends on the graph Laplacian matrix through and . Therefore, in LLLVM, the graph structure directly determines the functional form of the conditional precision.
4 Experiments
4.1 Mitigating the shortcircuit problem
Like other neighbourbased methods, LLLVM is sensitive to misspecified neighbourhoods; the prior, likelihood, and posterior all depend on the assumed graph. Unlike other methods, LLLVM provides a natural way to evaluate possible shortcircuits using the variational lower bound of Eq. (6). Fig. 4 shows samples drawn from a Swiss Roll in 3D space (Fig. 4A). Two points, labelled 28 and 29, happen to fall close to each other in 3D, but are actually far apart on the latent (2D) surface (Fig. 4B). A knearestneighbour graph might link these, distorting the recovered coordinates. However, evaluating the model without this edge (the correct graph) yields a higher variational bound (Fig. 4C). Although it is prohibitive to evaluate every possible graph in this way, the availability of a principled criterion to test specific hypotheses is of obvious value.
In the following, we demonstrate LLLVM on two real datasets: handwritten digits and climate data.
4.2 Modelling USPS handwritten digits
As a first realdata example, we test our method on a subset of samples each of the digits from the USPS digit dataset, where each digit is of size (i.e., , . We follow [7], and represent the lowdimensional latent variables in 2D.
Fig. 5A shows variational lower bounds for different values of , using different EM initialisations. The posterior mean of obtained from LLLVM using the best is illustrated in Fig. 5B. Fig. 5B also shows reconstructions of one randomlyselected example of each digit, using its 2D coordinates as well as the posterior mean coordinates , tangent spaces and actual images of its closest neighbours. The reconstruction is based on the assumed tangentspace structure of the generative model (Eq. (5)), that is: . A similar process could be used to reconstruct digits at outofsample locations. Finally, we quantify the relevance of the recovered subspace by computing the error incurred using a simple classifier to report digit identity using the 2D features obtained by LLLVM and various competing methods (Fig. 5CF). Classification with LLLVM coordinates performs similarly to GPLVM and ISOMAP (), and outperforms LLE ().
4.3 Mapping climate data
In this experiment, we attempted to recover 2D geographical relationships between weather stations from recorded monthly precipitation patterns. Data were obtained by averaging monthbymonth annual precipitation records from 2005–2014 at 400 weather stations scattered across the US (see Fig. 6) ^{5}^{5}5The dataset is made available by the National Climatic Data Center at http://www.ncdc.noaa.gov/oa/climate/research/ushcn/. We use version 2.5 monthly data [15].. Thus, the data set comprised 400 12dimensional vectors. The goal of the experiment is to recover the twodimensional topology of the weather stations (as given by their latitude and longitude) using only these 12dimensional climatic measurements. As before, we compare the projected points obtained by LLLVM with several widely used dimensionality reduction techniques. For the graphbased methods LLLVM, LTSA, ISOMAP, and LLE, we used 12NN with Euclidean distance to construct the neighbourhood graph.
The results are presented in Fig. 6. LLLVM identified a more geographicallyaccurate arrangement for the weather stations than the other algorithms. The fully probabilistic nature of LLLVM and GPLVM allowed these algorithms to handle the noise present in the measurements in a principled way. This contrasts with ISOMAP which can be topologically unstable [16] i.e. vulnerable to shortcircuit errors if the neighbourhood is too large. Perhaps coincidentally, LLLVM also seems to respect local geography more fully in places than does GPLVM.
5 Conclusion
We have demonstrated a new probabilistic approach to nonlinear manifold discovery that embodies the central notion that local geometries are mapped linearly between manifold coordinates and highdimensional observations. The approach offers a natural variational algorithm for learning, quantifies local uncertainty in the manifold, and permits evaluation of hypothetical neighbourhood relationships.
In the present study, we have described the LLLVM model conditioned on a neighbourhood graph. In principle, it is also possible to extend LLLVM so as to construct a distance matrix as in [17], by maximising the data likelihood. We leave this as a direction for future work.
Acknowledgments
The authors were funded by the Gatsby Charitable Foundation.
References
 [1] S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290(5500):2323–2326, 2000.
 [2] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, pages 585–591, 2002.
 [3] J. B. Tenenbaum, V. Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500):2319–2323, 2000.
 [4] L.J.P. van der Maaten, E. O. Postma, and H. J. van den Herik. Dimensionality reduction: A comparative review, 2008. http://www.iai.unibonn.de/~jz/dimensionality_reduction_a_comparative_review.pdf.
 [5] L. Cayton. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, pages 1–17, 2005. http://www.lcayton.com/resexam.pdf.

[6]
J. Platt.
Fastmap, metricmap, and landmark MDS are all Nyström
algorithms.
In
Proceedings of 10th International Workshop on Artificial Intelligence and Statistics
, pages 261–268, 2005.  [7] N. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In NIPS, pages 329–336, 2003.
 [8] M. K. Titsias and N. D. Lawrence. Bayesian Gaussian process latent variable model. In AISTATS, pages 844–851, 2010.
 [9] S. Roweis, L. Saul, and G. Hinton. Global coordination of local linear models. In NIPS, pages 889–896, 2002.
 [10] M. Brand. Charting a manifold. In NIPS, pages 961–968, 2003.
 [11] Y. Zhan and J. Yin. Robust local tangent space alignment. In NIPS, pages 293–301. 2009.
 [12] J. Verbeek. Learning nonlinear image manifolds by global alignment of local linear models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8):1236–1250, 2006.
 [13] C. Bishop. Pattern recognition and machine learning. Springer New York, 2006.

[14]
M. J. Beal.
Variational Algorithms for Approximate Bayesian Inference
. PhD thesis, Gatsby Unit, University College London, 2003.  [15] M. Menne, C. Williams, and R. Vose. The U.S. historical climatology network monthly temperature data, version 2.5. Bulletin of the American Meteorological Society, 90(7):993–1007, July 2009.
 [16] Mukund Balasubramanian and Eric L. Schwartz. The isomap algorithm and topological stability. Science, 295(5552):7–7, January 2002.
 [17] N. Lawrence. Spectral dimensionality reduction via maximum entropy. In AISTATS, pages 51–59, 2011.
 [18] K. B. Petersen and M. S. Pedersen. The matrix cookbook, nov 2012. Version 20121115.
LLLVM supplementary material
Notation
The vectorized version of a matrix is . We denote an identity matrix of size with . Other notations are the same as used in the main text.
Appendix A Matrix normal distribution
The matrix normal distribution generalises the standard multivariate normal distribution to matrixvalued variables. A matrix is said to follow a matrix normal distribution with parameters and if its density is given by
(12) 
If , then , a relationship we will use to simplify many expressions.
Appendix B Matrix normal expressions of priors and likelihood
Recall that .
Prior on low dimensional latent variables
(13)  
(14) 
where
is known as a graph Laplacian. It follows that . The prior covariance can be rewritten as
(15)  
(16)  
(17) 
By the relationship of a matrix normal and multivariate normal distributions described in section A, the equivalent prior for the matrix , constructed by reshaping , is given by
(18) 
Prior on locally linear maps
Recall that where each . We formulate the log prior on as
(19) 
In the first line, the first term imposes a constraint that the mean of should not be too large. The second term encourages the the locally linear maps of neighbouring points and to be similar in the sense of the Frobenius norm. Notice that the last line is in the form of a the log of a matrix normal density with mean 0 where is given by
(20) 
The expression is equivalent to
(21) 
In our implementation, we fix to a small value, since the magnitude of and can be controlled by the hyperparameter , which is optimized in the Mstep.
Likelihood
We penalise linear approximation error of the tangent spaces. Assume that the noise precision matrix is a scaled identify matrix ei.g., .
(22)  
(23) 
where
(24)  
(25)  
(26)  
(27)  
(28) 
By completing the quadratic form in , we want to write down the likelihood as a multivariate Gaussian ^{6}^{6}6The equivalent expression in term of matrix normal distribution for
(29)  
(30) 
By equating Eq. (22) with Eq. (29), we get the normalisation term
(31)  
(32)  
(33)  
(34) 
Therefore, the normalised loglikelihood can be written as
(35) 
Convenient form for EM
For the EM derivation in the next section, it is convenient to write the exponent term in terms of linear and quadratic functions in and , respectively. The linear terms appear in , which we write as a linear function in or
(36)  
(37) 
where
(38)  
(39) 
The quadratic terms appear in , which we write as a quadratic function of or a quadratic function of
(40)  
(41) 
where the th chunk of is given by
(42) 
The matrix and and the th column of this matrix is denoted by . The th chunk (of length ) of the th column is given by
(43) 
Appendix C Variational inference
In LLLVM, the goal is to infer the latent variables () as well as to learn the hyperparameters . We infer them by maximising the lower bound of the marginal likelihood of the observations .
For computational tractability, we assume that the posterior over () factorizes as
(44) 
where and are multivariate normal distributions.
We maximize the lower bound w.r.t.
Comments
There are no comments yet.