There is broad interest in learning and exploiting lower-dimensional structure in high-dimensional data. A canonical case is when the low dimensional structure corresponds to a-dimensional smooth Riemannian manifold embedded in the -dimensional ambient space of the observed data . Assuming that the observed data are close to , it becomes of substantial interest to learn along with the mapping from
. This allows better data visualization and for one to exploit the lower-dimensional structure to combat the curse of dimensionality in developing efficient machine learning algorithms for a variety of tasks.
The current literature on manifold learning
focuses on estimating the coordinatescorresponding to by optimization, finding ’s on the manifold that preserve distances between the corresponding ’s in . There are many such methods, including Isomap , locally-linear embedding  and Laplacian eigenmaps . Such methods have seen broad use, but have some clear limitations relative to probabilistic manifold learning approaches, which allow explicit learning of , the mapping and the distribution of .
There has been some considerable focus on probabilistic models, which would seem to allow learning of and . Two notable examples are mixtures of factor analyzers (MFA) [4, 5] and Gaussian process latent variable models (GP-LVM) . Such approaches are useful in exploiting lower-dimensional structure in estimating the distribution of , but unfortunately have critical problems in terms of reliable estimation of the manifold and mapping function. MFA is not smooth in approximating the manifold with a collage of lower dimensional hyper-planes, and hence we focus further discussion on GP-LVM. Similar problems occur for MFA and other probabilistic manifold learning methods.
In general form for the
th data vector, GP-LVM lets, with assigned a Gaussian process prior,
generated from a pre-specified Gaussian or uniform distribution over a-dimensional space, and the residual drawn from a -dimensional Gaussian centered on zero with diagonal or spherical covariance. While this model seems appropriate to manifold learning, identifiability problems lead to extremely poor performance in estimating and . To give an intuition for the root cause of the problem, consider the case in which are drawn independently from a uniform distribution over . The model is so flexible that we could fit the training data , for , just as well if we did not use the entire hypercube but just placed all the values in a small subset of . The uniform prior will not discourage this tendency to not spread out the latent coordinates, which unfortunately has disasterous consequences illustrated in our experiments. The structure of the model is just too flexible, and further constraints are needed. Replacing the uniform with a standard Gaussian does not solve the problem.
To make the problem more tractable, we focus on the case in which is a one-dimensional smooth compact manifold. Assume , with Gaussian noise, and a smooth mapping such that for , where . We focus on finding a good estimate of , and hence the manifold, via a probabilistic learning framework. We refer to this problem as probabilistic curve learning (PCL) motivated by the principal curve literature . PCL differs substantially from the principal curve learning problem, which seeks to estimate a non-linear curve through the data, which may be very different from the true manifold.
Our proposed approach builds on GP-LVM; in particular, our primary innovation is to generate the latent coordinates from a novel repulsive process. There is an interesting literature on repulsive point process modeling ranging from various Matern processes  to the determinantal point process (DPP) . In our very different context, these processes lead to unnecessary complexity — computationally and otherwise — and we propose a new Coulomb repulsive process (Corp) motivated by Coulomb’s law of electrostatic interaction between electrically charged particles. Using Corp for the latent positions has the effect of strongly favoring spread out locations on the manifold, effectively solving the identifiability problem mentioned above for the GP-LVM. We refer to the GP with Corp on the latent positions as an electrostatic GP (electroGP).
The remainder of the paper is organized as follows. The Coulomb repulsive process is proposed in § 2 and the electroGP is presented in § 3 with a comparison between electroGP and GP-LVM demonstrated via simulations. The performance is further evaluated via real world datasets in § 4. A discussion is reported in § 5.
2 Coulomb repulsive process
A univariate process is a Coulomb repulsive process (Corp) if and only if for every finite set of indices in the index set ,
where is the repulsive parameter. The process is denoted as .
The process is named by its analogy in electrostatic physics where by Coulomb law, two electrostatic positive charges will repel each other by a force proportional to the reciprocal of their square distance. Letting
, the above conditional probability ofgiven is proportional to , shrinking the probability exponentially fast as two states get closer to each other. Note that the periodicity of the sine function eliminates the edges of , making the electrostatic energy field homogeneous everywhere on .
Several observations related to Kolmogorov extension theorem can be made immediately, ensuring Corp to be well defined. Firstly, the conditional density defined in (1) is positive and integrable, since ’s are constrained in a compact interval, and is positive and bounded. Hence, the finite distributions are well defined.
Secondly, the joint finite p.d.f. for can be derived as
As can be easily seen, any permutation of will result in the same joint finite distribution, hence this finite distribution is exchangeable.
Thirdly, it can be easily checked that for any finite set of indices ,
by observing that
Assuming , is a realization from Corp, then the following lemmas hold.
For any , any and any , we have
For any , the p.d.f. (2) of (due to the exchangeability, we can assume without loss of generality) is maximized when and only when
According to Lemma 1 and Lemma 2, Corp will nudge the ’s to be spread out within , and penalizes the case when two ’s get too close. Figure 1 presents some simulations from Corp. This nudge becomes stronger as the sample size grows, or as the repulsive parameter grows. The properties of Corp makes it ideal for strongly favoring spread out latent positions across the manifold, avoiding the gaps and clustering in small regions that plague GP-LVM-type methods. The proofs for the lemmas and a simulation algorithm based on rejection sampling can be found in the supplement.
2.3 Multivariate Corp
A -dimensional multivariate process is a Coulomb repulsive process if and only if for every finite set of indices in the index set ,
where the -dimensional spherical coordinates ’s have been converted into the -dimensional Cartesian coordinates :
The multivariate Corp maps the hyper-cubic through a spherical coordinate system to a unit hyper-ball in . The repulsion is then defined as the reciprocal of the square Euclidean distances between these mapped points in . Based on this construction of multivariate Corp, a straightfoward generalization of the electroGP model to a -dimensional manifold could be made, where .
3 Electrostatic Gaussian Process
3.1 Formulation and Model Fitting
In this section, we propose the electrostatic Gaussian process (electroGP) model. Assuming -dimensional data vectors are observed, the model is given by
where for and denotes a Gaussian process prior with covariance function .
denote the model hyperparameters, model (3) could be fitted by maximizing the joint posterior distribution of and ,
where the repulsive parameter is fixed and can be tuned using cross validation. Based on our experience, setting always yields good results, and hence is used as a default across this paper. For the simplicity of notations, is excluded in the remainder. The above optimization problem can be rewritten as
where denotes the log likelihood function and denotes the finite dimensional pdf of Corp. Hence the Corp prior can also be viewed as a repulsive constraint in the optimization problem.
It can be easily checked that , for any and . Starting at initial values , the optimizer will converge to a local solution that maintains the same order as the initial ’s. We refer to this as the self-truncation property. We find that conditionally on the starting order, the optimization algorithm converges rapidly and yields stable results. Although the ’s are not identifiable, since the target function (4) is invariant under rotation, a unique solution does exist conditionally on the specified order.
Self-truncation raises the necessity of finding good initial values, or at least a good initial ordering for ’s. Fortunately, in our experience, simply applying any standard manifold learning algorithm to estimate in a manner that preserves distances in yields good performance. We find very similar results using LLE, Isomap and eigenmap, but focus on LLE in all our implementations. Our algorithm can be summarized as follows.
Learn the one dimensional coordinate by your favorite distance-preserving manifold learning algorithm and rescale into ;
Solve using scaled conjugate gradient descent (SCG);
Using SCG, setting and to be the initial values, solve and w.r.t. (4).
3.2 Posterior Mean Curve and Uncertainty Bands
In this subsection, we describe how to obtain a point estimate of the curve and how to characterize its uncertainty under electroGP. Such point and interval estimation is as of yet unsolved in the literature, and is of critical importance. In particular, it is difficult to interpret a single point estimate without some quantification of how uncertain that estimate is. We use the posterior mean curve as the Bayes optimal estimator under squared error loss. As a curve, has infinite dimensions. Hence, in order to store and visualize it, we discretize to obtain equally-spaced grid points for . Using basic multivariate Gaussian theory, the following expectation is easy to compute.
is approximated by linear interpolation using. For ease of notation, we use to denote this interpolated piecewise linear curve later on. Examples can be found in Figure 2 where all the mean curves (black solid) were obtained using the above method.
Estimating an uncertainty region including data points with
probability is much more challenging. We addressed this problem by the following heuristic algorithm.
Draw ’s from Unif(0,1) independently for ;
Sample the corresponding
from the posterior predictive distribution conditional on these latent coordinates;
Repeat steps 1-2 times, collecting all samples ’s;
Find the shortest distances from these ’s to the posterior mean curve , and find the
-quantile of these distances denoted by;
Moving a radius- ball through the entire curve , the envelope of the moving trace defines the % uncertainty band.
Note that step 4 can be easily solved since is a piecewise linear curve. Examples can be found in Figure 2, where the 95% uncertainty bands (dotted shading) were found using the above algorithm.
In this subsection, we compare the performance of electroGP with GP-LVM and principal curves (P-curve) in simple simulation experiments. 100 data points were sampled from each of the following three 2-dimensional distributions: a Gaussian distribution, a rotated parabola with Gaussian noises and a spiral with Gaussian noises. ElectroGP and GP-LVM were fitted using the same initial values obtained from LLE, and the P-Curve was fitted using theprincurve package in R.
The performance of the three methods is compared in Figure 2. The dotted shading represents a 95% posterior predictive uncertainty band for a new data point under the electroGP model. This illustrates that electroGP obtains an excellent fit to the data, provides a good characterization of uncertainty, and accurately captures the concentration near a 1d manifold embedded in two dimensions. The P-curve is plotted in red. The extremely poor representation of P-curve is as expected based on our experience in fitting principal curve in a wide variety of cases; the behavior is highly unstable. In the first two cases, the P-Curve corresponds to a smooth curve through the center of the data, but for the more complex manifold in the third case, the P-Curve is an extremely poor representation. This tendency to cut across large regions of near zero data density for highly curved manifolds is common for P-Curve.
For GP-LVM, we show three random realizations (dashed) from the posterior in each case. It is clear the results are completely unreliable, with the tendency being to place part of the curve through where the data have high density, while also erratically adding extra outside the range of the data. The GP-LVM model does not appropriately penalize such extra parts, and the very poor performance shown in the top right of Figure 2 is not unusual. We find that electroGP in general performs dramatically better than competitors. More simulation results can be found in the supplement. To better illustrate the results for the spiral case 3, we zoom in and present some further comparisons of GP-LVM and electroGP in Figure 3.
As can be seen the right panel, optimizing ’s without any constraint results in “holes” on . The trajectories of the Gaussian process over these holes will become arbitrary, as illustrated by the three realizations. This arbitrariness will be further projected into the input space , resulting in the erratic curve observed in the left panel. Failing to have well spread out ’s over not only causes trouble in learning the curve, but also makes the posterior predictive distribution of overly diffuse near these holes, e.g., the large gray shading area in the right panel. The middle panel shows that electroGP fills in these holes by softly constraining the latent coordinates ’s to spread out while still allowing the flexibility of moving them around to find a smooth curve snaking through them.
Broad prediction problems can be formulated as the following missing data problem. Assume new data , for , are partially observed and the missing entries are to be filled in. Letting denote the observed data vector and denote the missing part, the conditional distribution of the missing data is given by
where is the corresponding latent coordinate of , for . However, dealing with jointly is intractable due to the high non-linearity of the Gaussian process, which motivates the following approximation,
The approximation assumes to be conditionally independent. This assumption is more accurate if is well spread out on , as is favored by Corp.
The univariate distribution , though still intractable, is much easier to deal with. Depending on the purpose of the application, either a Metropolis Hasting algorithm could be adopted to sample from the predictive distribution, or a optimization method could be used to find the MAP of ’s. The details of both algorithms can be found in the supplement.
Video-inpainting 200 consecutive frames (of size 76 101 with RGB color)  were collected from a video of a teapot rotating 180
. Clearly these images roughly lie on a curve. 190 of the frames were assumed to be fully observed in the natural time order of the video, while the other 10 frames were given without any ordering information. Moreover, half of the pixels of these 10 frames were missing. The electroGP was fitted based on the other 190 images and was used to reconstruct the broken frames and impute the reconstructed frames into the whole frame series with the correct order. The reconstruction results are presented in Figure4. As can be seen, the reconstructed images are almost indistinguishable from the original ones. Note that these 10 frames were also correctly imputed into the video with respect to their latent position ’s. ElectroGP was compared with Bayesian GP-LVM  with the latent dimension set to 1. The reconstruction mean square error (MSE) using electroGP is 70.62, compared to 450.75 using GP-LVM. The comparison is also presented in Figure 4. It can be seen that electroGP outperforms Bayesian GP-LVM in high-resolution precision (e.g., how well they reconstructed the handle of the teapot) since it obtains a much tighter and more precise estimate of the manifold.
Super-resolution & Denoising 100 consecutive frames (of size
with gray color) were collected from a video of a shrinking shockwave. Frame 51 to 55 were assumed completely missing and the other 95 frames were observed with the original time order with strong white noises. The shockwave is homogeneous in all directions from the center; hence, the frames roughly lie on a curve. The electroGP was applied for two tasks: 1. Frame denoising; 2. Improving resolution by interpolating frames in between the existing frames. Note that the second task is hard since there are 5 consecutive frames missing and they can be interpolated only if the electroGP correctly learns the underlying manifold.
The denoising performance was compared with non-local mean filter (NLM)  and isotropic diffusion (IsD) . The interpolation performance was compared with linear interpolation (LI). The comparison is presented in Figure 5. As can be clearly seen, electroGP greatly outperforms other methods since it correctly learned this one-dimensional manifold. To be specific, the denoising MSE using electroGP is only , comparing to using NLM and using IsD. The MSE of reconstructing the entirely missing frame 53 using electroGP is compared to 13 using LI. An online video of the super-resolution result using electroGP can be found in this link111https://youtu.be/N1BG220J5Js This online video contains no information regarding the authors.. The frame per second (fps) of the generated video under electroGP was tripled compared to the original one. Though over two thirds of the frames are pure generations from electroGP, this new video flows quite smoothly. Another noticeable thing is that the 5 missing frames were perfectly regenerated by electroGP.
Manifold learning has dramatic importance in many applications where high-dimensional data are collected with unknown low dimensional manifold structure. While most of the methods focus on finding lower dimensional summaries or characterizing the joint distribution of the data, there is (to our knowledge) no reliable method for probabilistic learning of the manifold. This turns out to be a daunting problem due to major issues with identifiability leading to unstable and generally poor performance for current probabilistic non-linear dimensionality reduction methods. It is not obvious how to incorporate appropriate geometric constraints to ensure identifiability of the manifold without also enforcing overly-restrictive assumptions about its form.
We tackled this problem in the one-dimensional manifold (curve) case and built a novel electrostatic Gaussian process model based on the general framework of GP-LVM by introducing a novel Coulomb repulsive process. Both simulations and real world data experiments showed excellent performance of the proposed model in accurately estimating the manifold while characterizing uncertainty. Indeed, performance gains relative to competitors were dramatic. The proposed electroGP is shown to be applicable to many learning problems including video-inpainting, super-resolution and video-denoising. There are many interesting areas for future study including the development of efficient algorithms for applying the model for multidimensional manifolds, while learning the dimension.
-  J.B. Tenenbaum, V. De Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
-  S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
-  M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, volume 14, pages 585–591, 2001.
-  M. Chen, J. Silva, J. Paisley, C. Wang, D.B. Dunson, and L. Carin. Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds. Signal Processing, IEEE Transactions on, 58(12):6140–6155, 2010.
-  Y. Wang, A. Canale, and D.B. Dunson. Scalable multiscale density estimation. arXiv preprint arXiv:1410.7692, 2014.
Probabilistic non-linear principal component analysis with gaussian process latent variable models.The Journal of Machine Learning Research, 6:1783–1816, 2005.
-  T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 84(406):502–516, 1989.
-  V. Rao, R.P. Adams, and D.B. Dunson. Bayesian inference for matérn repulsive processes. arXiv preprint arXiv:1308.1136, 2013.
-  J.B. Hough, M. Krishnapur, Y. Peres, et al. Zeros of Gaussian analytic functions and determinantal point processes, volume 51. American Mathematical Soc., 2009.
K.Q. Weinberger and L.K. Saul.
An introduction to nonlinear dimensionality reduction by maximum variance unfolding.In AAAI, volume 6, pages 1683–1686, 2006.
-  M. Titsias and N. Lawrence. Bayesian gaussian process latent variable model. The Journal of Machine Learning Research, 9:844–851, 2010.
-  A. Buades, B. Coll, and J.M. Morel. A non-local algorithm for image denoising. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 60–65. IEEE, 2005.
-  P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(7):629–639, 1990.
Proof of Lemma 1 and Lemma 2
By definition of the Corp, for small enough, we have
where is the normalizing constant. When , the following is true,
When , the following is true,
The proof for is the same as above and hence is neglected here. ∎
The log of the density is given up to a constant by
The first order derivatives are given by
For any satisfying condition , (5) can be rewritten as
Hence (2) satisfies the first order condition. The second order derivatives are given by
Hence the Hessian matrix is positive semi-definite, indicating that (2) is a global maxima. Note also that the Hessian matrix is rank-deficit, indicating that the solution to this maximization problem is not unique. ∎
Sampling from Corp
The sampling method can be easily summarized as,
Sample from Unif(0,1);
Repeatedly sample from until desired sample size reached.
The difficulty arises in step 2 since
is multi-modal and not analytically integrable. Fortunately, sampling from the above univariate distribution can be done by rejection sampling. The only trick here is to find a proper proposal distribution. Naïvely using a uniform would result in very high rejection rate as grows larger.
Assuming without loss of generosity that , it can be easily checked that there is one local mode within each interval of , for . We denote the mode by and the interval by . There is also one mode on . We denote this mode by , and this interval by . Sampling from this conditional distribution can be summarized as,
Sample from Multinomial) where for . These integration is done using numerical method;
Use Unif() as the proposal distribution and calculate using numerical maximization method. Use rejection sampling to sample from the truncated conditional distribution .
Assume new data , for , are partially observed and the missing entries are to be predicted. Letting denote the observed data vector and denote the missing part. We approximate the predictive distribution by assuming that these ’s are conditionally independent. For ease of notation, we focus on discussing the prediction algorithm for one partially observed new data vector .
Sample from the posterior predictive distribution Instead of sampling from , we sample from , which can be factorized into two parts and . The first part is simply a conditional Gaussian distribution and can be easily sampled. We use the Metropolis Hasting algorithm to sample from the intractable second part, using Unif(0,1) as the proposal distribution. Note that Unif(0,1) is a natural choice, since it is the prior distribution of . It can be easily generalized to a piecewise uniform distribution, as what we did in sampling Corp, to decrease the rejection rate.
Find the MAP MCMC can be infeasible in some applications due to its expensive computation. A straightforward solution is to use EM algorithm treating as an augmented variable, which will give us a point estimate of . We propose another heuristic algorithm that would give us instead of point estimate a distribution of . The algorithm is very simple and is summarized as follows,
Find by maximizing ;
Return , which is simply a multivariate Gaussian.