A Locally Adaptive Normal Distribution

by   Georgios Arvanitidis, et al.

The multivariate normal density is a monotonic function of the distance to the mean, and its ellipsoidal shape is due to the underlying Euclidean metric. We suggest to replace this metric with a locally adaptive, smoothly changing (Riemannian) metric that favors regions of high local density. The resulting locally adaptive normal distribution (LAND) is a generalization of the normal distribution to the "manifold" setting, where data is assumed to lie near a potentially low-dimensional manifold embedded in R^D. The LAND is parametric, depending only on a mean and a covariance, and is the maximum entropy distribution under the given metric. The underlying metric is, however, non-parametric. We develop a maximum likelihood algorithm to infer the distribution parameters that relies on a combination of gradient descent and Monte Carlo integration. We further extend the LAND to mixture models, and provide the corresponding EM algorithm. We demonstrate the efficiency of the LAND to fit non-trivial probability distributions over both synthetic data, and EEG measurements of human sleep.



page 17


Bridge Simulation and Metric Estimation on Landmark Manifolds

We present an inference algorithm and connected Monte Carlo based estima...

The McDonald Normal Distribution

A five-parameter distribution called the McDonald normal distribution is...

Solution manifold and Its Statistical Applications

A solution manifold is the collection of points in a d-dimensional space...

Lagrangian Manifold Monte Carlo on Monge Patches

The efficiency of Markov Chain Monte Carlo (MCMC) depends on how the und...

Modelling wind speed with a univariate probability distribution depending on two baseline functions

Characterizing the wind speed distribution properly is essential for the...

Posterior Integration on a Riemannian Manifold

The geodesic Markov chain Monte Carlo method and its variants enable com...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The multivariate normal distribution is a fundamental building block in many machine learning algorithms, and its well-known density can compactly be written as


where denotes the Mahalanobis distance for covariance matrix . This distance measure corresponds to the length of the straight line connecting and , and consequently the normal distribution is often used to model linear phenomena. When data lies near a nonlinear manifold embedded in the normal distribution becomes inadequate due to its linear metric. We investigate if a useful distribution can be constructed by replacing the linear distance function with a nonlinear counterpart. This is similar in spirit to Isomap Tenenbaum et al. [2000] that famously replace the linear distance with a geodesic distance measured over a neighborhood graph spanned by the data, thereby allowing for a nonlinear model. This is, however, a discrete distance measure that is only well-defined over the training data. For a generative model, we need a continuously defined metric over the entire .

Following Hauberg et al. [2012] we learn a smoothly changing metric that favors regions of high density i.e., geodesics tend to move near the data. Under this metric, the data space is interpreted as a -dimensional Riemannian manifold. This “manifold learning” does not change dimensionality, but merely provides a local description of the data. The Riemannian view-point, however, gives a strong mathematical foundation upon which the proposed distribution can be developed. Our work, thus, bridges work on statistics on Riemannian manifolds Pennec [2006], Zhang and Fletcher [2013] with manifold learning Tenenbaum et al. [2000].

Figure 1: Illustration of the LAND using MNIST images of the digit 1 projected onto the first 2 principal components. Left: comparison of the geodesic and the linear distance. Center: the proposed locally adaptive normal distribution. Right: the Euclidean normal distribution.

We develop a locally adaptive normal distribution (LAND) as follows: First, we construct a metric that captures the nonlinear structure of the data and enables us to compute geodesics; from this, an

unnormalized density is trivially defined. Second, we propose a scalable Monte Carlo integration scheme for normalizing the density with respect to the measure induced by the metric. Third, we develop a gradient-based algorithm for maximum likelihood estimation on the learned manifold. We further consider a mixture of LANDs and provide the corresponding EM algorithm. The usefulness of the model is verified on both synthetic data and EEG measurements of human sleep stages.

Notation: all points

are considered as column vectors, and they are denoted with bold lowercase characters.

represents the set of symmetric positive definite matrices. The learned Riemannian manifold is denoted , and its tangent space at is denoted .

2 A Brief Summary of Riemannian Geometry

We start our exposition with a brief review of Riemannian manifolds do Carmo [1992]. These smooth manifolds are naturally equipped with a distance measure, and are commonly used to model physical phenomena such as dynamical or periodic systems, and many problems that have a smooth behavior.

Definition 1.

A smooth manifold together with a Riemannian metric is called a Riemannian manifold. The Riemannian metric encodes a smoothly changing inner product on the tangent space of each point .

Remark 1.

The Riemannian metric acts on tangent vectors, and may, thus, be interpreted as a standard Mahalanobis metric restricted to an infinitesimal region around .

The local inner product based on is a suitable model for capturing local behavior of data, i.e. manifold learning. From the inner product, we can define geodesics as length-minimizing curves connecting two points , i.e.



is the metric tensor at

, and the tangent vector denotes the derivative (velocity) of . The distance between and is defined as the length of the geodesic. A standard result from differential geometry is that the geodesic can be found as the solution to a system of

order ordinary differential equations (ODEs)

do Carmo [1992], Hauberg et al. [2012]:

Figure 2: An illustration of the exponential and logarithmic maps.

subject to . Here stacks the columns of a matrix into a vector and is the Kronecker product.

This differential equation allows us to define basic operations on the manifold. The exponential map at a point takes a tangent vector to such that the curve is a geodesic originating at with initial velocity and length . The inverse mapping, which takes to is known as the logarithm map and is denoted . By definition corresponds to the geodesic distance from to . These operations are illustrated in Fig. 2. The exponential and the logarithmic map can be computed by solving Eq. 3 numerically, as an initial value problem (IVP) or a boundary value problem (BVP) respectively. In practice the IVPs are substantially faster to compute than the BVPs.

The Mahalanobis distance is naturally extended to Riemannian manifolds as . From this, Pennec [2006] considered the Riemannian normal distribution


and showed that it is the manifold-valued distribution with maximum entropy subject to a known mean and covariance. This distribution is an instance of Eq. 1 and is the distribution we consider in this paper. Next, we consider standard “intrinsic least squares” estimates of and .

2.1 Intrinsic Least Squares Estimators

Let the data be generated from an unknown probability distribution on a manifold. Then it is common Pennec [2006] to define the intrinsic mean

of the distribution as the point that minimize the variance


where is the measure (or infinitesimal volume element) induced by the metric. Based on the mean, a covariance matrix can be defined


where is the domain over which is well-defined. For the manifolds we consider, the domain is . Practical estimators of rely on gradient-based optimization to find a local minimizer of Eq. 5, which is well-defined Karcher [1977]. For finite data , the descent direction is proportional to , and the updated mean is a point on the geodesic curve . After estimating the mean, the empirical covariance matrix is estimated as . It is worth noting that even though these estimators are natural, they are not maximum likelihood estimates for the Riemannian normal distribution (4).

In practice, the intrinsic mean often falls in regions of low data density Hauberg [2016]. For instance, consider data distributed uniformly on the equator of a sphere, then the optima of Eq. 5 is either of the poles. Consequently, the empirical covariance is often overestimated.

3 A Locally Adaptive Normal Distribution

We now have the tools to define a locally adaptive normal distribution (LAND): we replace the linear Euclidean distance with a locally adaptive Riemannian distance and study the corresponding Riemannian normal distribution (4). By learning a Riemannian manifold and using its structure to estimate distributions of the data, we provide a new and useful link between Riemannian statistics and manifold learning.

3.1 Constructing a Metric

In the context of manifold learning, Hauberg et al. [2012] suggest to model the local behavior of the data manifold via a locally-defined Riemannian metric. Here we propose to use a local covariance matrix to represent the local structure of the data. We only consider diagonal covariances for computational efficiency and to prevent the overfitting. The locality of the covariance is defined via an isotropic Gaussian kernel of size . Thus, the metric tensor at is defined as the inverse of a local diagonal covariance matrix with entries


Here is the dimension of the observation, and a regularization parameter to avoid singular covariances. This defines a smoothly changing (hence Riemannian) metric that captures the local structure of the data. It is easy to see that if

is outside of the support of the data, then the metric tensor is large. Thus, geodesics are “pulled” towards the data where the metric is small. Note that the proposed metric is not invariant to linear transformations.While we restrict our attention to this particular choice, other learned metrics are equally applicable, c.f.

Tosi et al. [2014], Hauberg et al. [2012].

3.2 Estimating the Normalization Constant

The normalization constant of Eq. 4 is by definition


where denotes the measure induced by the Riemannian metric. The constant depends not only on the covariance matrix, but also on the mean of the distribution, and the curvature of the manifold (captured by the logarithm map). For a general learned manifold, is inaccessible in closed-form and we resort to numerical techniques. We start by rewriting Eq. 8 as


In effect, we integrate the distribution over the tangent space instead of directly over the manifold. This transformation relies on the fact that the volume of an infinitely small area on the manifold can be computed in the tangent space if we take the deformation of the metric into account Pennec [2006]. This deformation is captured by the measure which, in the tangent space, is . For notational simplicity we define the function , which intuitively captures the cost for a point to be outside the data support ( is large in low density areas and small where the density is high).

Figure 3: Comparison of LAND and intrinsic least squares means.

We estimate the normalization constant (9) using Monte Carlo integration. We first multiply and divide the integral with the normalization constant of the Euclidean normal distribution . Then, the integral becomes an expectation estimation problem , which can be estimated numerically as


and is the number of samples on . The computationally expensive element is to evaluate , which in turn requires evaluating . This amounts to solving an IVP numerically, which is fairly fast. Had we performed the integration directly on the manifold (8) we would have had to evaluate the logarithm map, which is a much more expensive BVP. The tangent space integration, thus, scales better.

3.3 Inferring Parameters

Assuming an independent and identically distributed dataset

, we can write their joint distribution as

. We find parameters and by maximum likelihood, which we implement by minimizing the mean negative log-likelihood


The first term of the objective function is a data-fitting term, while the second can be seen as a force that both pulls the mean closer to the high density areas and shrinks the covariance. Specifically, when the mean is in low density areas, as well as when the covariance gives significant

0:  the data , stepsizes
0:  the estimated
1:  initialize and
2:  repeat
3:     estimate using Eq. 10
4:     compute using Eq. 12
6:     estimate using Eq. 9
7:     compute using Eq. 46
11:  until 
Algorithm 1 LAND maximum likelihood

probability to those areas, the value of will by construction be large. Consequently, will increase and these solutions will be penalized. In practice, we find that the maximum likelihood LAND mean generally avoids low density regions, which is in contrast to the standard intrinsic least squares mean (5), see Fig. 3.

In practice we optimize using block coordinate descent: we optimize the mean keeping the covariance fixed and vice versa. Unfortunately, both of the sub-problems are non-convex, and unlike the linear normal distribution, they lack a closed-form solution. Since the logarithm map is a differentiable function, we can use gradient-based techniques to infer and . Below we give the descent direction for and and the corresponding optimization scheme is given in Algorithm 1. Initialization is discussed in Appx. E.2.

Optimizing : the objective function is differentiable with respect to do Carmo [1992], and using that , we get the gradient


It is easy to see that this gradient is highly dependent on the condition number of . We find that this, at times, makes the gradient unstable, and choose to use the steepest descent direction instead of the gradient direction. This is equal to (see Appx. B).

Optimizing : since the covariance matrix by definition is constrained to be in the space , a common trick is to decompose the matrix as , and optimize the objective with respect to . The gradient of this factor is (see Appx. C for a derivation)


Here the first term fits the given data by increasing the size of the covariance matrix, while the second term regularizes the covariance towards a small matrix.

3.4 Mixture of LANDs

At this point we can find maximum likelihood estimates of the LAND model. We can easily extend this to mixtures of LANDs: Following the derivation of the standard Gaussian mixture model

Bishop [2006], our objective function for inferring the parameters of the LAND mixture model is formulated as follows


where , is the probability that is generated by the component, and . The corresponding EM algorithm is in Appx. E.

4 Experiments

In this section we present both synthetic and real experiments to demonstrate the advantages of the LAND. We compare our model with both the Gaussian mixture model (GMM), and a mixture of LANDs using least squares (LS) estimators (5, 6). Since the latter are not maximum likelihood estimates we use a Riemannian -means algorithm to find cluster centers. In all experiments we use samples in the Monte Carlo integration. This choice is investigated empirically in the supplements. Furthermore, we choose as small as possible, while ensuring that the manifold is smooth enough that geodesics can be computed numerically.

4.1 Synthetic Data Experiments

Figure 4: The mean negative log-likelihood experiment.

As a first experiment, we generate a nonlinear data-manifold by sampling from a mixture of 20 Gaussians positioned along a half-ellipsoidal curve (see left panel of Fig. 5). We generate 10 datasets with 300 points each, and fit for each dataset the three models with number of components. Then, we generate 10000 samples from each fitted model, and we compute the mean negative log-likelihood of the true generative distribution using these samples. Fig. 4 shows that the LAND learns faster the underlying true distribution, than the GMM. Moreover, the LAND perform better than the least squares estimators, which overestimates the covariance. In the supplements we show, using the standard AIC and BIC criteria, that the optimal LAND is achieved for , while for the least squares estimators and the GMM, the optimal is achieved for and respectively.

In addition, in Fig. 5 we show the contours for the LAND and the GMM for . There, we can observe that indeed, the LAND adapts locally to the data and reveals their underlying nonlinear structure. This is particularly evident near the “boundaries” of the data-manifold.

Figure 5: Synthetic data and the fitted models. Left: the given data, the intensity of the geodesics represent the responsibility of the point to the corresponding cluster. Center: the contours of the LAND mixture model. Right: the contours of the Gaussian mixture model.

We extend this experiment to a clustering task (see left panel of Fig. 11 for data). The center and right panels of Fig. 11 show the contours of the LAND and Gaussian mixtures, and it is evident that the LAND is substantially better at capturing non-ellipsoidal clusters. Due to space limitations, we move further illustrative experiments to Appx. F and continue with real data.

Figure 6: The clustering problem for two synthetic datasets. Left: the given data, the intensity of the geodesics represent the responsibility of the point to the corresponding cluster. Center: the LAND mixture model. Right: the Gaussian mixture model.

4.2 Modeling Sleep Stages

We consider electro-encephalography (EEG) measurements of human sleep from 10 subjects, part of the PhysioNet database Imtiaz and Rodriguez-Villegas [2015], Goldberger et al. [2000 (June 13], Delorme and Makeig [2004]. For each subject we get EEG measurements during sleep from two electrodes on the front and the back of the head, respectively. Measurements are sampled at Hz, and for each 30 second window a so-called sleep stage label is assigned from the set . Rapid eye movement (REM) sleep is particularly interesting, characterized by having EEG patterns similar to the awake state but with a complex physiological pattern, involving e.g., reduced muscle tone, rolling eye movements and erection Purves et al. . Recent evidence points to the importance of REM sleep for memory consolidation Boyce et al. [2016]. Periods in which the sleeper is awake are typically happening in or near REM intervals. Thus we here consider the characterization of sleep in terms of three categories REM, awake, and non-REM, the latter a merger of sleep stages .

We extract features from EEG measurements as follows: for each subject we subdivide the 30 second windows to 10 seconds, and apply a short-time-Fourier-transform to the EEG signal of the frontal electrode with

overlapping windows. From this we compute the log magnitude of the spectrum of each window. The resulting data matrix is decomposed using Non-Negative Matrix Factorization (10 random starts) into five factors, and we use the coefficients as 5 features. In Fig. 7 we illustrate the nonlinear manifold structure based on a three factor analysis.

Figure 7: The 3 leading factors for subject “s151”.

We perform clustering on the data and evaluate the alignment between cluster labels and sleep stages using the F-measure Marxer et al. [2008]. The LAND depends on the parameter to construct the metric tensor, and in this experiment it is less straightforward to select because of significant intersubject variability. First, we fixed for all the subjects. From the results in Table 1 we observe that for the LAND(1) generally outperforms the GMM and achieves much better alignment. To further illustrate the effect of we fitted a LAND for and present the best result achieved by the LAND. Selecting this way leads indeed to higher degrees of alignment further underlining that the conspicuous manifold structure and the rather compact sleep stage distributions in Fig. 7 are both captured better with the LAND representation than with a linear GMM.

s001 s011 s042 s062 s081 s141 s151 s161 s162 s191
Table 1: The F-measure result for 10 subjects (the closer to 1 the better).

5 Related Work

We are not the first to consider Riemannian normal distributions, e.g. Pennec [2006] gives a theoretical analysis of the distribution, and Zhang and Fletcher [2013] consider the Riemannian counterpart of probabilistic PCA. Both consider the scenario where the manifold is known a priori. We adapt the distribution to the “manifold learning” setting by constructing a Riemannian metric that adapts to the data. This is our overarching contribution.

Traditionally, manifold learning is seen as an embedding problem where a low-dimensional representation of the data is sought. This is useful for visualization [Tenenbaum et al., 2000, Roweis and Saul, 2000, Schölkopf et al., 1999, Belkin and Niyogi, 2003], clustering Luxburg [2007]

, semi-supervised learning

Belkin et al. [2006] and more. However, in embedding approaches, the relation between a new point and the embedded points are less well-defined, and consequently these approaches are less suited for building generative models. In contrast, the Riemannian approach gives the ability to measure continuous geodesics that follow the structure of the data. This makes the learned Riemannian manifold a suitable space for a generative model.

Simo-Serra et al. [2014] consider mixtures of Riemannian normal distributions on manifolds that are known a priori. Structurally, their EM algorithm is similar to ours, but they do not account for the normalization constants for different mixture components. Consequently, their approach is inconsistent with the probabilistic formulation. Straub et al. [2015] consider data on spherical manifolds, and further consider a Dirichlet process prior for determining the number of components. Such a prior could also be incorporated in our model. The key difference to our work is that we consider learned manifolds as well as the following complications.

6 Discussion

In this paper we have introduced a parametric locally adaptive normal distribution. The idea is to replace the Euclidean distance in the ordinary normal distribution with a locally adaptive nonlinear distance measure. In principle, we learn a non-parametric metric space, by constructing a smoothly changing metric that induces a Riemannian manifold, where we build our model. As such, we propose a parametric model over a non-parametric space.

The non-parametric space is constructed using a local metric that is the inverse of a local covariance matrix. Here locality is defined via a Gaussian kernel, such that the manifold learning can be seen as a form of kernel smoothing. This indicates that our scheme for learning a manifold might not scale to high-dimensional input spaces. In these cases it may be more practical to learn the manifold probabilistically Tosi et al. [2014] or as a mixture of metrics Hauberg et al. [2012]. This is feasible as the LAND estimation procedure is agnostic to the details of the learned manifold as long as exponential and logarithm maps can be evaluated.

Once a manifold is learned, the LAND is simply a Riemannian normal distribution. This is a natural model, but more intriguing, it is a theoretical interesting model since it is the maximum entropy distribution for a fixed mean and covariance Pennec [2006]. It is generally difficult to build locally adaptive distributions with maximum entropy properties, yet the LAND does this in a fairly straight-forward manner. This is, however, only a partial truth as the distribution depends on the non-parametric space. The natural question, to which we currently do not have an answer, is whether a suitable maximum entropy manifold exist?

Algorithmically, we have proposed a maximum likelihood estimation scheme for the LAND. This combines a gradient-based optimization with a scalable Monte Carlo integration method. Once exponential and logarithm maps are available, this procedure is surprisingly simple to implement. We have demonstrated the algorithm on both real and synthetic data and results are encouraging. We almost always improve upon a standard Gaussian mixture model as the LAND is better at capturing the local properties of the data.

We note that both the manifold learning aspect and the algorithmic aspect of our work can be improved. It would be of great value to learn the parameter used for smoothing the Riemannian metric, and in general, more adaptive learning schemes are of interest. Computationally, the bottleneck of our work is evaluating the logarithm maps. This may be improved by specialized solvers, e.g. probabilistic solvers Hennig and Hauberg [2014]

, or manifold-specific heuristics.

The ordinary normal distribution is a key element in many machine learning algorithms. We expect that many fundamental generative models can be extended to the “manifold” setting simply by replacing the normal distribution with a LAND. Examples of this idea include Naïve Bayes, Linear Discriminant Analysis, Principal Component Analysis and more. Finally we note that standard hypothesis tests also extend to Riemannian normal distributions

Pennec [2006] and hence also to the LAND.

Acknowledgements. LKH was funded in part by the Novo Nordisk Foundation Interdisciplinary Synergy Program 2014, ’Biophysically adjusted state-informed cortex stimulation (BASICS)’. SH was funded in part by the Danish Council for Independent Research, Natural Sciences.


  • Belkin and Niyogi [2003] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15(6):1373–1396, June 2003.
  • Belkin et al. [2006] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. JMLR, 7:2399–2434, Dec. 2006.
  • Bishop [2006] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
  • Boyce et al. [2016] R. Boyce, S. D. Glasgow, S. Williams, and A. Adamantidis. Causal evidence for the role of REM sleep theta rhythm in contextual memory consolidation. Science, 352(6287):812–816, 2016.
  • Delorme and Makeig [2004] A. Delorme and S. Makeig.

    EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis.

    J. Neurosci. Methods, page 21, 2004.
  • do Carmo [1992] M. do Carmo. Riemannian Geometry. Mathematics (Boston, Mass.). Birkhäuser, 1992.
  • Goldberger et al. [2000 (June 13] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation, 101(23):e215–e220, 2000 (June 13).
  • Hauberg [2016] S. Hauberg. Principal Curves on Riemannian Manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2016.
  • Hauberg et al. [2012] S. Hauberg, O. Freifeld, and M. J. Black. A Geometric Take on Metric Learning. In Advances in Neural Information Processing Systems (NIPS) 25, pages 2033–2041, 2012.
  • Hennig and Hauberg [2014] P. Hennig and S. Hauberg. Probabilistic Solutions to Differential Equations and their Application to Riemannian Statistics. In

    Proceedings of the 17th international Conference on Artificial Intelligence and Statistics (AISTATS)

    , volume 33, 2014.
  • Imtiaz and Rodriguez-Villegas [2015] S. A. Imtiaz and E. Rodriguez-Villegas. An open-source toolbox for standardized use of PhysioNet Sleep EDF Expanded Database. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 6014–6017, Aug 2015.
  • Karcher [1977] H. Karcher. Riemannian center of mass and mollifier smoothing. Communications on Pure and Applied Mathematics, 30(5):509–541, 1977.
  • Luxburg [2007] U. Luxburg.

    A Tutorial on Spectral Clustering.

    Statistics and Computing, 17(4):395–416, Dec. 2007.
  • Marxer et al. [2008] R. Marxer, H. Purwins, and A. Hazan. An f-measure for evaluation of unsupervised clustering with non-determined number of clusters. Report of the EmCAP project (European Commission FP6-IST), pages 1–3, 2008.
  • Pennec [2006] X. Pennec. Intrinsic Statistics on Riemannian Manifolds: Basic Tools for Geometric Measurements. Journal of Mathematical Imaging and Vision, 25(1):127–154, July 2006.
  • [16] D. Purves, G. Augustine, D. Fitzpatrick, W. Hall, A. LaMantia, J. McNamara, and L. White. Neuroscience, 2008. De Boeck, Sinauer, Sunderland, Mass.
  • Roweis and Saul [2000] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000.
  • Schölkopf et al. [1999] B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In Advances in Kernel Methods - Support Vector Learning, pages 327–352, 1999.
  • Simo-Serra et al. [2014] E. Simo-Serra, C. Torras, and F. Moreno-Noguer. Geodesic Finite Mixture Models. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
  • Straub et al. [2015] J. Straub, J. Chang, O. Freifeld, and J. W. Fisher III. A Dirichlet Process Mixture Model for Spherical Data. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
  • Tenenbaum et al. [2000] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500):2319, 2000.
  • Tosi et al. [2014] A. Tosi, S. Hauberg, A. Vellido, and N. D. Lawrence. Metrics for Probabilistic Geometries. In The Conference on Uncertainty in Artificial Intelligence (UAI), July 2014.
  • Zhang and Fletcher [2013] M. Zhang and P. Fletcher. Probabilistic Principal Geodesic Analysis. In Advances in Neural Information Processing Systems (NIPS) 26, pages 1178–1186, 2013.


Notation: all points are considered as column vectors, and they are denoted with bold lowercase characters. represents the set of symmetric positive definite matrices. The learned Riemannian manifold is denoted , and its tangent space at point is denoted .

We present for convenience the domain and co-domain of the following often used terms. Note that is .

Appendix A Estimating the Normalization Constant

The locally adaptive normal distribution is defined as


Therefore, the normalization constant is equal to


To simplify notation we have define the and . The integral is then estimated with a Monte-Carlo technique.

Appendix B Steepest Descent Direction for the Mean

The objective function is differentiable with respect to with


Then the gradient of the objective function is equal to


This gradient is highly dependent on the condition number of the covariance matrix , which makes the gradient unstable. We therefore consider the steepest descent direction.

We start by showing the general steepest descent direction.


Using the Cauchy-Schwarz inequality for the optimization problem (31), we get that the minimizer is equal to


and thus, by plugging the result of (33) in to (31), we get that the steepest descent direction is


In our case, the , and thus, we get that the steepest descent direction of the objective function of the LAND model is


where we omit the denominator , since this is just a scaling factor, which will be captured by the stepsize. This avoid problems that appears due to large condition numbers of .

Appendix C Gradient Direction for the Covariance

We decompose the . In addition, we rewrite the inner product as follows


where is the trace operator. Then the gradient of the objective with respect the matrix is


Finally, treating the integral as an expectation problem and using Monte Carlo integration, we get that the gradient is


Appendix D Gradients for the LAND Mixture Model

Similarly the LAND mixture model are


where , and the responsibilities .

Appendix E Algorithms

In this section we present the algorithms for: 1) estimating the normalization constant , 2) maximum likelihood estimation of the LAND, and 3) fitting the LAND mixture model.

0:  the given data , the , the number of samples
0:  the estimated
1:  sample tangent vectors on
2:  map the on as
3:  compute the normalization constant
Algorithm 2 The estimation of the normalization constant
0:  the data , stepsize , tolerance
0:  the estimated
1:  initialize and
2:  repeat
3:        estimate using Eq. 16
4:        compute using Eq. 35
6:        estimate using Eq. 16
7:        compute using Eq. 46
11:  until 
Algorithm 3 LAND maximum likelihood
0:  the data , , tolerance
0:  the estimated
1:  initialize the and
2:  repeat
3:        Expectation step:
4:        compute the responsibilities
5:        Maximization step:
6:        for  do
7:              estimate using Eq. 16
8:              compute from Eq. 47 the
10:              estimate using Eq. 16
11:              compute from Eq. 48 the
15:        end for
17:  until 
Algorithm 4 LAND mixture model

e.1 Stepsize Selection

The LAND objective is expensive to evaluate due to the dependency on . This imply that a line-search is infeasible for selecting a stepsize. Thus, we use the following common trick. Each stepsize is given in the start of the algorithm. If the objective increased after an update, we reduce the corresponding stepsize as , and if the objective reduced, then .

e.2 Initialization Issues

The initialization of the LAND is important, as well as for the mixture model. We discuss two different initializations plus one specifically for the mixture model.

  1. Random: we initialize the LAND mean with a random point on the manifold. The initial covariance is the empirical covariance of the tangent vectors. This initialization can be used also for the mixture model, with random starting points. Then, we cluster the points and the covariances are initialized using empirical estimators.

  2. Least Squares: we initialize the LAND with the intrinsic least squares mean, and the covariance with the empirical estimator. This initialization can be used also for the mixture model, using the extension of the -means on Riemannian manifolds, and then, the points of each cluster for the empirical covariances.

  3. GMM: we initialize the LAND mixture model centres with the result of the GMM. For the empirical covariances, we use the points that belong to each cluster from the GMM solution.

e.3 Stopping criterion

Our objective function is non-convex, thus, as stopping criterion we use the change of the objective value. In particular, we stop the optimization when , for some given by the user. The same stopping criterion is used for the mixture model.

Appendix F Experiments

In this section we provide additional illustrative experiments.

f.1 Estimating the Normalization Constant

In order to show the consistency of the normalization constant estimation with respect to the number of samples, we conduct the following experiment. We used the data from the first synthetic experiment of the paper. Then for a grid on the we computed the corresponding values. Thus, we computed the numerical integral on the tangent space using trapezoidal numerical integration. Then we estimated the normalization constant using our approach for sample sizes and for 10 different runs. From the result in Fig. 8 we observe that the numerical scheme we provide, approximates well the normalization constant that we computed numerically.

Figure 8: The normalization constant estimation for different sample sizes . The black line denotes the trapezoidal numerical integral, and the dashed red line the mean value of the estimators using our proposed method.

f.2 MNIST digit 1 data

In this experiment we used the digit 1 from the MNIST dataset. We sample 200 points and using PCA we projected them onto the first 2 principal components. Then we fitted LAND, a least squares model, and a normal distribution. From the result Fig. 9 we observe that the LAND model approximates efficiently the underlying distribution of the data. Also, the least squares model has a similar performance, since it takes under consideration the underlying manifold. However, is obvious that it overfits the given data, and gives significant probability to low density areas. On the other hand, the linear model has poor performance, due to the linear distance measure.

Figure 9: The MNIST digit 1 projected onto the 2 first principal components experiment. Left: the LAND model approximates efficiently the data distribution. Center: the least squares model approximates the distribution, but it overfits the given data. Right: the normal distribution has poor performance due to the linear distance measure.

f.3 The Sleep Stages Experiment

Here we present the feature extraction result for 3 factors. From the Fig. 

10 we observe that actually, the derived data have a manifold structure. Moreover, we see that the characteristics of the data i.e., the EEG measurement, varies a lot between the 3 subjects.

Figure 10: Top row: The given data, after the feature extraction procedure for 3 factors for three subjects. Bottom row: the F-measure for different values of (subject “s151”).

f.4 The Clustering Problem for the Synthetic Data

Due to space limitations, we were not able to present the result of the least squares mixture model for the clustering problem of the synthetic data, thus, we present here the result. From the Fig. 11 we observe that indeed the LAND can approximates efficiently the underlying distributions of the clusters. Even thought the least squares mixture model takes under consideration the underlying structure of the data, it fails to reveal precisely the distributions of the clusters. Thus, we argue that our maximum likelihood estimates are better than the least squares estimates. On the other hand, the GMM fails even to find the correct means of the distributions.

Figure 11: The clustering problem for two synthetic datasets. Left: the LAND mixture model approximates efficiently the underlying distributions of the clusters. Center: the least squares fails to reveal precisely the distributions of the clusters. Right: the GMM due to the linear distance measure fails even to find the correct means of the distributions.

f.5 The Contour Plots for the Synthetic Data

Additionally to the results presented in the main paper, in Fig. 12 we present the contours of all the fitted models and for all the numbers of components, where the advantages of the LAND are obvious. Especially, when we observe that the LAND approximates well the underlying distribution, while even though the least squares estimator reveals the nonlinearity of the distribution, as we discussed in the paper the covariance overfits the given data.

Furthermore, when increases the LAND components locally become almost linear Gaussians, since the geodesics will almost be straight lines. However, even in this case the LAND mixture model is more flexible than the Gaussian mixture model, see the result for . Also, the LAND does not overfit the given data, as the least squares mixture model does, since the probability mass is more concentrated around the means, see the result for .

Figure 12: Synthetic data and the fitted models. From top to bottom we present the results for , respectively. Left: the contours of the LAND mixture model. Center: the contours of the least squares mixture model. Right: the contours of the Gaussian mixture model.

f.6 Motion Capture Data

We conducted an experiment using motion capture data from CMU Motion Capture Database111http://mocap.cs.cmu.edu/. Specifically, we picked two movements motion: 16 from subject 22 (jumping jag), and the subject 9 (run). Each data point corresponds to a human pose. We projected the data onto the first 2 and 3 principal and we fitted a LAND mixture model and a Gaussian mixture model for . From the results in Fig. 13 we see that the LAND means fall inside the data, while the GMM means are actually outside of the manifold.

Figure 13: Motion capture experiment. Left: the experiment for the run sequence. Right: the experiment for the jumping jag sequence.

f.7 Scalability of Geodesic Computations

A scalability concern is that the underlying ODEs are computationally more demanding in high dimensions, and more specifically, we are interested in the logarithm map. We conducted a supplementary experiment on the MNIST data, reporting the ODE solver running time as a function of input dimensionality. In particular, we fix a point and we compute the running time of the logarithm map between this point and 20 random chosen points, for a set of the dimensions of the feature space. From the result in Fig. 14 we observe that the current implementation scales to approximately dimensions, where it becomes impractical.

Figure 14: Scalability experiment.

f.8 Model Selection

We used the standard AIC and BIC criteria,