1 Introduction
The multivariate normal distribution is a fundamental building block in many machine learning algorithms, and its wellknown density can compactly be written as
(1) 
where denotes the Mahalanobis distance for covariance matrix . This distance measure corresponds to the length of the straight line connecting and , and consequently the normal distribution is often used to model linear phenomena. When data lies near a nonlinear manifold embedded in the normal distribution becomes inadequate due to its linear metric. We investigate if a useful distribution can be constructed by replacing the linear distance function with a nonlinear counterpart. This is similar in spirit to Isomap Tenenbaum et al. [2000] that famously replace the linear distance with a geodesic distance measured over a neighborhood graph spanned by the data, thereby allowing for a nonlinear model. This is, however, a discrete distance measure that is only welldefined over the training data. For a generative model, we need a continuously defined metric over the entire .
Following Hauberg et al. [2012] we learn a smoothly changing metric that favors regions of high density i.e., geodesics tend to move near the data. Under this metric, the data space is interpreted as a dimensional Riemannian manifold. This “manifold learning” does not change dimensionality, but merely provides a local description of the data. The Riemannian viewpoint, however, gives a strong mathematical foundation upon which the proposed distribution can be developed. Our work, thus, bridges work on statistics on Riemannian manifolds Pennec [2006], Zhang and Fletcher [2013] with manifold learning Tenenbaum et al. [2000].
We develop a locally adaptive normal distribution (LAND) as follows: First, we construct a metric that captures the nonlinear structure of the data and enables us to compute geodesics; from this, an
unnormalized density is trivially defined. Second, we propose a scalable Monte Carlo integration scheme for normalizing the density with respect to the measure induced by the metric. Third, we develop a gradientbased algorithm for maximum likelihood estimation on the learned manifold. We further consider a mixture of LANDs and provide the corresponding EM algorithm. The usefulness of the model is verified on both synthetic data and EEG measurements of human sleep stages.
Notation: all points
are considered as column vectors, and they are denoted with bold lowercase characters.
represents the set of symmetric positive definite matrices. The learned Riemannian manifold is denoted , and its tangent space at is denoted .2 A Brief Summary of Riemannian Geometry
We start our exposition with a brief review of Riemannian manifolds do Carmo [1992]. These smooth manifolds are naturally equipped with a distance measure, and are commonly used to model physical phenomena such as dynamical or periodic systems, and many problems that have a smooth behavior.
Definition 1.
A smooth manifold together with a Riemannian metric is called a Riemannian manifold. The Riemannian metric encodes a smoothly changing inner product on the tangent space of each point .
Remark 1.
The Riemannian metric acts on tangent vectors, and may, thus, be interpreted as a standard Mahalanobis metric restricted to an infinitesimal region around .
The local inner product based on is a suitable model for capturing local behavior of data, i.e. manifold learning. From the inner product, we can define geodesics as lengthminimizing curves connecting two points , i.e.
(2) 
Here
is the metric tensor at
, and the tangent vector denotes the derivative (velocity) of . The distance between and is defined as the length of the geodesic. A standard result from differential geometry is that the geodesic can be found as the solution to a system oforder ordinary differential equations (ODEs)
do Carmo [1992], Hauberg et al. [2012]:(3) 
subject to . Here stacks the columns of a matrix into a vector and is the Kronecker product.
This differential equation allows us to define basic operations on the manifold. The exponential map at a point takes a tangent vector to such that the curve is a geodesic originating at with initial velocity and length . The inverse mapping, which takes to is known as the logarithm map and is denoted . By definition corresponds to the geodesic distance from to . These operations are illustrated in Fig. 2. The exponential and the logarithmic map can be computed by solving Eq. 3 numerically, as an initial value problem (IVP) or a boundary value problem (BVP) respectively. In practice the IVPs are substantially faster to compute than the BVPs.
The Mahalanobis distance is naturally extended to Riemannian manifolds as . From this, Pennec [2006] considered the Riemannian normal distribution
(4) 
and showed that it is the manifoldvalued distribution with maximum entropy subject to a known mean and covariance. This distribution is an instance of Eq. 1 and is the distribution we consider in this paper. Next, we consider standard “intrinsic least squares” estimates of and .
2.1 Intrinsic Least Squares Estimators
Let the data be generated from an unknown probability distribution on a manifold. Then it is common Pennec [2006] to define the intrinsic mean
of the distribution as the point that minimize the variance
(5) 
where is the measure (or infinitesimal volume element) induced by the metric. Based on the mean, a covariance matrix can be defined
(6) 
where is the domain over which is welldefined. For the manifolds we consider, the domain is . Practical estimators of rely on gradientbased optimization to find a local minimizer of Eq. 5, which is welldefined Karcher [1977]. For finite data , the descent direction is proportional to , and the updated mean is a point on the geodesic curve . After estimating the mean, the empirical covariance matrix is estimated as . It is worth noting that even though these estimators are natural, they are not maximum likelihood estimates for the Riemannian normal distribution (4).
3 A Locally Adaptive Normal Distribution
We now have the tools to define a locally adaptive normal distribution (LAND): we replace the linear Euclidean distance with a locally adaptive Riemannian distance and study the corresponding Riemannian normal distribution (4). By learning a Riemannian manifold and using its structure to estimate distributions of the data, we provide a new and useful link between Riemannian statistics and manifold learning.
3.1 Constructing a Metric
In the context of manifold learning, Hauberg et al. [2012] suggest to model the local behavior of the data manifold via a locallydefined Riemannian metric. Here we propose to use a local covariance matrix to represent the local structure of the data. We only consider diagonal covariances for computational efficiency and to prevent the overfitting. The locality of the covariance is defined via an isotropic Gaussian kernel of size . Thus, the metric tensor at is defined as the inverse of a local diagonal covariance matrix with entries
(7) 
Here is the dimension of the observation, and a regularization parameter to avoid singular covariances. This defines a smoothly changing (hence Riemannian) metric that captures the local structure of the data. It is easy to see that if
is outside of the support of the data, then the metric tensor is large. Thus, geodesics are “pulled” towards the data where the metric is small. Note that the proposed metric is not invariant to linear transformations.While we restrict our attention to this particular choice, other learned metrics are equally applicable, c.f.
Tosi et al. [2014], Hauberg et al. [2012].3.2 Estimating the Normalization Constant
The normalization constant of Eq. 4 is by definition
(8) 
where denotes the measure induced by the Riemannian metric. The constant depends not only on the covariance matrix, but also on the mean of the distribution, and the curvature of the manifold (captured by the logarithm map). For a general learned manifold, is inaccessible in closedform and we resort to numerical techniques. We start by rewriting Eq. 8 as
(9) 
In effect, we integrate the distribution over the tangent space instead of directly over the manifold. This transformation relies on the fact that the volume of an infinitely small area on the manifold can be computed in the tangent space if we take the deformation of the metric into account Pennec [2006]. This deformation is captured by the measure which, in the tangent space, is . For notational simplicity we define the function , which intuitively captures the cost for a point to be outside the data support ( is large in low density areas and small where the density is high).
We estimate the normalization constant (9) using Monte Carlo integration. We first multiply and divide the integral with the normalization constant of the Euclidean normal distribution . Then, the integral becomes an expectation estimation problem , which can be estimated numerically as
(10) 
and is the number of samples on . The computationally expensive element is to evaluate , which in turn requires evaluating . This amounts to solving an IVP numerically, which is fairly fast. Had we performed the integration directly on the manifold (8) we would have had to evaluate the logarithm map, which is a much more expensive BVP. The tangent space integration, thus, scales better.
3.3 Inferring Parameters
Assuming an independent and identically distributed dataset
, we can write their joint distribution as
. We find parameters and by maximum likelihood, which we implement by minimizing the mean negative loglikelihood(11) 
The first term of the objective function is a datafitting term, while the second can be seen as a force that both pulls the mean closer to the high density areas and shrinks the covariance. Specifically, when the mean is in low density areas, as well as when the covariance gives significant
probability to those areas, the value of will by construction be large. Consequently, will increase and these solutions will be penalized. In practice, we find that the maximum likelihood LAND mean generally avoids low density regions, which is in contrast to the standard intrinsic least squares mean (5), see Fig. 3.
In practice we optimize using block coordinate descent: we optimize the mean keeping the covariance fixed and vice versa. Unfortunately, both of the subproblems are nonconvex, and unlike the linear normal distribution, they lack a closedform solution. Since the logarithm map is a differentiable function, we can use gradientbased techniques to infer and . Below we give the descent direction for and and the corresponding optimization scheme is given in Algorithm 1. Initialization is discussed in Appx. E.2.
Optimizing : the objective function is differentiable with respect to do Carmo [1992], and using that , we get the gradient
(12) 
It is easy to see that this gradient is highly dependent on the condition number of . We find that this, at times, makes the gradient unstable, and choose to use the steepest descent direction instead of the gradient direction. This is equal to (see Appx. B).
Optimizing : since the covariance matrix by definition is constrained to be in the space , a common trick is to decompose the matrix as , and optimize the objective with respect to . The gradient of this factor is (see Appx. C for a derivation)
(13) 
Here the first term fits the given data by increasing the size of the covariance matrix, while the second term regularizes the covariance towards a small matrix.
3.4 Mixture of LANDs
At this point we can find maximum likelihood estimates of the LAND model. We can easily extend this to mixtures of LANDs: Following the derivation of the standard Gaussian mixture model
Bishop [2006], our objective function for inferring the parameters of the LAND mixture model is formulated as follows(14) 
where , is the probability that is generated by the component, and . The corresponding EM algorithm is in Appx. E.
4 Experiments
In this section we present both synthetic and real experiments to demonstrate the advantages of the LAND. We compare our model with both the Gaussian mixture model (GMM), and a mixture of LANDs using least squares (LS) estimators (5, 6). Since the latter are not maximum likelihood estimates we use a Riemannian means algorithm to find cluster centers. In all experiments we use samples in the Monte Carlo integration. This choice is investigated empirically in the supplements. Furthermore, we choose as small as possible, while ensuring that the manifold is smooth enough that geodesics can be computed numerically.
4.1 Synthetic Data Experiments
As a first experiment, we generate a nonlinear datamanifold by sampling from a mixture of 20 Gaussians positioned along a halfellipsoidal curve (see left panel of Fig. 5). We generate 10 datasets with 300 points each, and fit for each dataset the three models with number of components. Then, we generate 10000 samples from each fitted model, and we compute the mean negative loglikelihood of the true generative distribution using these samples. Fig. 4 shows that the LAND learns faster the underlying true distribution, than the GMM. Moreover, the LAND perform better than the least squares estimators, which overestimates the covariance. In the supplements we show, using the standard AIC and BIC criteria, that the optimal LAND is achieved for , while for the least squares estimators and the GMM, the optimal is achieved for and respectively.
In addition, in Fig. 5 we show the contours for the LAND and the GMM for . There, we can observe that indeed, the LAND adapts locally to the data and reveals their underlying nonlinear structure. This is particularly evident near the “boundaries” of the datamanifold.



We extend this experiment to a clustering task (see left panel of Fig. 11 for data). The center and right panels of Fig. 11 show the contours of the LAND and Gaussian mixtures, and it is evident that the LAND is substantially better at capturing nonellipsoidal clusters. Due to space limitations, we move further illustrative experiments to Appx. F and continue with real data.






4.2 Modeling Sleep Stages
We consider electroencephalography (EEG) measurements of human sleep from 10 subjects, part of the PhysioNet database Imtiaz and RodriguezVillegas [2015], Goldberger et al. [2000 (June 13], Delorme and Makeig [2004]. For each subject we get EEG measurements during sleep from two electrodes on the front and the back of the head, respectively. Measurements are sampled at Hz, and for each 30 second window a socalled sleep stage label is assigned from the set . Rapid eye movement (REM) sleep is particularly interesting, characterized by having EEG patterns similar to the awake state but with a complex physiological pattern, involving e.g., reduced muscle tone, rolling eye movements and erection Purves et al. . Recent evidence points to the importance of REM sleep for memory consolidation Boyce et al. [2016]. Periods in which the sleeper is awake are typically happening in or near REM intervals. Thus we here consider the characterization of sleep in terms of three categories REM, awake, and nonREM, the latter a merger of sleep stages .
We extract features from EEG measurements as follows: for each subject we subdivide the 30 second windows to 10 seconds, and apply a shorttimeFouriertransform to the EEG signal of the frontal electrode with
overlapping windows. From this we compute the log magnitude of the spectrum of each window. The resulting data matrix is decomposed using NonNegative Matrix Factorization (10 random starts) into five factors, and we use the coefficients as 5 features. In Fig. 7 we illustrate the nonlinear manifold structure based on a three factor analysis.We perform clustering on the data and evaluate the alignment between cluster labels and sleep stages using the Fmeasure Marxer et al. [2008]. The LAND depends on the parameter to construct the metric tensor, and in this experiment it is less straightforward to select because of significant intersubject variability. First, we fixed for all the subjects. From the results in Table 1 we observe that for the LAND(1) generally outperforms the GMM and achieves much better alignment. To further illustrate the effect of we fitted a LAND for and present the best result achieved by the LAND. Selecting this way leads indeed to higher degrees of alignment further underlining that the conspicuous manifold structure and the rather compact sleep stage distributions in Fig. 7 are both captured better with the LAND representation than with a linear GMM.
s001  s011  s042  s062  s081  s141  s151  s161  s162  s191  

LAND(1)  
GMM  
LAND 
5 Related Work
We are not the first to consider Riemannian normal distributions, e.g. Pennec [2006] gives a theoretical analysis of the distribution, and Zhang and Fletcher [2013] consider the Riemannian counterpart of probabilistic PCA. Both consider the scenario where the manifold is known a priori. We adapt the distribution to the “manifold learning” setting by constructing a Riemannian metric that adapts to the data. This is our overarching contribution.
Traditionally, manifold learning is seen as an embedding problem where a lowdimensional representation of the data is sought. This is useful for visualization [Tenenbaum et al., 2000, Roweis and Saul, 2000, Schölkopf et al., 1999, Belkin and Niyogi, 2003], clustering Luxburg [2007]
Belkin et al. [2006] and more. However, in embedding approaches, the relation between a new point and the embedded points are less welldefined, and consequently these approaches are less suited for building generative models. In contrast, the Riemannian approach gives the ability to measure continuous geodesics that follow the structure of the data. This makes the learned Riemannian manifold a suitable space for a generative model.SimoSerra et al. [2014] consider mixtures of Riemannian normal distributions on manifolds that are known a priori. Structurally, their EM algorithm is similar to ours, but they do not account for the normalization constants for different mixture components. Consequently, their approach is inconsistent with the probabilistic formulation. Straub et al. [2015] consider data on spherical manifolds, and further consider a Dirichlet process prior for determining the number of components. Such a prior could also be incorporated in our model. The key difference to our work is that we consider learned manifolds as well as the following complications.
6 Discussion
In this paper we have introduced a parametric locally adaptive normal distribution. The idea is to replace the Euclidean distance in the ordinary normal distribution with a locally adaptive nonlinear distance measure. In principle, we learn a nonparametric metric space, by constructing a smoothly changing metric that induces a Riemannian manifold, where we build our model. As such, we propose a parametric model over a nonparametric space.
The nonparametric space is constructed using a local metric that is the inverse of a local covariance matrix. Here locality is defined via a Gaussian kernel, such that the manifold learning can be seen as a form of kernel smoothing. This indicates that our scheme for learning a manifold might not scale to highdimensional input spaces. In these cases it may be more practical to learn the manifold probabilistically Tosi et al. [2014] or as a mixture of metrics Hauberg et al. [2012]. This is feasible as the LAND estimation procedure is agnostic to the details of the learned manifold as long as exponential and logarithm maps can be evaluated.
Once a manifold is learned, the LAND is simply a Riemannian normal distribution. This is a natural model, but more intriguing, it is a theoretical interesting model since it is the maximum entropy distribution for a fixed mean and covariance Pennec [2006]. It is generally difficult to build locally adaptive distributions with maximum entropy properties, yet the LAND does this in a fairly straightforward manner. This is, however, only a partial truth as the distribution depends on the nonparametric space. The natural question, to which we currently do not have an answer, is whether a suitable maximum entropy manifold exist?
Algorithmically, we have proposed a maximum likelihood estimation scheme for the LAND. This combines a gradientbased optimization with a scalable Monte Carlo integration method. Once exponential and logarithm maps are available, this procedure is surprisingly simple to implement. We have demonstrated the algorithm on both real and synthetic data and results are encouraging. We almost always improve upon a standard Gaussian mixture model as the LAND is better at capturing the local properties of the data.
We note that both the manifold learning aspect and the algorithmic aspect of our work can be improved. It would be of great value to learn the parameter used for smoothing the Riemannian metric, and in general, more adaptive learning schemes are of interest. Computationally, the bottleneck of our work is evaluating the logarithm maps. This may be improved by specialized solvers, e.g. probabilistic solvers Hennig and Hauberg [2014]
, or manifoldspecific heuristics.
The ordinary normal distribution is a key element in many machine learning algorithms. We expect that many fundamental generative models can be extended to the “manifold” setting simply by replacing the normal distribution with a LAND. Examples of this idea include Naïve Bayes, Linear Discriminant Analysis, Principal Component Analysis and more. Finally we note that standard hypothesis tests also extend to Riemannian normal distributions
Pennec [2006] and hence also to the LAND.Acknowledgements. LKH was funded in part by the Novo Nordisk Foundation Interdisciplinary Synergy Program 2014, ’Biophysically adjusted stateinformed cortex stimulation (BASICS)’. SH was funded in part by the Danish Council for Independent Research, Natural Sciences.
References
 Belkin and Niyogi [2003] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15(6):1373–1396, June 2003.
 Belkin et al. [2006] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. JMLR, 7:2399–2434, Dec. 2006.
 Bishop [2006] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2006.
 Boyce et al. [2016] R. Boyce, S. D. Glasgow, S. Williams, and A. Adamantidis. Causal evidence for the role of REM sleep theta rhythm in contextual memory consolidation. Science, 352(6287):812–816, 2016.

Delorme and Makeig [2004]
A. Delorme and S. Makeig.
EEGLAB: an open source toolbox for analysis of singletrial EEG dynamics including independent component analysis.
J. Neurosci. Methods, page 21, 2004.  do Carmo [1992] M. do Carmo. Riemannian Geometry. Mathematics (Boston, Mass.). Birkhäuser, 1992.
 Goldberger et al. [2000 (June 13] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.K. Peng, and H. E. Stanley. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation, 101(23):e215–e220, 2000 (June 13).
 Hauberg [2016] S. Hauberg. Principal Curves on Riemannian Manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2016.
 Hauberg et al. [2012] S. Hauberg, O. Freifeld, and M. J. Black. A Geometric Take on Metric Learning. In Advances in Neural Information Processing Systems (NIPS) 25, pages 2033–2041, 2012.

Hennig and Hauberg [2014]
P. Hennig and S. Hauberg.
Probabilistic Solutions to Differential Equations and their
Application to Riemannian Statistics.
In
Proceedings of the 17th international Conference on Artificial Intelligence and Statistics (AISTATS)
, volume 33, 2014.  Imtiaz and RodriguezVillegas [2015] S. A. Imtiaz and E. RodriguezVillegas. An opensource toolbox for standardized use of PhysioNet Sleep EDF Expanded Database. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 6014–6017, Aug 2015.
 Karcher [1977] H. Karcher. Riemannian center of mass and mollifier smoothing. Communications on Pure and Applied Mathematics, 30(5):509–541, 1977.

Luxburg [2007]
U. Luxburg.
A Tutorial on Spectral Clustering.
Statistics and Computing, 17(4):395–416, Dec. 2007.  Marxer et al. [2008] R. Marxer, H. Purwins, and A. Hazan. An fmeasure for evaluation of unsupervised clustering with nondetermined number of clusters. Report of the EmCAP project (European Commission FP6IST), pages 1–3, 2008.
 Pennec [2006] X. Pennec. Intrinsic Statistics on Riemannian Manifolds: Basic Tools for Geometric Measurements. Journal of Mathematical Imaging and Vision, 25(1):127–154, July 2006.
 [16] D. Purves, G. Augustine, D. Fitzpatrick, W. Hall, A. LaMantia, J. McNamara, and L. White. Neuroscience, 2008. De Boeck, Sinauer, Sunderland, Mass.
 Roweis and Saul [2000] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000.
 Schölkopf et al. [1999] B. Schölkopf, A. Smola, and K.R. Müller. Kernel principal component analysis. In Advances in Kernel Methods  Support Vector Learning, pages 327–352, 1999.
 SimoSerra et al. [2014] E. SimoSerra, C. Torras, and F. MorenoNoguer. Geodesic Finite Mixture Models. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
 Straub et al. [2015] J. Straub, J. Chang, O. Freifeld, and J. W. Fisher III. A Dirichlet Process Mixture Model for Spherical Data. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
 Tenenbaum et al. [2000] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500):2319, 2000.
 Tosi et al. [2014] A. Tosi, S. Hauberg, A. Vellido, and N. D. Lawrence. Metrics for Probabilistic Geometries. In The Conference on Uncertainty in Artificial Intelligence (UAI), July 2014.
 Zhang and Fletcher [2013] M. Zhang and P. Fletcher. Probabilistic Principal Geodesic Analysis. In Advances in Neural Information Processing Systems (NIPS) 26, pages 1178–1186, 2013.
Appendix
Notation: all points are considered as column vectors, and they are denoted with bold lowercase characters. represents the set of symmetric positive definite matrices. The learned Riemannian manifold is denoted , and its tangent space at point is denoted .
We present for convenience the domain and codomain of the following often used terms. Note that is .
Appendix A Estimating the Normalization Constant
The locally adaptive normal distribution is defined as
(15) 
Therefore, the normalization constant is equal to
(16)  
(17)  
(18)  
(19)  
(20)  
(21)  
(22) 
To simplify notation we have define the and . The integral is then estimated with a MonteCarlo technique.
Appendix B Steepest Descent Direction for the Mean
The objective function is differentiable with respect to with
(23) 
Then the gradient of the objective function is equal to
(24)  
(25)  
(26)  
(27)  
(28) 
This gradient is highly dependent on the condition number of the covariance matrix , which makes the gradient unstable. We therefore consider the steepest descent direction.
We start by showing the general steepest descent direction.
(29)  
(30)  
(31) 
Using the CauchySchwarz inequality for the optimization problem (31), we get that the minimizer is equal to
(33) 
and thus, by plugging the result of (33) in to (31), we get that the steepest descent direction is
(34) 
In our case, the , and thus, we get that the steepest descent direction of the objective function of the LAND model is
(35) 
where we omit the denominator , since this is just a scaling factor, which will be captured by the stepsize. This avoid problems that appears due to large condition numbers of .
Appendix C Gradient Direction for the Covariance
We decompose the . In addition, we rewrite the inner product as follows
(36)  
(37)  
(38) 
where is the trace operator. Then the gradient of the objective with respect the matrix is
(39)  
(40)  
(41)  
(42)  
(43)  
(44)  
(45) 
Finally, treating the integral as an expectation problem and using Monte Carlo integration, we get that the gradient is
(46) 
Appendix D Gradients for the LAND Mixture Model
Similarly the LAND mixture model are
(47)  
(48) 
where , and the responsibilities .
Appendix E Algorithms
In this section we present the algorithms for: 1) estimating the normalization constant , 2) maximum likelihood estimation of the LAND, and 3) fitting the LAND mixture model.
e.1 Stepsize Selection
The LAND objective is expensive to evaluate due to the dependency on . This imply that a linesearch is infeasible for selecting a stepsize. Thus, we use the following common trick. Each stepsize is given in the start of the algorithm. If the objective increased after an update, we reduce the corresponding stepsize as , and if the objective reduced, then .
e.2 Initialization Issues
The initialization of the LAND is important, as well as for the mixture model. We discuss two different initializations plus one specifically for the mixture model.

Random: we initialize the LAND mean with a random point on the manifold. The initial covariance is the empirical covariance of the tangent vectors. This initialization can be used also for the mixture model, with random starting points. Then, we cluster the points and the covariances are initialized using empirical estimators.

Least Squares: we initialize the LAND with the intrinsic least squares mean, and the covariance with the empirical estimator. This initialization can be used also for the mixture model, using the extension of the means on Riemannian manifolds, and then, the points of each cluster for the empirical covariances.

GMM: we initialize the LAND mixture model centres with the result of the GMM. For the empirical covariances, we use the points that belong to each cluster from the GMM solution.
e.3 Stopping criterion
Our objective function is nonconvex, thus, as stopping criterion we use the change of the objective value. In particular, we stop the optimization when , for some given by the user. The same stopping criterion is used for the mixture model.
Appendix F Experiments
In this section we provide additional illustrative experiments.
f.1 Estimating the Normalization Constant
In order to show the consistency of the normalization constant estimation with respect to the number of samples, we conduct the following experiment. We used the data from the first synthetic experiment of the paper. Then for a grid on the we computed the corresponding values. Thus, we computed the numerical integral on the tangent space using trapezoidal numerical integration. Then we estimated the normalization constant using our approach for sample sizes and for 10 different runs. From the result in Fig. 8 we observe that the numerical scheme we provide, approximates well the normalization constant that we computed numerically.
f.2 MNIST digit 1 data
In this experiment we used the digit 1 from the MNIST dataset. We sample 200 points and using PCA we projected them onto the first 2 principal components. Then we fitted LAND, a least squares model, and a normal distribution. From the result Fig. 9 we observe that the LAND model approximates efficiently the underlying distribution of the data. Also, the least squares model has a similar performance, since it takes under consideration the underlying manifold. However, is obvious that it overfits the given data, and gives significant probability to low density areas. On the other hand, the linear model has poor performance, due to the linear distance measure.



f.3 The Sleep Stages Experiment
Here we present the feature extraction result for 3 factors. From the Fig.
10 we observe that actually, the derived data have a manifold structure. Moreover, we see that the characteristics of the data i.e., the EEG measurement, varies a lot between the 3 subjects.




f.4 The Clustering Problem for the Synthetic Data
Due to space limitations, we were not able to present the result of the least squares mixture model for the clustering problem of the synthetic data, thus, we present here the result. From the Fig. 11 we observe that indeed the LAND can approximates efficiently the underlying distributions of the clusters. Even thought the least squares mixture model takes under consideration the underlying structure of the data, it fails to reveal precisely the distributions of the clusters. Thus, we argue that our maximum likelihood estimates are better than the least squares estimates. On the other hand, the GMM fails even to find the correct means of the distributions.






f.5 The Contour Plots for the Synthetic Data
Additionally to the results presented in the main paper, in Fig. 12 we present the contours of all the fitted models and for all the numbers of components, where the advantages of the LAND are obvious. Especially, when we observe that the LAND approximates well the underlying distribution, while even though the least squares estimator reveals the nonlinearity of the distribution, as we discussed in the paper the covariance overfits the given data.
Furthermore, when increases the LAND components locally become almost linear Gaussians, since the geodesics will almost be straight lines. However, even in this case the LAND mixture model is more flexible than the Gaussian mixture model, see the result for . Also, the LAND does not overfit the given data, as the least squares mixture model does, since the probability mass is more concentrated around the means, see the result for .












f.6 Motion Capture Data
We conducted an experiment using motion capture data from CMU Motion Capture Database^{1}^{1}1http://mocap.cs.cmu.edu/. Specifically, we picked two movements motion: 16 from subject 22 (jumping jag), and the subject 9 (run). Each data point corresponds to a human pose. We projected the data onto the first 2 and 3 principal and we fitted a LAND mixture model and a Gaussian mixture model for . From the results in Fig. 13 we see that the LAND means fall inside the data, while the GMM means are actually outside of the manifold.


f.7 Scalability of Geodesic Computations
A scalability concern is that the underlying ODEs are computationally more demanding in high dimensions, and more specifically, we are interested in the logarithm map. We conducted a supplementary experiment on the MNIST data, reporting the ODE solver running time as a function of input dimensionality. In particular, we fix a point and we compute the running time of the logarithm map between this point and 20 random chosen points, for a set of the dimensions of the feature space. From the result in Fig. 14 we observe that the current implementation scales to approximately dimensions, where it becomes impractical.
f.8 Model Selection
We used the standard AIC and BIC criteria,
Comments
There are no comments yet.