The differential geometric approach to probability theory and statistics has met increasing interest in the past years, from the theoretical point of view as well as in applications. In this approach, probability distributions are seen as elements of a differentiable manifold, on which a metric structure is defined through the choice of a Riemannian metric. Two very important ones are the Wasserstein metric, central in optimal transport, and the Fisher information metric (also called Fisher-Rao metric), essential in information geometry. Unlike optimal transport, information geometry is foremost concerned with parametric families of probability distributions, and defines a Riemannian structure on the parameter space using the Fisher information matrix, seeFisher (1922)
. The Fisher information is the Hessian of the Kullback-Leibler divergence, a popular measure of dissimilarity between probability distributions that does not verify the properties of a distance. In parameter estimation, it can be interpreted as the quantity of information contained in the model about unknown parameters.Rao (1992) showed that it could be used to locally define a scalar product on the space of parameters, interpreted as a Riemannian metric.
An important feature of this metric is that it is invariant under any diffeomorphic change of parameterization. Indeed, it is induced by a more general Riemannian metric on the infinite-dimensional space of smooth, non parametric probability densities, and therefore it is invariant to coordinate change; see e.g. Friedrich (1991). It can be seen as a natural choice of metric as it is the only Riemannian metric that is invariant with respect to transformation by a sufficient statistic, or a diffeomorphic transformation of the support in the non-parametric case (Cencov, 2000; Bauer et al., 2016). Arguably the most famous example of Fisher information geometry of a statistical model is that of the univariate Gaussian model, which is hyperbolic. The geometries of other parametric families such as the multivariate Gaussian model (Atkinson and Mitchell, 1981; Skovgaard, 1984)
, the family of gamma distributions(Arwini and Dodson, 2008; Rebbah et al., 2019), or more generally location-scale models (Said et al., 2019), among others, have also received a lot of attention.
In this work, we focus on beta distributions, a family of probability measures on1999), or to model percentages and proportions in genomic studies (Yang and Fang, 2017). Up to our knowledge, the information geometry of beta distributions has not yet received much attention. In Section 2, we give new results and properties for this geometry by deriving the metric matrix, and the geodesic equations. We compute the sectional curvature and show it has negative sign everywhere. This result is of particular interest to ensure the existence of Fréchet means. By numerically solving the geodesic equation, we may compute geodesic distances and means between probability distributions.
In Section 3, we exemplify the use of this geometry in the analysis of histograms extracted from segmented medical images. Traditional statistical methods are challenged in medical data analysis as data-sets often have very few samples () whereas observations are very high dimensional () and missing data are ubiquitous. Moreover anatomical measurements are often obtained through complex processing pipelines that result in various numbers of features (e.g. mesh representations of organs’ surface may have different numbers of vertices), resulting in the lack of a common representation space for the data.
To bypass these issues, it is thus helpful to consider histograms across the whole region of interest and to represent them in the space of beta distributions. The obtained representation has only two dimensions which is useful for visualization and interpretation of the results. We demonstrates that classification in this space achieves similar accuracy in discriminating patients from healthy controls as in the high-dimensional data space.
2 Fisher geometry of beta distributions
2.1 The Fisher information metric
Let be an open set of and let be a measure defined on a measurable space . A parameterized family of distributions on the parameter space is a set of probability measures absolutely continuous with respect to , that is:
When the mapping is differentiable for -almost all and the model satisfies certain regularity conditions, the Fisher information matrix at is defined to be:
where . As an open subset of , is a differentiable manifold with . The Fisher information defines a Riemannian metric on called the Fisher information metric
denotes the transpose of the vector. By a slight abuse of language, we will talk of the geometry of , implicitly referring to the metric structure induced on by .
Here we consider the parametric family of beta distributions, a family of probability measures on with density with respect to the Lebesgue measure parameterized by two positive scalars
We consider the Riemannian manifold composed of the parameter space and the Fisher information metric , and by extension denote by beta manifold the pair , where is the family of beta distributions
Here denotes the Lebesgue measure on . The beta family is a natural exponential one with log-partition function
and so the Fisher information admits a simple expression:
Straightforward computations then yield
where denotes the digamma function, i.e.
The matrix form of the Fisher metric is given by (2), and its infinitesimal length element is
2.2 Geodesics and geodesic distance
The distance between two beta distributions is defined as the geodesic distance associated to the Fisher metric in the parameter space
where the infimum is taken over all paths such that and . To compute this distance in practice, one needs to solve the geodesic equation.
The geodesics of the beta parameter space are solutions of
No closed form for the geodesics is known, but they can be computed numerically by solving (5), see Section. 2.4. Nonetheless we can notice that, due to the symmetry of the metric with respect to parameters and , the line of equation is a geodesic, where the parameterization is fixed by the unique geodesic equation given by (5).
2.3 Negative curvature and Fréchet mean
Extension of basic statistical objects such as the mean, the median or the variance to the setting of Riemannian manifolds is now well known; see e.g.Pennec (2006). A popular choice to define the notion of mean in a Riemannian manifold is the Fréchet mean, also called intrinsic mean, which is defined as the minimizer of the sum of the squared distances to the points that we want to average. In our setting, the intrinsic mean of a set of of beta distributions is given by
The Fréchet mean is in general not unique and refers to a set. Uniqueness holds however when the curvature of the Riemannian manifold is negative (Karcher, 1977), which is the case here.
The beta manifold has negative sectional curvature. Therefore, any set of beta distributions admits a unique Fréchet mean for the Fisher information geometry.
The proof of this theorem, given in the appendix, relies on certain regularity conditions of the polygamma functions, and in particuler on the sub-additivity of the ratio which was proven recently by Yang (2017).
The uniqueness of the Fréchet mean for this geometry is an important argument in favor of its use for supervised and unsupervised classification of histograms. Indeed, the popular K-means algorithm, also called Lloyd’s algorithm (Lloyd, 1982), which seeks to minimize intra-cluster variance, relies on the computation of the mean of the clusters. It can be easily extended to non-linear manifolds by replacing the Euclidean distance and mean by the Riemannian distance and Fréchet mean.
Geodesics (see Fig. 1) are computed through the exponential and logarithm maps of the Riemannian manifold, which can be implemented by solving respectively the initial and boundary value problems associated to the geodesic equation (5). This allows to approximate the distance, by averaging the norm of the velocity of the discretized geodesic between two points. The Fréchet mean can be found in practice using a gradient descent algorithm commonly referred to as the Karcher flow. It simply consists in applying the following update on the current estimate of the mean at each iteration:
for some step size . Here and
denote the Riemannian exponential and logarithm maps. Our implementation is available in the open-source Python package geomstatshttp://geomstats.ai.
3 Applications to medical data analysis
In this section with exemplify the theoretical results of the previous section on histograms computed from medical data in two different contexts.
Cardiac Shape Deformations
Firstly, we use a data-base of 204 shapes of right ventricles (RV) of the heart extracted from 3d-echocardiographic sequences (Moceri et al., 2017). 3d-meshes were extracted from the sequences with semi-automatic software (TomTec 4D RV-Function 2.0) and post-processed to compute markers of deformation during the cardiac cycle. We focus on the systole, that is the period during which the RV contracts to eject blood through the pulmonary valve. As a marker of deformation, we use the Area Strain (AS), that represents the local stretching of the RV surface. This feature is of growing interest among clinicians to assess cardiac function. It was shown to be a predictor of survival in pulmonary hypertension (Moceri et al., 2017).
For each subject, the RV at time is represented by a 3d-mesh of 822 vertices and 1587 triangular cells. Note that all vertices and cells correspond to one-another across subjects, but this may not be the case in most applications. This allows us to compare statistics computed in the original data space and in the space of beta distributions. We compute the relative area change of each cell between the end of diastole () and the end of systole ():
where is the area of the triangle at time for a given mesh. This results in a distribution of values greater than . Due to the tissue’s nature, the values should in fact be bounded by constants . However, the acquisition noise causes some values to lie outside although physically infeasible. We thus normalize the data to lie in the interval by applying the transformation
Finally, we map each patient’s histogram to the space of beta distributions by estimating the parameters by maximum likelihood (ML). Fig. 2
represents the data for a given patient with the estimated probability density function and Fig.3 the obtained representation of the population’s histograms in the manifold of beta distributions.
Cortical Thickness maps
The measure of brain atrophy is a crucial tool in the analysis of neurodegenerative diseases. In particular, Cortical Thickness (CTh) measured by structural Medical Resonance Imaging (MRI) has been proposed as a biomarker for the early diagnosis of Alzheimer’s disease (Pini et al., 2016; Busovaca et al., 2016) and for the prediction of conversion from Mild Cognitive Impairment to Alzheimer’s Disease (Wei et al., 2016; Sun et al., 2017). In many of these studies, central tendency measures such as the mean or the median are used to represent the biomarker, either in the whole cortex, in regions of interest or in the voxels, leading to an important loss of information. More recent ones used histogram analysis (Giulietti et al., 2018; Ruiz et al., 2018), and in (Rebbah et al., 2019), the authors use the Fisher geometry of generalized gamma distributions to perform this analysis. However, cortical thickness is a bounded quantity, and therefore beta distributions seem to be more natural candidates to model their distribution. Moreover, the negative curvature of the beta geometry makes it a suitable candidate in clustering procedures based on computation of cluster means, such as the K-mean algorithm used here.
The data used in this application were extracted from MR scans selected from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database111http://adni.loni.usc.edu/about/. Indeed, the initial subjects were not age- and sex-matched and our procedure consisted in randomly selecting subjects. In addition, some of the subjects were excluded because of a low diagnosis reliability (according to ADNI criteria) and others because of unsuccessful CTh measurement due to poor image quality. The resulting population is composed of 143 subjects: 71 healthy controls subjects and 72 Alzheimer’s disease patients.
CTh was measured from 3D T1-weighted MR images using the Matlab toolbox CorThiZon (Querbes et al., 2009). There is no unique definition of CTh. In (MacDonald et al., 2000) it is obtained as the distance between uniquely associated pairs of points on bounding surfaces. Uncoupled methods do not rely on a given set of associated landmarks but create them using techniques coming from shape matching (Tagare, 1997)
. Finally, a quite different category of algorithms makes use of an elliptic partial differential equation to obtain a field of deformation between the boundaries(Yezzi and Prince, 2003). The CorThiZon toolbox, developed at Inserm, implements a Laplace equation based method that falls within this last category.
For each subject, the data consist in measurements of CTh along the whole cortical ribbon. The overall procedure is the one described in (Querbes et al., 2009), with a normalized voxel size of 1mm along all directions. Due to the variability of the size and shape of the brain among the population of study, we obtain samples of unequal length. This results in the lack of a common representation space for the data, a problem that is solved by the histogram-based approach that we propose here. Indeed, just as for the AS data, we compute for each patient the histogram of CTh measurements, from which we estimate the parameters of a beta distribution by ML. Histograms are normalized beforehand with respect to the maximal CTh value among the population.
In both cases, the subjects are divided in two classes: diseased and controls. In order to assess whether the proposed representation and geometry are relevant, we perform supervised and unsupervised classification in the Riemannian manifold of beta distributions. We compare the results both when using the Riemannian metric and the Euclidean metric on the parameter space
. We chose to use the K-nearest- neighbor (KNN) and K-means algorithms as they only require computation of distances on non-linear manifolds. Both supervised (SKM) and unsupervised (UKM) versions of K-means are compared. Finally, in the case of the AS data, as we have cell-to-cell correspondence, we can run the algorithms in the original data space R1587 to assess the loss of information when considering only two-parameter distributions instead of the whole histograms. We also compare our method with a standard dimensionality reduction performed by principal component analysis (PCA). This cannot be done for the CTh data since the mapping between values and brain regions differs across patients. All models are trained and tested in a 5-fold cross-validation fashion, and we assess the classification accuracy in all cases. We used the scikit-learn package(Pedregosa et al., 2011) to perform these experiments.
Tables 1 and 2 show the mean classification accuracy over 5-fold cross validation for each setting: (1) the space of parameters of the beta distributions fitted to the data-point histograms and, when applicable, (2) the Euclidean space of the original data, and (3) the space spanned by the first two principal components obtained by PCA. The number of neighbors for the KNN algorithm is chosen to be . This choice is compared to other values through 5-fold cross validation, see Fig. 5 and 6.
|KNN||0.81 (0.05)||0.67 (0.32)||0.77 (0.09)||0.83 (0.05)|
|SKM||0.85 (0.03)||0.72 (0.28)||0.66 (0.06)||0.81 (0.06)|
Mean (and standard deviation) of the classification accuracy for the AS data on 5-fold cross-validation
|KNN||0.77 (0.05)||0.79 (0.04)|
|SKM||0.66 (0.10)||0.80 (0.05)|
These results show that in the space of beta parameters, the Fisher information geometry performs significantly better than Euclidean geometry. Moreover, this performance is comparable to that of classification in the original data space, despite the considerable reduction of dimension (from 1587 to 2). This is illustrated in Fig. 5, where both methods perform alternatively better as the number of neighbors changes.
In comparison, the two-dimensional representation space provided by PCA performs significantly worst on average. This can be explained by the fact that the performance of dimensionality reduction of PCA depends on the choice of training set, and certain train/test splittings lead to poor performance. This is illustrated in Fig. 7: in one of the cross-validation splittings, the test set for each class in the PCA representation is closer to the center of the other class computed from the training test. This has the effect of lowering the average accuracy over the whole cross-validation process. This problem does not occur in the two-dimensional beta representation, as the mapping from one data-point to a pair of parameters does not depend on the other points present in the data-set.
The clustering performance of the -means algorithm is shown for the same settings in Tables 3 and 4. These results confirm that the Fisher information geometry in the space of beta parameters is more suited to compare histograms than Euclidean geometry. Unlike in the supervised setting, the two-dimensional representation given by PCA performs well, since in this case the whole data-set was used to perform the dimensionality reduction.
In this work, we have studied the geometry of the beta manifold and shown in particular that it was negatively curved. We have illustrated its use as a non linear representation space for histograms of medical data, which presents the advantage of being low-dimensional and can easily be visualized. This common representation is also particularly helpful when the number of measurements vary from one subject to another. The examples studied here tend to show that the Fisher information metric is a more adapted choice of metric on this space than the Euclidean one to compare and classify histograms. Finally, the dimensionality reduction performed in the process does not result in a significant loss in performance. Moreover, unlike a standard method like PCA, the quality of the dimensionality reduction does not rely on the choice and size of the data-set: each data-point is mapped to an element of the beta manifold independently of the others. This makes the beta representation particularly interesting for small data-sets. Comparison with other geometries, such as that of the gamma and generalized gamma manifolds, and exploration of other supervised and unsupervised classification algorithms will be the object of future work.
The authors would like to warmly thank Pamela Moceri (CHU Nice) and Nicolas Duchateau (Université Lyon 1) for acquiring and curating the heart data. N. Guigui has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement G-Statistics No 786854). This research has been conducted within the FP2M federation (CNRS FR 2036)
- Arwini and Dodson (2008) Arwini, K. and Dodson, C.T. (2008). Information geometry: near randomness and near independence. Springer Science & Business Media.
- Atkinson and Mitchell (1981) Atkinson, C. and Mitchell, A.F. (1981). Rao’s distance measure. Sankhyā: The Indian Journal of Statistics, Series A, 345–365.
- Bauer et al. (2016) Bauer, M., Bruveris, M., and Michor, P.W. (2016). Uniqueness of the Fisher–Rao metric on the space of smooth densities. Bulletin of the London Mathematical Society, 48(3), 499–506.
- Busovaca et al. (2016) Busovaca, E., Zimmerman, M.E., Meier, I.B., Griffith, E.Y., Grieve, S.M., Korgaonkar, M.S., Williams, L.M., and Brickman, A.M. (2016). Is the Alzheimer’s disease cortical thickness signature a biological marker for memory? Brain imaging and behavior, 10(2), 517–523.
- Cencov (2000) Cencov, N.N. (2000). Statistical decision rules and optimal inference. 53. American Mathematical Soc.
- Fisher (1922) Fisher, R.A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 222(594-604), 309–368.
- Friedrich (1991) Friedrich, T. (1991). Die Fisher-information und symplektische strukturen. Mathematische Nachrichten, 153(1), 273–296.
Giulietti et al. (2018)
Giulietti, G., Torso, M., Serra, L., Spanò, B., Marra, C., Caltagirone, C.,
Cercignani, M., Bozzali, M., and (ADNI), A.D.N.I. (2018).
Whole brain white matter histogram analysis of diffusion tensor imaging data detects microstructural damage in mild cognitive impairment and Alzheimer’s disease patients.Journal of Magnetic Resonance Imaging, 48(3), 767–779.
- Karcher (1977) Karcher, H. (1977). Riemannian center of mass and mollifier smoothing. Communications on pure and applied mathematics, 30(5), 509–541.
- Lloyd (1982) Lloyd, S. (1982). Least squares quantization in PCM. IEEE transactions on information theory, 28(2), 129–137.
- MacDonald et al. (2000) MacDonald, D., Kabani, N., Avis, D., and Evans, A.C. (2000). Automated 3-D extraction of inner and outer surfaces of cerebral cortex from MRI. NeuroImage, 12(3), 340 – 356.
- Moceri et al. (2017) Moceri, P., Duchateau, N., Baudouy, D., Schouver, E.D., Leroy, S., Squara, F., Ferrari, E., and Sermesant, M. (2017). Three-dimensional right-ventricular regional deformation and survival in pulmonary hypertension. European Heart Journal - Cardiovascular Imaging, 19(4), 450–458. doi:10.1093/ehjci/jex163.
- O’Neill and Roberts (1999) O’Neill, P.D. and Roberts, G.O. (1999). Bayesian inference for partially observed stochastic epidemics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 162(1), 121–129.
Pedregosa et al. (2011)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,
Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12, 2825–2830.
- Pennec (2006) Pennec, X. (2006). Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1), 127.
- Pini et al. (2016) Pini, L., Pievani, M., Bocchetta, M., Altomare, D., Bosco, P., Cavedo, E., Galluzzi, S., Marizzoni, M., and Frisoni, G.B. (2016). Brain atrophy in Alzheimer’s disease and aging. Ageing research reviews, 30, 25–48.
- Querbes et al. (2009) Querbes, O., Aubry, F., Pariente, J., Lotterie, J.A., Démonet, J.F., Duret, V., Puel, M., Berry, I., Fort, J.C., Celsis, P., and Initiative, T.A.D.N. (2009). Early diagnosis of Alzheimer’s disease using cortical thickness: impact of cognitive reserve. Brain, 132(8), 2036–2047. doi:10.1093/brain/awp105.
- Rao (1992) Rao, C.R. (1992). Information and the accuracy attainable in the estimation of statistical parameters. In Breakthroughs in statistics, 235–247. Springer.
- Rebbah et al. (2019) Rebbah, S., Nicol, F., and Puechmorel, S. (2019). The geometry of the generalized gamma manifold and an application to medical imaging. Mathematics, 7(8), 674.
Ruiz et al. (2018)
Ruiz, E., Ramírez, J., Górriz, J.M., Casillas, J., Initiative, A.D.N.,
et al. (2018).
Alzheimer’s disease computer-aided diagnosis: Histogram-based analysis of regional MRI volumes for feature selection and classification.Journal of Alzheimer’s Disease, 65(3), 819–842.
- Said et al. (2019) Said, S., Bombrun, L., and Berthoumieu, Y. (2019). Warped Riemannian metrics for location-scale models. In Geometric Structures of Information, 251–296. Springer.
- Skovgaard (1984) Skovgaard, L.T. (1984). A Riemannian geometry of the multivariate normal model. Scandinavian Journal of Statistics, 211–223.
- Sun et al. (2017) Sun, Z., van de Giessen, M., Lelieveldt, B.P.F., and Staring, M. (2017). Detection of conversion from Mild Cognitive Impairment to Alzheimer’s disease using longitudinal brain MRI. Frontiers in Neuroinformatics, 11. doi:10.3389/fninf.2017.00016. URL https://doi.org/10.3389/fninf.2017.00016.
- Tagare (1997) Tagare, H.D. (1997). Deformable 2-D template matching using orthogonal curves. IEEE Transactions on Medical Imaging, 16(1), 108–117.
- Totaro (2004) Totaro, B. (2004). The curvature of a Hessian metric. International Journal of Mathematics, 15(04), 369–391.
- Wei et al. (2016) Wei, R., Li, C., Fogelson, N., and Li, L. (2016). Prediction of conversion from Mild Cognitive Impairment to Alzheimer’s disease using MRI and structural network features. Frontiers in Aging Neuroscience, 8. doi:10.3389/fnagi.2016.00076. URL https://doi.org/10.3389/fnagi.2016.00076.
- Yang and Fang (2017) Yang, S. and Fang, Z. (2017). Beta approximation of ratio distribution and its application to next generation sequencing read counts. Journal of applied statistics, 44(1), 57–70.
- Yang (2017) Yang, Z.H. (2017). Some properties of the divided difference of psi and polygamma functions. Journal of Mathematical Analysis and Applications, 455(1), 761 – 777. doi:https://doi.org/10.1016/j.jmaa.2017.05.081.
- Yezzi and Prince (2003) Yezzi, A. and Prince, J. (2003). An Eulerian PDE approach for computing tissue thickness. IEEE Transactions on Medical Imaging, 22(10), 1332–1339. doi:10.1109/TMI.2003.817775.
Appendix A Proof of Proposition 1
The geodesic equations are given by
where the ’s denote the Christoffel symbols of the second kind. These can be obtained from the Christoffel symbols of the first kind and the coefficients of the inverse of the metric matrix
Here we have used the Einstein summation convention. Since the Fisher metric is a Hessian metric (see Totaro (2004)), the Christoffel symbols of the first kind can be obtained as
where is the log-partition function (1) and lower indices denote partial derivatives. Straightforward computation yields the desired equations.
Appendix B Proof of Theorem 1
The sectional curvature of a Hessian metric is given by
Here lower indices denote partial derivatives with respect to the corresponding variables, and is the log-partition function (1). Their computation yields
and the determinant of the metric is given by
Factorizing the numerator by yields
where . Since is negative, the first factor is negative. Moreover, it has been shown in (Yang, 2017, Corollary 4) that the function is sub-additive. This means that the second factor is positive, yielding the desired result.