In neuroimaging studies, or more generally in shape analysis, normalising a set of subjects consists in deforming them towards a common space that allows one-to-one correspondence. Finding this common space usually reduces to finding an optimal shape in terms of distance to all subjects in the space of deformations. However, the covariance structure of these deformations is not known a priori
and the deformation metric generally involves penalising roughness. Yet, in a Bayesian setting, a prior that is informative of population variance would make the registration process, which relies ona posterioriestimates, more robust. Shape models aim to learn this covariance structure from the data. As deformations are parameterised in a very high-dimensional space, learning their covariance is computationally intractable. Dimensionality reduction techniques are therefore commonly used, even though some have tackled this problem by parameterising deformations using location-adaptive control points .
For a long time, due to their computational complexity, shape modelling approaches had only been applied to simple models of deformations  or simple data sets [6, 9]. Recently, Zhang and Fletcher applied a probabilistic shape model, named principal geodesic analysis, to densely sampled diffeomorphisms and 3D MR images of the brain . Diffeomorphisms correspond to a particular family of deformations that are ensured to be invertible, allowing for very large deformations. Geodesic shooting of diffeomorphisms involves specifying a Riemannian metric on their tangent space and allows a diffeomorphism to to be entirely parameterised by its initial velocity [11, 3]. However, because Zhang and Fletcher’s optimisation scheme relies either on Gradient descent or on Monte Carlo sampling of the posterior, they have focused on effectively reducing the dimensionality of velocity fields. It results in an effective approach for studying the principle modes of variations, but may give less accurate alignment than with classical approaches. In particular, they do not explicitly model “anatomical noise”, i.e., deformations that are not captured by the principal modes.
Here, we propose a generative shape model, whose posterior is inferred using variational inference and Laplace approximations. A residual velocity field capturing anatomical noise is explicitly defined and its magnitude is inferred from the data. An efficient Gauss-Newton optimisation is used to obtain the maximum a posteriori latent subspace as well as individual coordinates, minimising the chances of falling into local minima, and making the registration more robust.
2.1 Generative shape model
First, let us define a generative model of brain shape. Here, the observed variables are supposed to be categorical images (i.e., segmentations) comprising classes — e.g. grey matter, white matter, background — stemming from a categorical distribution. This kind of data term has proved very effective for driving registration  and is compatible with unified models of registration and segmentation. If needed, it is straightforward to replace this term with a stationary Gaussian noise model. The template
encodes prior probabilities of finding each of theclasses in a given location, and is deformed towards the -th subject according to the spatial transform . In practice, its log-representation
is encoded by trilinear basis functions, and the deformed template is recovered by softmax interpolation:
Note that the discrete operation can be equivalently written as the matrix multiplication, , where is a large and sparse matrix that depends on and performs the combined “sample and interpolate” operation. We will name this operation pulling, while the multiplication by its transpose, , will be named pushing.
Let be the set of observed images111We assume that they all have the same lattice, but this condition can be waived by composing each diffeomorphic transform with a fixed “change of lattice” transform, which can even embed a rigid-body alignment.. For each subject, let be the diffeomorphic transformation, and let be the deformed template. The likelihood of observed voxel values at locations is:
In this work, diffeomorphisms are defined as geodesics, according to a Riemannian metric222In this work, it is a combination of membrane, bending and linear-elastic energies. defined by a positive definite operator named , and are thus entirely parameterised by their initial velocity . A complete transformation is recovered by integrating the velocity in time, knowing that the momentum , is conserved at any :
The velocity field can be retrieved from the momentum field by performing the inverse operation , where is ’s Green’s function Because we want to model inter-individual variability, we need them to be all defined in the same (template) space, which is achieved by using the initial velocity of their inverse333The initial velocity of is the opposite of the final velocity of , and vice versa., that we name . Following , we use the probabilistic principal component analysis (PPCA) framework to regularise initial velocity fields, which leads to writing them as a linear combination of principal modes plus a residual field. Let us assume that we want to model them with principal components. Then, let be a matrix (called the principal subspace), each column being one principal component, let be the latent representation of a given velocity field in the principal subspace and let be the corresponding residual field. This yields . In , latent coordinates stem from a standard Gaussian and is i.i.d. Gaussian noise, and a maximum-likelihood estimate of the principal subspace is retrieved. A Bayesian version can be designed by placing a Gaussian prior on each principal component . Smooth velocities can be enforced with a smooth prior over each principal component and over the residual field, and a Gaussian prior over the latent coordinates, yielding:
where is the discretisation of and is the anatomical noise precision. However, this approach is often not regularised enough. Zhang and Fletcher 
proposed a different prior, which can be seen as being set over the reconstructed velocities. In practice, it takes the form of a joint distribution over all latent coordinates, residual fields and the principal subspace:
The advantage of the first formulation (4-6) is that it explicates the covariance matrix of the latent coordinates and the noise precision, which can then be inferred from the data. Additionally, it could be extended to multimodal latent distributions such as Gaussian mixtures. However, the second formulation (7) is better at effectively regularising the principal subspace and, in general, the reconstructed velocities. In practice, we use a weighted combination of these two formulations, and call the weights and .
The noise precision, , can be inferred in a Bayesian fashion by introducing a conjugate Gamma prior444The Gamma prior is a parameterised such that . with and as shown in 
. Similarly, the latent covariance matrix is given a conjugate Wishart prior, which is made as non-informative as possible by setting its degrees of freedom to
, and whose expected value is the identity matrix,i.e., . This prior has the opposite effect of an automatic relevance determination prior, since it prevents principal components from collapsing during the first iterations by promoting non-null variances.
2.2 Inference555 is used for approximate posteriors and for posterior expected values. Superscript stars denote optimal approximations. means “equal up to an additive constant”.
A basic inference scheme would be to search for a mode of the model likelihood, by optimising in turn each parameter of the model. It is however more consistent to tackle this problem as one of missing data, which is dealt with by computing the posterior distribution over all latent variables. Unfortunately, the posterior does not possess a tractable form. A solution is to use variational inference to describe an approximate posterior that can be more easily computed, by restricting the search space to distributions that factorise over a subset of variables . This method allows the uncertainty about parameters estimates to be accounted for when inferring other parameters. Here, for computational reasons, we do not perform a fully Bayesian treatment of the problem and look for mode estimates of the principal subspace and template. We still marginalise over all subject-specific parameters (latent coordinates and residual field), as recommended by . We state that the set of marginalised latent variables is and make the (mean field) approximation that the posterior factorises over , , and .
Since we used conjugate priors for the latent precision matrix and the anatomical noise precision, their posterior have the same form as their prior and update equations are equivalent to computing a weighted average between their prior expected value and their maximum likelihood estimator. In contrast, posterior distributions ofand have no simple form. We thus make a Laplace approximation and estimate their mean (, ) and covariance (, ) with their mode and second derivatives about this mode. They are obtained by Gauss-Newton optimisation and the corresponding derivations are provided in Sec. 2.3. Because of the non-linearity induced by geodesic shooting and template interpolation, we first make the approximation that:
where means the posterior expected value and is the template deformed according to the above parameter estimates. Consequently, we find:
2.3 Gauss-Newton optimisation
Gauss-Newton (GN) optimisation of an objective function
with respect to a vector of parametersconsists of iteratively improving the objective function by making, locally, a second-order approximation. The gradient, , and Hessian matrix, , are computed about the current best estimate of the optimal parameters, , and the new optimum is found according to . In practice, this update scheme sometimes overshoots it is therefore common to perform a backtracking line search along the direction .
Differentiating the data term:
Let us write the (negative) data term for an arbitrary subject:
Following , differentiating with respect to can be approximated by differentiating with respect to
and applying the chain rule, which yields:
where is the gradient of the log-Categorical distribution and takes the form of an vector field, and contains spatial gradients of the log-template and takes the form of an vector field. Second derivatives can be approximated by:
where is the Hessian of the log-Categorical distribution and takes the form of an
symmetric tensor field. The gradient and Hessian ofwere derived in  and can be computed in each voxel according to:
Since , derivatives with respect to , and are obtained by applying the chain rule.
The PPCA formulation is invariant to rotations inside the latent space . Consequently, it allows finding an optimal subspace but does not enforce the individual bases to be the eigenmodes of the complete covariance matrix. It makes sense, however, to transform the subspace so that it corresponds to the first eigenmodes as it eases the interpretation. Also, in order to enforce a sparse Hessian matrix over , we require to be diagonal, where the columns of are the individual . This leads us to look for an transformation matrix that keeps the actual diffeomorphisms untouched (), while diagonalising both and
. This is done by a series of singular value decompositions that insure thatis diagonal and is the identity. However, the distribution of diagonal weights between and is not optimal and we optimise an additional diagonal scaling matrix by alternating between updating from the rotated , and updating the scaling weights by Gauss-Newton optimisation of the remaining terms of the lower bound that depend on them:
3 Experiments and results
We ran the algorithm on a training set consisting of the first 38 subjects of the OASIS cross-sectional database . We used the provided FSL segmentations, which we transformed into tissue probability maps by extracting the grey and white matter classes and smoothing them with a 1-voxel FWHM Gaussian kernel. We set the number of principal components to 32, the parameters of the membrane, bending and linear-elastic energies were respectively set to 0.001, 0.02 and , and we used . We set the prior parameters of the residual precision magnitude based on tests conducted on 2D axial slices (, ). Templates reconstructed with a varying number of principal modes, and with or without the residual field, are presented in Fig. 2, while Fig. 3 shows the template deformed along the first two principal modes, the first one being typical of brain ageing. This pattern is validated by plotting coordinates along the first dimension against actual ages. Finally, the learnt model was tested by registering the template towards 38 unseen images from the OASIS database. The distribution of categorical log-likelihood values for the training and testing sets are depicted in Fig. 4, along with two example fits that were randomly selected. Both sets have similar distributions (mean std . Train: ; Test: ), showing the model’s robustness.
We presented a generative model of brain shape that does not limit the space of diffeomorphisms, allowing learning regularisation while preserving enough flexibility for accurate normalisation. We showed how principal modes of variation correlate with known factors of brain shape variability, hinting towards the fact that this low-dimensional representation might allow to discriminate between physiological states. Future research will focus on applying this framework to very large databases, and on combining it with segmentation models in order to work directly with raw data. The latent distribution may be improved by using multimodal priors such as Gaussian mixtures. Our main limitation is the small number of principal basis that can be learned due to their large size. This could be overcome by explicitly modelling sparse and local covariance patterns.
-  Allassonnière, S., Amit, Y., Trouvé, A.: Towards a coherent statistical framework for dense deformable template estimation. J. R. Stat. Soc. Ser. B Stat. Methodol. 69(1), 3–29 (2007)
-  Ashburner, J., Friston, K.J.: Computing average shaped tissue probability templates. NeuroImage 45(2), 333–341 (2009)
-  Ashburner, J., Friston, K.J.: Diffeomorphic registration using geodesic shooting and Gauss–Newton optimisation. NeuroImage 55(3), 954–967 (2011)
-  Bishop, C.M.: Bayesian PCA. In: Kearns, M.J., Solla, S.A., Cohn, D.A. (eds.) NIPS 11. pp. 382–388. MIT Press (1999)
-  Cootes, T.F., Twining, C.J., Babalola, K.O., Taylor, C.J.: Diffeomorphic statistical shape models. Image Vis. Comput. 26(3), 326–332 (2008)
-  Cootes, T., Hill, A., Taylor, C., Haslam, J.: Use of active shape models for locating structures in medical images. Image Vis. Comput. 12(6), 355–365 (1994)
-  Durrleman, S., Allassonnière, S., Joshi, S.: Sparse Adaptive Parameterization of Variability in Image Ensembles. Int. J. Comput. Vis. 101(1), 161–183 (2013)
-  Fletcher, P.T., Lu, C., Pizer, S.M., Joshi, S.: Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans. Med. Imaging 23(8), 995–1005 (2004)
-  Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open Access Series of Imaging Studies (OASIS): Cross-sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults. J. Cognit. Neurosci. 19(9), 1498–1507 (2007)
-  Miller, M.I., Trouvé, A., Younes, L.: Geodesic Shooting for Computational Anatomy. J. Math Imaging Vis. 24(2), 209–228 (2006)
-  Simpson, I.J.A., Schnabel, J.A., Groves, A.R., Andersson, J.L.R., Woolrich, M.W.: Probabilistic inference of regularisation in non-rigid registration. NeuroImage 59(3), 2438–2451 (2012)
-  Tipping, M.E., Bishop, C.M.: Probabilistic Principal Component Analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 61(3), 611–622 (1999)
-  Zhang, M., Fletcher, P.T.: Bayesian principal geodesic analysis for estimating intrinsic diffeomorphic image variability. Med. Image Anal. 25(1), 37–44 (2015)