Learning Conditional Deformable Templates with Convolutional Networks

08/07/2019 ∙ by Adrian V. Dalca, et al. ∙ MIT 5

We develop a learning framework for building deformable templates, which play a fundamental role in many image analysis and computational anatomy tasks. Conventional methods for template creation and image alignment to the template have undergone decades of rich technical development. In these frameworks, templates are constructed using an iterative process of template estimation and alignment, which is often computationally very expensive. Due in part to this shortcoming, most methods compute a single template for the entire population of images, or a few templates for specific sub-groups of the data. In this work, we present a probabilistic model and efficient learning strategy that yields either universal or conditional templates, jointly with a neural network that provides efficient alignment of the images to these templates. We demonstrate the usefulness of this method on a variety of domains, with a special focus on neuroimaging. This is particularly useful for clinical applications where a pre-existing template does not exist, or creating a new one with traditional methods can be prohibitively expensive. Our code and atlases are available online as part of the VoxelMorph library at http://voxelmorph.csail.mit.edu.



There are no comments yet.


page 2

page 7

page 8

page 9

page 10

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A deformable template is an image that can be geometrically deformed to match images in a dataset, providing a common reference frame. Templates are a powerful tool that enables the analysis of geometric variability. They have been used in computer vision 

Felzenszwalb and Schwartz (2007); Jain et al. (1996); Kim et al. (2013), medical image analysis Allassonnière et al. (2007); Davis et al. (2004); Joshi et al. (2004); Ma et al. (2008), graphics Kokkinos et al. (2012); Stoll et al. (2006), and time series signals Abdulla et al. (2003); Yurtman and Barshan (2014). We are motivated by the study of anatomical variability in neuroimaging, where collections of scans are mapped to a common template with anatomical and/or functional landmarks. However, the methods developed here are applicable to other domains.

Analysis with a deformable template is often done by computing a smooth deformation field that aligns the template to another image. The deformation field can be used to derive a measure of the differences between the two images. Rapidly obtaining this field to a given template is by itself a challenging task and the focus of extensive research.

A template can be chosen as one of the images in a given dataset, but often these do not represent the structural variability and complexity in the image collection, and can lead to biased and misleading analyses Joshi et al. (2004). If the template does not adequately represent dataset variability, such as the possible anatomy, it becomes challenging to accurately deform the template to some images. A good template therefore minimizes the geometric distance to all images in a dataset, and there has been extensive methodological development for finding such a central template Allassonnière et al. (2007); Davis et al. (2004); Joshi et al. (2004); Ma et al. (2008)

. However, these templates are obtained through a costly global optimization procedure and domain-specific heuristics, requiring extensive runtimes. For complex 3D images such as MRI, this process can consume days to weeks. In practice, this leads to few templates being constructed, and researchers often use templates that are not optimal for their dataset. Our work makes it easy and computationally efficient to generate deformable templates.

While deformable templates are powerful, a single template may be inadequate at capturing the variability in a large dataset. Existing methods alleviate this problem by grouping subpopulations, usually along a single attribute, and computing separate templates for each group. This approach relies on arbitrary decisions about the attributes and thresholds used for subdividing the dataset. Furthermore, each template is only constructed based on a subset of the data, thus exploiting fewer images, leading to sub-optimal templates. Instead, we propose a learning-based approach that can compute on-demand conditional deformable templates by leveraging the entire collection. Our framework enables the use of multiple attributes, continuous (e.g., age) or discrete (e.g., sex), to condition the template on, without needing to apply arbitrary thresholding or subdividing a dataset.

We formulate template estimation as a learning problem and describe a novel method to tackle it.

  1. [label=(0),leftmargin=*,align=left,noitemsep]

  2. We describe a probabilistic spatial deformation model based on diffeomorphisms. We then develop a general, end-to-end framework using convolutional neural networks that jointly synthesizes templates and rapidly provides the deformation field to any new image.

  3. This framework also enables learning a conditional template function given instance attributes, such as the age and sex of the subject in an MRI. Once learned, this function enables rapid synthesis of on-demand conditional templates. For example, it could construct a 3D brain MRI template for 35 year old women.

  4. We demonstrate the template construction method and its utility on a variety of datasets, including a large neuroimaging study. In addition, we show preliminary experiments indicating characteristics and interesting results of the model. For example, this formulation can be extended to learn image representations up to a deformation.

Conditional templates capture important trends related to attributes, and are useful for dealing with confounders. For example, in studying disease impact, for some tasks it may be helpful to register scans to age-specific templates rather than one covering a wide age range.

Figure 1: Conditional deformable templates generated by our method. Left: slices from 3D brain templates conditioned on age; Right: MNIST templates conditioned on class label.

2 Related Works

2.1 Spatial Alignment (Image Registration)

Spatial alignment, or registration, between two images is a building block for estimation of deformable templates. Alignment usually involves two steps: a global affine transformation, and a deformable transformation (as in many optical flow applications). In this work we focus on, and make use of, deformable transformations.

There is extensive work in deformable image registration methods Ashburner (2007); Avants et al. (2008); Bajcsy and Kovacic (1989); Beg and others (2005); Dalca and others (2016); Glocker and others (2008); Thirion (1998); Yeo et al. (2010); Zhang and others (2017). Conventional frameworks optimize a regularized dense deformation field that matches one image with the other Bajcsy and Kovacic (1989); Thirion (1998). Diffeomorphic transforms are toplogy preserving and invertible, and have been widely used in computational neuroanatomy analysis Avants et al. (2008); Ashburner (2007); Beg and others (2005); Cao et al. (2005); Ceritoglu et al. (2009); Hernandez et al. (2009); Joshi and Miller (2000); Miller et al. (2005); Oishi et al. (2009); Vercauteren and others (2009); Zhang and others (2017). While extensively studied, conventional registration algorithms require an optimization for every pair of images, leading to long runtimes in practice.

Recently, learning based registration methods have been proposed that offer a significant speed-up at test time Balakrishnan et al. (2018a, b); Cao et al. (2017); Dalca et al. (2018a, 2019); de Vos and others (2017); Krebs and others (2017); Krebs et al. (2019); Rohé and others (2017); Sokooti and others (2017); Yang and others (2017). These methods learn a network that computes the deformation field, either in a supervised (using ground truth deformations), unsupervised (using classical energy functions), or semi-supervised setting. These algorithms have been used for registering an image to an existing template. However, in many realistic scenarios, a template is not readily available, for example in a clinical study that uses a particular scan protocol. We build on these ideas in our learning strategy, but jointly estimate a registration network and a conditional deformable template in an unsupervised setting.

Optical flow methods are closely related to image registration, finding a dense displacement field for a pair of 2D images. Similar to registration, classical approaches solve an optimization problem, often using variational methods Brox and others (2004); Horn and Schunck (1980); Sun and others (2010). Learning-based optical flow methods use convolutional neural networks to learn the dense displacement fields Ahmadi and Patras (2016); Dosovitskiy and others (2015); Ilg et al. (2017); Jason et al. (2016); Ranjan and Black (2017); Tran et al. (2016).

2.2 Template Construction

Deformable templates, or atlases, are widely used in computational anatomy. Specifically, the deformation fields from this template to individual images are often carefully analyzed to understand population variability. The template is usually constructed through an iterative procedure based on a collection of images or volumes. First, an initial template is chosen, such as an example image or a pixel-wise average across all images. Next, all images are aligned (registered) to this template, a better template is estimated from aligned images through averaging, and the process is iterated until convergence Allassonnière et al. (2007); Davis et al. (2004); Joshi et al. (2004); Ma et al. (2008); Sabuncu et al. (2009). Since the above procedure requires many iterations involving many costly (3D) pairwise registrations, atlas construction runtimes are often prohibitive.

A single population template can be insufficient at capturing complex variability. Current methods often subdivide the population to build multiple atlases. For example, in neuroimaging, some methods build different templates for different age groups, requiring rigid discretization of the population and prohibiting each template from using all information across the collection. Images can also be clustered and a template optimized for each cluster, requiring a pre-set number of clusters Sabuncu et al. (2009). Specialized methods have also been developed that tackle a particular variability of interest. For example, spatiotemporal brain templates have been developed using specialized registration pipelines and explicit modelling of brain degeneration with time Davis et al. (2010); Habas et al. (2009); Kuklisova-Murgasova et al. (2011), requiring significant domain knowledge, manual anatomical segmentations, and significant computational resources. We build on the intuitions of these methods, but propose a general framework that can learn conditional deformable templates for any given set of attributes. Specifically, our strategy learns a single network that levarges shared information across the entire dataset and can output different templates as a function of sets of attributes, such as age, sex, and disease state. The conditional function learned by our model generates unbiased population templates for a specific configuration of the attributes.

Our model can be used to study the population variation with respect to certain attributes it was trained on, such as age in neuroimaging. In recent literature on deep probabilistic models, several papers find and explore latent axes of important variability in the dataset Arjovsky et al. (2017); Chen et al. (2016); Goodfellow et al. (2016); Higgins et al. (2017); Kingma and Welling (2014); Makhzani et al. (2015). Our model can also be used to build conditional geometric templates based on such latent information, as we show in our experiments. In this case, our model can be seen as learning meaningful image representations up to a geometric deformation. However, in this paper we focus on observed (measured) attributes, with the goal of explicitly capturing variability that is often a source of confounding.

3 Methods

We first present a generative model that describes the formation of images through deformations from an unknown conditional template. We present a learning approach that uses neural networks and diffeomorphic transforms to jointly estimate the global template and a network that rapidly aligns it to each image.

3.1 Probabilistic model

Let  be a data sample, such as a 2D image, a 3D volume like an MRI scan, or a time series. For the rest of this section, we use images and volumes as an example, but the development applies broadly to many data types. We assume we have a dataset , and model each image as a spatial deformation  of a global template . Each transform 

is parametrized by the random vector 


We consider a model of a conditional template , a function of attribute vector , parametrized by global parameters . For example,  can encode a class label or phenotypical information associated with medical scans, such as age and sex. In the case where no such conditioning information is available or of interest, this formulation reduces to a standard single template for the entire dataset: , where  can represent the pixel intensity values to be estimated.

We estimate the deformable template parameters  and the deformation fields for every data point using maximum likelihood. Letting  and ,


where the first term captures the likelihood of the data and deformations, and the second term controls a prior over the deformation fields.

Deformations. While the method described in this paper applies to a range of deformation parametrization , we focus on diffeomorphisms. Diffeomorphic deformations are invertible and differentiable, thus preserving topology. Specifically, we treat  as a stationary velocity field Ashburner (2007); Dalca et al. (2018a); Hernandez et al. (2009); Krebs et al. (2018, 2019); Modat et al. (2012), although time-varying fields are also possible. In this setup, the deformation field 

is defined through the following ordinary differential equation:


where is the identity transformation and  is time. We can obtain the final deformation field by integrating the stationary velocity field over . We compute this integration through scaling and squaring, which has been shown to be efficiently implementable in automatic differentiation platforms Dalca et al. (2019); Krebs et al. (2018).

We model the velocity field prior  to encourage desirable deformation properties. Specifically, we first assume that deformations are smooth, for example to maintain anatomical consistency. Second, we assume that population templates are unbiased, restricting deformation statistics. Letting be the spatial displacement for , and be its spatial gradient,



is the multivariate normal distribution with mean 

and covariance , and . We let , where  is (a relaxation of) the Laplacian of a neighborhood graph defined on the pixel grid, with the graph degree matrix  and the pixel neighbourhood adjacency matrix  Dalca et al. (2018a). Using this formulation, we obtain


where  is the neighbourhood degree. The first term encourages a small average deformation across the dataset, encouraging a central, unbiased template. The second and third terms encourage templates that minimize deformation size and smoothness, respectively, and , and 

are hyperparameters.

Data Likelihood. The data likelihood  can be adapted to the application domain. For images, we often adopt a simple additive Gaussian model coupled with a deformable template:


where  represents a spatial warp, and 

represents additive image noise. However, in some datasets, different likelihoods are more appropriate. For example, due to the spatial variability of contrast and noise in MRIs, likelihood models that result in normalized cross correlation loss functions have been widely shown to lead to more robust results, and such models can be used with our framework 

Avants et al. (2008).

Figure 2: Overview. The network takes as input an image and an optional attribute vector. The upper network  outputs a template, which is then registered with the input image by the second network . The loss function, derived from the negative log likelihood of the generative model, leverages the template warped into .

3.2 Neural Network Model

To solve the maximum likelihood formulation (1) given the model instantiations specified above, we design a network  that takes as input an image and an attribute vector to condition the template on (this could be empty for global templates). The network can be effectively seen as having two functional parts. The first, , produces the conditional template. The second, , takes in the template and a data point, and outputs the most likely velocity field (and hence deformation) between them. By learning the optimal parameters , we estimate a global network that simultaneously provides a deformable (conditional) template and its deformation to a datapoint. Figure 2 provides an overview schematic of the proposed network.

We optimize the neural network parameters  using stochastic gradient algorithms, and minimize the negative maximum likelihood (1) for image :


where  yields the template at iteration , and .

The use of stochastic gradients to update the networks enables us to learn templates faster than conventional methods by avoiding the need to compute final deformations at each iteration. Intuitively, with every iteration the network learns to output a template, optionally conditioned on the attribute data, that can be smoothly and invertably warped to every image in the dataset.

We implement the template network  with two versions, depending on whether we are estimating an unconditional or conditional template. The first, conditional version  consists of a decoder that takes as input the attribute data , and outputs the template 

. The decoder includes a fully connected layer, followed by several blocks of upsampling, convolutional, and ReLu activation layers. The second, unconditional version 

has no inputs and simply consists of a learnable parameter at each pixel. The registration network  takes as input two images  and  and outputs a stationary velocity field , and is designed as a convolutional U-Net like architecture Ronneberger and others (2015) with the design used in recent registration literature Balakrishnan et al. (2018b). To compute the loss (6), we compute the deformation field  from  using differentiable scaling and squaring integration layers Dalca et al. (2018a); Krebs et al. (2018), and the warped template  using spatial transform layers. We approximate the average deformation  in the loss function using a weighted running average , where  is the displacement at iteration , is the current iteration, and  is usually set to  in our experiments. Specific network design parameters depend on the application domain, and are included in the supplementary materials.

3.3 Test-time Inference of Template and Deformations.

Given a trained network, we obtain a (potentially conditional) template  directly from network  by a single forward pass given input . For each test input image , the deformation fields themselves are often of interest for analysis or prediction. The network also provides the deformation , where .

Often times, the inverse deformation, which takes the image to the template space, is also desired. Using a stationary velocity field representation, obtaining this inverse deformation  is easy to compute by integrating the negative velocity field using the same scaling and squaring layer:  Ashburner (2007); Dalca et al. (2019); Modat et al. (2014).

4 Experiments

We present two main sets of experiments. The first set uses image-based datasets MNIST and Google QuickDraw, with the goal of providing a picture of the capabilities of our method. While deformable templates in these data are not a real-world application, these are often-studied datasets that provide a platform to analyze aspects of deformable templates.

In contrast, the second set of experiments is designed to demonstrate the utility of our method on a task of practical importance, analysis of brain MRI. We demonstrate that our method can produce high quality deformable templates in the context of realistic data, and that conditional deformable templates capture important anatomical variability related to age.

4.1 Experiment on Benchmark Datasets

Data. We use the MNIST dataset, consisting of small 2D images of hand-written digits LeCun (1998) and 11 classes from the Google QuickDraw dataset Jongejan et al. (2016), a collection of categorized drawings contributed by players in an online drawing game. To evaluate our method’s ability to construct conditional templates that accurately capture the impact of the variables on which the templates are conditioned, we generate examples in which the initial images are scaled and rotated (Figure 3). Specifically, we use an image scaling factor in the range  and an image rotation in the range 0 to 360 degrees. We learn different models involving either the original dataset involving different classes (D-class), the dataset with simulated scale effects (D-class-scale), and the rotations (D-class-scale-rot). While simulated image changes are obvious to an observer, during training we assume we know the attributes that cause changes in the images, but do not a priori model their effect on the images. This simulates, for example, the correlation between age and changing size of anatomical structures. The goal is to understand whether the proposed method is able to learn the relationship between the attribute and the geometrical variability in the dataset, and therefore produce a function for generating on-demand templates conditioned on the attributes. The datasets are split into train, validation and test sets.

4.1.1 Validation

In the first experiment, we evaluate our ability to construct suitable conditional templates.

Figure 3: MNIST examples (1) MNIST digits from D-scale-rot; (2) templates conditioned on class (vertical axis) and scale (horizontal axis) on MNIST D-scale, learned with our model, and (3) with a decoder-only baseline model; (4) conditional templates learned with our model on the MNIST D-class-scale-rot dataset for the digit 3 and a variety of scaling and rotation values.
Figure 4: Example convergence. Convergence of two conditional template models. Left: model trained on digit-only attribute on D-class

for epochs 

. Right: model trained on D-class-rot, with all three attributes given as input to the model for epochs , and randomly sampled digits , rotations, and scales.

Figure 5: Example deformations. Each row shows: class template, example class image, template warped to this instance, instance warped to match the template, and the deformation field.

Hyperparameters. Model hyperparameters have intuitive effects on the sharpness of the templates, the spatial smoothness of the registration fields, and the quality of the alignments. In practical settings, they should be chosen based on the desired goal of a given task. In these experiments, we tune hyperparameters by visually assessing deformations on validation data, starting from , and  for training on the D-class data. We found that once a hyperparameter was chosen for one dataset, only minor tuning was required for other experiments.

Evaluation criteria. Template construction is an ill-posed problem, and the utility of resulting templates depends on the desired task. We report a series of measures to capture properties of the resulting templates and deformations. Our first two quantitative evaluation criteria relate to centrality, for which we computed the norm of the mean displacement field  and the average displacement size . Next, we illustrate field regularity per image class, and average intensity image agreement (via MSE). These metrics capture aspects about the deformation fields, rather than solely intrinsic properties of the templates. They need to be evaluated together - otherwise, deformation fields can lead to perfectly matching the image and template while being very irregular and geometrically meaningless, or can be perfectly smooth (zero displacement) at the cost of poor image matching. To capture field regularity, we compute the Jacobian matrix , which captures the local properties of  around voxel pixel . Low values indicate irregular deformation fields, and  indicate pixels that are not topology-preserving. Jacobian determinants near  represent very smooth fields. We use held-out test subjects for these measures.

Baselines. We compare our templates with templates built by choosing exemplar data as templates, and by training only a decoder of the given attributes using MSE loss and the same network architecture as the template network . This latter baseline can be seen as differing from our method in that it minimizes a pixel-wise intensity difference as opposed to a geometric difference (deformation).

Figure 6: Quantitative measures. Top: Centrality and average deformation norm for templates generated by our model and the baselines on the D-class variant of MNIST. We find that our models yield more central templates. Bottom: Both MSE and Jacobians determinants measures indicate good deformations for all models.

Results. Figure 3 illustrates conditional templates using our model and the decoder, and results from our model on the full MNIST dataset using all attributes. Our method produces sharp, central templates that are plausible digits and are a smoothly deformable to other digits. Example deformations are shown in Figure 5. Supplementary Figures 14 contains similar results for the QuickDraw dataset.

Figure 5 illustrates convergence behavior for two models, showing that the conditional attributes are able to capture complicated geometric differences. Templates early in the learning process share appearance features across attributes, indicating that the network leverages common information across the dataset. The final templates enable significantly smaller deformations than early ones, indicating better representation of the conditional variability. As one would expect, more epochs are necessary for convergence of the model with more attributes.

Figure 6 shows template measures indicating that our conditional templates are more central and require smaller deformations than the baselines when registered with the test set digits. In Figure 6 we also find that our method and exemplar-based templates can perform well for both deformation metrics, and comparable to each other. Specifically, all deformations are "smooth" (no negative Jacobian determinants are detected) and image differences are visually imperceptible. We underscore that changes in the hyperparameters will produce slightly different trade offs for these measures. At the presented parameters, our method produces templates and deformation fields with slightly smoother deformation fields coming at a slight cost in MSE for some digits, while the baselines can lead to slightly irregular fields to force images to match. The decoder baselines underperforms in all metrics. These results indicate that both our methods and instance-based templates can lead to accurate and smooth deformation fields, while our methods produce more central template requiring smaller deformations.

4.1.2 Analysis

In this section, we explore further characteristics and utility provided by our model using the MNIST and QuickDraw dataset.

Variability and Synthesis.

Figure 7: Variability. Left: Images are synthesized by warping a learned template from the D-class dataset along the main two axes found by applying PCA to test deformation fields. Right: Images are synthesized by warping learned template using the D-class-scale along the main two axes found by applying PCA. The first model uses scale as an attribute, learning mostly other meaningful geometric deformations. The second model does not use scale as an attribute; consequently both principal components are dominated by scale.

Conditional deformable templates capture an image representation up to a spatial deformation. Deformation fields from templates to images are often studied to characterize and visualize variability in a population. To illustrate this point, we demonstrate the main within-class variability by finding the principal components of the velocity fields using PCA. Figure 7 illustrates synthesized digits by warping the template along these components capturing handwriting variability in natural digits.

In a second variability experiment, we treat scale as a confounder and validate that our method reduces confounding effects. Figure 7 illustrates that a model learned with a scale attribute is able to learn principal geometric variability with reduced effects from scale compared to one not using this attribute.

Missing Attributes.

Figure 8: Missing attributes. Left: during training, digits 3-5 are not synthesized using scaling . Right: during training, only 5 examples of digit 5 are given. Red boxes highlight templates build with attributes where data was held out.

We test the ability of our conditional framework to learn templates that generalize to sparsely observed attributes in two regimes. First, for the D-class-scale dataset, we completely hold out scaling factors in the range  for images of digits 3, 4 and 5. In the second regime, we hold out all but 5 instances of the digit 5. Figure 8 indicates that for each regime, our method produces reasonable templates even for the held out attributes, indicating that it leverages the entire dataset in learning the conditioning function.

Figure 9: Latent attribute results.

The top row shows sample input digits. The middle row shows our reconstruction for those input images, highlighting that the model learns a template for each digit type even when the digit attribute is not explicitly given. The bottom row shows templates built using an auto encoder with a single neuron bottleneck, showing that the main variation captured in this manner encourages small pixel intensity error, rather than the geometric difference minimized by our method.

Figure 11: Segmentations. Example segmentations overlayed with different brain views for our unconditional template (left) and conditional templates (right) varying by age.
Figure 10: Slices from Learned 3D Brain MRI templates. Left: single unconditional template representing the entire population. Right: conditional age templates for brain MRI for ages 15 to 90, illustrating, for example, growth of the ventricles, also evident in a supplementary video.
Figure 11: Segmentations. Example segmentations overlayed with different brain views for our unconditional template (left) and conditional templates (right) varying by age.
Figure 12: Example 3D neuroimaging deformations. Frames include: coronal slices for age-conditional template, subject scan, warped template onto subject, warped subject onto template (using inverse field), and the first two directions of the 3D forward and inverse warps, and velocity field.
Figure 10: Slices from Learned 3D Brain MRI templates. Left: single unconditional template representing the entire population. Right: conditional age templates for brain MRI for ages 15 to 90, illustrating, for example, growth of the ventricles, also evident in a supplementary video.

Latent attributes.

In this experiment, we compare our method to recent probabilistic models in the situation where attributes are not known a priori. To do this, we add an encoder from the input image 

to the latent attribute, and as a baseline train an autoencoder with the same encoder and decoder architectures as used in our model, and the MSE loss. We train on the

D-class dataset with a bottleneck of a single neuron simulating the single unknown attribute. While more powerful autoencoders can lead to better reconstructions of the inputs, our goal is to explore the main mode of variability captured by each method. As Figure 9 shows, this autoencoder produces much fuzzier looking reconstructions, whereas our approach tends to reproduce the template for the given digit image. This is because the autoencoder learns representations to minimize pixel intensity differences, whereas our approach learns representations that minimize spatial deformations. In other words, out model learns image representations with respect to minimal geometric deformations.

Figure 13: Volume trends. Change in volume of ventricles and hippocampi of the age-conditional brain templates.

4.2 Experiment 2: Neuroimaging

In this section, we illustrate unconditional and conditional 3D brain MRI templates learned by our method, with the goal of showing its utility for the realistic task of neuroimaging analysis. We first show that our method efficiently synthesizes a unconditional population template, comparable to existing ones that require significantly more computation to construct. Secondly, we show that our learned conditional template function captures anatomical variability as a function of age.

Data. We use a large dataset of 7829 T1-weighted 3D brain MRI scans from publicly available datasets: ADNI Mueller and others (2005), OASIS Marcus and others (2007), ABIDE Di Martino and others (2014), ADHD200 Milham and others (2012), MCIC Gollub and others (2013), PPMI Marek and others (2011), HABS Dagley and others (2015), and Harvard GSP Holmes and others (2015). All scans are pre-processed with standard steps, including resampling to mm isotropic voxels, affine spatial normalization and anatomical segmentations using FreeSurfer Fischl (2012). Final images are cropped to . The segmentation maps are only used for analysis. The dataset is split into 7329 training volumes, 250 validation and 250 test. This dataset was first assembled and used in Dalca et al. (2018b)

Methods. All of the training data was used to build an unconditional template. We also learned a conditional template function using age and sex attributes, using only the ADNI and ABIDE datasets which provide this information. Following neuroimaging literature, we use a likelihood model resulting in normalized cross correlation data loss. Training the model requires approximately a day on a Titan XP GPU. However, obtaining a conditional template from a learned network requires less than a second.

Evaluation. For a given template, we obtain anatomical segmentations by warping 100 training images to the template and averaging their warped segmentations. For the conditional template, we do this for 7 ages equally spaced between 15 and 90 years old, for both males and females. We first analyze anatomical trends with respect to conditional attributes. We then measured registration accuracy facilitated by each template with the test set via the widely used volume overlap measure Dice (higher is better). To compare volume overlap via the Dice metric, as a baseline we use the atlas and segmentation masks available online from recent literature Balakrishnan et al. (2018a). To test the volume overlap with anatomical segmentations of test data, we warp each template (unconditional, appropriate age and sex conditional template, and baseline) to each of 100 test subjects, and propagate the template segmentations. We computed the mean Dice score of all subjects and 30 FreeSurfer labels.

Results. Figures 12 and 12, and a supplementary video111Video can be found at http://voxelmorph.mit.edu/atlas_creation/, illustrate example slices from the unconditional and conditional 3D templates. The ventricles and hippocampi are known to have significant anatomical variation as a function of age, which can be seen in the images. Figure 13 illustrates their volume measured using our atlases as a function of age, showing the growth of the ventricle volumes and shrinkage of the hippocampus. Figure 12 illustrates representative results.

We find Dice scores of  for the unconditional template,  for the conditional model, and  for the baseline, with this difference roughly consistent for each anatomical structure. We emphasize that these numbers may not be directly compared, since the baseline atlas (and segmentations) were obtained using a different process involving an external dataset and manual labeling, while our template was built with our training images (and their FreeSurfer segmentations to obtain template labels). Nonetheless, these visualizations and analyses are encouraging, suggesting that our method provides anatomical templates for brain MRI that enable brain segmentation.

5 Discussion and Conclusion

Deformable templates play an important role in image analysis tasks. In this paper, we present a method for automatically learning such templates from data. Our method is both less labor intensive and computationally more efficient than traditional data-driven methods for learning templates. Moreover, our method can be used to learn a function that can quickly generate templates conditioned upon sets of attributes. It can for example generate a template for the brains of 75 year old women in under a second. To our knowledge, this is the only general method for producing templates conditioned on available attributes.

In a series of experiments on popular data sets of images, we demonstrate that our method produces high quality unconditional templates. We also show that it can be used to construct conditional templates that account for confounders such as scaling and rotation. In a second set of experiments, we demonstrate the practical utility of our methods by applying it to a large data set of brain MRI images. We show that with about a day of training, we can produce unconditional atlases similar in quality and utility to a widely used atlas that took weeks to produce. We also show that the method can be used to rapidly produce conditional atlases that are consistent with known age-related changes in anatomy.

In the future, we plan to explore downstream consequences of being able to easily and quickly produce conditional templates for medical imaging studies. In addition, we believe that our model can be used for other tasks, such as estimating unknown attributes (e.g., age) for a given patient, which would be an interesting direction for further exploration.


  • W. H. Abdulla, D. Chow, and G. Sin (2003) Cross-words reference template for dtw-based speech recognition systems. In TENCON 2003. Conference on convergent technologies for Asia-Pacific region, Vol. 4, pp. 1576–1579. Cited by: §1.
  • A. Ahmadi and I. Patras (2016) Unsupervised convolutional neural networks for motion estimation. In Image Processing (ICIP), 2016 IEEE International Conference on, pp. 1629–1633. Cited by: §2.1.
  • S. Allassonnière, Y. Amit, and A. Trouvé (2007) Towards a coherent statistical framework for dense deformable template estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (1), pp. 3–29. Cited by: §1, §1, §2.2.
  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.2.
  • J. Ashburner (2007) A fast diffeomorphic image registration algorithm. Neuroimage 38 (1), pp. 95–113. Cited by: §2.1, §3.1, §3.3.
  • B. B. Avants, C. L. Epstein, M. Grossman, and J. C. Gee (2008) Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis 12 (1), pp. 26–41. Cited by: §2.1, §3.1.
  • R. Bajcsy and S. Kovacic (1989) Multiresolution elastic matching. Computer Vision, Graphics, and Image Processing 46, pp. 1–21. Cited by: §2.1.
  • G. Balakrishnan, A. Zhao, M.R. Sabuncu, J. Guttag, and A.V. Dalca (2018a)

    An unsupervised learning model for deformable medical image registration


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 9252–9260. Cited by: §2.1, §4.2.
  • G. Balakrishnan, A. Zhao, M.R. Sabuncu, J. Guttag, and A.V. Dalca (2018b) VoxelMorph: a learning framework for deformable medical image registration. IEEE Transactions on Medical Imaging. Cited by: §1, §2.1, §3.2.
  • M.F. Beg et al. (2005) Computing large deformation metric mappings via geodesic flows of diffeomorphisms. Int. J. Comput. Vision 61, pp. 139–157. Cited by: §2.1.
  • T. Brox et al. (2004) High accuracy optical flow estimation based on a theory for warping. European Conference on Computer Vision (ECCV), pp. 25–36. Cited by: §2.1.
  • X. Cao, J. Yang, J. Zhang, D. Nie, M. Kim, Q. Wang, and D. Shen (2017) Deformable image registration based on similarity-steered cnn regression. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 300–308. Cited by: §2.1.
  • Y. Cao, M. I. Miller, R. L. Winslow, and L. Younes (2005) Large deformation diffeomorphic metric mapping of vector fields. IEEE transactions on medical imaging 24 (9), pp. 1216–1230. Cited by: §2.1.
  • C. Ceritoglu, K. Oishi, X. Li, M. Chou, L. Younes, M. Albert, C. Lyketsos, P. C. van Zijl, M. I. Miller, and S. Mori (2009)

    Multi-contrast large deformation diffeomorphic metric mapping for diffusion tensor imaging

    Neuroimage 47 (2), pp. 618–627. Cited by: §2.1.
  • X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.2.
  • A. Dagley et al. (2015) Harvard aging brain study: dataset and accessibility. NeuroImage. Cited by: §4.2.
  • A.V. Dalca, G. Balakrishnan, J. Guttag, and M.R. Sabuncu (2019) Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Medical Image Analysis 57, pp. 226–236. Cited by: §2.1, §3.1, §3.3.
  • A.V. Dalca, G. Balakrishnan, J. Guttag, and M. Sabuncu (2018a) Unsupervised learning for fast probabilistic diffeomorphic registration. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 729–738. Cited by: §2.1, §3.1, §3.1, §3.2.
  • A.V. Dalca, J. Guttag, and M. Sabuncu (2018b) Anatomical priors in convolutional networks for unsupervised biomedical segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9290–9299. Cited by: §4.2.
  • A.V. Dalca et al. (2016) Patch-based discrete registration of clinical brain images. In International Workshop on Patch-based Techniques in Medical Imaging, pp. 60–67. Cited by: §2.1.
  • B. C. Davis, P. T. Fletcher, E. Bullitt, and S. Joshi (2010) Population shape regression from random design data. International journal of computer vision 90 (2), pp. 255–266. Cited by: §2.2.
  • B. Davis, P. Lorenzen, and S. C. Joshi (2004) Large deformation minimum mean squared error template estimation for computational anatomy.. In ISBI, Vol. 4, pp. 173–176. Cited by: §1, §1, §2.2.
  • B.D. de Vos et al. (2017) End-to-end unsupervised deformable image registration with a convolutional neural network. In DLMIA, pp. 204–212. Cited by: §2.1.
  • A. Di Martino et al. (2014) The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular psychiatry 19 (6), pp. 659–667. Cited by: §4.2.
  • A. Dosovitskiy et al. (2015) FlowNet: learning optical flow with convolutional networks. Cited by: §2.1.
  • P. F. Felzenszwalb and J. D. Schwartz (2007) Hierarchical matching of deformable shapes. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §1.
  • B. Fischl (2012) FreeSurfer. Neuroimage 62 (2), pp. 774–781. Cited by: §4.2.
  • B. Glocker et al. (2008)

    Dense image registration through mrfs and efficient linear programming

    Medical image analysis 12 (6), pp. 731–741. Cited by: §2.1.
  • R.L. Gollub et al. (2013) The mcic collection: a shared repository of multi-modal, multi-site brain image data from a clinical investigation of schizophrenia. Neuroinformatics 11 (3), pp. 367–388. Cited by: §4.2.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Cited by: §2.2.
  • P. A. Habas, K. Kim, F. Rousseau, O. A. Glenn, A. J. Barkovich, and C. Studholme (2009) A spatio-temporal atlas of the human fetal brain with application to tissue segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 289–296. Cited by: §2.2.
  • M. Hernandez, M. N. Bossa, and S. Olmos (2009) Registration of anatomical images using paths of diffeomorphisms parameterized with stationary vector field flows. International Journal of Computer Vision 85 (3), pp. 291–306. Cited by: §2.1, §3.1.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2.2.
  • A. J. Holmes et al. (2015) Brain genomics superstruct project initial data release with structural, functional, and behavioral measures. Scientific data 2. Cited by: §4.2.
  • B. K.P. Horn and B. G. Schunck (1980) Determining optical flow. Cited by: §2.1.
  • E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In IEEE conference on computer vision and pattern recognition (CVPR), Vol. 2, pp. 6. Cited by: §2.1.
  • A. K. Jain, Y. Zhong, and S. Lakshmanan (1996) Object matching using deformable templates. IEEE Transactions on pattern analysis and machine intelligence 18 (3), pp. 267–278. Cited by: §1.
  • J. Y. Jason, A. W. Harley, and K. G. Derpanis (2016) Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In European Conference on Computer Vision, pp. 3–10. Cited by: §2.1.
  • J. Jongejan, H. Rowley, T. Kawashima, J. Kim, and N. Fox-Gieg (2016) The quick, draw!-ai experiment. Mount View, CA, accessed Feb 17, pp. 2018. Cited by: §4.1.
  • S. C. Joshi and M. I. Miller (2000) Landmark matching via large deformation diffeomorphisms. IEEE transactions on image processing 9 (8), pp. 1357–1370. Cited by: §2.1.
  • S. Joshi, B. Davis, M. Jomier, and G. Gerig (2004) Unbiased diffeomorphic atlas construction for computational anatomy. NeuroImage 23, pp. S151–S160. Cited by: §1, §1, §2.2.
  • J. Kim, C. Liu, F. Sha, and K. Grauman (2013) Deformable spatial pyramid matching for fast dense correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2307–2314. Cited by: §1.
  • D.P. Kingma and M. Welling (2014) Auto-encoding variational bayes. ICLR. Cited by: §2.2.
  • I. Kokkinos, M. M. Bronstein, R. Litman, and A. M. Bronstein (2012) Intrinsic shape context descriptors for deformable shapes. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 159–166. Cited by: §1.
  • J. Krebs, T. Mansi, B. Mailhé, N. Ayache, and H. Delingette (2018) Unsupervised probabilistic deformation modeling for robust diffeomorphic registration. Deep Learning in Medical Image Analysis. Cited by: §3.1, §3.2.
  • J. Krebs, H. e Delingette, B. Mailhé, N. Ayache, and T. Mansi (2019) Learning a probabilistic model for diffeomorphic registration. IEEE transactions on medical imaging. Cited by: §2.1, §3.1.
  • J. Krebs et al. (2017) Robust non-rigid registration through agent-based action learning. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), Cham, pp. 344–352. Cited by: §2.1.
  • M. Kuklisova-Murgasova, P. Aljabar, L. Srinivasan, S. J. Counsell, V. Doria, A. Serag, I. S. Gousias, J. P. Boardman, M. A. Rutherford, A. D. Edwards, et al. (2011) A dynamic 4d probabilistic atlas of the developing brain. NeuroImage 54 (4), pp. 2750–2763. Cited by: §2.2.
  • Y. LeCun (1998)

    The mnist database of handwritten digits

    http://yann. lecun. com/exdb/mnist/. Cited by: §4.1.
  • J. Ma, M. I. Miller, A. Trouvé, and L. Younes (2008) Bayesian template estimation in computational anatomy. NeuroImage 42 (1), pp. 252–261. Cited by: §1, §1, §2.2.
  • A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §2.2.
  • D.S. Marcus et al. (2007) Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of cognitive neuroscience 19 (9), pp. 1498–1507. Cited by: §4.2.
  • K. Marek et al. (2011) The parkinson progression marker initiative (ppmi). Progress in neurobiology 95 (4), pp. 629–635. Cited by: §4.2.
  • M.P. Milham et al. (2012) The adhd-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Front. Sys. Neurosci 6, pp. 62. Cited by: §4.2.
  • M. I. Miller, M. F. Beg, C. Ceritoglu, and C. Stark (2005) Increasing the power of functional maps of the medial temporal lobe by using large deformation diffeomorphic metric mapping. Proceedings of the National Academy of Sciences 102 (27), pp. 9685–9690. Cited by: §2.1.
  • M. Modat, I.J.A. Simpson, M.J. Cardoso, D.M. Cash, N. Toussaint, N.C. Fox, and S. Ourselin (2014) Simulating neurodegeneration through longitudinal population analysis of structural and diffusion weighted mri data. Medical Image Computing and Computer-Assisted Intervention LNCS 8675, pp. 57–64. Cited by: §3.3.
  • M. Modat, P. Daga, M. J. Cardoso, S. Ourselin, G. R. Ridgway, and J. Ashburner (2012) Parametric non-rigid registration using a stationary velocity field. In 2012 IEEE Workshop on Mathematical Methods in Biomedical Image Analysis, pp. 145–150. Cited by: §3.1.
  • S.G. Mueller et al. (2005) Ways toward an early diagnosis in Alzheimer’s disease: the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimer’s & Dementia 1 (1), pp. 55–66. Cited by: §4.2.
  • K. Oishi, A. Faria, H. Jiang, X. Li, K. Akhter, J. Zhang, J. T. Hsu, M. I. Miller, P. C. van Zijl, M. Albert, et al. (2009) Atlas-based whole brain white matter analysis using large deformation diffeomorphic metric mapping: application to normal elderly and alzheimer’s disease participants. Neuroimage 46 (2), pp. 486–499. Cited by: §2.1.
  • A. Ranjan and M. J. Black (2017) Optical flow estimation using a spatial pyramid network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. 2. Cited by: §2.1.
  • M.M. Rohé et al. (2017) SVF-Net: learning deformable image registration using shape matching. In MICCAI, pp. 266–274. Cited by: §2.1.
  • O. Ronneberger et al. (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §3.2.
  • M. R. Sabuncu, S. K. Balci, M. E. Shenton, and P. Golland (2009) Image-driven population analysis through mixture modeling. IEEE transactions on medical imaging 28 (9), pp. 1473–1487. Cited by: §2.2, §2.2.
  • H. Sokooti et al. (2017) Nonrigid image registration using multi-scale 3d convolutional neural networks. In MICCAI, Cham, pp. 232–239. Cited by: §2.1.
  • C. Stoll, Z. Karni, C. Rössl, H. Yamauchi, and H. Seidel (2006) Template deformation for point cloud fitting.. In SPBG, pp. 27–35. Cited by: §1.
  • D. Sun et al. (2010) Secrets of optical flow estimation and their principles. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2439. Cited by: §2.1.
  • J.P. Thirion (1998) Image matching as a diffusion process: an analogy with maxwell’s demons. Medical Image Analysis 2 (3), pp. 243–260. Cited by: §2.1.
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2016) Deep end2end voxel2voxel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 17–24. Cited by: §2.1.
  • T. Vercauteren et al. (2009) Diffeomorphic demons: efficient non-parametric image registration. NeuroImage 45 (1), pp. S61–S72. Cited by: §2.1.
  • X. Yang et al. (2017) Quicksilver: fast predictive image registration–a deep learning approach. NeuroImage 158, pp. 378–396. Cited by: §2.1.
  • B. T. Yeo, M. R. Sabuncu, T. Vercauteren, D. J. Holt, K. Amunts, K. Zilles, P. Golland, and B. Fischl (2010) Learning task-optimal registration cost functions for localizing cytoarchitecture and function in the cerebral cortex. IEEE transactions on medical imaging 29 (7), pp. 1424–1441. Cited by: §2.1.
  • A. Yurtman and B. Barshan (2014) Automated evaluation of physical therapy exercises using multi-template dynamic time warping on wearable sensor signals. Computer methods and programs in biomedicine 117 (2), pp. 189–207. Cited by: §1.
  • M. Zhang et al. (2017) Frequency diffeomorphisms for efficient image registration. In IPMI, pp. 559–570. Cited by: §2.1.

1 Architectures.

We model the conditional template network architecture as a decoder with a dense layer followed by several upsampling and convolutional levels. Class attributes are encoded as one-hot representations, and continuous attributes are encoded as scalars. For toy datasets, we use a dense layer from the input attributes to a image with features followed by three upsampling levels with two convolution layers each with  features each. The value for  is set to  in most situations, and in the latent variable experiment, we avoid over-fitting by using a bottleneck of 1 and  of , for both our method and the baseline. Unconditional templates involve a single layer with a learnable parameter at each pixel.

For our conditional 3D neuroimaging template, we use a dense layer to a 3D image with  features, followed by a level of upsampling with three convolution layers and  features. All kernels are of size .

We base our design for the registration network on the architecture described in recent learning-based registration frameworks [9]. Specifically, we use a U-Net style architecture with four downsampling and upsampling layers, each involving a convolutional layer with 32 features and 3x3 kernel size. This is followed by two more convolution layers. For baseline templates – instances, and those produced by decoder-based models – we learn a registration network using the same architecture.

2 Quickdraw result examples

Figure 14: Quickdraw example templates. Left: example and learned atlases for the D-class QuickDraw dataset, and below variability examples similar to Figure 7-left. Right: templates for different scales and classes learned using D-class-scale simulations.

3 Supplementary Video

We include a supplementary video at http://voxelmorph.mit.edu/atlas_creation/, illustrating our method’s ability of synthesizing templates on-demand based on given attributes. Specifically, the video illustrates the brain template conditioned on age, between 15 and 90 years old, also used as the video frame index.