Generative Adversarial Registration for Improved Conditional Deformable Templates

by   Neel Dey, et al.
NYU college

Deformable templates are essential to large-scale medical image registration, segmentation, and population analysis. Current conventional and deep network-based methods for template construction use only regularized registration objectives and often yield templates with blurry and/or anatomically implausible appearance, confounding downstream biomedical interpretation. We reformulate deformable registration and conditional template estimation as an adversarial game wherein we encourage realism in the moved templates with a generative adversarial registration framework conditioned on flexible image covariates. The resulting templates exhibit significant gain in specificity to attributes such as age and disease, better fit underlying group-wise spatiotemporal trends, and achieve improved sharpness and centrality. These improvements enable more accurate population modeling with diverse covariates for standardized downstream analyses and easier anatomical delineation for structures of interest.



There are no comments yet.


page 4

page 5

page 14

page 15

page 16

page 17

page 18

page 19


Learning Conditional Deformable Templates with Convolutional Networks

We develop a learning framework for building deformable templates, which...

Unsupervised Deformable Image Registration Using Cycle-Consistent CNN

Medical image registration is one of the key processing steps for biomed...

Medical Image Segmentation and Localization using Deformable Templates

This paper presents deformable templates as a tool for segmentation and ...

Deep Deformable Registration: Enhancing Accuracy by Fully Convolutional Neural Net

Deformable registration is ubiquitous in medical image analysis. Many de...

Conditional Deformable Image Registration with Convolutional Neural Network

Recent deep learning-based methods have shown promising results and runt...

Learning Deformable Point Set Registration with Regularized Dynamic Graph CNNs for Large Lung Motion in COPD Patients

Deformable registration continues to be one of the key challenges in med...

Optimizing Through Learned Errors for Accurate Sports Field Registration

We propose an optimization-based framework to register sports field temp...

Code Repositories


[ICCV 2021] Generative Adversarial Registration for Improved Conditional Deformable Templates

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deformable image registration enables the quantification of geometric dissimilarity via the pairwise warping of a source image to a target. In the context of population studies, pairwise registration of a subject onto a deformable template is a central step in standardized analyses, where an ideal template is an unbiased barycentric representation of the (sub-)population of interest [9, 7, 50]. Templates play a key role in diverse large-scale biomedical imaging tasks such as alignment to a common coordinate system [34, 91], brain extraction [40, 46], segmentation [17, 46], and image and shape regression models [32, 72], among others.

Templates can be obtained from a reference database, but are preferably constructed for specific populations by optimizing for an image which minimizes deformation to each individual subject. As the template strongly affects subsequent morphometric analysis [87, 94], template construction has received significant attention. Further, as a single template cannot capture the high levels of structural variability within a population (e.g., age and cohort), we consider conditional template estimation with continuous and/or categorical attributes. Conditional templates constructed on image sets with diverse covariates enable sub-population modeling accounting for information learned from the overall population and obviate the need for arbitrary thresholding of demographic information to perform independent analyses [21, 24, 104].

Implicit models for template estimation [1, 9, 7, 50, 62, 102] alternate between registration of each scan to the current template estimate and updating the template based on averages of the warped subject scans. Due to the averaging of aligned image intensities, the resulting templates may blur significantly in regions with high-frequency deformations even alongside shape corrections [7]. Recently, explicit template estimation via unsupervised deep networks was proposed in [24] where each stochastic update of a registration network yields a (potentially conditional) template without averaging aligned images and transformations.

However, both implicit and explicit models typically only minimize an image dissimilarity term between the moved template and fixed image (and/or vice-versa) subject to application-specific regularization ensuring a diffeomorphic (smooth, differentiable, and invertible) transformation. As inter-brain variability includes complex topological changes not captured by purely diffeomorphic models, estimated templates are often unrealistic and do not resemble the data that they represent. Sub-optimal appearance impacts downstream applications due to ambiguous and/or implausible anatomical boundaries. For example, in order to register one or more expert-annotated templates to target images for atlas-based segmentation [46, 58, 63], the template(s) must have clearly distinguishable anatomical boundaries to enable expert delineation. Unfortunately, structural anatomical boundaries are often obfuscated by current template estimation approaches.

In this paper, we present a learning framework to estimate sharp (optionally conditional) templates with realistic anatomical structure via generative adversarial learning. Our core insight is that in addition to possessing high registration accuracy, the distribution of moved template images should be indistinguishable from the real image distribution. We develop a generator comprising of template generation and registration sub-networks and a discriminator which assesses the realism and condition-specificity of the synthesized and warped templates. As adversarial objectives encourage high-frequency detail, the templates gain crisp naturalistic boundaries without the need for ad hoc post-processing. To develop stable and accurate 3D GANs for large medical volumes with highly limited sample and batch sizes, we carefully design optimization and architectural schemes, augmentation strategies, and design careful conditioning mechanisms.

Our contributions include: (1) a generative adversarial approach to deformable template generation and registration; (2) construction of conditional templates across diverse challenging datasets including neuroimages of pre-term and term-born newborns, adults with and without Huntington’s disease, and real-world images of faces; (3) improvements on current template construction methodologies in terms of centrality and interpretability alongside significantly increased condition-specificity. Code is available at

2 Related work

Generative adversarial networks [37] have lead to remarkable progress in high-fidelity image generation [15, 52, 53, 76]

. Consequently, GANs for image-to-image translation 

[43, 47, 76, 101, 108] and generic inverse problems in imaging [26, 57, 86] have demonstrated that in addition to reconstruction objectives, adversarial regularizers dramatically increase the visual fidelity of the reconstructions by compensating for high-frequency details typically lost by using reconstruction objectives alone. We apply analogous reasoning in our use of conditional adversarial regularization of registration objectives. For conditional generator networks, modulating every feature map with learned conditional scales and shifts has lead to significantly improved image synthesis [15, 77, 82]

over methods where the attribute vector is concatenated to the input.

Deformable image registration

is the spatial deformation of a source image to a target such that a regularized image-matching cost is minimized. Optical flow registration commonly deployed in computer vision 

[16, 75] admits deformations which may create anatomically implausible transformations. Instead, a series of registration algorithms more suited to biomedical images have been developed [22, 29, 49, 83, 92], further leading to several topology-preserving diffeomorphic extensions [6, 11, 19, 97, 98, 105, 89]. More recently, deep networks trained under either supervised [18, 90, 103] or unsupervised [10, 25, 28, 56, 59, 73] registration objectives have emerged, simultaneously offering both greater modeling flexibility and accelerated inference performance.

Generative adversarial registration leveraging simulations has been used in works such as [42] where large-scale finite element simulations of plausible deformations serve as the real domain for a GAN loss alongside supervised registration. Simulated pairs of aligned and mis-aligned image patches have also been used to adversarially optimize a registration network [31, 35]. Our approach is distinct in that we focus on templates and not just registration, we develop adversarial registration techniques accounting for covariates, we process complete 3D volumes and do not use simulation, focusing only on moved template realism.

Template estimation enables standardized analyses of image sets by acting as barycentric representations of a given population. Unconditional template construction has a rich history in medical image analysis [1, 9, 7, 50, 62, 102]. Due to the blurring induced by image and shape averaging of aligned images, popular registration frameworks perform ad hoc template post-processing with Laplacian sharpening [7] which may inadvertently create implausible structure [2] and may still fail to resolve structures in highly variable populations. Further, given covariates of interest, ad hoc analyses may ignore shared information by dividing the dataset into sub-populations of interest and constructing templates for each independently. More principled approaches explicitly account for age and potentially other covariates by building spatiotemporal templates and have been extensively validated on pediatric [33, 36, 39, 54, 85, 88] and adult [14, 27, 45, 84] neuroimages.

In this work, we extensively build upon VoxelMorph Templates [24] (referred to as VXM in this work), a deep network for template estimation. Driven by a generative model, unconditional VXM considers a grid of free parameters as a template, which is used together with a training image as input to a registration network [25]. The network estimates a diffeomorphic displacement field between each image and the template. Both the registration network and the template parameters are trained end-to-end under a regularized image matching cost. For conditional VXM, an attribute vector is input to a convolutional decoder to generate a conditional template, which is then similarly processed by the registration network for end-to-end training. Subsequent sections detail our methodologies and improvements.

Figure 1: Overview of the proposed template construction framework. A template generation network (a) processes an array of learned parameters with a convolutional decoder whose feature-wise affine parameters are learned from input conditions to generate a conditional template. A registration network (b) warps the generated template to a randomly sampled fixed image. A discriminator (c) is trained to distinguish between moved synthesized templates and randomly-sampled images such that realism and condition-specificity is encouraged.

3 Methodology

Figure  1 gives an overview of our approach. The generator network ((a) & (b)) synthesizes a conditional template and deforms it to a fixed image to be assessed by a discriminator ((c)). The framework is trained end-to-end under a regularized registration and adversarial cost to encourage both registration accuracy and template realism.

Template Generation Sub-network. We develop an architecture whose backbone is agnostic to conditional or unconditional training. For unconditional training, we use a randomly-initialized parameter array (similar to [52, 53]) at half the spatial resolution of the desired template which is processed by convolutional decoder. The decoder output is added to the linear average of training images to generate the unconditional template, such that the network primarily learns to generate high-frequency detail. Checkerboard patterns generated by unconditional VXM are ameliorated in this design by imposing spatial priors through convolutions. However, the central advantage of this backbone architecture is that it enables more parameter-efficient and powerful mechanisms for conditional training, as below.

For conditional training, given condition vector and a feature map from the th layer and th channel, we feature-wise linearly modulate (FiLM [80]) all features in the backbone such that , where and are scale and shift parameters learned from . As shown in Figure 1(a), we use a four-layer MLP to generate a shared conditional embedding from which is then linearly projected with weight decay individually to every layer in the template network to generate feature-wise transformations. The primary benefit of this design is that with conditioning at every layer (as opposed to conditional VXM where the only source of conditioning is at its input), the template network has a higher capacity to fit datasets with high variability and synthesize more appropriate templates. A secondary benefit built upon the original VXM architecture is parameter-efficiency. The original VXM design uses a projection from to a high-dimensional vector at its input. In its neuroimaging experiments, (i.e., 3 attributes) is projected to a vector using a weight matrix with

parameters. As the number of conditions increase (with one-hot encoded categorical attributes, e.g.), the rapidly increasing number of parameters in this weight matrix makes learning intractable. Conversely, our framework is relatively insensitive to the dimensionality of the condition vector

, which is processed by a shallow MLP (with 64 units) to generate channel-wise scalars.

Figure 2: Unconditional templates learned from (a) Predict-HD and (b) dHCP. The templates generated by our framework yield sharper and more neuroanatomically-representative boundaries as highlighted in the red insets.

Registration Sub-network. We use an established U-Net registration architecture [25], which takes fixed and template images and outputs a time stationary velocity field (SVF)  [3, 4, 69]. When the SVF is integrated over time , it yields a diffeomorphic displacement field such that , where and represent the identity and final displacement fields, respectively. We then use with a spatial transformer [48] to deform the template to the fixed image space.

Discriminator Sub-network. We use a five-layer fully convolutional discriminator network (PatchGAN [47, 108]) to assess realism on local patches of input images. For example, given an input neuroimage volume of , the discriminator has a receptive field of . We find that discriminator regularization is essential for stable and artefact-free training, as outlined further below.

For conditional templates, the discriminator is trained to distinguish between real and synthesized images given their categorical and/or continuous covariates. For discriminator conditioning, we build on the projection method [68] commonly used in modern GANs [15, 51] defined as where is the network input, is the condition, is the pre-activation discriminator output, are learned parameters such that is the embedding matrix of , is the network output given prior to conditioning, and is a scalar function of . This formulation extends only to either categorical or continuous attributes and does not apply to both types of conditioning, rendering it inadmissible to neuroimaging settings where we are simultaneously interested in attributes such as age (continuous) and gender (categorical). Fortunately, under mild assumptions of conditional independence of the continuous and categorical attributes given the input, we find that similar analysis to [68] factorizes cleanly into: , where the and subscripts indicate categorical and continuous attributes, respectively.

Figure 3: Age-conditional dHCP templates alongside template segmentations obtained by [63]. Representative real images and segmentations are visualized in the bottom row.
Figure 4: Top: volumetric trends of dHCP template segmentations for all methods overlaid upon the volumetric trends for the underlying train (purple) and test (brown) sets. Bottom: Mean deformation norms to held-out test data (lower is better) for all conditional methods.

Loss Functions. We define our objective function in the unconditional setting, with straightforward extensions to the conditional scenario. The generator uses a three-part objective including an image matching term, penalties encouraging deformation smoothness and centrality, and an adversarial term encouraging realism in the moved images. For matching, we use a squared localized normalized cross-correlation (LNCC) objective, following standard medical image analysis rationale of requiring intensity standardization in local windows [8]. For deformation regularization, we follow [24] and employ over voxels , where indicates spatial displacement such that and is a moving average of estimated spatial displacements. Briefly, the first term leads to small deformations across the entirety of the dataset, whereas the second and third encourage smooth and small individual deformations, respectively. The adversarial terms used to train generator and discriminator networks correspond to the least-squares GAN [65] objective, chosen for its relative stability. The overall generator loss can be summarized as , where we use and as in [24]. For the discriminator, we employ several forms of regularization, as detailed below.

GAN Stabilization. Generative adversarial training stabilizes and improves with lower image resolutions, higher batch sizes, bigger networks, and higher sample sizes [15]. However, the opposite arises in neuroimaging as images are larger volumes, GPU memory limits processing of low batch sizes and small networks, and sample sizes in medical imaging studies are often only a few hundred scans. Therefore, careful stabilization is required in our application. We enforce a 1-Lipschitz constraint on both generator and discriminator with spectral normalization [67] on every layer such that where is the spectral norm of weight matrix . Spectral normalization has been shown to stabilize training and improve gradient feedback to the generator [23]. We further use the gradient penalty [66] on the discriminator which additionally stabilizes GAN training, defined as where is the penalty weight, is the real distribution, and is the discriminator. As discriminator overfitting on limited data is a key cause of GAN instability, we further use differentiable augmentations [51, 95, 106, 107] on both real and synthesized images when training the discriminator. We sample random translations for all datasets and further sample from the dihedral and (a subset of) groups for 2D images and 3D volumes, respectively. Intensity-based discriminator augmentations (brightness/contrast/saturation) eventually lead to training collapse for neuroimaging datasets, but were found to improve training on a 2D RGB face dataset.

4 Experiments

4.1 Datasets

Our experiments focus on challenging datasets for diffeomorphic template construction. All foreground/brain extraction is performed by thresholding provided segmentation labels. All neuroimages are cropped to a central field-of-view of resolution .

Figure 5: Left: Age and cohort-conditional templates for Predict-HD with representative real images visualized in the bottom row. Top-right: Inter-cohort volumetric trends from segmented templates for structures of interest for Huntington’s disease. Bottom-right: Mean deformation norms to held-out test data (lower is better) for all conditional methods.

dHCP. The developing human connectome project (dHCP) provides a dataset of newborns imaged at or near birth with gestational ages ranging from 29-45 weeks and displaying rapid week-to-week structural changes [44]. Spatiotemporal template estimation on dHCP is challenging as premature birth presents decreased cortical folding alongside increased incidences of hemorrhages, hyper-intensities, and lesions [70]. For age-conditioned template construction, we use all 558 T2w MR images from dHCP release 2 preprocessed and segmented by methodologies described in [64]. Images are affine-aligned to a constructed affine template and split at the subject-level (accounting for twins and repeat scans), resulting in 458, 15, and 85 scans for training, validation, and testing, respectively.

Predict-HD. We use a longitudinal multi-center and multi-scanner database of healthy controls and subjects with Huntington’s disease (HD) [13, 79]. HD is a (typically) adult-onset progressive neurodegenerative disease impairing motor control and cognition [99] substantially altering brain morphology. We build templates conditioned on age and the presence of the HD genetic mutation. We use 1117 T1w MR images from 388 individuals affine-aligned to MNI [34]. Image preprocessing is described in [78] and image segmentation was performed semi-automatically with labels corresponding to the Neuromorphometrics template [71]. We use 897, 30, 190 images for training, validation, and testing, split at the subject level.

FFHQ-Aging. Face images have been used as experimental vehicles to analyze various qualitative aspects of template construction [7]. We use FFHQ-Aging [74], a database of real-world face images providing labels corresponding to (binned) age, gender, and the presence of glasses. FFHQ [52] captures significantly higher variation in terms of age, head pose, and accessories (e.g., hats and costumes) as compared to datasets such as CelebA [61] and is thus a significant challenge. We resize the training images to , and use age, gender, and the presence of glasses as input conditions. FFHQ-Aging is a challenging dataset, as topological changes (e.g., mouths open or closed) render diffeomorphisms to be a severely limited class of transformations for such images.

Setting Method Avg Dice () Mov. Def () EFC () Avg Avg Def ()












Table 1: Quantitative evaluations on neuroimaging data for all methods including average Dice scores to test data, norms of accumulated moving average deformations over the course of training, entropy focus criteria (EFC), average Jacobian determinant to test data, and average deformation norms to test data. All deep network methods result in comparable Dice and Jacobian determinants, with our method (Ours) demonstrating improvements in aspects essential to template generation such as sharpness and deformation centrality.

4.2 Baselines and Evaluation Strategies

Baselines & Ablations. We first compare with the widely-used unconditional template construction algorithm SyGN [7] implemented in the ANTs library [96]. We then perform comparisons with a deep network for conditional and unconditional template estimation (VXM [24]) trained under its original objective. To isolate our core differences from VXM, we use the same registration network for all settings. We use ablated variants to investigate whether adding a discriminator network to the original framework (Ablation/VXM+Adv) or whether training our architecture under only a regularized registration cost without a discriminator (Ablation/noAdv) yield similar improvements to our combined framework with both architectural changes and discriminator networks (Ours). As Ablation/noAdv is an ablation of Ours, it retains spectral normalization in the template generation branch, which may unnecessarily hamper its performance when trained without an adversarial cost. We do not compare with conventionally-optimized spatiotemporal template construction methods as, to our knowledge, there are none that generically apply across diverse image sets (i.e., neonatal T2 MRI, adult T1 MRI, and RGB faces), account for arbitrary covariates, and typically require significant computational resources and domain knowledge.

Evaluation. Constructed templates are difficult to evaluate, as several properties are often desired. For example, weak deformation regularization enables exact matching of templates to target images at the cost of generating anatomically-impossible deformation magnitudes and irregularities, whereas strong regularization provides smaller and more central deformations but produces poor alignment [93]. We posit that preferable templates simultaneously present increased sharpness, accurate alignment, and small and smooth deformations to the target population.

We follow standard methods for quantifying template/MRI sharpness [30, 50, 60, 81] with the entropy focus criterion [5]. To assess registration quality and centrality, we follow the evaluation protocols defined in [24] on held-out test data, including: (1) average Dice coefficients for segmentation labels deformed from the template to the target, corresponding to registration accuracy; (2) mean determinant of the Jacobian matrix of the deformation field over voxels with indicating local folding of the deformation field and corresponding to smooth local deformation; (3) average deformation norms to the target images, with lower values indicating improved template centrality given equivalent registration accuracy. Finally, we estimate the norm of the moving average of deformations accumulated over training iterations, with lower values corresponding to increased centrality.

All considered datasets present significant gaps between training and test set age distributions and hence our quantitative registration evaluations are performed on a subset of the overall data range. For Dice evaluation, all templates for Predict-HD were segmented following procedures detailed in [24], whereas the templates for dHCP were segmented with [63]. See the appendix for further details.

Figure 6: Age and cohort-conditional FFHQ templates showing qualitatively improved perceptual fidelity with our framework.

4.3 Implementation Details

We briefly outline relevant details and describe all design choices and architectures in the appendix. We use a batch size of 1 for 3D neuroimaging volumes and a batch size of 32 for 2D planar images. As all datasets considered have highly imbalanced age distributions, we sample minority time-points at higher frequencies while training conditionally. Crucially, we do not make dataset-specific hyperparameter choices beyond

regularization for GAN stability, where we use , and for Predict-HD, dHCP, and FFHQ-Aging, respectively, loosely corresponding to their respective stabilization requirements. For fair comparison, the hyperparameters and architectures pertaining specifically to registration were matched to the suggested hyperparameters for VXM for all deep networks.

4.4 Results and Analysis

Figure 2 shows unconditional templates for dHCP and Predict-HD, highlighting that the adversarial approach yields more anatomically-accurate templates. For example, in the sagittal view of the Predict-HD templates (bottom row), we observe anatomical structures more clearly within the red insets when using adversarial regularizers, whereas prior methods are unable to do so. SyGN is unable to resolve structures which display high-frequency deformations due to its reliance on averaging aligned intensities and shapes.

For unconditional templates, we observe subtle differences between the proposed method and its reduction Ablation/VXM+Adv, with either approach having distinct advantages depending on their intended use. Ablation/VXM+Adv trains faster as it does not use 3D convolutions in its template generation branch. Conversely, Ours removes subtle checkerboard patterns generated by previous methods. Interestingly, unconditional Ablation/noAdv produces stronger moving average deformation magnitudes than other deep network approaches (Table 1), suggesting that the adversarial objective is necessary for this setting to obtain optimal results. However, subsequent analysis of conditional template estimation reveals significant differences between these approaches.

Figure 3 provides sample age-conditional templates alongside anatomical segmentations for dHCP. Methods that use input-concatenation architectures for template generation (VXM and Ablation/VXM+Adv) underfit the rapid week-to-week developmental changes in this dataset, whereas conditionally-modulated architectures (Ablation/noAdv and Ours) can generate templates which closely follow the underlying trends in the data. Figure 4 further shows that the conditionally-modulated models better represent volumetric changes in the held-out test set (top row) while showing improved centrality (bottom row). In comparison to Ablation/noAdv, the complete framework (Ours) better fits spatiotemporal image contrast, is sharper, and shows mildly improved centrality (Table 1). We underscore that the training and test image volumes correspond to the affine pre-aligned data and thus reflect relative volumes in the affine template space.

We observe similar trends in Figure 5 (left) for Predict-HD. While age and cohort-conditional templates generated by VXM and Ablation/VXM+Adv do show subtle geometric variation, stronger spatiotemporal changes and dataset-similarity are observed with our complete framework (compare changes in ventricles within the dashed boxes). Analogous improvements in test-set deformation magnitude for Predict-HD are shown in the boxplots (bottom-right). Volume trends of segmented templates for regions pertinent to Huntington’s disease are shown in the top-right and follow expected trends such as larger ventricular volumes and smaller basal ganglia volumes in the group with the Huntington’s mutation as compared to the controls.

Figure 6 illustrates qualitative FFHQ-Aging templates. Row 1 shows the conditional linear averages of the training set. As above, methods trained without adversarial losses (rows 2 and 4) correctly learn spatiotemporal changes while methods trained with them (rows 3 and 5) present stronger appearance variation and yield improved template perception, e.g. by removing border artefacts in the male cohort. Ours further removes border artefacts from the female cohort and increases shape variation and perceptual fidelity (bottom-left row). Further analysis of FFHQ-Aging template construction is presented in the Appendix.

Table 1 summarizes quantitative results. All methods achieve comparable Dice coefficients and produce smooth deformations (). However, the proposed techniques generally yield improved sharpness (lower entropy focus criteria) and improved centrality (lower deformation norms) while showing equivalent registration performance, indicating that the constructed templates are more barycentric representations of the data. We stress that Dice coefficients between unconditional and conditional settings cannot be directly compared as the template segmentations were obtained via different approaches. EFC interpretation requires care. While the numerical differences are subtle, EFC range is more restricted. For example, Gaussian filtering () of the Predict-HD unconditional template from our method only increases EFC from to , indicating that smaller changes are meaningful. Conditional EFC indicate that using a discriminator significantly improves sharpness across scan age ( between VXM and Ours for both datasets), with trends best interpreted via the temporal EFC plots given in the appendix.

5 Conclusions

Explicit methods for the construction of conditional deformable templates using stochastic gradient descent and deep networks are powerful tools for the efficient and flexible modeling of image populations with arbitrary covariates. In this work, we take a generative synthesis approach towards explicit template estimation by constructing a framework which enables training on datasets challenging for generative adversarial networks. The resulting templates are sharp and easy to delineate for domain-experts, more representative of the underlying demographics, and closely follow typical development in both neonatal MRI of developing pre-term neonates and adult MRI sampled across typical lifespans with and without neurodegeneration. While this work is motivated from the perspective of neuroimaging, it applies to generic imaging modalities.


This work was supported by NIH grants NIBIB R01EB021391, 1R01HD088125-01A1, 1R01DA038215-01A1, U01 NS082086, 1R01AG064027, and NS050568. Part of the HPC resources used for this research were provided by NSF grant MRI-1229185.

Original and processed Predict-HD image data and segmentations were provided by the University of Iowa as part of a collaborative NIH grant U01 NS082086. dHCP results were obtained using data made available from the Developing Human Connectome Project funded by the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement no. [319456]”.


  • [1] S. Allassonnière, Y. Amit, and A. Trouvé (2007) Towards a coherent statistical framework for dense deformable template estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (1), pp. 3–29. Cited by: §1, §2.
  • [2] ANTsX Fix sharpening artifacts in template construction · issue 956. Note: External Links: Link Cited by: §2.
  • [3] V. Arsigny, O. Commowick, X. Pennec, and N. Ayache (2006) A log-euclidean framework for statistics on diffeomorphisms. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2006, R. Larsen, M. Nielsen, and J. Sporring (Eds.), Berlin, Heidelberg, pp. 924–931. External Links: ISBN 978-3-540-44708-5 Cited by: §C.2, §3.
  • [4] J. Ashburner (2007) A fast diffeomorphic image registration algorithm. NeuroImage 38 (1), pp. 95 – 113. External Links: ISSN 1053-8119, Document, Link Cited by: §C.2, §3.
  • [5] D. Atkinson, D. L. G. Hill, P. N. R. Stoyle, P. E. Summers, and S. F. Keevil (1997) Automatic correction of motion artifacts in magnetic resonance images using an entropy focus criterion. IEEE Transactions on Medical Imaging 16 (6), pp. 903–910. External Links: Document Cited by: §4.2.
  • [6] B. B. Avants, C. L. Epstein, M. Grossman, and J. C. Gee (2008) Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis 12 (1), pp. 26–41. Cited by: §2.
  • [7] B. B. Avants, P. Yushkevich, J. Pluta, D. Minkoff, M. Korczykowski, J. Detre, and J. C. Gee (2010) The optimal template effect in hippocampus studies of diseased populations. Neuroimage 49 (3), pp. 2457–2466. Cited by: §C.1.3, §C.2, §1, §1, §2, §4.1, §4.2.
  • [8] B. B. Avants, N. J. Tustison, G. Song, P. A. Cook, A. Klein, and J. C. Gee (2011) A reproducible evaluation of ants similarity metric performance in brain image registration. NeuroImage 54 (3), pp. 2033 – 2044. External Links: ISSN 1053-8119, Document, Link Cited by: §3.
  • [9] B. Avants and J. C. Gee (2004)

    Geodesic estimation for large deformation anatomical shape averaging and interpolation

    Neuroimage 23, pp. S139–S150. Cited by: §1, §1, §2.
  • [10] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca (2019) VoxelMorph: a learning framework for deformable medical image registration. IEEE Transactions on Medical Imaging 38 (8), pp. 1788–1800. External Links: Document Cited by: §2.
  • [11] M. F. Beg, M. I. Miller, A. Trouvé, and L. Younes (2005) Computing large deformation metric mappings via geodesic flows of diffeomorphisms. International journal of computer vision 61 (2), pp. 139–157. Cited by: §2.
  • [12] M. F. Beg, M. I. Miller, A. Trouvé, and L. Younes (2005-02-01) Computing large deformation metric mappings via geodesic flows of diffeomorphisms. International Journal of Computer Vision 61 (2), pp. 139–157. External Links: ISSN 1573-1405, Document, Link Cited by: §C.2.
  • [13] K. M. Biglan, Y. Zhang, J. Long, M. Geschwind, G. Kang, A. Killoran, W. Lu, E. McCusker, J. A. Mills, L. A. Raymond, et al. (2013) Refining the diagnosis of huntington disease: the predict-hd study. Frontiers in aging neuroscience 5, pp. 12. Cited by: §4.1.
  • [14] A. Bône, O. Colliot, and S. Durrleman (2020-07) Learning the spatiotemporal variability in longitudinal shape data sets. International Journal of Computer Vision 128 (12), pp. 2873–2896. External Links: Document, Link Cited by: §2.
  • [15] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §C.2, §2, §3, §3.
  • [16] T. Brox and J. Malik (2011) Large displacement optical flow: descriptor matching in variational motion estimation.. IEEE Trans. Pattern Anal. Mach. Intell. 33 (3), pp. 500–513. External Links: Link Cited by: §2.
  • [17] M. Cabezas, A. Oliver, X. Lladó, J. Freixenet, and M. B. Cuadra (2011) A review of atlas-based segmentation for magnetic resonance brain images. Computer methods and programs in biomedicine 104 (3), pp. e158–e177. Cited by: §1.
  • [18] X. Cao, J. Yang, J. Zhang, D. Nie, M. Kim, Q. Wang, and D. Shen (2017) Deformable image registration based on similarity-steered cnn regression. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 300–308. Cited by: §2.
  • [19] Y. Cao, M. I. Miller, R. L. Winslow, and L. Younes (2005) Large deformation diffeomorphic metric mapping of vector fields. IEEE transactions on medical imaging 24 (9), pp. 1216–1230. Cited by: §2.
  • [20] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §C.1.3.
  • [21] J. Cheng, A. V. Dalca, and L. Zöllei (2020)

    Unbiased atlas construction for neonatal cortical surfaces via unsupervised learning

    In Medical Ultrasound, and Preterm, Perinatal and Paediatric Image Analysis, Y. Hu, R. Licandro, J. A. Noble, J. Hutter, S. Aylward, A. Melbourne, E. Abaci Turk, and J. Torrents Barrena (Eds.), Cham, pp. 334–342. Cited by: §1.
  • [22] G. E. Christensen, S. C. Joshi, and M. I. Miller (1997) Volumetric transformation of brain anatomy. IEEE Transactions on Medical Imaging 16 (6), pp. 864–877. External Links: Document Cited by: §2.
  • [23] C. Chu, K. Minami, and K. Fukumizu (2020) Smoothness and stability in gans. In International Conference on Learning Representations, External Links: Link Cited by: §3.
  • [24] A. Dalca, M. Rakic, J. Guttag, and M. Sabuncu (2019) Learning conditional deformable templates with convolutional networks. In Advances in neural information processing systems, pp. 806–818. Cited by: §C.1.1, §C.1.2, §C.2, §1, §1, §2, §3, §4.2, §4.2, §4.2.
  • [25] A. V. Dalca, G. Balakrishnan, J. Guttag, and M. R. Sabuncu (2019) Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Medical Image Analysis 57, pp. 226 – 236. External Links: ISSN 1361-8415, Document, Link Cited by: §C.2, §2, §2, §3.
  • [26] G. Daras, J. Dean, A. Jalal, and A. G. Dimakis (2021) Intermediate layer optimization for inverse problems using deep generative models. External Links: 2102.07364 Cited by: §2.
  • [27] B. C. Davis, P. T. Fletcher, E. Bullitt, and S. Joshi (2010) Population shape regression from random design data. International journal of computer vision 90 (2), pp. 255–266. Cited by: §2.
  • [28] B. D. de Vos, F. F. Berendsen, M. A. Viergever, M. Staring, and I. Išgum (2017)

    End-to-end unsupervised deformable image registration with a convolutional neural network

    In Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 204–212. Cited by: §2.
  • [29] Dinggang Shen and C. Davatzikos (2002) HAMMER: hierarchical attribute matching mechanism for elastic registration. IEEE Transactions on Medical Imaging 21 (11), pp. 1421–1439. External Links: Document Cited by: §2.
  • [30] O. Esteban, D. Birman, M. Schaer, O. O. Koyejo, R. A. Poldrack, and K. J. Gorgolewski (2017-09) MRIQC: advancing the automatic prediction of image quality in MRI from unseen sites. PLOS ONE 12 (9), pp. e0184661. External Links: Document Cited by: §4.2.
  • [31] J. Fan, X. Cao, Q. Wang, P. Yap, and D. Shen (2019) Adversarial learning for mono-or multi-modal registration. Medical image analysis 58, pp. 101545. Cited by: §2.
  • [32] J. Fishbaugh, M. Prastawa, G. Gerig, and S. Durrleman (2013) Geodesic shape regression in the framework of currents. In Information Processing in Medical Imaging, J. C. Gee, S. Joshi, K. M. Pohl, W. M. Wells, and L. Zöllei (Eds.), Berlin, Heidelberg, pp. 718–729. External Links: ISBN 978-3-642-38868-2 Cited by: §1.
  • [33] J. Fishbaugh, M. Styner, K. Grewen, J. Gilmore, and G. Gerig (2019) Spatiotemporal modeling for image time series with appearance change: application to early brain development. In Multimodal Brain Image Analysis and Mathematical Foundations of Computational Anatomy, D. Zhu, J. Yan, H. Huang, L. Shen, P. M. Thompson, C. Westin, X. Pennec, S. Joshi, M. Nielsen, T. Fletcher, S. Durrleman, and S. Sommer (Eds.), Cham, pp. 174–185. Cited by: §2.
  • [34] V. Fonov, A. C. Evans, K. Botteron, C. R. Almli, R. C. McKinstry, D. L. Collins, B. D. C. Group, et al. (2011) Unbiased average age-appropriate atlases for pediatric studies. Neuroimage 54 (1), pp. 313–327. Cited by: §C.1.1, §1, §4.1.
  • [35] Y. Fu, Y. Lei, T. Wang, K. Higgins, J. Bradley, W. Curran, T. Liu, and X. Yang (2020-02) LungRegNet: an unsupervised deformable image registration method for 4d-ct lung. Medical Physics 47, pp. . External Links: Document Cited by: §2.
  • [36] A. Gholipour, C. K. Rollins, C. Velasco-Annis, A. Ouaalam, A. Akhondi-Asl, O. Afacan, C. M. Ortinau, S. Clancy, C. Limperopoulos, E. Yang, J. A. Estroff, and S. K. Warfield (2017-03) A normative spatiotemporal MRI atlas of the fetal brain for automatic segmentation and analysis of early brain growth. Scientific Reports 7 (1). External Links: Document, Link Cited by: §2.
  • [37] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • [38] I. S. Gousias, A. D. Edwards, M. A. Rutherford, S. J. Counsell, J. V. Hajnal, D. Rueckert, and A. Hammers (2012) Magnetic resonance imaging of the newborn brain: manual segmentation of labelled atlases in term-born and preterm infants. Neuroimage 62 (3), pp. 1499–1509. Cited by: §C.1.2.
  • [39] P. A. Habas, K. Kim, J. M. Corbett-Detig, F. Rousseau, O. A. Glenn, A. J. Barkovich, and C. Studholme (2010)

    A spatiotemporal atlas of mr intensity, tissue probability and shape of the fetal brain with application to segmentation

    NeuroImage 53 (2), pp. 460–470. External Links: ISSN 1053-8119, Document, Link Cited by: §2.
  • [40] X. Han, R. Kwitt, S. Aylward, S. Bakas, B. Menze, A. Asturias, P. Vespa, J. Van Horn, and M. Niethammer (2018) Brain extraction from normal and pathological images: a joint pca/image-reconstruction approach. NeuroImage 176, pp. 431–445. Cited by: §1.
  • [41] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2018) GANs trained by a two time-scale update rule converge to a local nash equilibrium. External Links: 1706.08500 Cited by: §C.2.
  • [42] Y. Hu, E. Gibson, N. Ghavami, E. Bonmati, C. M. Moore, M. Emberton, T. Vercauteren, J. A. Noble, and D. C. Barratt (2018)

    Adversarial deformation regularization for training image registration neural networks

    In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 774–782. Cited by: §2.
  • [43] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pp. 172–189. Cited by: §2.
  • [44] E. J. Hughes, T. Winchman, F. Padormo, R. Teixeira, J. Wurie, M. Sharma, M. Fox, J. Hutter, L. Cordero-Grande, A. N. Price, et al. (2017) A dedicated neonatal brain imaging system. Magnetic resonance in medicine 78 (2), pp. 794–804. Cited by: §4.1.
  • [45] W. Huizinga, D.H.J. Poot, M.W. Vernooij, G.V. Roshchupkin, E.E. Bron, M.A. Ikram, D. Rueckert, W.J. Niessen, and S. Klein (2018) A spatio-temporal reference model of the aging brain. NeuroImage 169, pp. 11–22. External Links: ISSN 1053-8119, Document, Link Cited by: §2.
  • [46] J. E. Iglesias and M. R. Sabuncu (2015) Multi-atlas segmentation of biomedical images: a survey. Medical image analysis 24 (1), pp. 205–219. Cited by: §1, §1.
  • [47] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1125–1134. Cited by: §2, §3.
  • [48] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu (2015) Spatial transformer networks. arXiv preprint arXiv:1506.02025. Cited by: §3.
  • [49] H. J. Johnson and G. E. Christensen (2001) Landmark and intensity-based, consistent thin-plate spline image registration. In Information Processing in Medical Imaging, M. F. Insana and R. M. Leahy (Eds.), Berlin, Heidelberg, pp. 329–343. External Links: ISBN 978-3-540-45729-9 Cited by: §2.
  • [50] S. Joshi, B. Davis, M. Jomier, and G. Gerig (2004) Unbiased diffeomorphic atlas construction for computational anatomy. NeuroImage 23, pp. S151–S160. Cited by: §1, §1, §2, §4.2.
  • [51] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila (2020) Training generative adversarial networks with limited data. External Links: 2006.06676 Cited by: §3, §3.
  • [52] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4401–4410. Cited by: §C.1.3, §2, §3, §4.1.
  • [53] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119. Cited by: §2, §3.
  • [54] H. Kim, M. Styner, J. Piven, and G. Gerig (2020) A framework to construct a longitudinal dw-mri infant atlas based on mixed effects modeling of dodf coefficients. In Computational Diffusion MRI, pp. 149–159. Cited by: §2.
  • [55] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §C.2.
  • [56] J. Krebs, H. Delingette, B. Mailhé, N. Ayache, and T. Mansi (2019) Learning a probabilistic model for diffeomorphic registration. IEEE transactions on medical imaging 38 (9), pp. 2165–2176. Cited by: §2.
  • [57] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi (2017)

    Photo-realistic single image super-resolution using a generative adversarial network

    In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 105–114. External Links: Document Cited by: §2.
  • [58] H. W. Lee, M. R. Sabuncu, and A. V. Dalca (2019) Few labeled atlases are necessary for deep-learning-based segmentation. arXiv preprint arXiv:1908.04466. Cited by: §1.
  • [59] H. Li and Y. Fan (2017) Non-rigid image registration using fully convolutional networks with deep self-supervision. External Links: 1709.00799 Cited by: §2.
  • [60] X. Liu, M. Niethammer, R. Kwitt, N. Singh, M. McCormick, and S. Aylward (2015) Low-rank atlas image analyses in the presence of pathologies. IEEE transactions on medical imaging 34 (12), pp. 2583–2591. Cited by: §4.2.
  • [61] Z. Liu, P. Luo, X. Wang, and X. Tang Large-scale celebfaces attributes (celeba) dataset. Cited by: §4.1.
  • [62] P. Lorenzen, M. Prastawa, B. Davis, G. Gerig, E. Bullitt, and S. Joshi (2006) Multi-modal image set registration and atlas formation. Medical Image Analysis 10 (3), pp. 440–451. Note: Special Issue on The Second International Workshop on Biomedical Image Registration (WBIR’03) External Links: ISSN 1361-8415, Document, Link Cited by: §1, §2.
  • [63] A. Makropoulos, I. S. Gousias, C. Ledig, P. Aljabar, A. Serag, J. V. Hajnal, A. D. Edwards, S. J. Counsell, and D. Rueckert (2014) Automatic whole brain mri segmentation of the developing neonatal brain. IEEE transactions on medical imaging 33 (9), pp. 1818–1831. Cited by: Figure 14, §C.1.2, §C.1.2, §1, Figure 3, §4.2.
  • [64] A. Makropoulos, E. C. Robinson, A. Schuh, R. Wright, S. Fitzgibbon, J. Bozek, S. J. Counsell, J. Steinweg, K. Vecchiato, J. Passerat-Palmbach, et al. (2018) The developing human connectome project: a minimal processing pipeline for neonatal cortical surface reconstruction. Neuroimage 173, pp. 88–112. Cited by: §C.1.2, §4.1.
  • [65] X. Mao, Q. Li, H. Xie, R. Y.K. Lau, Z. Wang, and S. Paul Smolley (2017-10) Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §3.
  • [66] L. Mescheder, A. Geiger, and S. Nowozin (2018-10–15 Jul) Which training methods for GANs do actually converge?. J. Dy and A. Krause (Eds.),

    Proceedings of Machine Learning Research

    , Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3481–3490.
    External Links: Link Cited by: §3.
  • [67] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, External Links: Link Cited by: §3.
  • [68] T. Miyato and M. Koyama (2018) CGANs with projection discriminator. In International Conference on Learning Representations, External Links: Link Cited by: Appendix D, §3.
  • [69] M. Modat, P. Daga, M. J. Cardoso, S. Ourselin, G. R. Ridgway, and J. Ashburner (2012) Parametric non-rigid registration using a stationary velocity field. In 2012 IEEE Workshop on Mathematical Methods in Biomedical Image Analysis, Vol. , pp. 145–150. External Links: Document Cited by: §3.
  • [70] Z. Molnár and M. Rutherford (2013) Brain maturation after preterm birth. Science translational medicine 5 (168), pp. 168ps2–168ps2. Cited by: §4.1.
  • [71] Neuromorphometrics, Inc.. Note: External Links: Link Cited by: §4.1.
  • [72] M. Niethammer, Y. Huang, and F. Vialard (2011) Geodesic regression for image time-series. In International conference on medical image computing and computer-assisted intervention, pp. 655–662. Cited by: §1.
  • [73] M. Niethammer, R. Kwitt, and F. Vialard (2019) Metric learning for image registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8463–8472. Cited by: §2.
  • [74] R. Or-El, S. Sengupta, O. Fried, E. Shechtman, and I. Kemelmacher-Shlizerman (2020) Lifespan age transformation synthesis. External Links: 2003.09764 Cited by: §C.1.3, §4.1.
  • [75] N. Papenberg, A. Bruhn, T. Brox, S. Didas, and J. Weickert (2006-01) Highly accurate optic flow computation with theoretically justified warping. International Journal of Computer Vision 67 (2), pp. 141–158. External Links: Document, Link Cited by: §2.
  • [76] T. Park, M. Liu, T. Wang, and J. Zhu (2019-06) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §C.2, §2.
  • [77] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: §2.
  • [78] J. S. Paulsen, J. D. Long, C. A. Ross, D. L. Harrington, C. J. Erwin, J. K. Williams, H. J. Westervelt, H. J. Johnson, E. H. Aylward, Y. Zhang, et al. (2014) Prediction of manifest huntington’s disease with clinical and imaging measures: a prospective observational study. The Lancet Neurology 13 (12), pp. 1193–1201. Cited by: §C.1.1, §4.1.
  • [79] J. Paulsen, D. R. Langbehn, J. C. Stout, E. Aylward, C. A. Ross, M. Nance, M. Guttman, S. Johnson, M. MacDonald, L. J. Beglinger, et al. (2008) Detection of huntington’s disease decades before diagnosis: the predict-hd study. Journal of Neurology, Neurosurgery & Psychiatry 79 (8), pp. 874–880. Cited by: §4.1.
  • [80] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. C. Courville (2018) FiLM: visual reasoning with a general conditioning layer. In AAAI, Cited by: Figure 7, §3.
  • [81] D. Rajashekar, M. Wilms, M. E. MacDonald, J. Ehrhardt, P. Mouches, R. Frayne, M. D. Hill, and N. D. Forkert (2020-02) High-resolution t2-FLAIR and non-contrast CT brain atlas of the elderly. Scientific Data 7 (1). External Links: Document, Link Cited by: §4.2.
  • [82] M. Ren, N. Dey, J. Fishbaugh, and G. Gerig (2021)

    Segmentation-renormalized deep feature modulation for unpaired image harmonization

    IEEE Transactions on Medical Imaging (), pp. 1–1. External Links: Document Cited by: §2.
  • [83] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. Hill, M. O. Leach, and D. J. Hawkes (1999) Nonrigid registration using free-form deformations: application to breast mr images. IEEE transactions on medical imaging 18 (8), pp. 712–721. Cited by: §2.
  • [84] J. Schiratti, S. Allassonniere, O. Colliot, and S. Durrleman (2015) Learning spatiotemporal trajectories from manifold-valued longitudinal data. In Neural Information Processing Systems, Cited by: §2.
  • [85] A. Schuh, A. Makropoulos, E. C. Robinson, L. Cordero-Grande, E. Hughes, J. Hutter, A. N. Price, M. Murgasova, R. P. A. Teixeira, N. Tusor, et al. (2018) Unbiased construction of a temporally consistent morphological atlas of neonatal brain development. bioRxiv, pp. 251512. Cited by: §2.
  • [86] M. Seitzer, G. Yang, J. Schlemper, O. Oktay, T. Würfl, V. Christlein, T. Wong, R. Mohiaddin, D. Firmin, J. Keegan, D. Rueckert, and A. Maier (2018) Adversarial and perceptual refinement for compressed sensing mri reconstruction. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, and G. Fichtinger (Eds.), Cham, pp. 232–240. Cited by: §2.
  • [87] M. L. Senjem, J. L. Gunter, M. M. Shiung, R. C. Petersen, and C. R. Jack Jr (2005) Comparison of different methodological implementations of voxel-based morphometry in neurodegenerative disease. Neuroimage 26 (2), pp. 600–608. Cited by: §1.
  • [88] A. Serag, P. Aljabar, G. Ball, S. J. Counsell, J. P. Boardman, M. A. Rutherford, A. D. Edwards, J. V. Hajnal, and D. Rueckert (2012) Construction of a consistent high-definition spatio-temporal atlas of the developing brain using adaptive kernel regression. NeuroImage 59 (3), pp. 2255–2265. External Links: ISSN 1053-8119, Document, Link Cited by: §2.
  • [89] Z. Shen, F. Vialard, and M. Niethammer (2019) Region-specific diffeomorphic metric mapping. In Advances in Neural Information Processing Systems, pp. 1098–1108. Cited by: §2.
  • [90] H. Sokooti, B. De Vos, F. Berendsen, B. P. Lelieveldt, I. Išgum, and M. Staring (2017) Nonrigid image registration using multi-scale 3d convolutional neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 232–239. Cited by: §2.
  • [91] J. Talairach (1988) 3-dimensional proportional system; an approach to cerebral imaging. co-planar stereotaxic atlas of the human brain. Thieme, pp. 1–122. Cited by: §1.
  • [92] J. Thirion (1998) Image matching as a diffusion process: an analogy with maxwell’s demons. Medical image analysis 2 (3), pp. 243–260. Cited by: §2.
  • [93] B.T. Thomas Yeo, M. R. Sabuncu, R. Desikan, B. Fischl, and P. Golland (2008) Effects of registration regularization and atlas sharpness on segmentation accuracy. Medical Image Analysis 12 (5), pp. 603–615. Note: Special issue on the 10th international conference on medical imaging and computer assisted intervention - MICCAI 2007 External Links: ISSN 1361-8415, Document, Link Cited by: §4.2.
  • [94] P. M. Thompson, R. P. Woods, M. S. Mega, and A. W. Toga (2000) Mathematical/computational challenges in creating deformable and probabilistic atlases of the human brain. Human brain mapping 9 (2), pp. 81–92. Cited by: §1.
  • [95] N. Tran, V. Tran, N. Nguyen, T. Nguyen, and N. Cheung (2020) On data augmentation for gan training. External Links: 2006.05338 Cited by: §3.
  • [96] N. J. Tustison, P. A. Cook, A. J. Holbrook, H. J. Johnson, J. Muschelli, G. A. Devanyi, J. T. Duda, S. R. Das, N. C. Cullen, D. L. Gillen, et al. (2020) ANTsX: a dynamic ecosystem for quantitative biological and medical imaging. medRxiv. Cited by: §C.2, §4.2.
  • [97] N. J. Tustison and B. B. Avants (2012) Diffeomorphic directly manipulated free-form deformation image registration via vector field flows. In Biomedical Image Registration, B. M. Dawant, G. E. Christensen, J. M. Fitzpatrick, and D. Rueckert (Eds.), Berlin, Heidelberg, pp. 31–39. External Links: ISBN 978-3-642-31340-0 Cited by: §2.
  • [98] T. Vercauteren, X. Pennec, A. Perchant, and N. Ayache (2009) Diffeomorphic demons: efficient non-parametric image registration. NeuroImage 45 (1, Supplement 1), pp. S61–S72. Note: Mathematics in Brain Imaging External Links: ISSN 1053-8119, Document Cited by: §2.
  • [99] F. O. Walker (2007) Huntington’s disease. The Lancet 369 (9557), pp. 218–228. Cited by: §4.1.
  • [100] H. Wang, J. W. Suh, S. R. Das, J. B. Pluta, C. Craige, and P. A. Yushkevich (2012) Multi-atlas segmentation with joint label fusion. IEEE transactions on pattern analysis and machine intelligence 35 (3), pp. 611–623. Cited by: §C.1.1.
  • [101] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [102] G. Wu, H. Jia, Q. Wang, and D. Shen (2011) SharpMean: groupwise registration guided by sharp mean image and tree-based registration. NeuroImage 56 (4), pp. 1968–1981. Cited by: §1, §2.
  • [103] X. Yang, R. Kwitt, M. Styner, and M. Niethammer (2017) Quicksilver: fast predictive image registration–a deep learning approach. NeuroImage 158, pp. 378–396. Cited by: §2.
  • [104] E. M. Yu, A. V. Dalca, and M. R. Sabuncu (2020) Learning conditional deformable shape templates for brain anatomy. In Machine Learning in Medical Imaging, M. Liu, P. Yan, C. Lian, and X. Cao (Eds.), Cham, pp. 353–362. Cited by: §1.
  • [105] M. Zhang and P. T. Fletcher (2019) Fast diffeomorphic image registration via fourier-approximated lie algebras. International Journal of Computer Vision 127 (1), pp. 61–73. Cited by: §2.
  • [106] S. Zhao, Z. Liu, J. Lin, J. Zhu, and S. Han (2020) Differentiable augmentation for data-efficient gan training. External Links: 2006.10738 Cited by: §3.
  • [107] Z. Zhao, Z. Zhang, T. Chen, S. Singh, and H. Zhang (2020) Image augmentations for gan training. External Links: 2006.02595 Cited by: §3.
  • [108] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2, §3.

Appendix A Supplementary Animations

Animations illustrating several conditional spatiotemporal experiments are available at

Appendix B Supplementary Results

Figure 7: FFHQ-Aging age and cohort conditional templates with normal glasses (top) and sunglasses (bottom). For ages 7 and older, all methods produce plausible conditional templates, with Ours removing border effects and increasing shape and appearance variability. Significant label noise and highly limited sample sizes are apparent for ages 0-2 within the “glasses” label and for ages 0-6 within the “sunglasses” labels. For example, only two images exist within the training set for the male/with sunglasses/0-2 years old FFHQ-aging class with both images displaying adults with sunglasses and not infants (as can be seen from the corresponding linear average). As a result, methods using FiLM [80] (Ablation/noAdv and Ours) produce more adult-like templates in those age ranges. We speculate the results come from the increased data fitting capacity of FiLM. Interestingly, methods which do not use FiLM (Ablation/VXM+Adv and VXM) produce more plausible age-conditioned templates when all of the data for a category is mislabeled. This phenomenon arising from significant label noise requires future investigation.
Figure 8: Additional 2D views of unconditional 3D template construction on Predict-HD from all four methods. Methods using a discriminator (Ablation/VXM+Adv and Ours) exhibit increased sharpness, cortical folding detail, and tissue contrast. Ours improves on Ablation/VXM+Adv by removing subtle checkerboard artefacts, particularly visible in the coronal view.
Figure 9: Example dHCP template-to-image registration results for all methods on training data (top subfigures) and held-out test data (bottom subfigures), with varying gestational ages. Deformation norms for the 3D displacement fields are annotated on the top-right. We visualize training set results in addition to test data as a large age range (29-31weeks) of interest is not present in the test set (See Figure 15). As our templates show higher condition (age) specificity, the deformations are smaller and more anatomically plausible as compared to baselines and ablations.
Figure 10: Example Predict-HD template-to-image registration results for all methods on held-out test data, with varying ages and cohorts. Deformation norms for the 3D displacement fields are annotated on the top-right. All methods produce comparable moved templates. However, ours yields smaller deformations as seen from the displacement fields (especially visible in 72.2yrs/HD and 67.4yrs/CS).
Figure 11: Interpolations between control subjects (CS) and subjects with the Huntington’s disease (HD) mutation (), for fixed ages, obtained by linearly interpolating between one-hot attribute vectors. Both methods (VXM and Ours) achieve interpolations which match clinical expectations, e.g., with ventricles growing larger as the HD weight increases. Ours displays larger differences between CS and HD with correspondingly larger changes visible in the interpolations, as can be seen from the last row of the figure.
Figure 12: Example template segmentations for all methods generated by majority voting on inverse warped labels of training images. We emphasize that no segmentation labels are used in template construction or registration and that these segmentations are used only for Dice coefficient evaluation and temporal volume trends.
Figure 13: Temporal Entropy Focus Criteria (EFC, lower is better) for conditional templates on the dHCP (left), Predict-HD/Huntington’s Disease (center), and Predict-HD/Control Subjects (right). In all cases, methods using a discriminator (Ablation/VXM+Adv and Ours) achiever lower EFC over non-generative adversarial methods. These results should be interpreted in context as:
(1) While Ablation/VXM+Adv and Ours achieve equivalent EFC/sharpness, Ours displays increased condition-specificity and centrality as shown in the experiments in the primary text.

(2) Although commonly used to evaluate unconditional template sharpness, EFC is a heuristic surrogate for image sharpness and can fluctuate with varying structure. As

Ablation/noAdv and Ours show strong structural changes temporally, their temporal trends show higher variability as compared to techniques which present smaller structural changes (Ablation/VXM+Adv and VXM). As a result, EFC should be compared across methods at individual timepoints.
Figure 14: Negative results for dHCP conditional template segmentation for the Deep Gray Matter (dGM) label. DrawEM [63] (the tool used for dHCP template segmentation) with its default parameters overestimates dGM volume on the templates sampled at younger timepoints by Ablation/noAdv and Ours. For example, on the right, we show the generated templates from all methods at 29 weeks gestational age, with their DrawEM segmentation results below. While Ablation/noAdv and Ours produce more anatomically plausible templates compared to VXM and Ablation/VXM+Adv, the segmentation algorithm overestimates dGM volume. All other labels better match the underlying volume trends on the real data as shown in Figure 4 of the main text. In future work, careful tuning of DrawEM parameters on validation data may resolve this dGM overestimation.

Appendix C Experimental Details

c.1 Data Preparation

Figure 15: Histograms of sample size vs. scan age for both dHCP (top row) and Predict-HD (bottom row), for the constructed training sets (left column) and test sets (right column). Both datasets present both a significant gap between training and test sets in terms of scan age sampling.

We obtain all linear averages required for the neuroimaging experiments using voxel-wise averages of a 100 randomly chosen training scans. All 60,000 training images are used to compute the linear average for FFHQ (while we visualize group-wise barycenters in row 1 of Figure 7 for comparison, our framework uses the overall barycenter for training).

c.1.1 Predict-HD

Predict-HD provides longitudinal scans from 388 individuals with and without the Huntington’s Disease (HD) genetic condition. As imaging was performed across several distinct scanning sites, the images present highly heterogeneous appearance. All T1 images were bias corrected and segmented using procedures described in [78]. Prior to learning nonlinear deformable registration, we affinely register all T1 images to MNI [34], thus resampling them to . Out of 1121 images, 4 either failed affine alignment or had missing covariates and were discarded. We use 897, 30, and 190 images for training, validation, and testing, respectively, split at the subject level. In the context of this study, we do not currently consider longitudinal subject-specific effects in our conditional template estimation.

To compute template-to-image registration accuracy via Dice coefficient evaluation, we follow the template segmentation protocols outlined in [24]. Briefly, we select training scans within the ages of 25 and 65 years old wherein we have sufficient sample sizes for both cohorts and for both train and test sets. Accordingly, our Dice coefficient evaluation is only conducted on held-out test subjects between the ages of 25 and 65 (176 out of 190). The images are split into 5-year-wide age bins with a single template sampled at the center of the bin (i.e., a 52.5 year old HD template for HD subjects with ages between 50 and 55). For each cohort, all training segmentations within a bin are inverse warped to the bin-specific template, followed by majority voting on the labels to obtain the template segmentation for that age-bin and group. Unconditional template segmentations are performed via the same procedure, without the need for binning time points. In the future, other label fusion methods accounting for local intensities can be incorporated [100].

c.1.2 dHCP

Release 2 of the developing human connectome project (dHCP) was pre-processed by a specialized pipeline for neonatal image analysis [64] including steps such as motion-correction, super-resolution (from to ), bias correction, brain extraction, and segmentation [63]. For GPU memory, we crop to a central field-of-view and minimally resize images from voxel resolution to for a final image size of . For training, validation, and testing, we first assign all twins and repeat scans to the training set to prevent test set leakage and randomly hold-out a 100 scans from the remainder (15 for validation, 85 for testing), resulting in 458 training images. We construct an affine template for the training set with ANTs to which every scan is affinely aligned.

To generate segmentations for conditional templates generated by all methods for use in computing registration accuracy via Dice coefficients and analyzing volumetric trends of anatomical regions-of-interest, we use DrawEM  [63]. Briefly, DrawEM is a multi-atlas EM-segmentation pipeline based on the neonatal ALBERT templates [38] using normalized mutual information based image registration. This is in contrast to the majority voting template segmentation procedure for Predict-HD above and  [24]. DrawEM segmentation was performed instead of majority voting as several time points have very limited sample sizes not suitable for majority voting when using regularized registration, leading to qualitatively inaccurate template segmentations. We find that DrawEM produces sufficiently accurate segmentations on templates produced by all methods (as shown in Figure 3 of the main text, see Figure 14 in the supplementary material for a counter-example). Unconditional template segmentation was performed following [24].

c.1.3 FFHQ-Aging

FFHQ-Aging [74] annotates images in the FFHQ [52] human face dataset. These annotations include genders, ages (in 10 age bins), head pose (pitch, roll, and yaw), type of glasses (no glasses, normal glasses, sunglasses), eye occlusion scores, and segmentation labels (obtained by a DeepLabV3 [20] model pre-trained on CelebAMask-HQ). For simplicity, we only use the categorical attributes, leaving head pose and eye-occlusion conditioning to future work. We train all models on the FFHQ training set of 60,000 images (out of 70,000). As is common [7]

, we restrict ourselves to qualitative template evaluations for face images. We note that categorical template conditions for human faces are relatively crude as these attributes are not strictly categorical. Further, we note that the data set is skewed towards lighter skin tones (as evidenced by the linear averages of training images visualized in Supplementary Figure

7), which is consequently reflected in the synthesized templates from all methods. In future work, more careful modeling and diverse data collection protocols may ameliorate these issues.

c.2 Additional Implementation Details

Design choices and hyperparameters.

Architectures are given in Table 2

. All estimated templates for neuroimaging experiments are masked by a foreground mask during training for all methods to suppress commonly occurring background artefacts. The foreground mask was obtained by thresholding a linear average of training images. Reflection padding was used instead of zero-padding for all methods as it led to slightly fewer checkerboard artefacts. LeakyReLU slopes were set to 0.2. A window of 100 updates is used for the moving average deformation penalty

for all datasets. For condition vectors , we encode age as a continuous attribute (for the neuroimaging where we have access to continuous age values) divided by the maximum age in the dataset, and categorical attributes as one-hot vectors. We find that not rescaling continuous attributes in can lead to discriminator instability. Weight decay was applied on the linear projections from the FiLM embedding to the individual layers with weight for neuroimages and for FFHQ-Aging.

We choose the stationary velocity field (SVF [3, 4]) framework primarily for its speed and ease of implementation and note that other frameworks such as LDDMM [12] can also be used. The integration over time is in practice implemented for all methods with five scaling and squaring layers which have been shown to produce smooth diffeomorphic displacement fields [3, 25]. While all training is performed on full resolution 3D volumes, velocity and displacement fields are estimated at half-resolution and then linearly scaled up during training as in [24]. This resizing has an implicit smoothing effect. Implementations of spatial transformers and scaling and squaring layers are taken from the voxelmorph library at

For FFHQ-Aging, we make a few changes from the neuroimaging datasets. FFHQ-Aging provides ages in categorical form and are thus treated as one-hot representations. Linear averages for FFHQ-Aging were computed across the entire training dataset due to the high number of classes. As the dataset has a left-right head pose asymmetry (particularly pronounced in subclasses with few samples), we use horizontal reflection augmentation for all methods when training the template generation and registration sub-networks. We further use a penalty with unit weight (where indicates a left-right reflection of ) for all methods to encourage symmetric face templates.

Optimization Details.

As GAN training involves the joint optimization of two networks, the optimization parameters used in either network impacts training stability. The Adam [55] optimizer is used in all networks. For conditional dHCP and Predict-HD training, we adopt a two-time-scale-update-rule (TTUR [41]), with step size , , using , in both networks as is common in recent GAN works [15, 76]. For conditional FFHQ-Aging, we reduce to as additional stability was needed for highly challenging face registration. Unconditional template optimization was found to be amenable to mild amounts of momentum and was performed with step-size , , and used in both generator and discriminator for faster convergence. We note that momentum is theoretically contraindicated for gradient penalty on the discriminator but we did not find this to be an issue in practice. The non-GAN baselines (VXM and Ablation/noAdv) were trained with the same strategies to enable valid comparisons.

ANTs SyGN parameters.

We use the script provided by the ANTsX ecosystem [96] which implements the SyGN algorithm from [7]. We use the default construction parameters, including the squared local normalized cross-correlation objective, four template updates, using a four-level registration pyramid with at downsampling for the first three resolutions, and iterations per resolution. We turn off the default bias field correction and linear registration steps as these are performed during data pre-processing. Registrations between the estimated template and held-out test images were performed with the same registration parameters. We leave the default Laplacian sharpening on for all comparisons.

Miscellaneous Experimental Details.

All networks are implemented in TensorFlow 2.2 and trained on a single NVIDIA V100 GPU. As the GAN frameworks (Ours and Ablation/VXM+Adv) require concurrent optimization of two 3D networks, we found 16 GB vRAM neccessary for training. All entropy focus criteria are calculated within a common brain mask for the dataset.

Appendix D Projection Discriminator

We use the inner product-based framework presented in [68] who observe that the optimum for the standard adversarial loss can be written as (equation 2 of [68]):

where represents unconditional input, represents conditional information, and and are the real and synthesized data distributions, respectively. When we have conditioning such that is categorical and is continuous, assuming that they are conditionally independent given , we obtain through simple modification:

with the remaining analysis following [68] leading to the projection-discriminator expression given in the main text.

Template Generator ()
Inputs: conditions
Embed into using
Learn Parameters
ConvSN, 8 32
ResBlockSN, 32 32
Upsample trilinearly
ConvSN, , FiLM(), LeakyReLU
ConvSN, , FiLM(), LeakyReLU
ConvSN, , FiLM()
ConvSN, , FiLM(), tanh
Add to average of training images for
Registration Network ()
Inputs: template ; target
Concatenate(, )

Conv, Stride 2, 2

32, LeakyReLU
Conv, Stride 2, 32 32, LeakyReLU
Conv, Stride 2, 32 32, LeakyReLU
Conv, Stride 2, 32 32, LeakyReLU
Conv, 32 32, LeakyReLU
Conv, 32 32, LeakyReLU, Up , concat
Conv, 32 32, LeakyReLU, Up , concat
Conv, 32 32, LeakyReLU, Up , concat
Conv, 32 32, LeakyReLU
Conv, 32 32, LeakyReLU
Conv, 32 16
ConvBlock, 16 3
5 Scale and Square()

Discriminator ()
Inputs: image ; attributes
ConvSN, stride 2,

, Leaky ReLU

ConvSN, stride 2, , Leaky ReLU
ConvSN, stride 2, , Leaky ReLU
ConvSN, stride 2, , Leaky ReLU
Conv, stride 1, to
Projection(, )
Embedding/FiLM Generator ()
Inputs: attributes
Dense(64), LeakyReLU
Dense(64), LeakyReLU
Dense(64), LeakyReLU
Dense(64), LeakyReLU for
Table 2: Architectures for Conditional Predict-HD and dHCP consisting of a template generator (top left), a registration network (top right), a discriminator network (bottom left) and a FiLM embedding generator (bottom right). Conv represents a convolutional layer (ConvSN indicates use of spectral normalization). A ResBlockSN consists of two blocks of sequential ConvSN and LeakyReLU layers with an additive skip connection. For unconditional template estimation, we do not use any FiLM layers. For FFHQ-Aging, we use the same architectures, only replacing the 32 per-layer filters with 64 in the template generator (due to the higher number of classes), using ConvSN instead of Conv in the generator and the penultimate layer of the discriminator, and reducing the channel multiplier in the discriminator from 64 to 32.