A Cooperative Autoencoder for Population-Based Regularization of CNN Image Registration

08/16/2019 ∙ by Riddhish Bhalodia, et al. ∙ 1

Spatial transformations are enablers in a variety of medical image analysis applications that entail aligning images to a common coordinate systems. Population analysis of such transformations is expected to capture the underlying image and shape variations, and hence these transformations are required to produce anatomically feasible correspondences. This is usually enforced through some smoothness-based generic regularization on deformation field. Alternatively, population-based regularization has been shown to produce anatomically accurate correspondences in cases where anatomically unaware (i.e., data independent) fail. Recently, deep networks have been for unsupervised image registration, these methods are computationally faster and maintains the accuracy of state of the art methods. However, these networks use smoothness penalty on deformation fields and ignores population-level statistics of the transformations. We propose a novel neural network architecture that simultaneously learns and uses the population-level statistics of the spatial transformations to regularize the neural networks for unsupervised image registration. This regularization is in the form of a bottleneck autoencoder, which encodes the population level information of the deformation fields in a low-dimensional manifold. The proposed architecture produces deformation fields that describe the population-level features and associated correspondences in an anatomically relevant manner and are statistically compact relative to the state-of-the-art approaches while maintaining computational efficiency. We demonstrate the efficacy of the proposed architecture on synthetic data sets, as well as 2D and 3D medical data.



There are no comments yet.


page 4

page 5

page 7

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spatial transformations between sets of images play an important role in medical image analysis and are usually used for bringing distinct subjects into anatomical correspondence. This has many uses, such as the alignment of a population into a common coordinate system to compare functional/structural properties of specific anatomy, alignment of a new subject to an atlas, and in the study of anatomical shapes, where the transformations among and between images describe the morphology. In all of these applications, there is an assumption, either explicit or implicit, that the ideal transformation should bring the images into an anatomical correspondence such that key parts of the anatomy are collocated in the transformed image(s). Some methods identify specific anatomical features and find transformations that ensure their alignment [1]. Others find transformations that align unidentified image intensities/features, but regularize the problem with a smoothness penalty on the class of transformations [2, 3]. This approach has the advantage of potential generality, but it ignores known anatomical variability and correspondence. Thus, the metric, regularizations, or representations used to find these transformations do not incorporate any knowledge of transformations or class of transformations that best align members of a given population.

Existing body of literature suggests that anatomical correspondences can be better learned (even in the absence of semantic/functional knowledge) in the context of populations of images or shapes [4, 5, 6]. There is evidence that correct correspondence produces a population of transformations that is relatively easy to encode. This paper complements and extends these works by integrating population statistics (using non-linear models) into a deep neural network architecture for image registration, which we show is important for accurate characterization of anatomical correspondence.

Very recently, convolutional neural networks (CNNs) are utilized to regress coordinate transformations over the space of input images

[7, 8], in an unsupervised manner, by penalizing a metric of alignment between the input image pairs. These works are justified on the basis of computational speed or efficiency, as the feed-forward computation avoids non-linear, iterative optimization required for conventional image registration methods. However, CNNs for image registration offer other advantages, which are so far unexploited. In particular, CNNs do not rely on analytical representations of the coordinate transformation, the space of allowable transformations, or the optimization. This raises the possibility of incorporating empirical knowledge of the transformations, derived from a population of images, into the registration problem.

In this paper, we propose using population-based learning of regularizations or metrics for controlling the class of transformations that CNN learns. To achieve this, we introduce a novel neural network architecture that includes two subnetworks, namely primary and secondary networks, that work cooperatively. The primary network learns the transformations between pairs of images. The secondary network is a bottleneck autoencoder, that learns a low-dimensional description of the population of transformations, and cooperates with the primary network to enforce that the transformations adhere to a latent low-dimensional manifold.

2 Related Work

Deformable image registration has been explored extensively, however, challenges in generality, robustness, and efficiency remain. For brevity, we only focus below on the most closely related research.

Deformable registration is generally an ill-posed problem, and hence regularization is required to achieve plausible transformations, avoid non-smooth transformations, and provide anatomically consistent results. Deformation fields are a classical way to represent transformations, typically regularized through smoothness penalty, usually in the form of Dirichlet/elastic penalty on the deformation [9]. For relatively low-dimensional representations, such as b-splines [10], the basis introduces a degree of smoothness, although some methods apply penalties on the b-spline coefficients. Diffeomorphic registration uses static or dynamic (with time-dependent velocity), smooth flow fields to represent the deformation while guaranteeing invertibility, and has been applied to image alignment and shape analysis [2]. The smoothness in the diffeomorphic setting is typically introduced as part of the metric on the flow field.

Recently, CNNs have been used for image registration to boost the computational efficiency by avoiding the non-linear, iterative optimization routines of conventional methods. Supervised methods for CNN training showed promising results [11], but this requires large amounts of labeled training data (i.e., registration examples solved with other techniques). More recent work performs CNN-based registration in an unsupervised fashion [7, 8]. The work of Balakrishnan et al. [8] shows promising results on learning 3D brain registration displacement fields, improving the computational cost (after training) over the state-of-the-art traditional registration methods, such as ANTs [12], while maintaining registration accuracy. Like most registration methods, this approach also uses smoothness on the deformation fields as a regularizer.

Early works by [4] considered anatomical landmarks on a set of anatomical shapes, and suggested that anatomical variability is relatively low-dimensional. Later work used information-theoretic criteria to parameterize correspondences on populations of shapes [5]. Deformable transformations between images have also been confined to a low-dimensional representation that captures population characteristics [13]. Statistical deformation models [13, 14]

learn the probability distribution (subspace or manifold) of the deformation fields for a given population to reduce the dimensionality of the solution space and constrain the registration process. Low-rank representations and spatially varying metrics have also been proposed for diffeomorphic registration

[15, 6]. All these methods use linear models (e.g. PCA or low-rank correlations) to feed population statistics back into the registration process. In this paper, we introduce nonlinear models of the population and integrate these into a network architecture for registration.

This paper proposes a neural network architecture where one network influences another. Few proposed systems of interacting neural networks include generative adversarial networks (GAN) [16] and its variants, and domain adaptation (DA) [17]. In these works, the primary network is competing with the secondary network as an adversary, and the steady states of these systems (in training) is a saddle point for the competing energies. In the proposed work, the primary network is minimizing both its loss as well as the reconstruction loss of the secondary network, in an unsupervised setting—and thus we call these architectures cooperative networks.

3 Methods

The proposed cooperative network architecture is depicted in Figure 1. It consists of two interacting subnetworks, the primary network aims at solving the primary registration task, and the secondary network regularizes the solution space of the primary task. The architecture of the primary network is based on U-Net architecture (Figure 2), in line with other registration approaches [8]. Given a source () and a target () image pair (2D/3D), the network produces a displacement field , corresponding to the warp that ideally should match to . This displacement field, with the source image, is passed through a spatial transform unit [18] to produce a registered image (). The primary network uses an image matching term between and

as the loss function (e.g.,

norm or normalized cross-correlation). To re-iterate, the displacement fields are not required for training, and hence, this is an unsupervised image registration architecture.

Figure 1: Cooperative network architecture, with the primary unsupervised registration network depicted in the blue box, and the secondary autoencoder based regularizer network in the red box.

The secondary network is a bottleneck autoencoder, which we call a cooperative autoencoder (CAE), that attempts to reconstruct the displacement field. The CAE’s output is denoted as . The CAE is a CNN (Figure 2) with an h

-degrees-of-freedom bottleneck layer (i.e. the latent space) represents the low dimensional nonlinear manifold on which the displacement fields should lie (approximately). We add the CAE’s reconstruction loss (

loss given as ) to the primary registration loss. CAE acts as a regularizer and pushes the network objective function so that it prefers, among many possible solutions, displacement fields that are accurately represented by the CAE.

The final objective function constitutes three terms (Eq. 1). The first term represents the registration loss, the second term (weighted by ) is smoothness term [8], and, the third term (weighted by ) is the CAE based regularization term.


CAE training requires an initial set of transformation for a preliminary representation, hence, we start training with (no CAE input), and a small smoothness with weight . We found that this length of initialization phase does not significantly affect the results of the system, and we always set it at 5% of total iterations. After the initialization phase, we turn on the CAE and set to a non-zero value and (no smoothness), and train the primary and secondary network jointly (cooperatively).

4 Results

Figure 2: Left: primary network architecture (input: pair of images, output: displacement field between the images), which is then fed into the Spatial Transform (Figure 1). Right: architecture of the cooperative autoencoder.

In this paper, we use the proposed method to register shapes, represented as binary images and/or distance transforms. The same method applies directly to medical images. For each dataset, we train each network on all pairs of images from the data, with random 25% of the pairs set aside for testing. To clarify, this testing set is of completely held out pairs of images and the remaining 75% of pairs is broken into training and validation set, Training on all pairs ensures that the CAE captures the inherent low-dimensional structure of the displacement fields while avoiding bias. However, the concept of cooperative networks is applicable to other training strategies (e.g. training with a given atlas image) or representations (e.g. momentum fields).

Linear and Rotating Box-Bump

Our first didactic dataset is a set of 2D box-bump (as in [19]) images, where a protrusion on the surface of a rectangular shape is parameterized by its position along the side. We also use another synthetic dataset representative of rotational (non-linear) shape variations. Specifically, a protrusion is set atop of a circular base (parameterized by its angular position, between [-50, +50] degrees from the center). These linear and rotating box-bump datasets respectively represent a single linear and rotating (non-linear) mode of variation. We apply the proposed method on these datasets with the secondary network as cooperative autoencoder (CAE) with the bottleneck of dimension 1 and compare the resulting displacement fields with unsupervised deformable registration (UnDR) proposed in [8], which uses a smoothness penalty on the displacement fields and encodes no population-level information. We use difference as primary loss, i.e. . The results are shown in Figure 3, along with displacement fields and corresponding Dice coefficients, for a test pair of images. We see that the registration accuracy measured using the Dice coefficient is comparable for UnDR and the proposed method (UnDR-CAE), but produces vastly different displacement fields. Cooperating networks capture a single transverse/rotating component for linear/rotating box bump, respectively, each derived from population statistics. In comparison, UnDR (for both datasets) compresses the protrusion for the source and expands it for the target, which correctly aligns the source and target shapes, but it does not discover the shape variation of the population. This is an important distinction: unlike UnDR, CAE leverages information about the population statistics of the data.

The core idea of cooperative networks is to restrict displacement fields to a low dimensional manifold. For comparison, we also study some alternative strategies

Figure 3: Linear & rotating box-bump results with different methods, left figure shows the source with the field as produced by the network, and the right shows the false color difference image between the target and the registration output (white: correct overlap, green and magenta: mismatched pixels).

exploiting the same principle. The first option is to reduce the latent space of the primary network architecture (UnDR) to a single dimension bottleneck, which we call “UnDR-BN”, this represents a conventional alternative to the CAE. The results for this approach are shown in Figure 3 (UnDR-BN). These results show that UnDR-BN is similar to UnDR, which can be explained, in part, by the skip-connections (Figure 2) in the U-Net architecture used in UnDR. An alternative to UnDR-BN architecture can be to introduce a penalty on this layer to encourage sparsity. In our experiments, this leads to similar results as UnDR-BN, and for brevity, we do not present those results in this paper. We also provide additional results (in Appendix 0.A ) with UnDR-BN, but with skip-connections of the U-Net architecture removed.

We hypothesize that cooperative networks can discover meaningful correspondences of shape, to validate we define landmarks (analytically) on the family of box-bump shapes (in correspondence with the bump movement) and we evaluate how well each method aligns these ground truth correspondences (Landmark error in Table 1), along with Dice coefficients measuring registration accuracy. The computational cost of discovering displacement fields for a given image pair (testing step), are similar for both UnDR and the proposed method, i.e. CAE does not lose any of its speed over UnDR (speed is the main advantage of UnDR [8]). UnDR-CAE registers with similar accuracy as UnDR (measured by Dice coefficient), but consistently achieves lower landmark errors due to the secondary network which learns population statistics. It is also interesting to see the latent space variations as discovered by the single dimension of CAE and the additional results for this is provided Appendix 0.B.

For the CAE, we report the reconstruction error () in Table 1. For comparison, we train a separate autoencoder on the displacement fields produced by UnDR (Table 1). These results are in agreement with the key idea that the CAE helps the primary network to produce results closer to a low-dimensional manifold, as represented by the ability of the bottle-neck AE to accurately reconstruct its output.

Dataset Method AE error Dice coeff. Landmark error Test runtime
Linear Box-Bump CAE (1, = 8) 6.8% 0.98 26% 0.0185s
Linear Box-Bump UnDR 66.4% 0.97 124% 0.0184s
Linear Box-Bump UnDR-BN 65.8% 0.96 122% 0.0190s
Rotate Box-Bump CAE (1, = 8) 12.3% 0.98 24% 0.0195s
Rotate Box-Bump UnDR 63.5% 0.99 101% 0.0193s
Rotate Box-Bump UnDR-BN 54.9% 0.99 102% 0.0196s
Corpus Callosum CAE (2, = 10) 33.2% 0.89 5.7 mm 0.0237s
Corpus Callosum CAE (4, = 10) 19.2% 0.93 5.1 mm 0.0237s
Corpus Callosum CAE (8, = 10) 18.5% 0.95 4.5 mm 0.0237s
Corpus Callosum CAE (16, = 10) 16.3% 0.96 5.2 mm 0.0237s
Corpus Callosum UnDR 33 - 63% 0.93 6.5 mm 0.0234s
Left Atrium (3D) CAE (5, = 0.2) 29.8% 0.76 9.9 mm 0.784s
Left Atrium (3D) UnDR 46.3% 0.75 10.1 mm 0.772s
Table 1: Results obtained with Cooperative AutoEncoder networks (CAE, bottleneck size, coefficient) compared with Unsupervised Deformable Registration (UnDR) by [8]. Landmark errors for box-bump datasets are reported as the percentage of bump width. The AE error for UnDR refers to a separate autoencoder with bottleneck size same as CAE bottleneck (trained after UnDR). The AE error is 63.3% for bottleneck size 1, 54.1% for 2, 49.4% for 4, 38.8% for 8, and 33.5% for 16. We also report the average test runtime to compute the displacement fields.
Figure 4: Two corpus callosum source-target pairs, again one image showing the fields and the other a falsecolor between target and the registered output; top-row: UnDR, bottom-row: CAE.

Corpus Callosum (CC)

In this example, we use a dataset of 324 mid-saggital 2D slices of Corpus Callosum (CC) from the OASIS Brains dataset [20]. Unlike synthetic experiments discussed above, we do not know, apriori, the intrinsic dimensionality of the CC shapes. Therefore, we train the proposed architecture across a range of CAE bottleneck dimensions (2, 4, 8 and 16) and compare resulting Dice coefficients, autoencoder reconstructions, and landmark errors, as in Table 1. Networks are again trained using difference as the primary loss. Landmarks were identified using features from the literature [21], and we had multiple raters identify the posterior and anterior points of the CC, the inferior tip of the splenium, the posterior tip of the genu, the posterior angle of the genu, and the interior notch of the splenium. Interrater RMS error is 1.4mm, and the pixel/voxel size is 1mm for these images. We see that the optimal bottleneck size for cooperative networks is 8 – increasing the bottleneck to 16 improves the Dice coefficient and AE error, but leads to worse landmark error, which suggests the CAE starts to overfit. The UnDR approach leads to comparable Dice scores, but worse autoencoder and landmark errors (Table 1). As in the synthetic experiments, to report the AE error for UnDR, we trained the autoencoder separately after UnDR training. CAE helps the primary network produce displacement fields that are close to a low-dimensional manifold—a result that is not achieved with the conventional smoothness penalty.

Left Atrium Appendage (LAA)

Figure 5: The results of the 3D LAA registration produced by cooperative networks and UnDR.

We apply the cooperative network on a 3D dataset of left atrium appendages (LAA). These images are represented as signed distance transforms, and hence we use the normalized cross-correlation loss as in [8], instead of a image loss. The Dice scores, AE reconstruction accuracy and compute times are reported in Table 1. We also show the registration of a pair of LAA images in Figure 5, and landmark (manually obtained clinically validated Ostia landmarks on LAA) reconstruction errors in Table 1.

5 Conclusions

This paper proposes a novel architecture proposed for CNN-based unsupervised image registration that uses a cooperative autoencoder (CAE) and enforces the displacement fields to lie in the vicinity of a low-dimensional manifold. CAE reconstruction loss acts as a regularizer term for unsupervised registration. Cooperative networks have comparable registration run times (Table 1) with UnDR, but much faster as compared to the conventional state-of-the-art registration methods (as analyzed in [8]). Cooperative networks produce meaningful correspondence representation between shapes as compared to other methods (evident by landmark reconstruction errors in Table 1), while maintaining the registration accuracy, making it a viable tool for obtaining fast alignment with anatomically feasible correspondence.

Acknowledgements : This work was supported by NIH [grant numbers R01-AR-076120-01, R01- HL135568-02, and P41-GM103545-19] and also supported by the National Institute of General Medical Sciences of the National Institutes of Health under grant number P41 GM103545-18.


  • [1] Joshi, S.H., Cabeen, R.P., Joshi, A.A., Sun, B., Dinov, I., Narr, K.L., Toga, A.W., Woods, R.P.: Diffeomorphic sulcal shape analysis on the cortex. IEEE transactions on medical imaging 31(6), 1195–1212 (2012)
  • [2]

    Beg, M.F., Miller, M.I., Trouvé, A., Younes, L.: Computing large deformation metric mappings via geodesic flows of diffeomorphisms. International journal of computer vision 61(2), 139–157 (2005)

  • [3] Joshi, S.C., Miller, M.I.: Landmark matching via large deformation diffeomorphisms. IEEE transactions on image processing 9(8), 1357–1370 (2000)
  • [4] Grenander, U., Chow, Y.s., Keenan, D.M.: Hands: A pattern theoretic study of biological shapes, vol. 2. Springer Science & Business Media (1991)
  • [5] Cates, J., Fletcher, P.T., Styner, M., Shenton, M., Whitaker, R.: Shape modeling and analysis with entropy-based particle systems. In: IPMI. pp. 333–345. Springer (2007)
  • [6] Vialard, F.X., Risser, L.: Spatially-varying metric learning for diffeomorphic image registration: A variational framework. In: MICCAI. pp. 227–234. Springer, Cham (2014)
  • [7]

    de Vos, B.D., Berendsen, F.F., Viergever, M.A., Staring, M., Isgum, I.: End-to-end unsupervised deformable image registration with a convolutional neural network. In: Deep Learning in Medical Imaging MICCAI Workshop (2017)

  • [8]

    Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: An unsupervised learning model for deformable medical image registration. In: CVPR. pp. 9252–9260 (2018)

  • [9] Bajcsy, R., Kovačič, S.: Multiresolution elastic matching. Computer Vision, Graphics, and Image Processing 46(1), 1 – 21 (1989)
  • [10] Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L.G., Leach, M.O., Hawkes, D.J.: Nonrigid registration using free-form deformations: application to breast mr images. IEEE Transactions on Medical Imaging 18(8), 712–721 (1999)
  • [11] Krebs, J., Mansi, T., Delingette, H., Zhang, L., Ghesu, F.C., Miao, S., Maier, A.K., Ayache, N., Liao, R., Kamen, A.: Robust non-rigid registration through agent-based action learning. In: MICCAI. pp. 344–352. Springer International Publishing, Cham (2017)
  • [12] Avants, B.B., Tustison, N.J., Song, G., Cook, P.A., Klein, A., Gee, J.C.: A reproducible evaluation of ants similarity metric performance in brain image registration. NeuroImage 54(3), 2033 – 2044 (2011)
  • [13] Rueckert, D., Frangi, A.F., Schnabel, J.A.: Automatic construction of 3-d statistical deformation models of the brain using nonrigid registration. IEEE transactions on medical imaging 22(8), 1014–1025 (2003)
  • [14] Joshi, S.C., Miller, M.I., Grenander, U.: On the geometry and shape of brain sub-manifolds. IJPRAI 11(8), 1317–1343 (1997)
  • [15] Schmah, T., Risser, L., Vialard, F.X.: Left-invariant metrics for diffeomorphic image registration with spatially-varying regularisation. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013. pp. 203–210. Springer, Berlin, Heidelberg (2013)
  • [16] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc. (2014)
  • [17]

    Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17(1), 2096–2030 (2016)

  • [18]

    Jaderberg, M., Simonyan, K., Zisserman, A., kavukcuoglu, k.: Spatial transformer networks. In: Advances in Neural Information Processing Systems 28, pp. 2017–2025. Curran Associates, Inc. (2015)

  • [19] Thodberg, H.H.: Minimum description length shape and appearance models. In: Information Processing in Medical Imaging. pp. 51–62. Springer, Berlin, Heidelberg (2003)
  • [20] Marcus, D.S., Fotenos, A.F., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open access series of imaging studies: longitudinal mri data in nondemented and demented older adults. vol. 22, pp. 2677–2684 (2010)
  • [21] Sigirli, D., Ercan, I., Ozdemir, S.T., Taskapilioglu, O., Hakyemez, B., Turan, O.F.: Shape analysis of the corpus callosum and cerebellum in female ms patients with different clinical phenotypes. The Anatomical Record: Advances in Integrative Anatomy and Evolutionary Biology 295(7), 1202–1211 (2012)


Appendix 0.A UnDR-BN Without Skip-Connections

In the paper we consider comparison of CAE based UnDR with the variant of a constrained bottleneck, however, we saw similar results which we attributed to the still in place skip-connections of the U-Net architecture. Here, we redo the experiments of a single dimensional bottleneck in primary U-Net architecture for linear and rotating box-bump datasets, but we remove the skip-connections. This architecture is essentially a hard constraint on the dimensionality of displacement field as compared to the soft constraint as imposed by the CAE. In Figure 6 we show the results for these experiments. We can see that the registration is subpar, which aligns with our assumption that a CAE based soft penalty, to lie close to a low-dimensional manifold produces more accurate registrations and corresponding flow-fields as compared to hard penalty for the population of displacement fields to lie exactly on the low-dimensional manifold. The average dice coefficient for linear box-bump with this method is 0.926 and for rotating box-bump is 0.974.

Figure 6: Top-row: an example pair of liner box-bump images from testing set passed through UnDR-BN without skip-connections. left image shows the source image with the displacement fields. Right image shows difference image between the registered output and the true target image, the white pixels are correctly matched and the green and magenta shows the mismatched pixels of registered and target images respectively. Bottom-row: an example from the rotating box-bump dataset.

Appendix 0.B Latent Space Variations

We examine the nature of the latent space of the CAE for the linear and rotating box-bump datasets. For each dataset, we fix the target image to be the mean image (each in both case is the image with the bump in center) and change the source images. For linear box-bump we have 100 source images and in Figure 7(images) we show 10 equally spaced source images with their corresponding displacement field bringing each image to the center bump target image. Figure 7(plot) shows the value of the CAE bottleneck versus the source images. The variation in the latent space is monotonic and makes intuitive sense corresponding to displacement field it generates. This behavior is same for the rotating box bumps whose variations are seen in Figure 8.

Figure 7: Linear Box-Bump Data: The images going from top to bottom , left to right represent the 10 uniform samples of the source images and the corresponding displacement fields registering them to central bump image, the numbers in white denote the value of the latent space of the CAE. The plot on the right denotes the latent space value of the CAE for all 100 source images when registered to the central bump image.
Figure 8: Rotating Box-Bump Data: The images going from top to bottom , left to right represent the 10 uniform samples of the source images and the corresponding displacement fields registering them to central bump image, the numbers in white denote the value of the latent space of the CAE. The plot on the right denotes the latent space value of the CAE for all 100 source images when registered to the central bump image.