Spatial transformations between sets of images play an important role in medical image analysis and are usually used for bringing distinct subjects into anatomical correspondence. This has many uses, such as the alignment of a population into a common coordinate system to compare functional/structural properties of specific anatomy, alignment of a new subject to an atlas, and in the study of anatomical shapes, where the transformations among and between images describe the morphology. In all of these applications, there is an assumption, either explicit or implicit, that the ideal transformation should bring the images into an anatomical correspondence such that key parts of the anatomy are collocated in the transformed image(s). Some methods identify specific anatomical features and find transformations that ensure their alignment . Others find transformations that align unidentified image intensities/features, but regularize the problem with a smoothness penalty on the class of transformations [2, 3]. This approach has the advantage of potential generality, but it ignores known anatomical variability and correspondence. Thus, the metric, regularizations, or representations used to find these transformations do not incorporate any knowledge of transformations or class of transformations that best align members of a given population.
Existing body of literature suggests that anatomical correspondences can be better learned (even in the absence of semantic/functional knowledge) in the context of populations of images or shapes [4, 5, 6]. There is evidence that correct correspondence produces a population of transformations that is relatively easy to encode. This paper complements and extends these works by integrating population statistics (using non-linear models) into a deep neural network architecture for image registration, which we show is important for accurate characterization of anatomical correspondence.
Very recently, convolutional neural networks (CNNs) are utilized to regress coordinate transformations over the space of input images[7, 8], in an unsupervised manner, by penalizing a metric of alignment between the input image pairs. These works are justified on the basis of computational speed or efficiency, as the feed-forward computation avoids non-linear, iterative optimization required for conventional image registration methods. However, CNNs for image registration offer other advantages, which are so far unexploited. In particular, CNNs do not rely on analytical representations of the coordinate transformation, the space of allowable transformations, or the optimization. This raises the possibility of incorporating empirical knowledge of the transformations, derived from a population of images, into the registration problem.
In this paper, we propose using population-based learning of regularizations or metrics for controlling the class of transformations that CNN learns. To achieve this, we introduce a novel neural network architecture that includes two subnetworks, namely primary and secondary networks, that work cooperatively. The primary network learns the transformations between pairs of images. The secondary network is a bottleneck autoencoder, that learns a low-dimensional description of the population of transformations, and cooperates with the primary network to enforce that the transformations adhere to a latent low-dimensional manifold.
2 Related Work
Deformable image registration has been explored extensively, however, challenges in generality, robustness, and efficiency remain. For brevity, we only focus below on the most closely related research.
Deformable registration is generally an ill-posed problem, and hence regularization is required to achieve plausible transformations, avoid non-smooth transformations, and provide anatomically consistent results. Deformation fields are a classical way to represent transformations, typically regularized through smoothness penalty, usually in the form of Dirichlet/elastic penalty on the deformation . For relatively low-dimensional representations, such as b-splines , the basis introduces a degree of smoothness, although some methods apply penalties on the b-spline coefficients. Diffeomorphic registration uses static or dynamic (with time-dependent velocity), smooth flow fields to represent the deformation while guaranteeing invertibility, and has been applied to image alignment and shape analysis . The smoothness in the diffeomorphic setting is typically introduced as part of the metric on the flow field.
Recently, CNNs have been used for image registration to boost the computational efficiency by avoiding the non-linear, iterative optimization routines of conventional methods. Supervised methods for CNN training showed promising results , but this requires large amounts of labeled training data (i.e., registration examples solved with other techniques). More recent work performs CNN-based registration in an unsupervised fashion [7, 8]. The work of Balakrishnan et al.  shows promising results on learning 3D brain registration displacement fields, improving the computational cost (after training) over the state-of-the-art traditional registration methods, such as ANTs , while maintaining registration accuracy. Like most registration methods, this approach also uses smoothness on the deformation fields as a regularizer.
Early works by  considered anatomical landmarks on a set of anatomical shapes, and suggested that anatomical variability is relatively low-dimensional. Later work used information-theoretic criteria to parameterize correspondences on populations of shapes . Deformable transformations between images have also been confined to a low-dimensional representation that captures population characteristics . Statistical deformation models [13, 14]
learn the probability distribution (subspace or manifold) of the deformation fields for a given population to reduce the dimensionality of the solution space and constrain the registration process. Low-rank representations and spatially varying metrics have also been proposed for diffeomorphic registration[15, 6]. All these methods use linear models (e.g. PCA or low-rank correlations) to feed population statistics back into the registration process. In this paper, we introduce nonlinear models of the population and integrate these into a network architecture for registration.
This paper proposes a neural network architecture where one network influences another. Few proposed systems of interacting neural networks include generative adversarial networks (GAN)  and its variants, and domain adaptation (DA) . In these works, the primary network is competing with the secondary network as an adversary, and the steady states of these systems (in training) is a saddle point for the competing energies. In the proposed work, the primary network is minimizing both its loss as well as the reconstruction loss of the secondary network, in an unsupervised setting—and thus we call these architectures cooperative networks.
The proposed cooperative network architecture is depicted in Figure 1. It consists of two interacting subnetworks, the primary network aims at solving the primary registration task, and the secondary network regularizes the solution space of the primary task. The architecture of the primary network is based on U-Net architecture (Figure 2), in line with other registration approaches . Given a source () and a target () image pair (2D/3D), the network produces a displacement field , corresponding to the warp that ideally should match to . This displacement field, with the source image, is passed through a spatial transform unit  to produce a registered image (). The primary network uses an image matching term between and
as the loss function (e.g.,norm or normalized cross-correlation). To re-iterate, the displacement fields are not required for training, and hence, this is an unsupervised image registration architecture.
The secondary network is a bottleneck autoencoder, which we call a cooperative autoencoder (CAE), that attempts to reconstruct the displacement field. The CAE’s output is denoted as . The CAE is a CNN (Figure 2) with an h
-degrees-of-freedom bottleneck layer (i.e. the latent space) represents the low dimensional nonlinear manifold on which the displacement fields should lie (approximately). We add the CAE’s reconstruction loss (loss given as ) to the primary registration loss. CAE acts as a regularizer and pushes the network objective function so that it prefers, among many possible solutions, displacement fields that are accurately represented by the CAE.
The final objective function constitutes three terms (Eq. 1). The first term represents the registration loss, the second term (weighted by ) is smoothness term , and, the third term (weighted by ) is the CAE based regularization term.
CAE training requires an initial set of transformation for a preliminary representation, hence, we start training with (no CAE input), and a small smoothness with weight . We found that this length of initialization phase does not significantly affect the results of the system, and we always set it at 5% of total iterations. After the initialization phase, we turn on the CAE and set to a non-zero value and (no smoothness), and train the primary and secondary network jointly (cooperatively).
In this paper, we use the proposed method to register shapes, represented as binary images and/or distance transforms. The same method applies directly to medical images. For each dataset, we train each network on all pairs of images from the data, with random 25% of the pairs set aside for testing. To clarify, this testing set is of completely held out pairs of images and the remaining 75% of pairs is broken into training and validation set, Training on all pairs ensures that the CAE captures the inherent low-dimensional structure of the displacement fields while avoiding bias. However, the concept of cooperative networks is applicable to other training strategies (e.g. training with a given atlas image) or representations (e.g. momentum fields).
Linear and Rotating Box-Bump
Our first didactic dataset is a set of 2D box-bump (as in ) images, where a protrusion on the surface of a rectangular shape is parameterized by its position along the side. We also use another synthetic dataset representative of rotational (non-linear) shape variations. Specifically, a protrusion is set atop of a circular base (parameterized by its angular position, between [-50, +50] degrees from the center). These linear and rotating box-bump datasets respectively represent a single linear and rotating (non-linear) mode of variation. We apply the proposed method on these datasets with the secondary network as cooperative autoencoder (CAE) with the bottleneck of dimension 1 and compare the resulting displacement fields with unsupervised deformable registration (UnDR) proposed in , which uses a smoothness penalty on the displacement fields and encodes no population-level information. We use difference as primary loss, i.e. . The results are shown in Figure 3, along with displacement fields and corresponding Dice coefficients, for a test pair of images. We see that the registration accuracy measured using the Dice coefficient is comparable for UnDR and the proposed method (UnDR-CAE), but produces vastly different displacement fields. Cooperating networks capture a single transverse/rotating component for linear/rotating box bump, respectively, each derived from population statistics. In comparison, UnDR (for both datasets) compresses the protrusion for the source and expands it for the target, which correctly aligns the source and target shapes, but it does not discover the shape variation of the population. This is an important distinction: unlike UnDR, CAE leverages information about the population statistics of the data.
The core idea of cooperative networks is to restrict displacement fields to a low dimensional manifold. For comparison, we also study some alternative strategies
exploiting the same principle. The first option is to reduce the latent space of the primary network architecture (UnDR) to a single dimension bottleneck, which we call “UnDR-BN”, this represents a conventional alternative to the CAE. The results for this approach are shown in Figure 3 (UnDR-BN). These results show that UnDR-BN is similar to UnDR, which can be explained, in part, by the skip-connections (Figure 2) in the U-Net architecture used in UnDR. An alternative to UnDR-BN architecture can be to introduce a penalty on this layer to encourage sparsity. In our experiments, this leads to similar results as UnDR-BN, and for brevity, we do not present those results in this paper. We also provide additional results (in Appendix 0.A ) with UnDR-BN, but with skip-connections of the U-Net architecture removed.
We hypothesize that cooperative networks can discover meaningful correspondences of shape, to validate we define landmarks (analytically) on the family of box-bump shapes (in correspondence with the bump movement) and we evaluate how well each method aligns these ground truth correspondences (Landmark error in Table 1), along with Dice coefficients measuring registration accuracy. The computational cost of discovering displacement fields for a given image pair (testing step), are similar for both UnDR and the proposed method, i.e. CAE does not lose any of its speed over UnDR (speed is the main advantage of UnDR ). UnDR-CAE registers with similar accuracy as UnDR (measured by Dice coefficient), but consistently achieves lower landmark errors due to the secondary network which learns population statistics. It is also interesting to see the latent space variations as discovered by the single dimension of CAE and the additional results for this is provided Appendix 0.B.
For the CAE, we report the reconstruction error () in Table 1. For comparison, we train a separate autoencoder on the displacement fields produced by UnDR (Table 1). These results are in agreement with the key idea that the CAE helps the primary network to produce results closer to a low-dimensional manifold, as represented by the ability of the bottle-neck AE to accurately reconstruct its output.
|Dataset||Method||AE error||Dice coeff.||Landmark error||Test runtime|
|Linear Box-Bump||CAE (1, = 8)||6.8%||0.98||26%||0.0185s|
|Rotate Box-Bump||CAE (1, = 8)||12.3%||0.98||24%||0.0195s|
|Corpus Callosum||CAE (2, = 10)||33.2%||0.89||5.7 mm||0.0237s|
|Corpus Callosum||CAE (4, = 10)||19.2%||0.93||5.1 mm||0.0237s|
|Corpus Callosum||CAE (8, = 10)||18.5%||0.95||4.5 mm||0.0237s|
|Corpus Callosum||CAE (16, = 10)||16.3%||0.96||5.2 mm||0.0237s|
|Corpus Callosum||UnDR||33 - 63%||0.93||6.5 mm||0.0234s|
|Left Atrium (3D)||CAE (5, = 0.2)||29.8%||0.76||9.9 mm||0.784s|
|Left Atrium (3D)||UnDR||46.3%||0.75||10.1 mm||0.772s|
Corpus Callosum (CC)
In this example, we use a dataset of 324 mid-saggital 2D slices of Corpus Callosum (CC) from the OASIS Brains dataset . Unlike synthetic experiments discussed above, we do not know, apriori, the intrinsic dimensionality of the CC shapes. Therefore, we train the proposed architecture across a range of CAE bottleneck dimensions (2, 4, 8 and 16) and compare resulting Dice coefficients, autoencoder reconstructions, and landmark errors, as in Table 1. Networks are again trained using difference as the primary loss. Landmarks were identified using features from the literature , and we had multiple raters identify the posterior and anterior points of the CC, the inferior tip of the splenium, the posterior tip of the genu, the posterior angle of the genu, and the interior notch of the splenium. Interrater RMS error is 1.4mm, and the pixel/voxel size is 1mm for these images. We see that the optimal bottleneck size for cooperative networks is 8 – increasing the bottleneck to 16 improves the Dice coefficient and AE error, but leads to worse landmark error, which suggests the CAE starts to overfit. The UnDR approach leads to comparable Dice scores, but worse autoencoder and landmark errors (Table 1). As in the synthetic experiments, to report the AE error for UnDR, we trained the autoencoder separately after UnDR training. CAE helps the primary network produce displacement fields that are close to a low-dimensional manifold—a result that is not achieved with the conventional smoothness penalty.
Left Atrium Appendage (LAA)
We apply the cooperative network on a 3D dataset of left atrium appendages (LAA). These images are represented as signed distance transforms, and hence we use the normalized cross-correlation loss as in , instead of a image loss. The Dice scores, AE reconstruction accuracy and compute times are reported in Table 1. We also show the registration of a pair of LAA images in Figure 5, and landmark (manually obtained clinically validated Ostia landmarks on LAA) reconstruction errors in Table 1.
This paper proposes a novel architecture proposed for CNN-based unsupervised image registration that uses a cooperative autoencoder (CAE) and enforces the displacement fields to lie in the vicinity of a low-dimensional manifold. CAE reconstruction loss acts as a regularizer term for unsupervised registration. Cooperative networks have comparable registration run times (Table 1) with UnDR, but much faster as compared to the conventional state-of-the-art registration methods (as analyzed in ). Cooperative networks produce meaningful correspondence representation between shapes as compared to other methods (evident by landmark reconstruction errors in Table 1), while maintaining the registration accuracy, making it a viable tool for obtaining fast alignment with anatomically feasible correspondence.
Acknowledgements : This work was supported by NIH [grant numbers R01-AR-076120-01, R01- HL135568-02, and P41-GM103545-19] and also supported by the National Institute of General Medical Sciences of the National Institutes of Health under grant number P41 GM103545-18.
-  Joshi, S.H., Cabeen, R.P., Joshi, A.A., Sun, B., Dinov, I., Narr, K.L., Toga, A.W., Woods, R.P.: Diffeomorphic sulcal shape analysis on the cortex. IEEE transactions on medical imaging 31(6), 1195–1212 (2012)
Beg, M.F., Miller, M.I., Trouvé, A., Younes, L.: Computing large deformation metric mappings via geodesic flows of diffeomorphisms. International journal of computer vision 61(2), 139–157 (2005)
-  Joshi, S.C., Miller, M.I.: Landmark matching via large deformation diffeomorphisms. IEEE transactions on image processing 9(8), 1357–1370 (2000)
-  Grenander, U., Chow, Y.s., Keenan, D.M.: Hands: A pattern theoretic study of biological shapes, vol. 2. Springer Science & Business Media (1991)
-  Cates, J., Fletcher, P.T., Styner, M., Shenton, M., Whitaker, R.: Shape modeling and analysis with entropy-based particle systems. In: IPMI. pp. 333–345. Springer (2007)
-  Vialard, F.X., Risser, L.: Spatially-varying metric learning for diffeomorphic image registration: A variational framework. In: MICCAI. pp. 227–234. Springer, Cham (2014)
de Vos, B.D., Berendsen, F.F., Viergever, M.A., Staring, M., Isgum, I.: End-to-end unsupervised deformable image registration with a convolutional neural network. In: Deep Learning in Medical Imaging MICCAI Workshop (2017)
Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: An unsupervised learning model for deformable medical image registration. In: CVPR. pp. 9252–9260 (2018)
-  Bajcsy, R., Kovačič, S.: Multiresolution elastic matching. Computer Vision, Graphics, and Image Processing 46(1), 1 – 21 (1989)
-  Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L.G., Leach, M.O., Hawkes, D.J.: Nonrigid registration using free-form deformations: application to breast mr images. IEEE Transactions on Medical Imaging 18(8), 712–721 (1999)
-  Krebs, J., Mansi, T., Delingette, H., Zhang, L., Ghesu, F.C., Miao, S., Maier, A.K., Ayache, N., Liao, R., Kamen, A.: Robust non-rigid registration through agent-based action learning. In: MICCAI. pp. 344–352. Springer International Publishing, Cham (2017)
-  Avants, B.B., Tustison, N.J., Song, G., Cook, P.A., Klein, A., Gee, J.C.: A reproducible evaluation of ants similarity metric performance in brain image registration. NeuroImage 54(3), 2033 – 2044 (2011)
-  Rueckert, D., Frangi, A.F., Schnabel, J.A.: Automatic construction of 3-d statistical deformation models of the brain using nonrigid registration. IEEE transactions on medical imaging 22(8), 1014–1025 (2003)
-  Joshi, S.C., Miller, M.I., Grenander, U.: On the geometry and shape of brain sub-manifolds. IJPRAI 11(8), 1317–1343 (1997)
-  Schmah, T., Risser, L., Vialard, F.X.: Left-invariant metrics for diffeomorphic image registration with spatially-varying regularisation. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013. pp. 203–210. Springer, Berlin, Heidelberg (2013)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc. (2014)
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17(1), 2096–2030 (2016)
Jaderberg, M., Simonyan, K., Zisserman, A., kavukcuoglu, k.: Spatial transformer networks. In: Advances in Neural Information Processing Systems 28, pp. 2017–2025. Curran Associates, Inc. (2015)
-  Thodberg, H.H.: Minimum description length shape and appearance models. In: Information Processing in Medical Imaging. pp. 51–62. Springer, Berlin, Heidelberg (2003)
-  Marcus, D.S., Fotenos, A.F., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open access series of imaging studies: longitudinal mri data in nondemented and demented older adults. vol. 22, pp. 2677–2684 (2010)
-  Sigirli, D., Ercan, I., Ozdemir, S.T., Taskapilioglu, O., Hakyemez, B., Turan, O.F.: Shape analysis of the corpus callosum and cerebellum in female ms patients with different clinical phenotypes. The Anatomical Record: Advances in Integrative Anatomy and Evolutionary Biology 295(7), 1202–1211 (2012)
Appendix 0.A UnDR-BN Without Skip-Connections
In the paper we consider comparison of CAE based UnDR with the variant of a constrained bottleneck, however, we saw similar results which we attributed to the still in place skip-connections of the U-Net architecture. Here, we redo the experiments of a single dimensional bottleneck in primary U-Net architecture for linear and rotating box-bump datasets, but we remove the skip-connections. This architecture is essentially a hard constraint on the dimensionality of displacement field as compared to the soft constraint as imposed by the CAE. In Figure 6 we show the results for these experiments. We can see that the registration is subpar, which aligns with our assumption that a CAE based soft penalty, to lie close to a low-dimensional manifold produces more accurate registrations and corresponding flow-fields as compared to hard penalty for the population of displacement fields to lie exactly on the low-dimensional manifold. The average dice coefficient for linear box-bump with this method is 0.926 and for rotating box-bump is 0.974.
Appendix 0.B Latent Space Variations
We examine the nature of the latent space of the CAE for the linear and rotating box-bump datasets. For each dataset, we fix the target image to be the mean image (each in both case is the image with the bump in center) and change the source images. For linear box-bump we have 100 source images and in Figure 7(images) we show 10 equally spaced source images with their corresponding displacement field bringing each image to the center bump target image. Figure 7(plot) shows the value of the CAE bottleneck versus the source images. The variation in the latent space is monotonic and makes intuitive sense corresponding to displacement field it generates. This behavior is same for the rotating box bumps whose variations are seen in Figure 8.