1 Introduction
Image segmentation, i.e. giving pixels or voxels in an image meaningful labels, is an important medical image analysis task [7]
. While a large number of segmentation approaches exist, deep convolutional neural networks (CNNs) have shown remarkable segmentation performance and are considered stateoftheart
[4, 10]. This was not always the case. Prior to the dominance of CNNs, multiatlas segmentation (MAS) has been highly popular and successful for medical image segmentation [7]. MAS approaches rely on a set of previously labeled atlas images. These images are then registered to an unlabeled target image and their associated labels are used to infer the labeling of the target image via label fusion. Hence, MAS performance relies on highquality registrations or advanced label fusion methods. MAS is slow as it requires computationally costly registrations, but provides good spatial consistency via the given atlas segmentations.In contrast, CNN approaches use sophisticated network architectures, with parameters trained on large sets of labeled images. A popular architecture for medical image segmentation is the UNet [4]. For DL approaches, the majority of computational cost is spent during training. Hence, these approaches are fast at test time and have shown excellent segmentation performance for medical images [4, 10]. However, as image labels are not directly spatially transformed, spatial consistency is only indirectly encouraged during training and DL approaches may miss or add undesired structures. Furthermore, large numbers of labeled images are desirable for training, but may not always be available.
Conceptually, MAS is attractive as it provides a direct and intuitive way to specify and obtain segmentations via a set of labeled atlases. Given atlases that are similar to the target image to be segmented, it is plausible that good registrations can be achieved, that atlas labels can be transferred well, and consequentially that highquality segmentations can be obtained. However, it is apriori unclear which atlases should be used to estimate the segmentation as not all atlases will align well via registration. Label fusion strategies aim at addressing the resulting spatial inconsistencies between the atlas labels warped to the target image space. Approaches include, majority and plurality voting
[6, 5], global weighted voting [1], and local weighted voting strategies [13]. Statistical modeling approaches have also been proposed [14] and patchbased approaches directly aim to compensate for local registration errors [2]. Most recently, machine learning
[12] and deep learning [15] label fusion methods have been proposed. These methods all assume that all atlases might contribute to the labeling decision. Instead, we propose making decisions only based on atlases considered trustworthy. In contrast to global [11] and patchbased atlas selection [9], our approach locally predicts the set of trustworthy atlases voxel by voxel. Experimentally, we show that this strategy, even combined with simple plurality voting, has excellent segmentation performance on par or slightly outperforming a UNet and significantly outperforming other label fusion strategies. All our results make use of fast DLbased deformable image registration, thereby resulting in a MAS approach which is fast and accurate while providing more direct control over spatial label consistency.Contributions. (1) New label fusion method: We propose a DL label fusion method (VoteNet), which locally identifies sets of trustworthy atlases. (2) Fast implementation: All our results are based on a fast DL registration approach (Quicksilver). (3) Performance upper bound: We experimentally assess the best possible performance achievable with our approach and illustrate that there is a large margin for improvements. (4) Comprehensive experimental comparison: We compare to a variety of other label fusion strategies and a UNet for 3D brain segmentation. Our approach performs consistently best.
2 Methodology
Before discussing our proposed label fusion method, we first describe multiatlas segmentation. Let be the target image to be segmented and be atlas images and their corresponding manual segmentations. Assume there is a reliable deformable image registration method that warps all atlases into the space of the target image , i.e. . Each is now a candidate segmentation for . Finally, a label fusion method combines all the candidate segmentations to produce the final segmentation for , i.e.,
(1) 
Our framework uses two deep convolutional neural networks (CNNs) for MAS: the Quicksilver registration network to compute the spatial transformation to target image space and our label fusion network (VoteNet). While other registration approaches could be used, a DL approach greatly speedsup the typically slow registrations for MAS, when based on numerical optimization. By using a DL approach we also demonstrate that it integrates well with MAS.
Fig. 1 illustrates our approach. Quicksilver and VoteNet are discussed below.
Quicksilver: Quicksilver [16] uses the target image and the atlas images to predict the deformation maps which are used to generate warped atlas images and their corresponding labels . Using a DL registration approach such as Quicksilver speedsup pairwise registrations by at least an order of magnitude [16] compared to numerical optimization. Since registrations are the computational bottleneck of MAS, similar speedups can be obtained. E.g., MAS with our approach and 17 atlases requires only 15 mins on an NVIDIA GTX1080Ti. Experiments (Sec. 3) show that Quicksilver yields good results when combined with MAS (Tab. 1).
VoteNet: Given the warped atlas images and the target image, VoteNet independently predicts binary masks for each warped atlas image, locally indicating if a warped atlas should be considered for the final labeling decision of the target image. In effect, VoteNet predicts for each spatial location the set of trustworthy atlases; all atlases which should likely not be used are discarded. Hence, VoteNet, implements a form of locally adaptive statistical trimming. Once the set of trustworthy atlases has been determined, their associated labels can be fused with any chosen label fusion strategy. For simplicity we use plurality voting. Our VoteNet strategy shifts the notion of a plurality to a plurality of trusted atlases. We define trusted plurality voting as
(2) 
where is the set of labels ( structures; 0 indicating background), is the indicator function and denotes a voxel position. We define if ; if .
VoteNet training: Fig. 2 shows the VoteNet architecture, which is based on the 3D UNet [4] (but uses the target image and a warped atlas image as inputs). VoteNet processes an image patchwisely, with a patch size of from the target image and a warped atlas image at the same position, where the patch center is used to tile the volume. To train VoteNet, we use 20 images from LONI Probabilistic Brain Atlas dataset (LPBA40)^{1}^{1}1LPBA40 contains 40 3D brain MRIs with 56 manually segmented structures. Preprocessing includes affine registration to the MNI152 atlas and histogram equalization.; we randomly select 17 images and their labels as atlases and use Quicksilver to register all 17 atlases to the 20 images excluding themselves. This results in 323 () pairwise registrations. Given the manual segmentation labels of these 20 images, we determine at which location the warped atlas labels agree (1) and where they disagree (0). These agreement/disagreement labels are our training labels for VoteNet, trained via a binary cross entropy loss (Fig. 2
) in PyTorch. As VoteNet produces continuous outputs,
, (in via a sigmoid) we threshold at 0.5 at test time, i.e., the local prediction is if and otherwise. We train using ADAMover 300 epochs with a multistep learning rate. The initial learning rate is 0.001; reduced to 0.0001 after 150 epochs; and finally to 0.00001 after 250 epochs.Training image patches were randomly extracted so that 0 labels account for no less than 5% of the entire patch volume. Training requires
24h. The prediction of a single atlas mask takes 20s.3 Experimental Results and Discussion
We use LPBA40 for evaluation. We use twofold crossvalidation, i.e. the dataset is randomly divided into two nonoverlapping subsets of equal size. One set is used for training (Sec. 2) the other for testing for each of the two crossvalidation experiments. The results below are averaged over the crossvalidation folds.
Benchmark methods: We compare against plurality voting (PV) [6], majority voting (MV), simultaneous truth and performance level estimation (STAPLE) [14], multiatlas based multiimage segmentation (MAMBIS)^{2}^{2}2MABMIS uses Diffeomorphic Demons for registration and hence results are not directly comparable to ours; we use Quicksilver for all other label fusion methods. [8], joint label fusion (JLF) [13], patchbased label fusion (PB) [2] and a UNet [4]. PV locally assigns the most frequent segmentation label among the atlases. MV assigns a label only if more than half of the atlases () agree. STAPLE uses a statistical model to estimate a true hidden segmentation based on an optimal weighting of the segmentations. MAMBIS puts atlases in a tree structure to consider their correlations for concurrent alignment. JLF regards label fusion as an optimization problem and minimizes the total expectation of labeling errors. PB searches in a neighborhood to reduce registration errors and utilizes patch intensity and label information within a Bayesian label fusion framework. We include a UNet for a direct comparison to a popular DL image segmentation approach. We also create an oracle label fusion strategy, which has access to the true label during local atlas selection. This allows establishing upper performance bounds.
Oracle label fusion: MAS depends on the interplay of image registration and label fusion. Conceivably, given a highquality registration, one should be able to obtain a highquality segmentation. To assess how well an ideal label fusion strategy could work, we investigate the behavior of an oracle label fusion method following our Quicksilver atlas to target image registrations. Specifically, Oracle() assigns the correct label to a voxel if at least warped atlases (out of our 17) correctly label this voxel; otherwise the background label (0) is assigned.
Results: We use five measures to evaluate segmentations: average surface distance, average surface Dice score (i.e., a surface element is considered overlapping if it is within a certain distance (1mm) to the other surface), Hausdorff distance, 95% maximum surface distance, and average volume Dice score.
Oracle results: Tab. 1 shows that Dice scores of Oracle(1) are close to 100%, indicating that at least one warped atlas label image locally agrees with the manual segmentation. Even Oracle(9) (where the correct label is only assigned if at least 9 of the 17 atlases agree on this labeling) results in Dice scores higher than stateoftheart approaches. In contrast, MV is significantly worse than Oracle(9). Note that all labels (excluding background) of Oracle(9) are also contained in MV. Hence, MV contains incorrect labels for which the majority of atlases agree. Therefore, if VoteNet can identify good subsets of these atlases, a good segmentation should be achievable by majority or plurality voting.
Label fusion results: Tab. 1 shows that VoteNet greatly improves performance over MV/PV and significantly outperforms all other evaluated label fusion strategies on most measures. These results also illustrate that VoteNet successfully locally eliminates atlases that would otherwise have tipped the results to incorrect PV label assignments. Further, we observed that there are some voxels ( inside the brain) that are not assigned any labels by VoteNet (i.e., locally all warped atlases are rejected). We therefore propose a combined VoteNet + UNet strategy which fills in missing voxels via the UNet segmentation. This strategy outperforms both VoteNet and UNet. There is still a large gap between our VoteNet and the Oracle results. Hence, there is significant room for future improvement. Fig. 3 illustrates the performance of VoteNet. The predicted binary mask is close to the ground truth binary mask , indicating that VoteNet captures most areas of poor label alignments for a given atlas image. In fact, VoteNet achieves a volume Dice score of 0.86 on local atlas selection. Fig. 3(right) also shows that by only retaining locally trustworthy atlases the percentage of true positives (after VoteNet atlas selection)over all atlases grows significantly. Consequentially, subsequent plurality voting better predicts the correct labels.
UNet results: UNet results are generally good with respect to the volumetric Dice scores. However, as indicated by the surface measures (in particular, Hausdorff and 95% maximum surface distance), shapes of segmented structures may locally be distorted, as shape constraints are not straightforward to integrate into a CNN. This drawback is much less present in MAS, as a good deformable image registration method will preserve local structure and topology in target image space (based on transformation smoothness). Fig. 4 illustrates this effect. As highlighted by the red arrows, UNet results often show inconsistent shapes, while VoteNet and VoteNet+UNet produce shapes more consistent with the manual segmentations. Furthermore, our VoteNet and VoteNet+UNet retain the cortical foldings, while PV tends to flatten them. This indicates that our proposed approach indeed complements a label fusion method such as PV well.
4 Conclusion and Future Work
We presented a new label fusion method (VoteNet) which helps locally
select the most trustworthy atlases. With VoteNet, we achieve stateoftheart segmentation performance, even surpassing a deep network (UNet) while maintaining spatial shape consistency. We also provided an empirical analysis of best possible achievable performance of our approach, indicating that there is still substantial room for further performance improvements. In particular, it would be interesting to combine VoteNet with more advanced label fusion strategies than plurality voting. As such strategies have shown improved performance for MAS, it is conceivable that they could also further improve our approach, for example, by leveraging local image information to assess atlas to target image similarity. It would also be valuable to explore more advanced network architectures as well as endtoend formulations integrating the registration network.
References

[1]
Artaechevarria, X., MuñozBarrutia, A., Ortizde Solórzano, C.: Efficient classifier generation and weighted voting for atlasbased segmentation: Two small steps faster and closer to the combination oracle. In: SPIE. vol. 6914 (2008)
 [2] Bai, W., Shi, W., O’regan, D.P., Tong, T., Wang, H., JamilCopley, S., Peters, N.S., Rueckert, D.: A probabilistic patchbased label fusion model for multiatlas segmentation with registration refinement: application to cardiac MR images. TMI 32(7), 1302–1315 (2013)
 [3] Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57(1), 289–300 (1995)
 [4] Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D UNet: learning dense volumetric segmentation from sparse annotation. In: MICCAI. pp. 424–432 (2016)
 [5] Hansen, L., Salamon, P.: Neural network ensembles. PAMI (10), 993– (1990)
 [6] Heckemann, R.A., Hajnal, J.V., Aljabar, P., Rueckert, D., Hammers, A.: Automatic anatomical brain MRI segmentation combining label propagation and decision fusion. NeuroImage 33(1), 115–126 (2006)
 [7] Iglesias, J.E., Sabuncu, M.R.: Multiatlas segmentation of biomedical images: a survey. MEDIA 24(1), 205–219 (2015)
 [8] Jia, H., Yap, P.T., Shen, D.: Iterative multiatlasbased multiimage segmentation with treebased registration. NeuroImage 59(1), 422–430 (2012)

[9]
Konukoglu, E., Glocker, B., Zikic, D., Criminisi, A.: Neighbourhood approximation using randomized forests. MEDIA 17(7), 790–804 (2013)
 [10] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. pp. 3431–3440 (2015)
 [11] Sanroma, G., Wu, G., Gao, Y., Shen, D.: Learning to rank atlases for multipleatlas segmentation. TMI 33(10), 1939–1953 (2014)
 [12] Wang, H., Cao, Y., SyedaMahmood, T.: Multiatlas segmentation with learningbased label fusion. In: MLMI. pp. 256–263 (2014)
 [13] Wang, H., Suh, J.W., Das, S.R., Pluta, J.B., Craige, C., Yushkevich, P.A.: Multiatlas segmentation with joint label fusion. PAMI 35(3), 611–623 (2013)
 [14] Warfield, S.K., Zou, K.H., Wells, W.M.: Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. TMI 23(7), 903 (2004)
 [15] Yang, H., Sun, J., Li, H., Wang, L., Xu, Z.: Neural multiatlas label fusion: Application to cardiac MR images. MEDIA 49, 60–75 (2018)
 [16] Yang, X., Kwitt, R., Niethammer, M.: Quicksilver: Fast predictive image registrationa deep learning approach. NeuroImage 158, 378–396 (2017)