Brain tumours are among the most fatal types of cancer . Out of tumours that originally develop in the brain, gliomas are the most frequent . They arise from glioma cells and, depending on their aggressiveness, they are broadly categorized into high and low grade gliomas . High grade gliomas (HGG) develop rapidly and aggressively, forming abnormal vessels and often a necrotic core, accompanied by surrounding oedema and swelling . They are malignant, with high mortality and average survival rate of less than two years even after treatment . Low grade gliomas (LGG) can be benign or malignant, grow slower, but they may recur and evolve to HGG, thus their treatment is warranted. For treatment, patients undergo radiotherapy, chemotherapy and surgery .
Firstly for diagnosis and monitoring the tumour’s progression, then for treatment planning and afterwards for assessing the effect of treatment, various neuro-imaging protocols are employed. Magnetic resonance imaging (MRI) is widely used in both clinical routine and research studies. It facilitates tumour analysis by allowing estimation of extent, location and investigation of its subcomponents. This however requires accurate delineation of the tumour, which proves challenging due to its complex structure and appearance, the 3D nature of the MR images and the multiple MR sequences that need to be consulted in parallel for informed judgement. These factors make manual delineation time-consuming and subject to inter- and intra-rater variability .
Automatic segmentation systems aim at providing an objective and scalable solution. Representative early works are the atlas-based outlier detection method and the joint segmentation-registration framework, often guided by a tumour growth model [6, 7, 8]9, 10]
. More recently, convolutional neural networks (CNN) have gained popularity by exhibiting very promising results for segmentation of brain tumours[11, 12, 13].
A variety of CNN architectures have been proposed, each presenting different strengths and weaknesses. Additionally, networks have a vast number of meta parameters. The multiple configuration choices for a system influence not only performance but also its behaviour (Fig. 1). For instance, different models may perform better with different types of pre-processing. Consequently, when investigating their behaviour on a given task, findings can be biased. Finally, a configuration highly optimized on a given database may be an over-fit, and not generalise to other data or tasks.
In this work we push towards constructing a more reliable and objective
deep learning model. We bring together a variety of CNN architectures, configured and trained in diverse ways in order to introduce high variance between them. By combining them, we construct anEnsemble of Multiple Models and Architectures (EMMA), with the aim of averaging away the variance and with it model- and configuration-specific behaviours. Our approach leads to: (1) a system robust to unpredictable failures of independent components, (2) enables objective analysis with a generic deep learning model of unbiased behaviour, (3) introduces the new perspective of ensembling for objectiveness. This is in contrast to common ensembles, where a single model is trained with small variations such as initial seeds, which renders the ensemble biased by the main architectural choices. As a first milestone in this endeavour, we evaluated EMMA in the Brain Tumour Segmentation (BRATS) challenge 2017. Our method won the first position in the final testing stage among 50+ competing teams. This indicates the reliability of the approach and paves the way for its use in further analysis.
2 Background: Model bias, variance and ensembling
Feedforward neural networks have been shown capable of approximating any function . They are thus models with zero bias, possible of no systematic error. However they are not a panacea. If left unregularized they can overfit noise in the training data, which leads to mistakes when they are called to generalise. Coupled with the stochasticity of the optimization process and the multiple local minima, this leads to unpredictable inconsistent errors between different instances. This constitutes models with high variance. Regularization reduces the variance but increases the bias, as expressed in the bias/variance dilemma . Regularization can be explicit, such as weight decay that prevents networks from learning rare noisy patterns, or implicit, such as the local connectivity of CNN kernels, which however does not allow the model to learn patterns larger than the its receptive field. Architectural and configuration choices thus introduce bias, altering the behaviour of a network.
One route to address the bias/variance dilemma is ensembling. By combining multiple models, ensembling seeks to create a higher performing model with low variance. The most popular combination rule is averaging, which is not sensitive to inconsistent errors of the singletons . Commonly, instances of a network trained with different initial weights or from multiple final local minima are ensembled, with the majority correcting irregular errors. Intuitively, only inconsistent errors can be averaged out. Lack of consistent failures can be interpreted as statistical independency. Thus methods for de-correlating the instances have been developed. The most popular is bagging , commonly used for random forests. It uses bootstrap sampling to learn less correlated instances from different subsets of the data.
The above works often discuss ensembling as a means of increasing performance.  approached high variance from the scope of unreliability. They discussed ensembling as a type of N-version programming, which advocates reliability through redundancy. When producing N-versions of a program, versions may fail independently but through majority voting they behave as a reliable system. They formalize intuitive requirements for reliability: a) the target function to be covered by the ensemble and b) the majority to be correct. This in turn advocates diversity, independence and overall quality of the components.
Biomedical applications are reliability-critical and high variance would deter the use of neural networks. For this reason we set off to investigate robustness of diverse ensembles. Diverting from the above works, we introduce another perspective of ensembling: creating an objective, configuration-invariant model to facilitate objective analysis.
3 Ensembles of Multiple Models and Architectures
A variety of CNN architectures has shown promising results in recent literature. Regarding the architectures, they commonly differ in depth, number of filters and how they process multi-scale context among others. Such architectural choices bias the model behaviour. For instance, models with large receptive fields may show improved localisation capabilities but can be less sensitive to fine texture than models emphasizing local information. Strategies to handle class imbalance is another performance relevant parameter. Common strategies are training with class-weighted sampling or class-weighted cross entropy. As analysed in 
, these methods strongly influence the sensitivity of the model to each class. Furthermore, the choice of the loss function impacts results. For example, we observed that networks trained to optimize Intersection over Union (IoU), Dice or similar losses tend to give worse confidence estimations than when trained with cross entropy (Fig. 1). Finally, the setting of hyper-parameters for the optimization can strongly affect performance. It is often observed by practitioners that the choice of the optimizer and its configuration, for instance the learning rate schedule, can make the difference between bad and good segmentation.
The sensitivity to such meta-parameters is a greater problem than merely a time-consuming manual optimization of configurations:
A configuration setting optimized on one set of training data may be over-fitting them and not perform well on unseen data or another task. This can be viewed as another source of high model variance (Sec. 2).
By biasing the behaviour of the model, it also biases the findings of any analysis performed with it.
We now formalize the problem and our perspective of ensembling as a solution as follows. Given training data with labels , we need to learn the generating process . This is commonly approximated by a model , which has trainable parameters that are learnt via an optimization process that minimizes:
where is a distance (defined by the type of loss) computed at the points given by the training data, while represents the choice of the meta-parameters. It is commonly neglected although it conditions (biases) the learnt estimator. To take it into account, we instead define as a stochastic variable over the space of meta-parameter configurations, with a corresponding prior . In order to learn a model of unbiased by , we marginalize out its effect:
Here is the set of models within the ensemble. The prior is considered uniform over a subspace of that is covered by the models in and zero elsewhere. Note we have arrived at the standard ensembling with averaging, by considering that each individual model approximates a conditional on , and the true posterior is approximated by the ensemble which marginalizes away effects of . Note that the case of a single model configured by can be derived from the above, by setting a dirac prior . Thus the ensemble relaxes a pre-existing neglected strong prior.
The above formulation presents averaging ensembles from a new perspective: The marginalization over a subspace of the joint offers generalisation, regularising the (manual) optimization process of from falling into minima where overfits on the given training data (Fig. 2). Moreover, the process leads to a more objective approximation of where the biasing effect of has been marginalized out. The exposed limitations agree with the requirements for ensembling mentioned in Sec. 2: we need to restrict the subspace of into an area of relatively high quality models and we need to cover it with a relatively small number of models, thus diversity is key.
In the remainder of this section we describe the main properties of the models used to construct the collection of EMMA, which cover various contemporary architectures, configured and trained under different settings111Implementation and configuration details considered less important for this work were omitted to avoid cluttering the manuscript..
Model description: The first architecture we employ is DeepMedic, originally presented in [20, 13]. It is a fully 3D, multi-scale CNN, designed with a focus on efficient processing of 3D images. For this, it employs parallel pathways that take as input down-sampled context, avoiding to convolve large volumes at full resolution to remain computationally cheap. Although originally developed for segmentating brain lesions, it was found promising on diverse tasks, such as segmentation of the placenta , making it a good component for a robust ensemble. We include two deepMedic models in EMMA. The first is the residual version previously employed in BRATS 2016 , depicted in Fig. 3. The second is a wider variant, with double the number of feature maps at each layer.
The models are trained by extracting multi-scale image segments with a 50% probability centred on healthy tissue and 50% probability on tumour as proposed in. The wider variant is trained on larger inputs of width 34 and 22 for the two scales respectively. They are trained with cross-entropy loss, with all meta-parameters adopted from the original configuration.
. The second FCN is constructed larger, replacing each convolutional layer with a residual block with two convolutions. The third is also residual-based, but with one less down-sampling step. All layers use batch normalisation, ReLUs and zero-padding.
Training details: We draw training patches of width 64 for the first and 80 voxels for the residual-based FCNs, with an equal probability from each label. They were trained using Adam. The first was trained to optimize the IoU loss  while the Dice was used similarly for the other two.
Model description: We employ two 3D versions of the U-Net architecture  in our ensemble. The main elements of the first architecture are depicted in Fig. 5. In this version we follow the strategy suggested in 
to reduce model complexity, where skip connections are implemented via summations of the signals in the up-sampling part of the network, instead of the concatenation originally used. The second architecture is similar but concatenates the skip connections and uses strided convolutions instead of max pooling. All layers use batch normalisation, ReLUs and zero-padding.
Training Details: The U-Nets were trained with input patches of size 646464. The patches were sampled only from within the brain, with equal probability being centred around a voxel from each of the four labels. They were trained minimizing cross entropy via AdaDelta and Adam respectively, with different optimization, regularization and augmentation meta-parameters.
The above models are all trained completely separately. At testing time, each model segments individually an unseen image and outputs its class-confidence maps. The models are then ensembled into EMMA, according to eq. 2. For this, the ensemble’s confidence maps for each class are created by calculating for each voxel the average confidence of the individual models for the voxel to belong to this class. The final segmentation made by the EMMA is performed by assigning to each voxel the class with the highest confidence.
3.5 Implementation details
The original implementation of DeepMedic was used for the corresponding two models, along with the default meta-parameters, publicly available on https://biomedia.doc.ic.ac.uk/software/deepmedic/. The FCNs were implemented using DLTK, a deep learning library with a focus on medical imaging applications that allowed quick implementation and experimentation (https://github.com/DLTK/DLTK). Finally, an adaptation of the Unet will be released on https://gitlab.com/eferrante.
Our system was evaluated on the data from the Brain Tumour Segmentation Challenge 2017 (BRATS) [4, 26, 27, 28]. The training set consists of 210 cases with high grade glioma (HGG) and 75 cases with low grade glioma (LGG), for which manual segmentations are provided. The segmentations include the following tumour tissue labels: 1) necrotic core and non enhancing tumour, 2) oedema, 4) enhancing core. Label 3 is not used. The validation set consists of 46 cases, both HGG and LGG but the grade is not revealed. Reference segmentations for the validation set are hidden and evaluation is carried out via an online system that allows multiple submissions. In the testing phase of the competition, a test set of 146 cases is provided to the teams, and the teams have a 48 hours window for a single submission to the system. For evaluation, the 3 predicted labels are merged into different sets of whole tumour (all labels), the core (labels 1,4) and the enhancing tumour (label 4). For each subject, four MRI sequences are available, FLAIR, T1, T1 contrast enhanced (T1ce) and T2. The datasets are pre-processed by the organisers and provided as skull-stripped, registered to a common space and resampled to isotropic resolution. Dimensions of each volume are .
4.2 Preprocessing: Ensembling intensity normalisation methods
We experimented with three different versions of intensity normalisation as pre-processing: 1) Z-score normalisation of each modality of each case individually, with the mean and stdev of the brain intensities. 2) Bias field correction followed by (1). 3) Bias field correction, followed by piece-wise linear normalisation, followed by (1). Preliminary comparisons were inconclusive. We instead chose to average away the normalisation’s effect with EMMA. Three instances of each network were trained, each on data processed with different normalisation. They were applied to correspondingly processed images for inference and all results were averaged in EMMA (Fig. 6).
We provide the results that EMMA achieved on the validation and testing set of the BRATS’17 challenge222Leaderboard: https://www.cbica.upenn.edu/BraTS17/lboardValidation.html on Table 1. Our system won the competition by achieving the overall best performance in the testing phase, based on Dice score (DSC) and Haussdorf distance. We also show results achieved on the validation set by the teams that ranked in the next two positions at the testing stage. No testing-phase metrics are available to us for these methods. We note that EMMA achieves similar levels of performance on validation and test sets, even though the latter contains data from different sources, indicating the robustness of the method. In comparison, competing methods were very good fits for the validation set, but did not manage to retain the same levels on the testing set. This emphasizes the importance of research towards robust and reliable systems.
Neural networks have been proven very potent, yet imperfect estimators, often making unpredictable errors. Biomedical applications are reliability-critical however. For this reason we first concentrate on improving robustness. Towards this goal we introduced EMMA, an ensemble of widely varying CNNs. By combining a heterogeneous collection of networks we construct a model that is insensitive to independent failures of CNN components and thus generalises well (Fig. 7). We also introduced the new perspective of ensembling for objectiveness. By marginalizing out via ensembling the biased behaviour introduced by configuration choices, EMMA is a model more fit for objective analysis. Even though the individual networks have straight-forward architectures and were not optimized for the task, EMMA won the first position in the final testing stage of BRATS 2017 competition among 50+ teams, indicating strong generalisation.
By being robust to suboptimal configurations of its components, EMMA may offer re-usability on different tasks, which we aim to explore in the future. EMMA could also be useful in unbiased investigation of factors such as sensitivity of CNNs to different sources of domain shift that is strongly affecting large-scale studies , or estimating amount of training data required for a task. Finally, EMMA’s uncertainty could serve as a more objective measure of what type of patients or tumours are most challenging to learn.
This work is supported by the EPSRC (EP/N023668/1, EP/N024494/1 and EP/P001009/1) and partially funded under the 7th Framework Programme by the European Commission (CENTER-TBI: https://www.center-tbi.eu/). KK is supported by the President’s PhD Scholarship of Imperial College London. EF is beneficiary of an AXA Research Fund postdoctoral grant. NP is supported by Microsoft Research through its PhD Scholarship Programme and the EPSRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems (HiPEDS, Grant Reference EP/L016796/1). We gratefully acknowledge the support of NVIDIA with the donation of GPUs for our research.
-  DeAngelis, L.M.: Brain tumors. New Engl. Journ. of Med. 344(2) (2001) 114–123
-  Bauer, S., Wiest, R., Nolte, L.P., Reyes, M.: A survey of mri-based medical image analysis for brain tumor studies. Physics in med. and biol. 58(13) (2013) R97
-  Louis, D.N., et al.: The 2016 world health organization classification of tumors of the central nervous system: a summary. Acta neuropathologica 131(6) (2016) 803–820
-  Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor image segmentation benchmark (brats). IEEE TMI 34(10) (2015) 1993–2024
-  Prastawa, M., Bullitt, E., Ho, S., Gerig, G.: A brain tumor segmentation framework based on outlier detection. Med. Image Anal. 8(3) (2004) 275–283
-  Gooya, A., Pohl, K.M., Bilello, M., Biros, G., Davatzikos, C.: Joint segmentation and deformable registration of brain scans guided by a tumor growth model. In: MICCAI, Springer (2011) 532–540
-  Parisot, S., Duffau, H., Chemouny, S., Paragios, N.: Joint tumor segmentation and dense deformable registration of brain mr images. In: MICCAI. (2012) 651–658
Bakas, S., Zeng, K., Sotiras, A., Rathore, S., Akbari, H., Gaonkar, B.,
Rozycki, M., Pati, S., Davatzikos, C.:
Glistrboost: combining multimodal mri segmentation, registration, and biophysical tumor growth modeling with gradient boosting machines for glioma segmentation.In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Springer (2015) 144–155
-  Zikic, D., Glocker, B., Konukoglu, E., Criminisi, A., Demiralp, C., Shotton, J., Thomas, O.M., Das, T., Jena, R., Price, S.J.: Decision forests for tissue-specific segmentation of high-grade gliomas in multi-channel mr. In: MICCAI, Springer (2012) 369–376
-  Le Folgoc, L., Nori, A.V., Ancha, S., Criminisi, A.: Lifted auto-context forests for brain tumour segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Springer (2016) 171–183
-  Urban, G., Bendszus, M., Hamprecht, F., Kleesiek, J.: Multi-modal brain tumor segmentation using deep convolutional neural networks. BRATS-MICCAI (2014)
-  Pereira, S., Pinto, A., Alves, V., Silva, C.A.: Brain tumor segmentation using convolutional neural networks in mri images. IEEE TMI 35(5) (2016) 1240–1251
-  Kamnitsas, K., Ledig, C., Newcombe, V.F., Simpson, J.P., Kane, A.D., Menon, D.K., Rueckert, D., Glocker, B.: Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36 (2017) 61–78
-  Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural networks 2(5) (1989) 359–366
-  Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Networks 4(1) (2008)
-  Kittler, J., Hatef, M., Duin, R.P., Matas, J.: On combining classifiers. IEEE transactions on pattern analysis and machine intelligence 20(3) (1998) 226–239
-  Breiman, L.: Bagging predictors. Machine learning 24(2) (1996) 123–140
Sharkey, A.J., Sharkey, N.E.:
Combining diverse neural nets.
The Knowledge Engineering Review12(3) (1997) 231–247
-  Nowozin, S.: Optimal decisions from probabilistic models: the intersection-over-union case. In: CVPR. (2014) 548–555
-  Kamnitsas, K., Chen, L., Ledig, C., Rueckert, D., Glocker, B.: Multi-scane 3d convolutional neural networks for lesion segmentation in brain mri. in proc of ISLES-MICCAI (2015)
-  Alansary, A., Kamnitsas, K., Davidson, A., Khlebnikov, R., Rajchl, M., Malamateniou, C., Rutherford, M., Hajnal, J.V., Glocker, B., Rueckert, D., et al.: Fast fully automatic segmentation of the human placenta from motion corrupted mri. In: MICCAI, Springer (2016) 589–597
-  Kamnitsas, K., Ferrante, E., Parisot, S., Ledig, C., Nori, A.V., Criminisi, A., Rueckert, D., Glocker, B.: Deepmedic for brain tumor segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. (2016) 138–149
-  Long, J., et al.: Fully convolutional networks for semantic segmentation. In: CVPR. (2015) 3431–3440
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI, Springer (2015) 234–241
-  Guerrero, R., Qin, C., Oktay, O., Bowles, C., Chen, L., Joules, R., Wolz, R., Valdes-Hernandez, M., Dickie, D., Wardlaw, J., et al.: White matter hyperintensity and stroke lesion segmentation and differentiation using convolutional neural networks. arXiv:1706.00935 (2017)
-  Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J., Freymann, J., Farahani, K., Davatzikos, C.: Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Nature Scientific Data (2017)
-  Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J., Freymann, J., Farahani, K., Davatzikos, C.: Segmentation labels and radiomic features for the pre-operative scans of the tcga-gbm collection. The Cancer Imaging Archive (2017)
-  Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J., Freymann, J., Farahani, K., Davatzikos, C.: Segmentation labels and radiomic features for the pre-operative scans of the tcga-lgg collection. The Cancer Imaging Archive (2017)
-  Nyúl, L.G., Udupa, J.K., Zhang, X.: New variants of a method of mri scale standardization. IEEE TMI 19(2) (2000) 143–150
-  Kamnitsas, K., Baumgartner, C., Ledig, C., Newcombe, V., Simpson, J., Kane, A., Menon, D., Nori, A., Criminisi, A., Rueckert, D., et al.: Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In: Information Processing in Medical Imaging, Springer (2017) 597–609