Hippocampus Segmentation on Epilepsy and Alzheimer's Disease Studies with Multiple Convolutional Neural Networks

01/14/2020 ∙ by Diedre Carmo, et al. ∙ 6

Hippocampus segmentation on magnetic resonance imaging (MRI) is of key importance for the diagnosis, treatment decision and investigation of neuropsychiatric disorders. Automatic segmentation is a very active research field, with many recent models involving Deep Learning for such task. However, Deep Learning requires a training phase, which can introduce bias from the specific domain of the training dataset. Current state-of-the art methods train their methods on healthy or Alzheimer's disease patients from public datasets. This raises the question whether these methods are capable to recognize the Hippocampus on a very different domain. In this paper we present a state-of-the-art, open source, ready-to-use hippocampus segmentation methodology, using Deep Learning. We analyze this methodology alongside other recent Deep Learning methods, in two domains: the public HarP benchmark and an in-house Epilepsy patients dataset. Our internal dataset differs significantly from Alzheimer's and Healthy subjects scans. Some scans are from patients who have undergone hippocampal resection, due to surgical treatment of Epilepsy. We show that our method surpasses others from the literature in both the Alzheimer's and Epilepsy test datasets.



There are no comments yet.


page 7

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The hippocampus is a small, medial, subcortical brain structure related to long and short term memory Andersen (2007). Hippocampal segmentation from magnetic resonance imaging (MRI) is of great importance for research of neuropsychiatric disorders and can also be used in the preoperatory investigation of pharmacoresistant temporal lobe epilpesy Ghizoni et al. (2015). The hippocampus can be affected in shape and volume by different pathologies, such as the neurodegeneration associated to Alzheimer’s disease Petersen et al. (2010), or surgical intervention to treat temporal lobe epilepsy Ghizoni et al. (2017). The medical research of these diseases usually involves manual segmentation of the hippocampus, requiring time and expertise in the field. The high-cost associated to manual segmentation has stimulated the search for effective automatic segmentation methods. Some of those methods, such as FreeSurfer Fischl (2012), are already used as a starting point for a manual finer segmentation later McCarthy et al. (2015).

While conducting research on Epilepsy and methods for hippocampus segmentation, two things raised our attention. Firstly, the use of Deep Learning and Convolutional Neural Networks (CNN) is in the spotlight, with most of the recent hippocampus segmentation methods featuring them. Secondly, many of these methods rely on publicly available datasets for training and evaluating and therefore have access only to healthy scans, or patients with Alzheimer’s disease.

Considering these facts, we present an evaluation of some recent methods Roy et al. (2019); Thyreau et al. (2018); Isensee et al. (2017), including an improved version of our own Deep Learning based hippocampus segmentation method Carmo et al. (2019b), using a dataset from a different domain. This in-house dataset, named HCUnicamp, contains scans from patients with epilepsy (pre and post surgical removal of hippocampus), with different patterns of atrophy compared to that observed both in the Alzheimer’s data and healthy subjects. Additionally, we use the public Alzheimer’s HarP dataset for training and further comparisons with other methods.

1.1 Contributions

In summary, the main contributions of this paper are as follows:

  • A readily available hippocampus segmentation methodology consisting of an ensemble of 2D CNNs coupled with traditional 3D post processing, achieving state of the art performance in public data and using recent advancements from the Deep Learning literature.

  • An evaluation of recent hippocampus segmentation methods in our Epilepsy dataset, that includes post-operatory images of patients without one of hippocampi. We show that our method is also superior in this domain, although no method was able to achieve good performance in this dataset, according to our manual annotations.

This paper is organized as follows: Section 2 presents a literature review of recent Deep Learning based hippocampus segmentation methods. Section 3 introduces more details to the two datasets involved in this research. A detailed description of our hippocampus segmentation methodology is in Section 4. Section 5 has experimental results from our methodology development and qualitative and quantitative comparisons with other methods in HarP and HCUnicamp, while Sections 6 and 7 have, respectively, extended discussion of those results and conclusion.

2 Literature Review

Before the rise of Deep Learning methods in medical imaging segmentation, most hippocampus segmentation methods used some form of optimization of registration and deformation to atlas(es) Wang et al. (2013); Iglesias and Sabuncu (2015); Pipitone et al. (2014); Fischl (2012); Chincarini et al. (2016); Platero and Tobar (2017). Even today, medical research uses results from FreeSurfer Fischl (2012), a high impact multiple brain structures segmentation work, available as a software suite. Those atlas-based methods can produce high quality segmentations, taking, however, around 8 hours in a single volume. Lately, a more time efficient approach appeared in the literature, namely the use of such atlases as training volumes for CNNs. Deep Learning methods can achieve similar overlap metrics while predicting results in a matter of seconds per volume Chen et al. (2017); Xie and Gillies (2018); Wachinger et al. (2018); Thyreau et al. (2018); Roy et al. (2019); Ataloglou et al. (2019); Dinsdale et al. (2019).

Recent literature on hippocampus segmentation with Deep Learning is exploring different architectures, loss functions and overall methodologies for the task. One approach that seems to be common to most of the studies involves the combination of 2D or 3D CNNs, and patches as inputs in the training phase. Note that some works focus on hippocampus segmentation, while some attempt segmentation of multiple neuroanatomy. Following, a brief summary of each of those works.

Chen et al. Chen et al. (2017) reports 0.9 Dice Sudre et al. (2017) in 10-fold 110 ADNI Petersen et al. (2010) volumes with a novel CNN input idea. Instead of using only the triplanes as patches, it also cuts the volume in six more diagonal orientations. This results in 9 planes, that are fed to 9 small modified U-Net Ronneberger et al. (2015) CNNs. The ensemble of these U-Nets constructs the final result.

Xie et al. Xie and Gillies (2018)

trains a voxel-wise classification method using triplanar patches crossing the target voxel. They merge features from all patches into a Deep Neural Network with a fully connected classifier alongside standard use of ReLU activations and softmax 

Krizhevsky et al. (2012). The training patches come only from the approximate central area the hippocampus usually is, balancing labels for 1:1 foreground and background target voxels. Voxel classification methods tend to be faster than multi-atlas methods, but still slower than Fully Convolutional Neural Networks.

DeepNat from Wachinger et al. Wachinger et al. (2018) achieves segmentation of 25 structures with a 3D CNN architecture. With a hierarchical approach, a 3D CNN separates foreground from background and another 3D CNN segments the 25 sub-cortical structures on the foreground. Alongside a proposal of a novel parametrization method replacing coordinate augmentation, DeepNat uses 3D Conditional Random Fields as post-processing. The architecture is a voxelwise classification, taking into account the classification of neighbor voxels. This work’s results mainly focuses on the MICCAI Labeling Challenge, with around 0.86 Dice in hippocampus segmentation.

Thyreau et al. Thyreau et al. (2018)’s model, named Hippodeep, uses CNNs trained in a region of interest (ROI). However, where we apply one CNN for each plane of view, Thyreau et al. uses a single CNN, starting with a planar analysis followed by layers of 3D convolutions and shortcut connections. This study used more than 2000 patients, augmented to around 10000 volumes with augmentation. Initially the model is trained with FreeSurfer segmentations, and later fine-tuned using volumes which the author had access to manual segmentations, the gold standard. Thyreau’s method requires MNI152 registration of input data, which adds around a minute of computation time, but the model is generally faster than multi-atlas or voxel-wise classification, achieving generalization in different datasets, as verified by Nogovitsyn et al. Nogovitsyn et al. (2019).

QuickNat from Roy et al. Roy et al. (2019) achieves faster segmentations than DeepNat by using a multiple CNN approach instead of voxel-wise classification. Its methodology follows a consensus of multiple 2D U-Net like architectures specialized in each slice orientation. The use of FreeSurfer Fischl (2012) masks over hundreds of public data to generate silver standard annotations allows for much more data than usually available for medical imaging. Later, after the network already knows to localize the structures, it is finetuned to more precise gold standard labels. Inputs for this method need to conform to the FreeSurfer format.

Ataloglou et al. Ataloglou et al. (2019) recently displayed another case of fusion of multiple CNN outputs, specialized into axial, coronal and sagittal orientations, into a final hippocampus segmentation. They used U-Net like CNNs specialized in each orientation, followed by error correction CNNs, and a final average fusion of the results. They went against a common approach in training U-Nets of using patches during data augmentation, instead using cropped slices. This raises concerns about overfitting to the used dataset, HarP Boccardi et al. (2015), supported by the need of finetuning to generalize to a different dataset.

Dinsdale et al. Dinsdale et al. (2019) mixes knowledge from multi-atlas works with Deep Learning, by using a 3D U-Net CNN to predict a deformation field from an initial binary sphere to the segmentation of the hippocampus, achieving around 0.86 DICE on Harp. Interestingly, trying an auxiliary classification task did not improve segmentation results.

It is known that Deep Learning approaches require a large amount of training data, something that is not commonly available specially with Medical Imaging. Commonly used forms of increasing the quantity of data in the literature include using 2D CNNs over regions (patches) of slices, with some form of patch selection strategy. The Fully Convolutional Neural Network (FCNN) U-Net Ronneberger et al. (2015) architecture has shown potential to learn from relatively small amounts of data with their decoding, encoding and concatenation schemes, even working when used with 3D convolutions directly in a 3D volume Isensee et al. (2017).

Looking at these recent works, one can confirm the segmentation potential of the U-Net architecture, including the idea of an ensemble of 2D U-Nets instead of using a single 3D one, as we Carmo et al. (2019a, b), some simultaneous recent work Roy et al. (2019); Ataloglou et al. (2019), or even works in other segmentation problems Lucena et al. (2018) presented. In this paper, some of those methods were reproduced for comparison purposes in our in-house dataset, namely Roy et al. (2019); Thyreau et al. (2018), including a 3D UNet architecture test from Isensee et al. (2017).

3 Data

This study uses mainly two different datasets: one collected locally for an Epilepsy study, named HCUnicamp; and one public from the ADNI Alzheimer’s study, HarP. HarP is commonly used in the literature as a hippocampus segmentation benchmark. The main difference between the datasets is, the lack of one of the hippocampi in 70% of the scans from HCUnicamp, as these patients underwent surgical removal (Figure 1).

Although our method needs input data to be in the MNI152 Brett et al. (2001) orientation, data from those datasets are in native space and are not registered. We provide an orientation correction by rigid registration as an option when predicting in external volumes, to avoid orientation mismatch problems.

(a) (b) (c) (d) (e)
Figure 1: (a) 3D rendering of the manual annotation (in green) of one of the HarP dataset volumes. In (b), a coronal center crop slice of the average hippocampus mask for all volumes in HarP (green) and HCUnicamp (red). Zero corresponds to the center. (c) Sagittal, (d) Coronal and (e) Axial HCUnicamp slices from a post-operative scan with annotations in red.

3.1 HarP

This methodology was developed with training and validation on HarP Boccardi et al. (2015), a widely used benchmark dataset in the hippocampus segmentation literature. The full HarP release contains 135 T1-weighted MRI volumes. Alzheimer’s disease classes are balanced with equal occurrence of control normal (CN), mild cognitive impairment (MCI) and alzheimer’s disease (AD) cases Petersen et al. (2010). Volumes were minmax intensity normalized between 0 and 1, and no volumes were removed. Training with hold-out was performed with 80% training, 10% validation and 10% testing, while k-Folds, when used, consisted of 5 folds, with no overlap on the test sets.

3.2 HCUnicamp

HCUnicamp was collected inhouse, by personnel from the Brazilian Institute of Neuroscience and Neurotechnology (BRAINN) at UNICAMP’s Hospital de Clínicas. This dataset contains 190 T1-weighted 3T MRI acquisitions, in native space. 58 are healthy controls and 132 are Epilepsy patients. From those, 70% had one of the hippocampus surgically removed, resulting in a very different shape and texture than what is commonly seen in public datasets (Figure 1). More details about the surgical procedure can be found in Ghizoni et al. (2015, 2017). All volumes have manual annotations of the hippocampus, performed by one rater. The voxel intensity is minmax normalized, between 0 and 1, per volume. This data acquisition was approved by an Ethics and Research Committee.

Comparisons between the datasets can be seen in Figure 1. The difference in mean mask position due to the inclusion of neck in HCUnicamp is notable, alongside with the lower presence of left hippocampus labels due to surgical intervention for Epilepsy (Figure 1b).

To investigate the performance of different methods in terms of dealing with the absence of hippocampus and unusual textures, we used the HCUnicamp dataset (considered a different domain) as a final test set and benchmark. Our methodology was only tested in this dataset at the end, alongside other methods. Results on HCUnicamp were not taken into consideration for our method’s methodological choices.

4 Segmentation Methodology

In this section, the general methodology (Figure 2) for our hippocampus segmentation method is detailed. In summary, the activations from three orientation specialized modified 2D U-Net CNNs are merged into an activation consensus. Each network’s activations for a given input volume are built slice by slice. The three activation volumes are averaged into a consensus volume, which is post-processed into the final segmentation mask. The following sections go into more detail for each part of the architecture and method overall.

Figure 2: The final segmentation volume is generated by taking into account activations from three FCNNs specialized on each 2D orientation. Neighboring slices are taken into account in a multi-channel approach. Full slices are used in prediction time, but training uses patches.

4.1 U-Net architecture

The basic structure of our networks is inspired by the U-Net FCNN architecture Ronneberger et al. (2015). However, some modifications based on other successful works were applied to the architecture (Figure 3). Those modifications include: instead of one single 2D patch as input, two neighbour patches are concatenated leaving the patch corresponding to the target mask in the center Pereira et al. (2019)

. Residual connections based on ResNet

He et al. (2016)

between the input and output of the double convolutional block were added, as 1x1 2D convolutions to account for different number of channels. Batch normalization was added to each convolution inside the convolutional block, to accelerate convergence and facilitate learning

Ioffe and Szegedy (2015)

. Also, all convolutions use padding to keep dimensions and have no bias.

Figure 3: Final architecture of each modified U-Net in figure 2. Of note in comparison to the original U-Net is the use of BatchNorm, residual connections in each convolutional block, the 3 channel neighbour patches input and the sigmoid output limitation. Padding is also used after convolutions.

4.2 Residual Connections

Residual or shortcut connections have been shown to improve convergence and performance of CNNs He et al. (2016)

. Either in the form of direct connections propagating past results to the next convolution input, by adding values, or in the form of 1x1 convolutions, to deal with different number of channels. An argument to its effectiveness is that the residual connections offer a way for a simple propagation of values without any transformation, which is not a trivial task when the network consists of multiple non-linear transformations in the form of convolutions followed by max pooling.

In this work, residual connections were implemented in the form of an 1x1 convolution, adding the input of the first 3x3 convolution to the result of the batch normalization of the second 3x3 convolution in a convolutional block (Figure 3).

4.3 Weight Initialization, Bias and Batch-normalization

It has been shown that weight initialization is crucial in proper convergence of CNNs Kumar (2017)

. In computer vision related tasks, having pre-initialized weights that already recognize basic image pattern recognition features such as border directions, frequencies and textures can be helpful. This works uses VGG11

Simonyan and Zisserman (2014) weights in the encoder part of the U-Net architecture, as in Iglovikov and Shvets (2018).

4.4 Patches and Augmentation

During prediction time, slices for each network are extracted with a center crop. When building the consensus activation volume, the resulting activation is padded back to the original size.

For training, this method uses patches. One of the strong fits of the U-Net architecture is its ability to learn on patches and extend that knowledge to the evaluation of a full image, effectively working as a form of data augmentation. In this work, batches of random patches are used when training each network. Patches are randomly selected in runtime, not as pre-processing. Patches can achieve many possible sizes, as long as it accommodates the number of spatial resolution reductions present in the network (e.g. division by 2 by a max pool).

A pre-defined percentage of the patches are selected from a random point of the brain, allowing for learning of what structures are not the hippocampus, and are not close to the structure, such as scalp, neck, eyes and brain ridges. Those are called negative patches, although they not necessarily have a completely zeroed target due to being random. On the other hand, positive patches are always centered on a random point in the hippocampus border.

In a similar approach to Pereira et al. Pereira et al. (2019)’s Extended 2D, adjacent patches (slices on evaluation) are included in the network’s input as additional channels (Figure 2). The intention is for the 2D network to take into consideration volumetric information adjacent to the region of interest, hence the name for the method, Extended 2D Consensus Hippocampus Segmentation (E2DHipseg). This approach is inspired by how physicians compare neighbor slices in multiview visualization when deciding if a voxel is part of the analyzed structure or not.

center, max width= Aug. Chance (%) Description 0 100% Random patch selection, 80% positive 20% negative 1 100% Intensity modification by a value from the uniform distribution. 2 20% Rotation and scale by a value from degrees and % respectively 3 20% Gaussian noise with 0 mean and variance.

Table 1: Description of augmentations present in the experiments in Table 2, with the % chance of random application during patch selection and parameters description.

Deep Learning algorithms usually require a big and varied dataset to achieve generalization Shin et al. (2016). Manual segmentation by experts is used as a gold standard, but is often not enough for the training of Deep Networks. Data augmentation is used to improve our dataset variance and avoid overfitting, an excessive bias to the training data. Without augmentation, this method could overfit to MRI machine parameters, magnetic field intensity, field of view and so on. All augmentations perform a random small modification to the data, according to pre-defined parameters, on runtime, not as pre-processing. Alongside the use of random patches in runtime, the use of other transformations was tested, as seen in Table 1.

4.5 Loss Function

The choice of Loss function plays an important role in Deep learning methods, defining what the training process will be optimizing. When using a sigmoid output activation, Binary Cross Entropy (BCE), Mean Square Error (MSE) and Dice Loss are examples of commonly used functions in the literature.

Dice Sudre et al. (2017) is an overlap metric widely used in the evaluation of segmentation applications. Performance in this paper is mainly evaluated with Dice, by comparisons with the manual gold standard. Dice can be defined as:


Where the sums run over the N voxels, of the predicted binary segmentation volume and the ground truth binary volume . For conversion from a metric to a loss function, one can simply optimize , therefore optimizing a segmentation overlap metric. This is referred here as Dice Loss.

To take into account background information, a Softmax of two-channels representing background and foreground can be used as an output. In this case, Generalized Dice Loss (GDL) Sudre et al. (2017) and Boundary Loss, a recent proposal of augmentation to GDL from Kervadec et al. Kervadec et al. (2019) were considered as loss options.

Generalized Dice Loss weights the loss value by the presence of a given label in the target, giving more importance to less present labels. This solves the a class imbalance problem that would emerge when using Dice Loss while including background as a class.

Boundary Loss takes into consideration alongside the “regional” loss (e.g. GDL), the distance between boundaries of the prediction and target, which does not gives any weight to the area of the segmentation. Kervadec’s work suggests that a loss functions that takes into account boundary distance information can improve results, specially for unbalanced datasets. However, one needs to balance the contribution of both components with a weight, defined as in the following Boundary Loss (B) equation:


Where G is GDL, regional component of the loss function, and S is the surface component, that operates on surface distances. The weight factor

changes from epoch to epoch. The weight given to the regional loss is shifted to the surface loss, with

varying from 1 in the first epoch to 0 in the last epoch. We followed the original implementation in Kervadec et al. (2019).

4.6 Consensus and Post-processing

The consensus depicted in Figure 2 consists of taking the average from the activations of all three CNNs. A more advanced approach of using a 4th, 3D, U-Net as the consensus generator was also attempted.

After construction of the consensus of activations, a threshold is needed to binarize the segmentation. While developing this methodology, it was noticed that using patches, although improving generalization, resulted in small structures of the brain being recognized as the hippocampus. To remove those false positives, a 3D labeling implementation from

Dougherty and Lotufo (2003) was used, with subsequent removal of small non-connected volumes, keeping the 2 largest volumes, or 1 if a second volume is not present (Figure 2). This post processing is performed after the average consensus of all networks and threshold application.

5 Experiments and Results

In this section, experiments on the segmentation methodology are presented, displaying differences in Dice in the HarP test set, resulting from our methodological choices. Following that, quantitative and qualitative comparisons with other methods in HarP and HCUnicamp are presented.

(a) (b)
Figure 4:

Validation and training Dice for all models, using: (a) ADAM (b) RADAM. Both with same hyperparameters and no stepping. Early stopping is due to patience. RADAM displays more stability.

5.1 Training: Optimizers, Learning Rate and Scheduling

Training hyperparameters are the same for all networks. Regarding the optimizer of choice and initial LR, grid search defined 0.0001 with ADAM Kingma and Ba (2014) and 0.005 LR with SGD Bengio et al. (2015) to deliver similar performance. The recent RADAM from Liu et al. Liu et al. (2019) with 0.001 initial LR ended up being the optimizer of choice, due to improved training stability and results (see Fig 4). LR reduction scheduling is used, with multiplication by 0.1 after 250 epochs, its impact is showcased on Figure 5(a). While training on HarP with an 80% holdout training set, an epoch consisted of going through around 5000 sagittal, 4000 coronal and 3000 axial random patches extracted from slices with presence of hippocampus, depending on which network is being trained, with a batch size of 200. The max number of Epochs allowed is 1000, with a patience early stopping of no validation improvement of 200 epochs. Weights are only saved for the best validation Dice.

5.2 Hyperparameter Experiments

Some of the most important hyperparameter experiments can be seen in Table 2. Results from each change in methodology or architecture were calculated using the full consensus outlined in Figure 2, in other words, all three networks are trained using the same parameters and Dice is calculated after consensus and post-processing. For these experiments, holdout of 80/20% on HarP was used, keeping Alzheimer’s labels balanced. Reported Dice is the mean over the 20% test set. Some important final experiments were selected to be presented in Table 2.

center, min width= Optimizer LR Loss Aug. HarP (Dice) SGD 0.005 Dice Loss 0, 1, 2, 3 0.8748 SGD 0.005 Dice Loss 0 0.8760 ADAM 0.0001 Dice Loss 0 0.8809 ADAM 0.0001 Dice Loss 0, 1 0.8820 ADAM 0.0001 Dice Loss 0, 2 0.8827 ADAM 0.0001 Dice Loss 0, 3 0.8832 ADAM 0.0001 GDL 0 0.8830 ADAM 0.0001 GDL 0, 1, 2, 3 0.8862 ADAM 0.0001 Boundary 0 0.9068 RADAM 0.0001 Boundary 0 0.9071 RADAM 0.001 Boundary 0, 1, 2, 3 0.9117 RADAM 0.001 Boundary 0 0.9133

Table 2: Some of the most relevant hyperparameters experiments test results, in a hold-out approach to HarP. Aug. refers to what data augmentation strategies were used, from Table 1. The bolded results represents the final models used in the next section. All tests in this table use patch size and the modified U-Net architecture.
(a) (b)
Figure 5: (a) Training and validation Dice curve for the best model, with RADAM and LR step. (b) Boxplot for HarP test models, showing the improvement in variance and mean Dice from the Consensus compared to using only one network. In the individual network studies, post processing is also applied to remove false positives.

In regards to modifications to the basic U-Net architecture, the addition of Residual Connections, Batch Normalization, and encoder weight initialization improved convergence stability and reduced overfitting. VGG11 weights worked better than ResNet34 or Kaiming Uniform initialization He et al. (2015). The use of random patches (Aug. 0) with neighbour slices (E2D) instead of center crop slices also reduced overfitting, while increasing the number of false positive activations, handled by post processing.

For the patch selection strategy mentioned in Section 4.4, 80/20% balance between positive and negative patches, respectively, resulted in better convergence and less false positives than a 50/50% balance. Early experiments compared patch sizes between , , . For , one less U-Net layer was used, with 3 Max Pool/Transposed Convolutions instead of 4. Smaller patches resulted in less stable training, although final results for and patches were not significantly different. was chosen as the patch size from here forward.

While using only one channel sigmoid activations as an output. Early experiments defined Dice Loss as the best convergence and results, beating MSE and BCE. A softmax output and GDL achieved similar results to Dice Loss. However, implementation of a recently improvement to GDL in the form of Boundary Loss resulted in slightly better test Dice.

We found that augmentation techniques besides random patches only impacted overlap results in HarP slightly, sometimes even making results worse in testing. Augmentation’s most relevant impact, however, was avoiding overfitting and very early stopping due to no validation improvements in some cases, leading to unstable networks.

We found that, as empirically expected, the consensus of the results from the three networks brings less variance to the final Dice as seen in Figure 5(b). Early studies confirmed that 0.5 is the best value to choose for threshold after the activation averaging. Attempts at using a fourth 3D UNet as a consensus generator/error correction phase did not change results significantly.

5.3 Quantitative Results

In this section, we report quantitative results of our method and others from the literature in both HarP and HCUnicamp. For comparison’s sake, we also trained an off-the-shelf 3D U-Net architecture, from Isensee et al. Isensee et al. (2017), originally a Brain Tumor segmentation work, with ADAM and HarP center crops as input.

For the evaluation with the QuickNat Roy et al. (2019)

method, volumes and target needed to be conformed to its required format, causing interpolation . As far as we know, the method does not have a way to return its predictions on the volume’s original space. DICE was calculated with the masks on the conformed space. Note that QuickNat performs segmentation of multiple brain structures.

5.3.1 HarP

Deep Learning Methods HarP (DICE)
3D U-Net - Isensee et al. Isensee et al. (2017) (2017) 0.86
Hippodeep - Thyerau et al. Thyreau et al. (2018) (2018) 0.85
QuickNat - Roy et al. Roy et al. (2019) (2019) 0.80
Ataloglou et al. Ataloglou et al. (2019) (2019) 0.90*
E2DHipseg (this work) 0.90*
Atlas-based methods
FreeSurfer v6.0 Fischl (2012) (2012) 0.70
Chincarini et al. Chincarini et al. (2016) (2016) 0.85
Platero et al. Platero and Tobar (2017) (2017) 0.85
Table 3: Reported testing results for HarP. This work is named E2DHipseg. Results with * were calculated following a 5-fold cross validation.

The best hold-out mean Dice is . In regards to specific Alzheimer’s classes in the test set, our method achieves Dice for CN, for MCI and for AD cases. When using a hold-out approach in a relatively small dataset such as HarP, the model can be overfitting to better results in that specific test set. With that in mind, we also report results with cross validation. 5-fold training and testing is used, where all three networks are trained and tested with each fold. With 5-fold our model achieved Dice. Results reported by other works are present in Table 3. Our methodology has similar performance to what is reported by Atalaglou et al.’s recent, simultaneous work Ataloglou et al. (2019). Interestingly, the initial methodology of both methods is similar, in the use of multiple 2D CNNs.

5.3.2 HCUnicamp

center, max width=0.7 HCUnicamp (Controls) Method Both (Dice) Left (Dice) Right (Dice) Precision Recall 3D U-Net - Isensee et al. Isensee et al. (2017) (2017) Hippodeep - Thyerau et al. Thyreau et al. (2018) (2018) QuickNat - Roy et al. Roy et al. (2019) (2019) E2DHipseg without Aug. E2DHipseg with Aug. HCUnicamp (Patients) 3D U-Net - Isensee et al. Isensee et al. (2017) (2017) Hippodeep - Thyerau et al. Thyreau et al. (2018) (2018) QuickNat - Roy et al. Roy et al. (2019) (2019) E2DHipseg without Aug. E2DHipseg with Aug.

Table 4: Locally executed testing results for HCUnicamp. All 190 volumes from the dataset are included, and no model saw it on training. The 3D U-Net here is using the same weights from table 3. QuickNat performs whole brain multitask segmentation, not only hippocampus.

As described previously, the HCUnicamp dataset has lack of one of the hippocampi in many of it’s scans (Figure 1), and it was used to examine the generalization capability of these methods. Table 4

has mean and standard deviation Dice for all HCUnicamp volumes, using both masks, or only one the left or right mask, with multiple methods. “with Aug.” refers to the use of augmentations 1, 2, 3 in training, in addition to 0. We also report Precision and Recall, per voxel classification, where positives are hippocampus voxels and negatives are non hippocampus voxels. Precision is defined by

and Recall is defined by , where TP is true positives, FP are false positives and FN are false negatives. All tests were run locally. Unfortunately, we were not able to reproduce Atalaglou et al.’s method for local testing.

Our method performed better than other recent methods on the literature in the HCUnicamp dataset, even though HCUnicamp is not involved on our methodology development. However, no method was able to achieve more than 0.8 mean Dice in Epilepsy patients. The high number of false positives due to hippocampus removal is notable by the low left and right DICE, and low precision. The impact of additional augmentations was not statistically significant in the Epilepsy domain.

(a) (b)
Figure 6: Multiview and 3D render of our (a) best and (b) worst cases in the HarP test set. Prediction in green, target in red and overlap in purple.

Our method takes around 15 seconds on a mid-range GPU and 3 minutes on a consumer CPU to run, per volume. All the code used on its development is available in github.com/dscarmo/e2dhipseg, with instructions for how to run it in an input volume. A free executable version for medical research use, without environment setup requirements, is in development and will be available on the repository soon. To avoid problems with different head orientations, there is an option to use MNI152 registration when predicting in a given volume. Even when performing registration, the output mask will be in the input volume’s space, using the inverse transform. In regards to pre-processing requirements, our method requires only for the volume to be in the correct orientation. This can be achieved with rigid registration, and provided as an option, in a similar way to Hippodeep. A GPU is recommended for faster prediction but not necessary.

5.4 Qualitative Results

While visually inspecting HarP results, very low variance was found, without presence of heavy outliers. This is indicated by looking at the low deviation in the consensus boxplot in Figure 

5(b) and the best and worst segmentation in Figure 6. Other methods present similar, stable results.

However, in HCUnicamp, way more errors are visible in the worst segmentations in Figure 7(b). Specially where the hippocampus is removed. Other methods have similar results, with false positives in voxels where the hippocampus would be in a healthy subject or Alzheimer’s patient. As expected, the best segmentation, displayed in Figure 7(a), was in a control, healthy subject.

(a) (b)
Figure 7: Multiview and 3D render of our (a) best and (b) worst cases in the HCUnicamp dataset. Prediction in green, target in red and overlap in purple.

6 Discussion

Regarding the Consensus approach from our method, most of the false positives some of the networks produce are eliminated by the averaging of activations followed by thresholding and post processing. This approach allows the methodology to focus on good segmentation on the hippocampus area, without worrying with small false positives in other areas of the brain. It was also observed that in some cases, one of the networks fails and the other two “save” the result. This is visible looking at the outliers in Figure 5(b).

The fact that patches are randomly selected and augmented in runtime means they are mostly not repeated in different epochs. This is different to making a large dataset of pre-processed patches with augmentation. We believe this random variation during training is very important to ensure the network keeps seeing different data in different epochs, improving generalization. This idea is similar to the Dropout technique Srivastava et al. (2014), only done in data instead of weights. Even with all the data randomness, re-runs of the same experiment resulted mostly in the same final results, within 0.01 mean Dice of each other.

Interestingly, our method achieved better Dice in test scans with Alzheimer’s than control subjects. This suggests that our method is focusing learning on the Alzheimer’s atrophies present in the HarP dataset, and is able to adapt to them.

As visible on the results of multiple methods, Dice in the HCUnicamp dataset is not on the same level as what is seen on the public benchmark. Most methods have false positives on the removed hippocampus area, in a similar fashion to Figure 7(b). The fact that QuickNat and Hippodeep have separate outputs for left and right hippocampus does not seem to be enough to solve this problem. We believe the high false positive rate is due to textures similar to the hippocampus, present in the hippocampus area, after its removal. This could possibly be solved with a preliminary hippocampus presence detection phase.

7 Conclusion

This paper presents a hippocampus segmentation method including consensus of multiple U-Net based CNNs and traditional post-processing, successfully using a new optimizer and loss function from the literature. The presented method achieves state-of-the-art performance on the public HarP hippocampus segmentation benchmark. The hypothesis was raised that current automatic hippocampus segmentation methods, including our own, would not have the same performance on our in-house Epilepsy dataset, with some cases of hippocampus removal. Quantitative and qualitative results show failure from those methods to take into account hippocampus removal, in unseen data. This raises the concern that current automatic hippocampus segmentation methods are not ready to outliers such as what is shown in this paper. In future work, improvements can be made to our method to detect the removal of the hippocampus as a pre-processing step, using part of HCUnicamp as training data.

Conflict of Interest Statement

We have no conflicts of interest to declare.


We thank FAPESP for funding this research under grant 2018/00186-0, our partners at BRAINN (FAPESP number 2013/07559-3 and FAPESP 2015/10369-7) for letting us use their dataset on this research and CNPq research funding, process numbers 310828/2018-0 and 308311/2016-7.


  • P. Andersen (2007) The hippocampus book. Oxford University Press. Cited by: §1.
  • D. Ataloglou, A. Dimou, D. Zarpalas, and P. Daras (2019)

    Fast and precise hippocampus segmentation through deep convolutional neural network ensembles and transfer learning

    Neuroinformatics, pp. 1–20. Cited by: §2, §2, §2, §5.3.1, Table 3.
  • Y. Bengio, I. J. Goodfellow, and A. Courville (2015) Deep learning. Nature 521, pp. 436–444. Cited by: §5.1.
  • M. Boccardi, M. Bocchetta, F. C. Morency, D. L. Collins, M. Nishikawa, R. Ganzola, M. J. Grothe, D. Wolf, A. Redolfi, M. Pievani, et al. (2015) Training labels for hippocampal segmentation based on the eadc-adni harmonized hippocampal protocol. Alzheimer’s & Dementia 11 (2), pp. 175–183. Cited by: §2, §3.1.
  • M. Brett, K. Christoff, R. Cusack, J. Lancaster, et al. (2001) Using the talairach atlas with the mni template. Neuroimage 13 (6), pp. 85–85. Cited by: §3.
  • D. Carmo, B. Silva, C. Yasuda, L. Rittner, and R. Lotufo (2019a) Extended 2d volumetric consensus hippocampus segmentation. arXiv preprint arXiv:1902.04487. Cited by: §2.
  • D. Carmo, B. Silva, C. Yasuda, L. Rittner, and R. Lotufo (2019b) Extended 2d volumetric consensus hippocampus segmentation. pp. na. Cited by: §1, §2.
  • Y. Chen, B. Shi, Z. Wang, P. Zhang, C. D. Smith, and J. Liu (2017) Hippocampus segmentation through multi-view ensemble convnets. In 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), pp. 192–196. Cited by: §2, §2.
  • A. Chincarini, F. Sensi, L. Rei, G. Gemme, S. Squarcia, R. Longo, F. Brun, S. Tangaro, R. Bellotti, N. Amoroso, et al. (2016) Integrating longitudinal information in hippocampal volume measurements for the early detection of alzheimer’s disease. NeuroImage 125, pp. 834–847. Cited by: §2, Table 3.
  • N. K. Dinsdale, M. Jenkinson, and A. I. Namburete (2019) Spatial warping network for 3d segmentation of the hippocampus in mr images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 284–291. Cited by: §2, §2.
  • E. R. Dougherty and R. A. Lotufo (2003) Hands-on morphological image processing. Vol. 59, SPIE press. Cited by: §4.6.
  • B. Fischl (2012) FreeSurfer. Neuroimage 62 (2), pp. 774–781. Cited by: §1, §2, §2, Table 3.
  • E. Ghizoni, J. Almeida, A. F. Joaquim, C. L. Yasuda, B. M. de Campos, H. Tedeschi, and F. Cendes (2015) Modified anterior temporal lobectomy: anatomical landmarks and operative technique. Journal of Neurological Surgery Part A: Central European Neurosurgery 76 (05), pp. 407–414. Cited by: §1, §3.2.
  • E. Ghizoni, R. N. Matias, S. Lieber, B. M. de Campos, C. L. Yasuda, P. C. Pereira, A. C. S. Amato Filho, A. F. Joaquim, T. M. Lopes, H. Tedeschi, et al. (2017) Clinical and imaging evaluation of transuncus selective amygdalohippocampectomy. World neurosurgery 100, pp. 665–674. Cited by: §1, §3.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    pp. 1026–1034. Cited by: §5.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1, §4.2.
  • J. E. Iglesias and M. R. Sabuncu (2015) Multi-atlas segmentation of biomedical images: a survey. Medical image analysis 24 (1), pp. 205–219. Cited by: §2.
  • V. Iglovikov and A. Shvets (2018) TernausNet: u-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv preprint arXiv:1801.05746. Cited by: §4.3.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1.
  • F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, and K. H. Maier-Hein (2017) Brain tumor segmentation and radiomics survival prediction: contribution to the brats 2017 challenge. In International MICCAI Brainlesion Workshop, pp. 287–297. Cited by: §1, §2, §2, §5.3, Table 3, Table 4.
  • H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, and I. Ben Ayed (2019) Boundary loss for highly unbalanced segmentation. In Proceedings of The 2nd International Conference on Medical Imaging with Deep LearningInternational Conference on Medical image computing and computer-assisted interventionInternational Conference on Medical Imaging with Deep LearningBiomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium onProceedings of the IEEE international conference on computer visionAdvances in neural information processing systemsProceedings of the IEEE Conference on Computer Vision and Pattern Recognition WorkshopsAdvances in neural information processing systems, M. J. Cardoso, A. Feragen, B. Glocker, E. Konukoglu, I. Oguz, G. Unal, and T. Vercauteren (Eds.),

    Proceedings of Machine Learning Research

    , Vol. 102, London, United Kingdom, pp. 285–296.
    External Links: Link Cited by: §4.5, §4.5.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. pp. 1097–1105. Cited by: §2.
  • S. K. Kumar (2017) On weight initialization in deep neural networks. arXiv preprint arXiv:1704.08863. Cited by: §4.3.
  • L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2019) On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265. Cited by: §5.1.
  • O. Lucena, R. Souza, L. Rittner, R. Frayne, and R. Lotufo (2018) Silver standard masks for data augmentation applied to deep-learning-based skull-stripping. pp. 1114–1117. Cited by: §2.
  • C. S. McCarthy, A. Ramprashad, C. Thompson, J. Botti, I. L. Coman, and W. R. Kates (2015) A comparison of freesurfer-generated data with and without manual intervention. Frontiers in neuroscience 9, pp. 379. Cited by: §1.
  • N. Nogovitsyn, R. Souza, M. Muller, A. Srajer, S. Hassel, S. R. Arnott, A. D. Davis, G. B. Hall, J. K. Harris, M. Zamyadi, et al. (2019) Testing a deep convolutional neural network for automated hippocampus segmentation in a longitudinal sample of healthy participants. NeuroImage 197, pp. 589–597. Cited by: §2.
  • M. Pereira, R. Lotufo, and L. Rittner (2019) An extended-2d cnn approach for diagnosis of alzheimer’s disease through structural mri. In Proceedings of the 27th Annual Meeting of ISMRM 2019., pp. na. Cited by: §4.1, §4.4.
  • R. C. Petersen, P. Aisen, L. A. Beckett, M. Donohue, A. Gamst, D. J. Harvey, C. Jack, W. Jagust, L. Shaw, A. Toga, et al. (2010) Alzheimer’s disease neuroimaging initiative (adni): clinical characterization. Neurology 74 (3), pp. 201–209. Cited by: §1, §2, §3.1.
  • J. Pipitone, M. T. M. Park, J. Winterburn, T. A. Lett, J. P. Lerch, J. C. Pruessner, M. Lepage, A. N. Voineskos, M. M. Chakravarty, A. D. N. Initiative, et al. (2014) Multi-atlas segmentation of the whole hippocampus and subfields using multiple automatically generated templates. Neuroimage 101, pp. 494–512. Cited by: §2.
  • C. Platero and M. C. Tobar (2017) Combining a patch-based approach with a non-rigid registration-based label fusion method for the hippocampal segmentation in alzheimer’s disease. Neuroinformatics 15 (2), pp. 165–183. Cited by: §2, Table 3.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. pp. 234–241. Cited by: §2, §2, §4.1.
  • A. G. Roy, S. Conjeti, N. Navab, C. Wachinger, A. D. N. Initiative, et al. (2019) QuickNAT: a fully convolutional network for quick and accurate segmentation of neuroanatomy. NeuroImage 186, pp. 713–727. Cited by: §1, §2, §2, §2, §5.3, Table 3, Table 4.
  • H. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers (2016) Deep convolutional neural networks for computer-aided detection: cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35 (5), pp. 1285–1298. Cited by: §4.4.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.3.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §6.
  • C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso (2017) Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 240–248. Cited by: §2, §4.5, §4.5.
  • B. Thyreau, K. Sato, H. Fukuda, and Y. Taki (2018) Segmentation of the hippocampus by transferring algorithmic knowledge for large cohort processing. Medical image analysis 43, pp. 214–228. Cited by: §1, §2, §2, §2, Table 3, Table 4.
  • C. Wachinger, M. Reuter, and T. Klein (2018) DeepNAT: deep convolutional neural network for segmenting neuroanatomy. NeuroImage 170, pp. 434–445. Cited by: §2, §2.
  • H. Wang, J. W. Suh, S. R. Das, J. B. Pluta, C. Craige, and P. A. Yushkevich (2013) Multi-atlas segmentation with joint label fusion. IEEE transactions on pattern analysis and machine intelligence 35 (3), pp. 611–623. Cited by: §2.
  • Z. Xie and D. Gillies (2018) Near real-time hippocampus segmentation using patch-based canonical neural network. arXiv preprint arXiv:1807.05482. Cited by: §2, §2.