Extended 2D Volumetric Consensus Hippocampus Segmentation

02/12/2019 ∙ by Diedre Carmo, et al. ∙ University of Campinas 0

Hippocampus segmentation plays a key role in diagnosing various brain disorders such as Alzheimer's disease, epilepsy, multiple sclerosis, cancer, depression and others. Nowadays, segmentation is still mainly performed manually by specialists. Segmentation done by experts is considered to be a gold-standard when evaluating automated methods, buts it is a time consuming and arduos task, requiring specialized personnel. In recent years, efforts have been made to achieve reliable automated segmentation. For years the best performing authomatic methods were multi atlas based with around 90% Dice coefficient and very time consuming, but machine learning methods are recently rising with promising time and accuracy performance. A method for volumetric hippocampus segmentation is presented, based on the consensus of tri-planar U-Net inspired fully convolutional networks (FCNNs), with some modifications, including residual connections, VGG weight transfers, batch normalization and a patch extraction technique employing data from neighbor patches. A study on the impact of our modifications to the classical U-Net architecture was performed. Our method achieves cutting edge performance in our dataset, with around 96 volumetric Dice accuracy in our test data, and GPU execution time in the order of seconds per volume. Also, masks are shown to be similar to other recent state-of-the-art hippocampus segmentation methods.



There are no comments yet.


page 3

page 8

Code Repositories


This contains official implementation for Extended 2D Volumetric Consensus Hippocampus Segmentation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hippocampus segmentation is very important in the diagnosis and treatment of many brain disorders, such as Alzheimer’s disease. The hippocampus has an important role in long and short term memory, and many times when affected by some disease gets reduced in size and shape [Andersen (2007)]. In epilepsy treatment, in some cases surgical intervention is necessary [Yasuda et al. (2010)], and brain MRIs are often used to help in the planning phase. Standard procedure in this cases is to perform a MRI scan of the brain and have experts analyze the shape of the hippocampus. To this day, manual segmentation is still a gold standard, even though interrater variability is a concerning problem [Souza et al. (2018a)]. However, manual segmentation still takes time and needs to be performed by specialized personnel.

For some time, the state-of-the-art in automated hippocampus segmentation was composed mainly by methods with execution time in the order of hours per volume computation, with around 0.9 Dice [Duarte et al. (1999)]. Until recently, the most successful methods used the multi atlas approach [Wang et al. (2013)], [Iglesias and Sabuncu (2015)], [Pipitone et al. (2014)]. In this approach, multiple expert-segmented example images, called atlases, are registered to a target image, and deformed atlas segmentations are combined using label fusion [Sabuncu et al. (2010)]. The main drawback of these methods is the time it takes to perform a segmentation, in the order of hours per volume. One notable example is FreeSurfer [Fischl (2012)], a tool with a collection of methods for full brain segmentation that is used nowadays by physicians to aid segmentation, but takes hours to segment a volume.

Our work is inspired by the need to reduce the computation time of authomatic hippocampus segmentation, using a different approach, namely, Convolutional Neural Networks (CNNs). Recently, some works have also attempted to use CNNs with promising performance

[Wachinger et al. (2018), Thyreau et al. (2018), Xie and Gillies (2018)]. Our method consists of evaluating the consensus of volumes generated by three separate Extended 2D (Section 4.1) U-Net like [Ronneberger et al. (2015)] FCNNs, with encoders initialized in VGG11 [Simonyan and Zisserman (2014)] and residual connections [He et al. (2016)]

. The networks are trained on each brain orientation; sagital, coronal and axial. Our main contribution is a lightweight method with a 200MB memory footprint and 15 seconds mid-range GPU execution time per volume. This method achieves state-of-the-art segmentation performance of 96% Dice in our test set and employs CNN design ideas and learned knowledge from various works in deep learning, with results visually comparable to other recent hippocampus segmentation methods.

2 Related Work

Xie et Gillies (2018) used a small Convolutional Neural Network (CNN) architecture focused on providing a fast method, which is one of the main advantages of Deep Learning in comparison to previous works. The work focused not only in fast prediction, but also low memory usage and fast training time, which can be difficult to accomplish with CNNs. The author used 2D patches from all three MRI orientations, but fed to a single model that predicts a single voxel classification.

Thyreau et al. (2018), named Hippodeep, is more similar to this work in the sense that it uses Fully Convolutional Neural Networks trained in a region of interest (ROI). However, where we apply one FCNN for each plane of view, Thyreau et al. uses a single FCNN, that starts with a planar analysis swiftly followed by layers of 3D convolutions and shortcut connections. 3D FCNNS are known to be very computational intensive in training due to its large number of parameters, requiring large amounts of data. This study used more than 2000 patients, augmented to around 10000 volumes with augmentation. Initially the model is trained with FreeSurfer segmentations, and later fine tuned using volumes which the author had access to manual segmentations, the gold standard. In our experiments, hippodeep was used for a qualitative analysis of our work.

Wachinger et al. (2018

), named DeepNat, is a whole brain segmentation method that achieves a segmentation of all structures of the brain with around 90% Dice, including the hippocampus. The method uses 3D patches CNNs to classify voxels and its neighbours, with a multi-task learning strategy. Patches are augmented with coordinates and a novel brain parametrization strategy is presented, to avoid the initial registration problem. Two 3D CNNs are used, first segmenting the background and foreground. Following that, structures on the foreground are segmented with the second 3D CNN.

Although not a hippocampus segmentation work, Lucena et al. (2018) inspired our consensus strategy that involves the use of multiple FCNNs performing segmentation over different MRI orientations, merged into a single final volume. While his method uses another network to produce the final consensus, in our post processing we simply add the activation heatmap of each FCNN, apply a pre-defined threshold, and perform 3D labeling to eliminate all but the two bigger connected components, which is shown to improve performance significantly.


Figure 1: A coronal slice from a sample of our data. In blue the manual segmentation and in green a correspondent slice of the resulting mask from our method.

3 Data and Ethics

The main, currently private dataset used on this work was collected by medical personnel from the Brazilian Institute of Neuroscience and Neurotechnology (BRAINN) and Hospital HC-UNICAMP. Our dataset contains 214 MNI registered T1 weighted MRI acquisitions made at HC (fig:intro). Almost one third of the acquisitions (66) are from patients that have suffered some modification or removal of one side of to the hippocampus due to surgical treatment of epilpsy. The dataset was originally collected to study volumetry of the hippocampus post surgery.

For this study, hold-out was employed with 80% for training, 10% for validation and 10% for testing. The only pre processing done to our dataset volumes was minmax normalization of int16 values to float16, between 0 and 1. All MRIs have manually segmented masks of principal regions of the brain, including the hippocampus. Patients involved on the data acquisition were volunteers and signed consent terms. This research is done in partnership with BRAINN, and this dataset was already used in previous BRAINN research with approval of the UNICAMP Medical Sciences School Ethics and Research Committee (under CEP 1191/2011).

3.1 Cc-359

As a visual validation and comparison to hippodeep, we used CC-359, a public dataset with 359 volumes, 1.5T and 3T from Siemens and Philips MRI machines Souza et al. (2018b). Contrary to the training data, the volumes are not registered, and have many variations of magnetic field intensity and position of the hippocampus, translated or slightly rotate in relation to MNI registration. This is useful to show if our model is overfitting or not to our MNI registered, more well behaved training data.


Figure 2: An outline of our method. An input volume is analyzed in all three orientations by FCNNs trained in patches over that orientation. Analysis is done in 2D slices and the results are concatenated in a single volume. Following that, our consensus approach and post processing is applied, outputting the final volumetric segmentation.

4 Methodology

Our analysis consists of three FCNNs examining the brain from three possible orientations, slice per slice, and performing a consensus merging the three volumes generated by each network (fig:methodology). Neighbouring slices are also taken into account on the prediction. The inspiration for this methodology came from the way physicians analyze MRI, using neighbour slices around the point of interest, visualized in all three orientations. Volume segmentation is constructed from running the network multiple times over every slice in the orientation it was trained in. In the following sections, the inner works of our method are described from architecture to final post processing and consensus strategy in more detail.


Figure 3: Diagram showing our architectural choices. Differences from the original U-Net architecure include the 3 channels of grayscale input of neighbour patches, residual connections in the convolution blocks and batch normalization of convolutions in convolution blocks.

4.1 Architecture

Most of our architectural ideas comes from other successful works. The basic structure of our network (fig:arch) is inspired by U-Net Ronneberger et al. (2015). However, there are some modifications. Firstly, instead of one single 2D patch as input, two neighbour patches are concatenated leaving the patch corresponding to the target mask in the center. We named that approach as Extended 2D (E2D) for ease of reference. Residual connections based on ResNet He et al. (2016) between the input and output of the double convolutional block were added, as 1x1 2D convolutions to account for different number of channels. Batch normalization was added to each convolution inside the convolutional block, to accelerate convergence and facilitate learning Ioffe and Szegedy (2015)

. Also, all convolutions use padding to keep dimensions after the 3x3 convolution and have no bias.

All previously listed architectural choices improved validation performance on our data and performance on CC-359 (Table 1

). An attempt was done in using a smaller version of the network, with only 3 max pools and 3 transposed convolutions but convergence was not achieved. Other architectural changes were attempted without success and are not in the scope of this article.

4.2 Training

One of the most important steps in achieving good generalization on training CNNs is weight initialization. Poor initialization can have negative impacts in performance. To avoid that, weight transfer from VGG11 is performed, as in Iglovikov and Shvets (2018)

, to the initial convolutions. Early studies were performed over the validation of the 2D segmentation of each FCNN to determine the best input, loss and learning rate. As an input to the network, an comparison was done between a 128x128 slice center patch and 16x16 or a 32x32 random slice patch centered on the hippocampus border. Better validation results were achieved with the 32x32 patch strategy, while the 128x128 center patch resulted in overfitting and the 16x16 patch with a smaller FCNN resulted in under segmentation and more noise. Another early comparison was done between possible loss function choices. Mean Square Error (MSE), Binary Cross Entropy (BCE) and Dice Loss were tested. Better and fast convergence in validation was observed using Dice Loss with a 1.0 smooth factor. Although Dice applies originally to sets, we consider each sigmoid value from 0 to 1 as a set element activation, comparing these values to binary target masks (0 or 1). This allows smooth convergence as the network converges to values close to 1 or 0. Dice Loss is defined as follows:


Where P is the prediction sigmoid vector, T is a binary segmentation target vector, sum() denotes the sum of all elements of a vector and * is element wise multiplication. In our experiments, the smooth factor allowed for more stable convergence, with less exploding or vanishing gradients. In training, Dice Loss is calculated per slice and the mean for the mini batch is used. However, when using Dice as a metric for evaluation, Dice is calculated once for the whole final volume without a smooth factor.

With those parameters fixed, a learning rate search for the SGD optimizer with momentum 0.9 was conducted, with the optimal convergence and training speed achieved with 0.001 initial learning rate. The number of epochs was fixed in 500 with 200 patches as mini batch size. After 200 epochs, the learning rate is decayed by a 0.1 factor. While experimenting with parameters and network architecture, the axial orientation was the hardest to learn, often diverging in the middle of training. This makes sense considering the axial orientation is empirically the hardest to identify the hippocampus visually. Adam

Kingma and Ba (2014) was attempted as a optimizer but resulted in most times in divergence in the axial orientation. Only states with best validation were saved.

Every patch is generated at runtime. When using the E2D Patch strategy, a patch refers to the center patch and its neighbours. As augmentation, every extracted patch from a slice has a 20% chance of being of a completely random position on the brain, where the other 80% are centered on a hippocampus border. There is a 20% chance of the patch being horizontally fliped. Every patch suffers a variation in brightness between -10 and 10%, and there is a 20% chance of gaussian noise with variance 0.0002 and mean 0 being added. It was empirically observed that vertical flips resulted in worst performance. We guess that might be due to the hippocampus being more horizontally than vertically symmetric. Another observation was that noise augmentation helped with generalizing performance to the CC-359 dataset, even though training becomes harder.

[] []

Figure 4:

a) Dice values calculated in every binarization threshold (THS) values varying from 0.1 to 0.9 in all 22 test volumes. For all reported Dice results in this paper a THS of 0.5 was used. b) Results considering only one orientation versus the final consensus, using our best model. Consensus displays better performance.

max width= Test Dice (%) Augmentation Residual Connections Extended 2D VGG11 Weights 92.78 x x x x 93.33 x x x 94.81 x x 95.53 x 96.30

Table 1: Selected final consensus results, showing the improvements on test set volume Dice after including our changes to the U-Net base architecture of each network. Models without E2D have as input a single 32x32 patch in training. Training parameters were fixed as discussed on Section 4.2

4.3 Post-Processing and Evaluation

After all three networks are trained, in the test phase a volume for each network is generated by segmenting 160x160 center crop slices in their respective orientations. A concatenation of the slices results in one segmentation volume for each network. To generate a final consensus heatmap, each volume is given equal weight of 1/3 and the activation maps are summed. Careful attention was given to the registration process to the final volume to avoid errors, padding the volumes to their original size from the 160x160 center crop. Binarization of the consensus volume is performed with a threshold of 0.5 (fig:ths), and finally, 3D labeling is performed using an implementation from Dougherty and Lotufo (2003). The two connected labels with more volume are kept. This post processing raises the performance of our best model by around 10% in our test set, by removing noise from small false positives in the neck area, skull, and brain ridges and grooves. Also, the consensus has an effect of prioritizing a confident activation from a network (e.g 0.995, or 0.001) instead of uncertain activations, increasing the robustness of the final result (fig:consensus).

5 Results and Discussion


Figure 5: Visual comparison of masks generated by Hippodeep (blue) and our method (green) in CC-359 data.

Our results shown state-of-the-art performance in hippocampus segmentation on our test data. Changes listed on Table 1 on the original U-Net architecture resulted in improvement in our model performance. Also, the consensus strategy resulted in better performance then evaluation following only one orientation, (fig:consensus). Not using batch normalization resulted in much slower convergence or no convergence at all in most cases.

The first question one asks in front of good results is if the model is overfitting. We report that this method visually generalizes to another, large dataset, CC-359. CC-359 includes different MRI machines and mangetic intensities in relation to our training data. Also, the data in CC-359 is not registered to a common space, and has more neck tissue included. Before the inclusion of our modifications over the U-Nets, the method did not generalized well in CC-359. However, using Dice against Hippodeep Thyreau et al. (2018) masks in CC-359 data, we saw 25% improvement with residual connections, and 12% improvement over that with VGG11 weight initialization and the Extended 2D approach. That shows the importance of those modifications on the U-Net architecture for the robustness and generalization of this method. fig:vis shows comparisons of our masks and Hippodeep masks on CC-359 data. Finally, our method used less training volumes than Hippodeep, and runs in around 15 seconds per volume on a mid-range nVidia 1060 GPU. However, more validation in other datasets with gold standard manual annotations is necessary to confirm generalization.

6 Conclusion

This paper presents a hippocampus segmentation method based on the consensus of three Extended 2D FCNNs with 96.3% Dice in our test set composed of 22 MRI samples. The method also visually displays generalization in another fairly different dataset using hippodeep as a reference. We plan to make the code available in GitHub.

Future work will involve acquiring more gold-standard segmentations to confirm the generalization of the method. Also, this method could be applied to hippocampus subfields segmentation.

We thank FAPESP for funding this research under grant 2018/00186-0, our partners at BRAINN (FAPESP number 2013/07559-3 and FAPESP 2015/10369-7) for letting us use their dataset on this research and CNPq research funding, process number 311228/2014-3.


  • Andersen (2007) Per Andersen. The hippocampus book. Oxford University Press, 2007.
  • Dougherty and Lotufo (2003) Edward R Dougherty and Roberto A Lotufo. Hands-on morphological image processing, volume 59. SPIE press, 2003.
  • Duarte et al. (1999) Jair Moura Duarte, João Bosco dos Santos, and Leonardo Cunha Melo. Comparison of similarity coefficients based on rapd markers in the common bean. Genetics and Molecular Biology, 22(3):427–432, 1999.
  • Fischl (2012) Bruce Fischl. Freesurfer. Neuroimage, 62(2):774–781, 2012.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • Iglesias and Sabuncu (2015) Juan Eugenio Iglesias and Mert R Sabuncu. Multi-atlas segmentation of biomedical images: a survey. Medical image analysis, 24(1):205–219, 2015.
  • Iglovikov and Shvets (2018) Vladimir Iglovikov and Alexey Shvets. Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv preprint arXiv:1801.05746, 2018.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Lucena et al. (2018) Oeslle Lucena, Roberto Souza, Leticia Rittner, Richard Frayne, and Roberto Lotufo. Silver standard masks for data augmentation applied to deep-learning-based skull-stripping. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 1114–1117. IEEE, 2018.
  • Pipitone et al. (2014) Jon Pipitone, Min Tae M Park, Julie Winterburn, Tristram A Lett, Jason P Lerch, Jens C Pruessner, Martin Lepage, Aristotle N Voineskos, M Mallar Chakravarty, Alzheimer’s Disease Neuroimaging Initiative, et al. Multi-atlas segmentation of the whole hippocampus and subfields using multiple automatically generated templates. Neuroimage, 101:494–512, 2014.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • Sabuncu et al. (2010) Mert R Sabuncu, BT Thomas Yeo, Koen Van Leemput, Bruce Fischl, and Polina Golland. A generative model for image segmentation based on label fusion. IEEE transactions on medical imaging, 29(10):1714–1729, 2010.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Souza et al. (2018a) Roberto Souza, Oeslle Lucena, Mariana Bento, Julia Garrafa, Simone Appenzeller, Leticia Rittner, Roberto Lotufo, and Richard Frayne. Reliability of using single specialist annotation for designing and evaluating automatic segmentation methods: A skull stripping case study. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 1344–1347. IEEE, 2018a.
  • Souza et al. (2018b) Roberto Souza, Oeslle Lucena, Julia Garrafa, David Gobbi, Marina Saluzzi, Simone Appenzeller, Letícia Rittner, Richard Frayne, and Roberto Lotufo. An open, multi-vendor, multi-field-strength brain mr dataset and analysis of publicly available skull stripping methods agreement. NeuroImage, 170:482–494, 2018b.
  • Thyreau et al. (2018) Benjamin Thyreau, Kazunori Sato, Hiroshi Fukuda, and Yasuyuki Taki. Segmentation of the hippocampus by transferring algorithmic knowledge for large cohort processing. Medical image analysis, 43:214–228, 2018.
  • Wachinger et al. (2018) Christian Wachinger, Martin Reuter, and Tassilo Klein. Deepnat: Deep convolutional neural network for segmenting neuroanatomy. NeuroImage, 170:434–445, 2018.
  • Wang et al. (2013) Hongzhi Wang, Jung W Suh, Sandhitsu R Das, John B Pluta, Caryne Craige, and Paul A Yushkevich. Multi-atlas segmentation with joint label fusion. IEEE transactions on pattern analysis and machine intelligence, 35(3):611–623, 2013.
  • Xie and Gillies (2018) Zhongliu Xie and Duncan Gillies. Near real-time hippocampus segmentation using patch-based canonical neural network. arXiv preprint arXiv:1807.05482, 2018.
  • Yasuda et al. (2010) Clarissa Yasuda, Clarissa Valise, André Saúde, Amanda Pereira, Fabrício Pereira, André Costa, Márcia Morita, Luiz Betting, Gabriela Castellano, Carlos Guerreiro, et al. Dynamic changes in white and gray matter volume are associated with outcome of surgical treatment in temporal lobe epilepsy. Neuroimage, 49(1):71–79, 2010.