This contains official implementation for Extended 2D Volumetric Consensus Hippocampus Segmentation
Hippocampus segmentation plays a key role in diagnosing various brain disorders such as Alzheimer's disease, epilepsy, multiple sclerosis, cancer, depression and others. Nowadays, segmentation is still mainly performed manually by specialists. Segmentation done by experts is considered to be a gold-standard when evaluating automated methods, buts it is a time consuming and arduos task, requiring specialized personnel. In recent years, efforts have been made to achieve reliable automated segmentation. For years the best performing authomatic methods were multi atlas based with around 90% Dice coefficient and very time consuming, but machine learning methods are recently rising with promising time and accuracy performance. A method for volumetric hippocampus segmentation is presented, based on the consensus of tri-planar U-Net inspired fully convolutional networks (FCNNs), with some modifications, including residual connections, VGG weight transfers, batch normalization and a patch extraction technique employing data from neighbor patches. A study on the impact of our modifications to the classical U-Net architecture was performed. Our method achieves cutting edge performance in our dataset, with around 96 volumetric Dice accuracy in our test data, and GPU execution time in the order of seconds per volume. Also, masks are shown to be similar to other recent state-of-the-art hippocampus segmentation methods.READ FULL TEXT VIEW PDF
Volumetric analysis of brain ventricle (BV) structure is a key tool in t...
Brain cancer can be very fatal, but chances of survival increase through...
Convolutional neural networks (CNN) for medical imaging are constrained ...
Multi-parametric MR images have been shown to be effective in the
Automated brain tumour segmentation has the potential of making a massiv...
Arthritis patients develop hand bone loss, which leads to destruction an...
A two-step concept for 3D segmentation on 5 abdominal organs inside
This contains official implementation for Extended 2D Volumetric Consensus Hippocampus Segmentation
Hippocampus segmentation is very important in the diagnosis and treatment of many brain disorders, such as Alzheimer’s disease. The hippocampus has an important role in long and short term memory, and many times when affected by some disease gets reduced in size and shape [Andersen (2007)]. In epilepsy treatment, in some cases surgical intervention is necessary [Yasuda et al. (2010)], and brain MRIs are often used to help in the planning phase. Standard procedure in this cases is to perform a MRI scan of the brain and have experts analyze the shape of the hippocampus. To this day, manual segmentation is still a gold standard, even though interrater variability is a concerning problem [Souza et al. (2018a)]. However, manual segmentation still takes time and needs to be performed by specialized personnel.
For some time, the state-of-the-art in automated hippocampus segmentation was composed mainly by methods with execution time in the order of hours per volume computation, with around 0.9 Dice [Duarte et al. (1999)]. Until recently, the most successful methods used the multi atlas approach [Wang et al. (2013)], [Iglesias and Sabuncu (2015)], [Pipitone et al. (2014)]. In this approach, multiple expert-segmented example images, called atlases, are registered to a target image, and deformed atlas segmentations are combined using label fusion [Sabuncu et al. (2010)]. The main drawback of these methods is the time it takes to perform a segmentation, in the order of hours per volume. One notable example is FreeSurfer [Fischl (2012)], a tool with a collection of methods for full brain segmentation that is used nowadays by physicians to aid segmentation, but takes hours to segment a volume.
Our work is inspired by the need to reduce the computation time of authomatic hippocampus segmentation, using a different approach, namely, Convolutional Neural Networks (CNNs). Recently, some works have also attempted to use CNNs with promising performance[Wachinger et al. (2018), Thyreau et al. (2018), Xie and Gillies (2018)]. Our method consists of evaluating the consensus of volumes generated by three separate Extended 2D (Section 4.1) U-Net like [Ronneberger et al. (2015)] FCNNs, with encoders initialized in VGG11 [Simonyan and Zisserman (2014)] and residual connections [He et al. (2016)]
. The networks are trained on each brain orientation; sagital, coronal and axial. Our main contribution is a lightweight method with a 200MB memory footprint and 15 seconds mid-range GPU execution time per volume. This method achieves state-of-the-art segmentation performance of 96% Dice in our test set and employs CNN design ideas and learned knowledge from various works in deep learning, with results visually comparable to other recent hippocampus segmentation methods.
Xie et Gillies (2018) used a small Convolutional Neural Network (CNN) architecture focused on providing a fast method, which is one of the main advantages of Deep Learning in comparison to previous works. The work focused not only in fast prediction, but also low memory usage and fast training time, which can be difficult to accomplish with CNNs. The author used 2D patches from all three MRI orientations, but fed to a single model that predicts a single voxel classification.
Thyreau et al. (2018), named Hippodeep, is more similar to this work in the sense that it uses Fully Convolutional Neural Networks trained in a region of interest (ROI). However, where we apply one FCNN for each plane of view, Thyreau et al. uses a single FCNN, that starts with a planar analysis swiftly followed by layers of 3D convolutions and shortcut connections. 3D FCNNS are known to be very computational intensive in training due to its large number of parameters, requiring large amounts of data. This study used more than 2000 patients, augmented to around 10000 volumes with augmentation. Initially the model is trained with FreeSurfer segmentations, and later fine tuned using volumes which the author had access to manual segmentations, the gold standard. In our experiments, hippodeep was used for a qualitative analysis of our work.
Wachinger et al. (2018
), named DeepNat, is a whole brain segmentation method that achieves a segmentation of all structures of the brain with around 90% Dice, including the hippocampus. The method uses 3D patches CNNs to classify voxels and its neighbours, with a multi-task learning strategy. Patches are augmented with coordinates and a novel brain parametrization strategy is presented, to avoid the initial registration problem. Two 3D CNNs are used, first segmenting the background and foreground. Following that, structures on the foreground are segmented with the second 3D CNN.
Although not a hippocampus segmentation work, Lucena et al. (2018) inspired our consensus strategy that involves the use of multiple FCNNs performing segmentation over different MRI orientations, merged into a single final volume. While his method uses another network to produce the final consensus, in our post processing we simply add the activation heatmap of each FCNN, apply a pre-defined threshold, and perform 3D labeling to eliminate all but the two bigger connected components, which is shown to improve performance significantly.
The main, currently private dataset used on this work was collected by medical personnel from the Brazilian Institute of Neuroscience and Neurotechnology (BRAINN) and Hospital HC-UNICAMP. Our dataset contains 214 MNI registered T1 weighted MRI acquisitions made at HC (fig:intro). Almost one third of the acquisitions (66) are from patients that have suffered some modification or removal of one side of to the hippocampus due to surgical treatment of epilpsy. The dataset was originally collected to study volumetry of the hippocampus post surgery.
For this study, hold-out was employed with 80% for training, 10% for validation and 10% for testing. The only pre processing done to our dataset volumes was minmax normalization of int16 values to float16, between 0 and 1. All MRIs have manually segmented masks of principal regions of the brain, including the hippocampus. Patients involved on the data acquisition were volunteers and signed consent terms. This research is done in partnership with BRAINN, and this dataset was already used in previous BRAINN research with approval of the UNICAMP Medical Sciences School Ethics and Research Committee (under CEP 1191/2011).
As a visual validation and comparison to hippodeep, we used CC-359, a public dataset with 359 volumes, 1.5T and 3T from Siemens and Philips MRI machines Souza et al. (2018b). Contrary to the training data, the volumes are not registered, and have many variations of magnetic field intensity and position of the hippocampus, translated or slightly rotate in relation to MNI registration. This is useful to show if our model is overfitting or not to our MNI registered, more well behaved training data.
Our analysis consists of three FCNNs examining the brain from three possible orientations, slice per slice, and performing a consensus merging the three volumes generated by each network (fig:methodology). Neighbouring slices are also taken into account on the prediction. The inspiration for this methodology came from the way physicians analyze MRI, using neighbour slices around the point of interest, visualized in all three orientations. Volume segmentation is constructed from running the network multiple times over every slice in the orientation it was trained in. In the following sections, the inner works of our method are described from architecture to final post processing and consensus strategy in more detail.
Most of our architectural ideas comes from other successful works. The basic structure of our network (fig:arch) is inspired by U-Net Ronneberger et al. (2015). However, there are some modifications. Firstly, instead of one single 2D patch as input, two neighbour patches are concatenated leaving the patch corresponding to the target mask in the center. We named that approach as Extended 2D (E2D) for ease of reference. Residual connections based on ResNet He et al. (2016) between the input and output of the double convolutional block were added, as 1x1 2D convolutions to account for different number of channels. Batch normalization was added to each convolution inside the convolutional block, to accelerate convergence and facilitate learning Ioffe and Szegedy (2015)
. Also, all convolutions use padding to keep dimensions after the 3x3 convolution and have no bias.
All previously listed architectural choices improved validation performance on our data and performance on CC-359 (Table 1
). An attempt was done in using a smaller version of the network, with only 3 max pools and 3 transposed convolutions but convergence was not achieved. Other architectural changes were attempted without success and are not in the scope of this article.
One of the most important steps in achieving good generalization on training CNNs is weight initialization. Poor initialization can have negative impacts in performance. To avoid that, weight transfer from VGG11 is performed, as in Iglovikov and Shvets (2018)
, to the initial convolutions. Early studies were performed over the validation of the 2D segmentation of each FCNN to determine the best input, loss and learning rate. As an input to the network, an comparison was done between a 128x128 slice center patch and 16x16 or a 32x32 random slice patch centered on the hippocampus border. Better validation results were achieved with the 32x32 patch strategy, while the 128x128 center patch resulted in overfitting and the 16x16 patch with a smaller FCNN resulted in under segmentation and more noise. Another early comparison was done between possible loss function choices. Mean Square Error (MSE), Binary Cross Entropy (BCE) and Dice Loss were tested. Better and fast convergence in validation was observed using Dice Loss with a 1.0 smooth factor. Although Dice applies originally to sets, we consider each sigmoid value from 0 to 1 as a set element activation, comparing these values to binary target masks (0 or 1). This allows smooth convergence as the network converges to values close to 1 or 0. Dice Loss is defined as follows:
Where P is the prediction sigmoid vector, T is a binary segmentation target vector, sum() denotes the sum of all elements of a vector and * is element wise multiplication. In our experiments, the smooth factor allowed for more stable convergence, with less exploding or vanishing gradients. In training, Dice Loss is calculated per slice and the mean for the mini batch is used. However, when using Dice as a metric for evaluation, Dice is calculated once for the whole final volume without a smooth factor.
With those parameters fixed, a learning rate search for the SGD optimizer with momentum 0.9 was conducted, with the optimal convergence and training speed achieved with 0.001 initial learning rate. The number of epochs was fixed in 500 with 200 patches as mini batch size. After 200 epochs, the learning rate is decayed by a 0.1 factor. While experimenting with parameters and network architecture, the axial orientation was the hardest to learn, often diverging in the middle of training. This makes sense considering the axial orientation is empirically the hardest to identify the hippocampus visually. AdamKingma and Ba (2014) was attempted as a optimizer but resulted in most times in divergence in the axial orientation. Only states with best validation were saved.
Every patch is generated at runtime. When using the E2D Patch strategy, a patch refers to the center patch and its neighbours. As augmentation, every extracted patch from a slice has a 20% chance of being of a completely random position on the brain, where the other 80% are centered on a hippocampus border. There is a 20% chance of the patch being horizontally fliped. Every patch suffers a variation in brightness between -10 and 10%, and there is a 20% chance of gaussian noise with variance 0.0002 and mean 0 being added. It was empirically observed that vertical flips resulted in worst performance. We guess that might be due to the hippocampus being more horizontally than vertically symmetric. Another observation was that noise augmentation helped with generalizing performance to the CC-359 dataset, even though training becomes harder.
After all three networks are trained, in the test phase a volume for each network is generated by segmenting 160x160 center crop slices in their respective orientations. A concatenation of the slices results in one segmentation volume for each network. To generate a final consensus heatmap, each volume is given equal weight of 1/3 and the activation maps are summed. Careful attention was given to the registration process to the final volume to avoid errors, padding the volumes to their original size from the 160x160 center crop. Binarization of the consensus volume is performed with a threshold of 0.5 (fig:ths), and finally, 3D labeling is performed using an implementation from Dougherty and Lotufo (2003). The two connected labels with more volume are kept. This post processing raises the performance of our best model by around 10% in our test set, by removing noise from small false positives in the neck area, skull, and brain ridges and grooves. Also, the consensus has an effect of prioritizing a confident activation from a network (e.g 0.995, or 0.001) instead of uncertain activations, increasing the robustness of the final result (fig:consensus).
Our results shown state-of-the-art performance in hippocampus segmentation on our test data. Changes listed on Table 1 on the original U-Net architecture resulted in improvement in our model performance. Also, the consensus strategy resulted in better performance then evaluation following only one orientation, (fig:consensus). Not using batch normalization resulted in much slower convergence or no convergence at all in most cases.
The first question one asks in front of good results is if the model is overfitting. We report that this method visually generalizes to another, large dataset, CC-359. CC-359 includes different MRI machines and mangetic intensities in relation to our training data. Also, the data in CC-359 is not registered to a common space, and has more neck tissue included. Before the inclusion of our modifications over the U-Nets, the method did not generalized well in CC-359. However, using Dice against Hippodeep Thyreau et al. (2018) masks in CC-359 data, we saw 25% improvement with residual connections, and 12% improvement over that with VGG11 weight initialization and the Extended 2D approach. That shows the importance of those modifications on the U-Net architecture for the robustness and generalization of this method. fig:vis shows comparisons of our masks and Hippodeep masks on CC-359 data. Finally, our method used less training volumes than Hippodeep, and runs in around 15 seconds per volume on a mid-range nVidia 1060 GPU. However, more validation in other datasets with gold standard manual annotations is necessary to confirm generalization.
This paper presents a hippocampus segmentation method based on the consensus of three Extended 2D FCNNs with 96.3% Dice in our test set composed of 22 MRI samples. The method also visually displays generalization in another fairly different dataset using hippodeep as a reference. We plan to make the code available in GitHub.
Future work will involve acquiring more gold-standard segmentations to confirm the generalization of the method. Also, this method could be applied to hippocampus subfields segmentation.
We thank FAPESP for funding this research under grant 2018/00186-0, our partners at BRAINN (FAPESP number 2013/07559-3 and FAPESP 2015/10369-7) for letting us use their dataset on this research and CNPq research funding, process number 311228/2014-3.