This repository contains the code of HyperDenseNet, a hyper-densely connected CNN to segment medical images in multi-modal image scenarios.
Recently, dense connections have attracted substantial attention in computer vision because they facilitate gradient flow and implicit deep supervision during training. Particularly, DenseNet, which connects each layer to every other layer in a feed-forward fashion, has shown impressive performances in natural image classification tasks. We propose HyperDenseNet, a 3D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path, but also between those across different paths. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input or at the output of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstraction, which increases significantly the learning representation. We report extensive evaluations over two different and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013, with the former focusing on 6-month infant data and the latter on adult images. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning. Our code is publicly available at https://www.github.com/josedolz/HyperDenseNet.READ FULL TEXT VIEW PDF
Neonatal brain segmentation in magnetic resonance (MR) is a challenging
Accurate localization and segmentation of intervertebral disc (IVD) is
Classification is a pivotal function for many computer vision tasks such...
Delineating infarcted tissue in ischemic stroke lesions is crucial to
In this paper, we propose an easily trained yet powerful representation
State-of-the-art approaches for semantic image segmentation are built on...
Automatic segmentation of multi-sequence (multi-modal) cardiac MR (CMR)
This repository contains the code of HyperDenseNet, a hyper-densely connected CNN to segment medical images in multi-modal image scenarios.
Multi-modal imaging is of primary importance for developing comprehensive models of pathologies and increasing the statistical power of current imaging biomarkers . In neuroimaging studies, different magnetic resonance imaging (MRI) modalities are often combined to overcome the limitations of independent imaging techniques. While T1-weighted images yield a good contrast between gray matter (GM) and white matter (WM) tissues, T2-weighted and proton density (PD) pulses help visualize tissue abnormalities like lesions. Likewise, fluid attenuated inversion recovery (FLAIR) images can enhance the image contrast of white matter lesions resulting from multiple sclerosis . In brain segmentation, considering multiple MRI modalities is essential to obtain accurate results. This is particularly true for the segmentation of infant brains, where tissue contrast is low (Fig. 1).
Advances in multi-modal imaging, however, come at the price of an inherently large amount of data, imposing a burden on disease assessments. Visual inspections of such an enormous amount of medical images are prohibitively time-consuming, prone to errors and unsuitable for large-scale studies. Therefore, automatic and reliable multi-modal segmentation algorithms are of high interest to the clinical community.
Multi-modal image segmentation in brain-related applications has received a substantial research attention, for instance, brain tumors [3, 4, 5, 6], brain tissues of both infant [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17] and adult [18, 19], subcortical structures , among other problems [21, 22, 23]. Atlas-propagation approaches are commonly used in multi-modal scenarios [24, 25]. These methods rely on registering one or multiple atlases to the target image, followed by a propagation of manuals labels. When several atlases are considered, labels from individual atlases can be combined into a final segmentation via a label fusion strategy [8, 10, 13]. When relying solely on atlas fusion, the performance of such techniques might be limited and prone to registration errors. Parametric or deformable models 14]. For example, the study in  investigated a patch-driven method for neonatal brain tissue segmentation, integrating the probability maps of a subject-specific atlas into a level-set framework.
More recently, our community has witnessed a wide adoption of deep learning techniques, particularly, convolutional neural networks (CNNs), as an effective alternative to traditional segmentation approaches. CNN architectures are supervised models, trained end-to-end, to learn a hierarchy of image features representing different levels of abstraction. In contrast to conventional classifiers based on hand-crafted features, CNNs can learn both the features and classifier simultaneously, in a data-driven manner. They achieved state-of-the-art performances in a broad range of medical image segmentation problems[26, 27], including multi-modal tasks [28, 15, 4, 29, 16, 19, 17, 6, 5, 22, 23].
Most of the existing multi-modal CNN segmentation techniques followed an early-fusion strategy, which integrates the multi-modality information from the original space of low-level features [15, 29, 28, 5, 19, 23]. For instance, in , MRI T1, T2 and fractional anisotropy (FA) images are simply merged at the input of the network. However, as argued in  in the context of multi-modal learning, it is difficult to discover highly non-linear relationships between the low-level features of different modalities, more so when such modalities have significantly different statistical properties. In fact, early-fusion methods implicitly assume that the relationship between different modalities are simple (e.g., linear). For instance, the early fusion in  learns complementary information from T1, T2 and FA images. However, the relationship between the original T1, T2 and FA image data may be much more complex than complementarity, due to significantly different image acquisition processes . The work in  advocated late fusion
of high-level features as a way that accounts better for the complex relationships between different modalities. They used an independent convolutional network for each modality, and fused the outputs of the different networks in higher-level layers, showing better performance than early fusion in the context infant brain segmentation. These results are in line with a recent study in the machine learning community
, which investigated multimodal learning with deep Boltzmann machines in the context of fusing data from color images and text.
Prastawa et al., 2005 
|T1,T2||Infant brain tissue||Multi-atlas|
|Weisenfeld et al., 2006 ||T1,T2||Infant brain tissue||Multi-atlas|
|Deoni et al., 2007 ||T1,T2||Thalamic nuclei||K-means clustering|
|Anbeek et al., 2008 ||T2,IR||Infant brain tissue||KNN|
|Weisenfeld and Warfield, 2009 ||T1,T2||Infant brain tissue||Multi-atlas|
|Wang et al., 2011 ||T1,T2,FA||Infant brain tissue||Multi-atlas + Level sets|
|Srhoj et al., 2012 ||T1,T2||Infant brain tissue||Multi-atlas + KNN|
|Wang et al., 2012 ||T1,T2||Infant brain tissue||Multi-atlas|
|Wang et al., 2014 ||T1,T2,FA||Infant brain tissue||Multi-atlas + Level sets|
|Kamnitsas et al., 2015 ||Flair, DWI, T1, T2||Brain lesion||3D FCNN + CRF|
|Zhang et al., 2015 ||T1,T2,FA||Infant brain tissue||2D CNN|
|Havaei et al., 2016 ||T1,T1c,T2,FLAIR||Multiple Sclerosis/Brain tumor||2D CNN|
|Nie et al., 2016 ||T1,T2,FA||Infant brain tissue||2D FCNN|
|Chen et al., 2017 ||T1,T1-IR,FLAIR||Brain tissue||3D FCNN|
|Dolz et al., 2017 ||T1,T2||Infant brain tissue||3D FCNN|
|Fidon et al., 2017 ||T1,T1c,T2,FLAIR||Brain tumor||CNN|
|Kamnitsas et al., 2017 ||
|Brain tumour/lesions||3D FCNN + CRF|
|Kamnitsas et al., 2017 ||MPRAGE,FLAIR,T2,PD||Traumatic brain injuries||3D FCNN(Adversarial Training)|
|Valverde et al., 2017 ||T1, T2,FLAIR||Multiple-sclerosis||3D FCNN|
Since the recent introduction of residual learning in , shortcut connections from early to late layers have become very popular in a breadth of computer vision problems [33, 34]. Unlike traditional networks, these connections back-propagate gradients directly, thereby mitigating the gradient-vanishing problem and allowing deeper networks. Furthermore, they transform a whole network into a large ensemble of shallower networks, yielding competitive performances in various applications [35, 19, 36, 37]. DenseNet  extended the concept of shortcut connections, with the input of each layer corresponding to the outputs from all previous layers. Such a dense network facilitates the gradient flow and the learning of more complex patterns, which yielded significant improvements in accuracy and efficiency for natural image classification tasks . Inspired by this success, recent works have included dense connections in deep networks for medical image segmentation [39, 40, 41]. However, these works have either considered a single modality [39, 40] or have simply concatenated multiple modalities in a single stream . So far, the impact of dense connectivity across multiple network paths, and its application to multi-modal image segmentation, remains unexplored.
We propose HyperDenseNet, a 3D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path, but also between those across different paths; see the illustration in Fig. 2. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input (early fusion) or at the output (late fusion) of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstractions, which increases significantly the learning representation in comparison to early/late fusion. Furthermore, hyper-dense connections facilitate the learning as they improve gradient flow and impose implicit deep supervision. We report extensive evaluations over two different222iSEG 2017 focuses on 6-month infant data, whereas MRBrainS 2013 uses adult data. Therefore, there are significant differences between the two benchmarks in term of image data characteristics, e.g, the voxel spacing and number of available modalities. and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning. Our code is publicly available333https://www.github.com/josedolz/HyperDenseNet.
A preliminary conference version of this work appeared at ISBI 2018 . This journal version is a substantial extension, including (1) a much broader, more informative/rigorous treatment of the subject in the general context of multi-modal segmentation; and (2) comprehensive experiments with additional baselines and publicly available benchmarks, as well as a thorough investigation of the practical usefulness and impact of hyper-dense connections.
Convolutional neural networks (CNNs) are deep models that can learn feature representations automatically from the training data. They consist of multiple layers, each processing the imaging data at a different level of abstraction, enabling segmentation algorithms to learn from large datasets and discover complex patterns that can be further employed for predicting unseen samples. The first attempts to use CNNs in segmentation problems followed a sliding-window strategy, where the regions defined by the window are processed independently, which impedes segmentation accuracy and computational efficiency. To overcome these limitations, the network can be viewed as a single non-linear convolution, which is trained end-to-end, a process known as fully CNN (FCNN) . The latter brings several advantages over standard CNNs. It can handle images of arbitrary sizes and avoid redundant convolution and pooling operations, enabling computationally efficient learning.
The concept of “the deeper the better” is considered as a key principle in deep learning . Nevertheless, one obstacle when dealing with deep architectures is the problem of vanishing/exploding gradients, which hampers convergence during training. To address these limitations in very deep architectures, the study in  investigated densely connected networks. DenseNets are built on the idea that adding direct connections from any layer to all the subsequent layers in a feed-forward manner makes training easier and more accurate. This is motivated by three observations. First, there is an implicit deep supervision thanks to the short paths to all feature maps in the architecture. Second, direct connections between all layers help improving the flow of information and gradients throughout the entire network. Third, dense connections have a regularizing effect, which reduces the risk of over-fitting on tasks with smaller training sets.
Inspired by the recent success of densely-connected networks in medical image segmentation works [39, 40, 41], we propose a hyper-dense architecture for multi-modal image segmentation that extends the concept of dense connectivity to the multi-modal setting: each imaging modality has a path, and dense connections occur not only between layers within the same path, but also between layers across different paths (see Fig. 2 for an illustration).
Let be the output of the
layer. In CNNs, this vector is typically obtained from the output of the previous layerby a mapping
composed of a convolution followed by a non-linear activation function:
A densely-connected network concatenates all feature outputs in a feed-forward manner,
where denotes a concatenation operation.
Pushing this idea further, HyperDenseNet introduces a more general connectivity definition, in which we link the outputs from layers in different streams, each associated with a different image modality. In the multi-modal setting, our hyper-dense connectivity yields a much more powerful feature representation than early/late fusion as the network learns the complex relationships between the modalities within and in-between all the levels of abstractions. For simplicity, let us consider the scenario of two image modalities, although extension to modalities is straightforward. Let and denote the outputs of the layer in streams 1 and 2, respectively. In general, the output of the layer in a stream can then be defined as follows:
Shuffling and interleaving feature map elements in a CNN was recently found to enhance the efficiency and performance, while serving as a strong regularizer [44, 45, 46]. This is motivated by the fact that intermediate CNN layers perform deterministic transformations to improve the performance, however, relevant information might be lost during these operations . To overcome this issue, it is therefore beneficial for intermediate layers to offer a variety of information exchange while preserving the aforementioned deterministic functions. Motivated by this principle, we thus concatenate feature maps in a different order for each branch and layer:
with being a function that permutes the feature maps given as input. For instance, in the case of two image modalities, we could have:
Figure 2 shows a section of the proposed architecture, where each gray region represents a convolutional block. For simplicity, we assume that the red arrows indicate convolution operations only, whereas the black arrows represent the direct connections between feature maps from different layers, within and in-between the different streams. Thus, the input of each convolutional block (maps before the red arrow) is the concatenation of the outputs (maps after the red arrow) of all the preceding layers from both paths.
To investigate thoroughly the impact of hyper-dense connections between different streams in multi-modal image segmentation, several baselines were considered. First, we extended the semi-dense architecture proposed in  to a fully-dense one, by connecting the output of each convolutional layer to all subsequent layers. In this network, we follow an early-fusion strategy, in which MRI T1 and T2 are integrated at the input of the CNN and processed jointly along a single path (Fig. 3, left). The connectivity setting of this model corresponds to Eq. (2). Second, instead of merging both modalities at the input of the network, we considered a late-fusion strategy, where each modality is processed independently in different streams and learned features are fused before the first fully connected layer (Fig. 3, middle). In this model, the dense connections are included within each path, assuming the connectivity definition of Eq. (2) for each stream.
As last baseline, we used an early fusion model which combines features from different streams after the first convolutional layer (Fig. 3, right). Since this non-linear combination of features is re-used in all subsequent layers, the resulting network is similar to our hyper-dense model of Eq. (3). However, there are two important differences. First, each stream in our model processes its input differently, as shown by the stream-indexed function in Eq. (3). Also, as described above, each stream performs a different shuffling of inputs, which can enhance robustness to the model and mitigate the risk of overfitting. Our experiments in Section 3 demonstrate empirically the advantages of our model compared to this baseline.
To have a large receptive field, FCNNs typically use full images as input. The number of parameters is then limited via pooling/unpooling layers. A problem with this approach is the loss of resolution from repeated down-sampling operations. In the proposed method, we follow the strategy in , where sub-volumes are used as input, avoiding pooling layers. While sub-volumes of size are considered for training, we used non-overlapping sub-volumes during inference, as in [5, 26]. This strategy offers two considerable benefits. First, it reduces the memory requirements of our network, thereby removing the need for spatial pooling. More importantly, it substantially increases the number of training examples and, therefore, does not need data augmentation.
|Conv. kernel||# kernels||Output Size||Dropout|
. In the case of multi-modal images, the convolutional layers (conv_x) are present in any network path. All the convolutional layers have a stride of one pixel.
summarizes the parameters of the baselines and the proposed HyperDenseNet. The network parameters are optimized via the RMSprop optimizer, using cross-entropy as cost function. Letdenotes the network parameters (i.e., convolution weights, biases and from the parametric rectifier units), and the label of voxel in the -th image segment. We optimize the following:
where is the softmax output of the network for voxel and class , when the input segment is .
To initialize the weights of the network, we adopted the strategy proposed in is used to initialize the weights in layer , where
denotes the number of connections to the units in that layer. Momentum was set to 0.6 and the initial learning rate to 0.001, being reduced by a factor of 2 after every 5 epochs (starting from epoch 10). The network was trained for 30 epochs, each composed of 20 subepochs. At each subepoch, a total of 1000 samples were randomly selected from the training images and processed in batches of size 5.
The proposed HyperDenseNet architecture is evaluated on challenging multi-modal image segmentation tasks, using publicly available data from two challenges: infant brain tissue segmentation, iSEG , and adult brain tissue segmentation, MRBrainS444http://mrbrains13.isi.uu.nl. Quantitative evaluations and comparisons with state-of-the-art methods are reported for each of these applications. First, to evaluate the impact of dense connectivity on performance, we compared the proposed HyperDenseNet to the baselines described in Section 2.2 on infant brain tissue segmentation. Then, our results, compiled by the iSEG challenge organizers on testing data, are compared to those from the other competing teams. Second, to juxtapose the performance of HyperDenseNet to other segmentation networks under the same conditions, we provide a quantitative analysis of the results of current state-of-the-art segmentation networks for adult brain tissue segmentation. This includes comparison to the participants the MRBrainS challenge. Finally, in Section 3.3, we report a comprehensive analysis of feature re-use.
The focus of this challenge was to compare (semi-) automatic stat-of-the-art algorithms for the segmentation of 6-month infant brain tissues in T1- and T2-weighted brain MRI scans. This challenge was carried out in conjunction with MICCAI 2017, with a total of 21 international teams participating in the first round .
The iSEG-2017 organizers used three metrics to evaluate the accuracy of the competing methods: Dice Similarity Coefficient (DSC) , Modified Hausdorff distance (MHD), where the 95-th percentile of all Euclidean distances is employed, and Average Surface Distance (ASD). The first measures the degree of overlap between the segmentation region and ground truth, whereas the other two evaluate boundary distances.
Table III reports the performance achieved by HyperDenseNet and the baselines introduced in Section 2.2, for CSF, GM and WM brain tissues. The results were generated by splitting the 10 available iSEG-2017 volumes into training, validation and testing sets containing 6, 1 and 3 volumes, respectively. To show that improvements do not come from the higher number of learned parameters in HyperDenseNet, we also investigated a widened version of all baselines, with a similar parameter size as HyperDenseNet. The number of learned parameters of all the tested models is given in Table IV. A more detailed description of the tested architectures can be found in Table VIII of the Supplemental materials (’Supplementary materials are available in the supplementary files /multimedia tab.’).
We observe that the late fusion of deeper-layer features in independent paths provides a clear improvement over the single-path version, with an increase on performance of nearly 5. Fusing the feature maps from independent paths after the first convolutional layer (i.e., Dual-Single) outperformed the other two baselines by 1-2, particularly for WM and GM, which are the most challenging structures to segment. Also, the results indicate that processing multi-modal data in separate paths, while allowing dense connectivity between all the paths, increases performance over early and late fusion, as well as over disentangled modalities with fusion performed after the first convolutional block. Another interesting finding is that increasing the number of learned parameters does not bring an important boost in performance. Indeed, in some tissues (e.g., CSF for Single path and Dual-Single path architectures), the performance slightly decreased when widening the architecture.
|No connectivity between paths||Single Path||0.9014||0.8518||0.8370|
|Connectivity between paths||Dual-Single Path||0.9552||0.9142||0.9008|
Figures 4 and 5 compare the training and validation accuracy between the baselines and HyperDenseNet. In these figures, the mean DSC for the three brain tissues is evaluated during training (Top) and validation (Bottom). One can see that HyperDenseNet outperforms baselines in both cases, achieving better results than architectures with a similar number of parameters. Performance improvements seen in Table III, Fig. 4 and Fig. 5 might be due to two factors: the high number of direct connections between different layers, which facilitates back-propagation of the gradient to shallow layers, and the freedom of the network to explore more complex patterns thanks to the combination of several image modalities at any level of abstraction.
The computational efficiency of HyperDenseNet and baselines is compared in Table IV. As expected, inference times are proportional to the number of model parameters. While the lightest architecture needs around 45 seconds to segment a whole 3D brain, HyperDenseNet performs the same task in less than 2 minutes. This is acceptable from a clinical point of view.
|Architecture||Nb. of parameters||Time (sec)|
|Single Path||2,380,050||290,600||2,670,650||43.67 (8.37)|
|Single Path||9,518,850||470,600||9,989,450||101.63 (12.65)|
|Dual Path||4,760,100||470,600||5,230,700||64.57 (9.45)|
|Dual Path||9,381,960||614,600||9,996,560||104.31 (11.65)|
|Dual-Single Path||2,666,760||300,600||2,968,200||47.33 (8.74)|
|Dual-Single Path||9,518,850||470,600||9,989,450||103.64 (13.61)|
Figure 6 depicts visual results for the subject used in validation. It can be seen that, in most cases, HyperDenseNet typically recovers thin regions better than the baselines, which can explain the improvements observed for distance-based metrics. As confirmed in Table III, this effect is most prominent in the boundaries between the gray and white matter. Furthermore, HyperDenseNet produces fewer false positives for WM than the baselines, which tend to over-estimate the segmentation in this region.
Table V compares the segmentation accuracy of HyperDenseNet to that of top-5 ranking methods in the first round of the iSEG Challenge, as well as to all the methods in the second round of submission. We observe that our network ranked among the top-3 methods in 6 out of 9 metrics, considering the results of the first and second rounds of submissions.
A noteworthy point is the general performance decrease of all the methods for the segmentation of GM and WM, with lower DSC and larger ASD values. This confirms that segmenting these tissues is more challenging due to the unclear boundaries between them.
|First round (Top 5)|
|Second round (All methods)|
The MRBrainS challenge was initially proposed in conjunction with MICCAI 2013. It focuses on adult brain tissue segmentation in the context of aging, based on three modalities: MRI T1, MRI T1 Inversion Recovery (IR) and MR-FLAIR. To this day, a total of 47 international teams have participated in this challenge.
The organizers used three types of evaluation measures: a spatial overlap measure (DSC), a boundary distance measure (MHD) and a volumetric measure (the percentage of absolute volume difference).
We compare HyperDenseNet to three state-of-the-art networks for medical image segmentation. The first architecture is a 3D fully convolutional neural network with residual connections, which we denote as FCN_Res3D. The second one, referred to as UNet3D, is a U-Net  model with residual connections in the encoder and 3D volumes as input. Finally, our comparison also includes DeepMedic , which showed an outstanding performance in brain lesion segmentation. The implementation details of these architectures are described in Supplemental materials (Supplementary materials are available in the supplementary files /multimedia tab).
We performed a leave-one-out-cross-validation (LOOCV) on the 5 available MRBrainS datasets, using 4 subjects for training and one for validation. We trained and tested models three times, each time using a different subject for validation, and measured the average accuracy over these three folds. For this experiment, we used all three modalities (i.e., T1, T1 IR and FLAIR) for all competing methods. In a second set of experiments, we assessed the impact of integrating multiple imaging modalities on the performance of HyperDenseNet using all possible combinations of two modalities as input.
|Method||Mean DSC (std dev)|
|FCN_Res3D  (3-Modalities)||0.7685 (0.0161)||0.8163 (0.0222)||0.8607 (0.0178)|
|UNet3D  (3-Modalities)||0.8218 (0.0159)||0.8432 (0.0241)||0.8841 (0.0123)|
|DeepMedic (3-Modalities)||0.8292 (0.0094)||0.8522 (0.0193)||0.8884 (0.0137)|
|HyperDenseNet (T1-FLAIR)||0.8259 (0.0133)||0.8620 (0.0260)||0.8982 (0.0138)|
|HyperDenseNet (T1_IR-FLAIR)||0.7991 (0.0181)||0.8226 (0.0255)||0.8654 (0.0087)|
|HyperDenseNet (T1-T1_IR)||0.8191 (0.0297)||0.8498 (0.0173)||0.8913 (0.0082)|
|HyperDenseNet (3-Modalities)||0.8485 (0.0078)||0.8663 (0.0247)||0.9016 (0.0109)|
Table VI reports the mean DSC and standard-deviation values of tested models, with FCN_Res3D exhibiting the lowest mean DSC. This performance might be explained by the transpose convolutions in FCN_Res3D, which may cause voxel misclassification within small regions. Furthermore, the downsampling and upsampling operations in FCN_Res3D could make the feature maps in hidden layers sparser than the original inputs, causing a loss of image details. A strategy to avoid this problem is having skip connections as in UNet3D, which propagate information at different levels of abstraction between the encoding and decoding paths. This can be be observed in the results, where UNet3D clearly outperforms FCN_Res3D in all the metrics.
Moreover, DeepMedic obtained better results than its competitors, yielding a performance close to the different two-modality configurations of HyperDenseNet. The dual multiscale path is an important feature of DeepMedic which gives the network a larger receptive field via two paths, one for the input image and the other processing a low-resolution version of the input. This, in addition to the removal of pooling operations in DeepMedic, could explain the increase in performance with respect to FCN_Res3D and UNet3D.
Comparing the different modality combinations, the two-modality versions of HyperDenseNet yielded competitive performances, although there is a significant variability between the three configurations. Using only MRI T1 and FLAIR places HyperDenseNet first for two DSC measures (GM and WM), and second for the remaining measure (CSF), even though competing methods used all three modalities. However, HyperDenseNet with three modalities yields significantly better segmentations, with the highest mean DSC values for all three tissues.
The MRBrainS challenge organizers compiled the results and a ranking of 47 international teams555http://mrbrains13.isi.uu.nl/results.php. In Table VII, we report the results of the top-10 methods. We see that HyperDenseNet ranks first among competing methods, obtaining the highest DSC and HD for GM and WM. Interestingly, the BCH_CRL_IMAGINE and MSL_SKKU teams participated in both iSEG and MRBrains2013 challenges. While these two networks outperformed HyperDenseNet in the iSEG challenge, the performance of our Model was noticeably superior in the MRBrains challenge, with HyperDenseNet ranked 1, MSL_SKKU ranked 4 and BCH_CRL_IMAGINE ranked 18 (Ranking of February 2018). Considering the fact that three modalities are employed in MRBrains, unlike the two modalities used in iSEG, these results suggest that HyperDenseNet has stronger representation-learning power as the number of modalities increases.
|VoxResNet  + Auto-context||0.8615||1.44||6.60||0.8946||1.93||6.05||0.8425||2.19||7.69||54|
A typical example of segmentation results is depicted in Fig. 7. In these images, red arrows indicate regions where the two-modality versions of HyperDenseNet fail in comparison to the three-modality version. As expected, most errors of these networks occur at the boundary between the GM and WM (see images in Fig. 1, for example). Moreover, we observe that HyperDenseNet using three modalities can handle thin regions better than its two-modality versions.
Dense connectivity enables each network layer to access feature maps from all its preceding layers, strengthening feature propagation and encouraging feature re-use. To investigate the degree at which features are used in the trained network, we computed, for each convolutional layer, the average L-norm of connection weights to previous layers in any stream. This serves as a surrogate for the dependency of a given layer on its preceding layers. We normalized the values between 0 and 1 to facilitate visualization.
Figure 8 depicts the weights of HyperDenseNet trained with two modalities, for both iSEG and MRBrainS challenges. As the MRBrainS dataset contains three modalities, we have three different two-modality configurations. The average weights for the case of three modalities are shown in Fig. 9. A dark square in these plots indicates that the target layer (on x-axis) makes a strong use of the features produced by the source layer (on y-axis). An important observation that one can make from both figures is that, in most cases, all layers spread the importance of the connections over many previous layers, not only within the same path, but also from the other streams. This indicates that shallow layer features are directly used by deeper layers from both paths, which confirms the usefulness of hyper-dense connections for information flow and learning complex relationships between modalities within different levels of abstractions.
Considering challenge datasets separately, for HyperDenseNet trained on iSEG (top row of Fig 8), immediate previous layers have typically higher impact on the connections from both paths. Furthermore, the connections having access to MRI T2 features typically have the strongest values, which may indicate that T2 is more discriminative than T1 in this particular situation. We can also observe some regions with high (
0.5) feature re-use patterns from shallow to deep layers. The same behaviour is seen for HyperDenseNet trained on two modalities from the MRBrainS challenge, where immediate previous layers have a high impact on the connections within and in-between the paths. The re-use of low-level features by deeper layers is more evident than in the previous case. For example, in HyperDenseNet trained with T1-IR and FLAIR, deep layers in the T1-IR path make a strong use of features extracted in shallower layers of the same path, as well as in the path corresponding to FLAIR. This strong re-use of early features from both paths occurred across all tested configurations. The same pattern is observed when using three modalities (Fig9), with a strong re-use of shallow features from the network’s last layers. This reflects the importance of giving deep layers access to early-extracted features. Additionally, it suggests that learning how and where to fuse information from multiple sources is more effective than combining these sources in early or late stages.
This study investigated a hyper-densely connected 3D fully CNN, HyperDenseNet, with applications to brain tissue segmentation in multi-modal MRI. Our model leverages dense connectivity beyond recent works [39, 40, 41], exploiting the concept in multi-path architectures. Unlike these works, dense connections occur not only within the stream of individual modalities, but also across differents streams. This give the network total freedom to explore complex combinations between features of different modalities, within and in-between all levels of abstraction. We reported a comprehensive evaluation using the benchmarks of two highly competitive challenges, iSEG-2017 for 6-month infant brain segmentation and MRBrainS for adult data, and showed state-of-the-art performances of HyperDenseNet on both datasets. Our experiments provided new insights on the inclusion of short-cut connections in deep neural networks for segmentating medical images, particularly in multi-modal scenarios. In summary, this work demonstrated the potential of HyperDenseNet to tackle challenging medical image segmentation problems involving multi-modal volumetric data.
This work is supported by the National Science and Engineering Research Council of Canada (NSERC), discovery grant program, and by the ETS Research Chair on Artificial Intelligence in Medical Imaging. The authors would like to thank both iSEG and MRBrainS organizers for providing data benchmarks and evaluations.
R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4373–4382.
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE ICCV, 2015, pp. 1026–1034.