Semantic segmentation is an active area of research in medical image analysis. With the introduction of Convolutional Neural Networks (CNN), significant improvements in performance have been achieved in many standard datasets. For example, for the EM ISBI 2012 dataset, BRATS  or MS lesions , the top entries are built on CNNs [16, 4, 7, 3].
All these methods are based on Fully Convolutional Networks (FCN) . While CNNs are typically realized by a contracting path built from convolutional, pooling and fully connected layers, FCN adds an expanding path built with deconvolutional or unpooling layers. The expanding path recovers spatial information by merging features skipped from the various resolution levels on the contracting path.
Variants of these skip connections are proposed in the literature. In , upsampled feature maps are summed with feature maps skipped from the contractive path while  concatenate them and add convolutions and non-linearities between each upsampling step. These skip connections have been shown to help recover the full spatial resolution at the network output, making fully convolutional methods suitable for semantic segmentation. We refer to these skip connections as long skip connections.
. However, network depth is limited by the issue of vanishing gradients when backpropagating the signal across many layers. In, this problem is addressed with additional levels of supervision, while in [8, 9] skip connections are added around non-linearities, thus creating shortcuts through which the gradient can flow uninterrupted allowing parameters to be updated deep in the network. Moreover,  have shown that these skip connections allow for faster convergence during training. We refer to these skip connections as short skip connections.
In this paper, we explore deep, fully convolutional networks for semantic segmentation. We expand FCN by adding short skip connections that allow us to build very deep FCNs. With this setup, we perform an analysis of short and long skip connections on a standard biomedical dataset (EM ISBI 2012 challenge data). We observe that short skip connections speed up the convergence of the learning process; moreover, we show that a very deep architecture with a relatively small number of parameters can reach near-state-of-the-art performance on this dataset. Thus, the contributions of the paper can be summarized as follows:
We extend Residual Networks to fully convolutional networks for semantic image segmentation (see Section 2).
We show that a very deep network without any post-processing achieves performance comparable to the state of the art on EM data (see Section 3.1).
We show that long and short skip connections are beneficial for convergence of very deep networks (see Section 3.2)
2 Residual network for semantic image segmentation
Our approach extends Residual Networks  to segmentation tasks by adding an expanding (upsampling) path (Figure 1). We perform spatial reduction along the contracting path (left) and expansion along the expanding path (right). As in  and , spatial information lost along the contracting path is recovered in the expanding path by skipping equal resolution features from the former to the latter. Similarly to the short skip connections in Residual Networks, we choose to sum the features on the expanding path with those skipped over the long skip connections.
We consider three types of blocks, each containing at least one convolution and activation function: bottleneck, basic block, simple block (Figure1-1
). Each block is capable of performing batch normalization on its inputs as well as spatial downsampling at the input (marked blue; used for the contracting path) and spatial upsampling at the output (marked yellow; for the expanding path). The bottleneck and basic block are based on those introduced in which include short skip connections to skip the block input to its output with minimal modification, encouraging the path through the non-linearities to learn a residual representation of the input data. To minimize the modification of the input, we apply no transformations along the short skip connections, except when the number of filters or the spatial resolution needs to be adjusted to match the block output. We use convolutions to adjust the number of filters but for spatial adjustment we rely on simple decimation or simple repetition of rows and columns of the input so as not to increase the number of parameters. We add an optional dropout layer to all blocks along the residual path.
We experimented with both binary cross-entropy and dice loss functions. Letbe the output of the last network layer passed through a sigmoid non-linearity and let be the corresponding label. The binary cross-entropy is then defined as follows:
3.1 Segmenting EM data
EM training data consist of images ( pixels) assembled from serial section transmission electron microscopy of the Drosophila first instar larva ventral nerve cord. The test set is another set of images for which labels are not provided. Throughout the experiments, we used images for training, leaving images for validation.
During training, we augmented the input data using random flipping, sheering, rotations, and spline warping. We used the same spline warping strategy as . We used full resolution () images as input without applying random cropping for data augmentation. For each training run, the model version with the best validation loss was stored and evaluated. The detailed description of the highest performing architecture used in the experiments is shown in Table 1.
|Layer name||block type||output resolution||output width||repetition number|
|Down 2||simple block||32||1|
|Up 4||simple block||32||1|
|Up 5||conv 3x3||32||1|
Interestingly, we found that while the predictions from models trained with cross-entropy loss were of high quality, those produced by models trained with the Dice loss appeared visually cleaner since they were almost binary (similar observations were reported in a parallel work .); borders that would appear fuzzy in the former (see Figure 2) would be left as gaps in the latter (Figure 2). However, we found that the border continuity can be improved for models with the Dice loss by implicit model averaging over output samples drawn at test time, using dropout  (Figure 2). This yields better performance on the validation and test metrics than the output of models trained with binary cross-entropy (see Table 2).
Two metrics used in this dataset are: Maximal foreground-restricted Rand score after thinning () and maximal foreground-restricted information theoretic score after thinning (). For a detailed description of the metrics, please refer to .
|Method||FCN||post-processing||average over||parameters (M)|
Our results are comparable to other published results that establish the state of the art for the EM dataset (Table 2). Note that we did not do any post-processing of the resulting segmentations. We match the performance of UNet, for which predictions are averaged over seven rotations of the input images, while using less parameters and without sophisticated class weighting. Note that among other FCN available on the leader board, CUMedVision is using post-processing in order to boost performance.
3.2 On the importance of skip connections
The focus in the paper is to evaluate the utility of long and short skip connections for training fully convolutional networks for image segmentation. In this section, we investigate the learning behavior of the model with short and with long skip connections, paying specific attention to parameter updates at each layer of the network. We first explored variants of our best performing deep architecture (from Table 1
), using binary cross-entropy loss. Maintaining the same hyperparameters, we trained (Model 1) with long and short skip connections, (Model 2) with only short skip connections and (Model 3) with only long skip connections. Training curves are presented in Figure3 and the final loss and accuracy values on the training and the validation data are presented in Table 3.
We note that for our deep architecture, the variant with both long and short skip connections is not only the one that performs best but also converges faster than without short skip connections. This increase in convergence speed is consistent with the literature . Not surprisingly, the combination of both long and short skip connections performed better than having only one type of skip connection, both in terms of performance and convergence speed. At this depth, a network could not be trained without any skip connections. Finally, short skip connections appear to stabilize updates (note the smoothness of the validation loss plots in Figures 3 and 3 as compared to Figure 3).
|Method||training loss||validation loss|
|Long and short skip connections||0.163||0.162|
|Only short skip connections||0.188||0.202|
|Only long skip connection||0.205||0.188|
We expect that layers closer to the center of the model can not be effectively updated due to the vanishing gradient problem which is alleviated by short skip connections. This identity shortcut effectively introduces shorter paths through fewer non-linearities to the deep layers of our models. We validate this empirically on a range of models of varying depth by visualizing the mean model parameter updates at each layer for each epoch (see sample results in Figure4). To simplify the analysis and visualization, we used simple blocks instead of bottleneck blocks.
Parameter updates appear to be well distributed when short skip connections are present (Figure 4). When the short skip connections are removed, we find that for deep models, the deep parts of the network (at the center, Figure 4) get few updates, as expected. When long skip connections are retained, at least the shallow parts of the model can be updated (see both sides of Figure 4) as these connections provide shortcuts for gradient flow. Interestingly, we observed that model performance actually drops when using short skip connections in those models that are shallow enough for all layers to be well updated (eg. Figure 4). Moreover, batch normalization was observed to increase the maximal updatable depth of the network. Networks without batch normalization had diminishing updates toward the center of the network and with long skip connections were less stable, requiring a lower learning rate (eg. Figure 4).
It is also interesting to observe that the bulk of updates in all tested model variations (also visible in those shown in Figure 4) were always initially near or at the classification layer. This follows the findings of , where it is shown that even randomly initialized weights can confer a surprisingly large portion of a model’s performance after training only the classifier.
In this paper, we studied the influence of skip connections on FCN for biomedical image segmentation. We showed that a very deep network can achieve results near the state of the art on the EM dataset without any further post-processing. We confirm that although long skip connections provide a shortcut for gradient flow in shallow layers, they do not alleviate the vanishing gradient problem in deep networks. Consequently, we apply short skip connections to FCNs and confirm that this increases convergence speed and allows training of very deep networks.
We would like to thank all the developers of Theano and Keras for providing such powerful frameworks. We gratefully acknowledge NVIDIA for GPU donation to our lab at École Polytechnique. The authors would like to thank Lisa di Jorio, Adriana Romero and Nicolas Chapados for insightful discussions. This work was partially funded by Imagia Inc., MITACS (grant number IT05356) and MEDTEQ.
-  Al-Rfou, R., Alain, G., Almahairi, A., et al.: Theano: A python framework for fast computation of mathematical expressions. CoRR abs/1605.02688 (2016)
-  Arganda-Carreras, I., Turaga, S.C., Berger, D.R., et al.: Crowdsourcing the creation of image segmentation algorithms for connectomics. Frontiers in Neuroanatomy 9(142) (2015)
-  Brosch, T., Tang, L.Y.W., Yoo, Y., et al.: Deep 3d convolutional encoder networks with shortcuts for multiscale feature integration applied to multiple sclerosis lesion segmentation. IEEE TMI 35(5), 1229–1239 (May 2016)
Chen, H., Qi, X., Cheng, J., Heng, P.A.: Deep contextual networks for neuronal structure segmentation. In: Proceedings of the 13th AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA. pp. 1167–1173 (2016)
-  Chollet, F.: Keras. https://github.com/fchollet/keras (2015)
-  Ciresan, D., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Deep neural networks segment neuronal membranes in electron microscopy images. In: NIPS 25, pp. 2843–2851. Curran Associates, Inc. (2012)
-  Havaei, M., Davy, A., Warde-Farley, D., et al.: Brain tumor segmentation with deep neural networks. CoRR abs/1505.03540 (2015)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015)
-  He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. CoRR abs/1603.05027 (2016)
-  Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. CoRR abs/1511.02680 (2015)
-  Liu, T., Jones, C., Seyedhosseini, M., Tasdizen, T.: A modular hierarchical approach to 3d electron microscopy image segmentation. Journal of Neuroscience Methods 226, 88 – 102 (2014)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. CVPR (to appear) (Nov 2015)
-  Menze, B., Jakab, A., Bauer, S., et al.: The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE TMI p. 33 (2014)
-  Milletari, F., Navab, N., Ahmadi, S.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. CoRR abs/1606.04797 (2016)
-  Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. CoRR abs/1412.6550 (2014)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. CoRR abs/1505.04597 (2015)
Saxe, A., Koh, P.W., Chen, Z., Bhand, M., Suresh, B., Ng, A.Y.: On random weights and unsupervised feature learning. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 1089–1096. ACM, New York, NY, USA (2011)
-  Stollenga, M.F., Byeon, W., Liwicki, M., Schmidhuber, J.: Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation. CoRR abs/1506.07452 (2015)
-  Styner, M., Lee, J., Chin, B., et al.: 3d segmentation in the clinic: A grand challenge ii: Ms lesion segmentation (11 2008)
-  Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR abs/1602.07261 (2016)
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR abs/1409.4842 (2014)
-  Tieleman, T., Hinton, G.: Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning (2012)
-  Uzunbaş, M.G., Chen, C., Metaxsas, D.: Optree: A Learning-Based Adaptive Watershed Algorithm for Neuron Segmentation, pp. 97–105. Springer International Publishing, Cham (2014)
-  Wu, X.: An iterative convolutional neural network algorithm improves electron microscopy image segmentation. CoRR abs/1506.05849 (2015)