Code for "Global Deconvolutional Networks" paper (BMVC 2016)
Semantic image segmentation is a principal problem in computer vision, where the aim is to correctly classify each individual pixel of an image into a semantic label. Its widespread use in many areas, including medical imaging and autonomous driving, has fostered extensive research in recent years. Empirical improvements in tackling this task have primarily been motivated by successful exploitation of Convolutional Neural Networks (CNNs) pre-trained for image classification and object recognition. However, the pixel-wise labelling with CNNs has its own unique challenges: (1) an accurate deconvolution, or upsampling, of low-resolution output into a higher-resolution segmentation mask and (2) an inclusion of global information, or context, within locally extracted features. To address these issues, we propose a novel architecture to conduct the equivalent of the deconvolution operation globally and acquire dense predictions. We demonstrate that it leads to improved performance of state-of-the-art semantic segmentation models on the PASCAL VOC 2012 benchmark, reaching 74.0READ FULL TEXT VIEW PDF
Code for "Global Deconvolutional Networks" paper (BMVC 2016)
Convolutional Neural Networks [LeCun et al.(1989)LeCun, Boser, Denker, Henderson, Howard, Hubbard, and Jackel]
have become an essential part of deep learning models[LeCun et al.(2015)LeCun, Bengio, and Hinton] designed to tackle a wide range of computer vision tasks including image classification and recognition [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Sermanet et al.(2013)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun, Simonyan and Zisserman(2014), Szegedy et al.(2014)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich, Zeiler and Fergus(2014)], image captioning [Karpathy and Li(2015), Xu et al.(2015)Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, and Bengio, Vinyals et al.(2015)Vinyals, Toshev, Bengio, and Erhan], object detection [Girshick et al.()Girshick, Donahue, Darrell, and Malik, Girshick(2015), Ren et al.(2015)Ren, He, Girshick, and Sun]. Recent advances in computing technologies with efficient utilisation of Graphical Processing Units (GPUs), as well as availability of large-scale datasets [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Li, Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick] have been among primary factors in such a rapid rise in CNN popularity.
An adaptation of convolutional network models [Long et al.(2015)Long, Shelhamer, and Darrell]
, pre-trained for the image classification task, has fostered extensive research on the exploitation of CNNs in semantic image segmentation - a problem of marking (or classifying) each pixel of the image with one of the given semantic labels. Among important applications of this problem are road scene understanding[Álvarez et al.(2012)Álvarez, Gevers, LeCun, and López, Badrinarayanan et al.(2015)Badrinarayanan, Handa, and Cipolla, Sturgess et al.(2009)Sturgess, Alahari, Ladicky, and Torr], biomedical imaging [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox, Ciresan et al.(2012)Ciresan, Giusti, Gambardella, and Schmidhuber], aerial imaging [Kluckner et al.(2009)Kluckner, Mauthner, Roth, and Bischof, Mnih and Hinton(2010)].
Recent breakthrough methods in the area have efficiently and effectively combined neural networks with probabilistic graphical models, such as Conditional Random Fields (CRFs) [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Lin et al.(2015)Lin, Shen, Reid, and van den Hengel, Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr] and Markov Random Fields (MRFs) [Liu et al.(2015)Liu, Li, Luo, Loy, and Tang]
. These approaches usually refine per-pixel features extracted by CNNs (so-called ‘unary potentials’) with the help of pairwise similarities between the pixels based on location and colour features, followed by an approximate inference of the obtained fully connected graphical model[Krähenbühl and Koltun(2011)].
In this work, we address two main challenges of current CNN-based semantic segmentation methods [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Long et al.(2015)Long, Shelhamer, and Darrell]: an effective deconvolution, or upsampling, of low-resolution output from a neural network; and an inclusion of global information, or context, into existing models without relying on graphical models. Our contribution is twofold: i) we propose a novel approach for performing the deconvolution operation of the encoded signal and ii) demonstrate that this new architecture, called ‘Global Deconvolutional Network’, achieves close to the state-of-the-art performance on semantic segmentation with a simpler architecture and significantly lower number of parameters in comparison to the existing models [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Noh et al.(2015)Noh, Hong, and Han, Liu et al.(2015)Liu, Li, Luo, Loy, and Tang].
The rest of the paper is structured as follows. We briefly explore recent common practices of semantic segmentation models in Section 2. Section 3 presents our approach designed to overcome the issues outlined above. Section 4 describes the experimental part, including the evaluation results of the proposed method on the popular PASCAL VOC dataset. Finally, Section 5 contains conclusions.
Exploitation of fully convolutional neural networks has become ubiquitous in semantic image segmentation ever since the publication of the paper by Long et al[Long et al.(2015)Long, Shelhamer, and Darrell]. Further research has been concerned with the combination of CNNs and probabilistic graphical models [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Lin et al.(2015)Lin, Shen, Reid, and van den Hengel, Liu et al.(2015)Liu, Li, Luo, Loy, and Tang, Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr], training in the presence of weakly-labelled data [Hong et al.(2015)Hong, Noh, and Han, Papandreou et al.(2015)Papandreou, Chen, Murphy, and Yuille, Russakovsky et al.(2015a)Russakovsky, Bearman, Ferrari, and Li], learning an additional (deconvolutional) network [Noh et al.(2015)Noh, Hong, and Han].
The problem of incorporation of contextual information has been an active research topic in computer vision [Rabinovich et al.(2007)Rabinovich, Vedaldi, Galleguillos, Wiewiora, and Belongie, Heitz and Koller(2008), Divvala et al.(2009)Divvala, Hoiem, Hays, Efros, and Hebert, Doersch et al.(2014)Doersch, Gupta, and Efros, Mottaghi et al.(2014)Mottaghi, Chen, Liu, Cho, Lee, Fidler, Urtasun, and Yuille]. To some extent, probabilistic graphical models address this issue in semantic segmentation and can be either a) used as a separate post-processing step [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] or b) trained end-to-end with CNNs [Lin et al.(2015)Lin, Shen, Reid, and van den Hengel, Liu et al.(2015)Liu, Li, Luo, Loy, and Tang, Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr]. In setting a) graphical models are unable to refine the parameters of the CNN and thus the errors from the CNN will essentially be presented during post-processing. On the other hand, in b) one need to carefully design the inference part in terms of usual neural networks operations, and it still relies on computing high-dimensional Gaussian kernels [Adams et al.(2010)Adams, Baek, and Davis]. Besides that, Yu and Koltun [Yu and Koltun(2015)] have recently shown that dilated convolution filters are generally applicable and allow to increase the contextual capacity of the network as well.
In terms of improving the deconvolutional part of the network for dense predictions, there has been a prevalence of using information from lower layers: so-called ‘Skip Architecture’ [Long et al.(2015)Long, Shelhamer, and Darrell, Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] and Multi-scale [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] are two notable examples. Noh et al[Noh et al.(2015)Noh, Hong, and Han] proposed to train a separate deconvolutional network to effectively decode information from the original fully convolutional model. While these methods have given better results, all of them contain significantly more parameters than the corresponding baseline models.
In turn, we propose another approach, called ‘Global Deconvolutional Network’, which includes a global interpolation block with an additional recognition loss, and gives better results than multi-scale and ‘skip’ variants. The depiction of our architecture is presented in Figure 1.
In this section, we describe our approach intended to boost the performance of deep learning models on the semantic segmentation task.
As baseline models, we choose two publicly available deep CNN models: FCN-32s111https://github.com/BVLC/caffe/wiki/Model-Zoo [Long et al.(2015)Long, Shelhamer, and Darrell] and DeepLab222https://bitbucket.org/deeplab/deeplab-public/ [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]. Both of them are based on the VGG 16-layer net [Simonyan and Zisserman(2014)] from the ILSVRC-2014 competition [Russakovsky et al.(2015b)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Li]
. This network contains 16 weight layers, including two fully-connected ones, and can be represented as hierarchical stacks of convolutional layers with rectified linear unit non-linearity[Glorot et al.(2011)Glorot, Bordes, and Bengio] followed by pooling operations after each stack. The output of the fully-connected layers is fed into a softmax classifier.
For semantic segmentation, where one needs to acquire dense predictions, the fully-connected layers have been replaced by convolution filters followed by a learnable deconvolution or fixed bilinear interpolation to match the original spatial dimensions of the image. The pixel-wise softmax loss represents the objective function.
The output of multiple blocks of convolutional and pooling layers is an encoded image with severely reduced dimensions. To predict the segmentation mask of the same resolution as the original image, one needs to simultaneously decode and upsample this coarse output. A natural approach is to perform an interpolation. In this work, instead of applying conventional local methods, we devise a learnable global interpolation.
We denote the decoded information of the RGB-image , as , where represents the number of channels, and define the reduced height and width , respectively. To acquire , an upsampled signal, we apply the following formula:
where the matrices and are interpolating each feature map of through the corresponding spatial dimensions. Opposite to a simple bilinear interpolation, which operates only on the closest four points, the equation above allows to include much more information on the rectangular grid. An illustrative example can be seen in Figure 2.
Note that this operation is differentiable, and during the backpropagation algorithm[Rumelhart et al.(1986)Rumelhart, Hinton, and Williams]
the derivatives of the pixelwise cross-entropy loss functionwith respect to the input and parameters can be found as follows:
We call the operation performed by Equation (1) ‘global deconvolution’. We only use this term to underline the fact that we mimic the behaviour of standard deconvolution using a global function; note that our method is not the inverse of the convolution operation and therefore does not represent deconvolution in the strictest sense as, for example, in [Zeiler et al.(2010)Zeiler, Krishnan, Taylor, and Fergus].
It is not uncommon to force intermediate layers of deep learning networks to preserve meaningful and discriminative representations. For example, Szegedy et al[Szegedy et al.(2014)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] appended several auxiliary classifiers to the middle blocks of their model.
As semantic image segmentation essentially comprises image classification as one of its sub-tasks, we append an additional objective function on the top of the coarse output to further improve the model performance on the particular task of classification (Figure 1). This supplementary block consists of
fully-connected layers, with the length of the last one being equal to the pre-defined number of possible labels (excluding the background). As there are usually multiple instances of the same label presented in the image, we do not explicitly encode the quantitative components and only denote the presence of a particular class or its absence. The scores from the last layer are transformed with the sigmoid function followed by the multinomial cross entropy loss.
The loss functions are defined as follows:
where is the multi-label classification loss; is the pixelwise cross-entropy loss; is the set of pixels; is a ground truth map; is the number of possible labels;
is a ground truth binary vector of length;
is the softmax probability of pixelbeing assigned to class ; indicates the presence of class or its absence; corresponds to the predicted probability of class being presented in the image. Note that it is possible to use a weighted sum of the two losses depending on which task’s performance we want to optimise.
Overall, each component of the proposed approach aims to capture global information and incorporate it into the network, hence the name global deconvolutional network
. Besides that, the proposed interpolation also effectively upsamples the coarse output and a nonlinear upsampling can be achieved with the addition of an activation function on the top of the block. The complete architecture of our approach is presented in Figure1.
We have implemented the proposed methods using Caffe[Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell], the popular deep learning framework. Our training procedure follows the practice of the corresponding baseline models: DeepLab [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] and FCN [Long et al.(2015)Long, Shelhamer, and Darrell]
. Both of them employ the VGG-16 net pre-trained on ImageNet[Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Li].
We use Stochastic Gradient Descent (SGD) with momentum and train with a minibatch size of 20 images. We start the training process with the learning rate equal toand divide it by when the validation accuracy stops improving. We use momentum of and weight decay of . We initialise all additional layers randomly as in [Glorot and Bengio(2010)] and fine-tune them by backpropagation with a lower learning rate before finally training the whole network.
We evaluate performance of the proposed approach on the PASCAL VOC 2012 segmentation benchmark [Everingham et al.(2010)Everingham, Gool, Williams, Winn, and Zisserman], which consists of 20 semantic categories and one background category. Following [Hariharan et al.(2011)Hariharan, Arbelaez, Bourdev, Maji, and Malik], we augment the training data to 8498 images and to 10582 images for the FCN and DeepLab models, respectively.
The segmentation performance is evaluated by the mean pixel-wise intersection-over-union (mIoU) score [Everingham et al.(2010)Everingham, Gool, Williams, Winn, and Zisserman], defined as follows:
where represents the number of pixels of class predicted to belong to class .
First, we conduct all our experiments on the PASCAL VOC val set, and then compare the best performing models with their corresponding baseline models on the PASCAL VOC test set. As the annotations for the test data are not available, we send our predictions to the PASCAL VOC Evaluation Server.333http://host.robots.ox.ac.uk/
We conduct several experiments with FCN-32s as a baseline model. During the training stage the images are resized to .444This is the maximum value for both the height and width in the PASCAL VOC dataset. We evaluate all the models on the holdout dataset of 736 images as in [Long et al.(2015)Long, Shelhamer, and Darrell], and send the test results of the best performing ones to the evaluation server.
The original FCN-32s model employs the standard deconvolution operation (also known as backwards convolution) to upsample the coarse output. We replace it with the proposed global deconvolution and randomly initialise the new parameters as in [Glorot and Bengio(2010)]. We fix the rest of the network to pre-train the added block, and after that train the whole network. Global interpolation already improves its baseline model on the validation dataset, as can be seen in Table 1.
The baseline model deals with inputs of different sizes via cropping the predicted mask to the same resolution as the corresponding input. Other popular options include either 1) padding or 2) resizing to the fixed input dimensions. In case of global deconvolution, we propose a more elegant solution. Recall that the parameters of this block can be represented as matricesand , where , during the training stage. Then, given a test image , we subset the learned matrices to acquire and () and proceed with the same operation. To subset, we only leave first rows and columns of the corresponding matrices, and discard all the rest. We have found that this do not affect the final performance of our model.
Next, to increase the recognition accuracy we also append the multi-label classification loss. This slightly improves the validation score in comparison to the baseline model, while the combination with global interpolation gives a further boost in performance (FCN-32s+GDN).
Besides that, we have also conducted additional experiments with FCN-32s, where we insert a fully-connected layer directly after the coarse output (FCN-32s+FC). The idea behind this trick is to allow the network to refine the local predictions based on the information from all the pixels. One drawback in such an approach is the requirement of the fully-connected layers to have the fixed-size input, although the solutions discussed above are also applicable here. Nevertheless, neither of these methods gives satisfactory results during the empirical evaluations. Therefore, we proceed with a slightly different architecture: before appending the fully-connected layer, we first add a spatial pyramid pooling layer [He et al.(2015)He, Zhang, Ren, and Sun], which produces the output of the same length given an arbitrarily sized input. In particular, we are using a
-level pyramid with max-pooling. Though during evaluation this approach alone does not give any improvement in the validation score over the baseline model, its ensemble with the global deconvolution model (FCN-32s+GDN+FC) improves previous results, which indicates that these models may be complementary to each other.
|FCN-32s [Long et al.(2015)Long, Shelhamer, and Darrell]||59.4|
|FCN-32s + Label Loss||59.8|
|FCN-32s + Global Interp.||60.9|
|FCN-32s + GDN||61.2|
|FCN-32s + GDN + FC||62.5|
|DL-LargeFOV [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]||73.3|
|DL-LargeFOV + Label Loss||73.9|
|DL-LargeFOV + Global Interp.||74.2|
|DL-LargeFOV + GDN||75.1|
We continue with the evaluation of the best performing models on the test set (Table 5). Both of them improve their baseline model, FCN-32s, and even outperform FCN-8s, another model by Long et al [Long et al.(2015)Long, Shelhamer, and Darrell] with the skip-architecture, which combines information from lower layers with the final prediction layer.
Some examples of our approach can be seen in Figure 3.
As the next baseline model we consider DeepLab-LargeFOV [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]. With the help of the algorithme à trous [Holschneider et al.(1989)Holschneider, Kronland-Martinet, Morlet, and Tchamitchian, Shensa(1992)], the model has a larger Field-of-View (FOV), which results in the finer predictions from the network. Besides that, this model is significantly faster and contains fewer parameters, than the plain modification of the VGG-16 net due to the reduced number of filters of the last two convolutional layers. The model employs simple bilinear interpolation to acquire the output of the same resolution as the input.
We proceed with the same experiments as for the FCN-32s model, except for the ones involving the fully-connected layer. As DeepLab-LargeFOV has a higher resolution coarse output, the inclusion of the fully-connected layer would result in the weight matrix of several billions parameters. Therefore, we omit these experiments.
We separately replace the bilinear interpolation with global deconvolution, append the label loss and estimate the joint GDN model. We carry out the same strategy outlined above during the testing stage to deal with variable-size inputs. All the experiments lead to improvements over the baseline model, with GDN showing a significantly higher score on the PASCAL VOC 2012 val set (Table1).
|FCN-8s [Long et al.(2015)Long, Shelhamer, and Darrell]||76.8||34.2||68.9||49.4||60.3||75.3||74.7||77.6||21.4||62.5||46.8||71.8||63.9||76.5||73.9||45.2||72.4||37.4||70.9||55.1||62.20|
|FCN-32s + GDN||74.5||31.8||66.6||49.7||60.5||76.9||75.8||76.0||22.8||57.5||54.5||72.9||59.4||74.9||73.6||50.9||67.5||43.2||70.0||56.4||62.22|
|FCN-32s + GDN + FC||75.6||31.5||69.2||51.6||62.9||78.8||76.7||78.6||24.6||61.6||60.3||74.5||62.6||76.0||74.3||51.4||70.6||47.3||73.9||58.3||64.37|
|DL-LargeFOV-CRF [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]||83.4||36.5||82.5||62.2||66.4||85.3||78.4||83.7||30.4||72.9||60.4||78.4||75.4||82.1||79.6||58.2||81.9||48.8||73.6||63.2||70.34|
|DeconvNet+CRF_VOC[Noh et al.(2015)Noh, Hong, and Han]||87.8||41.9||80.6||63.9||67.3||88.1||78.4||81.3||25.9||73.7||61.2||72.0||77.0||79.9||78.7||59.5||78.3||55.0||75.2||61.5||70.50|
|DL-MSC-LargeFOV-CRF [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]||84.4||54.5||81.5||63.6||65.9||85.1||79.1||83.4||30.7||74.1||59.8||79.0||76.1||83.2||80.8||59.7||82.2||50.4||73.1||63.7||71.60|
|EDeconvNet+CRF_VOC[Noh et al.(2015)Noh, Hong, and Han]||89.9||39.3||79.7||63.9||68.2||87.4||81.2||86.1||28.5||77.0||62.0||79.0||80.3||83.6||80.2||58.8||83.4||54.3||80.7||65.0||72.50|
|DL-LargeFOV-CRF + GDN||87.9||37.8||88.8||64.5||70.7||87.7||81.3||87.1||32.5||76.7||66.7||80.4||76.6||82.2||82.3||57.9||84.5||55.9||78.5||64.2||73.21|
|DL-L_FOV-CRF + GDN_ENS||88.6||48.6||88.8||64.7||70.4||87.2||81.8||86.4||32.0||77.1||64.1||80.5||78.0||84.0||83.3||59.2||85.9||56.8||77.9||65.0||74.02|
|Adelaide_Cont_CRF_VOC [Liu et al.(2015)Liu, Li, Luo, Loy, and Tang]||90.6||37.6||80.0||67.8||74.4||92.0||85.2||86.2||39.1||81.2||58.9||83.8||83.9||84.3||84.8||62.1||83.2||58.2||80.8||72.3||75.30|
The DeepLab-LargeFOV model also incorporates a fully-connected CRF [Lafferty et al.(2001)Lafferty, McCallum, and Pereira, Kohli et al.(2009)Kohli, Ladicky, and Torr, Krähenbühl and Koltun(2011)] as a post-processing step.
To set the parameters of the fully connected CRF, we employ the same method of cross-validation as in [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and
Yuille] on a subset of the validation data. Then we send our best performing model enriched by CRF to the evaluation server.
On the PASCAL VOC 2012 test set our single model (DL-LargeFOV-CRF+GDN) achieves mIoU, a significant improvement over the baseline model (around ), and even excels the multiscale DeepLab-MSc-LargeFOV model by (Table 5); the predictions averaged across our several models (DL-L_FOV-CRF+GDN_ENS) give a further improvement of , showing a competitive score to the models that do not exploit the Microsoft COCO dataset [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan,
Dollár, and Zitnick].
As is the case with the FCN-32s model, we obtain performance on par with the multi-resolution variant using a much simpler architecture. Moreover, our single CRF-equipped global deconvolutional network (DL-LargeFOV-CRF+GDN) even surpasses the results of the competing approach (DeconvNet+CRF [Noh et al.(2015)Noh, Hong, and Han]) by , where the deconvolutional part of the network contains significantly more parameters: almost 126M compared to less than 70K of global deconvolution; in case of ensembles, the improvement is over .
In this paper we addressed two important problems of semantic image segmentation: an upsampling of the low-resolution output from the network and refinement of this coarse output, incorporating global information and the additional classification loss. We proposed a novel approach, global deconvolution, to acquire the output of the same size as the input for images of variable resolutions. We showed that global deconvolution effectively replaces standard approaches, and can easily be trained in a straightforward manner.
On the benchmark competition, PASCAL VOC 2012, we showed that the proposed approach outperforms the results of the baseline models. Furthermore, our method even surpasses the performance of more powerful multi-resolution models, which combine information from several blocks of the deep neural network.
Acknowledgements The authors would like to thank the anonymous reviewers for their helpful and constructive comments, and Gaee Kim for making Fig. 1. This work is supported by the Ministry of Science, ICT & Future Planning (MSIP), Korea, under Basic Science Research Program through the National Research Foundation of Korea (NRF) grant (NRF-2014R1A1A1002662), under the ITRC (Information Technology Research Center) support program (IITP-2016-R2720-16-0007) supervised by the IITP (Institute for Information & communications Technology Promotion) and under NIPA (National IT Industry Promotion Agency) program (NIPA-S0415-15-1004).
Deep neural networks segment neuronal membranes in electron microscopy images.In NIPS, 2012.