Log In Sign Up

Global Deconvolutional Networks for Semantic Segmentation

by   Vladimir Nekrasov, et al.

Semantic image segmentation is a principal problem in computer vision, where the aim is to correctly classify each individual pixel of an image into a semantic label. Its widespread use in many areas, including medical imaging and autonomous driving, has fostered extensive research in recent years. Empirical improvements in tackling this task have primarily been motivated by successful exploitation of Convolutional Neural Networks (CNNs) pre-trained for image classification and object recognition. However, the pixel-wise labelling with CNNs has its own unique challenges: (1) an accurate deconvolution, or upsampling, of low-resolution output into a higher-resolution segmentation mask and (2) an inclusion of global information, or context, within locally extracted features. To address these issues, we propose a novel architecture to conduct the equivalent of the deconvolution operation globally and acquire dense predictions. We demonstrate that it leads to improved performance of state-of-the-art semantic segmentation models on the PASCAL VOC 2012 benchmark, reaching 74.0


page 2

page 7

page 9


Can we unify monocular detectors for autonomous driving by using the pixel-wise semantic segmentation of CNNs?

Autonomous driving is a challenging topic that requires complex solution...

A Novel Upsampling and Context Convolution for Image Semantic Segmentation

Semantic segmentation, which refers to pixel-wise classification of an i...

An efficient solution for semantic segmentation: ShuffleNet V2 with atrous separable convolutions

Assigning a label to each pixel in an image, namely semantic segmentatio...

Diagnostics in Semantic Segmentation

Over the past years, computer vision community has contributed to enormo...

Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes

Semantic image segmentation is an essential component of modern autonomo...

Network Deconvolution

Convolution is a central operation in Convolutional Neural Networks (CNN...

Code Repositories


Code for "Global Deconvolutional Networks" paper (BMVC 2016)

view repo

1 Introduction

An adaptation of convolutional network models [Long et al.(2015)Long, Shelhamer, and Darrell]

, pre-trained for the image classification task, has fostered extensive research on the exploitation of CNNs in semantic image segmentation - a problem of marking (or classifying) each pixel of the image with one of the given semantic labels. Among important applications of this problem are road scene understanding

[Álvarez et al.(2012)Álvarez, Gevers, LeCun, and López, Badrinarayanan et al.(2015)Badrinarayanan, Handa, and Cipolla, Sturgess et al.(2009)Sturgess, Alahari, Ladicky, and Torr], biomedical imaging [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox, Ciresan et al.(2012)Ciresan, Giusti, Gambardella, and Schmidhuber], aerial imaging [Kluckner et al.(2009)Kluckner, Mauthner, Roth, and Bischof, Mnih and Hinton(2010)].

Recent breakthrough methods in the area have efficiently and effectively combined neural networks with probabilistic graphical models, such as Conditional Random Fields (CRFs) [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Lin et al.(2015)Lin, Shen, Reid, and van den Hengel, Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr] and Markov Random Fields (MRFs) [Liu et al.(2015)Liu, Li, Luo, Loy, and Tang]

. These approaches usually refine per-pixel features extracted by CNNs (so-called ‘unary potentials’) with the help of pairwise similarities between the pixels based on location and colour features, followed by an approximate inference of the obtained fully connected graphical model

[Krähenbühl and Koltun(2011)].

In this work, we address two main challenges of current CNN-based semantic segmentation methods [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Long et al.(2015)Long, Shelhamer, and Darrell]: an effective deconvolution, or upsampling, of low-resolution output from a neural network; and an inclusion of global information, or context, into existing models without relying on graphical models. Our contribution is twofold: i) we propose a novel approach for performing the deconvolution operation of the encoded signal and ii) demonstrate that this new architecture, called ‘Global Deconvolutional Network’, achieves close to the state-of-the-art performance on semantic segmentation with a simpler architecture and significantly lower number of parameters in comparison to the existing models [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Noh et al.(2015)Noh, Hong, and Han, Liu et al.(2015)Liu, Li, Luo, Loy, and Tang].

Figure 1: Global Deconvolutional Network. Our adaptation of FCN-32s [Long et al.(2015)Long, Shelhamer, and Darrell]

. After hierarchical blocks of convolutional-subsampling-nonlinearity layers, we are upsampling the reduced signal with the help of the global interpolation block. In addition to the pixel-wise softmax loss (not shown here), we also use the multi-label classification loss to increase the recognition accuracy.

The rest of the paper is structured as follows. We briefly explore recent common practices of semantic segmentation models in Section 2. Section 3 presents our approach designed to overcome the issues outlined above. Section 4 describes the experimental part, including the evaluation results of the proposed method on the popular PASCAL VOC dataset. Finally, Section 5 contains conclusions.

2 Related Work

Exploitation of fully convolutional neural networks has become ubiquitous in semantic image segmentation ever since the publication of the paper by Long et al[Long et al.(2015)Long, Shelhamer, and Darrell]. Further research has been concerned with the combination of CNNs and probabilistic graphical models [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Lin et al.(2015)Lin, Shen, Reid, and van den Hengel, Liu et al.(2015)Liu, Li, Luo, Loy, and Tang, Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr], training in the presence of weakly-labelled data [Hong et al.(2015)Hong, Noh, and Han, Papandreou et al.(2015)Papandreou, Chen, Murphy, and Yuille, Russakovsky et al.(2015a)Russakovsky, Bearman, Ferrari, and Li], learning an additional (deconvolutional) network [Noh et al.(2015)Noh, Hong, and Han].

The problem of incorporation of contextual information has been an active research topic in computer vision [Rabinovich et al.(2007)Rabinovich, Vedaldi, Galleguillos, Wiewiora, and Belongie, Heitz and Koller(2008), Divvala et al.(2009)Divvala, Hoiem, Hays, Efros, and Hebert, Doersch et al.(2014)Doersch, Gupta, and Efros, Mottaghi et al.(2014)Mottaghi, Chen, Liu, Cho, Lee, Fidler, Urtasun, and Yuille]. To some extent, probabilistic graphical models address this issue in semantic segmentation and can be either a) used as a separate post-processing step [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] or b) trained end-to-end with CNNs [Lin et al.(2015)Lin, Shen, Reid, and van den Hengel, Liu et al.(2015)Liu, Li, Luo, Loy, and Tang, Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr]. In setting a) graphical models are unable to refine the parameters of the CNN and thus the errors from the CNN will essentially be presented during post-processing. On the other hand, in b) one need to carefully design the inference part in terms of usual neural networks operations, and it still relies on computing high-dimensional Gaussian kernels [Adams et al.(2010)Adams, Baek, and Davis]. Besides that, Yu and Koltun [Yu and Koltun(2015)] have recently shown that dilated convolution filters are generally applicable and allow to increase the contextual capacity of the network as well.

In terms of improving the deconvolutional part of the network for dense predictions, there has been a prevalence of using information from lower layers: so-called ‘Skip Architecture’ [Long et al.(2015)Long, Shelhamer, and Darrell, Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] and Multi-scale [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] are two notable examples. Noh et al[Noh et al.(2015)Noh, Hong, and Han] proposed to train a separate deconvolutional network to effectively decode information from the original fully convolutional model. While these methods have given better results, all of them contain significantly more parameters than the corresponding baseline models.

In turn, we propose another approach, called ‘Global Deconvolutional Network’, which includes a global interpolation block with an additional recognition loss, and gives better results than multi-scale and ‘skip’ variants. The depiction of our architecture is presented in Figure 1.

3 Global Deconvolutional Network

In this section, we describe our approach intended to boost the performance of deep learning models on the semantic segmentation task.

3.1 Baseline Models

As baseline models, we choose two publicly available deep CNN models: FCN-32s111 [Long et al.(2015)Long, Shelhamer, and Darrell] and DeepLab222 [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]. Both of them are based on the VGG 16-layer net [Simonyan and Zisserman(2014)] from the ILSVRC-2014 competition [Russakovsky et al.(2015b)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Li]

. This network contains 16 weight layers, including two fully-connected ones, and can be represented as hierarchical stacks of convolutional layers with rectified linear unit non-linearity

[Glorot et al.(2011)Glorot, Bordes, and Bengio] followed by pooling operations after each stack. The output of the fully-connected layers is fed into a softmax classifier.

For semantic segmentation, where one needs to acquire dense predictions, the fully-connected layers have been replaced by convolution filters followed by a learnable deconvolution or fixed bilinear interpolation to match the original spatial dimensions of the image. The pixel-wise softmax loss represents the objective function.

3.2 Global Interpolation

The output of multiple blocks of convolutional and pooling layers is an encoded image with severely reduced dimensions. To predict the segmentation mask of the same resolution as the original image, one needs to simultaneously decode and upsample this coarse output. A natural approach is to perform an interpolation. In this work, instead of applying conventional local methods, we devise a learnable global interpolation.

We denote the decoded information of the RGB-image , as , where represents the number of channels, and define the reduced height and width , respectively. To acquire , an upsampled signal, we apply the following formula:


where the matrices and are interpolating each feature map of through the corresponding spatial dimensions. Opposite to a simple bilinear interpolation, which operates only on the closest four points, the equation above allows to include much more information on the rectangular grid. An illustrative example can be seen in Figure 2.

Figure 2: Depiction of Equation (1) for synthetic data.

Note that this operation is differentiable, and during the backpropagation algorithm

[Rumelhart et al.(1986)Rumelhart, Hinton, and Williams]

the derivatives of the pixelwise cross-entropy loss function

with respect to the input and parameters can be found as follows:


We call the operation performed by Equation (1) ‘global deconvolution’. We only use this term to underline the fact that we mimic the behaviour of standard deconvolution using a global function; note that our method is not the inverse of the convolution operation and therefore does not represent deconvolution in the strictest sense as, for example, in [Zeiler et al.(2010)Zeiler, Krishnan, Taylor, and Fergus].

3.3 Multi-task loss

It is not uncommon to force intermediate layers of deep learning networks to preserve meaningful and discriminative representations. For example, Szegedy et al[Szegedy et al.(2014)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] appended several auxiliary classifiers to the middle blocks of their model.

As semantic image segmentation essentially comprises image classification as one of its sub-tasks, we append an additional objective function on the top of the coarse output to further improve the model performance on the particular task of classification (Figure 1). This supplementary block consists of

fully-connected layers, with the length of the last one being equal to the pre-defined number of possible labels (excluding the background). As there are usually multiple instances of the same label presented in the image, we do not explicitly encode the quantitative components and only denote the presence of a particular class or its absence. The scores from the last layer are transformed with the sigmoid function followed by the multinomial cross entropy loss.

The loss functions are defined as follows:


where is the multi-label classification loss; is the pixelwise cross-entropy loss; is the set of pixels; is a ground truth map; is the number of possible labels;

is a ground truth binary vector of length


is the softmax probability of pixel

being assigned to class ; indicates the presence of class or its absence; corresponds to the predicted probability of class being presented in the image. Note that it is possible to use a weighted sum of the two losses depending on which task’s performance we want to optimise.

Overall, each component of the proposed approach aims to capture global information and incorporate it into the network, hence the name global deconvolutional network

. Besides that, the proposed interpolation also effectively upsamples the coarse output and a nonlinear upsampling can be achieved with the addition of an activation function on the top of the block. The complete architecture of our approach is presented in Figure 


4 Experiments

4.1 Implementation details

We have implemented the proposed methods using Caffe

[Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell], the popular deep learning framework. Our training procedure follows the practice of the corresponding baseline models: DeepLab [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] and FCN [Long et al.(2015)Long, Shelhamer, and Darrell]

. Both of them employ the VGG-16 net pre-trained on ImageNet

[Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Li].

We use Stochastic Gradient Descent (SGD) with momentum and train with a minibatch size of 20 images. We start the training process with the learning rate equal to

and divide it by when the validation accuracy stops improving. We use momentum of and weight decay of . We initialise all additional layers randomly as in [Glorot and Bengio(2010)] and fine-tune them by backpropagation with a lower learning rate before finally training the whole network.

4.2 Dataset

We evaluate performance of the proposed approach on the PASCAL VOC 2012 segmentation benchmark [Everingham et al.(2010)Everingham, Gool, Williams, Winn, and Zisserman], which consists of 20 semantic categories and one background category. Following [Hariharan et al.(2011)Hariharan, Arbelaez, Bourdev, Maji, and Malik], we augment the training data to 8498 images and to 10582 images for the FCN and DeepLab models, respectively.

The segmentation performance is evaluated by the mean pixel-wise intersection-over-union (mIoU) score [Everingham et al.(2010)Everingham, Gool, Williams, Winn, and Zisserman], defined as follows:


where represents the number of pixels of class predicted to belong to class .

First, we conduct all our experiments on the PASCAL VOC val set, and then compare the best performing models with their corresponding baseline models on the PASCAL VOC test set. As the annotations for the test data are not available, we send our predictions to the PASCAL VOC Evaluation Server.333

Figure 3: Qualitative results on the validation set. (a) Last column represents our approach, which includes the replacement of standard deconvolution with global deconvolution and the addition of the multi-label classification loss. (b) Fourth and sixth columns demonstrate our model, where bilinear interpolation is replaced with global deconvolution. Last two columns also incorporate a conditional random field (CRF) as a post-processing step. Best viewed in colour.

4.3 Experiments with FCN-32s

We conduct several experiments with FCN-32s as a baseline model. During the training stage the images are resized to .444This is the maximum value for both the height and width in the PASCAL VOC dataset. We evaluate all the models on the holdout dataset of 736 images as in [Long et al.(2015)Long, Shelhamer, and Darrell], and send the test results of the best performing ones to the evaluation server.

The original FCN-32s model employs the standard deconvolution operation (also known as backwards convolution) to upsample the coarse output. We replace it with the proposed global deconvolution and randomly initialise the new parameters as in [Glorot and Bengio(2010)]. We fix the rest of the network to pre-train the added block, and after that train the whole network. Global interpolation already improves its baseline model on the validation dataset, as can be seen in Table 1.

The baseline model deals with inputs of different sizes via cropping the predicted mask to the same resolution as the corresponding input. Other popular options include either 1) padding or 2) resizing to the fixed input dimensions. In case of global deconvolution, we propose a more elegant solution. Recall that the parameters of this block can be represented as matrices

and , where , during the training stage. Then, given a test image , we subset the learned matrices to acquire and () and proceed with the same operation. To subset, we only leave first rows and columns of the corresponding matrices, and discard all the rest. We have found that this do not affect the final performance of our model.

Next, to increase the recognition accuracy we also append the multi-label classification loss. This slightly improves the validation score in comparison to the baseline model, while the combination with global interpolation gives a further boost in performance (FCN-32s+GDN).

Besides that, we have also conducted additional experiments with FCN-32s, where we insert a fully-connected layer directly after the coarse output (FCN-32s+FC). The idea behind this trick is to allow the network to refine the local predictions based on the information from all the pixels. One drawback in such an approach is the requirement of the fully-connected layers to have the fixed-size input, although the solutions discussed above are also applicable here. Nevertheless, neither of these methods gives satisfactory results during the empirical evaluations. Therefore, we proceed with a slightly different architecture: before appending the fully-connected layer, we first add a spatial pyramid pooling layer [He et al.(2015)He, Zhang, Ren, and Sun], which produces the output of the same length given an arbitrarily sized input. In particular, we are using a

-level pyramid with max-pooling. Though during evaluation this approach alone does not give any improvement in the validation score over the baseline model, its ensemble with the global deconvolution model (FCN-32s+GDN+FC) improves previous results, which indicates that these models may be complementary to each other.

Method mean IoU
FCN-32s [Long et al.(2015)Long, Shelhamer, and Darrell] 59.4
FCN-32s + Label Loss 59.8
FCN-32s + Global Interp. 60.9
FCN-32s + GDN 61.2
FCN-32s + GDN + FC 62.5
DL-LargeFOV [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] 73.3
DL-LargeFOV + Label Loss 73.9
DL-LargeFOV + Global Interp. 74.2
DL-LargeFOV + GDN 75.1
Table 1: Mean intersection over union accuracy of our approach (GDN), which includes the addition of multi-label classification loss and global interpolation, compared with the baseline model on the reduced validation dataset (for FCN-32s) and on the PASCAL VOC 2012 validation dataset (for DL-LargeFOV).

We continue with the evaluation of the best performing models on the test set (Table 5). Both of them improve their baseline model, FCN-32s, and even outperform FCN-8s, another model by Long et al [Long et al.(2015)Long, Shelhamer, and Darrell] with the skip-architecture, which combines information from lower layers with the final prediction layer.

Some examples of our approach can be seen in Figure 3.

4.4 Experiments with DeepLab

As the next baseline model we consider DeepLab-LargeFOV [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille]. With the help of the algorithme à trous [Holschneider et al.(1989)Holschneider, Kronland-Martinet, Morlet, and Tchamitchian, Shensa(1992)], the model has a larger Field-of-View (FOV), which results in the finer predictions from the network. Besides that, this model is significantly faster and contains fewer parameters, than the plain modification of the VGG-16 net due to the reduced number of filters of the last two convolutional layers. The model employs simple bilinear interpolation to acquire the output of the same resolution as the input.

We proceed with the same experiments as for the FCN-32s model, except for the ones involving the fully-connected layer. As DeepLab-LargeFOV has a higher resolution coarse output, the inclusion of the fully-connected layer would result in the weight matrix of several billions parameters. Therefore, we omit these experiments.

We separately replace the bilinear interpolation with global deconvolution, append the label loss and estimate the joint GDN model. We carry out the same strategy outlined above during the testing stage to deal with variable-size inputs. All the experiments lead to improvements over the baseline model, with GDN showing a significantly higher score on the PASCAL VOC 2012 val set (Table


Figure 4: Failure cases of our approach, Global Deconvolutional Network (GDN), on the PASCAL VOC 2012 val set. Best viewed in colour.
Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mIoU
FCN-8s [Long et al.(2015)Long, Shelhamer, and Darrell] 76.8 34.2 68.9 49.4 60.3 75.3 74.7 77.6 21.4 62.5 46.8 71.8 63.9 76.5 73.9 45.2 72.4 37.4 70.9 55.1 62.20
FCN-32s + GDN 74.5 31.8 66.6 49.7 60.5 76.9 75.8 76.0 22.8 57.5 54.5 72.9 59.4 74.9 73.6 50.9 67.5 43.2 70.0 56.4 62.22
FCN-32s + GDN + FC 75.6 31.5 69.2 51.6 62.9 78.8 76.7 78.6 24.6 61.6 60.3 74.5 62.6 76.0 74.3 51.4 70.6 47.3 73.9 58.3 64.37
DL-LargeFOV-CRF [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] 83.4 36.5 82.5 62.2 66.4 85.3 78.4 83.7 30.4 72.9 60.4 78.4 75.4 82.1 79.6 58.2 81.9 48.8 73.6 63.2 70.34
DeconvNet+CRF_VOC[Noh et al.(2015)Noh, Hong, and Han] 87.8 41.9 80.6 63.9 67.3 88.1 78.4 81.3 25.9 73.7 61.2 72.0 77.0 79.9 78.7 59.5 78.3 55.0 75.2 61.5 70.50
DL-MSC-LargeFOV-CRF [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] 84.4 54.5 81.5 63.6 65.9 85.1 79.1 83.4 30.7 74.1 59.8 79.0 76.1 83.2 80.8 59.7 82.2 50.4 73.1 63.7 71.60
EDeconvNet+CRF_VOC[Noh et al.(2015)Noh, Hong, and Han] 89.9 39.3 79.7 63.9 68.2 87.4 81.2 86.1 28.5 77.0 62.0 79.0 80.3 83.6 80.2 58.8 83.4 54.3 80.7 65.0 72.50
DL-LargeFOV-CRF + GDN 87.9 37.8 88.8 64.5 70.7 87.7 81.3 87.1 32.5 76.7 66.7 80.4 76.6 82.2 82.3 57.9 84.5 55.9 78.5 64.2 73.21
DL-L_FOV-CRF + GDN_ENS 88.6 48.6 88.8 64.7 70.4 87.2 81.8 86.4 32.0 77.1 64.1 80.5 78.0 84.0 83.3 59.2 85.9 56.8 77.9 65.0 74.02
Adelaide_Cont_CRF_VOC [Liu et al.(2015)Liu, Li, Luo, Loy, and Tang] 90.6 37.6 80.0 67.8 74.4 92.0 85.2 86.2 39.1 81.2 58.9 83.8 83.9 84.3 84.8 62.1 83.2 58.2 80.8 72.3 75.30
Table 2: Mean intersection over union accuracy of our approach (GDN), compared with the competing models on the PASCAL VOC 2012 test set.555 Lacking the results of FCN-32s on the test set, we thus compare it directly with a more powerful model, FCN-8s [Long et al.(2015)Long, Shelhamer, and Darrell]. All the methods presented here do not use MS COCO dataset.

The DeepLab-LargeFOV model also incorporates a fully-connected CRF [Lafferty et al.(2001)Lafferty, McCallum, and Pereira, Kohli et al.(2009)Kohli, Ladicky, and Torr, Krähenbühl and Koltun(2011)] as a post-processing step. To set the parameters of the fully connected CRF, we employ the same method of cross-validation as in [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] on a subset of the validation data. Then we send our best performing model enriched by CRF to the evaluation server. On the PASCAL VOC 2012 test set our single model (DL-LargeFOV-CRF+GDN) achieves mIoU, a significant improvement over the baseline model (around ), and even excels the multiscale DeepLab-MSc-LargeFOV model by (Table 5); the predictions averaged across our several models (DL-L_FOV-CRF+GDN_ENS) give a further improvement of , showing a competitive score to the models that do not exploit the Microsoft COCO dataset [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick].
As is the case with the FCN-32s model, we obtain performance on par with the multi-resolution variant using a much simpler architecture. Moreover, our single CRF-equipped global deconvolutional network (DL-LargeFOV-CRF+GDN) even surpasses the results of the competing approach (DeconvNet+CRF [Noh et al.(2015)Noh, Hong, and Han]) by , where the deconvolutional part of the network contains significantly more parameters: almost 126M compared to less than 70K of global deconvolution; in case of ensembles, the improvement is over .

The illustrative examples are presented in Figures 3 and 4.

5 Conclusion

In this paper we addressed two important problems of semantic image segmentation: an upsampling of the low-resolution output from the network and refinement of this coarse output, incorporating global information and the additional classification loss. We proposed a novel approach, global deconvolution, to acquire the output of the same size as the input for images of variable resolutions. We showed that global deconvolution effectively replaces standard approaches, and can easily be trained in a straightforward manner.

On the benchmark competition, PASCAL VOC 2012, we showed that the proposed approach outperforms the results of the baseline models. Furthermore, our method even surpasses the performance of more powerful multi-resolution models, which combine information from several blocks of the deep neural network.

Acknowledgements The authors would like to thank the anonymous reviewers for their helpful and constructive comments, and Gaee Kim for making Fig. 1. This work is supported by the Ministry of Science, ICT & Future Planning (MSIP), Korea, under Basic Science Research Program through the National Research Foundation of Korea (NRF) grant (NRF-2014R1A1A1002662), under the ITRC (Information Technology Research Center) support program (IITP-2016-R2720-16-0007) supervised by the IITP (Institute for Information & communications Technology Promotion) and under NIPA (National IT Industry Promotion Agency) program (NIPA-S0415-15-1004).