1 Introduction
Convolutional Neural Networks (CNNs) have been extensively studied in the computer vision literature to tackle a variety of tasks, such as image classification [14, 17, 16], object detection [12] and semantic segmentation [21, 6]. Major advances have been driven by novel very deep architectural designs [14, 17], introducing skip connections to facilitate the forward propagation of relevant information to the top of the network, and provide shortcuts for gradient flow. Very deep architectures such as residual networks (ResNets) [14], densely connected networks (DenseNets) [17] and squeezeandexcitation networks [16]
have exhibited outstanding performance on standard large scale computer vision benchmarks such as ImageNet
[39] and MSCOCO [28].Among top performing classification networks, ResNets challenge the hierarchical representation learning view of CNNs [26, 42, 11]. The hierarchical representation view associates the layers in network to different levels of abstraction. However, contrary to previous architectures such as [40], dropping or permuting almost any layer in a ResNet has shown to only minimally affect their overall performance [42], suggesting that the operations applied by a single layer are only a small modification to the identity operation. Significant effort has been devoted to analyzing and understanding these findings. On one hand, it has been argued that ResNets behave as an ensemble of shallow networks, averaging exponentially many subnetworks, which use different subsets of layers [42]
. On the other hand, it has been suggested that ResNets engage in an unrolled iterative estimation of representations, that refine upon their input
[11]. These arguments have been exploited in [8] to learn normalized inputs for iterative estimation, highlighting the importance of having transformations prior to the residual blocks.Fully Convolutional Networks (FCNs) were presented in [30, 38] as an extension of CNNs to address per pixel prediction problems, by endowing standard CNNs with an upsampling path to recover the input spatial resolution at their output. In the recent years, FCN counterparts and enhanced versions of top performing classification networks have been successfully introduced in the semantic segmentation literature. Fully Convolutional ResNets (FCResNets) were presented and analyzed in [9] in the context of medical image segmentation. Moreover, Fully Convolutional DenseNets (FCDenseNets) [21] were proposed to build low capacity networks for semantic segmentation, taking advantage of iterative concatenation of features maps.
In this paper, we further exploit the iterative refinement properties of ResNets to build densely connected residual networks for semantic segmentation, which we call Fully Convolutional DenseResNets (FCDRNs). Contrary to FCDenseNets [21]
, where the convolution layers are densely connected, FCDRN apply dense connectivity to ResNets models. Thus, our model performs iterative refinement at each representation level (in a single ResNet) and uses dense connectivity to obtain refined multiscale feature representations (from multiple ResNets) in the presoftmax layer. We demonstrate the potential of our architecture on the challenging CamVid
[4] urban scene understanding benchmark and report stateoftheart results. To compare and contrast with common pipelines based on top performing classification CNNs, we perform an in depth analysis on different downsampling operations used in the context of semantic segmentation: dilated convolution, strided convolution and pooling. Although dilated convolutions have been well adopted in the semantic segmentation literature, we show that such operations seem to be beneficial only when used to finetune a pretrained network that applies downsampling operations (e.g. pooling or strided convolution). When trained from scratch, dilationbased models are outperformed by their pooling and strided convolutionsbased counterparts, highlighting the generalization capabilities of downsampling operations.The contributions of our paper can be summarized as:

We combine FCDenseNets and FCResNets into a single model (FCDRN) that fuses the benefits of both architectures: gradient flow and iterative refinement from FCResNets as well as multiscale feature representation and deep supervision from FCDenseNets.

We show that FCDRN model achieves stateoftheart performance on CamVid dataset [4]. Moreover, FCDRN outperform FCDenseNets, while keeping the number of trainable parameters small.

We provide an analysis on different operations enlarging the receptive field of a network, namely poolings, strided and dilated convolutions. We inspect FCDRN by dropping ResNets from trained models as well as by visualizing the norms of the weights of different layers. Our experiments suggest that the benefits of dilated convolutions only apply when combined with pretrained networks that contain downsampling operations. Moreover, we show that ResNets (by model construction) are good regularizers, since they can reduce the model capacity at different representation levels when needed, and adapt the refinement steps.
2 Related work
In the recent years, FCNs have become the de facto standard for semantic segmentation. Top performing classification networks have been successfully extended to perform semantic segmentation [43, 34, 9, 44, 21].
In order to overcome the spatial resolution loss induced by successive downsampling operations of classification networks, several alternatives have been introduced in the literature; the most popular ones being long skip connections in encoderdecoder architectures [30, 2, 38, 19] and dilated convolutions [46, 7]. Long skip connections help recover the spatial information by merging features skipped from different resolutions on the contracting path, whereas dilated convolutions enlarge the receptive field without downsizing the feature maps.
Another line of research seeks to endow segmentation pipelines with the ability to enforce structure consistency to their outputs. The contributions in this direction include Conditional Random Fields (CRFs) and its variants (which remain a popular choice) [24, 7, 49]
, CRFs as Recurrent Neural Networks
[49], iterative inference denoising autoencoders
[37], convolutional pseudopriors [45], as well as graphcuts, watersheds and spatiotemporal regularization [3, 34, 25].Alternative solutions to improve the performance of segmentation models are based on combining features at different levels of abstraction. Efforts in this direction include iterative concatenation of feature maps [17, 21]
; fusing upsampled feature maps with different receptive fields prior to the softmax classifier
[5], along the network [27, 1] or by means of two interacting processing streams operating at different resolutions [33]; gating skip connections between encoder and decoder to control the information to recover [19]; and using a pyramid pooling module with different spatial resolutions for context aggregation [48]. Moreover, incorporating global features has long shown to improve semantic segmentation performance [10, 29].3 Fully Convolutional DenseResNet
In this section, we briefly review both ResNets and DenseNets, and introduce the FCDRN architecture.
3.1 Background
Let us denote the feature map representation of the th layer of the model as . Traditionally, in CNNs, the feature map is obtained by applying a transformation , composed of a convolution followed by a nonlinearity, to the th feature map as . CNNs are built by stacking together multiple such transformations. However, due to the nonlinearity operation, optimization of such networks becomes harder with growing depth. Architectural solutions to this problem have been proposed in ResNets [14] and DenseNets [17].
In ResNets, the representation of th feature map is obtained by learning the residual transformation of the input feature map and summing it with the input . Thus, the th feature map representation can be computed as follows: . This simple modification in network’s connectivity introduces a path that has no nonlinearities, allowing to successfully train networks that have hundreds (or thousands) of layers. Moreover, lesion studies performed on ResNets have opened the door to research directions that try to better understand how these networks work. Following these lines, it has been suggested that ResNets layers learn small modifications of their input (close to the identity operation), engaging in an iterative refinement of their input.
In DenseNets, the th feature map is obtained by applying a transformation to all previously obtained feature maps such that , where
denotes the concatenation operation. One can easily notice that when following the dense connectivity pattern in DenseNets, the presoftmax layer receives the concatenation of all previous feature maps. Thus, DenseNets introduce deep feature supervision by means of their model construction. It has been shown that using the deep connectivity pattern one can train very deep models that outperform ResNets
[17]. Moreover, it is worth mentioning that combining information at different representation levels has shown to be beneficial in the context of semantic segmentation [10, 21].3.2 FCDRN model
FCDRNs extend the FCDenseNets architecture of [21] and incorporate a dense connectivity pattern over multiple ResNets^{1}^{1}1Note that in [21] the dense connectivity pattern is over convolutional operations.. Thus, FCDRNs combine the benefits of both architectures: FCDRNs perform iterative estimation at each abstraction level (by using ResNets) and combine different abstraction levels while obtaining deep supervision (by means of DenseNets connections).
The connectivity pattern of FCDRN is visualized in Figure 1. First, the input is processed with an Initial Downsampling Block (IDB) composed of a single convolution followed by pooling operation and two convolutions. Then, the output is fed to a dense block (the densely connected part of the model), which is composed of ResNets, transformations and concatenations, forming a downsampling path followed by an upsampling path.
In our model, there are 9 ResNets, motivated by the standard number of downsampling and upsampling operations in the FCN literature. Each ResNet is composed of 7 basic blocks, computing twice the following operations: batch normalization, ReLU activation, dropout and 3x3 convolution. After each ResNet, we apply a
transformation with the goal of changing the representation level. This transformation is different in the downsampling and upsampling paths: in the downsampling path, it can be either a pooling, a strided convolution or a dilated convolution; whereas in the upsampling path, it can be either an upsampling to compensate for pooling/strided convolution or a convolution in case of dilated convolutions, to keep models with roughly the same capacity. Following [7, 6], transformations in the dilationbased model adopt a multigrid pattern (for more details see Figure 1 in the supplementary material).The outputs of the transformations are concatenated such that the input to the subsequent ResNet incorporates information from all the previous ResNets. Concatenations are performed over channel dimensions and, if needed, the resolution of the feature maps is adjusted using transformations that are applied independently to each concatenation input^{2}^{2}2Note that in order to maintain the number of transformations when comparing different models (e.g. poolingbased vs. dilationbased model), we apply a convolution even when concatenating same resolution feature maps.. After each concatenation, there is a convolution to mix the features. Finally, the output of the dense block is fed to a Final Upsampling Block (FUB) that adapts the spatial resolution and the number of channels in the model output. A detailed description of the architecture is avail able in Table 1 of the supplementary material.
4 Analysis and Results
In this section, we assess the influence of applying different kinds of transformations between different ResNets and report our final results.
All experiments are conducted on the CamVid [4] dataset, which contains images of urban scene. Each image has a resolution of pixels and is densely labeled with 11 semantic classes. The dataset consists of 367, 101 and 233 frames for training, validation and test, respectively. In order to compare different architectures, we report results on the validation set with two metrics: mean intersection over union (mean IoU) and global accuracy.
All networks were trained following the same procedure. The weights were initialized with HeUniform [13]
, and the networks were trained with RMSProp optimizer
[41], with a learning rate of and an exponential decay ofafter each epoch. We used a weight decay of
and dropout rate of . The dataset was augmented with horizontal flipping and crops of 324x324. We used early stopping on the validation mean IoU metric to stop the training, with a patience of 200 epochs.4.1 FCDRN transformation variants
Stateoftheart classification networks downsample their feature maps’ resolution by successively applying pooling (or strided convolution) operations. In order to mitigate the spatial resolution loss induced by such subsampling layers, many segmentation models only allow for a number of subsampling operations and change the remaining ones by dilated convolutions [7, 46, 47]. However, in some other cases [21, 31, 38], the number of downsampling operations is preserved, recovering fine grained information from via long skip connections. Therefore, we aim to analyze and compare the influence of pooling/upsampling operations versus dilated convolutions. To that aim, we build sister architectures, which have an initial downsampling block, followed by a dense block, and a final upsamping block, as described in Section 3.2, and only differ in the transformation operations applied within their respective dense blocks.
MaxPooling architecture (FCDRNP):
This architecture interleaves ResNets with four maxpooling operations (downsampling path) and four nearest neighbor upsamplings followed by 3x3 convolutions to smooth the output (upsampling path).
Strided convolution architecture (FCDRNS): This architecture interleaves ResNets with four strided convolution operations (downsampling path) and four nearest neighbor upsamplings followed by 3x3 convolutions to smooth the output (upsampling path).
Dilated architecture (FCDRND): This architecture interleaves ResNets with four multigrid dilated convolution operations of increasing dilation factor (2, 4, 16 and 32)^{3}^{3}3We tested many different variants of dilation factors and found out that this multigrid structure gives the best results.. and standard convolutions to emulate the upsampling operations. Note that the dense block of this architecture does not change the resolution of its feature maps.
FCDRNP finetuned with dilations (FCDRNPD): This architecture seeks to mimic stateoftheart models based on top performing classification networks, which replace the final subsampling operations with dilated convolutions [7, 46, 47]. More precisely, we substitute the last two pooling operations of FCDRNP by dilated convolutions of dilation rate 4 and 8, respectively. Following the spirit of FCDRND, the first two upsampling operations become standard convolutions. We initialize our dilated convolutions to the identity, as suggested in [46].
FCDRNS finetuned with dilations (FCDRNSD): Following FCDRNPD, we substitute the last two strided convolution operations of FCDRNS by dilated convolutions (rates 4 & 8), whereas the first two upsampling operations become standard convolutions. In this case, we initialize the weights of the dilated convolutions with the weights of the corresponding pretrained strided convolutions.
Table 1 reports the validation results for the described architectures. Among the networks trained from scratch, FCDRNP achieves the best performance in terms of mean IoU by a margin of 0.8% and 3.7% w.r.t. FCDRNS and FCDRND, respectively. When finetuning the pooling and strided convolution architectures with dilations, we further improve the results to 81.7% and 81.1%, respectively. It is worth noting that we also tried training FCDRNPD and FCDRNSD from scratch, which yielded worse results, highlighting the benefits of pretraining with poolings/strided convolutions, which capture larger contiguous contextual information.
Figure 2 presents qualitative results from all architectures. As illustrated in the figure, FCDRNP prediction seems to better capture the global information, when compared to FCDRND. This can be observed on the left part of the predictions, where dilated convolutions predict different classes for isolated pixels. Maxpoolings understand better the scene and output a cleaner and more consistent prediction. Although FCDRNP has a more global view of the scene and is less prone to make local mistakes, it lacks resolution to make finegrained predictions. In the FCDRNPD prediction, we can see that dilated convolutions help recover the column poles in the middle of the image that were not properly identified in the other architectures, and the predictions of the pedestrians on the left are also sharper. However, the model still preserves the global information of FCDRNP, reducing some errors that were present on the left part of the image in the FCDRND prediction. If we take a look at the FCDRNS prediction, we can see that it recovers the pedestrian on the right successfully, but fails to correctly segment the sidewalk and the pedestrians on the left. Finetuning this architecture with dilations (FCDRNSD) helps capture a lost pedestrian on the left, but still lacks the ability to sharply sketch the right column pole. Furthermore, there are some artifacts on the top part of the image in both strided convolutionbased architectures.
Architecture  mean IoU [%]  accuracy [%] 
FCDRNP  81.1  96.1 
FCDRNS  80.3  95.9 
FCDRND  77.4  95.5 
FCDRNPD  81.7  96.0 
FCDRNSD  81.1  96.0 
4.2 Results
Following the comparison in Table 1, we report our final results on the FCDRNPD architecture. Recall that this architecture is a pretrained FCDRN with a dense block of 4 max pooling operations (downsampling path) and 4 repeat and convolve operations (upsampling path) and finetuned by substituting the last two max poolings and the first two upsamplings by dilated convolutions, on the same data.
Table 2 compares the performance of our model to stateoftheart models. As shown in the table, our FCDRNPD exhibits stateoftheart performance when compared to previous methods, especially when it comes to segmenting underrepresented classes such as column poles, pedestrians and cyclists. It is worth noting that our architecture improves upon pretrained models with 10 times more parameters. When compared to FCDenseNets, FCDRN outperform both FCDenseNet67 (with a comparable number of parameters) and FCDenseNet103 (with only of its parameters) by and mean IoU, respectively. Moreover, it also exceeds the performance of more recent methods such as GFRNet, which uses gated skip connections between encoder and decoder in a FCN architecture, while only using 13% of its parameters.
In order to further boost the performance of our network, we finetuned it by using soft targets [15, 36] ( and , instead of 1 and 0 in the target representation), improving generalization and obtaining a final score of mean IoU and global accuracy. Using soft targets allows the network to become more accurate in predicting classes such as pedestrian, fence, column pole, sign, building and sidewalk when compared to the original version. FCDRN recovers slim objects such as column poles and pedestrians much better than other architectures presented in the literature, while maintaining a good performance on classes composed of larger objects.
It is worth mentioning that, unlike most of current stateoftheart methods, FCDRN have not been pretrained on large datasets such as ImageNet [39]. Moreover, there are other methods in the literature that exploit virtual images to augment the training data [35] or that leverage temporal information to improve performance [20]. Note that those enhancements complement each other and FCDRN could most likely benefit from them to boost their final performance as well. However, we leave those as future work.
Figure 3 shows some FCDRN segmentation maps (right) compared to their respective ground truths (middle). We can observe that the segmentations have good quality, aligned with the quantitative results we obtained. The column poles and pedestrians are sharply segmented, but some difficulties arise when trying to distinguish between sidewalks and roads or in the presence of small road signs.
Model  # params [M] 
Ext. Data 
Building 
Tree 
Sky 
Car 
Sign 
Road 
Pedestrian 
Fence 
Column pole 
Sidewalk 
Cyclist 
mean IoU [%]  Gl. acc. [%] 
SegNet [2]  29.5  Yes  68.7  52.0  87.0  58.5  13.4  86.2  25.3  17.9  16.0  60.5  24.8  46.4  62.5 
DeconvNet [31]  252  Yes  n/a  48.9  85.9  
FCN8 [30]  134.5  Yes  77.8  71.0  88.7  76.1  32.7  91.2  41.7  24.4  19.9  72.7  31.0  57.0  88.0 
Visin et al. [43]  32.3  Yes  n/a  58.8  88.7  
DeepLabLFOV [7]  Yes  81.5  74.6  89.0  82.2  42.3  92.2  48.4  27.2  14.3  75.4  50.1  61.6    
Bayesian SegNet [22]  29.5  Yes  n/a  63.1  86.9  
Dilation8 [46]  140.8  Yes  82.6  76.2  89.0  84.0  46.9  92.2  56.3  35.8  23.4  75.3  55.5  65.3  79.0 
FCDenseNet67 [21]  3.5  No  80.2  75.4  93.0  78.2  40.9  94.7  58.4  30.7  38.4  81.9  52.1  65.8  90.8 
Dilation8 + FSO [25]  130  Yes  84.0  77.2  91.3  85.6  49.9  92.5  59.1  37.6  16.9  76.0  57.2  66.1  88.3 
FCDenseNet103 [21]  9.4  No  83.0  77.3  93.0  77.3  43.9  94.5  59.6  37.1  37.8  82.2  50.5  66.9  91.5 
GFRNet [19]  30  Yes  82.5  76.8  92.1  81.8  43.0  94.5  54.6  47.1  33.4  82.3  59.4  68.0  90.8 
FCDRN PD  3.9  No  82.6  75.7  92.6  79.9  42.3  94.1  61.2  36.9  42.6  81.2  61.8  68.3  91.4 
FCDRNPD + ST  3.9  No  83.5  75.6  92.1  78.5  46.6  93.9  62.7  44.3  43.1  82.2  60.8  69.4  91.6 
5 Delving deeper into FCDRN transformations
In this section, we provide an in depth analysis of the variants of the trained FCDRN architectures to compare different transformation operations: pooling, dilation and strided convolution. We start by dropping ResNets from a FCDRN and then look into the weight’s norms of all ResNets in the models. We end the section with a discussion exploiting the observations from the network inspection.
We follow the strategy of dropping layers introduced in [42, 18], and drop all residual blocks of a ResNet (we only keep the first residual block that adjusts the depth of the feature maps) with the goal of analyzing the implications of using different transformation operations. The results of the experiment are shown in Figure 4. On one hand, Figure 4(a) reports the performance drops in percentage of mean IoU for each ResNet in the initial networks (i.e. FCDRNP, FCDRND and FCDRNS). Surprisingly, dropping ResNets 3 to 8 barely affects the performance of FCDRND. However, both pooling and strided convolution models suffer from the loss of almost any ResNet. On the other hand, Figure 4(b) presents the results of dropping ResNets in the finetuned models (i.e. FCDRNPD and FCDRNPS). Finetuning the pooling network with dilations makes ResNets 5 and 6 slightly less necessary, while ResNet 8 becomes the most relevant one. Finetuning the strided convolution network with dilations makes ResNet 1 extremely important, while ResNets 4 to 6 have a smaller influence. In general, it seems that finetuning with dilations reduces the importance of the bottleneck ResNets of the network. Moreover, the results might suggest that dilations do not change the representation level as much as poolings/strided convolutions.
To gain further insight on what FCDRN variants are learning, we visualize the norm of the weights in all ResNets. More precisely, given a
weight tensor
of a shape that applies the transformation between two consecutive layers with and channels, respectively, we compute: . The results of this experiment for different FCDRN variants are shown in Figure 5. The weight norms shown in the figure are in line with the discussion of Figure 4. On one hand, basic FCDRN architectures are displayed in Figure 5(a). For FCDRNP, we can see how the weights in ResNet 5 have lower values, suggesting an explanation for the lower drop in performance when removing ResNet 5. Furthermore, we can also see that the weight norms in the case of FCDRND are almost zero for ResNets 3 to 8, suggesting that the network does not fully exploit its capacity. In the case of FCDRNS, the norms of the weights in ResNet 5 are smaller (similar to FCDRNP), whereas ResNet 2 exhibits some of the highest norms. On the other hand, Figure 5(b) shows the results of finetuning. FCDRNPD has weight norms among ResNets 4 to 6 lower than FCDRNP, whereas FCDRNSD follows a similar pattern when compared to FCDRNS. However, from the observed weights, it would seem as ResNet 4 still benefits from some refinement steps. Overall, it seems that finetuning consistently reduces the weight norms of the bottleneck ResNets.It is important to note that the structure of ResNets, due to the usage of the residual block, allows the model to selfadjust its capacity when needed, forcing the residual transformation of the residual block to be close to and using the identity connection to forward the information. We hypothesize that this behavior of residuals is observed for some layers in our model (as it is shown in Figure 5). To test our hypothesis, we decided to reduce the capacity of trained FCDRN by removing layers from the residuals of ResNets for which the norm of weights is small and to monitor the performance of the compressed FCDRN model. If our hypothesis is true, then removing the residuals in the layers where the norm is close to should not affect strongly the model performance. We choose to drop the layers whose weight norms are close enough to zero, based on visual inspection of Figure 5, thus allowing each representation level to have different number of refinement steps. The results of this trained model compression experiment are reported in Table 3. We can see that after removing of the parameters from FCDRNP and FCDRNS, there is a drop in mean IoU on validation set of and , respectively. Interestingly, we were able to remove of weights from FCDRND model while only experiencing a drop of in mean IoU. Both finetuned models can be compressed by removing of the capacity with slight performance drops of and for FCDRNPD and FCDRNSD, respectively. In general, it seems that finetuning the models with dilations not only improves the segmentation results but also makes the model more compressible.
Finally, we test if the optimization process of the high capacity FCDRN reaches better local minima, due to selfadjustment of ResNets’ capacity, than if we train a low capacity FCDRN from scratch. To this end, we trained from scratch the reduced capacity FCDRND model and compared the results to the numbers reported in Table 3. The retrained model obtained the mean validation IoU of . This is below the result reported for the high capacity FCDRND. We hypothesize that the model capacity reduction during the optimization process helps in reaching a better local minima. Since the selfadjustment of the ResNet capacity might be encouraged by weight decay, for the sake of the comparison we also trained the reduced capacity model without weight decay at all. This model obtained the mean validation IoU of that is below the full capacity model.
Architecture  mean IoU loss [%]  compression rate 
FCDRNP  
FCDRND  
FCDRNS  
FCDRNPD  
FCDRNSD 
6 Conclusions
In this paper, we combined two standard image segmentation architectures (FCDenseNets and FCResNets) into a single network that we called Fully Convolutional DenseResNet. Our FCDRN fuses the benefits of both models: gradient flow and iterative refinement from FCResNets as well as multiscale feature representation and deep supervision from FCDenseNets. We demonstrated the potential of our model on the challenging CamVid urban scene understanding benchmark and reported stateoftheart results, with at least x fewer parameters than concurrent models.
Additionally, we analyzed different downsampling operations used in the context of semantic segmentation: dilated convolution, strided convolution and pooling. We inspected the FCDRN by dropping ResNets from the trained models as well as by visualizing the weight norms of different layers and showed that ResNets (by model construction) are good regularizers, since they can reduce the model capacity when needed. In this direction, we observed that coarser representations seem to benefit from less refinement steps. Moreover, our results comparing different transformations suggest that pooling offers the best generalization capabilities, while the benefits of dilated convolutions only apply when combined with pretrained networks that contain downsampling operations.
Acknowledgments
References
 [1] I. Ardiyanto and T. B. Adji. Deep residual coalesced convolutional network for efficient semantic road segmentation. MVA, 2017.
 [2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. CoRR, 2015.
 [3] T. Beier, B. Andres, U. Köthe, and F. A. Hamprecht. An efficient fusion move algorithm for the minimum cost lifted multicut problem. In Lecture Notes in Computer Science. 2016.
 [4] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, 2008.

[5]
H. Chen, X. Qi, J.z. Cheng, and P.a. Heng.
Deep contextual networks for neuronal structure segmentation.
AAAI, 2016.  [6] L. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. CoRR, 2017.
 [7] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv:1606.00915, 2016.
 [8] M. Drozdzal, G. Chartrand, E. Vorontsov, M. Shakeri, L. Di Jorio, A. Tang, A. Romero, Y. Bengio, C. Pal, and S. Kadoury. Learning normalized inputs for iterative estimation in medical image segmentation. Medical image analysis, 2018.
 [9] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal. The importance of skip connections in biomedical image segmentation. Deep Learning and Data Labeling for Medical Applications, 2016.
 [10] C. Gatta, A. Romero, and J. van de Veijer. Unrolling Loopy TopDown Semantic Feedback in Convolutional Deep Networks. In CVPRW, 2014.
 [11] K. Greff, R. K. Srivastava, and J. Schmidhuber. Highway and residual networks learn unrolled iterative estimation. ICLR, 2017.
 [12] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. arXiv preprint arXiv:1703.06870, 2017.
 [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification. ICCV, 2015.
 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. CVPR, 2016.
 [15] G. Hinton, O. Vinyals, and J. Dean. Distilling the Knowledge in a Neural Network. Deep Learning workshop at NIPS, 2014.
 [16] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. arXiv preprint arXiv:1709.01507, 2017.
 [17] G. Huang, Z. Liu, and K. Q. Weinberger. Densely Connected Convolutional Networks. CVPR, 2017.
 [18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. CoRR, 2016.
 [19] M. A. Islam, M. Rochan, N. D. Bruce, and Y. Wang. Gated feedback refinement network for dense image labeling. In CVPR, 2017.
 [20] V. Jampani, R. Gadde, and P. V. Gehler. Video Propagation Networks. CVPR, 2017.
 [21] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In CVVT, CVPRW, 2017.
 [22] A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding. CoRR, 2015.
 [23] A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NIPS, 2017.
 [24] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS. 2011.
 [25] A. Kundu, V. Vineet, and V. Koltun. Feature space optimization for semantic video segmentation. In CVPR, 2016.
 [26] Q. Liao and T. A. Poggio. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. CoRR, 2016.
 [27] G. Lin, A. Milan, C. Shen, and I. Reid. RefineNet: Multipath refinement networks for highresolution semantic segmentation. CVPR, 2017.
 [28] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
 [29] W. Liu, A. Rabinovich, and A. C. Berg. ParseNet: Looking Wider to See Better. arXiv:1506.04579, jun 2015.
 [30] J. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR, 2015.
 [31] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. arXiv preprint arXiv:1505.04366, 2015.
 [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 [33] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Fullresolution residual networks for semantic segmentation in street scenes. CVPR, 2017.
 [34] T. M. Quan, D. G. Hilderbrand, and W.K. Jeong. Fusionnet: A deep fully residual convolutional neural network for image segmentation in connectomics. arXiv preprint arXiv:1612.05360, 2016.
 [35] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. ECCV, 2016.
 [36] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
 [37] A. Romero, M. Drozdzal, A. Erraqabi, S. Jégou, and Y. Bengio. Image segmentation by iterative inference from conditional score estimation. arXiv preprint arXiv:1705.07450, 2017.
 [38] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. CoRR, 2015.
 [39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [40] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. ICLR, 2015.

[41]
T. Tieleman and G. Hinton.
rmsprop adaptive learning.
In
COURSERA: Neural Networks for Machine Learning
, 2012.  [42] A. Veit, M. Wilber, and S. Belongie. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. NIPS, 2016.
 [43] F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho, Y. Bengio, M. Matteucci, and A. Courville. ReSeg: A Recurrent Neural Networkbased Model for Semantic Segmentation. CVPR workshop, 2016.
 [44] Z. Wu, C. Shen, and A. van den Hengel. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition. arXiv:1611.10080, nov 2016.
 [45] S. Xie, X. Huang, and Z. Tu. TopDown Learning for Structured Labeling with Convolutional Pseudoprior, pages 302–317. Springer International Publishing, Cham, 2016.
 [46] F. Yu and V. Koltun. MultiScale Context Aggregation by Dilated Convolutions. ICLR, 2016.
 [47] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. CVPR, 2017.
 [48] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. CVPR, 2017.
 [49] S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random fields as recurrent neural networks. ICCV, 2015.
Appendix A Supplementary Material
We present the architecture details in Table 4, agnostic to the type of transformation used in between ResNets.
The outputs of transformation blocks are reused when needed. In the case of dilations, we still maintain all transformations to keep the number of parameters roughly constant. The right column in the table indicates the number of feature channels after applying each operation.
The detailed composition of the ResNet block and the multigrid dilation block used in our architecture is presented in Figure 6.
Additionally, we also present some additional output segmentations for FCDRNPD, the maxpooling architecture finetuned with some dilated convolutions and trained with soft targets. Predictions are shown in Figure 7.
Operation  Out 

IDB: conv , max pool, 2 conv  50 
R1  30 
[ (IDB), (R1)]  80 
mixing block  80 
R2  40 
[ (IDB), (R1), (R2) ]  120 
mixing block  120 
R3  40 
[ (IDB), (R1), (R2), (R3) ]  160 
mixing block  160 
R4  40 
[ (IDB), (R1), (R2), (R3), (R4) ]  200 
mixing block  200 
R5  50 
[ (IDB), (R1), (R2), (R3), R4, (R5) ]  200 
mixing block  200 
R6  40 
[ (IDB), (R1), (R2), R3, (R4), (R5), (R6) ]  240 
mixing block  240 
R7  40 
[ (IDB), (R1), R2, (R3), (R4), (R5), (R6), (R7) ]  280 
mixing block  280 
R8  40 
[IDB, R1, (R2), (R3), (R4), (R5), (R6), (R7), (R8) ]  320 
mixing block  320 
R9  30 
[IDB, R1, (R2), (R3), (R4), (R5), (R6), (R7), (R8), R9 ]  350 
mixing block  350 
FUB: 2x2 repeat upsampling, conv  50 
Linear classifier: conv  11 
Comments
There are no comments yet.