Efficient Yet Deep Convolutional Neural Networks for Semantic Segmentation

07/26/2017 ∙ by Sharif Amit Kamran, et al. ∙ INDEPENDENT UNIVERSITY BANGLADESH 0

Semantic Segmentation using deep convolutional neural network pose more complex challenge for any GPU intensive work, as it has to compute million of parameters resulting to huge consumption of memory. Moreover, extracting finer features and conducting supervised training tends to increase the complexity furthermore. With the introduction of Fully Convolutional Neural Network, which uses finer strides and utilizes deconvolutional layers for upsampling, it has been a go to for any image segmentation task. We propose two segmentation architecture transferring weights from the popular classification neural net VGG19 and VGG16 which were trained on Imagenet classification dataset, transform all the fully connected layers to convolutional layers, use dilated convolution for decreasing the parameters, moreover we add more finer strides and attach four skip architectures which are concatenated with the deconvolutional layers in steps. We train on two stages, first with PASCAL VOC2012 training data and then with SBD training and validation set. With our model, FCN-2s-Dilated-VGG19 we yield better score for PASCAL VOC2012 test set with a meanIOU of 69 percent which is 1.8 percent better than FCN-8s. And with FCN-2s-Dilated-VGG16 we score a meanIOU of 67.6 percent. On the other hand our model consumes up to 10-20 percent less memory than FCN-8s for training with NVIDIA Pascal GPUs, making it more efficient and less memory consuming architecture for pixel-wise segmentation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

Code Repositories

DilatedFCNSegmentation

A dilated version of FCN with Stride 2 for Efficient Semantic Segmentation


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the introduction of convolutional neural network, the image recognition task has accelerated with great pace and has given state-of-the-art results for classification, detection and semantic segmentation alike. Classification is to recognize the the whole image to a certain class. Whereas in detection, each object has to be identified with a bounding box accurately. For semantic segmentation every pixel of the object in the image has to be classified to a corresponding class. Over the years many classification models yielded better results

[1, 2, 3] in their independent task. Due to this success they have also been used as a base model for acing in extracting local features and giving finer output [4, 5] for semantic segmentation tasks.

Problem definition of the task in hand is to keep the global structure in contrast with the local context [4, 5]. Here, the global structure means the shape of the objects as a whole and how they are placed in the image with respect to other objects. On the other hand, the local features means the small geometric shapes like the sharp edge, circles etc. For example, if we consider the bipedal humans as the object and its shape as the global structure, then the shape of the eyes, nose, color of the lips can be considered as the local features. Most of the time the local features tend to get lost while training the neural networks and global context seems to dominate throughout the segmentation mask. So for extracting those local fine features, Skip architectures were introduced [4, 5, 6, 7] with the preexisting segmentation architecture. This fine output from skip connections were then element-wise summed with the coarse semantic information on the top most layers of the neural net. By using Skip architectures the image representation becomes finer and less coarse.

The drawback for designing convolutional neural network with high level computing for pixel-wise classification seems to be the huge amount of memory required for the task. Firstly, orthodox ConvNets have rather large receptive fields because of their convolutional filters and generates coarse blob-like output map when it is redefined to produce pixel-wise segmentation [4]

. Secondly, sub-sampling with max-pool in ConvNets diminishes the chance to get finer output

[8]. Furthermore, similar labeling in neighboring pixels tends to get lost in deeper layers where the upsampling[9] takes place. So visual consistency and retaining spatial feature is one of the essential job for producing sharp segmentation mask. Falling short of producing such fine output can result in poor object portrayal and patch-like false regions in the segmentation mask [8, 10, 11, 12].

Using finer strides [4] and replacing vanilla convolution with dilated convolution [5, 13] have shown better segmentation results while keeping the memory usage in check. Because with dilation the receptive fields can be increased exponentially[5]. Whereas the filter of the convolution remains the same size as the previous filter. So with the expense of reducing the size of the filter and adding dilation between it, we can free up more memory for computing from the sixth convolutional layer in the architecture which is the most expensive layer.

Adding more Skip architectures seems to increase memory usage for the whole end-to-end network. But because additional memory has been freed up by using dilation[5, 13], extra skip connections can be added to upsample local features from other convolutional layers. In a feed forward network like Fully-Convolutional Neural Network[4] (which is denoted by FCN) the size of the representation changes with each convolutions. As the structure is similar to an encoder-decoder network,the feature hierarchies from earlier layers have to be element-wise added with the upsampled [4, 9] layers in steps.

Our proposal in this paper is an efficient yet deep feed forward neural net for a strongly supervised image segmentation task. Our work tends to integrate both dilated and vanilla convolution to recreate a FCN (which stands for Fully-Convolutional Neural Network) architecture which generates better output while consuming less memory. In addition, we introduce four skip architectures which fetches more local information lost in the network in bottom layers. These features are then upsampled and element-wise summed with the global feature map in steps ih the top layers. Which in turn produces better segmentation mask while keeping GPU memory consumption in check. Most Importantly, with this changes in architecture, the end-to-end deep network can be trained on any type of data while utilizing the usual back-propagation algorithm with more efficient and finer results.

Ii Literature Review

Following section describe different procedures which has been proposed before for conducting semantic segmentation task using deep learning. Out of many approaches only few have been adopted for high computing pixel-wise segmentation.

Our proposed model was developed based on a particular neural net that was used for image classification task [2, 1, 3] and the weights were transferred from it [14, 15]

. Transfer learning was seen being performed in classification task, afterwards it was applied to object detection tasks, lately it has been adopted for instance aware segmentation

[16] and image segmentation models with a powerful classifier[17, 18, 6]. We redesign and redefine the architecture and perform fine-tuning to get more sparse and accurate prediction for semantic segmentation. Furthermore. we compare different models with our one and show how it is more efficient and effective for semantic segmentation jobs.

Multi-digit recognition with neural network[19], an extension of LeNet[20], was such work where erratic range of values for input was first witnessed. Though the task was ordained for one dimensional data, Viterbi decoding was sufficient for such task. Three years later convolutional neural network was elongated for two dimensional feature output for processing postal address data [21]. These historical breakthroughs were designed to conduct small yet powerful detection task. Additionally LeCun et al.[22] using fully convolutional inference developed a CNN for sparse multiple class segmentation of embryo. We have also seen FCNs being used in many recent deep layered nets for high level computation. Using Sliding window for integrated object detection and localization by Eigen et al. [23]

, Recurrent neural network for scene labeling by Pinheiro et al.

[24] , and restoring dirt clad image using convNet by Eigen et al. [25]

is such remarkable example. Training a FCN can be difficult, but has been used for detecting human parts and estimating pose efficiently by Tompson et al.

[26]

Different approaches can be taken to get finer segmentation mask exploiting convolutional neural network. One such strategy could be to develop individual system for extracting dense features and detecting zoomed-in edges from images for finer semantic segmentation [27, 28]. A single step process can be, extract semantic feature with convnet and then using superpixels for figuring out the inner layout of the image. Another procedure can be to retrieve superpixels from the given image layout and then extracting features from images one by one[27, 29]

. The only drawback of this approach is that the erroneous super pixels may result into fallacious predictions, irrespective how powerful feature extraction took place. Zheng et al.

[7, 30] designed a RNN model and used Conditional random field to get finer features by training an end-to-end network for semantic segmentation. They also proposed a disjointed version of the same model having less accuracy and consuming more memory to prove that an end-to-end network always have an upper hand over two or even three stage effective segmentation retrieval procedure.

Another strategy could be to develop a model and train it using supervised image data and output the segmentation label map for each categories. Retaining the spatial information, one can replace the fully connected layers with convolutional layers in a deep convnet, which was shown by Eigen et al [31]. The most groundbreaking work so far was by Shelhamer and Long et al [4] where the idea was, FCN can be designed to harness features to help classify pixel from the top-most layers, whereas the bottom layers can be used for detecting shapes,contour and edges. With element-wise summation of earlier layers with latter layers they introduced the idea of skip architecture. On the other hand conditional random fields was used to refine semantic segmentation furthermore[7, 30]. CRF was also used by Snavely et al. [32] and Chen et al. [8, 33] for refining the existing segmentation mask. Snavely et al. conducted recognition for materials and it segmentation, on the other hand Chen et al. developed better ways to obtain finer semantic image segmentation. Though the previous procedures included disjointed CRF for conducting post-processing on the segmented output, the method developed by Torr et al. [7, 30] employed CRF as recurrent neural network and also developed higher order model which is an extension of CRFasRNN. Not only is the convnet is end-to-end but also it converges faster than the previous CRF models and produces finer segmentation mask.

Difference between dilated and vanilla convolution is the extra parameter called holes or dilation that affects the receptive fields of the convolution‘s filter. The whole idea of Atrous algorithm, which is based on wavelet decomposition[34] is wholeheartedly based on dilated filter. In [5] Fisher Yu et al. used the term “dilated convolution“ instead of “convolution with a dilated filter“ to formulate that no dilated filter weren’t built or produced. Convolutional layer was modified instead to make way for a new parameter called dilation to alter the preexisting filter. In [35] Chen et al. made use of dilation to modify the architecture of Shelhamer et al [4] to make it suitable for his task. In contrast, Yu et al. [5] developed a new range of feed forward neural net which exploits dilated convolutions and multi-scale context aggregation but get rid of the preexisting skip architectures.

Iii Segmentation Architecture

Iii-a Transfer Learning from Classification Net

VGGnet is a famous neural net which won ILSVRC14[1] for image classification. The neural net worked on the principal of using sized filters for feature extraction and concurrently joins each convolutions to make the receptive field bigger. We transferred weights from the VGG 19-layer network , removed the classifier from the network and turned all the fully connected layers to convolutions as done by Shelhamer et al.[4].

In covolutional neural network all the tensors have three dimensions of size

, where H and W are defined as height and width, and N is the color channel or feature output map. At first layer the image is taken as input, where the pixel size is , and the three color channel for RGB is N. As described by shelhamer et al. [4] receptive fields is the locations in higher layers corresponds to the locations path-connected to the image.

Iii-B Spatial Information and Receptive Field

The output feature map for each convolutions can be predefined, but the spatial dimension depends on the size of the filter, strides and padding. FCN architecture tends to keep the spatial dimension of each convolutions same before max-pooling (exception being Fc6 and Fc7 layers). Considering the output channel dimension be

and input channel dimension be for any convolution layer, where is the spatial dimension. Equation (1) can be used for getting the optimal output channel.

(1)

Here, stands for Padding, is for Kernel or filter size and is for stride. We choose a filter size of , single padding and stride 1 for convolutional layers. This helps us to retain the convolutional structure before pooling to reduce the spatial dimension.

For pooling we use stride of 2 and filter of

, to lower the spatial length of the tensors. If we use the probable value for filter and stride, in equation

, then we can see the output becomes half of the size of the input.

The tensors go through first convolution layer to second convolution layer and onward as it is fed forward through the net. In the first two convolution layers we have filters. Therefore the first one has a receptive field of and second one has receptive field of . From third to fifth set of convolutional layers we have four convolutions for each of them. So, the receptive field is quite larger than the earlier layers. Notice that receptive fields increases linearly after each convolutions. Let receptive field , filter and stride be . If the filter and stride values remains the same for concurrent convolutions, (2) formulation can be used for computing the receptive field.

(2)

where is the receptive field of the next convolution and is the receptive field of the previous convolution.

The sixth layer which is the most expensive one comes next. It is a fully connected layer, having a spatial dimension of with a filter size . The next is also a fully connected layer having a spatial dimension of and a filter size of . We convert both of this layers to convolutional layers with filter size of and as done by Shelhamer et al.[4].

Iii-C Decreasing Parameters using Dilation

Fihser et al. [5] combined dilation with vanilla convolution throughout the network and emphasized multi-context aggregation. Whereas, we stick to the original structure of FCN [4] but include dilation only in the sixth convolution, which is the most expensive layer and has the most amount of parameter computation. Similar work was done before in Parsenet, by Liu et al. [13], but they trained on a reduced version of VGG-16 net. We train on the original VGG-19 classification neural net and only change the filter size and enter dilation as parameter in Fc6 convolution (see Table I) while retaining the same number of parameters across all the convolutions. The relation between filter size and dilation can formulated using (3).

(3)

where K’ is the new filter size, K is the given filter size and d is the amount of dilation.

A convolution with dilation 1 is the same as having no dilation. Using equation , the filter size of the sixth convolution layer can be changed from to with a dilation size of 3 and filter size of 3. Fisher et al [5] has also defined (4) for calculating the receptive field of a dilated convolution considering the same condition as before.

(4)

Iii-D Deconvolution with Finer Strides

As upsampling with factor f means using convolution with a fractional input stride of 1/f [4]. If f is an integer, we can reverse the forward and backward pass to make the upsampling work by replacing vanilla convolution with transpose convolution. So we use upsampling with finer stride as done by Shelhamer et al. [4] for end-to-end calculation of pixel-wise semantic loss. But the author used strides of 32, 16 and 8, whereas we use a stride of 2 to upsample it in steps. This was done to element-wise summed local features from the bottom layers using skip architectures (discussed in the next section). Fig. 1 shows the procedure in details. In [15]

, transpose convolution layers are called “deconvolution layers“. The deconvolutional layers are used for “bilinear interpolation“ as described in

[9] rather than learning. It was witnessed by Shelhamer et al.[4] and Torr et al.[7] that upsampling in an end-to-end network is way faster and effective for learning dense prediction.

Fig. 1: The comparision between VGG-16, FCN-8s and Dilated Fully Convolutional Neural Network with Skip Architectures. Dilated FCn-2s upsample stride 2 predictions back to pixels in five steps. The pool-4 to pool-1 are element-wise summed with stride 2, 4, 8 and 16 in steps in reverse order. This provides with finer segmentation and accurate pixel classification

Iii-E Multiple Skip Architectures

We adopt a similar procedure as fcn-8s-all-at-once [4] for training rather than fcn-8s staged version Because it is more time-consuming to do in stages while predicting with nearly the same accuracy as the all-at-once version. Moreover, for all-at-once version each skip architectures scales with fixed constant and these constants are chosen in such a way that it equals to the average feature norms across all skip architectures[4]. This helps to decrease inference time a lot. For our case the inference time was less than 200ms for each forward and backward pass combined, whereas fcn-8s had an inference time of 500ms. We use a total of four skip architectures. Also the skip architectures tends to consume less memory compared to the convolution layer due to its only operation being element-wise summation with other layer.

FCN-8s Dilated FCN-2s (our)
inference time 0.5s 0.2s
Fc6 Weights 4096x512x7x7 4096x512x3x3
Dilation 1(none) 3
Fc6 Parameters 102,760,448 18,874,368
Total Parameters 134,477,280 55,812,880
TABLE I: Parameters Comparison

Iv Experiments

Iv-a Metrics and Evaluation

We use four different metrics to score the pixel-accuracy (5), mean-intersection-over-union denoted by mIOU (7), mean accuracy (6) and frequency weighted accuracy denoted by fw-IU (8). As background pixels numbers in majority, pixel accuracy is not preferable.For semantic segmentation and scene labeling mean-intersection-over union is the most optimum choice for bench-marking.

(5)
(6)
(7)
(8)

where is the number of pixels of class i predicted to belong to class j, is the number of classes and is the total number of pixels of class i. The data was used as it is provided by [36] and [18]. No pre or post processing or augmentation was done to training or validation images to enhance the accuracy of the segmentation output.

In Table II, Pixel Accuracy, Mean Accuracy, MeanIOU and FW Accuracy comparison between our model and Other FCN architecture. Both of our model outperforms the preexisting FCN structure for the reduced validation set.

Fig. 2: The Third and fourth image shows the output of our model. Fourth one being more accurate Dilated FCN-2s VGG19. The second image shows the output of the previous best method by Shelhamer et al.[4].
Neural Nets
Pixel
Accuracy
Mean
Accuracy
Mean
IOU
FW
Accuracy
FCN-8s-all-at-once 90.8 77.4 63.8 84
FCN-8s 90.9 76.6 63.9 84
Dilated FCN-2s
using VGG16 (ours)
91 78.3 64.15 84.4
Dilated FCN-2s
using VGG19 (ours)
91.2 77.6 64.86 84.7
TABLE II: Evaluation on PASCAL VOC2012 reduced validation set

Iv-B Data-set and Procedure

Iv-B1 Pascal VOC

Transfer learning was performed by copying weights separately from VGG-19 and VGG-16 classification nets for our two models, Dilated FCN-2s-VGG19 and Dilated FCN-2s-VGG16. We adopt the Back propagation [20] algorithm to train the network end-to-end with forward and backward pass. We used dilation for our most expensive layer, Fc6 as seen in Table I. which reduced the number of parameters. Resulting into less computation by the machine yet faster inference time. The total time needed was 12 hours for both the networks to get the best mIOU using a single GPU. We used PASCAL VOC 2012 training data counting up to 1464 images. Validation was done on the reduced VOC2012 validation set of 346 images[7] in which we got 58 percent meanIOU.

Iv-B2 Semantic Boundaries Dataset

Extensive data was used to improve the pixel accuracy and mean-intersection-over-union score of both the models for which the training was done on Semantic Boundaries data-set [18]. The set consists of 8498 training and 2857 validation data. Training was done on both the training and validation data summing up to 11355 images. The reduced set for validation was found by removing the common images in Augmented VOC2012 training set and VOC2012 validation set[36], resulting to 346 images. Table II to shows the comparative results of different FCN models. Our model, Dilated FCN-2s-VGG16 achieves a meanIOU of 64.1 percent and Dilated FCN-2s-VGG19 scores a meanIOU of 64.9 percent. Clearly, the deeper version of the model is more precise for pixel-wise-segmentation. Training was done with learning rate of with 200,000 iterations.

Neural Nets MeanIOU
FCN-8s [4] 62.2
FCN-8s-heavy [4] 67.2
DeepLab-CRF [8] 66.4
DeepLab-CRF-MSc [8] 67.1
VGG19_FCN [37]
68.1
Dilated FCN-2s using VGG16 (ours)
67.6
Dilated FCN-2s using VGG19 (ours)
69
TABLE III: PASCAL VOC 12 test results

Iv-B3 VOC2012 Test

The test results as shown on Table III indicates our models scoring better than similar FCN architecture in Pascal VOC2012 Segmentation Challenge. We didn’t train on any additional data, neither did we add any graphical model like CRF or MRF [7, 30] to enhance the accuracy furthermore. Reason being it consumes more GPU memory for training. Moreover, our model scores better than FCN model in NYUDv2 sets too (see Table V). Fig. 2 shows the segmentation mask compared to FCN-8s and the ground truth. Also Table VI demonstrates, how less our nets consume memory for training and testing with GPU. As one can see, the reduction in memory usage is more than 20 percent for training with FCN-2s Dilated VGG16.

Architectures
pixel
accu.
mean
accu.
mean
IU
f.w.
IU
O2P[38] - - 18.1 -
CFM[39] - - 18.1 -
FCN-32s 65.5 49.1 36.7 50.9
FCN-16s 66.9 51.3 38.4 52.3
FCN-8s 67.5 52.3 39.1 53.0
CRFasRNN[7] - - 39.28 -
HO-CRF[30] - - 41.3 -
DeepLab-
LargeFOV-CRF[40]
- - 39.6 -
Ours 69.9 54.9 42.6 56.5
TABLE IV: Evaluation of Pascal Context data-set

Iv-B4 Pascal Context Data-set

We train on more sparse data-set like Pascal Context which has 60 classes and pose more challenging pixel-wise prediction task [41]. The data-set consists of 10103 images. We split the data set into 5105 validation images and rest are used as training set. Table IV shows comparative results for Pascal Context Data-set. Our model, scores better mean-IOU of 42.6 percent than the other state-of-the-art models. Moreover, many deeper models with Higher Order CRF as post processing layer scored worse than our model. This clearly indicates that our model is better suited for pixel-wise prediction of sparse data-set. Training was done with learning rate of with 300,000 iterations.

pixel
acc,
mean
acc.
mean
IU
f.w.
IU
Gupta et al. [6] 60.3 - 28.6 47
FCN-32s RGB 61.8 44.7 31.6 46
FCN-32s HHA 58.3 35.7 25.2 41.7
FCN-2s Dilated RGB
62.6 47.1 32.3 47.5
FCN-2s Dilated HHA
58.8 39.3 26.8 43.5
TABLE V: Evaluation on NYUDv2 data-set

Iv-B5 NYUDv2 Data-set

We train on NYUD version 2, an RGB-D dataset collected with the Microsoft Kinect. It consists of 1,449 RGB-D images, with pixel-wise semantic labels that is divided into 40 semantic classes by Gupta et al. [42]. The data is split into 795 training images and 654 testing images. In, Table 5 the comparison of between fcn and our model is given. We train with Dilated FCN-2s VGG19 with three channel RGB images. Then we add depth information and train on a new model upgraded to take four-channel RGB-D input. Though the performance doesn’t increase. Long et al. [4] describes this phenomenon happens due to having similar number of parameters or the failure to propagate all the semantic gradients through the net. Following the footstep of Gupta et al. [6], we next train on three-dimensional HHA encoding of depth. The results proves to be more precise and yields better score for our model. Training was done with learning rate of with 150,000 iterations. Table V indicates comparative results for NYUDv2 data-set.

Models
GPU Memory
Usage
Training(MB)
GPU Memory
Usage
Training + Testing(MB)
Number
of
Parameters
(millions)
Inference
Time(ms)
Number
of
Classes
Fcn-8s 3759 4649 134 500 20
Dilated Fcn-2s
VGG16(ours)
3093 4101 50.5 200 20
Dilated Fcn-2s
VGG19(ours)
3367 4309 55.8 200 59
PascalContext
Fcn-8s
4173 5759 136 500 59
Dilated Fcn-2s
Context(ours)
3975 5333 56.2 200 59
TABLE VI: GPU memory usage comparison
Fig. 3: The difference in GPU usage while training and testing between our model and some FCN architectures

Iv-C Memory Efficiency

In terms of memory efficiency, three of our architecture has shown remarkable results across different GPUs. As shown in Fig. 3, training and testing time GPU usage has been decreased for all of the achitectures. Moreover, Dilated-Fcn2s-Vgg16 achieves better performance results by reducing 700 MB while training. For Dilated-Fcn2s-Context it has has shown slight improvement for both training and testing time while getting better even for more power consuming models.

In Table VI, it can be seen that the GPU memory allocation for 20-class and 59-class segmentation task has been reduced by 10-20 percent. Additionally, the counterpart of Fcn-8s[4] which is Dilated-Fcn2s, has shown remarkable results for both training and testing time GPU memory usage. On the other hand, the inference time required for three of our models are less than half of the other similar architectures. And for both 20 and 59 class segmentation it retains similar inference times.

Fig. 4: Number of parameters vs Mean IOU scores among different architectures

Fig. 4, shows the comparisons of different models in regard to Mean intersection-Over-Union Vs. Parameters (in millions). The number of parameters of Fcn-8s [4], VGG19-Fcn [37] architectures for training VOC2012 [36] and Pascal-context [41] data are in hundred of millions. Whereas, all three of our models require less parameters for calculation, hence less memory usage by both CPU and GPU. It brings to light another prospect that, huge number of parameters are not needed to increase accuracy or finer potrayal. Furthermore, wide receptive fields and training on sparse data can also effectively give better results.

All the models where trained and tested with Caffe

[43] on Nvidia GTX1060 and GTX1070 separately. The code for this model can be found at: https://github.com/SharifAmit/DilatedFCNSegmentation

Fig. 5: Results after testing on Pascal-Context dataset [41] for 59-classes.The Fifth column of images show the output of our model which tends to be more accurate. On the second column O2P model’s [39] [38] output which has wrongly predicted in many instances.

V Conclusion

Enhancing accuracy for pixel-wise segmentation requires huge amount of memory and time. Our benchmark result for PASCAL VOC2012 test data set set for 20 unique classes scored mean-IOU of 69 percent for Dilated-FCN-2s-VGG19 and 67.6 percent for Dilated-FCN-2s-VGG16. On the other hand, for sparse data-set like NYUDv2 for 39 unique classes and Pascal-Context for 59 unique classes our model scored pixel accuracy of 62.6 percent and mean IOU of 42.6 percent respectively. Fully convolutional networks can be used to transfer weights from pre-trained net, element-wise summing different layers to improve accuracy and to train end-to-end for entire images with extensive data. Dilation increases the receptive fields while decreasing parameters and inference time. The objective was to create efficient yet deep architectures for generating accurate output while using less computation resources. And the proposed models have remarkably produced accurate pixel-wise segmentation. Hopefully, this architectures can be further used for Semantic Segmentation tasks like Self-Driving cars, Medical Imaging and Robotics.

Acknowledgment

We would like to thank Evan Shelhamer for providing the evaluation scripts and Caffe users community for their advice and suggestions. We also would like to acknowledge the technical support “Center for Cognitive Skill Enhancement“ has provided to us.

References

  • [1] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [3] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 1–9, 2015.
  • [4] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(4):640–651, 2017.
  • [5] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  • [6] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. Learning rich features from rgb-d images for object detection and segmentation. In European Conference on Computer Vision, pages 345–360. Springer, 2014.
  • [7] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1529–1537, 2015.
  • [8] Chen Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations, 2015.
  • [9] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1520–1528, 2015.
  • [10] Zhuowen Tu. Auto-context and its application to high-level vision tasks. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
  • [11] Zhuowen Tu, Xiangrong Chen, Alan L Yuille, and Song-Chun Zhu. Image parsing: Unifying segmentation, detection, and recognition. International Journal of computer vision, 63(2):113–140, 2005.
  • [12] Chris Russell, Pushmeet Kohli, Philip HS Torr, et al. Associative hierarchical crfs for object class image segmentation. In Computer Vision, 2009 IEEE 12th International Conference on, pages 739–746. IEEE, 2009.
  • [13] Wei Liu, Andrew Rabinovich, and Alexander C Berg. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
  • [14] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In

    International conference on machine learning

    , pages 647–655, 2014.
  • [15] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
  • [16] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3150–3158, 2016.
  • [17] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Region-based convolutional networks for accurate object detection and segmentation. IEEE transactions on pattern analysis and machine intelligence, 38(1):142–158, 2016.
  • [18] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Simultaneous detection and segmentation. In European Conference on Computer Vision, pages 297–312. Springer, 2014.
  • [19] Ofer Matan, Christopher JC Burges, Yann LeCun, and John S Denker. Multi-digit recognition using a space displacement neural network. In Advances in neural information processing systems, pages 488–495, 1992.
  • [20] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  • [21] Ralph Wolf and John C Platt. Postal address block location using a convolutional locator network. In Advances in Neural Information Processing Systems, pages 745–752, 1994.
  • [22] Feng Ning, Damien Delhomme, Yann LeCun, Fabio Piano, Léon Bottou, and Paolo Emilio Barbano. Toward automatic phenotyping of developing embryos from videos. IEEE Transactions on Image Processing, 14(9):1360–1371, 2005.
  • [23] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
  • [24] Pedro Pinheiro and Ronan Collobert. Recurrent convolutional neural networks for scene labeling. In International Conference on Machine Learning, pages 82–90, 2014.
  • [25] David Eigen, Dilip Krishnan, and Rob Fergus. Restoring an image taken through a window covered with dirt or rain. In Proceedings of the IEEE International Conference on Computer Vision, pages 633–640, 2013.
  • [26] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pages 1799–1807, 2014.
  • [27] Mohammadreza Mostajabi, Payman Yadollahpour, and Gregory Shakhnarovich. Feedforward semantic segmentation with zoom-out features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3376–3385, 2015.
  • [28] Pablo Arbeláez, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir Bourdev, and Jitendra Malik. Semantic segmentation using regions and parts. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3378–3385. IEEE, 2012.
  • [29] Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for scene labeling. IEEE transactions on pattern analysis and machine intelligence, 35(8):1915–1929, 2013.
  • [30] Anurag Arnab, Sadeep Jayasumana, Shuai Zheng, and Philip HS Torr. Higher order conditional random fields in deep neural networks. In European Conference on Computer Vision, pages 524–540. Springer, 2016.
  • [31] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map preddiction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
  • [32] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. Material recognition in the wild with the materials in context database. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3479–3487, 2015.
  • [33] George Papandreou, Liang-Chieh Chen, Kevin Murphy, and Alan L Yuille. Weakly-and semi-supervised learning of a dcnn for semantic image segmentation. arXiv preprint arXiv:1502.02734, 2015.
  • [34] Matthias Holschneider, Richard Kronland-Martinet, Jean Morlet, and Ph Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pages 286–297. Springer, 1990.
  • [35] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille. Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3640–3649, 2016.
  • [36] Mark Everingham, Luc Van Gool, Chris Williams, J Winn, and A Zisserman. Pascal visual object classes challenge results. Available from www. pascal-network. org, 1(6):7, 2005.
  • [37] Sharif Amit Kamran, Bin Khaled, Md Asif, and Sabit Bin Kabir.

    Exploring deep features: deeper fully convolutional neural network for image segmentation, 2017.

    Bachelor Thesis, BRAC University.
  • [38] Joao Carreira, Rui Caseiro, Jorge Batista, and Cristian Sminchisescu. Semantic segmentation with second-order pooling. Computer Vision–ECCV 2012, pages 430–443, 2012.
  • [39] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 891–898, 2014.
  • [40] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
  • [41] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
  • [42] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. Perceptual organization and recognition of indoor scenes from rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 564–571, 2013.
  • [43] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.