Memory-Efficient Deep Salient Object Segmentation Networks on Gridized Superpixels

12/27/2017 ∙ by Caglar Aytekin, et al. ∙ Nokia 0

Computer vision algorithms with pixel-wise labeling tasks, such as semantic segmentation and salient object detection, have gone through a significant accuracy increase with the incorporation of deep learning. Deep segmentation methods slightly modify and fine-tune pre-trained networks that have hundreds of millions of parameters. In this work, we question the need to have such memory demanding networks for the specific task of salient object segmentation. To this end, we propose a way to learn a memory-efficient network from scratch by training it only on salient object detection datasets. Our method encodes images to gridized superpixels that preserve both the object boundaries and the connectivity rules of regular pixels. This representation allows us to use convolutional neural networks that operate on regular grids. By using these encoded images, we train a memory-efficient network using only 0.048% of the number of parameters that other deep salient object detection networks have. Our method shows comparable accuracy with the state-of-the-art deep salient object detection methods and provides a faster and a much more memory-efficient alternative to them. Due to its easy deployment, such a network is preferable for applications in memory limited devices such as mobile phones and IoT devices.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Convolutional Neural Networks (CNNs) are learning machines that are extensively used by top performing methods in image classification [1, 2, 3, 4]. By the introduction of Fully Convolutional Neural Networks (FCNNs) [5], these structures have also proven to constitute the state of the art in pixel-wise classification tasks such as semantic image segmentation and salient object detection. A typical FCNN relies on a pre-trained CNN that is used for image classification and fine-tunes the CNN’s parameters for segmentation task, often adding or replacing some layers. These pre-trained CNNs usually contain a very large number of parameters, e.g. 138 million for VGG-16 [2]. Such large networks require a lot of memory, which makes them challenging to deploy on limited memory devices such as mobile phones. There have been some efforts to reduce memory requirement of a CNN via pruning [44] or quantizing [45]

the weights of the network, however these approaches are post-processing operations on large networks trained on millions of images. For some segmentation tasks such as salient object detection, one might question the need for using such a high capacity network in the first place. It can be argued that such a network might be an overkill for salient object detection and one can achieve reasonable performance by using a much smaller network. Moreover, object recognition CNNs have greatly reduced resolutions in their final layer activations due to pooling or strided convolution operations throughout the network. In order to atone for this resolution loss, segmentation networks either introduces additional connections to make use of the localization power of low-middle layers

[5, 10], or adds a deconvolutional network on top of the CNN [11, 12] with unpooling layers. Both approaches results into an even more increase in the number of parameters used in the segmentation network.

Fig. 1: Our method (GRIDS) compared to state-of-the-art deep salient object detection methods. (a) Network size comparison (plotted in log-scale), the true number of parameters are written on the corresponding bar. (b) Performance comparison according to measure.

In this paper, we propose a way to overcome two problems of FCNNs mentioned above: requirement of a big pre-trained network and resolution loss because of pooling layers. To this end, we utilize a memory-efficient deep segmentation network without any pooling layers. We achieve this by encoding input images via gridized superpixels [15]. This allows us to use low resolution images that accurately encode object edges. By using these images, we show that it is possible to train a memory-efficient FCNN with a reasonable depth, no pooling layers, yet with large receptive field and comparable performance with state of the art, see Fig. 1. The contributions of our work are listed as follows:

  • We propose a way to use FCNNs without any pooling layers or strided convolutions by abstracting input images via gridized superpixels.

  • The predictions of our network does not suffer from inaccurate object edges.

  • Our proposed network has less than 67k parameters (about 0.048% of others).

  • Our proposed network does not require a pre-trained model and can be trained from scratch by existing pixel-wise classification datasets.

  • We show that the performance of our method is comparable with state of the art segmentation networks in salient object detection task.

The rest of the paper is organized as follows. In Section II, the related work is discussed, in Section III, the proposed method is described, in IV, the experimental results are analyzed. Finally, Section V concludes the paper and suggests topics for future research.

Ii Related Work

Ii-a Superpixel Gridization

Superpixel gridization produces over-segmentations that form a regular pixel-like lattice which best preserves the object edges in an image. There is a small number of studies in this topic which we review shortly next. The method in [17] relies on a boundary cost map which is an inverted edge detection result. Incrementally horizontal and vertical stripes are added to the image where no horizontal and vertical stripes intersect more than once and no two horizontal (or vertical) stripes intersect with each other. In each step the optimal stripe is found by minimizing the boundary cost that the stripe passes through by a min-cut based optimization algorithm.

An extended version of [17] was proposed in [18] where the authors use an alternating optimization strategy. The method finds globally optimal solutions to the horizontal and vertical components of the lattice via using a multi-label Markov Random Field, as opposed to the greedy optimization strategy adopted in [17]. In [16], a generic approach is proposed to optimally regularize superpixels extracted by any algorithm. The approach is based on placing dummy nodes between superpixels to satisfy the regularity criterion.

Finally, the method proposed in [15], starts with regular lattice seeds and relocates the seeds -in a search space defined by the initialization- to the locally maximal boundary response. The relocated seeds are considered as superpixel junctions. Next, for each junction pair a path was found that maximize the edge strength on the path. These paths form a superpixel boundary map which results in a regular superpixel grid.

Ii-B Salient Object Detection

A salient object is generally defined as the object that visually stands out from the rest of the image, thus is more appealing to the human eye [19].

Unsupervised salient object detection methods rely mostly on following saliency assumptions: 1) A salient object has high local or global contrast [20, 21], 2) The boundary of an image is less likely to contain a salient object [22, 23], 3) The salient object is more likely to be large [23], 4) Regions of similar feature maps have similar saliency [24].

Prior to deep learning, supervised approaches to salient object detection focused on following tracks: 1) Learning a dense labeling of each region as salient or not [25, 26], 2) Learning to rank salient object proposals [27], 3) Learning region affinities for [28] end to end salient object segmentation.

Deep learning based approaches to salient object detection either train a network to learn to classify each region in an image separately

[29, 30, 31], or employ FCNNs to learn a dense pixel-wise labeling for salient object detection [32, 33]. FCNN based models utilize pre-trained networks on other tasks, and employ special tricks to preserve the accurate edges in segmentation results. Next, we propose an FCNN that does not need a pre-trained network, automatically preserves accurate object edges with no pooling layers or strided convolutions, has much fewer parameters than other methods, and is comparable in performance to the state-of-the-art.

Fig. 2: From left to right: Original, encoded with [15] (950 superpixels) and reconstructed images (first row) and corresponding ground truths (second row).

Iii Proposed Method

Iii-a Data Preparation

We use the superpixel extraction method in [15] to abstract an input image with a small number of homogeneous image regions (superpixels). Thanks to the special property of the method in [15], the extracted superpixels form a grid. We encode each superpixel with its mean color, thus we end up with a new image with low dimensions as follows:

(1)

In Eq. 1, indicates a channel of an image, and are indices of and images respectively, is a set of pixels of covered in superpixel , and is cardinality operation. We will use this image as input to an FCNN. For training purposes, we also form a low resolution version of the ground truth label image via encoding the mean of the 0 (not salient) and 1 (salient) values in the regions indicated by superpixels similar to the process in Eq. 1. In order to have binary values in the low dimensional ground truth , we simply threshold the above image by 0.5. Note that it is equivalent to selecting the most common value within a superpixel. It should be noted that, one can reconstruct an approximation of and from and respectively as follows:

(2)

In Fig. 2, we illustrate original (), encoded () and reconstructed () images with corresponding ground truths. The encoded image is the input that will be supplied to the FCNN with the encoded ground truth as its label for training. As we observe from Fig. 2, even though the superpixel extraction is constrained by the grid structure, it is able to reconstruct the image well by preserving the object edges.

Fig. 3: Network Architecture.

Iii-B Network Architecture

We use a 28-layers deep convolutional network with residual blocks [4]

. In particular, the network has a convolutional (conv) layer with rectified linear unit

[41]

(relu) activation, 13 residual blocks followed by a batch normalization

[42] (bnorm) , relu and conv layer and sigmoid activation. Note that we have not applied bnorm layer right before the sigmoid in order to avoid restricting the convolutional output to a small interval. Each residual block consists of a bnorm-relu-conv-bnorm-relu-conv structure. The input of a residual block is short-connected to its output. We use same number of filters in each layer. The entire network is illustrated in Fig. 3

. The network’s inputs and corresponding ground truths are obtained by the procedure described in the previous subsection. Note that the convolutions are utilized with zero padding and stride 1, so that the input shape is preserved for each convolutional layer. Moreover, there are no pooling layers in the network architecture in order to avoid any resolution loss. This is possible because the input resolution is already low and we can use a constant number of convolutional filters throughout the network, whereas prior art networks need to reduce the resolution in order to increase the number of filters. The receptive field of our network is around 30x30 which is enough to cover the entire input size for an abstraction of an image with 900 superpixels if the abstraction forms a square grid. Typically the aspect ratio of the superpixel representation varies with the image aspect ratio, however we find the above receptive field enough to accurately detect the salient objects.

Input: image
      Output: salient segment

1:Encode to by Eq. 1
2:Apply min-max normalization on
3:Predict from via neural network
4:Reconstruct from by Eq. 2
Algorithm 1 Test-time implementation

Iii-C Training and Testing

The parameters of the network are optimized in order to minimize the binary cross entropy loss between the output of the network and the ground truth, by treating the sigmoid outputs as probabilities that the corresponding input pixels are salient. Separate datasets are used for training and validation sets and the model with the best validation error is selected. For testing, we use entirely different datasets and run the model learned by the training as described above. During testing, an image is encoded to the low dimensional superpixel grid representation and fed to the network. It should be noted here that we apply min-max normalization to each input, i.e. we linearly scale the values between 0 and 1. The output of the network lies in the same grid structure, thus should be converted back, i.e. reconstructed to the original image size. The reconstruction is simply utilized via replicating the value in each grid node in the image region that the node corresponds to as formulated in Eq.

2. The test-time algorithm is given in Algorithm 1.

Iv Experimental Results

Iv-a Datasets and Evaluation Metrics

We conducted evaluations on widely used salient object detection datasets. MSRA10K [34] includes 10000 images that exhibit a simple case with one salient object and clear background, HKU-IS [30] includes 4447 images with slightly challenging cases, ECSSD [37] includes 1000 relatively complex images , PASCALS [38] contains 850 images adopted from PASCAL VOC segmentation dataset, and SOD [39] contains 300 images from BSD300 [40]

segmentation dataset. We use two most widely used evaluation metrics, mean absolute error (MAE) and F-measure. For a saliency map

and a binary ground truth , MAE is defined as follows:

(3)

Precision-recall curves are extracted via thresholding the saliency map at several values

and plotting the precision and recall values which are calculated as follows:

(4)

The F-measure is used to obtain a global evaluation of the precision recall curve and is obtained as follows.

(5)

It is widely adopted in salient object detection literature to chose to be and use an adaptive thresholding where equals twice the mean saliency in saliency map. [19].

Iv-B Implementation

Our network is based on the publicly available Keras with Theano backend. Network parameters are initialized by Xavier’s method

[43]. We use Nesterov Adam optimizer with an initial learning rate of . We utilize a number of superpixel granularities for augmenting the input data. In particular, we use number of superpixels. Therefore, we have 5 different encoded image for each original image. This is only done for training and validation data for data augmentation. During test stage, we stick to 950 superpixels ,simply because it is the median of the above set, for evaluation. Unlike other methods, our network is trained from scratch, hence it seems like it needs more data to be trained on, but in reality it is trained on less data if we consider also the pre-training data in prior art works. Thus, we use largest datasets DUT-OMRON, HKU-IS and MSRA10K datasets for training and SOD for validation. Further data augmentation is employed via randomly flipping the input and labels in horizontal direction. We use a batch size of 20. The network was run for 5 million iterations and the model that gives the best validation set accuracy was selected. We evaluate 2 different variants of the network: one with 16 filters at each layer and one with 32. We call the methods GRIDS16 and GRIDS32 respectively. The networks that GRIDS16 and GRIDS32 use have 67k and 248k parameters respectively.

Method PASCALS ECSSD Avg. Perf.
MAE MAE MAE
CHM 0.222 0.631 0.195 0.722 0.209 0.677
RC 0.225 0.640 0.187 0.741 0.206 0.691
DSR 0.204 0.646 0.173 0.737 0.189 0.692
EQCUT 0.217 0.670 0.174 0.765 0.196 0.718
DRFI 0.221 0.679 0.166 0.787 0.194 0.733
MC 0.147 0.721 0.107 0.822 0.127 0.772
MDF 0.145 0.764 0.108 0.833 0.127 0.799
GRIDS16 0.171 0.781 0.085 0.823 0.128 0.802
GRIDS32 0.166 0.793 0.080 0.839 0.123 0.816
ELD 0.121 0.767 0.098 0.865 0.110 0.816
DCL 0.108 0.822 0.071 0.898 0.090 0.860
RFCN 0.118 0.827 0.097 0.898 0.108 0.863
DHS 0.091 0.820 0.061 0.905 0.076 0.863
DSS 0.080 0.830 0.052 0.915 0.066 0.873
TABLE I: Comparison with State-of-the-art

Iv-C Comparison with State-of-the-art

We compare our approach with 4 unsupervised methods: RC [20], CHM [26], DSR [22], EQCUT [23], and 8 supervised methods: DRFI [25], MC [29], ELD [31], MDF [30], RFCN [32], DHS [35], DCL [33] and DSS [36]. In Table I, we share results for ECSSD and PASCALS datasets as these are the only datasets used for testing in all methods. The ordering of the methods is made according to ascending measure. As one can observe from Table I, both variants of our method GRIDS can achieve better measure and MAE than that of all unsupervised methods and three supervised methods: DRFI, MC and MDF. Out of these methods MC and MDF are deep learning based and use around 58 and 138 million parameters respectively. Other methods that outperform our method are all deep learning based algorithms and use more than 138 million parameters - VGG-16 models with additional layers/connections. Yet, our models GRIDS16 and GRIDS32 only use around 67 and 248 thousand parameters respectively, which corresponds to respectively 0.048% and 0.18% of other methods and still achieve a comparable accuracy with the state of the art. The number of parameters each deep learning based method use and run times with used GPUs are given in Table II. Our method is the one with least memory requirement and fastest run-time. At this point, one should note that the superpixel extraction time is not included in the above table. With the method of [15], this takes around an additional 0.5 seconds for an image of size 300x400 for superpixel granularity 950.

Method #Parameters Run-time (sec.) GPU
MC 58m 2.38 Titan Black
MDF 138m 8 Titan Black
ELD 138m 0.5 Titan Black
DCL 138m 1.5 Titan Black
RFCN 138m 4.6 Titan X
DHS 138m 0.04 Titan Black
DSS 138m 0.08 Titan X
GRIDS16 67k 0.02 GTX 1080
GRIDS32 248k 0.03 GTX 1080
TABLE II: Complexity of Deep Learning Methods

Iv-D Analysis and Variants

In this section, we investigate the impact of several factors to our method’s performance. First, we evaluate the performance robustness to different superpixel granularities. We report the test performance when employing 900, 950 and 1000 number of superpixels in Table III. The experiments are made with GRIDS32 model. It can be observed that the resolution change in this interval has an insignificant impact on our method’s performance and does not alter the ranking in Table I.

Superpixel No. PASCALS ECSSD
MAE MAE
900 0.169 0.787 0.080 0.838
950 0.166 0.793 0.080 0.839
1000 0.167 0.792 0.080 0.840
TABLE III: Robustness to Resolution

Next, we have tried to improve the performance via combination of segmentation results from all resolutions via majority voting. Experiments are made with GRIDS32 model. It can be observed from Table III that multi-resolution approach (GRIDSM) results into a notable performance improvement in both MAE and measure. The rank of GRIDSM is the same with GRIDS for MAE, but it beats one more deep learning method (ELD) in measure in Table I.

Method PASCALS ECSSD
MAE MAE
GRIDS (950) 0.166 0.793 0.080 0.839
GRIDSM 0.164 0.800 0.075 0.851
TABLE IV: Improvement via Multi-resolution Approach

One might argue that a natural baseline related to our network is (a) plain downsampling of the datasets, (b) training a network on the downsampled images and ground truths (c) upsampling results to evaluate the performance. This would highlight the performance upgrade of dimension reduction and later reconstruction with gridized superpixel encoding compared to plain downsampling and upsampling. To make this comparison, we train a network with exactly the same structure as described in the text, only this time we train the network with downsampled images and ground truths. Ground truths were again binarized via thresholding with 0.5. Augmentation with scale was similarly performed by defining downsampled dimensions to result into around 900,925,950,975 and 1000 number of pixels while preserving the aspect ratio. During test time we have again utilized only the 950 granularity. Bicubic downsampling and upsampling is used. Experiments are made with GRIDS32 model. The comparison in

V clearly indicates the improvement of superpixel gridization encoding scheme over plain downsampling. Especially the improvement in measure is dramatic; up to a 13% relative improvement.

Encoding Type PASCALS ECSSD
MAE MAE
Downsampling 0.172 0.732 0.096 0.741
GRIDS 0.166 0.793 0.080 0.839
TABLE V: Encoding Strategy Comparison

As we have previously mentioned, since our method is trained from scratch we obviously need more data to be trained on. That is why we use largest datasets DUT-OMRON, HKU-IS and MSRA10K for training. We would like to emphasize here that the majority of other works use only MSRA10K for training and validation purposes for fine-tuning the pre-trained network they use. For transfer learning, fine-tuning with little number of data is known to give satisfactory results. Since our network is trained from scratch, such small data is not enough to train a network that gives satisfactory generalization. Moreover, we do not possess the advantage of starting with a pre-trained network on millions of images for object detection -as others do- which is clearly expected to contribute to salient object detection performance acting as a top-down prior, i.e. using the semantically higher level information of object recognition for detecting the salient object. Therefore, we argue the fairness of a comparison where we use only MSRA10k for our method. Yet, in order to give a complete set of experiments, we have also trained our model with other methods’ training and validation sets (partitions on MSRA10K dataset) and we report the test results in Table

VI. Experiments are made with GRIDS32 model. Clearly, the model trained on MSRA10K is inferior to the one trained on MSRA10K, HKU-IS and DUT-OMRON.

Training Data PASCALS ECSSD
MAE MAE
MSRA10k 0.190 0.752 0.096 0.814
MSRA10k+HKUIS+DUTOMRON 0.166 0.793 0.080 0.839
TABLE VI: Impact of Training Data

V Conclusion and Future Work

We have presented a deep, fast and memory-efficient method for salient object segmentation operating on encoded images with gridized superpixels. With the boundary preserving gridized superpixel encoding, we also do not suffer from blurry object boundaries. Moreover, the network does not employ any pooling layer, thus further resolution loss is prevented. This also eliminates the need of tricks such as additional connections and layers to atone for the resolution loss. We have shown that our method can outperform some deep learning based methods and shows comparable accuracy with others while having only 0.048% of their parameters. With only 430 KB memory, the network is extremely easy to deploy to any device. This especially makes the method preferable considering applications in mobile and small IoT devices. The presented framework can be applied to any pixel-wise labeling task such as semantic segmentation. This will be the main topic that we will work on in the future improvements of this work.

References

  • [1]

    A. Krizhevsky, I Sutskever and G. E. Hinton, ”Imagenet Classification with Deep Convolutional Neural Networks”,

    Advances in Neural Information Processing Systems (NIPS), pp. 1097-1105,2012.
  • [2] K. Simonyan and A. Zisserman, Very Deep Convolutional Neural Networks for Large-scale Image Recognition, arXiv:1409.1556, 2014.
  • [3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov and A. Rabinovich, Going Deeper with Convolutions

    , IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9, 2015.

  • [4] K. He, X. Zhang, S. Ren and J. Sun, Deep Residual Learning for Image Recognition, IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
  • [5] J. Long, E. ShelHamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431-3440, 2015.
  • [6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, The PASCAL Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, vol. 88, no.2, pp. 303-338, 2010.
  • [7] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C.L. Zitnick, Microsoft COCO: Common Objects in Context, European Conference on Computer Vision, pp. 740-755, 2014.
  • [8] R. Mottaghi, X. Chen, X. Liu, N. G. Cho, S. W. Lee, S. Fidler, and R. Urtasun. The Role of Context for Object Detection and Semantic Segmantation in the Wild, IEEE Conference on Computer Vision and Pattern Recognition, pp. 891-898, 2014.
  • [9] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, Semantic Understanding of Scenes Through the ADE20K dataset, arXiv:1608.05442, 2016.
  • [10] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, ”Hypercolumns for Object Segmentation and Fine-grained Localization”, IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [11] V. Badrinarayanan, A. Kendall, and R. Cipolla, ”Segnet: A Deep Convolutional Encoder-Decoder Network Arhcitecture for Image Segmentation”, arXiv:1511.00561, 2015.
  • [12] H. Noh, S. Hong, and B. Han, ”Learning Deconvolutional Network for Semantic Segmentation”, International Conference on Computer Vision, pp. 1520-1528, 2015.
  • [13] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. ”DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs”, arXiv:1606.00915, 2016.
  • [14] F. Yu, and V. Koltun. ”Multi-scale Context Aggregation by Dilated Convolutions”, arXiv:1511.07122, 2015.
  • [15] H. Fu, X. Cao, D. Tang, Y. Han, and D. Xu. ”Regularity Preserved Superpixels and Supervoxels”, IEEE Transactions on Multimedia vol. 16, no. 4, pp. 1165-1175, 2014.
  • [16] L. Li, W. Feng, L. Wan, and J. Zhang. ”Maximum Cohesive Grid of Superpixels for Fast Object Localization”. IEEE Conference on Computer Vision and Pattern Recognition pp. 3174-3181, 2013.
  • [17] A. P. Moore, S. J. Prince, J. Warrell, U. Mohammed, and G. Jones, ”Superpixel Lattices”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2008.
  • [18] A. P. Moore, S. J. Prince, J. Warrell, ”Lattice Cut-Constructing Superpixels Using Layer Constraints”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117-2124, 2010.
  • [19] A. Borji, M. M. Cheng, H. Jiang, and J. Li, ”Salient Object Detection: A Survey”, arXiv:1411.5878, 2014.
  • [20] M. M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu, ”Global contrast based salient region detection”, IEEE Transaction on Pattern Analysis and Machine Intelligence, 2015.
  • [21] C. Aytekin, S. Kiranyaz, and M. Gabbouj. ”Automatic Object Segmentation by Quantum Cuts.” International Conference on Pattern Recognition, pp.112-117, 2014.
  • [22] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang. ”Saliency detection via dense and sparse reconstruction”, International Conference on Computer Vision, pp. 2976–2983, 2013
  • [23] C. Aytekin, E. C. Ozan, S. Kiranyaz, and M. Gabbouj. ”Extended Quantum Cuts for Unsupervised Salient Object Extraction” Multimedia Tools and Applications vol. 76, no. 8, pp. 10443-10463, 2017.
  • [24]

    C. Aytekin, A. Iosifidis, and M. Gabbouj. ”Probabilistic Saliency Estimation

    Pattern Recognition, https://doi.org/10.1016/j.patcog.2017.09.023, 2017.
  • [25] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li. ”Salient object detection: A discriminative regional feature integration approach”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 2083-2090, 2013.
  • [26] X. Li, Y. Li, C. Shen, A. Dick, and A. Van Den Hengel. ”Contextual hypergraph modeling for salient object detection”, International Conference on Computer Vision, pp. 3328–3335, 2013.
  • [27] C. Aytekin, S. Kiranyaz, and M. Gabbouj. ”Learning to Rank Salient Segments Extracted by Multispectral Quantum Cuts”, Pattern Recognition Letters, vol. 72, pp. 91-99, 2016.
  • [28] C. Aytekin, A. Iosifidis, S. Kiranyaz, and M. Gabbouj. ”Learning Graph Affinities for Spectral Graph-based Salient Object Detection”, Pattern Recognition, vol. 64, pp. 159-167, 2017.
  • [29] R. Zhao, W. Ouyang, H. Li, and X. Wang. ”Saliency Detection by Multi-Context Deep Learning”. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1265-1274, 2015.
  • [30]

    G. Li and Y. Yu. ”Visual saliency based on multiscale deep features”,

    IEEE Conference on Computer Vision and Pattern Recognition, pp. 5455–5463, 2015.
  • [31] L. Gayoung, T. Yu-Wing, and K. Junmo. ”Deep saliency with encoded low level distance map and high level features”. IEEE Conference on Computer Vision and Pattern Recognition, pp. 660-668, 2016.
  • [32] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, ”Saliency detection with recurrent fully convolutional networks”, European Conference on Computer Vision, pp. 825-841, 2016.
  • [33] G. Li, and Y. Yu, ”Deep contrast learning for salient object detection”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 478-487, 2016.
  • [34] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang and H. Y. Shum, ”Learning to detect a salient object”, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 353-367, vol. 33, no. 2, .2011
  • [35] N. Liu and J. Han. ”Dhsnet: Deep hierarchical saliency network for salient object detection”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 678-686, 2016.
  • [36] Q. Hou, M. M. Cheng, X. Hu, A. Borji, Z. Tu and P. Torr. §Deeply supervised salient object detection with short connections§, IEEE Conference on Computer Vision and Pattern Recognition, pp. 5300-5309, 2017.
  • [37] Q. Yan, L. Xu, J. Shi and J. Jia, ”Hierarchical saliency detection”. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1155-1162, 2013.
  • [38] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille. ”The secrets of salient object segmentation”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 280-287, 2014.
  • [39] V. Movahedi and J. H. Elder, ”Design and perceptual validation of performance measures for salient object segmentation”, IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 49-56, 2010.
  • [40] D. Martin, C. Fowlkes, D. Tal and J. Malik, ”A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statics”, International Conference on Computer Vision, pp. 416-423, 2001.
  • [41]

    V. Nair and G. E. Hinton, ”Rectified linear units improve restricted boltzmann machines”,

    Proceedings of the 27th international conference on machine learning

    , pp. 807-814, 2010.
  • [42] S. Ioffe and C. Szegedy, ”Batch normalization: Accelerating deep network training by reducing internal covariate shift”, International Conference on Machine Learning, pp. 448-456, 2015.
  • [43] X. Glorot and Y. Bengio, ”Understanding the difficulty of training deep feedforward neural networks”,

    International Conference on Artificial Intelligence and Statistics

    , pp. 249-256, 2010.
  • [44] S. Han, J. Pool, J. Tran, W. Dally ”Learning both weights and connections for efficient neural network”, Advances in Neural Information Processing Systems, pp. 1135-1143, 2015.
  • [45] S. Gupta, A. Agrawal, K. Gopalakrishnan and P. Narayanan, ”Deep learning with limited numerical precision”, International Conference on Machine Learning, pp. 1737-1746, 2015.