Convolutional Neural Networks (CNNs) are learning machines that are extensively used by top performing methods in image classification [1, 2, 3, 4]. By the introduction of Fully Convolutional Neural Networks (FCNNs) , these structures have also proven to constitute the state of the art in pixel-wise classification tasks such as semantic image segmentation and salient object detection. A typical FCNN relies on a pre-trained CNN that is used for image classification and fine-tunes the CNN’s parameters for segmentation task, often adding or replacing some layers. These pre-trained CNNs usually contain a very large number of parameters, e.g. 138 million for VGG-16 . Such large networks require a lot of memory, which makes them challenging to deploy on limited memory devices such as mobile phones. There have been some efforts to reduce memory requirement of a CNN via pruning  or quantizing 
the weights of the network, however these approaches are post-processing operations on large networks trained on millions of images. For some segmentation tasks such as salient object detection, one might question the need for using such a high capacity network in the first place. It can be argued that such a network might be an overkill for salient object detection and one can achieve reasonable performance by using a much smaller network. Moreover, object recognition CNNs have greatly reduced resolutions in their final layer activations due to pooling or strided convolution operations throughout the network. In order to atone for this resolution loss, segmentation networks either introduces additional connections to make use of the localization power of low-middle layers[5, 10], or adds a deconvolutional network on top of the CNN [11, 12] with unpooling layers. Both approaches results into an even more increase in the number of parameters used in the segmentation network.
In this paper, we propose a way to overcome two problems of FCNNs mentioned above: requirement of a big pre-trained network and resolution loss because of pooling layers. To this end, we utilize a memory-efficient deep segmentation network without any pooling layers. We achieve this by encoding input images via gridized superpixels . This allows us to use low resolution images that accurately encode object edges. By using these images, we show that it is possible to train a memory-efficient FCNN with a reasonable depth, no pooling layers, yet with large receptive field and comparable performance with state of the art, see Fig. 1. The contributions of our work are listed as follows:
We propose a way to use FCNNs without any pooling layers or strided convolutions by abstracting input images via gridized superpixels.
The predictions of our network does not suffer from inaccurate object edges.
Our proposed network has less than 67k parameters (about 0.048% of others).
Our proposed network does not require a pre-trained model and can be trained from scratch by existing pixel-wise classification datasets.
We show that the performance of our method is comparable with state of the art segmentation networks in salient object detection task.
Ii Related Work
Ii-a Superpixel Gridization
Superpixel gridization produces over-segmentations that form a regular pixel-like lattice which best preserves the object edges in an image. There is a small number of studies in this topic which we review shortly next. The method in  relies on a boundary cost map which is an inverted edge detection result. Incrementally horizontal and vertical stripes are added to the image where no horizontal and vertical stripes intersect more than once and no two horizontal (or vertical) stripes intersect with each other. In each step the optimal stripe is found by minimizing the boundary cost that the stripe passes through by a min-cut based optimization algorithm.
An extended version of  was proposed in  where the authors use an alternating optimization strategy. The method finds globally optimal solutions to the horizontal and vertical components of the lattice via using a multi-label Markov Random Field, as opposed to the greedy optimization strategy adopted in . In , a generic approach is proposed to optimally regularize superpixels extracted by any algorithm. The approach is based on placing dummy nodes between superpixels to satisfy the regularity criterion.
Finally, the method proposed in , starts with regular lattice seeds and relocates the seeds -in a search space defined by the initialization- to the locally maximal boundary response. The relocated seeds are considered as superpixel junctions. Next, for each junction pair a path was found that maximize the edge strength on the path. These paths form a superpixel boundary map which results in a regular superpixel grid.
Ii-B Salient Object Detection
A salient object is generally defined as the object that visually stands out from the rest of the image, thus is more appealing to the human eye .
Unsupervised salient object detection methods rely mostly on following saliency assumptions: 1) A salient object has high local or global contrast [20, 21], 2) The boundary of an image is less likely to contain a salient object [22, 23], 3) The salient object is more likely to be large , 4) Regions of similar feature maps have similar saliency .
Prior to deep learning, supervised approaches to salient object detection focused on following tracks: 1) Learning a dense labeling of each region as salient or not [25, 26], 2) Learning to rank salient object proposals , 3) Learning region affinities for  end to end salient object segmentation.
Deep learning based approaches to salient object detection either train a network to learn to classify each region in an image separately[29, 30, 31], or employ FCNNs to learn a dense pixel-wise labeling for salient object detection [32, 33]. FCNN based models utilize pre-trained networks on other tasks, and employ special tricks to preserve the accurate edges in segmentation results. Next, we propose an FCNN that does not need a pre-trained network, automatically preserves accurate object edges with no pooling layers or strided convolutions, has much fewer parameters than other methods, and is comparable in performance to the state-of-the-art.
Iii Proposed Method
Iii-a Data Preparation
We use the superpixel extraction method in  to abstract an input image with a small number of homogeneous image regions (superpixels). Thanks to the special property of the method in , the extracted superpixels form a grid. We encode each superpixel with its mean color, thus we end up with a new image with low dimensions as follows:
In Eq. 1, indicates a channel of an image, and are indices of and images respectively, is a set of pixels of covered in superpixel , and is cardinality operation. We will use this image as input to an FCNN. For training purposes, we also form a low resolution version of the ground truth label image via encoding the mean of the 0 (not salient) and 1 (salient) values in the regions indicated by superpixels similar to the process in Eq. 1. In order to have binary values in the low dimensional ground truth , we simply threshold the above image by 0.5. Note that it is equivalent to selecting the most common value within a superpixel. It should be noted that, one can reconstruct an approximation of and from and respectively as follows:
In Fig. 2, we illustrate original (), encoded () and reconstructed () images with corresponding ground truths. The encoded image is the input that will be supplied to the FCNN with the encoded ground truth as its label for training. As we observe from Fig. 2, even though the superpixel extraction is constrained by the grid structure, it is able to reconstruct the image well by preserving the object edges.
Iii-B Network Architecture
We use a 28-layers deep convolutional network with residual blocks 
. In particular, the network has a convolutional (conv) layer with rectified linear unit42] (bnorm) , relu and conv layer and sigmoid activation. Note that we have not applied bnorm layer right before the sigmoid in order to avoid restricting the convolutional output to a small interval. Each residual block consists of a bnorm-relu-conv-bnorm-relu-conv structure. The input of a residual block is short-connected to its output. We use same number of filters in each layer. The entire network is illustrated in Fig. 3
. The network’s inputs and corresponding ground truths are obtained by the procedure described in the previous subsection. Note that the convolutions are utilized with zero padding and stride 1, so that the input shape is preserved for each convolutional layer. Moreover, there are no pooling layers in the network architecture in order to avoid any resolution loss. This is possible because the input resolution is already low and we can use a constant number of convolutional filters throughout the network, whereas prior art networks need to reduce the resolution in order to increase the number of filters. The receptive field of our network is around 30x30 which is enough to cover the entire input size for an abstraction of an image with 900 superpixels if the abstraction forms a square grid. Typically the aspect ratio of the superpixel representation varies with the image aspect ratio, however we find the above receptive field enough to accurately detect the salient objects.
Iii-C Training and Testing
The parameters of the network are optimized in order to minimize the binary cross entropy loss between the output of the network and the ground truth, by treating the sigmoid outputs as probabilities that the corresponding input pixels are salient. Separate datasets are used for training and validation sets and the model with the best validation error is selected. For testing, we use entirely different datasets and run the model learned by the training as described above. During testing, an image is encoded to the low dimensional superpixel grid representation and fed to the network. It should be noted here that we apply min-max normalization to each input, i.e. we linearly scale the values between 0 and 1. The output of the network lies in the same grid structure, thus should be converted back, i.e. reconstructed to the original image size. The reconstruction is simply utilized via replicating the value in each grid node in the image region that the node corresponds to as formulated in Eq.2. The test-time algorithm is given in Algorithm 1.
Iv Experimental Results
Iv-a Datasets and Evaluation Metrics
We conducted evaluations on widely used salient object detection datasets. MSRA10K  includes 10000 images that exhibit a simple case with one salient object and clear background, HKU-IS  includes 4447 images with slightly challenging cases, ECSSD  includes 1000 relatively complex images , PASCALS  contains 850 images adopted from PASCAL VOC segmentation dataset, and SOD  contains 300 images from BSD300 
segmentation dataset. We use two most widely used evaluation metrics, mean absolute error (MAE) and F-measure. For a saliency mapand a binary ground truth , MAE is defined as follows:
Precision-recall curves are extracted via thresholding the saliency map at several values
and plotting the precision and recall values which are calculated as follows:
The F-measure is used to obtain a global evaluation of the precision recall curve and is obtained as follows.
It is widely adopted in salient object detection literature to chose to be and use an adaptive thresholding where equals twice the mean saliency in saliency map. .
Iv-C Comparison with State-of-the-art
We compare our approach with 4 unsupervised methods: RC , CHM , DSR , EQCUT , and 8 supervised methods: DRFI , MC , ELD , MDF , RFCN , DHS , DCL  and DSS . In Table I, we share results for ECSSD and PASCALS datasets as these are the only datasets used for testing in all methods. The ordering of the methods is made according to ascending measure. As one can observe from Table I, both variants of our method GRIDS can achieve better measure and MAE than that of all unsupervised methods and three supervised methods: DRFI, MC and MDF. Out of these methods MC and MDF are deep learning based and use around 58 and 138 million parameters respectively. Other methods that outperform our method are all deep learning based algorithms and use more than 138 million parameters - VGG-16 models with additional layers/connections. Yet, our models GRIDS16 and GRIDS32 only use around 67 and 248 thousand parameters respectively, which corresponds to respectively 0.048% and 0.18% of other methods and still achieve a comparable accuracy with the state of the art. The number of parameters each deep learning based method use and run times with used GPUs are given in Table II. Our method is the one with least memory requirement and fastest run-time. At this point, one should note that the superpixel extraction time is not included in the above table. With the method of , this takes around an additional 0.5 seconds for an image of size 300x400 for superpixel granularity 950.
Iv-D Analysis and Variants
In this section, we investigate the impact of several factors to our method’s performance. First, we evaluate the performance robustness to different superpixel granularities. We report the test performance when employing 900, 950 and 1000 number of superpixels in Table III. The experiments are made with GRIDS32 model. It can be observed that the resolution change in this interval has an insignificant impact on our method’s performance and does not alter the ranking in Table I.
Next, we have tried to improve the performance via combination of segmentation results from all resolutions via majority voting. Experiments are made with GRIDS32 model. It can be observed from Table III that multi-resolution approach (GRIDSM) results into a notable performance improvement in both MAE and measure. The rank of GRIDSM is the same with GRIDS for MAE, but it beats one more deep learning method (ELD) in measure in Table I.
One might argue that a natural baseline related to our network is (a) plain downsampling of the datasets, (b) training a network on the downsampled images and ground truths (c) upsampling results to evaluate the performance. This would highlight the performance upgrade of dimension reduction and later reconstruction with gridized superpixel encoding compared to plain downsampling and upsampling. To make this comparison, we train a network with exactly the same structure as described in the text, only this time we train the network with downsampled images and ground truths. Ground truths were again binarized via thresholding with 0.5. Augmentation with scale was similarly performed by defining downsampled dimensions to result into around 900,925,950,975 and 1000 number of pixels while preserving the aspect ratio. During test time we have again utilized only the 950 granularity. Bicubic downsampling and upsampling is used. Experiments are made with GRIDS32 model. The comparison inV clearly indicates the improvement of superpixel gridization encoding scheme over plain downsampling. Especially the improvement in measure is dramatic; up to a 13% relative improvement.
As we have previously mentioned, since our method is trained from scratch we obviously need more data to be trained on. That is why we use largest datasets DUT-OMRON, HKU-IS and MSRA10K for training. We would like to emphasize here that the majority of other works use only MSRA10K for training and validation purposes for fine-tuning the pre-trained network they use. For transfer learning, fine-tuning with little number of data is known to give satisfactory results. Since our network is trained from scratch, such small data is not enough to train a network that gives satisfactory generalization. Moreover, we do not possess the advantage of starting with a pre-trained network on millions of images for object detection -as others do- which is clearly expected to contribute to salient object detection performance acting as a top-down prior, i.e. using the semantically higher level information of object recognition for detecting the salient object. Therefore, we argue the fairness of a comparison where we use only MSRA10k for our method. Yet, in order to give a complete set of experiments, we have also trained our model with other methods’ training and validation sets (partitions on MSRA10K dataset) and we report the test results in TableVI. Experiments are made with GRIDS32 model. Clearly, the model trained on MSRA10K is inferior to the one trained on MSRA10K, HKU-IS and DUT-OMRON.
V Conclusion and Future Work
We have presented a deep, fast and memory-efficient method for salient object segmentation operating on encoded images with gridized superpixels. With the boundary preserving gridized superpixel encoding, we also do not suffer from blurry object boundaries. Moreover, the network does not employ any pooling layer, thus further resolution loss is prevented. This also eliminates the need of tricks such as additional connections and layers to atone for the resolution loss. We have shown that our method can outperform some deep learning based methods and shows comparable accuracy with others while having only 0.048% of their parameters. With only 430 KB memory, the network is extremely easy to deploy to any device. This especially makes the method preferable considering applications in mobile and small IoT devices. The presented framework can be applied to any pixel-wise labeling task such as semantic segmentation. This will be the main topic that we will work on in the future improvements of this work.
A. Krizhevsky, I Sutskever and G. E. Hinton, ”Imagenet Classification with Deep Convolutional Neural Networks”,Advances in Neural Information Processing Systems (NIPS), pp. 1097-1105,2012.
-  K. Simonyan and A. Zisserman, Very Deep Convolutional Neural Networks for Large-scale Image Recognition, arXiv:1409.1556, 2014.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov and A. Rabinovich, Going Deeper with Convolutions
, IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9, 2015.
-  K. He, X. Zhang, S. Ren and J. Sun, Deep Residual Learning for Image Recognition, IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
-  J. Long, E. ShelHamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431-3440, 2015.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, The PASCAL Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, vol. 88, no.2, pp. 303-338, 2010.
-  T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C.L. Zitnick, Microsoft COCO: Common Objects in Context, European Conference on Computer Vision, pp. 740-755, 2014.
-  R. Mottaghi, X. Chen, X. Liu, N. G. Cho, S. W. Lee, S. Fidler, and R. Urtasun. The Role of Context for Object Detection and Semantic Segmantation in the Wild, IEEE Conference on Computer Vision and Pattern Recognition, pp. 891-898, 2014.
-  B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, Semantic Understanding of Scenes Through the ADE20K dataset, arXiv:1608.05442, 2016.
-  B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, ”Hypercolumns for Object Segmentation and Fine-grained Localization”, IEEE Conference on Computer Vision and Pattern Recognition, 2014.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, ”Segnet: A Deep Convolutional Encoder-Decoder Network Arhcitecture for Image Segmentation”, arXiv:1511.00561, 2015.
-  H. Noh, S. Hong, and B. Han, ”Learning Deconvolutional Network for Semantic Segmentation”, International Conference on Computer Vision, pp. 1520-1528, 2015.
-  L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. ”DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs”, arXiv:1606.00915, 2016.
-  F. Yu, and V. Koltun. ”Multi-scale Context Aggregation by Dilated Convolutions”, arXiv:1511.07122, 2015.
-  H. Fu, X. Cao, D. Tang, Y. Han, and D. Xu. ”Regularity Preserved Superpixels and Supervoxels”, IEEE Transactions on Multimedia vol. 16, no. 4, pp. 1165-1175, 2014.
-  L. Li, W. Feng, L. Wan, and J. Zhang. ”Maximum Cohesive Grid of Superpixels for Fast Object Localization”. IEEE Conference on Computer Vision and Pattern Recognition pp. 3174-3181, 2013.
-  A. P. Moore, S. J. Prince, J. Warrell, U. Mohammed, and G. Jones, ”Superpixel Lattices”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2008.
-  A. P. Moore, S. J. Prince, J. Warrell, ”Lattice Cut-Constructing Superpixels Using Layer Constraints”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117-2124, 2010.
-  A. Borji, M. M. Cheng, H. Jiang, and J. Li, ”Salient Object Detection: A Survey”, arXiv:1411.5878, 2014.
-  M. M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu, ”Global contrast based salient region detection”, IEEE Transaction on Pattern Analysis and Machine Intelligence, 2015.
-  C. Aytekin, S. Kiranyaz, and M. Gabbouj. ”Automatic Object Segmentation by Quantum Cuts.” International Conference on Pattern Recognition, pp.112-117, 2014.
-  X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang. ”Saliency detection via dense and sparse reconstruction”, International Conference on Computer Vision, pp. 2976–2983, 2013
-  C. Aytekin, E. C. Ozan, S. Kiranyaz, and M. Gabbouj. ”Extended Quantum Cuts for Unsupervised Salient Object Extraction” Multimedia Tools and Applications vol. 76, no. 8, pp. 10443-10463, 2017.
C. Aytekin, A. Iosifidis, and M. Gabbouj. ”Probabilistic Saliency Estimation”Pattern Recognition, https://doi.org/10.1016/j.patcog.2017.09.023, 2017.
-  H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li. ”Salient object detection: A discriminative regional feature integration approach”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 2083-2090, 2013.
-  X. Li, Y. Li, C. Shen, A. Dick, and A. Van Den Hengel. ”Contextual hypergraph modeling for salient object detection”, International Conference on Computer Vision, pp. 3328–3335, 2013.
-  C. Aytekin, S. Kiranyaz, and M. Gabbouj. ”Learning to Rank Salient Segments Extracted by Multispectral Quantum Cuts”, Pattern Recognition Letters, vol. 72, pp. 91-99, 2016.
-  C. Aytekin, A. Iosifidis, S. Kiranyaz, and M. Gabbouj. ”Learning Graph Affinities for Spectral Graph-based Salient Object Detection”, Pattern Recognition, vol. 64, pp. 159-167, 2017.
-  R. Zhao, W. Ouyang, H. Li, and X. Wang. ”Saliency Detection by Multi-Context Deep Learning”. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1265-1274, 2015.
G. Li and Y. Yu. ”Visual saliency based on multiscale deep features”,IEEE Conference on Computer Vision and Pattern Recognition, pp. 5455–5463, 2015.
-  L. Gayoung, T. Yu-Wing, and K. Junmo. ”Deep saliency with encoded low level distance map and high level features”. IEEE Conference on Computer Vision and Pattern Recognition, pp. 660-668, 2016.
-  L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, ”Saliency detection with recurrent fully convolutional networks”, European Conference on Computer Vision, pp. 825-841, 2016.
-  G. Li, and Y. Yu, ”Deep contrast learning for salient object detection”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 478-487, 2016.
-  T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang and H. Y. Shum, ”Learning to detect a salient object”, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 353-367, vol. 33, no. 2, .2011
-  N. Liu and J. Han. ”Dhsnet: Deep hierarchical saliency network for salient object detection”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 678-686, 2016.
-  Q. Hou, M. M. Cheng, X. Hu, A. Borji, Z. Tu and P. Torr. §Deeply supervised salient object detection with short connections§, IEEE Conference on Computer Vision and Pattern Recognition, pp. 5300-5309, 2017.
-  Q. Yan, L. Xu, J. Shi and J. Jia, ”Hierarchical saliency detection”. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1155-1162, 2013.
-  Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille. ”The secrets of salient object segmentation”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 280-287, 2014.
-  V. Movahedi and J. H. Elder, ”Design and perceptual validation of performance measures for salient object segmentation”, IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 49-56, 2010.
-  D. Martin, C. Fowlkes, D. Tal and J. Malik, ”A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statics”, International Conference on Computer Vision, pp. 416-423, 2001.
V. Nair and G. E. Hinton, ”Rectified linear units improve restricted boltzmann machines”,
Proceedings of the 27th international conference on machine learning, pp. 807-814, 2010.
-  S. Ioffe and C. Szegedy, ”Batch normalization: Accelerating deep network training by reducing internal covariate shift”, International Conference on Machine Learning, pp. 448-456, 2015.
X. Glorot and Y. Bengio, ”Understanding the difficulty of training deep feedforward neural networks”,
International Conference on Artificial Intelligence and Statistics, pp. 249-256, 2010.
-  S. Han, J. Pool, J. Tran, W. Dally ”Learning both weights and connections for efficient neural network”, Advances in Neural Information Processing Systems, pp. 1135-1143, 2015.
-  S. Gupta, A. Agrawal, K. Gopalakrishnan and P. Narayanan, ”Deep learning with limited numerical precision”, International Conference on Machine Learning, pp. 1737-1746, 2015.