1 Introduction
Visual saliency has been a fundamental problem in neuroscience, psychology, and computer vision for a long time
borji2013state ; borji2015salient . It refers to the identification of a portion of essential visual information contained in the original image. Recently, studies of visual saliency have been extended from originally predicting eyefixation to identifying a region containing salient objects, known as salient object detection or saliency detection wang2017salient. Tremendous efforts have been made to saliency detection over the past decades owing to its extensive real applications in the realm of computer vision and pattern recognition
brown2015generalisable ; chen2015salient . For example, object detection and recognition become much more efficient and reliable by exploring only those salient locations and ignoring large irrelevant background.Existing approaches for saliency detection can be divided into two categories: the bottomup (or stimulusdriven) approaches and the topdown (or taskdriven) approaches borji2013state
. The bottomup approaches detect saliency regions only using lowlevel visual information such as color, texture and localization, without requiring any specific knowledge on the objects and/or background. By contrast, the topdown approaches, including recently proposed deeplearning based methods (e.g.,
jetley2016end ; zhang2017learning ; wang2018detect), utilize highlevel human perceptual knowledge such as object labels or semantic information to guide the estimation of saliency maps. Compared with the topdown methods, bottomup ones require less computational power and exhibit better generality and scalability
borji2013state ; borji2015salient .A recent trend is to combine bottomup cues with topdown priors to facilitate saliency detection using lowrank matrix recovery (LRMR) theory candes2011robust . Generally speaking, these methods (e.g., yan2010visual ; lang2012saliency ; zou2013segmentation ) assume that a natural scene image consists of visually consistent background regions (corresponding to a highly redundant information component with lowrank structure) and distinctive foreground regions (corresponding to a visually salient component with sparse structure). In yan2010visual , Yan et al. proposed a LRMR based model using sparse representation of image features as input, where the sparse representation is obtained by learning a dictionary upon image patches. In lang2012saliency , Lang et al. introduced a multitask sparsity pursuit for saliency detection, where a single lowrank matrix decomposition is replaced by seeking consistently sparse elements from the joint decompositions of multiplefeature matrices into pairs of lowrank and sparse matrices. Despite promising results achieved by various LRMRbased methods, there still remain two challenging problems peng2017salient : 1) Intercorrelations among elements within the sparse component are neglected, causing incompleteness or scattering of detected object; 2) Lowrank matrix recovery model is hard to separate salient objects from background when the background is cluttered or has similar appearance with the salient objects. Therefore, treestructured sparsity constraint and Laplacian regularization are introduced in peng2017salient to address these two issues respectively.
In this paper, we first argue that the main reason for these two problems is that the spatial relationship among image regions (or superpixels) is not fully taken into consideration in the original LRMR model. Moreover, the structuredsparse constraint in peng2017salient , actually, cannot effectively preserve such a relationship. To this end, we propose a novel LRMR based saliency detection method under a coarsetofine framework to address the key issue while maintaining high efficiency. Our framework features two modules in a successive manner: a coarseprocessing module, in which a Laplacian smooth term is integrated into a baseline norm constrained LRMR model to roughly detect salient regions; and a refinement module, in which a projection is learned upon the coarse saliency map to enhance object boundaries.
To summarize, our main contributions are threefold:

An effective saliency detection model, integrating norm sparsity constrained LRMR and Laplacian regularization, is proposed to roughly detect salient regions. We set this as our baseline model and demonstrate that it performs well in diverse scenes.

A learningbased refinement module is developed to assign more accurate saliency values to such obscure regions, i.e., regions located around object boundaries, thus promoting the entirety of detected salient objects.

Extensive experiments are conducted on three benchmark datasets to demonstrate the superiority of our method against other LRMR based methods and the efficacy of the proposed coarsetofine framework.
2 Related Work
An extensive review on saliency detection is beyond the scope of this paper. We refer interested readers to two recently published surveys borji2013state ; borji2015salient for more details about existing bottomup and topdown approaches for saliency detection. This section first briefly reviews the prevailing unsupervised bottomup saliency detection methods, and then introduces several popular LRMR based methods that are closely related to our work.
2.1 Popular Bottomup Saliency Detection Methods
As a pioneering work, Itti et al. itti1998model innovatively suggested using “Center and Surround” filters to extract image features and to simulate human vision system on multiscale levels to generate saliency maps. Motivated by Itti’s framework, various contrast based approaches have been developed in past decades, which include localcontrastbased ones (e.g., goferman2012context ; jiang2011automatic ), globalcontrastsbased ones (e.g., cheng2015global ; perazzi2012saliency ; margolin2013makes ), or even those combining both local and global contrasts (e.g., borji2012exploiting ; lu2012saliency ; lu2014robust ). Local contrast is estimated by measuring the difference between a “center” pixel or small region with its neighbors, thus it is sensitive to high frequency changes such as edges and noises. On the contrary, global contrast is much more robust to local textures and edges, but they can fail to distinguish salient objects from the background that shares high similarity with the objects borji2015salient ; kim2016salient ; li2016double .
On the other hand, frequency domain also provides a reliable avenue for salient object detection. For example, Hou and Zhang hou2007saliency analyzed spectral residual of an image in spectral domain, where the highfrequency components are considered as background. A similar work was presented by Fang et al. fang2012bottom
, where the standard Fast Fourier Transform (FFT) is substituted with Quaternion Fourier Transform (QFT). Other representative examples include
li2013visual ; imamoglu2013saliency .Graph theory based methods (e.g.,yang2013saliency ; wang2016grab ; zhu2014saliency ) have attracted increasing attention in recent years due to their superior robustness and adaptability. For instance, Yang et al. yang2013saliency adopted manifold ranking to rank the similarity of superpixels with foreground and background seeds. Based on this model, Wang et al. wang2016grab suggested detecting saliency by combining local graph structure and background priors together. This way, salient information among different nodes can be jointly exploited. However, a fully connected graph suffers from high computational cost.
2.2 LRMRbased Saliency Detection Methods
The usage of LRMR theory on saliency detection was initiated by Yan et al. yan2010visual and then extended in shen2012unified . Generally, the LRMR based methods assume that an image consists of an informationredundant part and a visually salient part, which are characterized with a lowrank component and a sparse component respectively. Specifically, a given image is firstly divided into small regions or superpixels to reduce computational complexity, where is the number of regions. Features are extracted for each region, forming a feature matrix . The LRMR theory is deployed to decompose as follows:
(1) 
where denotes nuclear norm for the lowrank component and denotes norm that is used to encourage sparseness. is a tradeoff parameter balancing the low rank term and the sparse term. After the decomposition, a saliency map can be generated from the obtained sparse matrix :
(2) 
where denotes the th column of matrix . Note that
is a vector herein, thus its
norm is the sum of the absolute value of each entry.Early LRMR based methods are datadependent, i.e., the learned dictionaries or transformations depend heavily on selected training images or image patches, which suffer from limited adaptability and generalization capability. To this end, various approaches are developed in an unsupervised manner by either adopting a multitask scheme (e.g., lang2012saliency ) or introducing extra priors (e.g., zou2013segmentation ; zhang2017salient ). For example, Lang et al. lang2012saliency proposed to jointly decompose multiplefeature matrices instead of directly combining individual saliency maps generated by decomposing each feature matrix. Zou et al. zou2013segmentation introduced segmentation priors to cooperate with sparse saliency in an advanced manner. To preserve the entirety of detection objects, saliency fusion models (e.g.,li2016double ; li2014visual ; li2017saliency ; huang2015saliency ) were proposed thereafter. For instance, double lowrank matrix recovery (DLRMR) was suggested in li2016double to fuse saliency maps detected by multiple approaches.
Although above extensions improved algorithm robustness to cluttered backgrounds, there still remain two open problems. First, extra priors zou2013segmentation or complicated operations (such as saliency fusion li2016double ; huang2015saliency ) may incur expensive computational cost. Second, these methods neglect the spatial relationship among image regions, which cannot ensure the entirety of detected objects. The first work that attempts to address above two limitations is the recently proposed structured matrix decomposition (SMD) by Peng at al. peng2017salient . Specifically, SMD introduces two new regularization terms to Eq. (2.2): a treestructured sparse constraint that is used to preserve intercorrelations among sparse elements and a Laplacian regularization term that is adopted to enlarge the difference between foreground and background. This way, the spatial relationship among sparse elements and the coherence between low rank component and sparse component are explicitly modeled and optimized in a unified model. The objective of SMD is formulated as:
(3) 
where the matrix represents highlevel priors shen2012unified , and denotes dotproduct of matrices. The term denotes the structuredsparse constraint, is the norm (), is the depth (or layer) of index tree and is the number of nodes at the th layer. Here denotes the th node at the th level of the index tree such that ( and ), , and , where is the indexing at the th level. ( denotes set cardinality) is the submatrix of corresponding to node . The third term is introduced to promote the performance under cluttered background, where is a parameter that balances this regularization and the other two terms. is unnormalized graph Laplacian matrix.
Our work is directly motivated by SMD peng2017salient . However, two observations prompt us to propose our method:

SMD uses Laplacian constraint to reduce the coherence between low rank component and sparse component under cluttered background. In fact, the Laplacian constraint is not novel in saliency detection literature. In our perspective, it performs more like a smooth term (just like it does in previous saliency detection literature) that can hardly increase the discrepancy between foreground and background.

The structuredsparse constraint in SMD cannot effectively preserve spatial relationship among image regions. In fact, it may even disrupt such relationship if we apply this constraint on deep layers (as recommended by the authors).
The effects or functionality of Laplacian constraint can trace back to early work on saliency detection (e.g., borji2013state ; borji2015salient ; zhu2014saliency ; lu2014learning ), which use it as a smooth regularization term to reduce the discrepancy of saliency values from regions that have similar appearance or feature representations. Therefore, in the scenario of cluttered background (i.e., the salient object may be interfered by the background), the Laplacian constraint can hardly increase the discrepancy between foreground and background.
Regarding the second argument, spatial relationship among superpixels is taken into consideration in the construction of tree nodes . However, such relationship has not been preserved if we naively impose the norm sparse constraint on these nodes. It should be pointed out that in the deepest level of the tree, one node is composed of a single superpixel, whereas in the shallowest level, one node is composed of all the superpixels. According to scale theory, there exists an optimal scale for an object lindeberg1998feature . However, in treestructured sparsity constraint, nodes in different levels contribute equally to final sparsity, which does not emphasize or highlight spatial relationship among image regions. Moreover, one should note that the norm and the norm in a specific node lead to rowsparsity and columnsparsity respectively, which has little relationship to the spatial structure.
3 Our Method
This paper proposed a novel LRMR based saliency detection method under a coarsetofine framework that can effectively preserve object entirety, even in the scenarios of multiple objects or cluttered background. To this end, we integrate the basic LRMR model in Eq. (2.2) and Laplacian regularization to generate a coarse saliency map. Then, we learn a projection on top of superpixels sampled from the coarse saliency map to obtain final saliency. By exploiting the spatial relationship among superpixels in the refinement module, the proposed method is robust to cluttered background. The overall flowchart of our method is illustrated in Fig. 1.
3.1 The Limitation of TreeStructured Sparsity in SMD
In Section 2.2, we pointed out that treestructured regularization in SMD is not suitable for salient object detection. In this section, we further propose two arguments to specify the limitations of treestructured regularization: (1) for images containing only a single object, the regularization imposed on shallow layers of the index tree is sufficient to render satisfactory performance, and (2) for images containing multiple objects or complex scenes, the regularization imposed on deeper layers will destroy the spatial structure of a group of objects, thus disrupting the entirety of detected saliency regions.
To experimentally validate the effects of structuredsparse regularization in Eq. (2.2) and our coarsetofine architecture, we give two examples in Fig. 2^{1}^{1}1More examples are shown in supplementary material. Specifically, we construct a fourlayer indextree for validation. It is worth noting that the bottom layer (the 4th layer) of index tree is composed of graphs, each containing a superpixel, whereas the top layer (the 1st layer) of index tree only contains one graph that incorporating all superpixels. The norm constraint is applied to each graph separately and then the results are summed.
The first image is presented to illustrate the case of single object in pure background. Comparing Fig. 2(c1) with Fig. 2(e1) and Fig. 2(g1) respectively, we can observe that adding constraint to the 2nd layer eliminates irrelevant background, while deeper constraint is unnecessary for preserving spatial structure of the flower. Meanwhile, comparing Fig. 2(c1) with Fig. 2(i1), we can see that our coarsetofine architecture is also able to remove irrelevant background, e.g., regions below the flower.
The second image is presented to illustrate the case of multiple objects. Comparing Fig. 2(c2) with Fig. 2(e2) and Fig. 2(g2), we can observe that adding constraint to the 2nd layer promotes the structural entirety of objects to some extent, while deeper constraint destroys the spatial structure of the bodies. On the contrary, comparing Fig. 2(c2) with Fig. 2(i2), we can see that our coarsetofine architecture produces more accurate saliency of superpixels around object boundaries, e.g., superpixels in leg areas adjacent to image boundary, thus improves the entirety of salient objects.
3.2 Coarse Saliency from LowRank Matrix Recovery
Due to the limitations of treestructured sparsity, we revert to the original norm sparsity constraint, yielding sparsity by treating each element individually. Specifically, we roughly measure saliency of image regions using
(4) 
where matrices , is unnormalized graph Laplacian matrix. Once the lowrank matrix and sparse matrix are determined, saliency value of the th superpixel can be calculated by Eq. (2).
Optimization: The optimization problem in Eq. (3.2) can be efficiently solved via the alternating direction method of multipliers (ADMMs) lin2011linearized . For simplification, we denote the projected feature matrix as . An auxiliary variable is introduced and problem Eq. (3.2) becomes
(5) 
Lagrange multipliers and are introduced to remove the equality constraints, and the augmented Lagrangian function is constructed as
(6) 
where is the penalty parameter.
Iterative steps of minimizing the Lagrangian function are utilized to optimize Eq. (3.2), and stop criteria at step are given by Eq. (7) and Eq. (8)
(7)  
(8) 
The variables and can be alternately updated by minimizing the augmented Lagrangian function with other variables fixed. In this model, each variable can be updated with a closed form solution. With respect to and , they can be updated as follows
(9) 
(10) 
where the softthresholding operator is defined by
and , where SVD is the singular value decomposition.
Regarding and , we can update them as follows
(11)  
(12)  
(13)  
(14) 
where the parameter controls the convergence speed.
3.3 Learningbased Saliency Refinement
As we have discussed in Section 2.2, the coarse saliency map generated by LRMR based approaches ignores spatial relationship among adjacent superpixels. To further improve the detection results, we refine the coarse saliency by learning a projection from image features to saliency values.
Given the coarse saliency calculated using Eq. (2), we can roughly distinguish salient regions from background. In order to obtain common interior feature of foreground and background respectively, we choose confident superpixels based on their coarse saliency value. Specifically, we set two thresholds to select confident superpixel samples for background and for foreground respectively, i.e., superpixels with saliency value lower than are considered as negative samples, and superpixels with saliency value higher than are considered as positive ones. We denote as the sample matrix composed of both positive and negative samples, and as corresponding label matrix, where is the total number of confident samples. For the th positive sample, its label vector is , while for the th negative sample, its label vector is . See Fig. 3 for more intuitive examples.
In order to determine the saliency of those tough samples , we utilize their spatial relationship with these confident samples, as shown in Fig. 3. Based on the coarse saliency and adjacency, rough saliency for the th tough sample is generated by
(15) 
where is the number of superpixels adjacent to the th tough sample , and denotes the number of pixels contained in the th superpixel. Similarly, we formulate label vector of as , and the label matrix , where is the number of tough samples.
(a)  (b)  (c)  (d) 
Combining the coarse saliency for confident samples and tough samples, we build our saliency refining model as follows
(16) 
where and represent tough samples and corresponding labels, respectively. is the projection to be learned, and are regularization parameters. The first term imposes regularization on to avoid overfitting, whereas the second and third terms require respectively labeled confident and tough samples. Once the projection is learned, saliency of those tough superpixels are given by the first column of matrix .
Despite the simplicity of Eq. (16), one should note that background region is typically much larger than salient region. This leads to the issue of learning in the circumstance of imbalanced data. In order to overcome this limitation, we introduce samplewise weights to balance the contributions of positive and negative samples in projection learning, which is formulated as follows
(17) 
where is the weight for the th confident sample. Now the second term distinguishes the importance of positive samples from that of negative ones. In fact, we can simplify Eq. (17) by combining the second term and the third term with generalized weights as follows
(18) 
where is the weight for the th sample, either positive one, negative one or tough one. Given that there are much more positive samples than negative ones, we adopt the weighting strategy that is widely used in imbalanced date problems sun2009strategies to leverage the effect of positive and negative samples, i.e, , where and are the weights of the th positive sample and the th negative sample, respectively. and denote the number of negative and positive samples. Moreover, noting that labels of positive/negative samples are more reliable than that of tough ones, the weight of a tough sample is set to be half of that for a confident sample. To summarize, the weighting scheme is given by
where . Optimization problem in Eq. (18) can be efficiently solved by
(19) 
where is a diagonal matrix with , and
is an identity matrix.
3.4 Complexity analysis
Here we briefly discuss the computational complexity of optimization in Section 3.2 and Section 3.3 respectively, and we have , .
We set the th iteration for coarse saliency generation as an example. The time consumption mainly involves three kinds of operations, i.e., SVD, matrix inversion and matrix multiplication. Specifically, update for and is addressed by SVD, with the complexity of and , respectively. While major operations in updating include matrix inversion and matrix multiplication, with complexity of . Considering , the final computational complexity is . Compared with this, the optimization for the treestructured sparsity in peng2017salient requires no extra computational complexity. However, multiscale segmentation in constructing the index tree introduces computational cost thus slows down the speed, as listed in Table 3.
For saliency refinement, the solution in Eq. (19) involves matrix inversion and matrix multiplication, with the complexity of and , respectively. Considering , the final computational complexity is .
4 Experiments
In this section, extensive experiments are conducted to demonstrate the effectiveness and superiority of our method. We first introduce the quantitative metrics and the implementation details of our method in Section 4.1. Then in Section 4.2, we compare our method (including our baseline model) with other LRMR based methods to emphasize the effectiveness and advantage of our coarsetofine architecture. In Section 4.3, we present a systematic comparison with stateorthearts to show the superiority of our method. Finally in Section 4.4, we analyze the effects of different parameters in our method. Three benchmark datasets are selected: MSRA10K cheng2015global contains 10,000 images with a single object per image, iCoSeg batra2011interactively contains 643 images with multiple objects per image, and ECSSD perazzi2012saliency contains 1,000 images with cluttered backgrounds. We also select stateoftheart methods for comparison. Among them, three methods are LRMR based, i.e., SMD peng2017salient , SLR zou2013segmentation and ULR shen2012unified . Moreover, we select five stateoftheart methods that use contrasts or incorporating priors, i.e., RBD zhu2014saliency , PCA margolin2013makes , HS yan2013hierarchical , HCT kim2016salient and DSR li2013saliency . The four remaining methods are MR yang2013saliency , SS hou2012image , FT achanta2009frequency , and DRFI wang2017salient . All the experiments in this paper were conducted with MATLAB2016b on an Intel i56500 3.2GHz Dual Core PC with 16GB RAM.
4.1 Experimental Setup
We follow the same experimental setup in SMD peng2017salient to compare the performance of different methods. The quantitative metrics include precisionrecall (PR) curve, receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), weighted measure (WF), overlapping ratio (OR) and mean absolute error (MAE). Supposing saliency values are normalized to the range of
, the generated saliency map can be binarized with a given threshold, i.e., salient or nonsalient. PR curve is obtained by setting a series of discrete threshold ranging from
to on the grayscale saliency map. ROC curve is obtained in a similar way, the only difference is that ROC measures hitrate (recall) and falsealarm. WF is proposed in margolin2014evaluateto achieve a tradeoff between precision and recall
, with in previous work wang2017salient ; zhu2014saliency . OR measures the intersection between predicted (binarized) saliency map (S) and the groundtruth saliency map (G), . MAE gives a numerical difference between the continuous saliency map and the true saliency map.For our method, we adopt simple linear iterative clustering (SLIC) algorithm achanta2012slic () for oversegmentation and extract the widely used dimensional features (i.e., color, responses of steerable pyramid filters, responses of Gabor filters) as conducted in previous approaches zou2013segmentation ; peng2017salient ; shen2012unified . Initialization for variables and parameters in the coarse module are set as . Regularization parameters for coarse saliency generation are set as optimal ones, i.e., through out the experiments except for parametric analysis. For the refinement module, we set and corresponding parametric sensitivity is provided in Section 4.4. As for homogenization, we consider location, contrast and background priors as done in peng2017salient .
Dataset  MSRA10K  iCoSeg  ECSSD  

WF  OR  AUC  MAE  WF  OR  AUC  MAE  WF  OR  AUC  MAE  
ULR shen2012unified  0.425  0.524  0.831  0.224  0.379  0.443  0.814  0.222  0.351  0.369  0.788  0.274 
SLR zou2013segmentation  0.601  0.691  0.840  0.141  0.473  0.505  0.805  0.179  0.402  0.486  0.805  0.226 
SMD peng2017salient  0.704  0.741  0.847  0.104  0.611  0.598  0.822  0.138  0.544  0.563  0.813  0.174 
Ours (C)  0.688  0.734  0.844  0.108  0.614  0.599  0.823  0.137  0.535  0.557  0.810  0.175 
ULR  0.532  0.597  0.846  0.195  0.439  0.459  0.814  0.219  0.421  0.418  0.801  0.262 
SLR  0.681  0.726  0.847  0.122  0.602  0.587  0.816  0.161  0.519  0.542  0.814  0.199 
SMD  0.706  0.753  0.854  0.103  0.630  0.618  0.838  0.132  0.546  0.571  0.820  0.175 
Ours  0.705  0.751  0.854  0.104  0.634  0.624  0.838  0.131  0.545  0.571  0.820  0.176 
RGB  ULR shen2012unified  SLR zou2013segmentation  SMD peng2017salient  Ours (C)  ULR  SLR  SMD  Ours  GT 
4.2 Comparison with LRMRbased methods
4.2.1 The effectiveness of our baseline model
To evaluate the performance of our baseline model, i.e., the lowrank decomposition model with Laplacian constraint in Eq. (3.2), a thorough comparison with other LRMR based methods including ULR shen2012unified , SLR zou2013segmentation and SMD peng2017salient is provided in Table 1 and Fig. 4. From the qualitative comparison in Fig. 4, we can see that methods such as ULR and SLR fail to generate uniform detection results. By contrast, salient objects detected by SMD peng2017salient and our baseline model are much smoother. This results further validate our argument that the Laplacian regularization plays more like a smooth term, rather than increasing the discriminancy around object boundaries as claimed in peng2017salient . From quantitative comparison in Table 1, we can see that our baseline model and SMD peng2017salient outperform ULR shen2012unified and SLR zou2013segmentation by a large margin. It is worth noting that our baseline model is only slightly outperformed by SMD peng2017salient on MSRA10K and ECSSD datasets. While on iCoSeg dataset, our baseline model achieves even better result than SMD peng2017salient in terms of all the four metrics. Two key conclusions can be drawn from the experimental results. First, the basic norm sparsity constraint performs almost equally to the structuredsparse regularization, which indicates that the latter can hardly preserve spatial relationship among elements within the sparse component. Second, treestructured sparsity constraint is not suitable in the scenario of multiple objects.
4.2.2 The advantage of our coarsetofine framework
It can be observed in Fig. 4 that salient objects detected by these LRMR based approaches are not entire enough, and even contain irrelevant background regions. This is because the basic LRMR model ignores the spatial relationship of object parts. Though SMD peng2017salient attempts to handle this issue by replacing original norm sparsity constraint with structuredsparse constraint, it can hardly achieve the goal as aforementioned. Instead, we address the issue by cascading a learned projection to produce finer saliency maps. We can see that our method generates more entire saliency result compared with our baseline model, e.g., the persons in the second image and the dog in the third image. Besides, the refinement module also helps eliminate irrelevant background, e.g., blue water in the first image. With quantitative comparison listed in Table 1, we can see an obvious boost of performance of our model on all the three benchmark datasets, compared with that of our baseline model.
To further verify the general effectiveness of our coarsetofine architecture, we conduct more experiments with different LRMR baseline models, i.e., ULR shen2012unified , SLR zou2013segmentation and SMD peng2017salient . Test results are also summarized in Table 1. Comparing with original baselines, models with refinement show an improvement on all the three datasets. The best performance is achieved by our method and also by the SMD peng2017salient model with refinement. Similar visual improvement as discussed above can be observed in Fig. 4. It is especially obvious for the ULR shen2012unified baseline, where clearer and more entire saliency maps are generated after refinement.
4.3 Comparison with StateoftheArts
To evaluate the superiority of our coarsetofine model, we systematically compare it with the other twelve stateofthearts. PR curves on three datasets are shown in Fig. 5, ROC curves are shown on Fig. 6, and results of four metrics mentioned above are listed in Table 2. Besides, qualitative comparisons are provided in Fig. 9. From the results we can see that, in most cases, our model ranks first or second on the three datasets under different criteria. It is worth noting that we report the result of DRFI wang2017salient as a reference, which belongs to topdown methods with supervised training.
(a)  Metric  Ours  SMDpeng2017salient  DRFIwang2017salient  RBDzhu2014saliency  HCTkim2016salient  DSRli2013saliency  PCAmargolin2013makes  MRyang2013saliency  SLRzou2013segmentation  SShou2012image  ULRshen2012unified  HSyan2013hierarchical  FTachanta2009frequency 

WF  0.705  0.704  0.666  0.685  0.582  0.656  0.473  0.642  0.601  0.137  0.425  0.604  0.277  
OR  0.751  0.741  0.723  0.716  0.674  0.654  0.576  0.693  0.691  0.148  0.524  0.656  0.379  
AUC  0.854  0.847  0.857  0.834  0.847  0.825  0.839  0.601  0.840  0.801  0.831  0.833  0.690  
MAE  0.104  0.104  0.114  0.108  0.143  0.121  0.185  0.125  0.141  0.255  0.224  0.149  0.231  
(b)  Metric  Ours  SMDpeng2017salient  DRFIwang2017salient  RBDzhu2014saliency  HCTkim2016salient  DSRli2013saliency  PCAmargolin2013makes  MRyang2013saliency  SLRzou2013segmentation  SShou2012image  ULRshen2012unified  HSyan2013hierarchical  FTachanta2009frequency 
WF  0.634  0.611  0.592  0.599  0.464  0.548  0.407  0.554  0.473  0.126  0.379  0.563  0.289  
OR  0.624  0.598  0.582  0.588  0.519  0.514  0.427  0.573  0.505  0.164  0.443  0.537  0.387  
AUC  0.838  0.822  0.839  0.827  0.833  0.801  0.798  0.795  0.805  0.630  0.814  0.812  0.717  
MAE  0.131  0.138  0.139  0.138  0.179  0.153  0.201  0.162  0.179  0.253  0.222  0.176  0.223  
(c)  Metric  Ours  SMDpeng2017salient  DRFIwang2017salient  RBDzhu2014saliency  HCTkim2016salient  DSRli2013saliency  PCAmargolin2013makes  MRyang2013saliency  SLRzou2013segmentation  SShou2012image  ULRshen2012unified  HSyan2013hierarchical  FTachanta2009frequency 
WF  0.545  0.544  0.547  0.513  0.446  0.514  0.364  0.496  0.402  0.128  0.351  0.454  0.195  
OR  0.571  0.563  0.568  0.526  0.486  0.514  0.395  0.523  0.486  0.103  0.369  0.458  0.216  
AUC  0.820  0.813  0.817  0.781  0.785  0.785  0.791  0.793  0.805  0.567  0.788  0.801  0.607  
MAE  0.176  0.174  0.160  0.171  0.198  0.171  0.247  0.186  0.226  0.278  0.274  0.227  0.270 
(a)  (b)  (c) 
(a)  (b)  (c) 
4.3.1 Results on singleobject images
The MSRA10K dataset contains images with diverse objects of varying size, and with only one object in each image. From Fig. 5 (a), Fig. 6 (a) and Table 2 (a), we can see that our method achieves the best result with the highest weighted Fmeasure, overlapping ratio and the lowest mean average error, while DRFI wang2017salient obtains the highest AUC score. It is worth noting that, our method even outperforms DRFI wang2017salient with just simple features and no supervision. Frequencybased methods like FT achanta2009frequency perform badly, as it is difficult to choose a proper scale to suppress background without knowing of object size. While SS hou2012image considers sparsity directly in standard spatial space and DCT space, it can only give a rough result of detected objects. In PR curves, our method shows an obvious superiority to other approaches. While in ROC curves, DRFI wang2017salient and our method are the best two among those competitive methods.
4.3.2 Results on multipleobject images
The iCoSeg dataset contains images with multiple objects, separate or adjacent. From Fig. 5 (b), Fig. 6 (b) and Table 2 (b), we can see that our method also achieves the highest weighted Fmeasure, overlapping ratio and the lowest mean average error, which shows that our method is effective under cases of multiple objects. However, the performance of PCA margolin2013makes , SLR zou2013segmentation , DSR li2013saliency and ULR shen2012unified decrease heavily. As PCA margolin2013makes considers the dissimilarity between image patches and SLR zou2013segmentation introduces a segmentation prior, they are more sensitive to the quantity of object within a scene. As for DSR li2013saliency , its precision drops dramatically with the increase of recall due to its dependence on background templates. This is because in the scenario of multiple objects, salient objects are more likely to overlap with image boundary regions. ULR shen2012unified trains a feature transformation on MSRA dataset, hence it obtains poor performance for the detection of multiple objects. In PR curves, our method presents better stability with increased recall. While in ROC curves, our method and DRFI wang2017salient achieve the best performance and almost the same AUC score, outperforming the rest approaches.
4.3.3 Results on complex scene images
The ECSSD dataset contains images with complicated background and also objects of varying size. From Fig. 5 (c), Fig. 6 (c) and Table 2 (c), we can see that our method achieves the highest overlapping ratio and AUC score, and is outperformed by DRFI wang2017salient in terms of weighted Fmeasure and mean absolute error. In PR curves, our method performs similarly to SMD peng2017salient , while in ROC curves, DRFI wang2017salient and our method are the best two among the stateofthearts. The result demonstrates that our method is competitive under complex scene. Approaches such as HS yan2013hierarchical , HCT kim2016salient , MR yang2013saliency and RBD zhu2014saliency that depend on cues like contrast bias and center bias fail to keep good performance.
4.3.4 Visual comparison
To have an intuitive concept of the performance, we provide a visual comparison of detection result with images selected from the three benchmark datasets, which are diverse in object size, complexity of background and number of objects, as listed in Fig. 9. We can see that our method works well under most cases, and is capable of providing a relatively entire detection. As analyzed above, frequencytuned method FT achanta2009frequency tends either to filter out part of object or to preserve part of background. Basic lowrank matrix recovery methods like SLR zou2013segmentation and ULR shen2012unified are not robust enough to background and fail to provide a uniform saliency map. Approaches depending on prior cues such as HC yan2013hierarchical , HCT kim2016salient , MR yang2013saliency and RBD zhu2014saliency are more likely to miss object parts that are adjacent to image boundary. Finally, time consumption for all methods is provided in Table 3, which demonstrates the efficiency of our method.
Methods  Ours  SMD  DRFI  RBD  HCT  DSR  PCA  MR  SLR  SS  ULR  HS  FT 
Time(s)  0.83  1.59  9.06  0.20  4.12  10.2  4.43  1.84  22.80  0.05  15.62  0.53  0.07 
Code  M+C  M+C  M+C  M+C  M  M+C  M+C  M+C  M+C  M  M+C  EXE  C 
(a)  (b)  (c) 
(a)  (b)  (c) 
4.4 Analysis of Parameters
4.4.1 Parameters in coarse module
In our coarse module, the algorithm takes three parameters, i.e., the number of superpixels in oversegmentation, regularization parameters . We examine the sensitivity of our model to changes of on iCoSeg dataset as an example. The analysis is conducted by tuning one parameter while fixing another two. The performance changes in terms of WF, OR, AUC, MAE are shown in Fig. 7. For , we observe that similar results are achieved by varying and is a good tradeoff between efficiency and performance, as larger requires more expensive computation. Besides, we observe that when is fixed (), the WF, OR and MAE performance decreases while the AUC performance initially increases, spikes within a range of from to , and then decreases. Thus, we choose the optimal . When is fixed (), the WF and OR performance initially increases, spikes within a range of from to . The AUC performance initially maintains and then decreases, and the MAE performance initially maintains, increases within a range of from to , and then decreases. Thus, we choose the optimal .
4.4.2 Parameters in refining module
In our fine module, the main parameter is the regularization parameter . The sensitivity in terms of WF, OR, AUC, MAE is shown in Fig. 8 (a). We observe that the WF, OR performance initially increases, spikes within a range of from to , and then decreases. The AUC performance initially increases, spikes within a range of from to , and then decreases. The MAE performance initially increases, spikes at , and then maintains. The results illustrate that compared a small , the model performs worse with a lack of label information from those samples (including both confident ones and tough ones). When is large, the performance suffers from an obvious drop, which may be caused by overfitting the confident samples. Therefore, we choose in our method.
Moreover, we also examine the sensitivity of our model to the changes of different thresholding strategies in our refining module. We fix the lower threshold, i.e., we set as the average value of coarse saliency, and test varying . PR curves and ROC curves of and are shown in Fig. 8 (b) and (c). We observe that our method performs similarly under the three strategies, which demonstrates its robustness.
5 Conclusion
In this paper, we present a coarsetofine saliency detection architecture that first estimates a coarse saliency map using a novel LRMR model and then refines the obtained coarse saliency map using a learning scheme. Compared with stateoftheart approaches, our method can efficiently detect salient objects with enhanced object boundaries, even in the scenario of multiple objects. We also show that our finetuning scheme can be easily imposed on previous LRMR based methods to significantly improve their detection accuracy.
Image  FT  ULR  SS  HS  SLR  MR  PCA  DSR  HCT  RBD  DRFI  SMD  Ours  GT 
Acknowledgment
This work was supported partially by the Key Program for International S&T Cooperation Projects of China (No. 2016YFE0121200), in part by the National Natural Science Foundation of China (No. 61571205), in part by the National Natural Science Foundation of China (No. 61772220).
References
References

(1)
A. Borji, L. Itti, Stateoftheart in visual attention modeling, IEEE transactions on pattern analysis and machine intelligence 35 (1) (2013) 185–207.
 (2) A. Borji, M.M. Cheng, H. Jiang, J. Li, Salient object detection: A benchmark, IEEE transactions on image processing 24 (12) (2015) 5706–5722.
 (3) J. Wang, H. Jiang, Z. Yuan, M.M. Cheng, X. Hu, N. Zheng, Salient object detection: A discriminative regional feature integration approach, International journal of computer vision 123 (2) (2017) 251–268.
 (4) M. Brown, D. Windridge, J.Y. Guillemaut, A generalisable framework for saliencybased line segment detection, Pattern Recognition 48 (12) (2015) 3993–4011.
 (5) Y.C. Chen, V. M. Patel, R. Chellappa, P. J. Phillips, Salient views and viewdependent dictionaries for object recognition, Pattern Recognition 48 (10) (2015) 3053–3066.

(6)
S. Jetley, N. Murray, E. Vig, Endtoend saliency mapping via probability distribution prediction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 (7) P. Zhang, D. Wang, H. Lu, H. Wang, B. Yin, Learning uncertain convolutional features for accurate saliency detection, in: Computer Vision (ICCV), 2017 IEEE International Conference on, IEEE, 2017, pp. 212–221.
 (8) T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, A. Borji, Detect globally, refine locally: A novel approach to saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3127–3135.

(9)
E. J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis?, Journal of the ACM (JACM) 58 (3) (2011) 11.
 (10) J. Yan, M. Zhu, H. Liu, Y. Liu, Visual saliency detection via sparsity pursuit, IEEE Signal Processing Letters 17 (8) (2010) 739–742.
 (11) C. Lang, G. Liu, J. Yu, S. Yan, Saliency detection by multitask sparsity pursuit, IEEE transactions on image processing 21 (3) (2012) 1327–1338.
 (12) W. Zou, K. Kpalma, Z. Liu, J. Ronsin, Segmentation driven lowrank matrix recovery for saliency detection, in: BMVC, 2013.
 (13) H. Peng, B. Li, H. Ling, W. Hu, W. Xiong, S. J. Maybank, Salient object detection via structured matrix decomposition, IEEE transactions on pattern analysis and machine intelligence 39 (4) (2017) 818–832.
 (14) L. Itti, C. Koch, E. Niebur, A model of saliencybased visual attention for rapid scene analysis, IEEE transactions on pattern analysis and machine intelligence 20 (11) (1998) 1254–1259.
 (15) S. Goferman, L. ZelnikManor, A. Tal, Contextaware saliency detection, IEEE transactions on pattern analysis and machine intelligence 34 (10) (2012) 1915–1926.
 (16) H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, S. Li, Automatic salient object segmentation based on context and shape prior., in: BMVC, Vol. 6, 2011.
 (17) M.M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, S.M. Hu, Global contrast based salient region detection, IEEE transactions on pattern analysis and machine intelligence 37 (3) (2015) 569–582.
 (18) F. Perazzi, P. Krähenbühl, Y. Pritch, A. Hornung, Saliency filters: Contrast based filtering for salient region detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
 (19) R. Margolin, A. Tal, L. ZelnikManor, What makes a patch distinct?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 (20) A. Borji, L. Itti, Exploiting local and global patch rarities for saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
 (21) S. Lu, J.H. Lim, Saliency modeling from image histograms, in: European Conference on Computer Vision, 2012.
 (22) S. Lu, C. Tan, J.H. Lim, Robust and efficient saliency modeling from image cooccurrence histograms, IEEE transactions on pattern analysis and machine intelligence 36 (1) (2014) 195–201.
 (23) J. Kim, D. Han, Y.W. Tai, J. Kim, Salient region detection via highdimensional color transform and local spatial support, IEEE transactions on image processing 25 (1) (2016) 9–23.
 (24) J. Li, L. Luo, F. Zhang, J. Yang, D. Rajan, Double low rank matrix recovery for saliency fusion, IEEE transactions on image processing 25 (9) (2016) 4421–4432.
 (25) X. Hou, L. Zhang, Saliency detection: A spectral residual approach, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.
 (26) Y. Fang, W. Lin, B.S. Lee, C.T. Lau, Z. Chen, C.W. Lin, Bottomup saliency detection model based on human visual sensitivity and amplitude spectrum, IEEE transactions on multimedia 14 (1) (2012) 187–198.
 (27) J. Li, M. D. Levine, X. An, X. Xu, H. He, Visual saliency based on scalespace analysis in the frequency domain, IEEE transactions on pattern analysis and machine intelligence 35 (4) (2013) 996–1010.
 (28) N. Imamoglu, W. Lin, Y. Fang, A saliency detection model using lowlevel features based on wavelet transform, IEEE transactions on multimedia 15 (1) (2013) 96–105.
 (29) C. Yang, L. Zhang, H. Lu, X. Ruan, M.H. Yang, Saliency detection via graphbased manifold ranking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 (30) Q. Wang, W. Zheng, R. Piramuthu, Grab: Visual saliency via novel graph model and background priors, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 (31) W. Zhu, S. Liang, Y. Wei, J. Sun, Saliency optimization from robust background detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
 (32) X. Shen, Y. Wu, A unified approach to salient object detection via low rank matrix recovery, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
 (33) Q. Zhang, Y. Liu, S. Zhu, J. Han, Salient object detection based on superpixel clustering and unified lowrank representation, Computer Vision and Image Understanding 161 (2017) 51–64.
 (34) J. Li, J. Ding, J. Yang, Visual salience learning via low rank matrix recovery, in: Asian Conference on Computer Vision, 2014.
 (35) J. Li, J. Yang, C. Gong, Q. Liu, Saliency fusion via sparse and double low rank decomposition, Pattern Recognition Letters.
 (36) R. Huang, W. Feng, J. Sun, Saliency and cosaliency detection by lowrank multiscale fusion, in: IEEE International Conference on Multimedia and Expo (ICME), 2015.
 (37) S. Lu, V. Mahadevan, N. Vasconcelos, Learning optimal seeds for diffusionbased salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
 (38) T. Lindeberg, Feature detection with automatic scale selection, International journal of computer vision 30 (2) (1998) 79–116.
 (39) Z. Lin, R. Liu, Z. Su, Linearized alternating direction method with adaptive penalty for lowrank representation, in: Advances in neural information processing systems, 2011.
 (40) A. Sun, E.P. Lim, Y. Liu, On strategies for imbalanced text classification using svm: A comparative study, Decision Support Systems 48 (1) (2009) 191–201.
 (41) D. Batra, A. Kowdle, D. Parikh, J. Luo, T. Chen, Interactively cosegmentating topically related images with intelligent scribble guidance, International journal of computer vision 93 (3) (2011) 273–292.
 (42) Q. Yan, L. Xu, J. Shi, J. Jia, Hierarchical saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 (43) X. Li, H. Lu, L. Zhang, X. Ruan, M.H. Yang, Saliency detection via dense and sparse reconstruction, in: Proceedings of the IEEE International Conference on Computer Vision, 2013.
 (44) X. Hou, J. Harel, C. Koch, Image signature: Highlighting sparse salient regions, IEEE transactions on pattern analysis and machine intelligence 34 (1) (2012) 194–201.
 (45) R. Achanta, S. Hemami, F. Estrada, S. Susstrunk, Frequencytuned salient region detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
 (46) R. Margolin, L. ZelnikManor, A. Tal, How to evaluate foreground maps?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
 (47) R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, Slic superpixels compared to stateoftheart superpixel methods, IEEE transactions on pattern analysis and machine intelligence 34 (11) (2012) 2274–2282.
Comments
There are no comments yet.