Co-salient Object Detection Based on Deep Saliency Networks and Seed Propagation over an Integrated Graph

06/29/2017 ∙ by Dong-ju Jeong, et al. ∙ Seoul National University 0

This paper presents a co-salient object detection method to find common salient regions in a set of images. We utilize deep saliency networks to transfer co-saliency prior knowledge and better capture high-level semantic information, and the resulting initial co-saliency maps are enhanced by seed propagation steps over an integrated graph. The deep saliency networks are trained in a supervised manner to avoid online weakly supervised learning and exploit them not only to extract high-level features but also to produce both intra- and inter-image saliency maps. Through a refinement step, the initial co-saliency maps can uniformly highlight co-salient regions and locate accurate object boundaries. To handle input image groups inconsistent in size, we propose to pool multi-regional descriptors including both within-segment and within-group information. In addition, the integrated multilayer graph is constructed to find the regions that the previous steps may not detect by seed propagation with low-level descriptors. In this work, we utilize the useful complementary components of high-, low-level information, and several learning-based steps. Our experiments have demonstrated that the proposed approach outperforms comparable co-saliency detection methods on widely used public databases and can also be directly applied to co-segmentation tasks.



There are no comments yet.


page 2

page 3

page 4

page 5

page 7

page 8

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The objective of saliency detection is to find the most informative and attention-drawing regions in an image [1]

, and it has been one of the most popular computer vision tasks for the past few decades

[2]. There may be two categories of saliency detection: salient object detection and eye fixation prediction. The former aims at identifying precise salient object regions with relative saliency values [1, 3, 4, 5]

, while the latter is for estimating eye gaze fixation points resulting in saliency maps in the form of heat maps

[6, 7, 8, 9]. Recently, co-saliency detection has emerged as an important subtopic of the salient object detection, which is to find visually distinct regions and/or objects that commonly appear in a set of images. In other words, the goal of co-saliency detection is to find common salient objects while suppressing salient objects/regions that appear only in part of the image group. Thus, it is needed to consider visual coherency among the images besides the cues used in the saliency detection such as contrast [10, 11, 12] and/or boundary priors [12, 13, 1]. The co-saliency detection can be applied to other computer vision tasks, such as co-segmentation [2], video foreground detection [14]

, image retrieval

[15], and weakly supervised localization [16]. It can be utilized to enhance the single-image saliency detection as well [17].

Many researchers have recently proposed to utilize convolutional neural networks (CNNs), which will be called deep saliency networks in this paper, to produce pixel- or segment-level saliency maps better capturing high-level semantic information and robust to complex background

[18, 4, 5, 19]. These methods also detect salient regions more uniformly, and outperform conventional algorithms in terms of accuracy. Meanwhile, until recently, the majority of co-saliency detection methods use low-level handcrafted features such as color cues because the color information usually play an important role in distinguishing between co-salient and non-co-salient regions [15, 20, 21, 22]

. However, recent advances in deep learning have also contributed to the state-of-the-art methods for co-saliency detection

[2, 23, 24]

, which exploit high-level CNN features to represent image patches/segments or encode low-level features with deep autoencoders. One of the most challenging issues in co-saliency detection is its dependency on input image groups: whether low- or high-level features become a prior factor differs from case to case, and the same goes for contrast and consistency

[16]. To handle this, those learning-based methods perform weakly supervised learning given an image group, where similar images from external groups are also exploited to identify consistent background. On the other hand, the graph-based processing proves to be effective for spatial refinement of each image [23, 19], but it has rarely been used considering a whole image group and its consistency factor.

Fig. 1: A flowchart of the proposed co-saliency detection method with the blocks showing its steps and produced items.

In order to tackle the issues mentioned above, we propose a supervised learning-based method that is complemented by graph-based manifold ranking with an integrated graph including all the intra-image nodes of input images. The intra-image saliency (IrIS) maps of the images are produced by a fully convolutional networks, the part of which generates high-level semantic features. They are associated with low-level features to cope with the various cases and improve the performance of our system [25]

, and fed into fully-connected layers to obtain the inter-image saliency (IeIS) value of each segment. We choose to train these deep saliency networks in a supervised manner to avoid using any learning models trained given an input image group (with similar external images) and thereby reduce computation time. As a result, initial co-saliency maps are generated by combining the IrIS and IeIS maps. In addition, we propose to construct the integrated graph where auxiliary co-saliency values are obtained by propagating seeds extracted from the initial co-saliency maps. The segments of the input images are treated as the intra-image nodes, and inter-image nodes connect them to form the integrated graph with a sparse affinity matrix computed with color similarities. While the deep saliency networks are expected to detect precise (co-)salient regions, the graph-based method helps to find parts of co-salient objects showing color consistency and/or located on image boundaries. These two types of co-saliency value are combined to produce final co-saliency maps with simple spatial refinement. The unified framework is illustrated in Fig. 


The rest of this paper is organized as follows. The second section introduces related works on co-saliency detection, the third section describes co-saliency detection using the deep saliency networks, and the fourth section describes the seed propagation over the integrated graph. The experimental results and the conclusions are presented in the last two sections.

Ii Related Work

The co-salient object detection began with analyzing multi-image information and finding common objects within image pairs [26, 27, 28, 29]. For example, Li et al. [26] performed the pyramid decomposition of images and then extracted color and texture features from each region to compute the maximum SimRank scores of region pairs, which are defined as multi-image saliency values. To obtain the final co-saliency maps, they linearly combined the single- and multi-image saliency maps. Tan et al. [27] proposed to calculate the affinities of superpixel pairs with color and position similarities, and then perform bipartite graph matching to discover the most relevant pairs for affinity propagation. The resulting superpixel affinities between two images are converted into foreground cohesiveness and locality compactness measures to obtain the final co-saliency maps.

Due to the lack of scalability, other co-saliency detection methods have aimed at treating larger groups with more than two images. Fu et al. [15] proposed a two-layer cluster-based approach, where pixel-level intra- and inter-image clustering steps are performed to calculate the contrast, spatial, and corresponding cues of each cluster. They employed multiplication fusion of the cluster-wise cues and converted them into the final pixel-wise co-saliency values. The algorithm by Li et al. [30] generates intra-saliency maps with multi-scale segmentation and pixel-wise voting, and inter-saliency maps by matching image regions with a minimum spanning tree. It also linearly combines the intra- and inter-saliency maps into the final co-saliency maps. Liu et al. [20] proposed to perform hierarchical segmentation and compute intra-saliency, object prior, and global similarity values of the fine/coarse segments to obtain co-saliency values. In [21], Li et al. adopted their previous work to obtain single-image saliency maps, and used the two-stage manifold ranking method to estimate co-salient regions. They let each image of a group take turns to produce queries for the manifold ranking of all the images, and fused multiple co-saliency values by averaging or multiplication. In addition, Cao et al. [22] proposed a fusion-based algorithm, which adopts several existing (co-)saliency detection schemes and combine their results with self-adaptive weights produced by low-rank analysis.

The above methods utilize handcrafted features to represent pixels, segments, or clusters, and some of them focus only on color cues to cope with the situation where co-salient objects are quite consistent in color; so they cannot capture abstract semantic information and effectively detect the co-salient objects that consist of multiple components. Thus the learning-based methods using high-level features [2, 23, 24] have recently been proposed to tackle this problem. Zhang et al. [2] proposed to find several similar neighbors from external groups for negative image patches and analyze intra-image contrast, intra-group consistency, and inter-group separability measures. They combined them through a Bayesian framework to obtain patch-wise co-saliency values and then converted them into pixel-wise ones. In [23], a self-paced multi-instance learning method is used to update positive and negative training samples and their weights, and thereby train an SVM model for co-saliency estimation, where similar neighbors give the negative samples as in [2]. The approach proposed in [24]

exploits stacked denoising autoencoders (SDAEs) for two objectives: intra-saliency prior transfer and deep inter-saliency mining. First, several SDAEs for intra-saliency detection are trained in supervised and unsupervised manners with single-image saliency detection data and second, another SDAE is trained in an unsupervised manner with input images to exploit its reconstruction errors for co-saliency cues. These three approaches also utilize CNN models or SDAEs to represent patches/superpixels with high-level features or convert low-level descriptors into higher-level ones. In addition, they perform their learning steps provided with input image groups because the criteria for differentiating between co-salient and non-co-salient objects depend on the given target image group; but, this may result in high computational complexity for testing.

Iii Co-saliency detection using deep saliency networks

According to [16], the co-saliency detection methods in the literature explicitly or implicitly use the contrast cue and corresponding cue, which are also called intra- and inter-image saliency respectively. This is because co-salient regions are salient in each image and have correspondence in a whole image group, and thus a co-saliency detection algorithm should not ignore either one. To be accurate, the definition of the inter-image saliency in the conventional methods is similar to that of co-saliency but places more emphasis on the correspondence factor. As the bottom-up methods for co-saliency detection generally design the explicit intra- and inter-image saliency maps, we compute them with deep saliency networks trained in a supervised manner. Then, they are refined and combined to produce initial co-saliency maps for the next step.

Iii-a Intra-Image Saliency Detection

Given an image group , each image is independently represented by its superpixels , which are over-segmented regions obtained by the SLIC algorithm [31]. The goal of this step is to produce pixel-wise intra-image saliency (IrIS) values and convert them into segment-wise ones for each image . To this end, we use the multi-scale fully convolutional network [19], which produces pixel-level saliency maps combining several stacked feature maps extracted with different sizes of receptive fields. We call this single-image saliency detection network as an IrISnet in this paper. It is based on the original structure that utilizes the pre-trained VGG16 network [32] and is implemented following the DeepLab system [33].

Specifically, it replaces the fully-connected layers of the original VGG16 network with convolutional layers to design its fully convolutional structure, and four branches consisting of and convolutional layers are attached to its pooling layers to obtain the multi-scale feature maps. The main stream and four branches compute the five multi-scale single-channel feature maps, which are input to the last

convolutional layer and a sigmoid activation function to obtain an output saliency map ranging between

-. This network also exploits the hole (à trous) algorithm [34] for two purposes: first it helps to compute denser feature maps maintaining the original sizes of receptive fields and second, it can also adjust the size of each multi-scale feature map to be identical. As a result, we obtain the output maps for

and use bicubic interpolation to resize them to the original input image sizes so that we can estimate pixel-level saliency maps. Lastly, we set the median of the saliency values within each superpixel as

to obtain the segment-wise IrIS values, where we use the medians instead of means to reduce halos around salient objects as shown in Fig. 2.

Fig. 2: Examples of the intra-image saliency maps. From left to right: an input image, its pixel-level IrIS map, and two segment-level IrIS maps. The third and fourth images show the mean and median of the pixel-wise saliency values within each segment, respectively.

Iii-B Inter-Image Saliency Detection

For both single-image salient object detection and co-salient object detection, many of the existing methods (over-)segment input images and produce segment-wise saliency values. As for graph-based models [1, 3], superpixels instead of raw pixels are treated as nodes in a graph, the number of which is limited, so this makes it possible to utilize the graph models by manifold ranking. Meanwhile, the deep convolutional networks can make it possible to produce pixel-level saliency maps, and some CNN-based methods efficiently perform in that manner [4, 5]. However, other ones operate totally at segment-level or use segment-wise saliency values to complement pixel-wise ones. In [18, 25], each segment (with its relevant regions), irrespective of the number of segments, can be fed into the deep neural networks with the fixed number of parameters, and low-level features can also be exploited as additional inputs. The method of [19] utilizes both the pixel- and segment-wise saliency values, where the latter ones better represent saliency discontinuities along object boundaries.

To treat multiple images, we take advantages of the segment-wise processing mentioned above. When a CNN-based method is applied, as for single-image saliency detection, a whole image can be fed into a CNN model. However, for co-saliency detection, the size of an image group is not consistent and the information of a whole image group has to be exploited, so it is appropriate to predict segment-wise co-saliency values. In addition, there are cases when color cues are the most important rather than the other ones such as high-level semantic information, and other low-level features (e.g., position) might also be helpful and need to be added to the higher-level features extracted from convolutional layers. Considering these aspects, we compute each segment’s descriptor that includes the information of a whole image group, and produce segment-level inter-image saliency (IeIS) maps.

The CNN features extracted from the conv5_3 layer within the IrISnet are used as higher-level features for each segment. To adapt the superpixels made on the image domain to the domain of the feature maps of the IrISnet, we use convolutional feature masking (CFM) [35], and then perform spatial pooling [36] to obtain fixed-length descriptors as in [19]. In addition to the higher-level features, the low-level ones such as Lab

color vectors, color histograms, and positions are computed to complement the higher-level features because either one of the two types is more important than the other one depending on the input image group

[16]. With these components, we propose to compute each segment ’s multi-regional descriptor , each element of which is pooled within four different regions: i) the target segment , ii) its immediate neighborhood, iii) foreground regions in the image that it belongs to, and iv) foreground regions in the whole image group. Each image can initially have one or multiple foreground regions, which are set by thresholding its IrIS map with and finding connected components. Then we form the power set of them, where the empty set is excluded, and all its elements are treated as the foreground regions of the image. The Lab color vectors and positions are normalized to

and averaged to represent each region. Also, each foreground region has the variance of the

-positions within the region. We -normalize and then square root the 256-bin Lab color histograms as RootSIFT [37] and VLAD [38] to moderately suppress the few color components bursty in the image group. Given the descriptors of all the foreground regions, we perform sum-pooling to obtain fixed-length and

. In particular, the sum-pooling of regional max-pooled CNN features has been shown to be effective in

[39, 40], but the difference from R-MAC [40] is that we perform the spatial pooling over the fixed grid in each region. We compute the covariance matrices of the high- and low-level descriptors within the foreground regions, and the traces of them are included in . At last, each of , , , and is -normalized and then they are concatenated to form .

Given the segment descriptors for each image , they are fed into three fully-connected layers, which outputs with a two-way softmax. We call this network model an IeISnet. To train the IeISnet, the ground-truth co-saliency maps of training datasets are set to labels by thresholding the averaged label with 0.5 in each superpixel, and the cross entropy loss is used with pointwise weights as below:


where is the number of training data, and are the two last activation values, and are the labels for non-co-salient and co-salient regions respectively, and is the ground-truth label of the -th sample. The weight balances the number of and labels in the training sets. In terms of , the pointwise weights are designed to place more emphasis on the regions that have high IrIS and low IeIS values, and vice versa. As shown in Fig. 3, the former case shows what are called “single saliency residuals” in [41], and the latter one represents the situations where some regions may not seem salient in their image but are certainly co-salient in their image group.

Fig. 3: Examples of two different cases where IrIS and IeIS maps differ from each other. From left to right: image groups, input images from these groups, and their IrIS, IeIS, and initial co-saliency maps. The first row shows the regions that are salient in their image and not co-salient in the group, while the second row represents an opposite case.

Because the IeIS value of each segment is independently estimated through the above process, we refine each IeIS map so that neighboring regions have smooth IeIS values. For this, we perform seed propagation over a simple graph model. The segments are treated as intra-image nodes in each image , and the edge between two neighboring nodes and that share a common boundary of segments connects them with a weight , which represents the affinity between them and is calculated using color similarity [42]. Even though the recently proposed saliency detection methods using the graph-based manifold ranking [1, 3] utilize more sophisticated graphs for difficult cases such as where parts of salient objects are located on image boundaries, we tackle this problem in section IV and use the simple graph model for this step. We let denote (omitting for now) the averaged Lab color vector of in an image, the weight is computed as:


where is the set of all the edges in the image and is the size of . Then the affinity matrix for the intra-image graph of is constructed whose -th element is the weight between and :


where is an index set of neighbors of the -th node.

To propagate seeds over these graphs, we need to extract foreground and background seeds. We set the segments whose initial IeIS values are larger than 0.5 and, at the same time, in the top 10 percent as the foreground seeds, where they have s and all the others have s in . The segments on image boundaries are simply selected as the background seeds and we get likewise. To obtain the refined IeIS maps, the graph-based learning method is adopted for effective propagation [43, 1]. Given the weight matrix and its degree matrix , where , the newly ranked values for either type of the seeds can be optimized with the following problem:


where is the controlling parameter that balances the smoothness constraint and the fitting constraint. The solution of (4) is given by:


where and, for implementation, the diagonal elements of are set to for each query to obtain the propagated values ranked by the other ones except the query itself. Using both the foreground and background seeds, and are obtained where each node receives newly propagated values from the seeds with the learned affinity matrix. The final refined IeIS values are computed as:


where is the element-wise division of two vectors and is a controlling parameter. The numerator represents the IeIS while the denominator maintains the balance among the nodes, and lastly is normalized to .

Iii-C Initial Co-saliency Maps

As mentioned above, there are the occasions where should be sufficiently larger than , and vice versa. If , is considered to show the single saliency residual and thus the co-saliency value of should be as small as , which prohibits us from linear combination of the two values [41]. If , on the other hand, this shows the specific case where some regions may not seem salient in their image but are certainly co-salient in their image group. Both the types of cases encourage us to put more emphasis on for computing co-saliency maps. Considering this aspect, we obtain the initial co-saliency (IC) value for each segment as below:


where the threshold draws a boundary between the “single saliency residual” case and the other one. Fig. 3 shows several saliency maps resulted from the deep saliency networks.

Iv Seed propagation over an integrated graph

The (co-)saliency detection methods usually perform their latter tasks to obtain final (co-)saliency maps leveraging color and pixel position information. For example, the ranking with foreground queries in [1] is the second stage to locate accurate object boundaries and eliminate background noise, and a fully-connected conditional random field (CRF) [44] is used for post-processing in [19]. Many co-saliency detection algorithms also refine their resulting maps [2, 23] or combine several cues [20] using color features within an image group because co-salient objects probably share similar color features in (part of) the image group. However, the graph-based procedures among these latter tasks are performed respectively within each image, so they tackle only the refinement of each (co-)
saliency map so that it shows accurate boundaries and has smooth saliency values, not considering the correspondence within the image group.

In contrast to those methods, we propose to consider the whole image group and refine the input images all together. In addition, this step has another important role, which is to detect (parts of) co-salient objects located on image boundaries as shown in Fig. 4. The above procedures in our work might miss the homogeneous parts of co-salient objects on the image boundaries and strongly suppress the regions close to the boundaries. To this end, we construct an integrated graph so that it can connect all the intra-image nodes of the images in the group for sharing co-saliency information.

Fig. 4: Examples of initial and auxiliary co-saliency maps. From left to right: image groups, input images from these groups, and their initial and auxiliary co-saliency maps. These initial co-saliency maps do not fully detect the regions that are homogeneous and/or close to image boundaries, while the second row shows that the auxiliary co-saliency maps may miss part of the objects with multiple components. Thus, the two types of co-saliency maps can complement each other.

Iv-a The Integrated Graph with a Cluster Layer

In [27], the bipartite graph matching method finds pairs of the most relevant superpixels between two images, each of which is connected with its matching score. Though ensuring good matched pairs for similar scenes such as sequential frames of a video that are not severely different from each other, in general, this approach easily fails to find good pairs of superpixels between the images that have various backgrounds and/or different sizes of objects. Hence, an indirect approach is introduced in this paper to overcome this problem. We basically ignore the connectivity between images, which means that there are no edges that directly connect the intra-image nodes of any two different images. Thus the intra-image graphs are represented in the form of a sparse block-wise diagonal matrix:


where and each is computed by (2,3). Instead, the proposed method introduces an additional cluster layer to consider the interactions between images and indirectly connect the intra-image nodes via the inter-image ones on it, as shown in Fig. 5.

Fig. 5: Visualization of the integrated graph and the interactions of intra-image nodes therein, focusing on image layer . Red arrows represent the paths where the intra-image nodes interact between images via inter-image nodes, and small navy arrows indicate the interactions between the intra-image nodes in the single image.

To define the inter-image nodes, we perform -means clustering with the descriptor of every intra-image node, reusing its averaged Lab color vector . Through this procedure, clusters and their centroids are generated, where is the representative descriptor for and also defined as an inter-image node. The goal of this step is to construct the affinity matrix of the unified graph including all the intra- and inter-image nodes, so we first connect each to its elements and compute the weights of the edges using descriptor similarities as:


where is a control parameter for the descriptor similarity. In addition, the inter-image nodes are also connected to each other, specifically to their -nearest neighbors (-NN), which means that the graph of the cluster layer is as sparse as the intra-image graphs, and its affinity matrix is written as:


Finally, the affinity matrix of the unified graph is constructed from , , and , expressed in a block-wise matrix form:


Iv-B Seed Propagation

To assign newly propagated co-saliency values to all the segments, we need to extract foreground seeds (called co-saliency seeds in this section) and background ones, which are selected similarly to the process for the IeIS refinement. The top 10 percent of co-salient regions with respect to the IC values in each image are extracted as the co-saliency seeds, and the boundary nodes of each image are selected as the background seeds, based on the boundary prior. In addition, the ones selected as both the co-saliency and background seeds simultaneously are precluded from both seed sets because those seeds are not reliable. In summary, the co-saliency and background seeds are defined as:

  • Co-saliency seeds () : high IC nodes that are not on any image boundaries.

  • Background seeds () : low IC nodes on image boundaries.

From the co-saliency and background seeds, co-saliency values are computed by propagating them to all the (intra-image) nodes in the image group. For this, we use the graph-based learning scheme again with the integrated graph, which makes a full pairwise graph as:


where is the degree matrix of . As mentioned above, there are no direct inter-image connections between any two intra-image nodes in the graph with the affinity matrix , so the inter-image nodes indirectly connects the pairs of them instead. However, the learned graph with has full pairwise relations of all the nodes. In other words, this graph has direct inter-image connections so that it ensures straightforward propagation between images.

To obtain auxiliary co-saliency maps to combine with the IC maps, the overall affinities to the co-saliency and background seeds are computed respectively, which is written as:


where and are the co-saliency and background seed vectors respectively each of which is concatenated with a zero vector for the inter-image nodes, and and represent the co-saliency and background seed sets respectively. and are decomposed into the vectors for each image and the cluster layer, i.e., and , and thus the auxiliary co-saliency map for is computed as:


Lastly, is also normalized to and combined with .

Iv-C Final Co-saliency Maps

Given the initial and auxiliary co-saliency maps, the former ones might not fully detect the regions that are homogeneous and/or close to image boundaries, while the latter ones might miss part of the objects with multiple components due to solely using the color and position cues. Therefore, these are complementary to each other and thus simply combined to produce the final co-saliency maps as


Because the auxiliary co-saliency maps are likely to be vulnerable to background noise, we perform a simple post-processing scheme for where the outputs never exceed the inputs [22]. This step needs spatial positional distance maps, and they can be computed with shrunk input images to reduce processing time.

Fig. 6: Visual comparison on the Alaskan bear, Red Sox player, and Statue of liberty sets in iCoseg (from left to right: input images, CB, HS, MG, LDW, MIL, DIM, the proposed method, and ground truth images).
Fig. 7: Visual comparison on the Building and Face sets in MSRC (from left to right: input images, CB, HS, MG, LDW, MIL, the proposed method, and ground truth images).

V Experimental Results

V-a Experimental Settings

In our experiments, two widely used datasets, iCoseg [45] and MSRC [46], are used to evaluate the performance of our algorithm and compare it with others. The iCoseg dataset consists of 38 groups, each of which includes 4-42 images, and totally 643 images along with pixel-wise ground truth annotations. It is the largest among widely used co-saliency detection datasets, and its image groups contain multiple objects and complex backgrounds. The MSRC dataset is composed of 8 groups, each of which equally has 30 images, but the grass group is not used for the evaluation since it has no co-salient objects. This dataset can be used to evaluate the ability to treat the co-salient objects that are not consistent in color, and also contains complex co-salient objects, and diverse and cluttered backgrounds.

For each of the evaluation datasets, the performance is measured with five widely used criteria: the precision-recall (PR) curve, the average precision (AP), the receiver operating characteristic (ROC) curve, areas under the ROC curve (AUC), and the F-measure. When there are more true negatives than true positives, the PR curve more clearly shows the differences between algorithms than the ROC curve does, and the same goes for their areas under the curves, AP and AUC. As for the PR and ROC curves, the co-saliency maps are normalized to

and binarized with thresholds varying from 0 to 255. The precision, recall, false positive rates are calculated under each threshold and averaged over all samples as the standard used in the literature

[47]. Meanwhile, we used a self-adaptive threshold [48] to obtain the F-measure, where and

are the mean and standard deviation within each co-saliency map respectively, and the precision and recall rates averaged over all samples are combined as defined below:


where as typically used in the literature.

The deep saliency networks are implemented with the Caffe package [49] to follow the publicly available model of the DeepLab system. For the IrISnet, we used a large single saliency detection dataset, MSRA10K [50], and the size of input images (i.e., ) and hyper parameters for training are set as suggested in [19]

. The IeISnet consists of sequential fully-connected, batch normalization


, and rectified linear unit (ReLU) layers, which are trained with several co-saliency detection datasets, i.e., Cosal2015

[2] and CPD [26], including either of iCoseg or MSRC that is not used for testing to exploit as much training data as possible. Even though parts of the Cosal2015 dataset (e.g., baseball) tend to put far more emphasis on the correspondence than on the intra-image saliency, it is acceptable to our IeISnet training because the definition of IeIS also focuses more on the correspondence. We set the learning rate and momentum parameter to 0.001 and 0.9 respectively, and the weight decay is 0.0005. As in [1, 2, 19], we set for each to 200, where 150 and 50 superpixels at different scales are additionally used for the IeIS detection, and the precise value of is determined by the SLIC algorithm. We consistently set and irrespective of the number of images . For the IeISnet learning, we set considering the number of true positives and negatives in the training co-saliency detection datasets, and is empirically set to 3. The parameter is usually set to 0.99 in the literature, but we use since the seed propagation steps are performed with more reliable foreground seeds, and we set for the same reason. Lastly, we use and , and also conduct grid search experiments for , , , and to ensure that we select the appropriate values of these parameters.


Time (s) 1.02 103.36 1.12 6.52 12.25 19.6 1.79


* The running times of DIM and HS, cited from [24, 20], were measured
aaaa only with iCoseg.

TABLE I: Average execution time per image.

V-B Run Time Comparison

We conduct the experiments with our unoptimized code run on a PC with Intel i7-6700 CPU, 32GB RAM, and GTX Titan X GPU. The code is implemented in MATLAB except for the SLIC algorithm in C++, and the GPU acceleration was applied only for the Caffe framework. Table I lists the average execution time per image using several different methods, where the execution times of LDW, MIL, DIM, and HS are cite from their papers. The first two values were measured using a PC with two 2.8GHz 6-core CPUs, 64GM RAM, and GTX Titan black GPU in [2, 23], the third one with Intel i3-2130 CPU and 8GB RAM in [24], and the last one with Intel i7-3770 and 4GB RAM in [20]. As can be seen, the proposed method has moderate computational complexity with state-of-the-art performance as evaluated below. In particular, our method runs faster leveraging the supervised learning schemes compared to the other ones based on the online weakly supervised learning [23, 2, 24], and shows the execution time similar to that of the efficient CB [15] and MG [21] methods.

Fig. 8: Quantitative comparison on the iCoseg and MSRC datasets with the PR and ROC curves.

V-C Comparison with the State-of-the-Art

With the evaluation criteria stated above, we compare the proposed co-saliency detection method with other major algorithms ranging from the bottom-up ones based on handcrafted features to the learning-based ones using high-level features: CB [15], HS [20], MG [21], LDW [2], MIL [23], DIM [24] (only for the iCoseg dataset). At first, Figs. 6 and 7 show several visual examples of resulting co-saliency maps for qualitative comparison, where it can be seen that the proposed method more uniformly detects co-salient objects and better suppresses background regions than the others do. In particular, the Alaskan bear set shows the background regions similar to co-salient objects in color. Even though our auxiliary co-saliency maps focus on the color similarity, emphasizing the background seeds moderately suppresses the background noise that it may bring about. The MG method effectively finds the co-salient objects consistent in color, e.g., Red Sox player, but it has weaknesses in suppressing noisy backgrounds and detecting the co-salient regions inconsistent in color, as shown in Fig. 7. The Statue of liberty

set includes a lot of the co-salient regions that are not salient in terms of single-image saliency. The most representative case is the first image, where only the torch probably looks salient, but every part of the statue is co-salient in the group. Because each image in the MSRC dataset probably contains a single co-salient object, it is effective to first find salient regions in terms of single-image saliency and then analyze the correspondence in each group. Thus, the results of the proposed method show well-suppressed common backgrounds.

For the quantitative comparison, Fig. 8 shows the PR and ROC curves, and Table II contains the AP, AUC, and F-measure values of ours and compared methods. As for the iCoseg dataset, the proposed method outperforms the other ones on all the evaluation criteria. Both the PR and ROC curves show that our co-saliency maps result in the highest precision/recall rates in the widest ranges of recall/false positive rates, especially at decent recall/false positive rates ( 0.8/0.1). Even though, as for the MSRC dataset, our method results in the slightly low PR and ROC curves than those of MIL, it shows the best score on the F-measure criterion. As can be seen in Figs. 6 and 7, the proposed algorithm produces more assertive co-saliency maps than those of the other methods, which has its strengths and weaknesses. Because there is a certain amount of detected regions for each image even with a threshold close to 1, it guarantees a degree of recall, but there is also a limit to precision/false positive rates. However, it is noteworthy that this property helps to easily determine a threshold to segment a given co-saliency map for other tasks, e.g., co-segmentation. The self-adaptive thresholds used for the F-measure evaluation also probably result in robust segmentation given assertive co-saliency maps. Table II shows that the standard deviation of F-measures is the smallest when our approach is applied, which means that our method is more robust to the variation of the binarization threshold. Meanwhile, when comparing the proposed method with the other learning-based ones, LDW, MIL, and DIM, it should be noted that they need similar neighbors from other external groups rather than a target image group and perform the weakly supervised learning given the input images for testing. These procedures assume that the external groups do not include common co-salient objects with the target group so that they can give negative samples illustrating background features in the target group. Thus, as can be seen in Table I, they require relatively high computational complexity, and the fact that they need the external groups may also be a limitation in conditions where the similar neighbors are lacking or insufficient. Despite the differences in requirements, we can observe that the proposed method shows better or competitive performance compared to the other learning-based algorithms.


Dataset Method AP AUC F-measure *


iCoseg CB 0.806 0.937 0.741 0.145
HS 0.839 0.955 0.755 0.189
MG 0.854 0.957 0.794 0.114
LDW 0.875 0.957 0.799 0.168
MIL 0.866 0.965 0.814 0.141
DIM 0.877 0.969 0.792 0.212
Ours 0.896 0.979 0.823 0.077
MSRC CB 0.689 0.798 0.577 0.170
HS 0.785 0.882 0.709 0.197
MG 0.688 0.827 0.635 0.133
LDW 0.842 0.908 0.767 0.178
MIL 0.894 0.940 0.796 0.138
Ours 0.876 0.934 0.811 0.054


* denotes a standard deviation of F-measures with the variation of the
aaaa binarization threshold.

TABLE II: Quantitative comparison on the iCoseg and MSRC datasets with AP, AUC, and F-measures.
Fig. 9: Grid search analysis on the parameters , , , and . The performance with the variations of and slightly depends on target datasets (i.e., iCoseg and MSRC), while the proposed system is not sensitive to and .

V-D Parameter Analysis

We conduct the grid search experiments to find the appropriate values of , , , and . Fig. 9 shows the AP, AUC, F-measure scores along with certain ranges of these parameters. As can been seen, the most effective value of is lower with MSRC than with iCoseg, which is related to the tendency that correspondence cues are of lower importance so the fitting constraint is more emphasized for the seed propagation over the integrated graph in MSRC than in iCoseg. Likewise, the variation of the parameter also slightly influences the performance depending on the target datasets. Because the foreground and background seeds are basically of equal importance, one could set to 1, but the value of larger than 1 is more effective with reliable foreground seeds, especially with respect to the F-measures. On the other hand, the proposed method is not sensitive to the parameters and , so we can select the decent values for these parameters and obtain the stable results with them. Even though, in (7), two different operations are applied according to , both sides would emphasize the IeIS with large difference between IrIS and IeIS values; when they are similar to each other, they both would give almost equal contributions to the resulting initial co-saliency value. Thus, slight variations of the parameter do not bring about big differences in the performance of our method. The control parameter for the construction of the integrated graph behaves similarly to , where larger more facilitates the seed propagation between the cluster centers and intra-image nodes with similar colors, and vice versa. It is because the various colors of co-salient objects could be reflected in the cluster layer, where different inter-image nodes represent diverse colors, with sufficiently large , and the regions within an image group similar in color could share their co-saliency information through the seed propagation.

V-E Co-segmentation Experiments

Co-segmentation is a direct higher-level application of the co-salient object detection, where it can replace user interaction and provide useful prior knowledge of target objects. For example, Quan et al. [52] proposed to construct a graph including input images and generate two types of probability maps using low- and high-level features through graph-based optimization. For each image, the resulting two probability maps are combined by multiplication, and then a graph cut approach produces the final co-segmentation results. These probability maps could be replaced with co-saliency maps and in fact, Fu et al. [15] applied their co-saliency detection method to co-segmentation through Markov random field optimization. The approach of Chen et al. [53] groups input images into aligned homogeneous clusters and then merges them into visual subcategories, where a discriminative detector for each subcategory is trained to find target objects within a test dataset. For each cluster, a co-segmentation method is applied to segment out the aligned objects, and this step could also be performed with co-saliency detection.

Thus, we conduct co-segmentation experiments to compare our results with those of other approaches. Two datasets, Internet-100 [54]

and iCoseg, are used for the evaluation with the Jaccard index (J, intersection-over-union for the foreground regions) and Precision (P, the proportion of correctly labeled pixels). We convert our co-saliency maps into co-segmentation results by simply thresholding with 0.5. Because, as for the Internet-100 dataset, there are several noisy images that do not contain target objects in each class, we normalize each auxiliary co-saliency map by the operation

instead of normalizing it to , which forces the maximum in it to be . Table III and Fig. 10 show the quantitative comparison and several visual examples of our results on the Internet-100 dataset, respectively. The proposed method with simple thresholding outperforms other state-of-the-art co-segmentation algorithms or produces competitive results compared to them. Fig. 10 shows several quality co-segmentation results, but the noisy objects have not been perfectly suppressed in the last images of the first and second rows.


Internet-100 Airplane Car Horse
P (%) J (%) P (%) J (%) P (%) J (%)


[55] 47.5 11.7 59.2 35.2 64.2 29.5
[54] 88.0 55.8 85.3 64.4 82.8 51.3
[53] 90.3 40.3 87.7 64.9 86.2 33.4
[52] 91.0 56.3 88.5 66.8 89.3 58.1
Ours 92.5 58.5 91.0 74.9 89.6 59.3



iCoseg [56] [57] [52] Ours


P (%) 91.4 92.8 93.3 93.9
J 0.73 0.76 0.75


TABLE III: Quantitative comparison of co-segmentation methods.
Fig. 10: Visual examples of our co-segmentation results on the Internet-100 dataset. Top to bottom: Airplane, Car, Horse classes.

Vi Conclusions

We have proposed a co-saliency detection method, which finds the regions of high initial co-saliency values with the deep saliency networks and complementary regions through the seed propagation over the integrated graph. Given salient regions within each image in terms of single-image saliency, the features extracted from these foregrounds in a group are concatenated with the descriptor of each segment to be fed into the inter-image saliency network. The resulting IrIS and IeIS values are combined to produce initial co-saliency maps, which then provide foreground and background seeds for the seed propagation steps. The unified graph is constructed with the affinity matrices using color similarities, and the newly propagated co-saliency values become complementary components for the final co-saliency maps. The experimental results indicate that the proposed method shows the state-of-the-art performance with decent requirements about computational complexity and input images/groups.


  • [1] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in

    Proc. IEEE Conf. Computer Vision and Pattern Recognition

    , 2013, pp. 3166–3173.
  • [2] D. Zhang, J. Han, C. Li, J. Wang, and X. Li, “Detection of co-salient objects by looking deep and wide,” Int. J. Comput. Vision, vol. 120, no. 2, pp. 215–232, Nov. 2016.
  • [3] Q. Wang, W. Zheng, and R. Piramuthu, “Grab: Visual saliency via novel graph model and background priors,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 535–543.
  • [4] N. Liu and J. Han, “DHSNet: Deep hierarchical saliency network for salient object detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 678–686.
  • [5] J. Kuen, Z. Wang, and G. Wang, “Recurrent attentional networks for saliency detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 3668–3677.
  • [6] J. Pan, E. Sayrol, X. Giro-I-Nieto, K. McGuinness, and N. E. O’Connor, “Shallow and deep convolutional networks for saliency prediction,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 598–606.
  • [7]

    S. Jetley, N. Murray, and E. Vig, “End-to-end saliency mapping via probability distribution prediction,” in

    Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 5753–5761.
  • [8]

    H. Cholakkal, J. Johnson, and D. Rajan, “Backtracking scspm image classifier for weakly supervised top-down saliency,” in

    Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 5278–5287.
  • [9] S. S. S. Kruthiventi, V. Gudisa, J. H. Dholakiya, and R. V. Babu, “Saliency unified: A deep architecture for simultaneous eye fixation prediction and salient object segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 5781–5790.
  • [10] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu, “Global contrast based salient region detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011, pp. 409–416.
  • [11] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu, “Global contrast based salient region detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, Mar. 2015.
  • [12] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust background detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014, pp. 2814–2821.
  • [13] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using background priors,” in Proc. European Conf. Computer Vision, 2012, pp. 29–42.
  • [14] H. Fu, D. Xu, B. Zhang, S. Lin, and R. K. Ward, “Object-based multiple foreground video co-segmentation via multi-state selection graph,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 3415–3424, Jun. 2015.
  • [15] H. Fu, X. Cao, and Z. Tu, “Cluster-based co-saliency detection,” IEEE Trans. Image Process., vol. 22, no. 10, pp. 3766–3778, Oct. 2013.
  • [16] D. Zhang, H. Fu, J. Han, and F. Wu, “A review of co-saliency detection technique: Fundamentals, applications, and challenges,” arXiv preprint arXiv:1604.07090, Apr. 2016.
  • [17] M.-M. Cheng, N. J. Mitra, X. Huang, and S.-M. Hu, “Salientshape: Group saliency in image collections,” Vis. Comput., vol. 30, no. 4, pp. 443–453, Apr. 2014.
  • [18]

    G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in

    Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2015, pp. 5455–5463.
  • [19] ——, “Deep contrast learning for salient object detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 478–487.
  • [20] Z. Liu, W. Zou, L. Li, L. Shen, and O. Le Meur, “Co-saliency detection based on hierarchical segmentation,” IEEE Signal Process. Lett., vol. 21, no. 1, pp. 88–92, Jan. 2014.
  • [21] Y. Li, K. Fu, Z. Liu, and J. Yang, “Efficient saliency-model-guided visual co-saliency detection,” IEEE Signal Process. Lett., vol. 22, no. 5, pp. 588–592, May 2015.
  • [22] X. Cao, Z. Tao, B. Zhang, H. Fu, and W. Feng, “Self-adaptively weighted co-saliency detection via rank constraint,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 4175–4186, Sep. 2014.
  • [23] D. Zhang, D. Meng, and J. Han, “Co-saliency detection via a self-paced multiple-instance learning framework,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 5, pp. 865–878, May 2017.
  • [24] D. Zhang, J. Han, J. Han, and L. Shao, “Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 6, pp. 1163–1176, Jun. 2016.
  • [25] G. Lee, Y. W. Tai, and J. Kim, “Deep saliency with encoded low level distance map and high level features,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 660–668.
  • [26] H. Li and K. N. Ngan, “A co-saliency model of image pairs,” IEEE Trans. Image Process., vol. 20, no. 12, pp. 3365–3375, Dec. 2011.
  • [27] Z. Tan, L. Wan, W. Feng, and C.-M. Pun, “Image co-saliency detection by propagating superpixel affinities,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2013, pp. 2114–2118.
  • [28] D. E. Jacobs, D. B. Goldman, and E. Shechtman, “Cosaliency: Where people look when comparing images.” in Proc. ACM Symp. User Interface Software and Technology, 2010, pp. 219–228.
  • [29] H. T. Chen, “Preattentive co-saliency detection,” in Proc. IEEE Int. Conf. Image Processing, 2010, pp. 1117–1120.
  • [30] H. Li, F. Meng, and K. N. Ngan, “Co-salient object detection from multiple images,” IEEE Trans. Multimedia, vol. 15, no. 8, pp. 1896–1909, Dec. 2013.
  • [31] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282, Nov. 2012.
  • [32] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learning Representations, 2015.
  • [33] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PP, no. 99, pp. 1–1, 2017.
  • [34] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian, “A real-time algorithm for signal analysis with the help of the wavelet transform,” in Wavelets: Time-Frequency Methods and Phase Space, 1990.
  • [35] J. Dai, K. He, and J. Sun, “Convolutional feature masking for joint object and stuff segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2015, pp. 3992–4000.
  • [36] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sept 2015.
  • [37] R. Arandjelovic and A. Zisserman, “Three things everyone should know to improve object retrieval,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp. 2911–2918.
  • [38] H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid, “Aggregating local image descriptors into compact codes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 9, pp. 1704–1716, Sep. 2012.
  • [39] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval: Learning global representations for image search,” in Proc. European Conf. Computer Vision, 2016, pp. 241–257.
  • [40] G. Tolias, R. Sicre, and H. Jégou, “Particular object retrieval with integral max-pooling of CNN activations,” in Proc. Int. Conf. Learning Representations, 2016.
  • [41] R. Huang, W. Feng, and J. Sun, “Color feature reinforcement for cosaliency detection without single saliency residuals,” IEEE Signal Process. Lett., vol. 24, no. 5, pp. 569–573, May 2017.
  • [42] I. Hwang, S. H. Lee, J. S. Park, and N. I. Cho, “Saliency detection based on seed propagation in a multilayer graph,” Multimedia Tools and Applications, vol. 76, no. 2, pp. 2111–2129, 2017.
  • [43] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Proc. Advances in Neural Information Processing Systems, 2004, pp. 321–328.
  • [44] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Proc. Advances in Neural Information Processing Systems, 2011, pp. 109–117.
  • [45] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “icoseg: Interactive co-segmentation with intelligent scribble guidance,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010, pp. 3169–3176.
  • [46] J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned universal visual dictionary,” in Proc. IEEE Int. Conf. Computer Vision, 2005, pp. 1800–1807.
  • [47] X. Li, Y. Li, C. Shen, A. Dick, and A. V. D. Hengel, “Contextual hypergraph modeling for salient object detection,” in Proc. IEEE Int. Conf. Computer Vision, 2013, pp. 3328–3335.
  • [48] Y. Jia and M. Han, “Category-independent object-level saliency detection,” in Proc. IEEE Int. Conf. Computer Vision, 2013, pp. 1761–1768.
  • [49] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proc. ACM Int. Conf. on Multimedia, 2014, pp. 675–678.
  • [50] T. Liu, J. Sun, N. N. Zheng, X. Tang, and H. Y. Shum, “Learning to detect a salient object,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007, pp. 1–8.
  • [51] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in

    Proc. Int. Conf. Machine Learning

    , 2015, pp. 448–456.
  • [52] R. Quan, J. Han, D. Zhang, and F. Nie, “Object co-segmentation via graph optimized-flexible manifold ranking,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 687–695.
  • [53] X. Chen, A. Shrivastava, and A. Gupta, “Enriching visual knowledge bases via object discovery and segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014, pp. 2035–2042.
  • [54] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu, “Unsupervised joint object discovery and segmentation in internet images,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2013, pp. 1939–1946.
  • [55] A. Joulin, F. Bach, and J. Ponce, “Multi-class cosegmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp. 542–549.
  • [56]

    D. Kuettel, M. Guillaumin, and V. Ferrari, “Segmentation propagation in imagenet,” in

    Proc. European Conf. Computer Vision, 2012, pp. 459–473.
  • [57] A. Faktor and M. Irani, “Co-segmentation by composition,” in Proc. IEEE Int. Conf. Computer Vision, 2013, pp. 1297–1304.