Free Lunch for Co-Saliency Detection: Context Adjustment

We unveil a long-standing problem in the prevailing co-saliency detection systems: there is indeed inconsistency between training and testing. Constructing a high-quality co-saliency detection dataset involves time-consuming and labor-intensive pixel-level labeling, which has forced most recent works to rely instead on semantic segmentation or saliency detection datasets for training. However, the lack of proper co-saliency and the absence of multiple foreground objects in these datasets can lead to spurious variations and inherent biases learned by models. To tackle this, we introduce the idea of counterfactual training through context adjustment, and propose a "cost-free" group-cut-paste (GCP) procedure to leverage images from off-the-shelf saliency detection datasets and synthesize new samples. Following GCP, we collect a novel dataset called Context Adjustment Training. The two variants of our dataset, i.e., CAT and CAT+, consist of 16,750 and 33,500 images, respectively. All images are automatically annotated with high-quality masks. As a side-product, object categories, as well as edge information, are also provided to facilitate other related works. Extensive experiments with state-of-the-art models are conducted to demonstrate the superiority of our dataset. We hope that the scale, diversity, and quality of CAT/CAT+ can benefit researchers in this area and beyond. The dataset and benchmark toolkit will be accessible through our project page.



There are no comments yet.


page 1

page 4

page 5

page 6

page 8


STC: A Simple to Complex Framework for Weakly-supervised Semantic Segmentation

Recently, significant improvement has been made on semantic object segme...

Grid Saliency for Context Explanations of Semantic Segmentation

Recently, there has been a growing interest in developing saliency metho...

Gradient-Induced Co-Saliency Detection

Co-saliency detection (Co-SOD) aims to segment the common salient foregr...

Triple-cooperative Video Shadow Detection

Shadow detection in a single image has received significant research int...

ATLANTIS: A Benchmark for Semantic Segmentation of Waterbody Images

Vision-based semantic segmentation of waterbodies and nearby related obj...

Big GANs Are Watching You: Towards Unsupervised Object Segmentation with Off-the-Shelf Generative Models

Since collecting pixel-level groundtruth data is expensive, unsupervised...

GazeGAN: A Generative Adversarial Saliency Model based on Invariance Analysis of Human Gaze During Scene Free Viewing

Data size is the bottleneck for developing deep saliency models, because...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Conceptual illustration of counterfactual training versus biased training. Top block: Evaluation for the banana group in CoSOD3k dataset [19]; Middle block: Biased training samples from current co-saliency detection datasets; Bottom block: Samples generated under the idea of counterfactual training. GT denotes the ground-truth. Answers are generated by GICD [66].

Saliency detection attempts to mimic the human visual system by automatically detecting and segmenting out object regions that attract the most attention in an image [3]

. Today’s saliency detection systems, especially those equipped with deep neural networks

[55], are good at identifying salient objects from images [3]. However, the setting of detecting a single object is ideal. For real-world applications like image/video retrieval [28], surveillance [22], video analysis [27], etc., there are always scenes that multiple objects co-occurring in or across frames. This motivates our community to explore co-saliency detection [26]. Sharing a similar rationale with saliency detection, co-saliency detection automatically detects and segments out the common object regions that attract the most attention within image groups [55]

. Such an imitation can be achieved by the learning paradigm hiding behind artificial neural networks, especially convolutional neural networks

[33]. However, “there is no such thing as a free lunch

.” The success of deep learning systems depends heavily on large-scale datasets, which take huge effort and time to collect. The conventional way

[19, 66] of constructing a specialized co-saliency detection dataset involves time-consuming and labor-intensive data selection and pixel-level labeling. As shown in Figure 1, the uni-object distribution of the current training data (middle block) has deviated heavily from that of the evaluation data (top block), which consists of complex context with multiple foreground objects. Models encoded with such spurious variations and biases fail to capture the co-salient signals and thus make incorrect predictions [51, 25].

In the context of co-saliency detection, we denote the training, testing, and true (real-world) distributions as , , and , respectively. Recent works on evaluation [19, 66] make one-step closer towards by identifying co-salient signals from multi-foreground clusters in a group-by-group manner. However, a dataset that can well-define is still missing. As we will discuss in Section 2, most recent works borrow semantic segmentation datasets like


[38] and saliency detection datasets like DUTS [53] to train their models, which exacerbate the inconsistency among , , and . Not surprisingly, as models only see naive examples during training, they will inevitably cater to the seen idiosyncrasies, and thus are stuck in the unseen world with biased assumptions.

In this paper, we aim at finding a “cost-free” way to handle the distribution inconsistency in co-saliency detection. Intrigued by causal effect [44, 43] and its extensions in vision & language [54, 42, 60], we introduce counterfactual training with regard to the gap between current training distribution and true distribution as the direct cause [46, 50] of incorrect co-saliency predictions. As shown in Figure 2, the quality of prediction made by a learning-based model is dependent on the quality of input data under distribution . The goal of counterfactual training is to synthesize “imaginary” data sample , whose distribution — also originates from — can mimic . In this way, models can capture the true effect in terms of co-saliency from and make prediction properly.

Figure 2: Causal graph for co-saliency detection. Nodes denote variables and arrows denote direct causal effects.

Under the instruction of counterfactual training, we propose using context adjustment [13] to augment off-the-shelf saliency detection datasets, and introduce a novel group-cut-paste (GCP) procedure to improve the distribution of the training dataset. Taking a closer look at Figure 1. and are the corresponding pixel-level annotation and category label of sample . GCP turns

into a canvas to be completed and paint the remaining part through the following steps: (1) classifying candidate images into a semantic group

(e.g., banana) by reliable pretrained models; (2) cutting out candidate objects (e.g., baseball, butterfly, etc.); and (3) pasting candidate objects into image samples. We will revisit this procedure formally in Section 3. Different from the plain “observation” (e.g., different color from the background) made by biased training, counterfactual training opens the door of “imagination” and allows models to think comprehensively [42]. A better prediction can be made possibly because of features, such as the “elongated and curved” shape (instead of the round shape of baseball and orange) or “yellow-green” color (instead of the dark color of remote control and avocado), are captured. In this way, models can focus more on the true causal effects rather than spurious variants and biases caused by the distribution gap [5].

Following GCP, we collect a novel dataset called Context Adjustment Training, of which has two variants: CAT and CAT+. Both of them consist of subclasses affiliated to superclasses, which cover common items in daily life, as well as animals and plants in nature. While our CAT/CAT+ is diverse in semantics, it also has a large scale. The two variants of our dataset contain 16,750 and 33,500 samples, respectively, making it the current largest in co-saliency detection. Every sample in CAT/CAT+ is equipped with sophisticated mask annotation, category, and edge information. It is worth noting that, unlike manual selection and pixel-by-pixel labeling, all the images and their corresponding masks in our dataset are automatically annotated, making the cost virtually “free.” Extensive experimental results shown in Section 4 verify the effectiveness of our dataset. Without bells and whistles, CAT/CAT+ helps both one-stage and two-stage models to achieve significant improvements for 5 30 in conventionally-adopted metrics on challenging evaluation datasets CoSOD3k [19] and CoCA [66].

2 Related Work

# Training Dataset Year #Img. #Cat. #Avg. #Max. #Min. Mul. Sal. Larg. H.Q. Type Inputs
1 MSRCv1 [56] 2005 240 8 30.0 30 30 CO Group images
2 MSRCv2 [56] 2005 591 23 25.7 34 21 CO Group images
3 iCoseg [2] 2010 643 38 16.9 41 4 CO Group images
4 MSRA-B [29] 2013 5,000 - - - - SD Single image
5 Coseg-Rep [10] 2013 572 23 24.8 116 9 CO Group images
6 DUT-OMRON [57] 2013 5,172 - - - - SD Single image
7 THUR-15K [6] 2014 15,531 5 3,000.0 3,457 2,892 CO Group images
8 MSRA10K [7] 2015 10,000 - - - - SD Single image
9 CoSal2015 [62] 2015 2,015 50 40.3 52 26 CO Group images
10 DUTS-TR [53] 2017 10,553 - - - - SD Single image
11 COCO9213 [38] 2017 9,213 65 141.7 468 18 SS Group images
12 COCO-GWD [34] 2019 9,000 118 76.2 - - SS Group images
13 COCO-SEG [52] 2019 200,932 78 2576.1 49,355 201 SS Group images
14 WISD [59] 2019 2,019 - - - - - - CO Single image
15 DUTS-Class [66] 2020 8,250 291 28.3 252 5 CO Group images
16 Jigsaw [66] 2020 33,000 291 113.4 1,008 20 CO Group images
17 CAT (Ours) 2021 16,750 280 59.8 412 14 CO Group images
18 CAT+ (Ours) 2021 33,500 280 119.6 824 28 CO Group images
Table 1: Datasets used for training co-saliency detection models. #Img.: Number of images; #Cat.: Number of categories; #Avg.: Average number of images per category; #Max.: Maximum number of images per group; #Min.: Minimum number of images per group. Mul.: Whether contains multiple foreground objects or not; Sal.: Whether maintains saliency or not; Larg.: Whether large-scale (more than 10k images) or not; H.Q.: Whether has high-quality annotations or not. CO: Co-saliency detection dataset; SD: Saliency detection dataset; SS: Semantic segmentation dataset. “-” denotes “not available”.

Context Adjustment.

Modeling visual contexts liberates a lot of computer vision tasks from the impediment caused by the need for sophisticated annotations

[14], such as bounding boxes and pixel-level masks. [61] leverages training pixels and regularization effects of regional dropout by cutting and pasting image patches. To avoid overfitting, [12] randomly masks out square regions of an image during training. The generic vicinal distribution introduced in [63]

helps to synthesize the groundtruth of new samples by the linear interpolation of one-hot labels. Unfortunately, the patches introduced by both

[61] and [12] may cause severe occlusions for original objects, which is not compatible with the idea of saliency detection [15]. For [63], the local statistics and semantic are destroyed. Since the foreground and background are blended, the saliency cannot be appropriately defined anymore. [24] fixes this problem by mixing several transformed versions of an image together, i.e., augmentation chain. Although such a composition improves the model robustness, it cannot introduce co-salient signals. [66] concatenates two samples together to form a multi-foreground image. However, for backbones that require fixed-size inputs [48], the resize of such jigsawed images leads to severe shape distortion. In contrast, our GCP preserves both the multi-foreground requirement and object shapes and maintains saliency [15].

Co-Saliency Detection. The origin of this task can be dated back to the last decade [26], where co-saliency was defined as the regions of an object which occur in a pair of images [36]. A more formal definition given in [35] exploits both intra-image saliency and inter-image saliency in a group-by-group manner. Since then, researchers in this area have been working on identifying co-salient signals across image groups [6]. Representative works from early years [4, 20, 40]

rely on certain heuristics like background prior

[58] and color contrast [9]. Let alone the post-processing like CRF [32], most modern co-saliency detection models can be divided into two-stage and one-stage models. Building upon saliency detection frameworks, two-stage models leverage the intra-saliency with inter-saliency generated by specially designed modules [30, 18]. There are also some one-stage models which do not require saliency preparation. [65] and [66] introduce consensus embedding procedures to explore group representation and use it to guide soft attentions [68]. Some recent models adopt different network structures to identify co-salient signals, such as graph neural networks [64, 37]

and generative adversarial networks

[47]. Associating with our dataset, both two-stage and one-stage models achieve better performances.

Training Dataset. Due to the lack of a suitable training dataset, most previous works use existing saliency detection datasets or semantic segmentation datasets for training. Table 1 summarizes datasets adopted to train co-saliency detection models. Small datasets [56, 2, 10] are popular among works in early years only. Some large-scale saliency detection datasets [57, 7, 53] are adopted by two-stage models to extract intra-saliency cues. However, such datasets do not have class information, hence it is impossible for them to train end-to-end models. Some works even borrow datasets from semantic segmentation, e.g., COCO-SEG [52] and COCO9213 [38]. Unfortunately, although such datasets are large in scales, their annotations are coarse. Recently, DUTS-Class [66] and its jigsawed version [66] have been proposed as a “transition plan” towards co-saliency. However, the former is just a “grouped” saliency detection dataset, while the latter introduces issues like shape distortion and independent boundaries. Evidence shows that these inappropriate training paradigms have caused serious biases for co-saliency detection models [19, 66]. Our dataset addresses the aforementioned problems. CAT/CAT+ is currently the largest co-saliency detection dataset and offers diverse semantics and high-quality annotations.

Figure 3: Overview of our group-cut-paste (GCP) procedure. Left block: Classifier generates category label for all raw images and “group” them accordingly; Middle block: Object is “cut” out from the original image based on contours and bounding boxes; Right block: New sample is synthesized by “pasting” randomly sampled object into canvas .

Evaluation Dataset. Pioneer co-saliency detection works evaluate model performances on small datasets like MSRC [56], iCoseg [2], and ImagePair [36]. Consisting of 2,015 images from 50 groups, CoSal2015 [62] is the most popular choice among modern models. However, most samples in it are still uni-object. Two recent datasets [19, 66] fill the defect in terms of both scale and appearance. CoSOD3k [19] is a diverse dataset consists of 160 categories and 3,316 images. 70 and 20 images have one and two objects, while 10 images have three or more objects. CoCA [66] consists of 80 categories with 1,295 images, which are challenging in occlusion and clutter background. At least one foreground object is introduced for every image, and some of them have more than two co-salient objects. We have adopted both of these two datasets to evaluate model performances in our benchmark experiments.

3 Proposed Method

In this section, we first introduce the idea of counterfactual training in co-saliency detection. Under this high-level instruction, we propose a group-cut-paste (GCP) procedure to adjust visual context and generate training samples. Finally, we construct a novel dataset by following GCP.

3.1 Counterfactual Training

Let and denote a training image and its mask annotation with width , height , and channel . is an image group whose members include . denotes the category label which contains the semantic information for both and . The counterfactual [45] of sample , is read as:

would be , had been , given the fact that .

where denotes the new image group; New mask is restricted as to maintain the saliency [15] of the co-occurring object. In other words, label will not change during the whole process. The given “fact” means that the group training paradigm remains the same, while the “counter-fact” (what if) is indicating that the generated clashes with .

The conceptual meanings of counterfactual training can be substantiated by the following instructions [43]:

Abduction - “Given the fact that .” When constructing a new image group for , the co-saliency should remain unchanged. That is to say, the “fact” that the object in is salient still holds for it in .

Action - “Had been .” After the “imagination”, the mask annotation should not have a significant change. Here we intervene by keeping the semantic label the same both beforehand and afterhand.

Prediction - “ would be .” Conditioning on the “fact” and the intervention objective , we can generate the counterfactual sample and its mask

from the following probability distribution function:


3.2 Group-Cut-Paste

Following the counterfactual training instructions, we design this group-cut-paste (GCP) procedure. Specifically, the object map for image can be generated by , i.e., , where denote element-wise multiplication. The goal of GCP is to automatically generate a new training sample by combing and with , where and are category indexes of the target and source groups, respectively. The generated sample

is then used to train candidate models with their original loss functions. Figure

3 gives an overview of our method.

Group. The first step is to build image groups. To effectively classify candidate images, we adopt one of the current state-of-the-art classifiers WSL-Images [41], which achieves Top-1 accuracy on ImageNet [11]. The category label can be generated by , where denotes the pretrained classifier. We manually pick out the misclassified examples after grouping. This results in sample pairs distributed in semantic groups.

Cut. Both object map and mask are used to prepare candidate object . We adopt the border following algorithm [49] to extract external contour points by mask . Based on contours, the bounding rectangle with minimum area can be drawn. The area to be cut can then be defined by the four vertices of this rotated box. This gives us the final object .

Paste. Under the counterfactual training instructions, object can be pasted properly into to form a new sample . Here we maintain the “fact” by preserving group consistency, i.e., conduct this procedure in a group-by-group manner. Denote the regions outside as . To avoid severe occlusion, coordinate is randomly sampled in as the position to be pasted. The center of is chosen as the anchor for pasting. We execute the “action” by restricting the size of candidate object and maintaining label before and after paste. The mask annotations are automatically synthesized by dyadic operations between and

. Finally, under uniform probability distribution

, “prediction” gives us pairs for further model training. Our GCP operation can be easily applied on CPUs in parallel with the main GPU training tasks, which hides the computation cost and boosts the performance for virtually free.

Figure 4: The taxonomic illustration for our dataset, of which consists of 15 superclasses and 280 subclasses. Zoom in for details.

3.3 Constructing the Dataset

For the purpose of stabilization and to ensure high quality, we collect a dataset called Context Adjustment Training. We pick 8,375 images with clear saliency and sophisticated annotations as our canvas from existing saliency detection datasets [53, 15]. Following GCP, 280 semantic groups affiliated to 15 superclasses are built after the “grouping,” i.e., aves, electronic product (elec.), food, fruit, insect, instrument (instru.), kitchenware (kitch.), mammal (mamm.), marine, other, reptile (rept.), sports, tool, transportation (trans.), and vegetable (vege.). See Figure 4 for the taxonomy. We then “cut” all the objects out and discard those with incomplete shapes. During the “paste” stage, the object size is restricted as . Candidate objects are randomly flipped horizontally to increase diversity. We sample twice for each candidate image to generate two samples and do a re-sample for unsatisfied cases. This gives us 16,750 augmented images. Their corresponding masks are automatically generated. As shown in Figure 5, samples synthesized by counterfactual training and GCP can model realistic visual context and offer proper co-salient signals.

Figure 5: Samples from our dataset. Zoom in for details.
Figure 6: Patterns for representative groups and the overall dataset. Best viewed in color and zoomed-in for details.

Our dataset has two variants: CAT and CAT+. CAT consists of all 16,750 augmented images, while CAT+ contains both the augmented and two times of the original images (i.e., 33,500 images, which is ten times bigger than the current largest evaluation dataset). In short, CAT+ is larger in scale, and CAT is more challenging in terms of the quota of complex contexts. See Table 1 for more dataset statistics. Some patterns (calculated by averaging overlapping masks) of our dataset are shown in Figure 6. The overall pattern of our dataset tends to be a “round” shape, while categories with unique shapes (e.g., cello and limousine) result in shape-biased patterns, which is consistent with the definition of saliency [15]. More details of our dataset can be found in supplementary material.

4 Benchmark Experiment

In this section, we conduct a comprehensive benchmark study for co-saliency detection to verify the effectiveness and superiority of our proposed methods and dataset.

4.1 Implementation Detail

Training. We compare our CAT/CAT+ with five popular datasets, i.e., COCO-SEG [52], DUTS-TR [53], COCO9213 [38], DUTS-Class [66], and Jigsaw [66]. Four current state-of-the-art models are selected for our benchmark experiments, i.e., PoolNet [39], EGNet [67], ICNet [30], and GICD [66]. During training, all input data, including images, masks, and edges maps, are resized to before feeding into the networks. NVIDIA GeForce RTX 2080Ti graphics cards are used through our experiments. Models. For PoolNet [39], we adopt its Res2Net [21] version and Adam optimizer [31] with learning rate - and weight decay -. We run this model for epochs. The VGG [48] version of EGNet [67] is chosen. We follow the default settings by using the Adam optimizer [31] and assigning the learning rate -, weight decay -, and loss weight . Edge maps are prepared by the corresponding masks of each datasets. We train EGNet [67] for epochs and divide the learning rate by after epochs. For ICNet [30], we adopt both VGG [48] and ResNet [23] versions of EGNet [67] to prepare the single-image-saliency-maps (SISMs). Adam optimizer [31] with learning rate - is used. We run this model for epochs. For GICD [66], we keep the original training paradigm of randomly selecting at most 20 samples from each images groups for each training epoch. Adam optimizer [31] is adopted. The initial learning rate is set as -. We train GICD [66] for up to epochs and reduce the learning rate to every epochs.

CoSOD3k [19] MAE
DUTS-TR [53] 0.2089 0.5177 0.4710 0.7249 0.6463 0.6205
DUTS-Class [66] 0.2049 0.5077 0.4703 0.7102 0.6495 0.5906
Jigsaw [66] 0.1895 0.5437 0.5029 0.7415 0.6770 0.6195
COCO9213 [38] 0.2255 0.4883 0.4525 0.7195 0.6390 0.6012
CAT (Ours) 0.1760 0.5503 0.5089 0.7424 0.6783 0.6288
CAT+ (Ours) 0.1713 0.5614 0.5281 0.7509 0.6886 0.6355
CoCA [66] MAE
DUTS-TR [53] 0.2350 0.2993 0.2732 0.6764 0.5582 0.5254
DUTS-Class [66] 0.2102 0.3230 0.2987 0.6953 0.5945 0.5347
Jigsaw [66] 0.2224 0.3238 0.2941 0.7008 0.5705 0.5470
COCO9213 [38] 0.2558 0.2833 0.2613 0.6975 0.5413 0.5095
CAT (Ours) 0.1820 0.3348 0.3070 0.7075 0.6207 0.5555
CAT+ (Ours) 0.1809 0.3459 0.3230 0.7071 0.6253 0.5605
Table 2: Benchmarking results of different datasets trained by PoolNet [39]. Symbols and denote the higher and the lower the better, respectively. Superscript denotes reported results or results reproduced by public checkpoints.
CoSOD3k [19] MAE
DUTS-TR [53] 0.1184 0.6961 0.6744 0.7934 0.7707 0.7604
DUTS-Class [66] 0.1132 0.7042 0.6823 0.8011 0.7927 0.7695
Jigsaw [66] 0.1028 0.7243 0.7096 0.8181 0.8059 0.7789
COCO9213 [38] 0.1222 0.7024 0.6719 0.8098 0.7811 0.7621
CAT (Ours) 0.0884 0.7440 0.7313 0.8409 0.8270 0.7914
CAT+ (Ours) 0.0881 0.7448 0.7351 0.8398 0.8303 0.7911
CoCA [66] MAE
DUTS-TR [53] 0.1822 0.4251 0.4098 0.6751 0.6101 0.6038
DUTS-Class [66] 0.1663 0.4398 0.4270 0.6873 0.6403 0.6200
Jigsaw [66] 0.1644 0.4311 0.4215 0.6684 0.6401 0.6158
COCO9213 [38] 0.1779 0.4301 0.4078 0.6920 0.6351 0.6127
CAT (Ours) 0.1448 0.4399 0.4306 0.6935 0.6767 0.6231
CAT+ (Ours) 0.1447 0.4404 0.4335 0.6909 0.6762 0.6247
Table 3: Benchmarking results of different datasets trained by EGNet [67]. Symbols and denote the higher and the lower the better, respectively. Superscript denotes reported results or results reproduced by public checkpoints.
CoSOD3k [19] MAE MAE
DUTS-Class [66] 0.0882 0.7620 0.7537 0.8398 0.8354 0.7976 0.0893 0.7576 0.7499 0.8383 0.8341 0.7933
Jigsaw [66] 0.0887 0.7547 0.7467 0.8357 0.8313 0.7942 0.0967 0.7361 0.7294 0.8150 0.8101 0.7757
COCO9213 [38] 0.0895 0.7619 0.7517 0.8446 0.8404 0.7936 0.0953 0.7665 0.7431 0.8485 0.8345 0.7966
CAT (Ours) 0.0772 0.7826 0.7687 0.8666 0.8570 0.8104 0.0714 0.7763 0.7681 0.8552 0.8508 0.8092
CAT+ (Ours) 0.0680 0.7907 0.7817 0.8702 0.8656 0.8105 0.0666 0.7919 0.7832 0.8718 0.8670 0.8174
DUTS-Class [66] 0.1481 0.4728 0.4652 0.7007 0.6673 0.6368 0.1462 0.4694 0.4635 0.7001 0.6766 0.6342
Jigsaw [66] 0.1426 0.4671 0.4609 0.6935 0.6695 0.6372 0.1351 0.4759 0.4701 0.6928 0.6793 0.6412
COCO9213 [38] 0.1470 0.5133 0.5018 0.7042 0.6835 0.6541 0.1546 0.5219 0.4944 0.7166 0.6701 0.6531
CAT (Ours) 0.1126 0.5308 0.5195 0.7497 0.7361 0.6723 0.1011 0.5245 0.5181 0.7396 0.7327 0.6728
CAT+ (Ours) 0.1081 0.5292 0.5218 0.7435 0.7336 0.6729 0.1014 0.5329 0.5256 0.7510 0.7430 0.6783
Table 4: Benchmarking results of different datasets trained by ICNet [30]. Left: SISMs generated by EGNet [67] with VGG [48] backbone; Right: SISMs generated by EGNet [67] with ResNet [23] backbone. Symbols and denote the higher and the lower the better, respectively. Superscript denotes reported results or results reproduced by public checkpoints.
CoSOD3k [19] MAE
COCO-SEG [52] 0.0799 0.7673 0.7553 0.8579 0.8526 0.7931
DUTS-Class [66] 0.0815 0.7659 0.7567 0.8493 0.8448 0.8016
Jigsaw [66] 0.0794 0.7695 0.7628 0.8478 0.8446 0.7969
COCO9213 [38] 0.0837 0.7526 0.7427 0.8427 0.8390 0.7835
CAT (Ours) 0.0727 0.7809 0.7758 0.8602 0.8570 0.8052
CAT+ (Ours) 0.0701 0.7864 0.7785 0.8864 0.8631 0.8121
CoCA [66] MAE
COCO-SEG [52] 0.1330 0.5082 0.5011 0.7075 0.6955 0.6536
DUTS-Class [66] 0.1502 0.4746 0.4653 0.6735 0.6561 0.6324
Jigsaw [66] 0.1260 0.5110 0.5043 0.7177 0.7051 0.6563
COCO9213 [38] 0.1431 0.4990 0.4869 0.6985 0.6794 0.6412
CAT (Ours) 0.1187 0.5104 0.5059 0.7230 0.7133 0.6563
CAT+ (Ours) 0.1157 0.5233 0.5159 0.7266 0.7163 0.6676
Table 5: Benchmarking results of different datasets trained by GICD [66]. Symbols and denote the higher and the lower the better, respectively. Superscript denotes reported results or results reproduced by public checkpoints.
# Group Cut Paste MAE
A .2033 .4187 .4132 .6118 .6015 .5913
B .1499 .4716 .4661 .6793 .6627 .6331
C .1783 .4081 .4020 .6305 .6220 .5823
D .1729 .4166 .4108 .6479 .6348 .5857
E .1392 .4602 .4559 .6994 .6913 .6268
F .1302 .4742 .4691 .6999 .6919 .6382
Ours .1157 .5233 .5159 .7266 .7163 .6676
Table 6: Ablation study for our proposed approach with the GICD model on the CoCA dataset. Symbols ✓and denote conduct and randomly conduct corresponding operations, respectively.

Evaluation. We choose the two recent evaluation datasets CoSOD3k [19] and CoCA [66] to evaluate model performances. Four conventional metrics are adopted: mean absolute error (MAE) [8], F-measure [1], E-measure [17], and structure measure () [16]. For F-measure [1], we follow the convention by setting weight and report both the maximum score () and the mean score (). We also report both and for E-measure [17].

4.2 Quantitative Analysis

Longitudinal Comparison. From Tables 2, 3, 4 & 5, we observe that models trained on CAT/CAT+ achieve better performances than other datasets in all metrics on both CoSOD3k [19] and CoCA [66]. This suggests that CAT/CAT+ is agnostic to models and backbones. Specifically, the improvements are typically larger in MAE. Significant boosts for more than are achieved on PoolNet [39], EGNet [67], and ICNet [30], as well as a gain around on GICD [66]. Regarding the average improvements on other metrics, we observe that CAT/CAT+ improves more on F-measure (), which shows that instead of radically regarding every foreground objects as salient, models trained on CAT/CAT+ tend to make relatively conservative predictions and focus more on regions with co-salient objects only. This supports our analysis in Sections 1 & 3. It also provides improvements in E-measure () and (). Indeed, is the hardest among current metrics which measures both accuracy and structural similarity. Although models are capable of making more confident predictions now, they cannot capture very detailed information like boundaries. Overall, we believe that the diversity and quality of CAT/CAT+ have laid the foundation for more in-depth work in the future.

Horizontal Comparison. Changing the direction of comparison, we notice that the evaluation scores on CoSOD3k [19] are much higher than CoCA [66]. This is in line with the characteristics of these two datasets: although the former is richer in scale, the latter has more complex context and more variant appearance, such as occlusion and object size. Besides, the performances of co-saliency detection models are much better than that of saliency detection models, which emphasize again the importance of specialized modules/techniques designed for identifying co-salient signals and suppressing non-salient ones. In addition, the performance of CAT+ is slightly better than that of CAT. A possible reason for this is that CAT+ has a larger scale, which indeed increases the robustness of the model. We will investigate this potential theory in future work.

Figure 7: Qualitative results of different datasets trained by GICD [66]. Left: results on CoSOD3k [19]; Right: results on CoCA [66]. Groups from left to right: frying pan, banana, dolphin, pigeon.

Ablation Study. We conduct an ablation study to verify the effectiveness of our methods (see Table 6). The results for the original images are shown in rows A and B. We also provide other combinations for comparisons, e.g., group images w/ classifiers (vs. randomly allocate images to the same number of groups), cut candidate objects out before paste (vs. do not cut and paste the sampled images directly), and paste objects/images under counterfactual rules (vs. no paste or randomly paste w/o limitations). Specifically, “Grouping” offers proper inter-saliency signals, which is essential for models to find the common salient objects between image groups (A vs. B, C vs. E). “Cut” gives fine-grained category information and further improves performance (C vs. D, E vs. F). As the key part, “Paste” provides intra-saliency signals, which determines the quality of the collaborative learning paradigm. A proper introduction of such signals significantly boosts the performance (Ours vs. F), and vice versa (A vs. C, B vs. E).

4.3 Qualitative Analysis

Figure 7 illustrates some visualization results on CoSOD3k [19] and CoCA [66] generated by GICD [66] trained with different training datasets. It can be easily observed that the quality of a dataset plays an important role during training. Models encoded with features from high-quality dataset like CAT/CAT+ tends to focus more on the true co-salient signals while suppressing the non-salient ones. Taking the frying pan group at columns 1 to 3 as an example. Both semantic segmentation and saliency detection datasets fail to teach the model “what” and “where” to focus and thus let it mistakenly regards items besides frying pans as co-salient. In contrast, the model trained on CAT/CAT+ can well distinguish co-salient objects from non-salient clusters even when they share similar shapes, textures, colors, etc. Please refer to our supplementary material for more examples. Another point worthy of attention is that the existing models cannot well-restore the details (e.g., the boundaries) of objects, especially that of small objects. For example, the contours of the pigeons in the last two columns of Figure 7 are not well drawn. A possible reason for this is that most current co-saliency models adopt VGG-16 [48] as their backbones and rely on low-resolution inputs (e.g., 224 224). Some minor but important features are missing during the down-sampling. With the emergence of more and more deep and efficient backbones, we encourage future works to explore more possibilities.

5 Conclusion

In this work, we addressed the inconsistency problem in co-saliency detection. Borrowing the idea from cause and effect, we proposed a counterfactual training framework to formalize the co-salient signals during training. A group-cut-paste procedure is introduced to leverage existing data samples and makes “cost-free” augmentation by context adjustment. Follow this procedure, we collected a large-scale dataset call Context Adjustment Training. The abundant semantics and high-quality annotations of our dataset liberates models from the impediment caused by spurious variations and biases and significantly improves model performances. As moving forward, we are going to take a deeper look at the co-salient signals both across image groups and within each sample, and improve the quality and scale of our dataset accordingly.


  • [1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk (2009) Frequency-tuned salient region detection. CVPR. Cited by: §4.1.
  • [2] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen (2010) ICoseg: interactive co-segmentation with intelligent scribble guidance. CVPR. Cited by: Table 1, §2, §2.
  • [3] A. Borji, M. Cheng, Q. Hou, H. Jiang, and J. Li (2019) Salient object detection: a survey. Comp. Visual Media 5 (2), pp. 117–150. Cited by: §1.
  • [4] H. Chen (2010) Preattentive co-saliency detection. ICIP. Cited by: §2.
  • [5] L. Chen, X. Yan, J. Xiao, H. Zhang, S. Pu, and Y. Zhuang (2020) Counterfactual samples synthesizing for robust visual question answering. CVPR. Cited by: §1.
  • [6] M. Cheng, N. J. Mitra, X. Huang, and S. Hu (2014) Salientshape: group saliency in image collections. Vis. Comput. 30 (4), pp. 443–453. Cited by: Table 1, §2.
  • [7] M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu (2014) Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37 (3), pp. 569–582. Cited by: Table 1, §2.
  • [8] M. Cheng, J. Warrell, W. Lin, S. Zheng, V. Vineet, and N. Crook (2013) Efficient salient region detection with soft image abstraction. ICCV. Cited by: §4.1.
  • [9] R. Cong, J. Lei, H. Fu, Q. Huang, X. Cao, and C. Hou (2017) Co-saliency detection for rgbd images based on multi-constraint feature matching and cross label propagation. IEEE Trans. Image Process. 27 (2), pp. 568–579. Cited by: §2.
  • [10] J. Dai, Y. N. Wu, J. Zhou, and S. Zhu (2013)

    Cosegmentation and cosketch by unsupervised learning

    ICCV. Cited by: Table 1, §2.
  • [11] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. CVPR. Cited by: §3.2.
  • [12] T. DeVries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §2.
  • [13] S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert (2009) An empirical study of context in object detection. CVPR. Cited by: §1.
  • [14] D. Dwibedi, I. Misra, and M. Hebert (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. ICCV. Cited by: §2.
  • [15] D. Fan, M. Cheng, J. Liu, S. Gao, Q. Hou, and A. Borji (2018) Salient objects in clutter: bringing salient object detection to the foreground. ECCV. Cited by: §2, §3.1, §3.3, §3.3.
  • [16] D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji (2017) Structure-measure: a new way to evaluate foreground maps. ICCV. Cited by: §4.1.
  • [17] D. Fan, C. Gong, Y. Cao, B. Ren, M. Cheng, and A. Borji (2018) Enhanced-alignment measure for binary foreground map evaluation. IJCAI. Cited by: §4.1.
  • [18] D. Fan, T. Li, Z. Lin, G. Ji, D. Zhang, M. Cheng, H. Fu, and J. Shen (2021) Re-thinking co-salient object detection. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.
  • [19] D. Fan, Z. Lin, G. Ji, D. Zhang, H. Fu, and M. Cheng (2020) Taking a deeper look at co-salient object detection. CVPR. Cited by: Figure 1, §1, §1, §1, §2, §2, Figure 7, §4.1, §4.2, §4.2, §4.3, Table 2, Table 3, Table 4, Table 5.
  • [20] H. Fu, X. Cao, and Z. Tu (2013) Cluster-based co-saliency detection. IEEE Trans. Image Process. 22 (10), pp. 3766–3778. Cited by: §2.
  • [21] S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. H. Torr (2021) Res2Net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43 (2), pp. 652–662. Cited by: §4.1.
  • [22] Z. Gao, C. Xu, H. Zhang, S. Li, and V. H. C. de Albuquerque (2020) Trustful internet of surveillance things based on deeply represented visual co-saliency detection. IEEE Internet Things J. 7 (5), pp. 4092–4100. Cited by: §1.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. CVPR. Cited by: §4.1, Table 4.
  • [24] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan (2020) Augmix: a simple data processing method to improve robustness and uncertainty. ICLR. Cited by: §2.
  • [25] L. Herranz, S. Jiang, and X. Li (2016) Scene recognition with cnns: objects, scales and dataset bias. CVPR. Cited by: §1.
  • [26] D. E. Jacobs, D. B. Goldman, and E. Shechtman (2010) Cosaliency: where people look when comparing images. UIST. Cited by: §1, §2.
  • [27] K. R. Jerripothula, J. Cai, and J. Yuan (2016) CATS: co-saliency activated tracklet selection for video co-localization. ECCV. Cited by: §1.
  • [28] K. R. Jerripothula, J. Cai, and J. Yuan (2019) Efficient video object co-localization with co-saliency activated tracklets. IEEE Trans. Circuits Syst. Video Technol.. Cited by: §1.
  • [29] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li (2013) Salient object detection: a discriminative regional feature integration approach. CVPR. Cited by: Table 1.
  • [30] W. Jin, J. Xu, M. Cheng, Y. Zhang, and W. Guo (2020) ICNet: intra-saliency correlation network for co-saliency detection. NeurIPS. Cited by: §2, §4.1, §4.2, Table 4.
  • [31] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §4.1.
  • [32] P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. NeurIPS. Cited by: §2.
  • [33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proc. IEEE. Cited by: §1.
  • [34] B. Li, Z. Sun, Q. Li, Y. Wu, and A. Hu (2019)

    Group-wise deep object co-segmentation with co-attention recurrent neural network

    ICCV. Cited by: Table 1.
  • [35] H. Li, F. Meng, and K. N. Ngan (2013) Co-salient object detection from multiple images. IEEE Trans. Multimedia 15 (8), pp. 1896–1909. Cited by: §2.
  • [36] H. Li and K. N. Ngan (2011) A co-saliency model of image pairs. IEEE Trans. Image Process. 20 (12), pp. 3365–3375. Cited by: §2, §2.
  • [37] T. Li, K. Zhang, S. Shen, B. Liu, Q. Liu, and Z. Li (2021) Image co-saliency detection and instance co-segmentation using attention graph clustering based graph convolutional network. IEEE Trans. Multimedia. Cited by: §2.
  • [38] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. ECCV. Cited by: §1, Table 1, §2, §4.1, Table 2, Table 3, Table 4, Table 5.
  • [39] J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang (2019) A simple pooling-based design for real-time salient object detection. CVPR. Cited by: §4.1, §4.2, Table 2.
  • [40] Z. Liu, W. Zou, L. Li, L. Shen, and O. L. Meur (2014) Co-saliency detection based on hierarchical segmentation. IEEE Signal Process. Lett. 21 (1), pp. 88–92. Cited by: §2.
  • [41] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. V. D. Maaten (2018) Exploring the limits of weakly supervised pretraining. ECCV. Cited by: §3.2.
  • [42] Y. Niu, K. Tang, H. Zhang, Z. Lu, X. Hua, and J. Wen (2020) Counterfactual vqa: a cause-effect look at language bias. arXiv preprint arXiv:2006.04315. Cited by: §1, §1.
  • [43] J. Pearl, M. Glymour, and N. P. Jewell (2016) Causal inference in statistics: a primer. John Wiley & Sons. Cited by: §1, §3.1.
  • [44] J. Pearl and D. Mackenzie (2018) The book of why: the new science of cause and effect. Basic Books. Cited by: §1.
  • [45] J. Pearl (2009) Causality. Cambridge University Press. Cited by: §3.1.
  • [46] J. Pearl (2013) Direct and indirect effects. arXiv preprint arXiv:1301.2300. Cited by: §1.
  • [47] X. Qian, X. Cheng, G. Cheng, X. Yao, and L. Jiang (2021) Two-stream encoder gan with progressive training for co-saliency detection. IEEE Signal Process. Lett. 28, pp. 180–184. Cited by: §2.
  • [48] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §2, §4.1, §4.3, Table 4.
  • [49] S. Suzuki (1985) Topological structural analysis of digitized binary images by border following. Comput. Vis. Graphics Image Process. 30 (1), pp. 32–46. Cited by: §3.2.
  • [50] K. Tang, J. Huang, and H. Zhang (2020) Long-tailed classification by keeping the good and removing the bad momentum causal effect. NeurIPS. Cited by: §1.
  • [51] A. Torralba and A. A. Efros (2011) Unbiased look at dataset bias. CVPR. Cited by: §1.
  • [52] C. Wang, Z. Zha, D. Liu, and H. Xie (2019) Robust deep co-saliency detection with group semantic. AAAI. Cited by: Table 1, §2, §4.1, Table 5.
  • [53] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan (2017) Learning to detect salient objects with image-level supervision. CVPR. Cited by: §1, Table 1, §2, §3.3, §4.1, Table 2, Table 3.
  • [54] T. Wang, J. Huang, H. Zhang, and Q. Sun (2020) Visual commonsense r-cnn. CVPR. Cited by: §1.
  • [55] W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, and R. Yang (2021) Salient object detection in the deep learning era: an in-depth survey. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §1.
  • [56] J. Winn, A. Criminisi, and T. Minka (2005) Object categorization by learned universal visual dictionary. ICCV. Cited by: Table 1, §2, §2.
  • [57] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang (2013) Saliency detection via graph-based manifold ranking. CVPR. Cited by: Table 1, §2.
  • [58] X. Yao, J. Han, D. Zhang, and F. Nie (2017) Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering. IEEE Trans. Image Process. 26 (7), pp. 3196–3209. Cited by: §2.
  • [59] Y. Yue, Q. Zou, H. Yu, Q. Wang, and S. Wang (2019) An end-to-end network for co-saliency detection in one single image. arXiv preprint arXiv:1910.11819. Cited by: Table 1.
  • [60] Z. Yue, H. Zhang, Q. Sun, and X. Hua (2020) Interventional few-shot learning. NeurIPS. Cited by: §1.
  • [61] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) CutMix: regularization strategy to train strong classifiers with localizable features. ICCV. Cited by: §2.
  • [62] D. Zhang, J. Han, C. Li, J. Wang, and X. Li (2016) Detection of co-salient objects by looking deep and wide. Int. J. Comput. Vis. 120 (2), pp. 215–232. Cited by: Table 1, §2.
  • [63] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. ICLR. Cited by: §2.
  • [64] K. Zhang, T. Li, S. Shen, B. Liu, J. Chen, and Q. Liu (2020) Adaptive graph convolutional network with attention graph clustering for co-saliency detection. CVPR. Cited by: §2.
  • [65] Q. Zhang, R. Cong, J. Hou, C. Li, and Y. Zhao (2020) CoADNet: collaborative aggregation-and-distribution networks for co-salient object detection. NeurIPS. Cited by: §2.
  • [66] Z. Zhang, W. Jin, J. Xu, and M. Cheng (2020) Gradient-induced co-saliency detection. ECCV. Cited by: Figure 1, §1, §1, §1, Table 1, §2, §2, §2, §2, Figure 7, §4.1, §4.1, §4.2, §4.2, §4.3, Table 2, Table 3, Table 4, Table 5.
  • [67] J. Zhao, J. Liu, D. Fan, Y. Cao, J. Yang, and M. Cheng (2019) EGNet: edge guidance network for salient object detection. ICCV. Cited by: §4.1, §4.2, Table 3, Table 4.
  • [68] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    CVPR. Cited by: §2.