Saliency detection attempts to mimic the human visual system by automatically detecting and segmenting out object regions that attract the most attention in an image 
. Today’s saliency detection systems, especially those equipped with deep neural networks, are good at identifying salient objects from images . However, the setting of detecting a single object is ideal. For real-world applications like image/video retrieval , surveillance , video analysis , etc., there are always scenes that multiple objects co-occurring in or across frames. This motivates our community to explore co-saliency detection . Sharing a similar rationale with saliency detection, co-saliency detection automatically detects and segments out the common object regions that attract the most attention within image groups 
. Such an imitation can be achieved by the learning paradigm hiding behind artificial neural networks, especially convolutional neural networks. However, “there is no such thing as a free lunch
.” The success of deep learning systems depends heavily on large-scale datasets, which take huge effort and time to collect. The conventional way[19, 66] of constructing a specialized co-saliency detection dataset involves time-consuming and labor-intensive data selection and pixel-level labeling. As shown in Figure 1, the uni-object distribution of the current training data (middle block) has deviated heavily from that of the evaluation data (top block), which consists of complex context with multiple foreground objects. Models encoded with such spurious variations and biases fail to capture the co-salient signals and thus make incorrect predictions [51, 25].
In the context of co-saliency detection, we denote the training, testing, and true (real-world) distributions as , , and , respectively. Recent works on evaluation [19, 66] make one-step closer towards by identifying co-salient signals from multi-foreground clusters in a group-by-group manner. However, a dataset that can well-define is still missing.
As we will discuss in Section 2, most recent works borrow semantic segmentation datasets like MS-COCO
MS-COCO and saliency detection datasets like DUTS  to train their models, which exacerbate the inconsistency among , , and . Not surprisingly, as models only see naive examples during training, they will inevitably cater to the seen idiosyncrasies, and thus are stuck in the unseen world with biased assumptions.
In this paper, we aim at finding a “cost-free” way to handle the distribution inconsistency in co-saliency detection. Intrigued by causal effect [44, 43] and its extensions in vision & language [54, 42, 60], we introduce counterfactual training with regard to the gap between current training distribution and true distribution as the direct cause [46, 50] of incorrect co-saliency predictions. As shown in Figure 2, the quality of prediction made by a learning-based model is dependent on the quality of input data under distribution . The goal of counterfactual training is to synthesize “imaginary” data sample , whose distribution — also originates from — can mimic . In this way, models can capture the true effect in terms of co-saliency from and make prediction properly.
Under the instruction of counterfactual training, we propose using context adjustment  to augment off-the-shelf saliency detection datasets, and introduce a novel group-cut-paste (GCP) procedure to improve the distribution of the training dataset. Taking a closer look at Figure 1. and are the corresponding pixel-level annotation and category label of sample . GCP turns
into a canvas to be completed and paint the remaining part through the following steps: (1) classifying candidate images into a semantic group(e.g., banana) by reliable pretrained models; (2) cutting out candidate objects (e.g., baseball, butterfly, etc.); and (3) pasting candidate objects into image samples. We will revisit this procedure formally in Section 3. Different from the plain “observation” (e.g., different color from the background) made by biased training, counterfactual training opens the door of “imagination” and allows models to think comprehensively . A better prediction can be made possibly because of features, such as the “elongated and curved” shape (instead of the round shape of baseball and orange) or “yellow-green” color (instead of the dark color of remote control and avocado), are captured. In this way, models can focus more on the true causal effects rather than spurious variants and biases caused by the distribution gap .
Following GCP, we collect a novel dataset called Context Adjustment Training, of which has two variants: CAT and CAT+. Both of them consist of subclasses affiliated to superclasses, which cover common items in daily life, as well as animals and plants in nature. While our CAT/CAT+ is diverse in semantics, it also has a large scale. The two variants of our dataset contain 16,750 and 33,500 samples, respectively, making it the current largest in co-saliency detection. Every sample in CAT/CAT+ is equipped with sophisticated mask annotation, category, and edge information. It is worth noting that, unlike manual selection and pixel-by-pixel labeling, all the images and their corresponding masks in our dataset are automatically annotated, making the cost virtually “free.” Extensive experimental results shown in Section 4 verify the effectiveness of our dataset. Without bells and whistles, CAT/CAT+ helps both one-stage and two-stage models to achieve significant improvements for 5 30 in conventionally-adopted metrics on challenging evaluation datasets CoSOD3k  and CoCA .
2 Related Work
|1||MSRCv1 ||2005||240||8||30.0||30||30||✗||✓||✗||✗||CO||Group images|
|2||MSRCv2 ||2005||591||23||25.7||34||21||✗||✓||✗||✗||CO||Group images|
|3||iCoseg ||2010||643||38||16.9||41||4||✗||✓||✗||✓||CO||Group images|
|4||MSRA-B ||2013||5,000||-||-||-||-||✗||✓||✗||✓||SD||Single image|
|5||Coseg-Rep ||2013||572||23||24.8||116||9||✗||✓||✗||✓||CO||Group images|
|6||DUT-OMRON ||2013||5,172||-||-||-||-||✗||✓||✗||✓||SD||Single image|
|7||THUR-15K ||2014||15,531||5||3,000.0||3,457||2,892||✗||✓||✓||✓||CO||Group images|
|8||MSRA10K ||2015||10,000||-||-||-||-||✗||✓||✓||✓||SD||Single image|
|9||CoSal2015 ||2015||2,015||50||40.3||52||26||✓||✓||✗||✓||CO||Group images|
|10||DUTS-TR ||2017||10,553||-||-||-||-||✗||✓||✓||✓||SD||Single image|
|11||COCO9213 ||2017||9,213||65||141.7||468||18||✓||✓||✗||✗||SS||Group images|
|12||COCO-GWD ||2019||9,000||118||76.2||-||-||✓||✓||✗||✗||SS||Group images|
|13||COCO-SEG ||2019||200,932||78||2576.1||49,355||201||✓||✗||✓||✗||SS||Group images|
|14||WISD ||2019||2,019||-||-||-||-||✓||-||✗||-||CO||Single image|
|15||DUTS-Class ||2020||8,250||291||28.3||252||5||✗||✓||✓||✓||CO||Group images|
|16||Jigsaw ||2020||33,000||291||113.4||1,008||20||✓||✗||✓||✓||CO||Group images|
|17||CAT (Ours)||2021||16,750||280||59.8||412||14||✓||✓||✓||✓||CO||Group images|
|18||CAT+ (Ours)||2021||33,500||280||119.6||824||28||✓||✓||✓||✓||CO||Group images|
Modeling visual contexts liberates a lot of computer vision tasks from the impediment caused by the need for sophisticated annotations, such as bounding boxes and pixel-level masks.  leverages training pixels and regularization effects of regional dropout by cutting and pasting image patches. To avoid overfitting,  randomly masks out square regions of an image during training. The generic vicinal distribution introduced in 
helps to synthesize the groundtruth of new samples by the linear interpolation of one-hot labels. Unfortunately, the patches introduced by both and  may cause severe occlusions for original objects, which is not compatible with the idea of saliency detection . For , the local statistics and semantic are destroyed. Since the foreground and background are blended, the saliency cannot be appropriately defined anymore.  fixes this problem by mixing several transformed versions of an image together, i.e., augmentation chain. Although such a composition improves the model robustness, it cannot introduce co-salient signals.  concatenates two samples together to form a multi-foreground image. However, for backbones that require fixed-size inputs , the resize of such jigsawed images leads to severe shape distortion. In contrast, our GCP preserves both the multi-foreground requirement and object shapes and maintains saliency .
Co-Saliency Detection. The origin of this task can be dated back to the last decade , where co-saliency was defined as the regions of an object which occur in a pair of images . A more formal definition given in  exploits both intra-image saliency and inter-image saliency in a group-by-group manner. Since then, researchers in this area have been working on identifying co-salient signals across image groups . Representative works from early years [4, 20, 40]
rely on certain heuristics like background prior and color contrast . Let alone the post-processing like CRF , most modern co-saliency detection models can be divided into two-stage and one-stage models. Building upon saliency detection frameworks, two-stage models leverage the intra-saliency with inter-saliency generated by specially designed modules [30, 18]. There are also some one-stage models which do not require saliency preparation.  and  introduce consensus embedding procedures to explore group representation and use it to guide soft attentions . Some recent models adopt different network structures to identify co-salient signals, such as graph neural networks [64, 37]47]. Associating with our dataset, both two-stage and one-stage models achieve better performances.
Training Dataset. Due to the lack of a suitable training dataset, most previous works use existing saliency detection datasets or semantic segmentation datasets for training. Table 1 summarizes datasets adopted to train co-saliency detection models. Small datasets [56, 2, 10] are popular among works in early years only. Some large-scale saliency detection datasets [57, 7, 53] are adopted by two-stage models to extract intra-saliency cues. However, such datasets do not have class information, hence it is impossible for them to train end-to-end models. Some works even borrow datasets from semantic segmentation, e.g., COCO-SEG  and COCO9213 . Unfortunately, although such datasets are large in scales, their annotations are coarse. Recently, DUTS-Class  and its jigsawed version  have been proposed as a “transition plan” towards co-saliency. However, the former is just a “grouped” saliency detection dataset, while the latter introduces issues like shape distortion and independent boundaries. Evidence shows that these inappropriate training paradigms have caused serious biases for co-saliency detection models [19, 66]. Our dataset addresses the aforementioned problems. CAT/CAT+ is currently the largest co-saliency detection dataset and offers diverse semantics and high-quality annotations.
Evaluation Dataset. Pioneer co-saliency detection works evaluate model performances on small datasets like MSRC , iCoseg , and ImagePair . Consisting of 2,015 images from 50 groups, CoSal2015  is the most popular choice among modern models. However, most samples in it are still uni-object. Two recent datasets [19, 66] fill the defect in terms of both scale and appearance. CoSOD3k  is a diverse dataset consists of 160 categories and 3,316 images. 70 and 20 images have one and two objects, while 10 images have three or more objects. CoCA  consists of 80 categories with 1,295 images, which are challenging in occlusion and clutter background. At least one foreground object is introduced for every image, and some of them have more than two co-salient objects. We have adopted both of these two datasets to evaluate model performances in our benchmark experiments.
3 Proposed Method
In this section, we first introduce the idea of counterfactual training in co-saliency detection. Under this high-level instruction, we propose a group-cut-paste (GCP) procedure to adjust visual context and generate training samples. Finally, we construct a novel dataset by following GCP.
3.1 Counterfactual Training
Let and denote a training image and its mask annotation with width , height , and channel . is an image group whose members include . denotes the category label which contains the semantic information for both and . The counterfactual  of sample , is read as:
would be , had been , given the fact that .
where denotes the new image group; New mask is restricted as to maintain the saliency  of the co-occurring object. In other words, label will not change during the whole process. The given “fact” means that the group training paradigm remains the same, while the “counter-fact” (what if) is indicating that the generated clashes with .
The conceptual meanings of counterfactual training can be substantiated by the following instructions :
Abduction - “Given the fact that .” When constructing a new image group for , the co-saliency should remain unchanged. That is to say, the “fact” that the object in is salient still holds for it in .
Action - “Had been .” After the “imagination”, the mask annotation should not have a significant change. Here we intervene by keeping the semantic label the same both beforehand and afterhand.
Prediction - “ would be .” Conditioning on the “fact” and the intervention objective , we can generate the counterfactual sample and its mask
from the following probability distribution function:.
Following the counterfactual training instructions, we design this group-cut-paste (GCP) procedure. Specifically, the object map for image can be generated by , i.e., , where denote element-wise multiplication. The goal of GCP is to automatically generate a new training sample by combing and with , where and are category indexes of the target and source groups, respectively. The generated sample
is then used to train candidate models with their original loss functions. Figure3 gives an overview of our method.
Group. The first step is to build image groups. To effectively classify candidate images, we adopt one of the current state-of-the-art classifiers WSL-Images , which achieves Top-1 accuracy on ImageNet . The category label can be generated by , where denotes the pretrained classifier. We manually pick out the misclassified examples after grouping. This results in sample pairs distributed in semantic groups.
Cut. Both object map and mask are used to prepare candidate object . We adopt the border following algorithm  to extract external contour points by mask . Based on contours, the bounding rectangle with minimum area can be drawn. The area to be cut can then be defined by the four vertices of this rotated box. This gives us the final object .
Paste. Under the counterfactual training instructions, object can be pasted properly into to form a new sample . Here we maintain the “fact” by preserving group consistency, i.e., conduct this procedure in a group-by-group manner. Denote the regions outside as . To avoid severe occlusion, coordinate is randomly sampled in as the position to be pasted. The center of is chosen as the anchor for pasting. We execute the “action” by restricting the size of candidate object and maintaining label before and after paste. The mask annotations are automatically synthesized by dyadic operations between and
. Finally, under uniform probability distribution, “prediction” gives us pairs for further model training. Our GCP operation can be easily applied on CPUs in parallel with the main GPU training tasks, which hides the computation cost and boosts the performance for virtually free.
3.3 Constructing the Dataset
For the purpose of stabilization and to ensure high quality, we collect a dataset called Context Adjustment Training. We pick 8,375 images with clear saliency and sophisticated annotations as our canvas from existing saliency detection datasets [53, 15]. Following GCP, 280 semantic groups affiliated to 15 superclasses are built after the “grouping,” i.e., aves, electronic product (elec.), food, fruit, insect, instrument (instru.), kitchenware (kitch.), mammal (mamm.), marine, other, reptile (rept.), sports, tool, transportation (trans.), and vegetable (vege.). See Figure 4 for the taxonomy. We then “cut” all the objects out and discard those with incomplete shapes. During the “paste” stage, the object size is restricted as . Candidate objects are randomly flipped horizontally to increase diversity. We sample twice for each candidate image to generate two samples and do a re-sample for unsatisfied cases. This gives us 16,750 augmented images. Their corresponding masks are automatically generated. As shown in Figure 5, samples synthesized by counterfactual training and GCP can model realistic visual context and offer proper co-salient signals.
Our dataset has two variants: CAT and CAT+. CAT consists of all 16,750 augmented images, while CAT+ contains both the augmented and two times of the original images (i.e., 33,500 images, which is ten times bigger than the current largest evaluation dataset). In short, CAT+ is larger in scale, and CAT is more challenging in terms of the quota of complex contexts. See Table 1 for more dataset statistics. Some patterns (calculated by averaging overlapping masks) of our dataset are shown in Figure 6. The overall pattern of our dataset tends to be a “round” shape, while categories with unique shapes (e.g., cello and limousine) result in shape-biased patterns, which is consistent with the definition of saliency . More details of our dataset can be found in supplementary material.
4 Benchmark Experiment
In this section, we conduct a comprehensive benchmark study for co-saliency detection to verify the effectiveness and superiority of our proposed methods and dataset.
4.1 Implementation Detail
Training. We compare our CAT/CAT+ with five popular datasets, i.e., COCO-SEG , DUTS-TR , COCO9213 , DUTS-Class , and Jigsaw . Four current state-of-the-art models are selected for our benchmark experiments, i.e., PoolNet , EGNet , ICNet , and GICD . During training, all input data, including images, masks, and edges maps, are resized to before feeding into the networks. NVIDIA GeForce RTX 2080Ti graphics cards are used through our experiments. Models. For PoolNet , we adopt its Res2Net  version and Adam optimizer  with learning rate - and weight decay -. We run this model for epochs. The VGG  version of EGNet  is chosen. We follow the default settings by using the Adam optimizer  and assigning the learning rate -, weight decay -, and loss weight . Edge maps are prepared by the corresponding masks of each datasets. We train EGNet  for epochs and divide the learning rate by after epochs. For ICNet , we adopt both VGG  and ResNet  versions of EGNet  to prepare the single-image-saliency-maps (SISMs). Adam optimizer  with learning rate - is used. We run this model for epochs. For GICD , we keep the original training paradigm of randomly selecting at most 20 samples from each images groups for each training epoch. Adam optimizer  is adopted. The initial learning rate is set as -. We train GICD  for up to epochs and reduce the learning rate to every epochs.
Evaluation. We choose the two recent evaluation datasets CoSOD3k  and CoCA  to evaluate model performances. Four conventional metrics are adopted: mean absolute error (MAE) , F-measure , E-measure , and structure measure () . For F-measure , we follow the convention by setting weight and report both the maximum score () and the mean score (). We also report both and for E-measure .
4.2 Quantitative Analysis
Longitudinal Comparison. From Tables 2, 3, 4 & 5, we observe that models trained on CAT/CAT+ achieve better performances than other datasets in all metrics on both CoSOD3k  and CoCA . This suggests that CAT/CAT+ is agnostic to models and backbones. Specifically, the improvements are typically larger in MAE. Significant boosts for more than are achieved on PoolNet , EGNet , and ICNet , as well as a gain around on GICD . Regarding the average improvements on other metrics, we observe that CAT/CAT+ improves more on F-measure (), which shows that instead of radically regarding every foreground objects as salient, models trained on CAT/CAT+ tend to make relatively conservative predictions and focus more on regions with co-salient objects only. This supports our analysis in Sections 1 & 3. It also provides improvements in E-measure () and (). Indeed, is the hardest among current metrics which measures both accuracy and structural similarity. Although models are capable of making more confident predictions now, they cannot capture very detailed information like boundaries. Overall, we believe that the diversity and quality of CAT/CAT+ have laid the foundation for more in-depth work in the future.
Horizontal Comparison. Changing the direction of comparison, we notice that the evaluation scores on CoSOD3k  are much higher than CoCA . This is in line with the characteristics of these two datasets: although the former is richer in scale, the latter has more complex context and more variant appearance, such as occlusion and object size. Besides, the performances of co-saliency detection models are much better than that of saliency detection models, which emphasize again the importance of specialized modules/techniques designed for identifying co-salient signals and suppressing non-salient ones. In addition, the performance of CAT+ is slightly better than that of CAT. A possible reason for this is that CAT+ has a larger scale, which indeed increases the robustness of the model. We will investigate this potential theory in future work.
Ablation Study. We conduct an ablation study to verify the effectiveness of our methods (see Table 6). The results for the original images are shown in rows A and B. We also provide other combinations for comparisons, e.g., group images w/ classifiers (vs. randomly allocate images to the same number of groups), cut candidate objects out before paste (vs. do not cut and paste the sampled images directly), and paste objects/images under counterfactual rules (vs. no paste or randomly paste w/o limitations). Specifically, “Grouping” offers proper inter-saliency signals, which is essential for models to find the common salient objects between image groups (A vs. B, C vs. E). “Cut” gives fine-grained category information and further improves performance (C vs. D, E vs. F). As the key part, “Paste” provides intra-saliency signals, which determines the quality of the collaborative learning paradigm. A proper introduction of such signals significantly boosts the performance (Ours vs. F), and vice versa (A vs. C, B vs. E).
4.3 Qualitative Analysis
Figure 7 illustrates some visualization results on CoSOD3k  and CoCA  generated by GICD  trained with different training datasets. It can be easily observed that the quality of a dataset plays an important role during training. Models encoded with features from high-quality dataset like CAT/CAT+ tends to focus more on the true co-salient signals while suppressing the non-salient ones. Taking the frying pan group at columns 1 to 3 as an example. Both semantic segmentation and saliency detection datasets fail to teach the model “what” and “where” to focus and thus let it mistakenly regards items besides frying pans as co-salient. In contrast, the model trained on CAT/CAT+ can well distinguish co-salient objects from non-salient clusters even when they share similar shapes, textures, colors, etc. Please refer to our supplementary material for more examples. Another point worthy of attention is that the existing models cannot well-restore the details (e.g., the boundaries) of objects, especially that of small objects. For example, the contours of the pigeons in the last two columns of Figure 7 are not well drawn. A possible reason for this is that most current co-saliency models adopt VGG-16  as their backbones and rely on low-resolution inputs (e.g., 224 224). Some minor but important features are missing during the down-sampling. With the emergence of more and more deep and efficient backbones, we encourage future works to explore more possibilities.
In this work, we addressed the inconsistency problem in co-saliency detection. Borrowing the idea from cause and effect, we proposed a counterfactual training framework to formalize the co-salient signals during training. A group-cut-paste procedure is introduced to leverage existing data samples and makes “cost-free” augmentation by context adjustment. Follow this procedure, we collected a large-scale dataset call Context Adjustment Training. The abundant semantics and high-quality annotations of our dataset liberates models from the impediment caused by spurious variations and biases and significantly improves model performances. As moving forward, we are going to take a deeper look at the co-salient signals both across image groups and within each sample, and improve the quality and scale of our dataset accordingly.
-  (2009) Frequency-tuned salient region detection. CVPR. Cited by: §4.1.
-  (2010) ICoseg: interactive co-segmentation with intelligent scribble guidance. CVPR. Cited by: Table 1, §2, §2.
-  (2019) Salient object detection: a survey. Comp. Visual Media 5 (2), pp. 117–150. Cited by: §1.
-  (2010) Preattentive co-saliency detection. ICIP. Cited by: §2.
-  (2020) Counterfactual samples synthesizing for robust visual question answering. CVPR. Cited by: §1.
-  (2014) Salientshape: group saliency in image collections. Vis. Comput. 30 (4), pp. 443–453. Cited by: Table 1, §2.
-  (2014) Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37 (3), pp. 569–582. Cited by: Table 1, §2.
-  (2013) Efficient salient region detection with soft image abstraction. ICCV. Cited by: §4.1.
-  (2017) Co-saliency detection for rgbd images based on multi-constraint feature matching and cross label propagation. IEEE Trans. Image Process. 27 (2), pp. 568–579. Cited by: §2.
Cosegmentation and cosketch by unsupervised learning. ICCV. Cited by: Table 1, §2.
-  (2009) ImageNet: a large-scale hierarchical image database. CVPR. Cited by: §3.2.
-  (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §2.
-  (2009) An empirical study of context in object detection. CVPR. Cited by: §1.
-  (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. ICCV. Cited by: §2.
-  (2018) Salient objects in clutter: bringing salient object detection to the foreground. ECCV. Cited by: §2, §3.1, §3.3, §3.3.
-  (2017) Structure-measure: a new way to evaluate foreground maps. ICCV. Cited by: §4.1.
-  (2018) Enhanced-alignment measure for binary foreground map evaluation. IJCAI. Cited by: §4.1.
-  (2021) Re-thinking co-salient object detection. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.
-  (2020) Taking a deeper look at co-salient object detection. CVPR. Cited by: Figure 1, §1, §1, §1, §2, §2, Figure 7, §4.1, §4.2, §4.2, §4.3, Table 2, Table 3, Table 4, Table 5.
-  (2013) Cluster-based co-saliency detection. IEEE Trans. Image Process. 22 (10), pp. 3766–3778. Cited by: §2.
-  (2021) Res2Net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43 (2), pp. 652–662. Cited by: §4.1.
-  (2020) Trustful internet of surveillance things based on deeply represented visual co-saliency detection. IEEE Internet Things J. 7 (5), pp. 4092–4100. Cited by: §1.
-  (2016) Deep residual learning for image recognition. CVPR. Cited by: §4.1, Table 4.
-  (2020) Augmix: a simple data processing method to improve robustness and uncertainty. ICLR. Cited by: §2.
-  (2016) Scene recognition with cnns: objects, scales and dataset bias. CVPR. Cited by: §1.
-  (2010) Cosaliency: where people look when comparing images. UIST. Cited by: §1, §2.
-  (2016) CATS: co-saliency activated tracklet selection for video co-localization. ECCV. Cited by: §1.
-  (2019) Efficient video object co-localization with co-saliency activated tracklets. IEEE Trans. Circuits Syst. Video Technol.. Cited by: §1.
-  (2013) Salient object detection: a discriminative regional feature integration approach. CVPR. Cited by: Table 1.
-  (2020) ICNet: intra-saliency correlation network for co-saliency detection. NeurIPS. Cited by: §2, §4.1, §4.2, Table 4.
-  (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §4.1.
-  (2011) Efficient inference in fully connected crfs with gaussian edge potentials. NeurIPS. Cited by: §2.
-  (1998) Gradient-based learning applied to document recognition. Proc. IEEE. Cited by: §1.
Group-wise deep object co-segmentation with co-attention recurrent neural network. ICCV. Cited by: Table 1.
-  (2013) Co-salient object detection from multiple images. IEEE Trans. Multimedia 15 (8), pp. 1896–1909. Cited by: §2.
-  (2011) A co-saliency model of image pairs. IEEE Trans. Image Process. 20 (12), pp. 3365–3375. Cited by: §2, §2.
-  (2021) Image co-saliency detection and instance co-segmentation using attention graph clustering based graph convolutional network. IEEE Trans. Multimedia. Cited by: §2.
-  (2014) Microsoft coco: common objects in context. ECCV. Cited by: §1, Table 1, §2, §4.1, Table 2, Table 3, Table 4, Table 5.
-  (2019) A simple pooling-based design for real-time salient object detection. CVPR. Cited by: §4.1, §4.2, Table 2.
-  (2014) Co-saliency detection based on hierarchical segmentation. IEEE Signal Process. Lett. 21 (1), pp. 88–92. Cited by: §2.
-  (2018) Exploring the limits of weakly supervised pretraining. ECCV. Cited by: §3.2.
-  (2020) Counterfactual vqa: a cause-effect look at language bias. arXiv preprint arXiv:2006.04315. Cited by: §1, §1.
-  (2016) Causal inference in statistics: a primer. John Wiley & Sons. Cited by: §1, §3.1.
-  (2018) The book of why: the new science of cause and effect. Basic Books. Cited by: §1.
-  (2009) Causality. Cambridge University Press. Cited by: §3.1.
-  (2013) Direct and indirect effects. arXiv preprint arXiv:1301.2300. Cited by: §1.
-  (2021) Two-stream encoder gan with progressive training for co-saliency detection. IEEE Signal Process. Lett. 28, pp. 180–184. Cited by: §2.
-  (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §2, §4.1, §4.3, Table 4.
-  (1985) Topological structural analysis of digitized binary images by border following. Comput. Vis. Graphics Image Process. 30 (1), pp. 32–46. Cited by: §3.2.
-  (2020) Long-tailed classification by keeping the good and removing the bad momentum causal effect. NeurIPS. Cited by: §1.
-  (2011) Unbiased look at dataset bias. CVPR. Cited by: §1.
-  (2019) Robust deep co-saliency detection with group semantic. AAAI. Cited by: Table 1, §2, §4.1, Table 5.
-  (2017) Learning to detect salient objects with image-level supervision. CVPR. Cited by: §1, Table 1, §2, §3.3, §4.1, Table 2, Table 3.
-  (2020) Visual commonsense r-cnn. CVPR. Cited by: §1.
-  (2021) Salient object detection in the deep learning era: an in-depth survey. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §1.
-  (2005) Object categorization by learned universal visual dictionary. ICCV. Cited by: Table 1, §2, §2.
-  (2013) Saliency detection via graph-based manifold ranking. CVPR. Cited by: Table 1, §2.
-  (2017) Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering. IEEE Trans. Image Process. 26 (7), pp. 3196–3209. Cited by: §2.
-  (2019) An end-to-end network for co-saliency detection in one single image. arXiv preprint arXiv:1910.11819. Cited by: Table 1.
-  (2020) Interventional few-shot learning. NeurIPS. Cited by: §1.
-  (2019) CutMix: regularization strategy to train strong classifiers with localizable features. ICCV. Cited by: §2.
-  (2016) Detection of co-salient objects by looking deep and wide. Int. J. Comput. Vis. 120 (2), pp. 215–232. Cited by: Table 1, §2.
-  (2018) Mixup: beyond empirical risk minimization. ICLR. Cited by: §2.
-  (2020) Adaptive graph convolutional network with attention graph clustering for co-saliency detection. CVPR. Cited by: §2.
-  (2020) CoADNet: collaborative aggregation-and-distribution networks for co-salient object detection. NeurIPS. Cited by: §2.
-  (2020) Gradient-induced co-saliency detection. ECCV. Cited by: Figure 1, §1, §1, §1, Table 1, §2, §2, §2, §2, Figure 7, §4.1, §4.1, §4.2, §4.2, §4.3, Table 2, Table 3, Table 4, Table 5.
-  (2019) EGNet: edge guidance network for salient object detection. ICCV. Cited by: §4.1, §4.2, Table 3, Table 4.
Learning deep features for discriminative localization. CVPR. Cited by: §2.