Concealed Object Detection

02/20/2021 ∙ by Deng-Ping Fan, et al. ∙ Nankai University 0

We present the first systematic study on concealed object detection (COD), which aims to identify objects that are "perfectly" embedded in their background. The high intrinsic similarities between the concealed objects and their background make COD far more challenging than traditional object detection/segmentation. To better understand this task, we collect a large-scale dataset, called COD10K, which consists of 10,000 images covering concealed objects in diverse real-world scenarios from 78 object categories. Further, we provide rich annotations including object categories, object boundaries, challenging attributes, object-level labels, and instance-level annotations. Our COD10K is the largest COD dataset to date, with the richest annotations, which enables comprehensive concealed object understanding and can even be used to help progress several other vision tasks, such as detection, segmentation, classification, etc. Motivated by how animals hunt in the wild, we also design a simple but strong baseline for COD, termed the Search Identification Network (SINet). Without any bells and whistles, SINet outperforms 12 cutting-edge baselines on all datasets tested, making them robust, general architectures that could serve as catalysts for future research in COD. Finally, we provide some interesting findings and highlight several potential applications and future directions. To spark research in this new field, our code, dataset, and online demo are available on our project page:



There are no comments yet.


page 1

page 2

page 5

page 9

page 11

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Can you find the concealed object(s) in each image of Fig. 1 within 10 seconds? Biologists refer to this as background matching camouflage (BMC) [12], where one or more objects attempt to adapt their coloring to match “seamlessly” with the surroundings in order to avoid detection [63]. Sensory ecologists [73] have found that this BMC strategy works by deceiving the visual perceptual system of the observer. Naturally, addressing concealed object detection (COD111We define COD as segmenting objects or stuff (amorphous regions [42]) that have a similar pattern, e.g., texture, color, direction, etc., to their natural or man-made environment. In the rest of the paper, for convenience, the concealed object segmentation is considered identical to COD and used interchangeably.) requires a significant amount of visual perception [79]

knowledge. Understanding COD has not only scientific value in itself, but it also important for applications in many fundamental fields, such as computer vision (

e.g., for search-and-rescue work, or rare species discovery), medicine (e.g., polyp segmentation [22], lung infection segmentation [27, 86]), agriculture (e.g., locust detection to prevent invasion), and art (e.g., recreational art [9]).

In Fig. 2, we present examples of generic, salient, and concealed object detection. The high intrinsic similarities between the targets and non-targets make COD far more challenging than traditional object segmentation/detection [101, 25, 23, 99, 98]

. Although it has gained increased attention recently, studies on COD still remain scarce, mainly due to the lack of a sufficiently large dataset and a standard benchmark like Pascal-VOC 


, ImageNet 




, ADE20K 

[103], and DAVIS [65].

Fig. 1: Examples of background matching camouflage (BMC). There are seven and six birds for the left and right image, respectively. Answers in color are shown in Fig. 27.

[width=]DifferentTask (a) Image(b) Generic object(c) Salient object(d) Concealed object

Fig. 2: Task relationship. Given an input image (a), we present the ground-truth for (b) panoptic segmentation [42] (which detects generic objects [52, 58] including stuff and things), (c) instance level salient object detection [46, 25], and (d) the proposed concealed object detection task, where the goal is to detect objects that have a similar pattern to the natural habitat. In this example, the boundaries of the two butterflies are blended with the bananas, making them difficult to identify.

Fig. 3: Annotation diversity in the proposed COD10K dataset. Instead of only providing coarse-grained object-level annotations like in previous works, we offer six different annotations for each image, which include attributes and categories ( row), bounding boxes ( row), object annotation ( row), instance annotation ( row), and edge annotation ( row).

In this paper, we present the first complete study for the concealed object detection task using deep learning, bringing a novel view to object detection from a concealed perspective.

1.1 Contributions

Our main contributions are as follows:

  • COD10K Dataset. With the goal mentioned above, we carefully assemble COD10K, a large-scale concealed object detection dataset. Our dataset contains 10,000 images covering 78 object categories, such as terrestrial, amphibians, flying, aquatic, etc. All the concealed images are hierarchically annotated with category, bounding-box, object-level, and instance-level labels (Fig. 3), benefiting many related tasks, such as object proposal, localization, semantic edge detection [56]

    , transfer learning 

    [92], domain adaption [69], etc. Each concealed image is assigned challenging attributes (e.g., shape complexity-SC, indefinable boundaries-IB, occlusions-OC) found in the real-world and matting-level [96] labeling (which takes 60 minutes per image). These high-quality labels could help provide deeper insight into the performance of models.

  • COD Framework. We propose a simple but efficient framework, named SINet (Search Identification Net). Remarkably, the overall training time of SINet takes 4 hours and it achieves the new state-of-the-art (SOTA) on all existing COD datasets, suggesting that it could offer a potential solution to concealed object detection. Our network also yield several interesting findings (e.g., search and identification strategy is suitable for COD), making various potential applications more feasible.

  • COD Benchmark. Based on the collected COD10K and previous datasets [72, 45], we offer a rigorous evaluation of 12 SOTA baselines, making ours the largest COD study. We report baselines in two scenarios, i.e., super-class and sub-class. We also track the community’s progress via an online benchmark.

  • Downstream Applications. To further support research in the field, we develop an online demo to enable other researchers to test their scenes easily. In addition, we also demonstrate several potential applications such as medicine, manufacturing, agriculture, art, etc.

  • Future Directions. Based on the proposed COD10K, we also discuss ten promising directions for future research. We find that concealed object detection is still far from been solved, leaving large room for improvement.

This paper is based on and extends our conference version [21] in terms of several aspects. First, we provide a more detailed analysis of our COD10K, including the taxonomy, statistics, annotations, and resolutions. Second, we improve the performance our SINet model by introducing neighbor connection decoder (NCD) and group-reversal attention (GRA). Third, we conduct extensive experiments to validate the effectiveness of our model, and provide several ablation studies for the different modules within our framework. Fourth, we provide an exhaustive super-class and sub-class benchmarking and a more insightful discussion regarding the novel COD task. Last but not least, based on our benchmark results, we draw several important conclusions and highlight several promising future directions, such as concealed object ranking, concealed object proposal, concealed instance segmentation.

2 Related Work

In this section, we briefly review closely related works. Following [101], we roughly divide object detection into three categories: generic, salient, and concealed object detection.

Statistics Annotations Data Split Tasks
Dataset Year #Img. #Cls. Att. BBox. Ml. Ins. Cate. Obj. #Training #Testing Loc. Det. Cls. WS. InSeg.
CHAMELEON [72] 2018 76 N/A 0 76
CAMO-COCO [45] 2019 2,500 8 1,250 1,250
COD10K (OUR) 2020 10,000 78 6,000 4,000
TABLE I: Summary of COD datasets, showing that COD10K offers much richer annotations and benefits many tasks. Att.: Attribute label. BBox.: Bounding box label. Ml.: Alpha matting-level annotation [96]. Ins.: Instance-level label. Cate.: Category label. Obj.: Object-level label. Loc.: Location. Det.: Detection. Cls.: Classification. WS.: Weak Supervision. InSeg. Instance Segmentation.

Generic Object Segmentation (GOS).   One of the most popular directions in computer vision is generic object segmentation [71, 50, 17, 42]. Note that generic objects can be either salient or concealed. Concealed objects can be seen as difficult cases of generic objects. Typical GOS tasks include semantic segmentation and panoptic segmentation (see Fig. 2 b).

Salient Object Detection (SOD).   This task aims to identify the most attention-grabbing objects in an image and then segment their pixel-level silhouettes [37, 8]. The flagship products that make use of SOD technology [35] are Huawei’s smartphones, which employ SOD [35] to create what they call “AI Selfies”. Recently, Qin et al.  applied the SOD algorithm [66] to two (near) commercial applications: AR COPY & PASTE222 and OBJECT CUT333 These applications have already drawn great attention (12K github stars) and have important real-world impacts.

Although the term “salient” is essentially the opposite of “concealed” (standout vs. immersion), salient objects can nevertheless provide important information for COD, e.g., images containing salient objects can be used as the negative samples. Giving a complete review on SOD is beyond the scope of this work. We refer readers to recent survey and benchmark papers [25, 3, 81, 2] for more details. Our online benchmark is publicly available at:

Concealed Object Detection (COD).   Research into concealed/camouflaged object detection, which has had a tremendous impact on advancing our knowledge of visual perception, has a long and rich history in biology and art. Two remarkable studies on concealed animals from Abbott Thayer [76] and Hugh Cott [11] are still hugely influential. The reader can refer to the survey by Stevens et al.  [73] for more details on this history.

COD Datasets. CHAMELEON [72] is an unpublished dataset that has only 76 images with manually annotated object-level ground-truths (GTs). The images were collected from the Internet via the Google search engine using “concealed animal” as a keyword. Another contemporary dataset is CAMO [45], which has 2.5K images (2K for training, 0.5K for testing) covering eight categories. It has two sub-datasets, CAMO and MS-COCO, each of which contains 1.25K images. Unlike existing datasets, the goal of our COD10K is to provide a more challenging, higher quality, and more densely annotated dataset. COD10K is the largest concealed object detection dataset so far, containing 10K images (6K for training, 4K for testing). See Table I for details.

Types of Camouflage. Concealed images can be roughly split into two types: those containing natural camouflage and those with artificial camouflage. Natural camouflage is used by animals (e.g., insects, sea horses, and cephalopods) as a survival skill to avoid recognition by predators. In contrast, artificial camouflage is usually used in art design/gaming to hide information, occurs in products during the manufacturing process (so-called surface defects [75], defect detection [34, 16]), or appears in our daily life (e.g., transparent objects [40, 90, 88]).

COD Formulation. Unlike class-aware tasks such as semantic segmentation, concealed object detection is a class-agnostic task. Thus, the formulation of COD is simple and easy to define. Given an image, the task requires a concealed object detection algorithm to assign each pixel a label {0,1}, where denotes the binary value of pixel . A label of 0 is given to pixels that do not belong to the concealed objects, while a label of 1 indicates that a pixel is fully assigned to the concealed objects. We focus on object-level concealed object detection, leaving concealed instance detection to our future work.

[width=]sub-classExample sea-horsecatdogbirdsea-lionbutterfly

Fig. 4: Examples of sub-classes. Please refer to supplementary materials for other sub-classes.
Fig. 5: Object and instance distributions of each concealed category in the COD10K. COD10K consists of 5,066 concealed images from 69 categories. Zoom in for best view.

3 COD10K Dataset

The emergence of new tasks and datasets [10, 103, 62]

has led to rapid progress in various areas of computer vision. For instance, ImageNet 

[68] revolutionized the use of deep models for visual recognition. With this in mind, our goals for studying and developing a dataset for COD are: (1) to provide a new challenging object detection task from the concealed perspective, (2) to promote research in several new topics, and (3) to spark novel ideas. Examples from COD10K are shown in Fig. 1. We will provide the details on COD10K in terms of three key aspects including image collection, professional annotation, and dataset features and statistics.

Fig. 6: Taxonomic system. We illustrate the histogram distribution for the 69 concealed categories in our COD10K.

3.1 Image Collection

As discussed in [65, 82, 25], the quality of annotation and size of a dataset are determining factors for its lifespan as a benchmark. To this end, COD10K contains 10,000 images (5,066 concealed, 3,000 background, 1,934 non-concealed), divided into 10 super-classes (i.e., flying, aquatic, terrestrial, amphibians, other, sky, vegetation, indoor, ocean, and sand), and 78 sub-classes (69 concealed, 9 non-concealed) which were collected from multiple photography websites.

Most concealed images are from Flicker and have been applied for academic use with the following keywords: concealed animal, unnoticeable animal, concealed fish, concealed butterfly, hidden wolf spider, walking stick, dead-leaf mantis, bird, sea horse, cat, pygmy seahorses, etc. (see Fig. 4) The remaining concealed images (around 200 images) come from other websites, including Visual Hunt, Pixabay, Unsplash, Free-images, etc., which release public-domain stock photos, free from copyright and loyalties. To avoid selection bias [25], we also collected 3,000 salient images from Flickr. To further enrich the negative samples, 1,934 non-concealed images, including forest, snow, grassland, sky, seawater and other categories of background scenes, were selected from the Internet. For more details on the image selection scheme, we refer to Zhou et al.  [102].

3.2 Professional Annotation

Recently released datasets [82, 23, 13] have shown that establishing a taxonomic system is crucial when creating a large-scale dataset. Motivated by [59], our annotations (obtained via crowdsourcing) are hierarchical (category bounding box attribute object/instance).

Fig. 7: Attribute distribution. Left: Co-attributes distribution over COD10K

. The number in each grid indicates the total number of images. Right: Multi-dependencies among these attributes. A larger arc length indicates a higher probability of one attribute correlating to another.

Attr. Description
MO Multiple Objects. Image contains at least two objects.
BO Big Object. Ratio () between object area and image area 0.5.
SO Small Object. Ratio () between object area and image area 0.1.
OV Out-of-View. Object is clipped by image boundaries.
OC Occlusions. Object is partially occluded.
SC Shape Complexity. Object contains thin parts (e.g., animal foot).
IB Indefinable Boundaries. The foreground and background areas
around the object have similar colors ( distance between
RGB histograms less than 0.9).
TABLE II: Attribute descriptions. See examples in the first row of Fig. 3.

Categories. As illustrated in Fig. 6

, we first create five super-class categories. Then, we summarize the 69 most frequently appearing sub-class categories according to our collected data. Finally, we label the sub-class and super-class of each image. If the candidate image does not belong to any established category, we classify it as ‘other’.

Fig. 8: Image resolution distribution of COD datasets. From left to right: CHAMELEON [72], CAMO-COCO [45] and COD10K datasets.

Bounding boxes. To extend COD10K for the concealed object proposal task, we also carefully annotate the bounding boxes for each image.

Attributes. In line with the literature [25, 65], we label each concealed image with highly challenging attributes faced in natural scenes, e.g., occlusions, indefinable boundaries. Attribute descriptions are provided in Table II, and the co-attribute distribution is shown in Fig. 7.

Fig. 9: Comparison between the proposed COD10K and existing datasets. COD10K has smaller objects (top-left), contains more difficult conceale (top-right), and suffers from less center bias (bottom-left/right).

Fig. 10: High-quality annotation. The annotation quality is close to the existing matting-level [96] annotation.

Objects/Instances. We stress that existing COD datasets focus exclusively on object-level annotations (Table I). However, being able to parse an object into its instances is important for computer vision researchers to be able to edit and understand a scene. To this end, we further annotate objects at an instance-level, like COCO [48], resulting in 5,069 objects and 5,930 instances.

3.3 Dataset Features and Statistics

We now discuss the proposed dataset and provide some statistics.

Resolution distribution. As noted in [93], high-resolution data provides more object boundary details for model training and yields better performance when testing. Fig. 8 presents the resolution distribution of COD10K, which includes a large number of Full HD 1080p resolution images.

Object size. Following [25], we plot the normalized object size in Fig. 9 (top-left), i.e., the size distribution from 0.01% 80.74% (avg.: 8.94%), showing a broader range compared to CAMO-COCO, and CHAMELEON.

Global/Local contrast. To evaluate whether an object is easy to detect, we describe it using the global/local contrast strategy [47]. Fig. 9 (top-right) shows that objects in COD10K are more challenging than those in other datasets.

Center bias. This commonly occurs when taking a photo, as humans are naturally inclined to focus on the center of a scene. We adopt the strategy described in [25] to analyze this bias. Fig. 9 (bottom-left/right) shows that our COD10K dataset suffers from less center bias than others.

Quality control. To ensure high-quality annotation, we invited three viewers to participate in the labeling process for 10-fold cross-validation. Fig. 10 shows examples that were passed/rejected. This matting-level annotation costs 60 minutes per image on average.

Super/Sub-class distribution. COD10K includes five super-classes (i.e., terrestrial, atmobios, aquatic, amphibian, other) and 69 sub-classes (e.g., bat-fish, lion, bat, frog, etc). Examples of the word cloud and object/instance number for various categories are shown in Fig. 5 & Fig. 11, respectively.

Fig. 11: Word cloud distribution. The size of a specific word is proportional to the ratio of that keyword.

Fig. 12: Diverse types of concealed objects in our COD10K. For instance, concealed human in art ( column), and concealed animals ( column) in our daily life.

Dataset splits. To provide a large amount of training data for deep learning algorithms, our COD10K is split into 6,000 images for training and 4,000 for testing, randomly selected from each sub-class.

Diverse concealed objects. In addition to the general concealed patterns, such as those in Fig. 1, our dataset also includes various other types of concealed objects, such as concealed body paintings and conceale in daily life (see Fig. 12).

Fig. 13: Pipeline of our SINet framework. It consists of three main components: the texture enhanced module (TEM), neighbor connection decoder (NCD), and group-reversal attention (GRA). The TEM is introduced to mimic the textural structure of receptive fields in the human visual system. The NCD is responsible for locating the candidates with the assistance of the TEM. The GRA blocks reproduce the identification stages of animal predation. Note that = .

4 COD Framework

4.1 Network Overview

Fig. 13 illustrates the overall concealed object detection framework of the proposed SINet (Search Identification Network). Next, we explain our motivation and introduce the network overview.

Motivation.   Biological studies [30] have shown that, when hunting, a predator will first judge whether a potential prey exists, i.e., it will search for a prey. Then, the target animal can be identified; and, finally, it can be caught.

Introduction.   Several methods [67, 89] have shown that satisfactory performance is dependent on the re-optimization strategy (i.e., coarse-to-fine), which is regarded as the composition of multiple sub-steps. This also suggests that decoupling the complicated targets can break the performance bottleneck. Our SINet model consists of the first two stages of hunting, i.e., search and identification. Specifically, the former phase (Section 4.2) is responsible for searching for a concealed object, while the latter one (Section 4.3) is then used to precisely detect the concealed object in a cascaded manner.

Next, we elaborate on the details of the three main modules, including a) the texture enhanced module (TEM), which is used to capture fine-grained textures with the enlarged context cues; b) the neighbor connection decoder (NCD), which is able to provide the location information; and c) the cascaded group-reversal attention (GRA) blocks, which work collaboratively to refine the coarse prediction from the deeper layer.

4.2 Search Phase

Feature Extraction.   For an input image , a set of features is extracted from Res2Net-50 [29] (removing the top fully connected layers). Thus, the resolution of each feature is , covering diversified feature pyramids from high-resolution, weakly semantic to low-resolution, strongly semantic.

Texture Enhanced Module (TEM).   Neuroscience experiments have verified that, in the human visual system, a set of various sized population receptive fields helps to highlight the area close to the retinal fovea, which is sensitive to small spatial shifts [55]. This motivates us to use the TEM [87] to incorporate more discriminative feature representations during the searching stage (usually in a small/local space). As shown in Fig. 13, each TEM component includes four parallel residual branches with different dilation rates and a shortcut branch (gray arrow), respectively. In each branch , the first convolutional layer utilizes a convolution operation (Conv11) to reduce the channel size to 32. This is followed by two other layers: a convolutional layer and a convolutional layer with a specific dilation rate when . Then, the first four branches are concatenated and the channel size is reduced to via a 33 convolution operation. Note that we set

in the default implementation of our network for time-cost trade-off. Finally, the identity shortcut branch is added in, then the whole module is fed to a ReLU function to obtain the output feature

. Besides, several works (e.g., Inception-V3 [74]) have suggested that the standard convolution operation of size can be factorized as a sequence of two steps with and kernels, speeding-up the inference efficiency without decreasing the representation capabilities. All of these ideas are predicated on the fact that a 2D kernel with a rank of one is equal to a series of 1D convolutions [38, 15]. In brief, compared to the standard receptive fields block structure [55], TEM add one more branch with a larger dilation rate to enlarge the receptive field and further replace the standard convolution with two asymmetric convolutional layers. For more details please refer to Fig. 13.

Neighbor Connection Decoder (NCD).   As observed by Wu et al.  [87], low-level features consume more computational resources due to their larger spatial resolutions, but contribute less to performance. Motivated by this observation, we decide to aggregate only the top-three highest-level features (i.e., ) to obtain a more efficient learning capability, rather than taking all the feature pyramids into consideration. To be specific, after obtaining the candidate features from the three previous TEMs, in the search phase, we need to locate the concealed object.

However, there are still two key issues when aggregating multiple feature pyramids; namely, how to maintain semantic consistency within a layer and how to bridge the context across layers. Here, we propose to address these with the neighbor connection decoder (NCD). More specifically, we modify the partial decoder component (PDC) [87] with a neighbor connection function and get three refined features , and , which are formulated as:


where denotes a 3

3 convolutional layer followed by a batch normalization operation. To ensure shape matching between candidate features, we utilize an upsampling (

e.g., 2 times) operation before element-wise multiplication . Then, we feed into the neighbor connection decoder (NCD) and generate the coarse location map .

4.3 Identification Phase

Reverse Guidance.   As discussed in Section 4.2, our global location map is derived from the three highest layers, which can only capture a relatively rough location of the concealed object, ignoring structural and textural details (see Fig. 13). To address this issue, we introduce a principled strategy to mine discriminative concealed regions by erasing objects [84, 7, 22]. As shown in Fig. 14 (b), we obtain the output reverse guidance via sigmoid and reverse operation. More precisely, we obtain the output reverse attention guidance by a reverse operation, which can be formulated as:


where and denote a 4 down-sampling and 2 up-sampling operation, respectively. is the sigmoid function, which is applied to convert the mask into the interval [0, 1]. is a reverse operation subtracting the input from matrix , in which all the elements are .

Fig. 14: Component details. Details on the group-reversal attention (b) block in the identification phase, where denotes the number of GRAs in the -th feature pyramids.

Group Guidance Operation (GGO).   As shown in [22]

, reverse attention is used for mining complementary regions and details by erasing the existing estimated target regions from side-output features.

Inspired by [6], we present a novel group-wise operation to utilize the reverse guidance prior more effectively. As can be seen in Fig. 14 (a), the group guidance operation contains two main steps. First, we split the candidate features into groups along the channel-wise dimension. Then, the guidance prior

is periodically interpolated among the split features

, where . Thus, this operation (i.e., ) can be decoupled as two steps:


where and indicate the channel-wise split and concatenation function for the candidates. Note that , where . In contrast, [22] puts more emphasis on ensuring that the candidate features are directly multiplied by the priors, which may incur two issues: a) feature confusion due to the limited discriminative ability of the network, and b) the simple multiplication introduces both true and false guidance priors and is thus prone to accumulating inaccuracies. Compared to [22], our GGO can explicitly isolate the guidance prior and candidate feature before the subsequent refinement process.

Group-Reversal Attention (GRA).   Finally, we introduce the residual learning process, termed the GRA block, with the assistance of both the reverse guidance and group guidance operation. According to previous studies [67, 89], multi-stage refinement can improve performance. We thus combine multiple GRA blocks (e.g., , ) to progressively refine the coarse prediction via different feature pyramids. Overall, each GRA block has three residual learning processes:

  1. We combine candidate features and via the group guidance operation and then use the residual stage to produce the refined features . This is formulated as:


    where denotes the convolutional layer with a 33 kernel followed by batch normalization layer for reducing the channel number from to . Note that we only reverse the guidance prior in the first GRA block (i.e., when ) in the default implementation. Refer to Section 5.3 for detailed discussion.

  2. Then, we get a single channel residual guidance:


    which is parameterized by learnable weights .

  3. Finally, we only output the refined guidance, which serves as the residual prediction. It is formulated as:


    where is when and when .

4.4 Implementation Details

4.4.1 Learning Strategy

Our loss function is defined as

, where and represent the weighted intersection-over-union (IoU) loss and binary cross entropy (BCE) loss for the global restriction and local (pixel-level) restriction. Different from the standard IoU loss, which has been widely adopted in segmentation tasks, the weighted IoU loss increases the weights of hard pixels to highlight their importance. In addition, compared with the standard BCE loss, pays more attention to hard pixels rather than assigning all pixels equal weights. The definitions of these losses are the same as in [67, 83] and their effectiveness has been validated in the field of salient object detection. Here, we adopt deep supervision for the three side-outputs (i.e., , , and ) and the global map . Each map is up-sampled (e.g., ) to the same size as the ground-truth map . Thus, the total loss for the proposed SINet can be formulated as: .

4.4.2 Hyperparameter Settings


 is implemented in PyTorch and trained with the Adam optimizer 


. During the training stage, the batch size is set to 36, and the learning rate starts at 1e-4, dividing by 10 every 50 epochs. The whole training time is only about 4 hours for 100 epochs. The running time is measured on an Intel

i9-9820X CPU @3.30GHz 20 platform and a single NVIDIA TITAN RTX GPU. During inference, each image is resized to 352352 and then fed into the proposed pipeline to obtain the final prediction without any post-processing techniques. The inference speed is 45 fps on a single GPU without I/O time.

CAMO-Test [45]
COD10K-Test (OUR)
Baseline Models
FPN [49] 0.794 0.783 0.590 0.075 0.684 0.677 0.483 0.131 0.697 0.691 0.411 0.075
MaskRCNN [32] 0.643 0.778 0.518 0.099 0.574 0.715 0.430 0.151 0.613 0.748 0.402 0.080
PSPNet [97] 0.773 0.758 0.555 0.085 0.663 0.659 0.455 0.139 0.678 0.680 0.377 0.080
UNet++ [104] 0.695 0.762 0.501 0.094 0.599 0.653 0.392 0.149 0.623 0.672 0.350 0.086
PiCANet [54] 0.769 0.749 0.536 0.085 0.609 0.584 0.356 0.156 0.649 0.643 0.322 0.090
MSRCNN [36] 0.637 0.686 0.443 0.091 0.617 0.669 0.454 0.133 0.641 0.706 0.419 0.073
PFANet [100] 0.679 0.648 0.378 0.144 0.659 0.622 0.391 0.172 0.636 0.618 0.286 0.128
CPD [87] 0.853 0.866 0.706 0.052 0.726 0.729 0.550 0.115 0.747 0.770 0.508 0.059
HTC [5] 0.517 0.489 0.204 0.129 0.476 0.442 0.174 0.172 0.548 0.520 0.221 0.088
ANet-SRM [45] - - - - 0.682 0.685 0.484 0.126 - - - -
EGNet [99] 0.848 0.870 0.702 0.050 0.732 0.768 0.583 0.104 0.737 0.779 0.509 0.056
PraNet [22] 0.860 0.907 0.763 0.044 0.769 0.824 0.663 0.094 0.789 0.861 0.629 0.045
SINet (OUR) 0.888 0.942 0.816 0.030 0.820 0.882 0.743 0.070 0.815 0.887 0.680 0.037
TABLE III: Quantitative results on three different datasets. The best scores are highlighted in bold. Note that the ANet-SRM model (only trained on CAMO) does not have a publicly available code, thus other results are not available. indicates the higher the score the better. denotes mean E-measure [20].
Amphibian (124 images)
Aquatic (474 images)
Flying (714 images)
Terrestrial (699 images)
Baseline Models
FPN [49] 0.745 0.776 0.497 0.065 0.684 0.732 0.432 0.103 0.726 0.766 0.440 0.061 0.601 0.656 0.353 0.109
MaskRCNN [32] 0.665 0.785 0.487 0.081 0.560 0.721 0.344 0.123 0.644 0.767 0.449 0.063 0.611 0.630 0.380 0.075
PSPNet [97] 0.736 0.774 0.463 0.072 0.659 0.712 0.396 0.111 0.700 0.743 0.394 0.067 0.669 0.718 0.332 0.071
UNet++ [104] 0.677 0.745 0.434 0.079 0.599 0.673 0.347 0.121 0.659 0.727 0.397 0.068 0.608 0.749 0.288 0.070
PiCANet [54] 0.686 0.702 0.405 0.079 0.616 0.631 0.335 0.115 0.663 0.676 0.347 0.069 0.658 0.708 0.273 0.074
MSRCNN [36] 0.722 0.786 0.555 0.055 0.614 0.686 0.398 0.107 0.675 0.744 0.466 0.058 0.594 0.661 0.361 0.081
PFANet [100] 0.693 0.677 0.358 0.110 0.629 0.626 0.319 0.155 0.658 0.648 0.299 0.102 0.611 0.603 0.237 0.111
CPD [87] 0.794 0.839 0.587 0.051 0.739 0.792 0.529 0.082 0.777 0.827 0.544 0.046 0.714 0.771 0.445 0.058
HTC [5] 0.606 0.598 0.331 0.088 0.507 0.495 0.183 0.129 0.582 0.559 0.274 0.070 0.530 0.485 0.170 0.078
EGNet [99] 0.785 0.854 0.606 0.047 0.725 0.793 0.528 0.080 0.766 0.826 0.543 0.044 0.700 0.775 0.445 0.053
PraNet [22] 0.842 0.905 0.717 0.035 0.781 0.883 0.696 0.065 0.819 0.888 0.669 0.033 0.756 0.835 0.565 0.046
SINet (OUR) 0.858 0.916 0.756 0.030 0.811 0.883 0.696 0.051 0.839 0.908 0.713 0.027 0.787 0.866 0.623 0.039

Quantitative results on four sub-classes of the COD10K dataset in terms of four widely used evaluation metrics.

All methods are trained using the same dataset as in [21]. indicates the higher the score the better, and : the lower the better.

5 COD Benchmark

5.1 Experimental Settings

5.1.1 Evaluation Metrics

Mean absolute error (MAE) is widely used in SOD tasks. Following Perazzi et al.  [64], we also adopt the MAE () metric to assess the pixel-level accuracy between a predicted map and ground-truth. However, while useful for assessing the presence and amount of error, the MAE metric is not able to determine where the error occurs. Recently, Fan et al. proposed a human visual perception based E-measure ([20], which simultaneously evaluates the pixel-level matching and image-level statistics. This metric is naturally suited for assessing the overall and localized accuracy of the concealed object detection results. Note that we report mean in the experiments. Since concealed objects often contain complex shapes, COD also requires a metric that can judge structural similarity. We therefore utilize the S-measure ([19] as our structural similarity evaluation metric. Finally, recent studies [20, 19] have suggested that the weighted F-measure ([57] can provide more reliable evaluation results than the traditional . Thus, we further consider this as an alternative metric for COD. Our one-key evaluation code is also available at the project page.

5.1.2 Baseline Models

We select 12 deep learning baselines [49, 32, 97, 104, 54, 36, 100, 87, 5, 99, 45, 22] according to the following criteria: a) classical architectures, b) recently published, and c) achieve SOTA performance in a specific field.

5.1.3 Training/Testing Protocols

For fair comparison with our previous version [21], we adopt the same training settings [21] for the baselines.444To verify the generalizability of SINet, we only use the combined training set of CAMO [45] and COD10K [21] without EXTRA (i.e., additional) data. We evaluate the models on the whole CHAMELEON [72] dataset and the test sets of CAMO and COD10K.

Fig. 15: Per-subclass performance. Sub-classes are sorted by difficulty, determined by the mean  [19] across 12 baselines. We also provide the minimum (bottom line) and maximum (top line) for each sub-class.

5.2 Results and Data Analysis

This section provides the quantitative evaluation results on CHAMELEON, CAMO, and COD10K datasets, respectively.

Performance on CHAMELEON.   From Table III, compared with the 12 SOTA object detection baselines and ANet-SRM, our SINet achieves the new SOTA performances across all metrics. Note that our model does not apply any auxiliary edge/boundary features (e.g., EGNet [99], PFANet [100]), pre-processing techniques [61], or post-processing strategies such as [43, 4].

Sub-class [5] [100] [32] [67] [104] [54] [36] [51] [97] [49] [99] [87] [22] OUR
Amphibian-Frog 0.600 0.678 0.664 0.692 0.656 0.687 0.692 0.732 0.697 0.731 0.745 0.752 0.823 0.837
Amphibian-Toad 0.609 0.697 0.666 0.717 0.689 0.714 0.739 0.786 0.757 0.752 0.812 0.817 0.853 0.870
Aquatic-BatFish 0.546 0.746 0.634 0.749 0.626 0.624 0.637 0.741 0.724 0.764 0.707 0.761 0.879 0.873
Aquatic-ClownFish 0.547 0.519 0.509 0.464 0.548 0.636 0.571 0.626 0.531 0.730 0.632 0.646 0.707 0.787
Aquatic-Crab 0.543 0.661 0.691 0.643 0.630 0.675 0.634 0.727 0.680 0.724 0.760 0.753 0.792 0.815
Aquatic-Crocodile 0.546 0.631 0.660 0.599 0.602 0.669 0.646 0.743 0.636 0.687 0.772 0.761 0.806 0.825
Aquatic-CrocodileFish 0.436 0.572 0.558 0.475 0.559 0.373 0.479 0.693 0.624 0.515 0.709 0.690 0.669 0.746
Aquatic-Fish 0.488 0.622 0.597 0.625 0.574 0.619 0.680 0.703 0.650 0.699 0.717 0.778 0.784 0.834
Aquatic-Flounder 0.403 0.663 0.539 0.633 0.569 0.704 0.570 0.782 0.646 0.695 0.798 0.774 0.835 0.889
Aquatic-FrogFish 0.595 0.768 0.650 0.736 0.671 0.670 0.653 0.719 0.695 0.807 0.806 0.730 0.781 0.894
Aquatic-GhostPipefish 0.522 0.690 0.556 0.679 0.651 0.675 0.636 0.717 0.709 0.744 0.759 0.763 0.784 0.817
Aquatic-LeafySeaDragon 0.460 0.576 0.442 0.481 0.523 0.499 0.500 0.547 0.563 0.507 0.522 0.534 0.587 0.670
Aquatic-Octopus 0.505 0.708 0.598 0.644 0.663 0.673 0.720 0.779 0.723 0.760 0.810 0.812 0.896 0.887
Aquatic-Pagurian 0.427 0.578 0.477 0.607 0.553 0.624 0.583 0.657 0.608 0.638 0.683 0.611 0.615 0.698
Aquatic-Pipefish 0.510 0.553 0.531 0.557 0.550 0.612 0.566 0.625 0.642 0.632 0.681 0.704 0.769 0.781
Aquatic-ScorpionFish 0.459 0.697 0.482 0.686 0.630 0.605 0.600 0.729 0.649 0.668 0.730 0.746 0.766 0.808
Aquatic-SeaHorse 0.566 0.656 0.581 0.664 0.663 0.623 0.657 0.698 0.687 0.715 0.750 0.765 0.810 0.823
Aquatic-Shrimp 0.500 0.574 0.520 0.631 0.586 0.574 0.546 0.605 0.591 0.667 0.647 0.669 0.727 0.735
Aquatic-Slug 0.493 0.581 0.368 0.492 0.533 0.460 0.661 0.732 0.547 0.664 0.777 0.774 0.701 0.729
Aquatic-StarFish 0.568 0.641 0.617 0.611 0.657 0.638 0.580 0.733 0.722 0.756 0.811 0.787 0.779 0.890
Aquatic-Stingaree 0.519 0.721 0.670 0.618 0.571 0.569 0.709 0.733 0.616 0.670 0.741 0.754 0.704 0.815
Aquatic-Turtle 0.364 0.686 0.594 0.658 0.565 0.734 0.762 0.757 0.664 0.745 0.752 0.786 0.773 0.760
Flying-Bat 0.589 0.652 0.611 0.623 0.557 0.638 0.679 0.725 0.657 0.714 0.765 0.784 0.817 0.847
Flying-Bee 0.578 0.579 0.628 0.547 0.588 0.616 0.679 0.670 0.655 0.665 0.737 0.709 0.763 0.777
Flying-Beetle 0.699 0.741 0.693 0.810 0.829 0.780 0.796 0.860 0.808 0.848 0.830 0.887 0.890 0.903
Flying-Bird 0.591 0.628 0.680 0.627 0.643 0.674 0.681 0.735 0.696 0.708 0.763 0.785 0.822 0.835
Flying-Bittern 0.639 0.621 0.703 0.650 0.673 0.741 0.704 0.785 0.701 0.751 0.802 0.838 0.827 0.849
Flying-Butterfly 0.653 0.692 0.697 0.700 0.725 0.714 0.762 0.777 0.736 0.758 0.816 0.818 0.871 0.883
Flying-Cicada 0.640 0.682 0.620 0.729 0.675 0.691 0.708 0.781 0.744 0.733 0.820 0.812 0.845 0.883
Flying-Dragonfly 0.472 0.679 0.624 0.712 0.670 0.694 0.682 0.695 0.681 0.707 0.761 0.779 0.779 0.837
Flying-Frogmouth 0.684 0.766 0.648 0.828 0.813 0.722 0.773 0.883 0.741 0.795 0.901 0.928 0.927 0.941
Flying-Grasshopper 0.563 0.671 0.651 0.689 0.656 0.692 0.666 0.734 0.710 0.740 0.773 0.779 0.821 0.833
Flying-Heron 0.563 0.579 0.629 0.598 0.670 0.647 0.699 0.718 0.654 0.743 0.783 0.786 0.810 0.823
Flying-Katydid 0.540 0.661 0.593 0.657 0.653 0.659 0.615 0.696 0.687 0.709 0.730 0.739 0.802 0.809
Flying-Mantis 0.527 0.622 0.569 0.618 0.614 0.629 0.603 0.661 0.658 0.670 0.696 0.690 0.749 0.775
Flying-Mockingbird 0.641 0.550 0.622 0.593 0.636 0.596 0.664 0.670 0.674 0.683 0.721 0.737 0.788 0.838
Flying-Moth 0.583 0.720 0.726 0.737 0.707 0.685 0.747 0.783 0.753 0.798 0.833 0.854 0.878 0.917
Flying-Owl 0.625 0.671 0.705 0.656 0.657 0.718 0.710 0.781 0.712 0.750 0.793 0.809 0.837 0.868
Flying-Owlfly 0.614 0.690 0.524 0.669 0.633 0.580 0.599 0.778 0.583 0.743 0.782 0.756 0.758 0.863
Other-Other 0.571 0.613 0.603 0.593 0.638 0.653 0.675 0.687 0.671 0.665 0.725 0.700 0.777 0.779
Terrestrial-Ant 0.506 0.516 0.508 0.519 0.523 0.585 0.538 0.552 0.572 0.564 0.627 0.605 0.676 0.669
Terrestrial-Bug 0.578 0.681 0.682 0.687 0.686 0.701 0.691 0.743 0.710 0.799 0.799 0.803 0.828 0.856
Terrestrial-Cat 0.505 0.585 0.591 0.557 0.562 0.608 0.613 0.669 0.624 0.634 0.682 0.678 0.745 0.772
Terrestrial-Caterpillar 0.517 0.643 0.569 0.691 0.636 0.581 0.575 0.638 0.640 0.685 0.684 0.704 0.729 0.776
Terrestrial-Centipede 0.432 0.573 0.476 0.485 0.496 0.554 0.629 0.703 0.561 0.536 0.727 0.643 0.704 0.762
Terrestrial-Chameleon 0.556 0.651 0.627 0.653 0.619 0.619 0.632 0.695 0.659 0.673 0.713 0.732 0.789 0.804
Terrestrial-Cheetah 0.536 0.649 0.699 0.624 0.603 0.662 0.598 0.717 0.720 0.667 0.732 0.769 0.800 0.826
Terrestrial-Deer 0.530 0.581 0.610 0.564 0.558 0.600 0.623 0.650 0.644 0.660 0.667 0.670 0.719 0.757
Terrestrial-Dog 0.572 0.560 0.596 0.536 0.559 0.574 0.614 0.608 0.588 0.613 0.607 0.648 0.666 0.707
Terrestrial-Duck 0.530 0.535 0.557 0.539 0.524 0.558 0.619 0.582 0.602 0.548 0.598 0.682 0.742 0.746
Terrestrial-Gecko 0.485 0.674 0.662 0.725 0.683 0.705 0.606 0.733 0.724 0.747 0.789 0.771 0.833 0.848
Terrestrial-Giraffe 0.469 0.628 0.697 0.620 0.611 0.701 0.635 0.681 0.718 0.722 0.747 0.776 0.809 0.784
Terrestrial-Grouse 0.704 0.760 0.726 0.721 0.774 0.805 0.780 0.879 0.803 0.806 0.904 0.919 0.888 0.921
Terrestrial-Human 0.530 0.629 0.608 0.613 0.549 0.577 0.658 0.697 0.636 0.665 0.708 0.700 0.765 0.817
Terrestrial-Kangaroo 0.482 0.586 0.599 0.467 0.548 0.588 0.571 0.644 0.630 0.623 0.650 0.620 0.798 0.816
Terrestrial-Leopard 0.617 0.647 0.742 0.616 0.640 0.652 0.673 0.736 0.720 0.704 0.744 0.791 0.791 0.823
Terrestrial-Lion 0.534 0.634 0.695 0.599 0.660 0.656 0.658 0.720 0.714 0.663 0.754 0.751 0.805 0.813
Terrestrial-Lizard 0.579 0.629 0.634 0.635 0.633 0.656 0.627 0.710 0.702 0.716 0.744 0.777 0.804 0.830
Terrestrial-Monkey 0.423 0.693 0.724 0.593 0.611 0.730 0.663 0.792 0.678 0.614 0.709 0.699 0.851 0.888
Terrestrial-Rabbit 0.504 0.657 0.685 0.634 0.635 0.721 0.731 0.794 0.722 0.758 0.789 0.806 0.829 0.843
Terrestrial-Reccoon 0.451 0.525 0.536 0.461 0.482 0.702 0.723 0.643 0.532 0.592 0.691 0.659 0.781 0.766
Terrestrial-Sciuridae 0.533 0.612 0.638 0.573 0.608 0.693 0.661 0.745 0.725 0.721 0.775 0.757 0.810 0.842
Terrestrial-Sheep 0.434 0.451 0.721 0.410 0.482 0.467 0.763 0.660 0.466 0.430 0.489 0.487 0.481 0.500
Terrestrial-Snake 0.544 0.590 0.586 0.603 0.567 0.614 0.597 0.714 0.695 0.652 0.738 0.788 0.771 0.831
Terrestrial-Spider 0.528 0.594 0.593 0.594 0.580 0.621 0.572 0.650 0.649 0.651 0.685 0.687 0.740 0.771
Terrestrial-StickInsect 0.473 0.548 0.486 0.526 0.535 0.600 0.491 0.578 0.607 0.629 0.616 0.647 0.660 0.696
Terrestrial-Tiger 0.489 0.583 0.576 0.555 0.573 0.563 0.565 0.638 0.602 0.599 0.647 0.621 0.690 0.703
Terrestrial-Wolf 0.472 0.574 0.602 0.535 0.534 0.568 0.621 0.650 0.656 0.651 0.704 0.662 0.737 0.749
Terrestrial-Worm 0.485 0.652 0.596 0.642 0.628 0.558 0.651 0.692 0.629 0.684 0.763 0.670 0.724 0.806
TABLE V: Results of for each sub-class in our COD10K dataset. The best performing method of each category is highlighted in bold.

Performance on CAMO.   We also test our model on the CAMO [45] dataset, which includes various concealed objects. Based on the overall performances reported in Table III, we find that the CAMO dataset is more challenging than CHAMELEON. Again, SINet obtains the best performance, further demonstrating its robustness.

Performance on COD10K.   With the test set (2,026 images) of our COD10K dataset, we again observe that the proposed SINet is consistently better than other competitors. This is because its specially designed search and identification modules can automatically learn rich diversified features from coarse to fine, which are crucial for overcoming challenging ambiguities in object boundaries. The results are shown in Table III and Table IV.

Per-subclass Performance.   In addition to the overall quantitative comparisons on our COD10K dataset, we also report the quantitative per-subclass results in the Table V to investigate the pros and cons of the models for future researchers. In Fig. 15, we additionally show the minimum, mean, and maximum S-measure performance of each sub-class over all baselines. The earliest sub-class is “Grouse”, while the most difficult is the “LeafySeaDragon”, from the aquatic and terrestrial categories, respectively.

Qualitative Results.   We present more detection results of our conference version model (SINet_cvpr) for various challenging concealed objects, such as spider, moth, sea horse, and toad, in the supplementary materials. As shown in Fig. 16, SINet further improves the visual results compared to SINet_cvpr in terms of different lighting ( row), appearance changes ( row), and indefinable boundaries ( to ). PFANet [100] is able to locate the concealed objects, but the outputs are always inaccurate. By further using reverse attention module, PraNet [22] achieves a relatively more accurate location than PFANet in the first case. Nevertheless, it still misses the fine details of objects, especially for the fish in the and rows. For all these challenging cases, SINet is able to infer the real concealed object with fine details, demonstrating the robustness of our framework.

GOS vs. SOD Baselines.   One noteworthy finding is that, among the top-3 models, the GOS model (i.e., FPN [49]) performs worse than the SOD competitors, CPD [87], EGNet [99], suggesting that the SOD framework may be better suited for extension to COD tasks. Compared with both the GOS [49, 32, 97, 104, 36, 5] and the SOD [54, 100, 87, 99] models, SINet significantly decrease the training time (e.g., SINet: 4 hours vs. EGNet: 48 hours) and achieve the SOTA performance on all datasets, showing that they are promising solutions for the COD problem. Due to the limited space, fully comparing them with existing SOTA SOD models is beyond the scope of this paper. Note that our main goal is to provide more general observations for future work. More recent SOD models can be found in our project page.

Fig. 16: Comparison of our SINet and three top-performing baselines, including (d) SINet_cvpr [21], (e) PraNet [22], and (f) PFANet [100].
Trained on:Tested on:
CAMO [45] 0.803 0.702 0.803 0.678 15.6%
COD10K (OUR) 0.742 0.700 0.700 0.683 2.40%
Mean others 0.641 0.589
TABLE VI: Structure-measure ( [19]) scores for cross-dataset generalization. SINet_cvpr is trained on one dataset (rows) and tested on all datasets (columns). “Self”: training and testing on the same dataset (diagonal). “Mean others”: average score on all except self.

Generalization.   The generalizability and difficulty of datasets play a crucial role in both training and assessing different algorithms [81]. Hence, we study these aspects for existing COD datasets, using the cross-dataset analysis method [77], i.e., training a model on one dataset, and testing it on others. We select two datasets, namely CAMO [45], and our COD10K. Following [81], for each dataset, we randomly select 800 images as the training set and 200 images as the testing set. For fair comparison, we train SINet_cvpr on each dataset until the loss is stable.

CAMO-Test [45]
COD10K-Test (OUR)
No. PD NCD Sy. Conv. Asy. Conv, Reverse Group Size
#1 0.884 0.940 0.811 0.033 0.812 0.869 0.730 0.073 0.812 0.884 0.679 0.039
#2 0.881 0.934 0.799 0.034 0.820 0.877 0.740 0.071 0.813 0.884 0.673 0.038
#3 0.887 0.934 0.813 0.033 0.811 0.867 0.731 0.074 0.815 0.888 0.680 0.036
#4 0.888 0.944 0.818 0.030 0.810 0.866 0.730 0.073 0.814 0.883 0.678 0.037
#5 0.886 0.942 0.814 0.031 0.814 0.873 0.739 0.073 0.814 0.887 0.682 0.037
#6 0.879 0.928 0.794 0.035 0.820 0.877 0.738 0.071 0.807 0.878 0.661 0.040
#7 0.886 0.939 0.812 0.031 0.817 0.875 0.736 0.073 0.810 0.884 0.670 0.037
#8 0.888 0.940 0.812 0.031 0.819 0.877 0.741 0.072 0.814 0.887 0.681 0.037
#9 0.886 0.943 0.814 0.032 0.816 0.872 0.738 0.074 0.815 0.886 0.682 0.037
#10 0.884 0.944 0.810 0.033 0.819 0.876 0.738 0.071 0.813 0.884 0.675 0.037
#11 0.883 0.940 0.812 0.032 0.811 0.869 0.734 0.073 0.815 0.887 0.679 0.036
#OUR 0.888 0.942 0.816 0.030 0.820 0.882 0.743 0.070 0.815 0.887 0.680 0.037
TABLE VII: Ablation studies for each component on three test datasets. For details please refer to Section 5.3.

Table VI provides the S-measure results for the cross-dataset generalization. Each row lists a model that is trained on one dataset and tested on all others, indicating the generalizability of the dataset used for training. Each column shows the performance of one model tested on a specific dataset and trained on all others, indicating the difficulty of the testing dataset. Please note that the training/testing settings are different from those used in Table III, and thus the performances are not comparable. As expected, we find that our COD10K is the most difficult (e.g., the last row Mean others: 0.589). This is because our dataset contains a variety of challenging concealed objects (Section 3). We can thus see that our COD10K dataset is suitable for more challenging scenes.

5.3 Ablation Studies

We now provide a detailed analysis of the proposed SINet on CHAMELEON, CAMO, and COD10K. We verify the effectiveness by decoupling various sub-components, including the NCD, TEM, and GRA, as summarized in Table VII

. Note that we maintain the same hyperparameters mentioned in Section

4.4 during the re-training process for each ablation variant.

Effectiveness of NCD.   We explore the influence of the decoder in the search phase of our SINet. To verify its necessity, we retrain our network without the NCD (No.#1) and find that, compared with #OUR (last row in Table VII), the NCD is attributed to boosting the performance on CAMO, increasing the mean score from 0.869 to 0.882. Further, we replace the NCD with the partial decoder [87] (i.e., PD of No.#2) to validate the superiority of neighbor connections (No.#OUR). Comparing No.#2 with #OUR, our design can enhance the performance slightly, increasing it by 1.7% in terms of on the CHAMELEON.

Fig. 17: Comparison of various types of inter-layer feature aggregation strategies with a short connection. (a) DSS [35] introduce the densely connected short connection in a top-down manner. (b) CPD [87] constructs a partial decoder by discarding larger resolution features of shallower layers for memory and speed enhancement. (c) Our neighbor connection decoder only propagates between neighboring layers.

As shown in Fig. 17, we present a novel feature aggregation strategy before the modified UNet-like decoder (removing the bottom-two high-resolution layers), termed the NCD, with neighbor connections between adjacent layers. This design is motivated by the fact that the high-level features are superior to semantic strength and location accuracy, but introduce noise and blurred edges for the target object.

Instead of broadcasting features from densely connected layers with a short connection [35] or a partial decoder with a skip connection [87], our NCD exploits the semantic context through a neighbor connection, providing a simple but effective way to reduce inconsistency between different features. Aggregating all features by a short connection [35] increases the parameters. This is one of the major differences between DSS (Fig. 17 a) and NCD. Compared to CPD [87] (Fig. 17 b), which ignores feature transparency between and , NCD is more efficient at broadcasting the features step by step.

Effectiveness of TEM.   We provide three different variants: (a) without TEM (No.#3), and (b) with symmetric convolutional layers [74] (No.#4). Comparing with No.#3, we find that our TEM with asymmetric convolutional layers (No.#OUR) is necessary for increasing the performance on the CAMO dataset. Besides, replacing the standard symmetric convolutional layer (No.#4) with an asymmetric convolutional layer (No.#OUR) has little impact on the learning capability of the network, while further increasing the mean from 0.866 to 0.882 on the CAMO dataset.

Effectiveness of GRA.   Reverse Guidance. As shown in the ‘Reverse’ column of Table VII, {*,*,*} indicates whether the guidance is reversed (see Fig. 14 (b)) before each GRA block . For instance, {1,0,0} means that we only reverse the guidance in the first block (i.e., ) and the remaining two blocks (i.e., and ) do not have a reverse operation.

We investigate the contribution of the reverse guidance in the GRA, including three alternatives: (a) without any reverse, i.e., {0,0,0} of No.#5, (b) reversing the first two guidances , i.e., {1,1,0} of No.#6, and (c) reversing all the guidances , i.e., {1,1,1} of No.#7. Compared to the default implementation of SINet (i.e., {1,0,0} of No.#OUR), we find that only reversing the first guidance may help the network to mine diversified representations from two perspectives (i.e., attention and reverse attention regions), while introducing reverse guidance several times in the intermediate process may cause confusion during the learning procedure, especially for setting #6 on the CHAMELEON and COD10K datasets.

Group Size of GGO. As shown in the ‘Group Size’ column of  Table VII, indicates the number of feature slices (i.e., group size ) from the GGO of the first block to last block . For example, indicates that we split the candidate feature into 32, 8, and 1 groups at each GRA block , respectively. Here, we discuss two ways of selecting the group size, i.e., the uniform strategy (i.e., of #8, of #9, of #10) and progressive strategy (i.e., of #11 and of #OUR). We observe that our design based on the progressive strategy can effectively maintain the generalizability of the network, providing more satisfactory performance compared with other variants.

6 Downstream Applications

Concealed object detection systems have various downstream applications in fields such as medicine, art, and agriculture. Here, we envision some potential uses. More details are shown on our project page.

6.1 Application I: Medicine

6.1.1 Polyp Segmentation

As we all know, early diagnosis through medical imaging plays a key role in the treatment of diseases. However, the early disease area/lesions usually have a high degree of homogeneity with the surrounding tissues. As a result, it is difficult for doctors to identity the lesion area in the early stage from a medical image. One typical example is the early colonoscopy to segment polyps, which has contributed to roughly 30% decline in the incidence of colorectal cancer [22]. Similar to concealed object detection, polyp segmentation (see Fig. 18) also faces several challenges, such as variation in appearance and blurred boundaries. The recent state-of-the-art polyp segmentation model, PraNet [22], has shown promising performance in both poly segmentation (Top-1) and concealed object segmentation (Top-2).

[width=]Medical-min (a)(b)(c)(d)

Fig. 18: Polyp segmentation. (a) & (c) are input polyp images. (b) & (d) are corresponding ground-truths.

[width=]Medical-2-min (a)(b)(c)(d)

Fig. 19: Lung infection segmentation. The first row presents COVID-19 lung infection CT scans, while the second row shows their ground-truths labeled by doctors. From (a) to (d), COVID-19 patients from mild to severe.

[width=]SurfaceDefect-min (a)(b)(c)(d)

Fig. 20: Surface defect detection. The defect types are textile (a), stone (b), magnetic tile (c), and wood (d), respectively. Source images are derived from [33].

6.1.2 Lung Infection Segmentation

Another concealed object detection example is the lung infection segmentation task in the medical field. Recently, COVID-19 has been of particular concern, and resulted in a global pandemic. An AI system equipped with a COVID-19 lung infection segmentation model would be helpful in the early screening of COVID-19. More details on this application can be found in the recent diagnosis system [86], segmentation model [27], and survey paper [70].

6.2 Application II: Manufacturing

6.2.1 Surface Defect Detection

In industrial manufacturing, products (e.g., wood, textile, and magnetic tile) of poor quality will inevitably lead to adverse effects on the economy. As can be seen from Fig. 20, the surface defects are challenging, with different factors including low contrast, ambiguous boundaries and so on. Since traditional surface defect detection systems mainly rely on humans, major issues are highly subjective and time-consuming to identify. Thus, designing an automatic recognition system based on AI is essential to increase productivity. Some related papers can be found at:

[width=]PestsDetection-min (a)(b)(c)

Fig. 21: Pest detection. For pest detection applications, the system can generate a bounding box (b) for each locally screened image (a) or provide statistics (pest counting) for locust plague density monitoring in the whole environment (c).

[width=]FruitMaturity-min (a)(b)(c)

Fig. 22: Fruit maturity detection. Compared with the traditional manual inspection (c) of fruits, such as Persea Americana (a) and Myrica Rubra (b) for maturity, an AI-based maturity monitoring system will greatly improve production efficiency.

6.3 Application III: Agriculture

6.3.1 Pest Detection

Since early 2020, plagues of desert locusts have invaded the world, from Africa to South Asia. Large numbers of locusts gnaw on fields and completely destroy agricultural products, causing serious financial losses and famine due to food shortages. As shown in Fig. 21, introducing AI-based techniques to provide scientific monitoring is feasible for achieving sustainable regulation/containment by governments.

6.3.2 Fruit Maturity Detection

In the early stages of ripening, many fruits appear similar to green leaves, making it difficult for farmers to monitor production. We present two types of fruits, i.e., Persea Americana and Myrica Rubra, in Fig. 22. These fruits share similar characteristics to concealed objects, so it is possible to utilize a COD algorithm to identify them and improve the monitoring efficiency.

6.4 Application IV: Art

6.4.1 Recreational Art

Background warping to concealed salient objects is a fascinating technique in the SIGGRAPH community. Fig. 23 presents some examples generated by Chu et al. in [9]. We argue that this technique will provide more training data for existing data-hungry deep learning models, and thus it is of value to explore the underline mechanism behind the feature search and conjunction search theory described by Treisman and Wolfe [78, 85].

Fig. 23: Recreational art. Some animals are embedded into the background by algorithms. Source images from Chu et al.  [9] and all rights reserved by 2010 John Van Straalen.

[width=]Concealed2Salient (a)(b)(c)

Fig. 24: Converting concealed objects to salient objects. Source images from [45]. One interesting application is to identify (b) a specific concealed object (a) and then convert it to a salient object (c).

[width=]TransparentObject-min (a)(b)(c)

Fig. 25: Transparent stuff/objects detection. In our daily lives, we humans see, touch, or interact with various transparent stuff such as windows (a), glass doors (b), and glass walls (c). It is essential to teach AI robots to identify transparent stuff/objects to avoid unseen obstacles.

6.4.2 From Concealed to Salient Objects

Concealed object detection and salient object detection are two opposite tasks, making it convenient for us to design a multi-task learning framework that can simultaneously increase the robustness of the network. As shown in Fig. 24, there exist two reverse objects (a) and (c). An interesting application is to provide a scroll bar to allow users to customize the degree of salient objects from the concealed objects.

6.5 Application V: Daily Life

6.5.1 Transparent Stuff/Objects Detection

Transparent objects, such as glass products, are commonplace in our daily life. These object/things, including doors and walls, inherent the appearance of their background, making them unnoticeable, as illustrated in Fig. 25. As a sub-task of concealed object detection, transparent object detection [88] and transparent object tracking [28] have shown promise.

[width=]SearchEngine (b)(a)

Fig. 26: Search engines. Internet search engine application equipped without (a)/with (b) a concealed detection system.

6.5.2 Search Engines

Fig. 26 shows an example of search results from Google. From the results (Fig. 26 a), we notice that the search engine cannot detect the concealed butterfly, and thus only provides images with similar backgrounds. Interestingly, when the search engine is equipped with a concealed detection system (here, we just simply change the keyword), it can identify the concealed object and then feedback several butterfly images (Fig. 26 b).

7 Potential Research Directions

Despite the recent 10 years of progress in the field of concealed object detection, the leading algorithms in the deep learning era remain limited compared to those for generic object detection [53] and cannot yet effectively solve real-world challenges as shown in our COD10K benchmark (Top-1: ). We highlight some long-standing challenges, as follows:

  • Concealed object detection under limited conditions: few/zero-shot learning, weakly supervised learning, unsupervised learning, self-supervised learning, limited training data, unseen object class,


  • Concealed object detection combined with other modalities: Text, Audio, Video, RGB-D, RGB-T, 3D, etc.

  • New directions based on the rich annotations provided in the COD10K, such as concealed instance segmentation, concealed edge detection, concealed object proposal, concealed object ranking, among others.

Based on the above-mentioned challenges, there are a number of foreseeable directions for future research:

(1) Weakly/Semi-Supervised Detection: Existing deep-based methods extract the features in a fully supervised manner from images annotated with object-level labels. However, the pixel-level annotations are usually manually marked by LabelMe or Adobe Photoshop tools with intensive professional interaction. Thus, it is essential to utilize weakly/semi (partially) annotated data for training in order to avoid heavy annotation costs.

(2) Self-Supervised Detection: Recent efforts to learn visual (e.g., image, audio, and video) representations using self-supervised learning [1, 31] have achieved world-renowned achievements, attracting much attention. Thus, it is natural to setup a self-supervised learning benchmark for the concealed object detection task.

(3) Concealed Object Detection in Other Modalities: Existing concealed data is only based on static images or dynamic videos [44]. However, concealed object detection in other modalities can be closely related in domains such as pest monitoring in the dark night, robotics, and artist design. Similar to in RGB-D SOD [24], RGB-T SOD [95], and VSOD [26], these modalities can be audio, thermal, or depth data, raising new challenges under specific scenes.

(4) Concealed Object Classification: Generic object classification is a fundamental task in computer vision. Thus concealed object classification will also likely gain attention in the future. By utilizing the class and sub-class labels provided in COD10K, one could build a large scale and fine-grain classification task.

(5) Concealed Object Proposal and Tracking: In this paper, the concealed object detection is actually a segmentation task. It is different from traditional object detection, which generates a proposal or bounding boxes as the prediction. As such, concealed object proposal and tracking is a new and interesting direction [60] for future work.

(6) Concealed Object Ranking: Currently, concealed object detection algorithms are built upon binary ground-truths to generate the masks of concealed objects, without analyzing the level of concealed. However, understanding the level of concealment could help to better explore the mechanism behind the models, providing deeper insights into them. We refer readers to [39] for some inspiring ideas.

(7) Concealed Instance Segmentation: As described in [46], instance segmentation is more crucial than object-level segmentation for practical applications. Recently, CamouFinder [80] has pushed the research on camouflaged object segmentation into camouflaged instance segmentation.

(8) Universal Network for Multiple Tasks: As studied by Zamir et al. in Taskonomy [92], different visual tasks have strong relationships. Thus, their supervision can be reused in one universal system without piling up complexity. It is natural to consider devising a universal network to simultaneously localize, segment and rank concealed objects.

(9) Neural Architecture Search:

Both traditional algorithms and deep learning-based models for concealed object detection require human experts with strong prior knowledge or skilled expertise. Sometimes, the hand-crafted features and architectures designed by algorithm engineers may not optimal. Therefore, neural architecture search techniques, such as the popular automated machine learning 

[91], offer a potential direction.

(10) Transferring Salient Objects to Concealed Objects: Due to space limitations, we only evaluated typical salient object detection models in our benchmark section. There are several valuable problems that deserve further studying, however, such as transferring salient objects to concealed objects to increase the training data, and introducing a generative adversarial mechanism between the SOD and COD tasks to increase the feature extraction ability of the network.

The ten new research directions listed for concealed object remain far from being solved. However, there are many famous works that can be referred to, providing us a solid basis for studying the object detection task from a concealed perspective.

Fig. 27: Ground-truths of the images presented in Fig. 1

8 Conclusion

We have presented the first comprehensive study on object detection from a concealed vision perspective. Specifically, we have provided the new challenging and densely annotated COD10K dataset, conducted a large-scale benchmark, developed a simple but efficient end-to-end search and identification framework (i.e., SINet), and highlighted several potential applications. Compared with existing cutting-edge baselines, our SINet is competitive and generates more visually favorable results. The above contributions offer the community an opportunity to design new models for the COD task. In our future work, we plan to extend our COD10K dataset to provide inputs of various forms, such as multi-view images (like RGB-D SOD [98, 94]), textual descriptions, video (similar to video salient object detection [26]), among others.


This research was supported by the Major Project for New Generation of AI under Grant No. 2018AAA0100400, NSFC (61922046), and S&T innovation project from Chinese Ministry of Education.


  • [1] T. Afouras, A. Owens, J. S. Chung, and A. Zisserman (2020) Self-supervised learning of audio-visual objects from video. arXiv preprint arXiv:2008.04237. Cited by: §7.
  • [2] A. Borji, M. Cheng, Q. Hou, H. Jiang, and J. Li (2019) Salient object detection: a survey. Computational Visual Media 5 (2), pp. 117–150. External Links: ISSN 2096-0662, Document Cited by: §2.
  • [3] A. Borji, M. Cheng, H. Jiang, and J. Li (2015) Salient object detection: a benchmark. IEEE Trans. Image Process. 24 (12), pp. 5706–5722. Cited by: §2.
  • [4] Y. Boykov, O. Veksler, and R. Zabih (1999) Fast approximate energy minimization via graph cuts. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 377–384. Cited by: §5.2.
  • [5] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. (2019) Hybrid task cascade for instance segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 4974–4983. Cited by: TABLE III, TABLE IV, §5.1.2, §5.2, TABLE V.
  • [6] S. Chen and Y. Fu (2020) Progressively guided alternate refinement network for rgb-d salient object detection. In Eur. Conf. Comput. Vis., pp. 520–538. Cited by: §4.3.
  • [7] S. Chen, X. Tan, B. Wang, and X. Hu (2018) Reverse attention for salient object detection. In Eur. Conf. Comput. Vis., pp. 234–250. Cited by: §4.3.
  • [8] M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S. Hu (2015) Global contrast based salient region detection. IEEE TPAMI 37 (3), pp. 569–582. External Links: Document Cited by: §2.
  • [9] H. Chu, W. Hsu, N. J. Mitra, D. Cohen-Or, T. Wong, and T. Lee (2010) Camouflage images.. ACM Trans. Graph. 29 (4), pp. 51–1. Cited by: §1, Fig. 23, §6.4.1.
  • [10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3213–3223. Cited by: §3.
  • [11] H. B. Cott (1940) Adaptive coloratcottion in animals. Methuen & Co., Ltd.. Cited by: §2.
  • [12] I. C. Cuthill, M. Stevens, J. Sheppard, T. Maddocks, C. A. Párraga, and T. S. Troscianko (2005)

    Disruptive coloration and background pattern matching

    Nature 434 (7029), pp. 72. Cited by: §1.
  • [13] D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. (2018) Scaling egocentric vision: the epic-kitchens dataset. In Eur. Conf. Comput. Vis., pp. 720–736. Cited by: §3.2.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 248–255. Cited by: §1.
  • [15] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. Adv. Neural Inform. Process. Syst. 27, pp. 1269–1277. Cited by: §4.2.
  • [16] H. Dong, K. Song, Y. He, J. Xu, Y. Yan, and Q. Meng (2020) PGA-net: pyramid feature fusion and global context attention network for automated surface defect detection. IEEE Trans. Industr. Inform.. Cited by: §2.
  • [17] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2015) The PASCAL visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111 (1), pp. 98–136. Cited by: §2.
  • [18] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88 (2), pp. 303–338. Cited by: §1.
  • [19] D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji (2017) Structure-measure: A New Way to Evaluate Foreground Maps. In Int. Conf. Comput. Vis., pp. 4548–4557. Cited by: Fig. 15, §5.1.1, TABLE VI.
  • [20] D. Fan, C. Gong, Y. Cao, B. Ren, M. Cheng, and A. Borji (2018) Enhanced-alignment Measure for Binary Foreground Map Evaluation. In Int. Joint Conf. Artif. Intell., pp. 698–704. Cited by: TABLE III, §5.1.1.
  • [21] D. Fan, G. Ji, G. Sun, M. Cheng, J. Shen, and L. Shao (2020) Camouflaged object detection. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 2777–2787. Cited by: Concealed Object Detection, §1.1, TABLE IV, Fig. 16, §5.1.3, footnote 4.
  • [22] D. Fan, G. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao (2020) Pranet: parallel reverse attention network for polyp segmentation. In Med. Image. Comput. Comput. Assist. Interv., Cited by: §1, §4.3, §4.3, TABLE III, TABLE IV, Fig. 16, §5.1.2, §5.2, TABLE V, §6.1.1.
  • [23] D. Fan, Z. Lin, Z. Zhang, M. Zhu, and M. Cheng (2020) Rethinking rgb-d salient object detection: models, data sets, and large-scale benchmarks. IEEE T. Neural Netw. Learn. Syst.. Cited by: §1, §3.2.
  • [24] D. Fan, Z. Lin, Z. Zhang, M. Zhu, and M. Cheng (2020) Rethinking rgb-d salient object detection: models, data sets, and large-scale benchmarks. IEEE T. Neural Netw. Learn. Syst.. Cited by: §7.
  • [25] D. Fan, J. Liu, S. Gao, Q. Hou, A. Borji, and M. Cheng (2018) Salient objects in clutter: bringing salient object detection to the foreground. In Eur. Conf. Comput. Vis., pp. 1597–1604. Cited by: Fig. 2, §1, §2, §3.1, §3.1, §3.2, §3.3, §3.3.
  • [26] D. Fan, W. Wang, M. Cheng, and J. Shen (2019) Shifting more attention to video salient object detection. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 8554–8564. Cited by: §7, §8.
  • [27] D. Fan, T. Zhou, G. Ji, Y. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao (2020) Inf-Net: Automatic COVID-19 Lung Infection Segmentation from CT Images. IEEE Trans. Med. Imaging. Cited by: §1, §6.1.2.
  • [28] H. Fan, H. A. Miththanthaya, S. R. Rajan, X. Liu, Z. Zou, Y. Lin, H. Ling, et al. (2020) Transparent object tracking benchmark. arXiv preprint arXiv:2011.10875. Cited by: §6.5.1.
  • [29] S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. Torr (2021) Res2Net: a new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2), pp. 652–662. External Links: Document Cited by: §4.2.
  • [30] J. R. Hall, I. C. Cuthill, R. Baddeley, A. J. Shohet, and N. E. Scott-Samuel (2013) Camouflage, detection and identification of moving targets. Proc. Royal Soc. B: Biological Sciences 280 (1758), pp. 20130064. Cited by: §4.1.
  • [31] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 9729–9738. Cited by: §7.
  • [32] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Int. Conf. Comput. Vis., pp. 2961–2969. Cited by: TABLE III, TABLE IV, §5.1.2, §5.2, TABLE V.
  • [33] T. He, Y. Liu, C. Xu, X. Zhou, Z. Hu, and J. Fan (2019)

    A fully convolutional neural network for wood defect location and identification

    IEEE Access 7, pp. 123453–123462. Cited by: Fig. 20.
  • [34] Y. He, K. Song, Q. Meng, and Y. Yan (2020) An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Trans. Instrum. Meas. 69 (4), pp. 1493–1504. Cited by: §2.
  • [35] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr (2019) Deeply supervised salient object detection with short connections. IEEE Trans. Pattern Anal. Mach. Intell. 41 (4), pp. 815–828. Cited by: §2, Fig. 17, §5.3.
  • [36] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang (2019) Mask scoring r-cnn. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 6409–6418. Cited by: TABLE III, TABLE IV, §5.1.2, §5.2, TABLE V.
  • [37] L. Itti, C. Koch, and E. Niebur (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20 (11), pp. 1254–1259. Cited by: §2.
  • [38] M. Jaderberg, A. Vedaldi, and A. Zisserman (2014)

    Speeding up convolutional neural networks with low rank expansions

    In Brit. Mach. Vis. Conf., Cited by: §4.2.
  • [39] M. Kalash, M. A. Islam, and N. Bruce (2019) Relative saliency and ranking: models, metrics, data and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §7.
  • [40] A. Kalra, V. Taamazyan, S. K. Rao, K. Venkataraman, R. Raskar, and A. Kadambi (2020) Deep polarization cues for transparent object segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 8602–8611. Cited by: §2.
  • [41] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Int. Conf. Learn. Represent., Cited by: §4.4.2.
  • [42] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár (2019) Panoptic segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 9404–9413. Cited by: Fig. 2, §2, footnote 1.
  • [43] P. Krahenbuhl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Adv. Neural Inform. Process. Syst., pp. 109–117. Cited by: §5.2.
  • [44] H. Lamdouar, C. Yang, W. Xie, and A. Zisserman (2020) Betrayed by motion: camouflaged object discovery via motion segmentation. In Asian Conf. Comput. Vis., Cited by: §7.
  • [45] T. Le, T. V. Nguyen, Z. Nie, M. Tran, and A. Sugimoto (2019) Anabranch network for camouflaged object segmentation. Comput. Vis. Image Underst. 184, pp. 45–56. Cited by: item 3), TABLE I, §2, Fig. 8, TABLE III, §5.1.2, §5.2, §5.2, TABLE VI, TABLE VII, Fig. 24, footnote 4.
  • [46] G. Li, Y. Xie, L. Lin, and Y. Yu (2017) Instance-level salient object segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 247–256. Cited by: Fig. 2, §7.
  • [47] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille (2014) The secrets of salient object segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 280–287. Cited by: §3.3.
  • [48] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Eur. Conf. Comput. Vis., pp. 740–755. Cited by: §1, §3.2.
  • [49] T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 936–944. Cited by: TABLE III, TABLE IV, §5.1.2, §5.2, TABLE V.
  • [50] C. Liu, J. Yuen, and A. Torralba (2010) Sift flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33 (5), pp. 978–994. Cited by: §2.
  • [51] J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang (2019) A simple pooling-based design for real-time salient object detection. IEEE Conf. Comput. Vis. Pattern Recog.. Cited by: TABLE V.
  • [52] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen (2019) Deep learning for generic object detection: a survey. Int. J. Comput. Vis.. Cited by: Fig. 2.
  • [53] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen (2020) Deep learning for generic object detection: a survey. Int. J. Comput. Vis. 128 (2), pp. 261–318. Cited by: §7.
  • [54] N. Liu, J. Han, and M. Yang (2018) PiCANet: learning pixel-wise contextual attention for saliency detection. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3089–3098. Cited by: TABLE III, TABLE IV, §5.1.2, §5.2, TABLE V.
  • [55] S. Liu, D. Huang, et al. (2018) Receptive field block net for accurate and fast object detection. In Eur. Conf. Comput. Vis., pp. 385–400. Cited by: §4.2.
  • [56] Y. Liu, M. Cheng, D. Fan, L. Zhang, J. Bian, and D. Tao (2018) Semantic edge detection with diverse deep supervision. arXiv preprint arXiv:1804.02864. Cited by: item 1).
  • [57] R. Margolin, L. Zelnik-Manor, and A. Tal (2014) How to evaluate foreground maps?. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 248–255. Cited by: §5.1.1.
  • [58] G. Medioni (2009) Generic object recognition by inference of 3-d volumetric. Object Categorization: Comput. Hum. Vis. Perspect. 87. Cited by: Fig. 2.
  • [59] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su (2019) PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 909–918. Cited by: §3.2.
  • [60] A. Mondal (2020) Camouflaged object detection and tracking: a survey. Int. J. Image Graph. 20 (04), pp. 2050028. Cited by: §7.
  • [61] G. Mori (2005) Guiding model search using segmentation. In Int. Conf. Comput. Vis., pp. 1417–1423. Cited by: §5.2.
  • [62] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder (2017) The mapillary vistas dataset for semantic understanding of street scenes. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 4990–4999. Cited by: §3.
  • [63] A. Owens, C. Barnes, A. Flint, H. Singh, and W. Freeman (2014) Camouflaging an object from many viewpoints. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 2782–2789. Cited by: §1.
  • [64] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung (2012) Saliency filters: contrast based filtering for salient region detection. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 733–740. Cited by: §5.1.1.
  • [65] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 724–732. Cited by: §1, §3.1, §3.2.
  • [66] X. Qin, D. Fan, C. Huang, C. Diagne, Z. Zhang, A. C. Sant’Anna, A. Su‘arez, M. Jagersand, and L. Shao (2021) Boundary-aware segmentation network for mobile and web applications. arXiv preprint arXiv:2101.04704. Cited by: §2.
  • [67] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand (2019) BASNet: boundary-aware salient object detection. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 7479–7489. Cited by: §4.1, §4.3, §4.4.1, TABLE V.
  • [68] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115 (3), pp. 211–252. Cited by: §3.
  • [69] K. Saenko, B. Kulis, M. Fritz, and T. Darrell (2010) Adapting visual category models to new domains. In Eur. Conf. Comput. Vis., pp. 213–226. Cited by: item 1).
  • [70] F. Shi, J. Wang, J. Shi, Z. Wu, Q. Wang, Z. Tang, K. He, Y. Shi, and D. Shen (2020) Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19. IEEE Rev. Biomed. Eng.. Cited by: §6.1.2.
  • [71] J. Shotton, J. Winn, C. Rother, and A. Criminisi (2006) Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Eur. Conf. Comput. Vis., pp. 1–15. Cited by: §2.
  • [72] P. Skurowski, H. Abdulameer, J. Błaszczyk, T. Depta, A. Kornacki, and P. Kozieł (2018) Animal camouflage analysis: chameleon database. Note: Unpublished Manuscript Cited by: item 3), TABLE I, §2, Fig. 8, TABLE III, §5.1.3, TABLE VII.
  • [73] M. Stevens and S. Merilaita (2008) Animal camouflage: current issues and new perspectives. Phil. Trans. R. Soc. B: Biological Sciences 364 (1516), pp. 423–427. Cited by: §1, §2.
  • [74] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 2818–2826. Cited by: §4.2, §5.3.
  • [75] D. Tabernik, S. Šela, J. Skvarč, and D. Skočaj (2020) Segmentation-based deep-learning approach for surface-defect detection. J. Intell. Manuf. 31 (3), pp. 759–776. Cited by: §2.
  • [76] G. H. Thayer and A. H. Thayer (1909) Concealing-coloration in the animal kingdom: an exposition of the laws of disguise through color and pattern: being a summary of abbott h. thayer’s discoveries. Macmillan Company. Cited by: §2.
  • [77] A. Torralba, A. A. Efros, et al. (2011) Unbiased look at dataset bias. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 1521–1528. Cited by: §5.2.
  • [78] A. Treisman (1988) Features and objects: the fourteenth bartlett memorial lecture. Q. J. Exp. Psychol. (Hove) 40 (2), pp. 201–237. Cited by: §6.4.1.
  • [79] T. Troscianko, C. P. Benton, P. G. Lovell, D. J. Tolhurst, and Z. Pizlo (2008) Camouflage and visual perception. Phil. Trans. R. Soc. B: Biological Sciences 364 (1516), pp. 449–461. Cited by: §1.
  • [80] L. Trung-Nghia, V. Nguyen, C. Le, T. Nguyen, M. Tran, and T. V. Nguyen (2021) CamouFinder: finding camouflaged instances in images. In AAAI Conf. Art. Intell., Cited by: §7.
  • [81] W. Wang, Q. Lai, H. Fu, J. Shen, and H. Ling (2021) Salient object detection in the deep learning era: an in-depth survey. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2, §5.2.
  • [82] W. Wang, J. Shen, F. Guo, M. Cheng, and A. Borji (2018) Revisiting video saliency: a large-scale benchmark and a new model. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 4894–4903. Cited by: §3.1, §3.2.
  • [83] J. Wei, S. Wang, and Q. Huang (2020) F3Net: Fusion, Feedback and Focus for Salient Object Detection. In AAAI Conf. Art. Intell., Cited by: §4.4.1.
  • [84] Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan (2017) Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 1568–1576. Cited by: §4.3.
  • [85] J. M. Wolfe (1994) Guided search 2.0 a revised model of visual search. Psychon. Bull. Rev. 1 (2), pp. 202–238. Cited by: §6.4.1.
  • [86] Y. Wu, S. Gao, J. Mei, J. Xu, D. Fan, C. Zhao, and M. Cheng (2020) JCS: an explainable covid-19 diagnosis system by joint classification and segmentation. IEEE Trans. Image Process.. Cited by: §1, §6.1.2.
  • [87] Z. Wu, L. Su, and Q. Huang (2019) Cascaded partial decoder for fast and accurate salient object detection. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3907–3916. Cited by: §4.2, §4.2, §4.2, TABLE III, TABLE IV, Fig. 17, §5.1.2, §5.2, §5.3, §5.3, TABLE V.
  • [88] E. Xie, W. Wang, W. Wang, M. Ding, C. Shen, and P. Luo (2020) Segmenting transparent objects in the wild. In Eur. Conf. Comput. Vis., Cited by: §2, §6.5.1.
  • [89] N. Xu, B. Price, S. Cohen, and T. Huang (2017) Deep image matting. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 2970–2979. Cited by: §4.1, §4.3.
  • [90] Y. Xu, H. Nagahara, A. Shimada, and R. Taniguchi (2015) Transcut: transparent object segmentation from a light-field image. In Int. Conf. Comput. Vis., pp. 3442–3450. Cited by: §2.
  • [91] Q. Yao, M. Wang, Y. Chen, W. Dai, H. Yi-Qi, L. Yu-Feng, T. Wei-Wei, Y. Qiang, and Y. Yang (2018) Taking human out of learning applications: a survey on automated machine learning. arXiv preprint arXiv:1810.13306. Cited by: §7.
  • [92] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese (2018) Taskonomy: disentangling task transfer learning. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3712–3722. Cited by: item 1), §7.
  • [93] Y. Zeng, P. Zhang, J. Zhang, Z. Lin, and H. Lu (2019) Towards high-resolution salient object detection. In Int. Conf. Comput. Vis., Cited by: §3.3.
  • [94] J. Zhang, D. Fan, Y. Dai, S. Anwar, F. Sadat Saleh, T. Zhang, and N. Barnes (2020)

    UC-Net: Uncertainty Inspired RGB-D Saliency Detection via Conditional Variational Autoencoders

    In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §8.
  • [95] Q. Zhang, N. Huang, L. Yao, D. Zhang, C. Shan, and J. Han (2019) RGB-t salient object detection via fusing multi-level cnn features. IEEE Trans. Image Process. 29, pp. 3321–3335. Cited by: §7.
  • [96] Y. Zhang, L. Gong, L. Fan, P. Ren, Q. Huang, H. Bao, and W. Xu (2019) A late fusion cnn for digital matting. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 7469–7478. Cited by: item 1), TABLE I, Fig. 10.
  • [97] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 6230–6239. Cited by: TABLE III, TABLE IV, §5.1.2, §5.2, TABLE V.
  • [98] J. Zhao, Y. Cao, D. Fan, M. Cheng, X. Li, and L. Zhang (2019) Contrast prior and fluid pyramid integration for RGBD salient object detection. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3927–3936. Cited by: §1, §8.
  • [99] J. Zhao, J. Liu, D. Fan, Y. Cao, J. Yang, and M. Cheng (2019) EGNet:edge guidance network for salient object detection. In Int. Conf. Comput. Vis., Cited by: §1, TABLE III, TABLE IV, §5.1.2, §5.2, §5.2, TABLE V.
  • [100] T. Zhao and X. Wu (2019) Pyramid feature attention network for saliency detection. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3085–3094. Cited by: TABLE III, TABLE IV, Fig. 16, §5.1.2, §5.2, §5.2, §5.2, TABLE V.
  • [101] Z. Zhao, P. Zheng, S. Xu, and X. Wu (2019) Object detection with deep learning: a review. IEEE T. Neural Netw. Learn. Syst. 30 (11), pp. 3212–3232. Cited by: §1, §2.
  • [102] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017) Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40 (6), pp. 1452–1464. Cited by: §3.1.
  • [103] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 633–641. Cited by: §1, §3.
  • [104] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang (2018) UNet++: a nested u-net architecture for medical image segmentation. In D. Learn. Med. Image Anal., pp. 3–11. Cited by: TABLE III, TABLE IV, §5.1.2, §5.2, TABLE V.