DeepAI
Log In Sign Up

Expanding Low-Density Latent Regions for Open-Set Object Detection

Modern object detectors have achieved impressive progress under the close-set setup. However, open-set object detection (OSOD) remains challenging since objects of unknown categories are often misclassified to existing known classes. In this work, we propose to identify unknown objects by separating high/low-density regions in the latent space, based on the consensus that unknown objects are usually distributed in low-density latent regions. As traditional threshold-based methods only maintain limited low-density regions, which cannot cover all unknown objects, we present a novel Open-set Detector (OpenDet) with expanded low-density regions. To this aim, we equip OpenDet with two learners, Contrastive Feature Learner (CFL) and Unknown Probability Learner (UPL). CFL performs instance-level contrastive learning to encourage compact features of known classes, leaving more low-density regions for unknown classes; UPL optimizes unknown probability based on the uncertainty of predictions, which further divides more low-density regions around the cluster of known classes. Thus, unknown objects in low-density regions can be easily identified with the learned unknown probability. Extensive experiments demonstrate that our method can significantly improve the OSOD performance, e.g., OpenDet reduces the Absolute Open-Set Errors by 25 benchmarks. Code is available at: https://github.com/csuhan/opendet2.

READ FULL TEXT VIEW PDF

page 1

page 3

page 8

page 11

page 12

07/23/2022

UC-OWOD: Unknown-Classified Open World Object Detection

Open World Object Detection (OWOD) is a challenging computer vision prob...
10/28/2022

Towards Few-Shot Open-Set Object Detection

Open-set object detection (OSOD) aims to detect the known categories and...
12/02/2022

PROB: Probabilistic Objectness for Open World Object Detection

Open World Object Detection (OWOD) is a new and challenging computer vis...
12/02/2021

Open-set 3D Object Detection

3D object detection has been wildly studied in recent years, especially ...
04/16/2019

Detecting the Unexpected via Image Resynthesis

Classical semantic segmentation methods, including the recent deep learn...
06/03/2019

Neural Network-based Object Classification by Known and Unknown Features (Based on Text Queries)

The article presents a method that improves the quality of classificatio...

Code Repositories

opendet2

Official code of the paper "Expanding Low-Density Latent Regions for Open-Set Object Detection" (CVPR 2022)


view repo

1 Introduction

Although the past decade has witnessed significant progress in object detection [girshick2014rich, ren2017faster, redmon2016you, lin2017focal, tian2019fcos, carion2020end], modern object detectors are often developed with a close-set assumption that the object categories appearing in the testing process are contained by the training sets, and quickly lose their efficiency when handling real-world scenarios as many objects categories have never been seen in the training. See Fig. 1 for an instance, where a representative object detector, , Faster R-CNN [ren2017faster]

trained on PASCAL VOC 

[everingham2010pascal], misclassifies zebra into horse with high confidence, as the new class zebra is not contained by PASCAL VOC. To alleviate this issue, Open-Set Object Detection (OSOD) has been recently investigated, where the detector trained on close-set datasets is asked to detect all known objects and identify unknown objects in open-set conditions.

Figure 1: Trained on close-set images, (a) threshold-based methods, e.g., Faster R-CNN, usually misclassify unknown objects (black triangles, e.g. zebra) into known classes (colored dots, e.g. horse) due to limited low-density regions (in gray color). (b) Our method identifies unknown objects by expanding low-density regions. We encourage compact proposal features and learn clear separation between known and unknown classes.

OSOD can be seen as an extension of Open-Set Recognition (OSR) [scheirer2012toward]. Although OSR has been extensively studied [scheirer2012toward, bendale2016towards, ge2017generative, yoshihashi2019classification, chen2020learning, zhou2021learning], rare works attempted to solve the challenging OSOD. Dhamija et al. [dhamija2020overlooked] first benchmarked the open-set performance of some representative methods [ren2017faster, redmon2016you, lin2017focal], which indicates most detectors are overestimated in open-set conditions. Miller et al. [miller2018dropout, miller2019evaluating] adopt dropout sampling [gal2016dropout] to improve the robustness of detectors in open-set conditions. Joseph et al. [joseph2021towards] proposed an energy-based unknown identifier by fitting the energy distributions of known and unknown classes. In summary, prior works usually leverage hidden evidence (e.g.

, the output logits) of pre-trained models as unknown indicators, with the cost of additional training step and complex post-processing. Can we train an open-set detector

with only close-set data, and directly apply it to open-set environments without complex post-processing?

We draw inspiration from the consensus that known objects are usually clustered to form high-density regions in the latent space, while unknown objects (or novel patterns) are distributed in low-density regions [grandvalet2004semi, chapelle2006semi, ren2018meta]. From this perspective, proper separation of high/low-density latent regions is crucial for unknown identification. However, traditional methods, e.g., hard-thresholding (Fig. 1 (a)), only maintain limited low-density regions, as higher thresholds will hinder the close-set accuracy. In this work, we propose to identify unknown objects by expanding low-density latent regions (Fig. 1 (b)). Firstly, we learn compact features of known classes, leaving more low-density regions for unknown classes. Then, we learn an unknown probability for each instance, which serves as a threshold to divide more low-density regions around the cluster of known classes. Finally, unknown objects distributed in these regions can be easily identified.

More specifically, we propose an Open-set Detector (OpenDet) with two learners, Contrastive Feature Learner (CFL) and Unknown Probability Learner (UPL), which expands low-density regions from two folds. Let us denote the latent space with , where and represent high/low-density sub-space, respectively. CFL performs instance-level contrastive learning to encourage intra-class compactness and inter-class separation of known classes, which expands by narrowing . UPL learns unknown probability for each instance based on the uncertainty of predictions. As we carefully optimize UPL to maintain the close-set accuracy, the learned unknown probability can serve as a threshold to divide more around

. In the testing phase, we directly classify an instance into the

unknown class if its unknown probability is the largest among all classes.

To demonstrate the effectiveness of our method, we take PASCAL VOC [everingham2010pascal]

for close-set training and construct several open-set settings considering both VOC and COCO 

[lin2014microsoft]. Compared with previous methods, OpenDet shows significant improvements on all open-set metrics without compromising the close-set accuracy. For example, OpenDet reduces the Absolute Open-Set Errors (introduced in Sec. 4.1) by 25%-35% on six open-set settings. We also visualize the latent feature in Fig. 2, where OpenDet learns clear separation between known and unknown classes. Besides, we conduct extensive ablation experiments to analyze the effect of our main components and core design choices. Furthermore, we show that OpenDet can be easily extended to one-stage detectors and achieve satisfactory results. We summarize our contributions as:

  • [leftmargin=*]

  • To our best knowledge, we are the first to solve the challenging OSOD by modeling low-density latent regions.

  • We present a novel Open-set Detector (OpenDet) with two well-designed learners, CFL and UPL, which can be trained in an end-to-end manner and directly applied to open-set environments.

  • We introduce a new OSOD benchmark. Compared with previous methods, OpenDet shows significant improvements on all open-set metrics, e.g., OpenDet reduces the Absolute Open-Set Errors by 25%-35%.

2 Related Work

Figure 2: t-SNE visualization of latent features. We take VOC classes as known classes (colored dots), and non-VOC classes in COCO as unknown classes (black triangles). Our method learns a clear separation between known and unknown classes.
Figure 3: Overview of our proposed method. Left: OpenDet is a two-stage detector with (a) Contrastive Feature Learner (CFL) and (b) Unknown Probability Learner (UPL). CFL first encodes proposal features into low-dimensional embeddings with the Contrastive Head (CH). Then we optimize these embeddings between the mini-batch and memory bank with an Instance Contrastive Loss . UPL learns probabilities for both known classes and unknown class with cross-entropy loss and Unknown Probability Loss . Right: A toy illustration of how different components work. Colored dots and triangles denote proposal features of different known and unknown classes, respectively. Our method identifies unknown by expanding low-density latent regions (in gray color).

Open-Set Recognition. Early attempts on OSR [scheirer2014probability, jain2014multi, zhang2016sparse, bendale2015towards, junior2017nearest]

usually leverage traditional machine learning methods,

e.g., SVM [scheirer2014probability, jain2014multi]. Bendale et al. [bendale2016towards]

introduced OpenMax, the first deep learning-based OSR method, which redistributes the output probabilities of the softmax layer. Other approaches include generative adversarial network-based methods 

[ge2017generative, neal2018open] which generate potential open-set images to train an open-set classifier, reconstruction-based methods [yoshihashi2019classification, oza2019c2ae, sun2020conditional] which adopt auto-encoder to recover latent features and identify unknown by reconstruction errors, and prototype-based methods [chen2020learning, chen2021adversarial] which identify open-set images by measuring the distance to learned prototypes. In addition, Zhou et al. [zhou2021learning] proposed to learn data placeholders to anticipate open-set data and classifier placeholders to distinguish known and unknown. Kong et al. [kong2021opengan] utilized an adversarially trained discriminator to detect unknown examples. Our method is more related to [zhou2021learning]. Differently, [zhou2021learning] requires close-set pre-train and calibration on validation sets, while our method is trained in an end-to-end manner, and the learned unknown probability is accurate and calibration-free.

Open-Set Object Detection is an extension of OSR in object detection. Dhamija et al. [dhamija2020overlooked] first formalized OSOD and benchmarked some representative detectors by their classifiers. Classifiers with a background class [ren2017faster] performs better than one-vs-rest [lin2017focal] and objectness-based [redmon2016you] classifiers in handling unknown objects. Dhamija et al. [dhamija2020overlooked] also show that the performance of most detectors is overestimated in open-set conditions. Miller et al. [miller2018dropout, miller2019evaluating] utilized dropout sampling [gal2016dropout]

to estimate uncertainty in object detection and thus reduce open-set errors. Joseph 

et al. [joseph2021towards] proposed an energy-based unknown identifier by fitting the energy distributions of known and unknown classes. However, the approach in [joseph2021towards] requires extra open-set data of unknown classes, which violates the original definition of OSOD. In summary, previous methods leverage hidden evidence (e.g., the output logits) of pre-trained models as unknown indicators. But they need additional training step and complex post-processing to estimate the unknown indicator. In contrast, OpenDet can be trained with only close-set data and directly identify unknown objects with the learned unknown probability.

Contrastive Learning is a methodology to learn representation by pulling together positive sample pairs while pushing apart negative sample pairs, which has been recently popularized for self-supervised representation learning [he2020momentum, chen2020simple, grill2020bootstrap, chen2021exploring, caron2020unsupervised, ding2021unsupervised]. Khosla et al. [khosla2020supervised] first extended self-supervised contrastive learning to the full-supervised setting and received a lot of attention from other fields, e.g., long-tailed recognition [wang2021contrastive, cui2021parametric], semantic segmentation [van2021unsupervised, wang2021exploring] and few-shot object detection [sun2021fsce]. Our approach is also inspired by supervised contrastive learning [khosla2020supervised]. In this work, we explore instance-level contrastive learning to learn compact features of object proposals.

Uncertainty Estimation.Neural networks tend to produce over-confident predictions [lakshminarayanan2016simple]. Estimating the uncertainty of model predictions is important for real-world applications. Currently, uncertainty estimation can be categorized into sampling-based and sampling-free methods. Sampling-based methods ensemble predictions of multiple runs [gal2016dropout] or multiple models [lakshminarayanan2016simple], which are not applicable for speed-critical object detection. Sampling-free methods learn additional confidence value [devries2018learning, sensoy2018evidential] to estimate uncertainty. Our method belongs to the latter family. The learned unknown probability can reflects the uncertainty of predictions.

3 Methodology

3.1 Preliminary

We formalize OSOD based on prior works [dhamija2020overlooked, joseph2021towards]. Let us denote with an object detection dataset, where is an input image and denotes a set of objects with corresponding class label and bounding box . We train the detector on the training set with known classes , and test it on the testing set with objects from both known classes and unknown classes . The goal is to detect all known objects (objects ), and identify unknown objects (objects ) so that they will not be misclassified to . As it is impossible to list infinite unknown classes, we denote them with .

Different from OSR, OSOD has its unique challenges. In OSR, an image only belongs to or ; any example out of is defined as unknown. In OSOD, an image may contain objects from both and , which is defined as mixed unknown [dhamija2020overlooked]. That means unknown objects will also appear in but have not been labeled yet. Besides, detectors usually keep a background class which is easily confused with .

3.2 Baseline Setup

We setup the baseline with Faster R-CNN [ren2017faster], which consists of a backbone, Region Proposal Network (RPN) and R-CNN. The standard R-CNN includes a shared fully connected (FC) layer and two separate FC layers for classification and regression. We augment R-CNN in three ways. (a) We replace the shared FC layer with two parallel FC layers so that the module applied to the classification branch will not affect the regression task. (b) Inspired by [chen2020learning, wang2020frustratingly]

, we use cosine similarity-based classifier to alleviate the over-confidence issue 

[bendale2016towards, padhy2020revisiting]. Specifically, we adopt scaled cosine similarity scores as output logits: , where denotes the similar score between -th proposal features

and weight vector of class

. is the scaling factor (=20 by default). (c) The box regressor is set to class-agnostic, i.e., the regression branch outputs a vector of length 4 rather than . Note that our baseline does not improve the open-set performance, but it is effective for the whole framework (Fig. 3).

3.3 Contrastive Feature Learner

This section presents Contrastive Feature Learner (CFL) to encourage intra-class compactness and inter-class separation, which expands low-density latent regions by narrowing the cluster of known classes. As shown in Fig. 3 (a), CFL contains a contrastive head (CH), a memory bank, and an instance contrastive loss . For a proposal feature , we first encode it into a low-dimensional embedding with CH. Then, we optimize the embeddings from the mini-batch and memory bank with . We give more details in the following part.

Contrastive Head. We build a contrastive head (CH) to map high-dimensional proposal feature to low-dimensional proposal embedding (

by default). In detail, CH is a multilayer perceptron with sequential FC, ReLU, FC, and L2-Norm layers, which is applied to the classification branch of R-CNN in training and abandoned during inference.

Class-Balanced Memory Bank. Popular contrastive representation learning usually adopts large-size mini-batch [khosla2020supervised] or memory bank [he2020momentum] to increase the diversity of exemplars. Here we build a novel class-balanced memory bank to increase the diversity of object proposals. Specifically, for each class , we initialize a memory bank of size . Then, we sample representative proposals from a mini-batch with two steps: (a) We sample proposals with Intersection of Union (IoU) where is an IoU threshold to ensure the proposals contain relevant semantics. (b) For each mini-batch, we sample proposals that are least similar (i.e., minimum cosine similarity) with existing exemplars in . This step makes our memory banks store more diverse exemplars and enable long-term memory. Finally, we repeat (a) and (b) every iteration where the oldest proposals are out of the memory and the newest into the queue.

Instance-Level Contrastive Learning. Inspired by supervised contrastive loss [khosla2020supervised], we propose an Instance Contrastive (IC) Loss to learn more compact features of object proposals. Assume we have a mini-batch of proposals, IC Loss is formulated as:

(1)
(2)

where is the class label of -the proposal, is a temperature hyper-parameter, denotes the memory bank of class , and . Note that we only optimize proposals with IoU where is an IoU threshold similar to .

Although unknown objects are unavailable in training, the separation of known classes benefits unknown identification. Optimizing is equivalent to pushing the cluster of known classes away from low-density latent regions. As shown in Fig. 2 (b), our method learns a clear separation between known and unknown classes with only close-set training data.

Method VOC VOC-COCO-20 VOC-COCO-40 VOC-COCO-60
mAP WI AOSE mAP AP WI AOSE mAP AP WI AOSE mAP AP
FR-CNN [ren2017faster] 80.10 18.39 15118 58.45 0 22.74 23391 55.26 0 18.49 25472 55.83 0
FR-CNN [ren2017faster] 80.01 18.83 11941 57.91 0 23.24 18257 54.77 0 18.72 19566 55.34 0
PROSER [zhou2021learning] 79.68 19.16 13035 57.66 10.92 24.15 19831 54.66 7.62 19.64 21322 55.20 3.25
ORE [joseph2021towards] 79.80 18.18 12811 58.25 2.60 22.40 19752 55.30 1.70 18.35 21415 55.47 0.53
DS [miller2018dropout] 80.04 16.98 12868 58.35 5.13 20.86 19775 55.31 3.39 17.22 21921 55.77 1.25
OpenDet 80.02 14.95 11286 58.75 14.93 18.23 16800 55.83 10.58 14.24 18250 56.37 4.36
Table 1: Comparisons with other methods on VOC and VOC-COCO-T. We report close-set performance (mAP) on VOC, and both close-set (mAP) and open-set (WI, AOSE, AP) performance of different methods on VOC-COCO-{20, 40, 60}. means a higher score threshold (i.e. 0.1) for testing.
Method VOC-COCO-0.5 VOC-COCO- VOC-COCO-4
WI AOSE mAP AP WI AOSE mAP AP WI AOSE mAP AP
FR-CNN [ren2017faster] 9.25 6015 77.97 0 16.14 12409 74.52 0 32.89 48618 63.92 0
FR-CNN [ren2017faster] 9.01 4599 77.66 0 16.00 9477 74.17 0 33.11 37012 63.80 0
PROSER [zhou2021learning] 9.32 5105 77.35 7.48 16.65 10601 73.55 8.88 34.60 41569 63.09 11.15
ORE [joseph2021towards] 8.39 4945 77.84 1.75 15.36 10568 74.34 1.81 32.40 40865 64.59 2.14
DS [miller2018dropout] 8.30 4862 77.78 2.89 15.43 10136 73.67 4.11 31.79 39388 63.12 5.64
OpenDet 6.44 3944 78.61 9.05 11.70 8282 75.56 12.30 26.69 32419 65.55 16.76
Table 2: Comparisons with other methods on VOC-COCO-T. Note that we put VOC-COCO-2 in the appendix due to limited space.

3.4 Unknown Probability Learner

As introduced in Sec. 3.3, CFL expands low-density latent regions by narrowing the cluster of known classes (i.e., high-density regions). However, we still lack explicit boundaries to separate high/low-density regions. Traditional threshold-based methods with a small score threshold (e.g., 0.05) only maintain limited low-density regions, which cannot cover all unknown objects. Here we present Unknown Probability Learner (UPL) to divide more low-density latent regions around the cluster of known classes.

To this aim, we first augment the K-way classifier with the K+1-way classifier, where K+1 denotes the unknown class. Then the problem becomes: how to optimize the unknown class with only close-set training data? Let us consider a simple known vs. unknown classifier with available open-set data, we can directly train a good classifier by maximizing margins between classes. Now, we only have close-set data; To train such a classifier, we relax the maximum margin principle and only ensure all known objects are correctly classified, i.e., maintaining the close-set accuracy. With this premise, we will introduce how to learn the unknown probability in the following section.

Review Cross-Entropy (CE) Loss. We first review softmax CE Loss, the default classification loss of Faster R-CNN. Let denote the classification logits of a proposal, the softmax probability of class is defined as:

(3)

where denotes all known classes , unknown class and background . Then, we formulate softmax CE Loss as:

(4)

where means the ground truth class, and is the one-hot class label. For simplicity, we re-write as:

(5)

Learning Unknown Probability. Since there is no supervision for the unknown probability , we consider a conditional probability under the ground truth probability . Formally, we define as a softmax probability without the logit of ground truth class :

(6)

where is short for the unknown class . Then, similar to CE Loss, we formulate an Unknown Probability (UP) Loss to optimize , which is defined as:

(7)

After that, we jointly optimize the CE Loss and UP Loss (illustrated in Fig. 3 (b)), where aims to maintain the close-set accuracy, and learns the unknown probability. Take Fig. 3 (bottom-right) for an illustration, optimizing is equivalent to dividing more low-density latent regions (in gray color) from known classes. Once we finished the training, the learned unknown probability serves as an indicator to identify unknown objects in these low-density regions.

Uncertainty-weighted Optimization. Although we optimize the conditional probability instead of , will still penalize the convergence of , leading to the accuracy drop of known classes. Inspired by uncertainty estimation [devries2018learning, sensoy2018evidential], we add a weighting factor to , which is defined as a function of :

(8)

where is a hyper-parameter (=1 by default). Despite many design choices of (shown in Tab. 6), we choose a simple yet effective one in Eq. 8. We are inspired by the popular uncertainty signal: entropy . Since Eq. 8 has a similar curve shape to entropy (see our appendix), it can also reflect uncertainty. But our empirical findings suggest that Eq. 8 is easier to optimize than entropy. Finally, we formulate the uncertainty-weighted UP Loss as follow:

(9)

Hard Example Mining. It is unreasonable to let all known objects learn the unknown probability as they do not belong to the unknown class. Therefore, we present uncertainty-guided hard example mining to optimize with high-uncertainty proposals, which may overlap with real unknown objects in the latent space. Here we consider two uncertainty-guided mining methods:

  • [leftmargin=*]

  • Max entropy. Entropy is a popular uncertainty measure [lakshminarayanan2016simple, malinin2018predictive] defined as: . For a mini-batch, We sort them in descending entropy order, and select top-k examples.

  • Min max-probability. Max-probability, i.e., the maximum probability of all classes: , is another uncertainty signal. We select top-k examples with minimum max-probability.

Furthermore, since background proposals usually overwhelm the mini-batch, we sample the same number of foreground and background proposals, enabling our model to recall unknown objects from the background class.

3.5 Overall Optimization

Our method can be trained in an end-to-end manner with the following multi-task loss:

(10)

where denotes the total loss of RPN, is smooth L1 loss for box regression, and are weighting coefficients. Note is proportional to the current iteration so that we can gradually decrease the weight of for better convergence of and .

4 Experiment

4.1 Experimental Setup

Datasets. We construct an OSOD benchmark using popular PASCAL VOC [everingham2010pascal] and MS COCO [lin2014microsoft]. We take the trainval set of VOC for close-set training. Meanwhile, we take 20 VOC classes and 60 non-VOC classes in COCO to evaluate our method under different open-set conditions. Here we define two settings: VOC-COCO-{T, T}. For setting T, we gradually increase open-set classes to build three joint datasets with =5000 VOC testing images and COCO images containing {20, 40, 60} non-VOC classes, respectively. For setting T, we gradually increase the Wilderness Ratio (WR) 222Wilderness Ratio is the ratio of #images with unknown objects to #images with known objects. [dhamija2020overlooked] to construct four joint datasets with VOC testing images and COCO images disjointing with VOC classes. See our appendix for more details.

Evaluation Metrics. We use the Wilderness Impact (WI) [dhamija2020overlooked] to measure the degree of unknown objects misclassified to known classes: , where and denote the precision of close-set and open-set classes, respectively. Note that we scale the original WI by 100 for convenience. Following [joseph2021towards], we report WI under a recall level of 0.8. Besides, we also use Absolute Open-Set Error (AOSE) [miller2018dropout] to count the number of misclassified unknown objects. Furthermore, we report the mean Average Precision (mAP) of known classes (mAP). Lastly, we measure the novelty discovery ability by AP (AP of the unknown class). Note WI, AOSE, and AP are open-set metrics, and mAP is a close-set metric.

Comparison Methods. We compare OpenDet with the following methods: Faster R-CNN (FR-CNN[ren2017faster], Dropout Sampling (DS[miller2018dropout], ORE [joseph2021towards] and PROSER [zhou2021learning]. FR-CNN is the base detector of other methods. We also report FR-CNN, which adopts a higher score threshold for testing. We use the official code of ORE and reimplement DS and PROSER based on the FR-CNN framework.

Implementation Details. We use ResNet-50 [he2016resnet] with Feature Pyramid Network [lin2017feature] as the backbone of all methods. We adopt the same learning rate schedules with Detectron2 [wu2019detectron2]. SGD optimizer is adopted with an initial learning rate of 0.02, momentum of 0.9, and weight decay of 0.0001. All models are trained on 8 GPUs with a batch size of 16. For CFL, we set memory size =256 and sampling size =16. We sample proposals with an IoU threshold =0.7 for the memory bank, and =0.5 for the mini-batch. For UPL, we sample =3 examples for foreground and background proposals respectively. Besides, we set hyper-parameters =1.0 and =0.5. We set the initial value of =0.1 and linearly decrease it to zero.

CFL UPL WI AOSE mAP AP
baseline 19.26 16433 58.33 0
17.92 15162 58.54 0
16.47 12018 57.91 14.27
14.95 11286 58.75 14.93
Table 3: Effect of different components on VOC-COCO-20.

4.2 Main Results

We compare OpenDet with other methods on VOC-COCO-{T, T}. Tab. 1 shows results on VOC-COCO-T by gradually increasing unknown classes. Compared with FR-CNN, FR-CNN with a higher score threshold (0.050.1) does not reduce WI, but results in a decrease in mAP, where known objects with low confidence are filtered out. PROSER improves AOSE and AP to some extent, but the WI and mAP are even worse. Although ORE and DS achieve comparable mAP, the improvement on open-set metrics is limited. The proposed OpenDet outperforms other methods by a large margin. Taking VOC-COCO-20 for an example, OpenDet gains about 20%, 25%, 14.93 on WI, AOSE and AP respectively without compromising the mAP (58.75 vs. 58.45). Besides, we also report mAP on VOC, which indicates OpenDet is competitive in the traditional close-set setting (80.02 vs. 80.10).

We also compare OpenDet with other methods by increasing the WR, where the results in Tab. 2 draws similar conclusions with Tab. 1. Our method performs better as the WR increases. For example, the mAP gains on VOC-COCO-{0.5, , 4} are {0.64, 1.04, 1.63}, indicating that our method actually separates known and unknown classes.

4.3 Ablation Studies

In this section, we conduct ablation experiments on VOC-COCO-20 to analyze the effect of our main components and core design choices.

Overall Analysis. We first analyze the contribution of different components. As shown in Tab. 3, our two modules, CFL and UPL, show substantial improvement compared with the baseline. The combination of CFL and UPL further boosts the performance. We also visualize the latent features in Fig. 2, where our method learns clear separation between known and unknown classes.

Memory size WI mAP
single-GPU mini-batch ~50 16.19 58.29
cross-GPU mini-batch ~508 15.88 58.07
class-agnostic memory bank 5120 15.99 57.47
class-agnostic memory bank 65536 15.49 58.90
class-balanced memory bank 25620 14.95 58.75
Table 4: Class-balanced memory bank. We compare our class-balanced memory bank with other variants. We keep class-agnostic memory bank the same size with ours (25620=5120). 8 and 20 are the number of GPU and VOC classes, respectively. means a larger memory size.
(a) (b) (c) (d) (e) (f)
T 0.5 0.7 0.9 0.5 0.5 0.7
T 0.5 0.7 0.9 0.7 0.9 0.9
WI 15.33 15.16 15.27 14.95 14.62 15.27
mAP 58.29 58.55 58.32 58.75 58.66 58.33
(a) IoU threshold
(a) (b) (c) (d) (e) (f)
q 16 16 16 32 64 128
Q 128 256 512 256 256 256
WI 15.36 14.95 14.47 15.24 15.43 14.77
mAP 58.51 58.75 58.31 58.32 57.77 58.18
(b) Memory size and mini-batch sampling size
Table 5: Sampling strategy in CFL. We list different choices of (a) memory sampling threshold T and mini-batch sampling threshold T, (b) memory size and sampling size .

Contrastive Feature Learner. We carefully study the design choices of the memory bank and example sampling strategy in CFL. As is optimized between the current mini-batch and the memory bank, we investigate different designs of memory in Tab. 4. Compared with the mini-batch (i.e., short-term memory), the settings with a memory bank perform better on WI. However, imbalanced training data makes the class-agnostic memory bank filled with high-frequency classes, leading to a drop in mAP (58.7657.47). Enlarging the memory bank size (512065536) can alleviate this issue, but it requires more computation. The proposed class-balanced memory bank can store more diverse examples with a small memory size, outperforming other variants.

We further study the design choices of example sampling strategies. For a mini-batch, we consider the IoU threshold T; for the memory bank, we consider the IoU threshold T, memory size and mini-batch sampling size . As shown in Tab. (a)a, the settings (d) and (e) achieve the best result in mAP and WI, respectively, while (a)-(c) are worse than (d)-(e) in WI. This indicates that the mini-batch requires a loose constraint to gather more diverse examples, while the memory bank needs high-quality examples to represent the class centers. In Tab. (b)b, (b) and (c) perform better than other settings, which demonstrates that long-term memory (i.e., larger ) is a good choice for CFL.

WI AOSE mAP AP
baseline 19.26 16433 58.33 0
(a) identity 10.50 12185 56.42 11.33
(b) 14.70 11384 58.13 13.71
(c) 14.95 11286 58.75 14.93
(d) 14.86 11296 58.03 14.15
(e) 14.29 11690 57.75 14.65
Table 6: Different designs of in . is the maximum probability of all classes: . (e) denotes normalized entropy where and is the number of known classes.
Setting WI AOSE mAP AP
OpenDet (w/ HEM) 14.95 11286 58.75 14.93
(a) w/o HEM 18.33 13733 57.41 13.91
(b) w/o bg. 13.02 12230 56.53 13.49
(c) top-k:
1 14.46 12826 58.42 14.54
3 14.95 11286 58.75 14.93
5 14.66 10412 58.50 14.55
10 15.15 10358 58.25 14.86
all 18.40 11779 56.55 13.89
(d) metric:
random 17.01 13065 56.99 15.58
max entropy 14.29 11514 58.27 15.46
min max-probability 14.95 11286 58.75 14.93
Table 7: Hard example mining (HEM) in UPL. (a) without HEM. (b) without background: we only sample foreground proposals. (c) varying top-k. the setting all means all foreground and equal number of background proposals. (d) mining methods.
Figure 4: Qualitative comparisons between the baseline (top) and OpenDet (bottom). We train both models on VOC and visualize the detection results on COCO. Note that we apply NMS between known classes and the unknown class for better visualization.

Unknown Probability Learner. We first explore different variants of . Compared with the baseline, Tab. 6 (a) significantly reduces WI and AOSE, but leads to mAP drop, which indicates the learned unknown probability is overestimated. The formula of (b) is similar to entropy, and (c) is our default setting. As discussed in Sec. 3.4, both (b) and (c) achieve satisfactory results in WI and AOSE, but (c) is outperforms (b) in mAP and AP. (d) and (e) are two variants of based on max-probability and entropy, respectively. They obtain comparable performance on open-set metrics, but the mAP is lower than (c).

We also analyze the effect of Hard example mining (HEM) in Tab. 7. Comparing Tab. 7 (a) (without HEM) with our default setting (with HEM), we show HEM is crucial for UPL. Tab. 7 (b) indicates background proposals are also necessary for unknown probability learning, e.g., OpenDet without background leads to 2.22 and 1.44 drop in mAP and AP. Besides, we varying the hyper-parameter top-k in Tab. 7 (c) where HEM works in a wide range of (10), while optimizing all examples is not applicable. Tab. 7 (d) demonstrates the effectiveness of two mining methods, i.e., max entropy and min max-probability.

Qualitative Comparisons. Fig. A3 compares the qualitative results of baseline and OpenDet. OpenDet gives an unknown label to unknown objects (bottom row), while the baseline method classifies them to known classes or the background (top row). See our appendix for more qualitative results.

4.4 Extend to One-Stage Detector

Although OpenDet is based on a two-stage detector, it can be easily extended to other architectures, e.g., a representative one-stage detector RetinaNet [lin2017focal]. RetinaNet has a backbone and two parallel sub-networks for classification and regression, respectively. Different from FR-CNN, RetinaNet adopts Focal Loss [lin2017focal] for dense classification. Here we show how to extend OpenDet to RetinaNet (denote with Open-RetinaNet). For CFL, we append the contrastive head to the second-last layer of the classification sub-network. We adopt the same sampling strategies in CFL and optimize with pixel-wise features. For UPL, we only sample hard foreground examples as RetinaNet does not preserve a background class. Then is jointly optimized with Focal Loss. Tab. A6 reports the results on VOC-COCO-20, where Open-RetinaNet shows significant improvements on all open-set metrics and achieves comparable close-set mAP. For example, Open-RetinaNet gains 23.7%, 55.8%, and 11.02 in WI, AOSE, and AP, respectively.

Method WI AOSE mAP AP
RetinaNet 14.58 38071 57.44 0
Open-RetinaNet 10.84 16815 57.25 11.02
Table 8: Performance of Open-RetinaNet on VOC-COCO-20.

5 Conclusions

This paper proposes a novel Open-set Detector (OpenDet) to solve the challenging OSOD task by expanding low-density latent regions. OpenDet consists of two well-designed learners, CFL and UPL, where CFL performs instance-level contrastive learning to learn more compact features and UPL learns the unknown probability that serves as a threshold to further separate known and unknown classes. We also build an OSOD benchmark and conduct extensive experiments to demonstrate the effectiveness of our method. Compared with other methods, OpenDet shows significant improvements on all metrics.

Limitations. We notice that some low-quality proposals belonging to known classes are given the unknown label during inference, and cannot be filtered out by per-class non-maximum suppression. Although these proposals do not hurt the close-set mAP, it raises a new question about reducing false unknown predictions, which is also a direction for our future work.

Acknowledgement

This work was supported by National Nature Science Foundation of China under grant 61922065, 41820104006 and 61871299. The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University. Jian Ding is also supported by China Scholarship Council.

Appendix A More Experimental Details

a.1 Datasets

In this section, we introduce more details about the dataset construction.

PASCAL VOC [everingham2010pascal]. We use VOC07 train and VOC12 trainval splits for the training, and VOC07 test split to evaluate the close-set performance. We take VOC07 val as the validation set.

VOC-COCO-T. We divide 80 COCO classes into four groups (20 classes per group) by their semantics [joseph2021towards]: (1) VOC classes. (2) Outdoor, Accessories, Appliance, Truck. (3) Sports, Food. (4) Electronic, Indoor, Kitchen, Furniture. We construct VOC-COCO-{20, 40, 60} with =5000 VOC testing images and {, 2, 3} COCO images containing {20, 40, 60} non-VOC classes with semantic shifts, respectively. Note that we only ensure each COCO image contains objects of corresponding open-set classes, which means objects of VOC classes will also appear in these images. This setting is more similar to real-world scenarios where detectors need to carefully identify unknown objects and do not classify known objects into the unknown class.

VOC-COCO-T. We gradually increase the Wilderness Ratio to build four dataset with =5000 VOC testing images and {0.5, , 2, 4} COCO images disjointing with VOC classes. Compared with the setting T, T aims to evaluate the model under a higher wilderness, where large amounts of testing instances are not seen in the training.

Comparisons with existing benchmarks. [dhamija2020overlooked] proposed the first OSOD benchmark. They also use the data in VOC for close-set training, and both VOC and COCO for open-set testing. In the testing phase, they just vary the number of open-set images sampled from COCO, while ignoring the number of open-set categories.  [joseph2021towards] proposed an open world object detection benchmark. They divide the open-set testing set into several groups by category. However, the wilderness ratio of each group is limited, and such data partitioning cannot reflect the real performance of detectors under extreme open-set conditions. In contrast, our proposed benchmark considers both the number of open-set classes (VOC-COCO-T) and images (VOC-COCO-T).

On the other hand, some works on open-set panoptic segmentation [hwang2021exemplar] divide a single dataset into close-set and open-set. If a image contains both close-set and open-set instances, they just remove the annotations of open-set instances. Differently, we strictly follows the definition in OSR [scheirer2012toward] that unknown instances should not appear in training. To acquire enough open-set examples, we take both VOC and COCO from cross-dataset evaluation, which is a common practice in OSR [kong2021opengan, sun2020conditional, zhou2021learning].

a.2 Implementation Details

Training schedule. Inspired by [vaze2021open] that a good close-set classifier benefits OSR, we train all models with the 3 schedule (i.e.

, 36 epochs). Besides, we enable UPL after several warmup iterations (

e.g., 100 iterations) to make sure the model produce valid probabilities.

Open-RetinaNet. We change some hyper-parameters for Open-RetinaNet. In OpenDet, we take object proposals as examples and apply CFL to proposal-wise embeddings, which are equivalent to the anchor boxes in RetinaNet. Therefore, we optimize Instance Contrastive Loss with pixel-wise features of each anchor box. Since the number of anchor box is much larger than the proposals in OpenDet, we enlarge the memory size =1024, sampling size =64, and loss weight to 0.2 in CFL. Similar, we sample 10 hard examples rather than 3 in UPL.

a.3 Evaluation Metrics

Firstly, we give a detailed formulation of the Wilderness Impact [dhamija2020overlooked], which is defined as:

(11)

where means that any detections belonging to the unknown classes are classified to one of known classes . For AP (AP of unknown classes), we merge the annotations of all unknown classes into one class, and calculate the class-agnostic AP between unknown’s predictions and the ground truth.

Appendix B Additional Main Results

Due to limited space in our main paper, we report the results on VOC-COCO-2 in Tab. A1, where OpenDet shows significant improves than other methods.

Method WI AOSE mAP AP
FR-CNN [ren2017faster] 24.18 24636 70.07 0
FR-CNN [ren2017faster] 24.05 18740 69.81 0
PROSER [zhou2021learning] 25.74 21107 69.32 10.31
ORE [joseph2021towards] 23.67 20839 70.01 2.13
DS [miller2018dropout] 23.21 20018 69.33 4.84
OpenDet 18.69 16329 71.44 14.96
Table A1: Comparisons with other methods on VOC-COCO-2. This table is an extension of Tab.2 in our main paper.

Appendix C Additional Ablation Studies

Figure A1: Visualization of different .
metric baseline +CFL +UPL Ours

intra-variance

3.79 2.83 3.05 2.47
inter-distance 62.74 65.17 64.69 66.31
Table A2: Quantitative analyses of the latent space. We calculate the intra-class variance and inter-class distance of latent features.
0.01 0.1 0.5 1.0 w/o decay
WI 16.13 14.95 12.26 9.71 15.65
mAP 58.90 58.75 57.47 53.36 58.43
Table A3: Loss weight of . w/o decay: is a constant (i.e., 0.1) instead of variable.
0.07 [he2020momentum] 0.1 [khosla2020supervised] 0.2
WI 15.48 14.95 15.50
mAP 57.80 58.75 58.87
Table A4: Temperature in .

Visual analyses of . In Fig. A1, we plot the graph of different . Compared with entropy: , the proposed function can adjsut the curve shape by changing . In other words, the model adjusts the weights of examples as changes. The right of Fig. A1 reports the model’s open-set performance by varying , where smaller reduces WI and AOSE.

Quantitative analyses of latent space. In Fig. 2 of the main paper, we give a visual analyses of latent space. Here we give a quantitative analyses of latent space in Tab. A2. Specifically, we calculate the intra-class variance and inter-class distance of latent features. Tab. A2 shows that CFL and UPL, as well as their combination reduce intra-class variance and enlarge inter-class distance. The results further confirm our conclusion in the main paper that our method can expand low-density latent regions.

More hyper-parameters in CFL. Loss weight: Tab. A3 shows that loss weight is important for , where a small weight (e.g., 0.01) cannot learn compact features and a large weight (e.g., 1.0) hinder the generalization ability. Besides, Tab. A3 (last column) also demonstrates the effectiveness of loss decay. Temperature: We try different that used in pervious works [he2020momentum, khosla2020supervised]. Tab. A4 indicates that =0.1 [khosla2020supervised] works better than other settings.

Training strategy. Some works in OSR [zhou2021learning] adopted a pretrain-then-finetune paradigm to train the unknown identifier. We carefully design the UPL so that OpenDet can be trained in an end-to-end manner. Tab. A5 shows that jointly optimizing UPL performs better than that of fine-tuning.

Open-RetinaNet. To further demonstrates the effectiveness of Open-RetinaNet, we report more results in Tab. A6, where Open-RetinaNet shows substantial improvements on WI, AOSE and AP, and achieves comparable performance on mAP.

Vision transformer as backbone. We find the detector with vision transformer, e.g., Swin Transformer [liu2021swin] is a stronger baseline for OSOD. As shown in Tab. A7, models with a Swin-T backbone significantly suppress their ResNet counterparts.

Speed and computation. In the training stage, OpenDet only increases 14% (1.4h vs. 1.2h) training time and 1.2% (2424Mb vs. 2395Mb) memory usage. In the testing phase, as we only add the unknown class to the classifier, OpenDet keeps similar running speed and computation with FR-CNN.

setting backbone epoch WI mAP
end-to-end - - 14.95 58.75
fine-tune fixed 1 17.98 56.88
fixed 12 17.43 56.86
trainable 12 17.01 57.19
Table A5: End-to-end vs. fine-tune in UPL. End-to-end: we jointly optimize UPL and other modules in OpenDet. Fine-tune: we pretrain a model without UPL, and optimize UPL in the fine-tuning stage.
Method WI AOSE mAP AP
VOC:
RetinaNet - - 79.84 -
Open-RetinaNet - - 79.72 -
VOC-COCO-40:
RetinaNet 17.60 58383 53.81 0
Open-RetinaNet 13.65 25964 53.22 8.23
VOC-COCO-60:
RetinaNet 14.20 64327 54.68 0
Open-RetinaNet 11.28 30631 54.25 3.20
Table A6: Open-RetinaNet on more datasets.
Method backbone WI AOSE mAP AP
FR-CNN ResNet-50 18.39 15118 58.45 0
Swin-T 15.99 13204 63.09 0
OpenDet ResNet-50 14.95 11286 58.75 14.93
Swin-T 12.51 9875 63.17 15.77
Table A7: Comparisons of different backbones, i.e., ResNet-50 [he2016resnet] and Swin-T [liu2021swin].

Appendix D Comparison with ORE [joseph2021towards]

Implementation details. The original ORE adopted a R50-C4 FR-CNN framework, and train the model with 8 epochs. For fair comparisons, we replace the R50-C4 architecture with R50-FPN, and train all models with 3 schedule. Besides, as discussed in these issues111https://github.com/JosephKJ/OWOD/issues?q=cannot+reproduce, we report our re-implemented results when comparing with ORE in an open world object detection task (see Tab. A9).

Analysis of ORE. To learn the energy-based unknown identifier (Sec 4.3 in [joseph2021towards]), ORE requires an additional validation set with the annotations of unknown classes. We notice that ORE continues to train on the validation set, so that the model can leverage the information of unknown classes. In Tab. A8, we find ORE without training on valset (i.e., froze parameters) obtains a rather lower mAP (53.96 vs. 58.45), and large amounts of known examples are misclassified to unknown. In contrast, OpenDet outperforms ORE without using the information of unknown classes.

Method train model on valset WI AOSE mAP AP
FR-CNN 18.39 15118 58.45 0
ORE 8.46 2909 53.96 9.64
16.98 12868 58.35 5.13
OpenDet 14.95 11286 58.75 14.93
Table A8: Comparison with ORE [joseph2021towards]. The row with gray background is reported in our main paper.

Results on open world object detection. We also compare OpenDet with ORE in the task1 of open world object detection. As shown in Tab. A9, without accessing open-set data in the training set or validation set, OpenDet outperforms FR-CNN and ORE by a large margin and achieves comparable results with the Oracle.

Method use unknown’s annotation WI AOSE mAP
in train set in val set
FR-CNN (Oracle) 4.27 6862 60.43
FR-CNN 6.03 8468 58.81
ORE 5.11 6833 58.93
OpenDet 4.44 5781 59.01
Table A9: Results on open world object detection [joseph2021towards].

Appendix E Comparison with DS [miller2018dropout]

Implementation details. DS averages multiple runs of a dropout-enabled model to produce more confident preditions. As DS has no public implementation, we implement it based on the FR-CNN [ren2017faster] framework. Specifically, we insert a dropout layer to the second-last layer of the classification branch in R-CNN, and set the dropout probability to 0.5. Previous works [dhamija2020overlooked, joseph2021towards] indicate that DS works even worse than the baseline method; we show it is effective as long as we remove the dropout layer during training, i.e., we only use the dropout layer in the testing phase. Besides, original DS can only tell what is known, but do not have a metric for the unknown (e.g., the unknown probability in OpenDet). We give DS the ability to identify unknown by entropy thresholding [hendrycks2016baseline]. In detail, we define proposals with the entropy larger than a threshold (i.e., 0.25) as unknown.

DS with different #runs. DS requires multiple runs for a given image. We report DS with different number of #runs in Tab. A10. By increasing #runs, DS shows substantial improvements on AOSE and mAP, while the performance on WI becomes worse. We report DS with 30 #runs in our main paper, which is consistent with its original paper [miller2018dropout].

Method #runs WI AOSE mAP AP
FR-CNN 1 18.39 15118 58.45 0
DS 1 15.26 18227 56.60 5.67
3 16.41 14593 57.88 5.48
5 16.76 13862 57.98 5.31
10 16.91 13327 58.24 4.97
30 16.98 12868 58.35 5.13
50 17.01 12757 58.29 4.94
OpenDet 1 14.95 11286 58.75 14.93
Table A10: Comparison with DS [miller2018dropout]. #runs denotes the number of runs used for ensemble. The row with gray background is reported in our main paper.

Appendix F More Qualitative Results.

Fig. A3 gives more qualitative comparisons between the baseline method and OpenDet. OpenDet can recall unknown objects from known classes and the ”background”. Besides, we also give two failure cases in Fig. A2. (a) We find OpenDet performs poorly in some scenes with dense objects, e.g., images with lots of person. (b) OpenDet classifies ”real” background to the unknown class.

Figure A2: Failure cases.
Figure A3: More qualitative comparisons between the baseline and OpenDet.

References