source code for CVPR'22 paper "Unknown-Aware Object Detection: Learning What You Don’t Know from Videos in the Wild"
Building reliable object detectors that can detect out-of-distribution (OOD) objects is critical yet underexplored. One of the key challenges is that models lack supervision signals from unknown data, producing overconfident predictions on OOD objects. We propose a new unknown-aware object detection framework through Spatial-Temporal Unknown Distillation (STUD), which distills unknown objects from videos in the wild and meaningfully regularizes the model's decision boundary. STUD first identifies the unknown candidate object proposals in the spatial dimension, and then aggregates the candidates across multiple video frames to form a diverse set of unknown objects near the decision boundary. Alongside, we employ an energy-based uncertainty regularization loss, which contrastively shapes the uncertainty space between the in-distribution and distilled unknown objects. STUD establishes the state-of-the-art performance on OOD detection tasks for object detection, reducing the FPR95 score by over 10 https://github.com/deeplearning-wisc/stud.READ FULL TEXT VIEW PDF
source code for CVPR'22 paper "Unknown-Aware Object Detection: Learning What You Don’t Know from Videos in the Wild"
Object detection models have achieved remarkable success in known contexts for which they are trained. Yet, they often struggle with out-of-distribution (OOD) data—samples from unknown classes that the network has not been exposed to during training, and therefore should not be predicted by the model in testing. Teaching the object detectors to be aware of unknown objects is critical for building a reliable vision system, especially in safety-critical applications like autonomous driving [du2022vos] and medical analysis [DBLP:journals/corr/abs-2007-04250].
While much research progress is made in OOD detection for classification models [hendrycks2016baseline, lakshminarayanan2017simple, liang2018enhancing, lee2018simple, liu2020energy, tack2020csi, hsu2020generalized], the problem remains underexplored in the context of object detection. Unlike image-level OOD detection, detecting unknowns for object detection requires a finer-grained understanding of the complex scenes. In practice, an image can be OOD in specific regions while being in-distribution (ID) elsewhere. Taking autonomous driving as an example, we observe that an object detection model trained to recognize ID objects (e.g., cars, pedestrians) can produce a high-confidence prediction for an unseen object such as a deer; see Figure 1(a). This happens when our object detector minimizes its training error without explicitly accounting for the uncertainty that could appear outside the training categories. Unfortunately, the plethora of ways that unknown objects can emerge are innumerable in an open world. It is arguably expensive to annotate a large number of OOD objects in complex scenes—in addition to the already costly process of ID data collection.
In this paper, we propose a new unknown-aware object detection framework through Spatial-Temporal Unknown Distillation (STUD), which distills unknown objects from videos in the wild and meaningfully regularizes the model’s decision boundary. Video data naturally captures the open-world environment that the model operates in, and encapsulates a mixture of both known and unknown objects; see Figure 1(b). For example, buildings and trees (OOD) may appear in the driving video, though they are not labeled explicitly for training an object detector for cars and pedestrians (ID). Our approach draws an analogy to the concept of distillation in chemistry, which refers to the “process of separating the substances from a mixture” [doi:10.1021/ie50303a003]. While classic object detection models primarily use the labeled known objects for training, we attempt to capitalize on the unknown ones for model regularization by jointly optimizing object detection and OOD detection performance.
Concretely, our framework consists of two components, tackling challenges of (1) distilling diverse unknown objects from videos, and (2) regularizing object detector with the distilled unknown objects. To address the first problem, we introduce a new spatial-temporal unknown distillation approach, which automatically constructs diverse unknown objects (Section 3.1). In the spatial dimension, for each ID object in a frame, we identify the unknown object candidates in the reference frames based on an OOD measurement. We then distill the unknown object by linearly combining the selected objects in the feature space, weighted by the dissimilarity measurement. The distilled unknown object therefore captures a more diverse distribution over multiple objects than using single ones. In the temporal dimension, we propose aggregating unknown objects from multiple video frames, which captures additional diversity of unknowns in the temporal dimension.
Leveraging the distilled unknown objects, we further employ an unknown-aware training objective (Section 3.2). Unlike vanilla object detection, we train the object detector with an uncertainty regularization branch. Our regularization facilitates learning a more conservative decision boundary between ID and OOD objects, which helps flag unseen OOD objects during inference. To achieve this, the regularization contrastively shapes the uncertainty surface, which produces larger probabilistic scores for ID objects and vice versa, enabling effective OOD detection in testing. Our key contributions are summarized as follows:
We propose a new framework STUD, addressing a challenging yet underexplored problem of unknown-aware object detection. To the best of our knowledge, we are the first to exploit the rich information from videos to enable OOD identification for the object detection models.
STUD effectively regularizes object detectors by distilling diverse unknown objects in both spatial and temporal dimensions without costly human annotations of OOD objects. Moreover, we show that STUD is more advantageous than synthesizing unknowns in the high-dimensional pixel space (e.g., using GAN [lee2018training]) or using negative proposals as unknowns [DBLP:journals/corr/abs-2103-02603].
We extensively evaluate the proposed STUD on large-scale BDD100K [DBLP:conf/cvpr/YuCWXCLMD20] and Youtube-VIS datasets [DBLP:conf/iccv/YangFX19]. STUD obtains state-of-the-art results, outperforming the best baseline by a large margin (10.88% in FPR95 on BDD100K) while preserving the accuracy of object detection on ID data.
We start by formulating the OOD detection problem for the object detection task. Most previous formulations of OOD detection treat entire images as anomalies, which can lead to ambiguity shown in Figure 1(a). In particular, natural images are not monolithic entities but instead are composed of numerous objects and components. Knowing which regions of an image are anomalous allows for the safe handling of unfamiliar objects. Compared to image-level OOD detection, object-level OOD detection is more relevant in realistic perception systems, yet also more challenging as it requires reasoning OOD uncertainty at the fine-grained object level. We design reliable object detectors that are aware of unknown OOD objects in testing. That is, an object detector trained on the ID categories (e.g., cars, trucks) can identify test-time objects (e.g., deer) that do not belong to the training categories and refrain from making a confident prediction on them.
We denote the input and label space by and , respectively. Let be the input image, be the bounding box coordinates associated with objects in the image, and be the semantic label of the object. An object detection model is trained on ID data
drawn from an unknown joint distribution
. We use neural networks with parametersto model the bounding box regression and the classification .
The OOD detection can be formulated as a binary classification problem, distinguishing between the in vs. out-of-distribution objects. Let
denote the marginal probability distribution on. Given a test input , as well as an object predicted by the object detector, the goal is to predict . We use to indicate a detected object being ID, and being OOD, with semantics outside the support of .
Our unknown-aware object detection framework trains an object detector in tandem with the OOD uncertainty regularization branch. Both share the feature extractor and the prediction head and are jointly trained from scratch (see Figure 2). Our framework encompasses two novel components, which address: (1) how to distill diverse unknown objects in the spatial and temporal dimensions (Section 3.1), and (2) how to leverage the unknown objects for effective model regularization (Section 3.2).
Our approach STUD distills unknown objects guided by the rich spatial-temporal information in videos, without explicit supervision signals of unknown objects. Video data naturally encapsulates a mixture of both known and unknown objects. While classic object detection models primarily use the labeled known objects for training, we attempt to capitalize on the unknown ones for model regularization. For this reason, we term our approach unknown distillation—extracting unknown objects w.r.t
the known objects. Notably, our distillation process for object detection is performed at the object level, in contrast to constructing the image-level outliers[hendrycks2018deep]. That is, for every ID object in a given frame, we construct a corresponding OOD counterpart. The distilled unknowns will be used for model regularization (Section 3.2).
While intuition is straightforward, challenges arise in constructing unknown objects in an unsupervised manner. The plethora of ways that unknown objects can emerge are innumerable in high-dimensional space. Taking the ID object car as an example (c.f. Figure 3), the objects such as billboards, trees, buildings, etc. can all be considered as unknowns w.r.t the car. This undesirably increases the sample complexity and demands a diverse collection of unknown objects to be observed. We tackle the challenge through distilling diverse unknown objects by leveraging the rich information in the spatial and temporal dimensions of videos.
In the spatial dimension, for each ID object in a given frame, we create the unknown counterpart through a linear combination of the object features from the reference frames, weighted by the dissimilarity measurement. Utilizing multiple objects captures a more diverse distribution of unknowns than using single ones. STUD operates on the feature outputs from the proposal generator to calculate dissimilarity. Specifically, we consider a pair of frames at timestamps and , designated key frame and reference frame, respectively. For an object , we denote its feature representation as , where is the feature dimension. We collect a set of object features and with the objectiveness score above a threshold. We adopt a dissimilarity measurement using the distance between two features:
are encoded feature vectors obtained by a small network using the object featuresas input. In our experiments, the encoder consists of two convolutional layers with kernel size of 3 × 3 and an average pooling layer. The larger is, the more dissimilar the object features are. The dissimilarity measurement results are illustrated in Figure 3. The OOD objects in the reference frame, such as street lights and billboards, have a more significant dissimilarity.
Lastly, we perform a weighted average of the object features from frame . Using multiple objects captures a diverse distribution of unknowns. The weights are defined as the normalized exponential of the dissimilarity scores:
where is the distilled unknown object (in the feature space), corresponding to the -th object at frame .
Our spatial unknown distillation mechanism operates on a single reference frame, which can be extended to multiple video frames to capture additional diversity of unknowns in the temporal dimension. For example, consider a video of a car driving on the highway, the more frames we observe, the more unknown objects can be observed, such as trees, buildings, and rocks.
Given a frame at timestamp , we propose distilling the unknown objects from multiple frames . We randomly sample frames within a range . As a special case, reduces to the previous pair-frame setting. To distill spatial-temporal unknown objects, we concatenate the object feature vectors from frames, and then measure their dissimilarity w.r.t the objects in frame by Equation (1). For the -th object in frame , the unknown counterpart is defined as follows:
where denotes the normalized dissimilarity scores defined in Equation 2. is the total number of objects across reference frames. The temporal aggregation mechanism allows searching through multiple frames for meaningful and diverse unknown discovery.
Later in Section 4.3, we provide comprehensive ablation studies on the frame sampling range and the number of selected frames , and show the benefits of temporal aggregation for improved OOD detection.
A critical step in unknown distillation is to filter unknowns in the reference frame that may be ID objects or simple background. Without selection, the model may be confused to separate the distilled unknown objects from the ID objects or quickly memorize the simple OOD pattern during training. To prevent this, we pre-filter the proposals based on the energy score, and then use the selected ones for the spatial-temporal unknown distillation. It is shown that the energy score is an effective indicator of OOD data in image classification [liu2020energy]. To calculate the energy score for object detection network, we feed the object features to the prediction head and follow the definition:
is the logit output of the-way classification branch. A higher energy indicates more OOD-ness and vice versa. Then, we select objects with mild energy scores, i.e., those in a specific percentile among all objects. In case of multiple frames , the object selection is performed on each individual frame before temporal aggregation. Ablation study on the effect of the energy filtering and the selection percentile are provided in Section 4.3.
Leveraging the distilled unknown objects from Section 3.1, we now introduce our training objective for unknown-aware object detection. Our key idea is to perform object detection task while regularizing the model to produce a low uncertainty score for ID objects, and a high uncertainty score for the unknown ones. The overall objective function is defined as:
where is the scaling weight when combining the detection loss and the uncertainty regularization loss . Next we describe the details of .
Following Du et al. [du2022vos]
, we employ a loss function that contrastively shapes the uncertainty surface, amplifying the separability between known ID objects and unknown OOD objects. To measure the uncertainty, we use the energy score in Equation (4), which is derived from the output of the classification branch. Here we calculate the energy score for the ID objects and the distilled unknown object features
. The uncertainty score is then passed into a logistic regression classifier with weight coefficient
, which predicts high probability for ID objectand low probability for the unknown ones . The regularization loss is calculated as:
where contains all the unknown object features (c.f. Section 3.1). In Figure 4(a), we show the uncertainty regularization loss over the course of training on Youtube-VIS dataset [DBLP:conf/iccv/YangFX19]. Upon convergence, Figure 4(b) shows the energy score distribution for both the ID and distilled unknown objects. This demonstrates that STUD converges properly and is able to separate the distilled unknown objects and the ID objects.
Compared to for the vanilla object detector, our loss intends to facilitate learning a more conservative decision boundary between ID and OOD objects, which helps flag unseen OOD objects in testing. We proceed by describing the test-time OOD detection procedure.
During inference, we use the output of the logistic regression uncertainty branch for OOD detection. In particular, given a test input , the object detector produces a box prediction . The uncertainty score for the predicted object is given by:
For OOD detection, we use the common thresholding mechanism to distinguish between ID and OOD objects:
The threshold is typically chosen so that a high fraction of ID data (e.g., 95%) is correctly classified. For objects that are classified as ID, one can obtain the bounding box and class prediction using the prediction head as usual. Our approach STUD is summarized in Algorithm 1.
The two key components of STUD—unknown distillation (Section 3.1) and contrastive regularization (Section 3.2) work collaboratively. First, a set of well distilled unknown objects may improve the energy-based contrastive regularization and help learn a more accurate decision boundary between known and unknown objects. Second, as the contrastive uncertainty loss amplifies an energy gap between known and unknown objects, the unknown distillation module can benefit from more accurate unknown object selection (via energy-based filtering). The entire training process converges when the two components perform satisfactorily. Our experiments in Section 4 further justify the efficacy of our framework.
In this section, we provide empirical evidence to show the effectiveness of STUD on two large-scale video datasets (Section 4.1). We show that STUD outperforms other commonly used OOD detection baselines on detecting OOD data in Section 4.2. Ablation studies of STUD and qualitative analysis are presented in Sections 4.3 and 4.4.
|In-distribution||Method||FPR95||AUROC||mAP (ID)||Cost (h)|
OOD: MS-COCO / nuImages
|BDD100K||MSP [hendrycks2016baseline]||90.11 / 93.98||66.32 / 59.21||31.0||9.1|
|ODIN [liang2018enhancing]||80.32 / 87.75||68.49 / 66.51||31.0||9.1|
|Mahalanobis [lee2018simple]||63.06 / 79.02||79.95 / 68.94||31.0||9.1|
|Gram matrices [DBLP:conf/icml/SastryO20]||68.78 / 82.60||66.13 / 71.56||31.0||9.1|
|Energy score [liu2020energy]||78.36 / 86.02||73.75 / 67.08||31.0||9.1|
|Generalized ODIN [hsu2020generalized]||75.99 / 92.15||78.63 / 67.23||30.9||10.5|
|CSI [tack2020csi]||69.38 / 80.06||80.85 / 72.59||29.8||15.3|
|GAN-synthesis [lee2018training]||67.95 / 88.53||78.33 / 66.50||30.1||14.6|
|STUD (ours)||52.182.2 / 77.573.0||85.670.6 / 75.670.7||30.50.2||10.1|
|Youtube-VIS||MSP [hendrycks2016baseline]||90.17 / 94.52||70.26 / 54.59||24.8||9.2|
|ODIN [liang2018enhancing]||87.17 / 97.69||71.46 / 57.46||24.8||9.2|
|Mahalanobis [lee2018simple]||85.60 / 95.65||72.16 / 62.02||24.8||9.2|
|Gram matrices [DBLP:conf/icml/SastryO20]||88.68 / 93.20||61.96 / 60.04||24.8||9.2|
|Energy score [liu2020energy]||91.77 / 91.78||70.58 / 59.05||24.8||9.2|
|Generalized ODIN [hsu2020generalized]||83.90 / 93.18||71.33 / 62.16||24.3||10.5|
|CSI [tack2020csi]||80.21 / 84.85||73.89 / 68.84||23.3||15.7|
|GAN-synthesis [lee2018training]||84.57 / 94.59||71.59 / 64.43||24.4||15.0|
|STUD (ours)||79.820.2 / 76.930.4||75.550.3 / 71.480.6||24.50.3||10.2|
Datasets. We use two large-scale video datasets as ID data: BDD100K [DBLP:conf/cvpr/YuCWXCLMD20] and Youtube-Video Instance Segmentation (Youtube-VIS) 2021 [DBLP:conf/iccv/YangFX19]. For both tasks, we evaluate on two OOD datasets containing diverse visual categories: MS-COCO [lin2014microsoft] and nuImages [DBLP:conf/cvpr/CaesarBLVLXKPBB20]. We perform careful deduplication to ensure there is no semantic overlap between the ID and OOD data. Extensive details on the datasets are described in the appendix.
We adopt Faster R-CNN [ren2015faster] as the base object detector. We use Detectron2 library [Detectron2018] and train with the backbone of ResNet-50 [DBLP:conf/cvpr/HeZRS16]
and the default hyperparameters. We set the weightfor to be for BDD100K and for Youtube-VIS dataset. For both datasets, we use frames and set the sampling range . We set the energy filtering percentile to be among all proposals. Ablation studies on different hyperparameters are detailed in Section 4.3.
For evaluating the OOD detection performance, we report: (1) the false positive rate (FPR95) of OOD samples when the true positive rate of ID samples is at 95%; (2) the area under the receiver operating characteristic curve (AUROC). For evaluating the object detection performance on the ID task, we report the common metric of mAP.
In Table 1, we compare STUD with competitive OOD detection methods in literature, where STUD
significantly outperforms baselines on both datasets. For a fair comparison, all the methods use the same ID training data, trained with the same number of epochs. Our comprehensive baselines include Maximum Softmax Probability[hendrycks2016baseline], ODIN [liang2018enhancing], Mahalanobis distance [lee2018simple], Generalized ODIN [hsu2020generalized], energy score [liu2020energy], Gram matrices [DBLP:conf/icml/SastryO20], and a latest method CSI [tack2020csi]. These baselines rely on the classification output or backbone feature, and therefore can be seamlessly evaluated on the object detection model.
The results show that STUD can outperform these baselines by a considerable margin because the majority of baselines rely on object detection models trained on ID data only, without being regularized by unknown objects. Such a training scheme is prone to produce overconfident predictions on OOD data (Figure 1) while STUD incorporates unknown objects to regularize the model more effectively.
We also compare with GAN-based approach for synthesizing outliers in the pixel space [lee2018training], where STUD effectively improves the OOD detection performance (FPR95) by 15.77% on BDD100K (COCO as OOD) and 17.66% on Youtube-VIS (nuImages as OOD). Moreover, we show in Table 1 that STUD achieves stronger OOD detection performance while preserving a high object detection accuracy on ID data (measured by mAP). This is in contrast with CSI, which displays significant degradation, with mAP decreasing by 1.2% on Youtube-VIS. Details of reproducing baselines are in the Appendix Section D.
This section provides comprehensive ablation studies to understand the efficacy of STUD. For consistency, all ablations are conducted on the BDD100K dataset, using ResNet-50 as the backbone. We refer readers to Appendix Section E for more ablations on using a different backbone architecture.
|COCO / nuImages as OOD|
|Farthest object||83.04 / 71.38||30.2|
|Random object||79.61 / 70.42||30.3|
|Object with mild energy||83.60 / 71.24||30.3|
|Negative proposal [DBLP:journals/corr/abs-2103-02603]||80.94 / 72.92||30.0|
|GAN [lee2018training]||78.33 / 66.50||30.1|
|Mixup [DBLP:conf/iclr/ZhangCDL18]||81.76 / 70.17||27.6|
|Gaussian noise||83.64 / 71.50||30.3|
|STUD (ours)||85.67 / 75.67||30.5|
We compare STUD with three types of unknown distillation approaches, i.e., (I) using independent objects without spatial-temporal aggregation, (II) synthesizing unknowns in the pixel space, and (III) using noise as unknowns.
For type I, we utilize objects from the reference frame without aggregating multiple objects across spatial and temporal dimensions—a key difference from STUD. The unknown objects can be constructed by: using the object in the reference frame that has the largest dissimilarity, using random objects, using the negative object as in [DBLP:journals/corr/abs-2103-02603], and using objects with mild energy scores (percentile ) in the reference frame.
For type II, we consider GAN-based [lee2018training] and mixup-based [DBLP:conf/iclr/ZhangCDL18] methods. For [lee2018training]
, and interpolate ID objects in the pixel space for the reference frames.
For type III, we add fixed Gaussian noise to the ID objects to create unknown object features.
The results are summarized in Table 2, where STUD outperforms alternative approaches. Exploiting objects without spatial-temporal distillation () is less effective than STUD, because the generated unknowns either lack diversity (e.g., using object with the biggest dissimilarity or mild energy) or are too simple to effectively regularize the decision boundary between ID and OOD (e.g., using negative or random objects). Synthesizing unknowns in the pixel space () is either unstable (GAN) or harmful for the object detection performance (mixup). Lastly, Gaussian noise as unknowns is relatively simple and does not outperform STUD.
Table 3 investigates the importance of filtering unknown objects based on the energy score. We contrast performance by either removing the filtering, or using different filtering percentile (c.f. Section 3.1). Using the objects with a mild energy score in the reference frames performs the best. This strategy distills unknown objects with a proper difficulty level, which is effective during contrastive uncertainty regularization.
|COCO / nuImages as OOD|
|w/o unknown filtering||62.23 / 83.54||82.87 / 72.29||30.6|
|w/ ratio 0%-20%||61.41 / 82.33||83.66 / 74.86||30.2|
|w/ ratio 20%-40%||57.73 / 82.13||85.43 / 74.09||30.3|
|w/ ratio 40%-60%||52.18 / 77.57||85.67 / 75.67||30.5|
|w/ ratio 60%-80%||62.29 / 85.12||83.47 / 73.44||30.2|
|w/ ratio 80%-100%||65.86 / 88.47||82.46 / 72.50||30.3|
Recall our spatial-temporal unknown distillation requires concatenation of objects from reference frames. We ablate the effect of randomly selecting frames within different temporal horizons w.r.t the key frame, modulated by the sampling range . The results with varying are shown in Figure 5 (a)-(b) with . We observe that OOD detection benefits from using the reference frames that are mildly close to the key frame. The trend is consistent for both COCO and nuImages OOD datasets. A larger sampling range translates into more dissimilar scenes, resulting in relatively easier unknowns to be distilled. When becomes infinity, STUD randomly samples frames from the entire video, where the distilled unknowns are much less effective with AUROC significantly degrades (from 85.67% to 80.35% on COCO).
We contrast performance under different number of reference frames and report the OOD detection results in Figure 5 (c)-(d). This ablation shows that STUD indeed benefits from aggregating objects from multiple frames across the temporal dimension. For example, the model trained on BDD100K with achieves an AUROC improvement of 5.24% (COCO as OOD) compared to . This highlights the importance of temporal distillation with multiple frames. However, a larger hurts the OOD detection performance. We hypothesize this is because many redundant object features are used during unknown distillation.
Table 4 reports the OOD detection results as we vary the weight for . The model is evaluated on both COCO and nuImages datasets as OOD. The results suggest that a mild weight is desirable. In most cases, STUD outperforms the baseline OOD detection methods in Table 1 in terms of AUROC.
|COCO / nuImages as OOD|
|0.03||63.52 / 86.18||83.49 / 70.70||30.4|
|0.04||59.52 / 84.01||84.03 / 72.09||30.3|
|0.05||52.18 / 77.57||85.67 / 75.67||30.5|
|0.06||57.37 / 85.53||84.59 / 72.60||30.2|
|0.07||55.03 / 84.43||84.18 / 71.21||30.2|
We perform ablation on three alternatives for : (1) using the squared hinge loss [liu2020energy], (2) classifying the unknowns as an additional class in the classification branch and (3) removing the weight in . The comparison is summarized in Table 5. Compared to the hinge loss, our logistic loss improves the AUROC by 11.35% (COCO as OOD). In addition, classifying the distilled unknowns as an additional class increases the difficulty of object classification, which does not outperform either. Moreover, the learnable weight modulates the slope of the logistic function, which allows learning a sharper binary decision boundary for optimal ID-OOD separation. This ablation demonstrates the superiority of the uncertainty loss employed by STUD.
|COCO / nuImages as OOD|
|STUD w/o||64.06 / 85.31||83.15 / 69.67||30.1|
|Hinge loss [liu2020energy]||74.73 / 90.70||74.32 / 62.59||30.2|
|K+1 class||84.34 / 93.63||59.40 / 56.25||30.8|
|STUD (ours)||52.18 / 77.57||85.67 / 75.67||30.5|
Here we further present qualitative analysis on the instance-level OOD detection results. In Figure 6, we visualize the predictions on several OOD images, using object detection models trained without distilled unknown objects (top) and with STUD (bottom). The ID data is BDD100K. STUD performs better in identifying OOD objects (in green) than a vanilla object detector and reduces false positives among detected objects. Moreover, the confidence score of the false-positive objects of STUD is lower than that of the vanilla model (e.g., rocks in the 3rd column).
OOD detection for classification can be broadly categorized into post hoc, generative-based and outlier exposure (OE)-based approaches [yang2021oodsurvey]. For post hoc methods, the softmax confidence score is a common baseline [hendrycks2016baseline], which can be arbitrarily high for OOD inputs [hein2019relu]. Several improvements have been proposed, such as ODIN [liang2018enhancing, hsu2020generalized], Mahalanobis [lee2018simple], energy score [liu2020energy, wang2021canmulti], Gram matrices score [DBLP:conf/icml/SastryO20] and GradNorm score [huang2021importance]. Outlier exposure methods exploited regularization using natural images [mohseni2020self, DBLP:journals/corr/abs-2106-03917, hendrycks2018deep, DBLP:conf/nips/DhamijaGB18, DBLP:conf/cvpr/LiV20, DBLP:journals/corr/abs-2108-11941] or images synthesized by GANs [lee2018training]. However, real outlier data is difficult to obtain, especially for object detection. In contrast, STUD automatically distills unknowns from videos which allows greater flexibility. Generative models directly estimate the ID density [DBLP:conf/nips/SchirrmeisterZB20, DBLP:conf/nips/KingmaD18, van2016conditional], which makes them natural alternatives for OOD detection. However, they are in general less competitive compared to discriminative-based methods and typically harder to optimize [DBLP:conf/iclr/HinzHW19, kirichenko2020normalizing, DBLP:conf/iclr/NalisnickMTGL19, DBLP:conf/nips/RenLFSPDDL19, DBLP:conf/iclr/SerraAGSNL20, xiao2020likelihood]. Very recently, Sun et al. [sun2021react] showed that a simple activation rectification strategy termed ReAct can significantly improve test-time OOD detection. Theoretical understandings on different post-hoc OOD detection methods are provided in [morteza2022provable]. [tack2020csi, sehwag2021ssd]
applied self-supervised learning for OOD detection, which we compare in Section4.2.
OOD detection for object detection is currently underexplored. Du et al. [du2022vos] proposed to synthesize virtual outliers in the feature space for effective model regularization, and demonstrated promise on OOD detection for object detection. In this paper, STUD focuses on OOD detection with the help of videos and adopts an unknown-aware training loss. Moreover, [DBLP:journals/corr/abs-2103-02603] used the negative objects as the unknown samples, which is suboptimal as we show in Table 2. Harakeh et al. [DBLP:journals/corr/abs-2101-05036] focused on uncertainty estimation for the localization branch, rather than OOD detection for classification problem. Several works [DBLP:conf/wacv/DhamijaGVB20, DBLP:conf/icra/MillerDMS19, DBLP:conf/icra/MillerNDS18, DBLP:conf/wacv/0003DSZMCCAS20, DBLP:journals/corr/abs-2108-03614] used approximate Bayesian methods, such as MC-Dropout [gal2016dropout] for OOD detection. They require multiple inference passes to generate the uncertainty score, which are computationally expensive on larger datasets and models.
Open-world object detection includes OOD generalization [DBLP:journals/corr/abs-2108-06753], zero-shot object detection [DBLP:journals/corr/abs-2104-13921, DBLP:journals/ijcv/RahmanKP20] and incremental object detection [DBLP:journals/corr/abs-2002-05347, DBLP:conf/cvpr/Perez-RuaZHX20], etc. Generally they developed measures to mitigate catastrophic forgetting [DBLP:journals/corr/abs-2003-08798] or used auxiliary information [DBLP:journals/ijcv/RahmanKP20], such as class attributes, to perform object detection on unseen data–both differing from our focus of OOD detection. Wang et al. [DBLP:journals/corr/abs-2104-08381] adopted dissimilarity measurement in the cycle forward step, but their focus is OOD generalization (label space remains the same) rather than OOD detection. Additionally, it did not consider aggregating temporal information from multiple frames.
Video anomaly detection (VAD)
Video anomaly detection (VAD)aims to identify anomalous events on both the object level [DBLP:conf/cvpr/IonescuKG019, DBLP:conf/cvpr/DoshiY20b, DBLP:conf/mm/YuWCZXYK20] and frame level [DBLP:conf/cvpr/LiuLLG18, DBLP:conf/cvpr/MehranOS09, DBLP:conf/wacv/RavanbakhshNMSS18] by techniques such as skeleton trajectory modeling [DBLP:conf/cvpr/MoraisL0SMV19], weakly supervised learning [DBLP:conf/eccv/ZaheerMAL20], attention [DBLP:conf/cvpr/ParkNH20], temporal pose graph [DBLP:conf/cvpr/MarkovitzSFZA20], self-supervised learning [DBLP:conf/cvpr/GeorgescuBIKPS21]
and autoencoders[DBLP:conf/eccv/ChangTXY20]. Compared with STUD
, the anomalies in VAD do not necessarily have different semantics from the ID training data. Moreover, none of the approaches considered synthesizing unknowns with the help of videos or energy-based model regularization.
In this paper, we propose STUD, an unknown-aware object detection framework for OOD detection. STUD
distills diverse unknown objects during training by exploiting the rich spatial-temporal information from videos. The distilled unknowns meaningfully improve the decision boundary between the ID and OOD data, resulting in state-of-the-art OOD detection performance while preserving the performance of the ID task. We hope our work will inspire future research towards unknown-aware deep learning in real-world settings.
Research is supported by Wisconsin Alumni Research Foundation (WARF), Facebook Research Award, and funding from Google Research.
We summarize the OOD detection evaluation task in Table 6. The OOD test dataset is selected from MS-COCO and nuImages dataset, which contains disjoint labels from the respective ID dataset. For the Youtube-VIS dataset, we use the dataset released in year 2021. Since there are no ground truth labels available for the validation images, we select the last 597 videos in the training set as the in-distribution evaluation dataset. The remaining 2,388 videos are used for training. The BDD100K and Youtube-VIS model are both trained for a total of 52,500 iterations. See detailed ablations on the hyperparameters in Section 4.3 of the main paper.
|Task 1||Task 2|
|ID train dataset||BDD100K train||Youtube-VIS train|
|ID val dataset||BDD100K val||Youtube-VIS val|
|OOD dataset||COCO / nuImages||COCO / nuImages|
|ID train images||273,406||67,861|
|ID val images||39,973||21,889|
|OOD images from COCO||1,914||28,922|
|OOD images from nuImages||2,100||2,100|
We provide a detailed description of the in-distribution classes for the two video datasets as follows.
BDD100K dataset contains 8 classes, which are pedestrian, rider, car, truck, bus, train, motorcycle, bicycle.
The Youtube-VIS dataset contains 40 classes, which are airplane, bear, bird, boat, car, cat, cow, deer, dog, duck, earless_seal, elephant, fish, flying_disc, fox, frog, giant_panda, giraffe, horse, leopard, lizard, monkey, motorbike, mouse, parrot, person, rabbit, shark, skateboard, snake, snowboard, squirrel, surfboard, tennis_racket, tiger, train, truck, turtle, whale, zebra.
We run all experiments with Python 3.8.5 and PyTorch 1.7.0, using NVIDIA GeForce RTX 2080Ti GPUs.
To evaluate the baselines, we follow the original methods in MSP [hendrycks2016baseline], ODIN [liang2018enhancing], Generalized ODIN [hsu2020generalized], Mahalanobis distance [lee2018simple], CSI [tack2020csi], energy score [liu2020energy] and gram matrices [DBLP:conf/icml/SastryO20] and apply them accordingly on the classification branch of the object detectors. For ODIN [liang2018enhancing], the temperature is set to be following the original work. For both ODIN and Mahalanobis distance [lee2018simple], the noise magnitude is set to because the region-based object detector is not end-to-end differentiable given the existence of region cropping and ROIAlign. For GAN [lee2018training], we follow the original paper and use a GAN to generate OOD images. The prediction of the OOD images/objects is regularized to be close to a uniform distribution, through a KL divergence loss with a weight of 0.05. We set the shape of the generated images to be 100100 and resize them to have the same shape as the real images. We optimize the generator and discriminator using the Adam optimizer [DBLP:journals/corr/KingmaB14], with a learning rate of 0.001. For CSI [tack2020csi], we use the rotations (0, 90, 180, 270) as the self-supervision task. We set the temperature in the contrastive loss to 0.5. We use the features right before the classification branch (with the dimension to be 1024) to perform contrastive learning. The weights of the losses that are used for classifying shifted instances and instance discrimination are both set to 0.1 to prevent training collapse. For Generalized ODIN [hsu2020generalized], we replace and train the classification head of the object detector by the most effective Deconf-C head shown in the original paper.
|OOD: MS-COCO / nuImages|
|BDD100K||MSP [hendrycks2016baseline]||80.09 / 93.05||74.19 / 63.14||32.0|
|ODIN [liang2018enhancing]||64.74 / 82.08||77.65 / 67.09||32.0|
|Mahalanobis [lee2018simple]||54.02 / 79.85||82.38 / 75.48||32.0|
|Gram matrices [DBLP:conf/icml/SastryO20]||63.96 / 63.61||67.56 / 67.47||32.0|
|Energy score [liu2020energy]||64.79 / 81.62||78.78 / 69.43||32.0|
|Generalized ODIN [hsu2020generalized]||60.76 / 82.00||80.14 / 70.74||32.5|
|CSI [tack2020csi]||52.98 / 80.00||83.57 / 74.91||31.8|
|GAN-synthesis [lee2018training]||58.35 / 83.65||81.43 / 70.39||31.5|
|STUD (ours)||52.51 / 79.75||84.03 / 76.55||32.3|
|Youtube-VIS||MSP [hendrycks2016baseline]||89.86 / 97.42||67.04 / 54.02||26.7|
|ODIN [liang2018enhancing]||89.28 / 96.30||67.54 / 60.82||26.7|
|Mahalanobis [lee2018simple]||90.00 / 94.44||70.47 / 54.83||26.7|
|Gram matrices [DBLP:conf/icml/SastryO20]||87.64 / 91.25||69.76 / 61.43||26.7|
|Energy score [liu2020energy]||88.54 / 90.21||67.83 / 58.02||26.7|
|Generalized ODIN [hsu2020generalized]||85.15 / 98.00||71.57 / 64.23||27.3|
|CSI [tack2020csi]||82.43 / 88.61||71.81 / 54.00||24.2|
|GAN-synthesis [lee2018training]||85.75 / 93.75||72.95 / 56.94||25.5|
|STUD (ours)||81.14 / 80.77||74.82 / 69.52||27.2|
In this section, we evaluate the proposed STUD using a different backbone architecture of the Faster-RCNN, which is RegNetX-4.0GF [DBLP:conf/cvpr/RadosavovicKGHD20]. Similarly, we compare with the same set of OOD detection baselines as stated in the main paper. The results are shown in Table 7.
From Table 7, we demonstrate that STUD is effective on alternative neural network architectures. In particular, using RegNet [DBLP:conf/cvpr/RadosavovicKGHD20] as backbone yields better OOD detection performance compared with the baselines. Moreover, we show that STUD achieves stronger OOD detection performance while preserving or even slightly increasing the object detection accuracy on ID data (measured by mAP). This is in contrast with CSI, which displays significant degradation, with mAP decreasing by 3% on Youtube-VIS.