A Python library to evaluate mean Average Precision(mAP) for object detection. Provides the same output as PASCAL VOC's matlab code.
Average precision (AP) is a widely used metric to evaluate detection accuracy of image and video object detectors. In this paper, we analyze object detection from videos and point out that AP alone is not sufficient to capture the temporal nature of video object detection. To tackle this problem, we propose a comprehensive metric, average delay (AD), to measure and compare detection delay. To facilitate delay evaluation, we carefully select a subset of ImageNet VID, which we name as ImageNet VIDT with an emphasis on complex trajectories. By extensively evaluating a wide range of detectors on VIDT, we show that most methods drastically increase the detection delay but still preserve AP well. In other words, AP is not sensitive enough to reflect the temporal characteristics of a video object detector. Our results suggest that video object detection methods should be additionally evaluated with a delay metric, particularly for latency-critical applications such as autonomous vehicle perception.READ FULL TEXT VIEW PDF
Detecting objects in a video is a compute-intensive task. In this paper ...
Average precision (AP), the area under the recall-precision (RP) curve, ...
Despite being widely used as a performance measure for visual detection
By design, average precision (AP) for object detection aims to treat all...
Long-term object detection requires the integration of frame-based resul...
We consider the problem of detecting objects, as they come into view, fr...
A high-performing object detection system plays a crucial role in autono...
A Python library to evaluate mean Average Precision(mAP) for object detection. Provides the same output as PASCAL VOC's matlab code.
There is a growing interest in video object detection. Many real-world applications, such as surveillance analysis and autonomous driving, deal with video streams. Several single-image object detection algorithms have been proposed in the past few years [5, 20, 30], but they are compute-intensive to run on a full-resolution video stream. Exploiting temporal information is therefore an important direction to improve the accuracy-cost trade-off [12, 19, 33].
Prior research suffers the lack of densely annotated video datasets. KITTI 
is a dataset targeting at autonomous driving that provides frame-level bounding box annotations. However, it is relatively small compared with other large-scale datasets for training deep neural networks. Since the introduction of object detection from video challenge (VID), more research focus has been drawn into the study of video object detection algorithms.
There are two general goals of video object detection: improving detection accuracy [2, 8, 12, 37] and reducing computational cost [4, 24, 38]. Currently, the accuracy of proposed detection algorithms are mostly evaluated with average precision (AP) or mean average precision (mAP) that is the average of APs over all classes [6, 9, 18]
. Video object detection benchmarks like VID also adopt the mAP, where every frame is treated as an individual image for evaluation. However, such an evaluation metric ignores the temporal nature of videos and fails to capture the dynamics of detection results, e.g., a detector that detects the later half occurrences of an instance holds the same mAP as a detector that detects every other frame. As indicated in later experiments, video detectors tend to demonstrate different temporal behaviors compared to their single-image counterparts.
We introduce average delay (AD), a new detection delay metric. Measuring video object detection delay seems trivial, as the delay can be simply defined as the number of frames from when an object appears to when it is detected. However, to avoid the case where an algorithm trivially detects every bounding box in an image, a false alarm rate constraint is necessary. AD also needs to be designed to be comprehensive like AP, so that the delays at different false alarm rates can be combined. We discuss our design rationale in Section 3.
Most video snippets in VID contain fixed numbers of instances (typically only one), which is not suitable for the delay evaluation. We therefore select a portion of the validation set in VID and name it as VID with multiple tracklets (VIDT). Details of the new VIDT dataset are described in Section 4. With VIDT we then evaluate the AD of a wide range of the recent proposed video detection algorithms in Section 5. A general trend is shown in Figure 1, which indicates that some computation-reducing methods [24, 38] preserve the mAP well but increase the AD. Alternative methods leverage the temporal information to improve detection accuracy but worsen the detection delay . Our results suggest that video object detection methods should be evaluated with a delay metric, particularly for latency-critical applications such as autonomous vehicle perception.
To our knowledge, this is the first work that brings up and compares detection delay, a highly critical but usually ignored issue, for video object detection. We propose a comprehensive evaluation metric AD to measure and compare video object detection delay111Code available at https://github.com/RalphMao/VMetrics. . By evaluating a variety of video object detection algorithms, we analyze the key factors for detection delay and provide the guidance for future algorithm design.
Video object detection performs a similar task as image object detection, except that the former is carried on a video stream. Densely annotated videos, which are costly to obtain, are typically required to train a video object detector. The ImageNet VID challenge greatly advances the research progress in the field of video object detection, and provides a large frame-by-frame annotated dataset that covers a wide range of scenarios.
Various methods have since been proposed and evaluated on the VID dataset. The goal of video object detection is to reduce the computational cost or refine the detection results by exploiting the temporal dimension of videos. For instance, deep feature flow (DFF), detect or track (DorT) , CaTDET  and saptiotemporal sampling networks  fall into the first category, while T-CNN , detect to track (DtoT)  and LSTM-aided SSD  belong to the second category. These methods are typically variants of the well studied image object detection algorithms such as R-FCN , Faster R-CNN , SSD Multibox  and RetinaNet .
As required in the VID challenge, the performance of a video object detector is solely evaluated by mAP, the metric for still image object detection [6, 7, 18]. When evaluating mAP, every single frame of a video is treated as an individual image. In such a way, the quality of a detector over the whole video sequence is measured and compared.
Low latency is a common requirement for many video-related applications. For example, autonomous driving typically requires less than 100ms latency . Detecting an object with minimum delay is desired, and detection after certain time is no longer important.
In previous research, the term latency mostly refers to computational latency only [1, 26]. However, we argue that the overall latency equals to computational latency plus algorithmic delay, and the latter is the time taken in a video stream for an algorithm to finally determine the existence of an object. Computational latency has been extensively studied in the recent works [11, 25, 36], while algorithmic delay remains less explored in the object detection field. In other fields like activity detection, there have been efforts to study early detection .
Quickest change detection (QCD) is a well studied problem in statistical processing. It refers to real-time detection of abrupt changes in the behavior of an observed signal or time series as quickly as possible . Generally, the delay is measured at a certain constraint of false alarms. Lao et al.  targeted the problem of moving object detection at a minimum delay. They formulated the task under the framework of QCD and gave an optimal solution for the single object case.
is a benchmark for real-time anomaly detection in time-series data. The authors pointed out that traditional scoring methods such as precision and recall do not suffice, as they cannot effectively test anomaly detection algorithms for real-time use. To reward early detection, they define anomaly windows. Inside the window, true positive detections are scored by a sigmoid function and out of the window all detections are ignored.
. This task typically requires accumulating enough frames to make a decision. To alleviate this issue, a special loss function was proposed to encourage early detection of an activity.
All of the works above are essentially dealing with the single object or single signal case. CATDet  introduced a delay metric to measure the detection delay for multiple objects. However, the delay is evaluated at a specific precision only to counter false alarms.
In this section, we present our definition of average delay (AD), the evaluation metric for video object detection delay. Our metric is designed to incorporate fairness and comprehensiveness. Fairness: AD considers the trade-off between false positives and false negatives to avoid the case of reducing delay by detecting many false positives. Comprehensiveness: AD covers a wide range of operating conditions, analogous to AP.
We first explain the terminology used throughout this paper before delving into the detailed derivation. An instance is a physical object that appears in consecutive frames as a trajectory (or a tracklet). An object refers to a single occurrence of an instance in a frame. The ground truth of an object includes its bounding box coordinates, class label, and track identity. A detection is the recognition of an object in one frame with bounding box coordinates, class label, and confidence.
The most intuitive definition of delay is the number of frames taken to detect an instance from the frame it appears. Before reasoning on a comprehensive delay metric, we make this simple assumption: a detector detects every object at every frame with the same probability.
Under this assumption, the delay
follows the discrete exponential distribution:. Figure 2 exemplifies a histogram of the detection delays of R-FCN on VIDT. The actual distribution generally resembles the exponential distribution, apart from an anomalous region in the tail. There are substantially more instances than expected with extremely large delays due to existence of “hard instances”. A detailed discussion about the delay statistics is described in Section 6 and hard examples are given in Figure 10.
For discrete exponential distribution, the expected value follows . Thus, we can measure the quality of a detector through inferring the latent parameter , given multiple observed datapoints , where
. With maximum likelihood estimation, we find that the maximum likelihood is achieved when the expected value matches the mean of the samples:. So the detection probability on each frame can be obtained by:
As aforementioned, the existence of “heavy tail” results in a potential problem when we try to estimate . Different detectors may not be effectively differentiated if the heavy tail dominates the mean value. We thus adopt a simple strategy to clip the delay samples with a constant value , which we name as a detection window. This is also a practical consideration, as for most latency-critical tasks a detection no longer matters once it falls out of a time window.
It is important to set a threshold for false alarms to ensure fair comparisons. In the previous work , precision that is defined as number of true positives divided by total detections is selected as a threshold to counter false alarms, as the increased number of false alarms will reduce precision. However, there are undesired outcomes if we set the same precision to compare different detectors.
We demonstrate with a toy example in Figure 3 to illustrate that setting a precision threshold may cause the measured delay to behave differently from our expectation. Suppose the precision threshold is set to 0.6. In case 1, we should set a confidence threshold of 0.75 to meet the precision requirement. In case 2, due to the increased confidence score of the last detection, a confidence threshold of 0.35 is adequate. The resulted detection delay in these two cases are 2 and 0, respectively. By refining the later detections, the detection delay can be magically improved. Such a behavior counters our intuition that delay should be a matter of the early detections.
As a result, we argue that precision may not be the ideal threshold to counter false alarms. Instead we propose to use false positive (FP) ratio, which is the ratio between false positives and ground truth objects. FP ratio as a threshold is determined only by false positive detections, therefore will not be impacted with more true positives.
The last question comes that how we should comprehensively measure the detection delay of a detector under different false alarm constraints, similar to what AP does. AP is the integral or the arithmetic mean of precisions over different recalls. Analogously, is it a good practice to average detection delays over different false positive ratios?
Consider the real-world scenarios, a detector with zero delay is substantially better than one with 1-frame delay, while a detector with 14-frame delay does not make a significant difference from one with 15-frame delay. However, the arithmetic mean cannot distinguish the two cases.
We argue that averaging the latent parameter , which represents the probability of detecting an object, would be a better choice. Since is the reciprocal of , it weighs more for a smaller delay. In addition, it is a bounded value between 0 and 1. As a result, we average the inferred values of a detector under different false positive ratios, and derive the corresponding AD fom the averaged .
. Notice that this definition has a very similar form to the harmonic mean.
In our following experiments, we set the detection window to 30 frames and select 6 FP ratios including 0.1, 0.2, 0.4, 0.8, 1.6 and 3.2.
There are multiple public datasets for object detection, such as KITTI , ImageNet-VID , YouTube-BB , BDD100K , VIRAT , etc. However, they suffer from various drawbacks for delay evaluation. KITTI is a relatively small dataset, making it hard to train deep neural networks. Most video snippets in ImageNet VID contain fixed numbers of objects from beginning to end, which leaks strong prior, thus making it unsuitable for delay evaluation. Youtube-BB and BDD100K are both large-scale datasets with rich objects and scenarios, but they are sparsely annotated. VIRAT is a surveillance analysis dataset and has a fixed background.
An ideal dataset for delay evaluation should (i) be densely annotated (frame by frame); (ii) have random entry time for each instance (exclude videos with the same objects throughout the time); (iii) have random entry location for each instance (exclude videos with a fixed background and limited entry locations for new objects). In Figure 4, we show examples of ideal and non-ideal snippets in the validation set of ImageNet VID. The ideal snippet has multiple different instances entering the frames randomly over space and time, while in the non-ideal case, the same instance (which is a boat in the example) exists from the very first frame to the last.
We introduce VIDT, a subset of the validation set of VID, to meet the requirements aforementioned. Video snippets in VIDT have at least one instance entering at a non-first frame, which guarantees the randomness of entry time. VIDT largely relies on the annotated track identities in VID. A subtle difference is that in VIDT, once an instance disappears for more than 10 consecutive frames, it is marked as a new instance. One reason is that we do not care about the re-identification ability but only the capability to detect as early as possible. In this way, the number of instances is increased from 555 to 666.
We report the statistics of VIDT and compare with the original VID and KITTI in Table 1. The ample training data of VID makes it feasible to train deep neural networks. Even though VIDT is smaller than the original validation set of VID, it still has much more frames and objects than KITTI with training and validation sets combined. However, severe class imbalance problem exists in both VID and VIDT as shown in Figure 5. Therefore, AD is not measured on each class separately, but instead treats all instances in a class-agnostic way.
In this section, we demonstrate a common but mostly ignored problem in recent research of video object detection. Many detectors suffer from worse detection delay, even though they are able to preserve or even improve mean Average Precision.
We design several special cases to show the advantages of our proposed average delay metric against mAP , NAB Score  and CaTDet Delay . The NAB metric, originally designed for anomaly detection, can be modified to fit in the object detection task. The modification is described in Appendix.
Our comparison of the different metrics is achieved by manipulating the detection output and quantifying the impact on each metric. Retardation measures the sensitivity by suppressing the first few detections of a tracklet. A desirable delay metric should be worsened after retardation. Tail boost measures the fairness by elevating the confidence scores of lately detected objects. A fair delay metric should not be affected by tail boost. Multiple observations can be drawn from Table 2.
For mAP, retardation makes little impact while tail boost greatly improves the result, which is in accordance to the number of affected detections.
Retardation worsens all three delay metrics. However, if suppressing the low-confident objects only, NAB and CaTDet do not reflect the change, as both of them operate at a single confidence threshold. In contrast, AD evaluates multiple thresholds, therefore is robust to reflect the effect of retardation.
For tail boost, it improves both NAB and CaTDet, while only improves AD negligibly, indicating that AD is better than the other two metrics in term of fairness.
|# of affected||0||3076||3616||71781|
A range of recent works on video object detection employ the concept of the key frame [4, 10, 21, 38]. Key frames are sparsely distributed over the whole video sequence and typically require more computational resources than non-key frames. Key frames can be used to improve the detection accuracy or reduce the cost of non-key frames, through exploiting the temporal locality in videos.
We choose deep feature flow (DFF)  as a representative key frame based algorithm. The basic idea is to compute features on key frames and propagate the features with optical flow on non-key frames. We vary the interval of key frames and show the impact on mAP and AD in Figure 6. Two R-FCN models are also reported for comparison. The full model is a standard R-FCN model with ResNet-101, and the half model is in the same architecture but trained with only half number of iterations.
Figure 6 shows that DFF tends to worsen AD. For example, the DFF model that adopts a key frame in every 10 frames achieves mAP of 0.613, much higher than the mAP 0.567 of the inferior R-FCN model. However, in term of AD, the DFF model is a bit worse (11.6 vs. 11.2). This indicates that setting sparse key frames leads to the delayed detection of new objects.
and implicit feature aggregation via recurrent neural networks.
We select the flow-guided feature aggregation (FGFA)  as an example and demonstrate how it may affect detection delay while improving mAP. FGFA aggregates the features of previous frames and solves the spatial mismatches by propagating the features with optical flow. The open-sourced version of FGFA is based on R-FCN, therefore we also compare its mAP and AD in Figure 6. FGFA alone improves mAP from 0.642 to 0.675, meanwhile deteriorates AD from 9.0 to 10.2. We also observe a trend that the more frames aggregated, the better mAP can be obtained but the worse AD is.
FGFA substantially improves mAP compared with the original R-FCN, but worsens the detection delay. To explain this phenomenon, we select one instance with increased delay and plot the process of being detected in Figure 7, which shows the confidence score of the closest detection to the ground truth object. In the case where no detection has an IoU over 50%, the confidence score is 0. The steady and progressive increasing confidence of FGFA, as shown in the figure, incurs the extra delay to detection, suggesting that for latency-critical tasks it is probably not a good choice to slowly build up the confidence.
A cascaded detector consists of multiple components and tries to shift the workload from complex ones to simple ones, following specific heuristics. Bolukbasi et al. proposed a selective execution model for object recognition problem, which is essentially a cascaded system. Further works explored the efficacy of cascaded systems in the video object detection task, including scale-time lattice  and CaTDet .
We adopt CaTDet  as an example. CaTDet adds a tracker in the cascaded model to enable temporal feedback, which helps save the workload and improve accuracy. As shown in Figure 8, CaTDet models preserve the mAP well but substantially increases detection delay compared with other Faster R-CNN models. The CaTDet model with an internal confidence threshold of 0.01 achieves mAP of 0.555, which is very close to that of the Faster R-CNN model (0.561), however, it increases AD from 8.2 to 9.2.
In this section, we analyze the characteristics of video object detection delay on the VIDT dataset and aim to provide our insights into the AD metric.
In Section 3, we make an assumption that video object detection delay follows the discrete exponential distribution but with a heavy tail. Here we provide more examples and analysis to examine the actual distribution of delay.
We select the three object detection methods: R-FCN, Faster R-CNN and DFF R-FCN, and plot their delay distribution in Figure 9
. All three distributions resemble the exponential distribution. Note that at the same confidence threshold Faster R-CNN has the smallest delay, therefore its delay distribution is more skewed to left compared with the other two approaches.
We also show the statistics to measure the “heavy-tail” effect in Table 3. The difference between mean and clipped mean denotes that the long tail has a large impact on the mean value. Here we define “expected off-window percentage” as the probability of the delay falling out of a detection window, where is assumed to follow the ideal discrete exponential distribution. The ideal distribution is estimated by the maximum likelihood estimation. Such a probability can be computed by , where is obtained as in Equation 1 and is the window size. The higher percentages outside the window in all three detectors validate that the tails are indeed “heavier” than those in the ideal exponential distributions. We select 6 examples out of 10 with largest delay and illustrate them in Figure 10. These video objects are either with very low resolution, heavily truncated or largely occluded.
Due to the class imbalance in VIDT, AD is measured on all 666 instances instead of individual classes to avoid high variance. To demonstrate how the delay varies on different classes, we select 5 classes with over 40 instances and compare their AD results in Figure11. All three models present large delays for class “Bird”, which is typically small and quick moving. Classes “Car” and “Dog” have relatively smaller delays. For class “Bicycle”, R-FCN and Faster R-CNN show distinct delays.
To study how the size of instances affects detection delay, we divide all 666 instances into 3 categories by the averaged shorter dimension of their first 30 frames. Small, Median and Large instances are categorized according to , and . This criterion results in 129 small instances, 257 median instances and 280 large instances, respectively.
The anchor scale is the size of reference bounding box in all major object detection algorithms, where 3 and 4 scales are common choices for image object detection. As shown in Table 4, further increasing the number of anchor scales from 3 to 4 does not improve mAP. However, adding a small scale helps with AD, in particular for the instances with lower resolutions. This is probably because that an instance is typically smaller when it appears in first few frames. The results with 5 scales show that further adding finer-grained scales does not help much.
Given the fact that VIDT only contains a few hundreds of instances, AD of various video object detectors evaluated on this dataset might be prone to high variance. Here we analyze if our comparisons are reliable, i.e., whether the difference between AD of different methods is significant compared to variance. To test the conclusion that DFF and FGFA incur extra delay to the baseline model R-FCN, we perform a 3-fold validation to verify whether the results correlate well on each fold. In addition, we select a subset from ImageNet VID-2017 (which is recently published but not yet widely used in the community) and validate whether the same conclusion can be extended to a different dataset. The results are shown in Table 5. We find the results demonstrate good consistency across all folds and datasets.
|Fold 1||Fold 2||Fold 3||Overall||2017|
We have presented the metric average delay (AD) to measure and compare detection delay of various video object detectors. Extensive experiments find that many detectors with descent detection accuracy suffer from the problem of increased delay. However, the widely used detection accuracy metric mAP by itself cannot reveal this deficiency. We hope our findings and the new AD metric would help the design and evaluation of future video object detectors for latency-critical tasks. We also expect large and diverse video datasets in the future and better target the delay issue.
MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Cited by: §2.2.
The NAB metric is originally designed for anomaly detection. It follows the simple idea of detection window. Inside such a window, early detections are rewarded with higher scores, and outside the window, true positives are treated as false positives. The final score is the normalized sum of rewards from true positives and penalties from false positives and false negatives.
For video object detection, we do not treat detections outside the window as false positives but “don’t care”. Our NAB metric for object detection is defined as:
where and are the true positives and groundtruth, indicates the start-to-detect time. We set , , and the confidence threshold to be in our experiments.