probdet
Code for "Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors." (ICLR 2021)
view repo
One of the challenging aspects of incorporating deep neural networks into robotic systems is the lack of uncertainty measures associated with their output predictions. Recent work has identified aleatoric and epistemic as two types of uncertainty in the output of deep neural networks, and provided methods for their estimation. However, these methods have had limited success when applied to the object detection task. This paper introduces, BayesOD, a Bayesian approach for estimating the uncertainty in the output of deep object detectors, which reformulates the neural network inference and Non-Maximum suppression components of standard object detectors from a Bayesian perspective. As a result, BayesOD provides uncertainty estimates associated with detected object instances, which allows the deep object detector to be treated as any other sensor in a robotic system. BayesOD is shown to be capable of reliably identifying erroneous detection output instances using their estimated uncertainty measure. The estimated uncertainty measures are also shown to be better correlated with the correctness of a detection than the state of the art methods available in literature.
READ FULL TEXT VIEW PDF
Medical images are increasingly used as input to deep neural networks to...
read it
When the cost of misclassifying a sample is high, it is useful to have a...
read it
We consider the problem of uncertainty estimation in the context of
(non...
read it
There has been a recent emergence of sampling-based techniques for estim...
read it
Many automated operations in agriculture, such as weeding and plant coun...
read it
We evaluate the uncertainty quality in neural networks using anomaly
det...
read it
Predictive uncertainty estimation is an essential next step for the reli...
read it
Code for "Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors." (ICLR 2021)
Uncertainty estimation for anchor-based deep object detectors.
Deep neural networks have arisen as the dominant method for the object detection problem, demonstrating near human level performance on both the 2D [1, 2, 3, 4] and 3D [5, 6, 7] object detection tasks. Due to their high level of performance, deep object detectors have become standard components of perception stacks for safety critical tasks such as autonomous driving [5, 6, 7] and automated surveillance [8]. Therefore, the quantification of how trustworthy these detectors are for subsequent modules, especially in safety critical systems, is of utmost importance. To encode the level of confidence in an estimate, a meaningful and consistent measure of uncertainty should be provided for every detection instance.
A meaningful uncertainty measure is defined as one that is discriminant enough to allow a robotic system to achieve two important goals. First, the robotic system should be capable of using the uncertainty measure to fuse a deep object detector’s output with prior information from different sources, effectively treating it as any other sensor [9]. As such, the estimated uncertainty measure should be negatively correlated to the correctness of the output of a detector. Second, the robotic system should be able to use the provided uncertainty measure to reliably identify incorrect estimates, including those resulting from unknown unknowns, where object categories, scenarios, textures, or environmental conditions have not been seen during the training phase [9] (see Fig. 1).
Two sources of uncertainty can be identified in any machine learned model.
Epistemic or model uncertainty is the uncertainty in the model’s parameters, usually as a result of the confusion about which model generated the training data, and can be explained away given enough representative training data points [10]. On the other hand, aleatoric or observation uncertainty results from the stochastic nature of the observed input, and persist in network output despite expanded training on additional data [11].Methods to estimate both uncertainty types in deep neural network models have been recently proposed [11], with applications in one-to-one perception tasks such as semantic segmentation or monocular depth regression. Object detectors usually output a large number of redundant detections [1, 2, 3, 4, 5, 6, 7], and as such, extending the proposed framework to object detection is not trivial. Multiple approaches to solve this problem have been proposed in literature ranging from solely considering epistemic uncertainty [13, 14], to proposing independent methods that tackle both uncertainties individually [15, 16]. These approaches do not tackle the incorporation of prior information using the proposed uncertainty measures, and are shown in Section IV to not be able to reliably identify incorrect object detections using their estimated bounding box uncertainty. To this end, this paper offers the following contributions:
BayesOD, a Bayesian approach for estimating the uncertainty in the output of deep object detectors, is proposed to estimate uncertainty measures associated with both category classification and bounding box regression tasks in standard object detectors.
When applied to RetinaNet [2] on the 2D object detection problem, BayesOD is shown to provide uncertainty measures that produce large gains in discriminating power over state of the art methods in literature when used to identify erroneous detections.
BayesOD is shown to be capable of efficiently incorporating object priors at multiple stages of the neural network inference process with closed form solutions.
The majority of state of the art object detectors in 2D [1, 2, 3, 4] or in 3D [5, 6, 7]
follow a standard procedure, which maps a scene representation to object instances. The object detection problem requires an object detector to provide an estimate of two states for every object instance in the scene: the category to which an object belongs and the spatial location and extent of the object, often expressed as the tightest fitting bounding box. The bounding box state
is modeled as a random variable drawn from a multivariate Gaussian distribution
, where is the distribution’s mean, and is the distribution’s covariance matrix. On the other hand, the category state is modeled as being drawn from a Categorical (Multinoulli) distribution , wheredescribes the probability of the state
being class written as .Given an input scene representation, the number of object instances in the scene is unknown a priori, and as such, the neural network is usually provided with a densely sampled grid of prior object bounding boxes, referred to as anchors [1, 2], or default boxes [3]. Every anchor is spatially associated with a portion of the input scene denoted . The object detector is trained to output the parameters and of the conditional distributions for each individual anchor:
(1) | ||||
where is the training dataset, and
are the object detector’s parameters. Since the anchor grid is densely sampled, many anchors may be associated with each object instance in the scene. The subsequent problem is to derive a single set of object states for each set of associated anchors through the joint distribution:
(2) | ||||
where is the set of input scene portions associated with a single object instance. To solve Eq. (2) using the set of outputs from Eq. (1), post-processing via Non-Maximum Suppression (NMS) is used to eliminate redundancies. Greedy NMS[17] in particular assumes that the anchor with the highest individual category score within is the anchor with the highest joint probability in Eq. (2).
The described procedure is considered one of the main meta-architectures for object detection, and has been extensively studied by researchers in the field [18, 19]. However, it suffers from the shortcomings described in Section I. In Section III, it is shown that reinterpreting the neural network as a measurement device allows for a full Bayesian treatment of the object detection task, resulting in reliable state uncertainty estimates, as well as the ability to incorporate object priors into the detection framework.
To fully describe a Gaussian distribution associated with a regression output of a deep neural network, both of its sufficient statistics must be estimated by the model. Motivated by heteroscedastic regression
[20], a loss attenuation formulation has been proposed [11]that estimates the variance of the regression output of a deep neural network model by modifying the regression training loss as follows:
(3) | ||||
where is the input to, and is the output from the neural network. Furthermore, is the total number of regression instances, is the ground truth regression target, is an norm. and is the estimated output variance. The total loss is then defined as the sample mean of the losses of individual regression targets. The first term of
serves as an intelligent robust regression loss, where the model is allowed to attenuate the effect of outliers in training examples by increasing their estimated variance. The second term acts as a regularizer, preventing the model from rejecting all training examples by always setting the variance to infinity. We refer the reader to
[11]for further explanation of this loss function.
To capture uncertainty of the states and
, the entropy of their associated probability distributions in Eq. (
2) is computed using the distributions’ sufficient statistics. High state entropy is positively correlated to the state uncertainty, and is commonly used as an uncertainty measure in the state of the art [13, 14].To estimate the uncertainty in object detection results, [13] treats the deep object detector as a black box, with parameters that can be stochastically sampled through Monte-Carlo (MC) Dropout [10]. The output detections of multiple stochastic runs are then clustered, and the sufficient statistics of the state distributions in Eq. (2) for every object instance are directly estimated from the cluster members. The main advantage of this formulation lies in treating the underlying structure of the deep object detector as a black box, allowing it to be applied to various architectures with little effort. Later work [14] studied the effect of various merging algorithms on the quality of the estimated uncertainty measures from the black box method in [13]. In Section IV the uncertainty estimated for the bounding box state is shown to be of little discriminative power when used to reject erroneous detection outputs, mainly because the black box method observes the output after NMS.
Another way to estimate the uncertainty in object detection results is to directly apply the formulation in Eq. (3) to provide estimates for the covariance matrix of the bounding box state . Examples of these sampling free methods include [15, 16] and are usually faster than black box methods, since a single run of the deep object detector can estimate uncertainty. Sampling free methods usually provide slightly better uncertainty estimates from the bounding box state distributions when applied to identification of erroneous detections. However, in Section IV, these methods are shown to provide a lower quality uncertainty estimate for the category state over methods utilizing MC-Dropout.
Finally, [16] proposes another method to estimate the uncertainty in deep object detectors, which exploits the redundancy in the output of the deep object detector before NMS to form spatially affiliated clusters of detection outputs, from which sufficient statistics for both object state distributions in Eq. (2) can be estimated. However, when compared to black box and sampling free methods, this redundancy based method is shown to perform the worst in terms of average precision and uncertainty quality.
Unlike all methods described in this section, BayesOD allows the incorporation of object priors at multiple stages of the object detection framework. By replacing NMS with Bayesian inference, BayesOD also outperforms all methods described in this section in terms of the discriminative power of its estimated uncertainty measures (Section
IV).The main steps of BayesOD are shown in Fig. 2. This section aims to describe the intuition and formalize the mathematical derivations involved in each of these steps. Note that throughout this section, outputs from the neural network are denoted with a operator, and per-anchor variables are indexed with . Variables not indexed with an represent accumulation over several anchors.
To capture the epistemic uncertainty of a deep object detection model, a prior distribution is imposed over its parameters to compute a posterior distribution over the set of all possible parameters given the training data. A marginal distribution is then computed for every object state according to:
(4) |
where is the output of the neural network, which can either be or for every anchor .
A simple and computationally efficient Monte-Carlo sampling method, Monte-Carlo Dropout [10], allows drawing i.i.d samples from Eq. (4) by performing neural network inference with dropout enabled. Using the drawn samples, the sufficient statistics of the Gaussian marginal probability distribution describing the estimated bounding box state can be derived as:
(5) | |||
(6) |
where is the number of times MC-Dropout sampling is performed, and is the bounding box regression output of the neural network for the MC-Dropout run. The covariance matrix, , captures the epistemic uncertainty in the estimated bounding box state .
Since the neural network outputs the parameters of a Categorical distribution rather than categorical samples, these parameters can be derived for the Categorical marginal conditional probability distribution
as:(7) |
where is the soft max function, and is the output logit of the category, estimated at the MC-Dropout run of the neural network.
To capture aleatoric uncertainty for the bounding box state , the neural network is trained to estimate the elements of the diagonal of a per-anchor aleatoric covariance matrix , using a modified version of Eq. (3). Specifically, the loss for every dimension of the bounding box representation is modified as:
(8) |
where is the number of positive anchors, is the number of negative anchors, and is the loss in Eq. (3). The first term of the proposed loss is simply Eq. (3) applied to the positive anchor set, while the second term encourages the model to increase the total variance of the bounding box state of the negative anchors. The proposed modification is empirically found to provide better numeric stability while training with higher learning rates, and for a slightly more discriminative uncertainty measure over the original as shown in Section IV.
The aleatoric covariance matrix can then be constructed from the output regressed variances as:
(9) | |||
(10) |
where is the estimated variance of the element of the bounding box state , at the MC-Dropout run of the neural network. Following [11], the final output covariance of the state can then be approximated as:
(11) |
No explicit treatment of the aleatoric classification uncertainty is needed, since it was found in [15] to be self-contained within the estimated parameters of the categorical distribution.
One of the useful properties of BayesOD is that it enables incorporating per-anchor prior information in the final estimate of the states. This formulation interprets the output of the neural network as measurements of an anchor’s states, which can be used to update a prior distribution over each state. Specifically, the per-anchor conditional posterior distribution describing the bounding box state can be written as:
(12) |
is a Gaussian likelihood function described by the sufficient statistics in equations Eq. (5) and Eq. (6), while is a predefined per-anchor prior distribution conditioned on the input and assumed to be independent of the data . The sufficient statistics can be computed through the multivariate Gaussian conjugate update, as:
(13) | ||||
(14) |
Instead of incorporating a prior distribution directly over the categorical state , a Dirichlet distribution is set as a prior over the sufficient statistics . The posterior distribution of these sufficient statistics can be written as:
(15) |
where is the set of updated sufficient statistics , and are i.i.d. instances of the categorical random variable described by a categorical distribution with sufficient statistics defined in Eq. (7). Since the likelihood function is a multinoulli distribution, the prior distribution is chosen to be a Dirichlet distribution so that the posterior is itself a Dirichlet distribution that can be computed through conjugacy in closed form as:
(16) |
where is the indicator function, is the element in instance corresponding to category , and are the inferred parameters of the Dirichlet posterior distribution. Finally, the categorical posterior distribution describing the category state can be written as:
(17) |
where is the mean of the posterior distribution [22] in Eq. (III-B) written as:
The choice of anchor priors depends on the application, and whether object information is actually available a priori. For the rest of this paper, a weakly informative prior is chosen for the bounding box state , by setting to the initial anchor position, and to a matrix with large diagonal entries. Similarly, by setting the parameters of the Dirichlet distribution such that , the resultant distribution over the parameters of the categorical distribution describing
is also non-informative, and in fact equivalent to a uniform distribution over the open standard
probability simplex.Fig. 2 provides a visualization of how such non-informative priors (first column) are updated through neural network inference. It can be seen that multiple updated anchors are clustered around single object instances in the scene. Such redundancy is usually eliminated through post processing via NMS. The elimination process employed by NMS results in a large amount of useful information being discarded, which greatly impacts the quality of the computed uncertainty metric, especially for the bounding box state .
Car | Pedestrian | ||||||
---|---|---|---|---|---|---|---|
Test Dataset | Method | AP(%) | GMUE(%) | CMUE(%) | AP(%) | GMUE(%) | CMUE(%) |
BDD [21] | Sampling Free [16, 15] | 55.16 | 38.99 | 21.96 | 37.64 | 47.49 | 30.55 |
Black Box [13, 14] | 57.34 | 49.75 | 21.71 | 41.54 | 49.86 | 29.43 | |
Redundancy [16] | 56.43 | 49.71 | 24.80 | 40.43 | 49.96 | 38.56 | |
Ours | 61.35 | 25.53 | 16.96 | 43.62 | 26.15 | 23.56 | |
KITTI [12] | Sampling Free [16, 15] | 73.27 | 46.40 | 20.82 | 44.98 | 49.22 | 29.21 |
Black Box [13, 14] | 74.49 | 48.67 | 18.81 | 48.20 | 49.71 | 25.46 | |
Redundancy [16] | 68.83 | 47.86 | 22.98 | 45.87 | 49.69 | 34.97 | |
Ours | 74.31 | 29.73 | 13.10 | 45.18 | 28.70 | 18.45 |
BayesOD uses Bayesian inference over clusters as a replacement to the elimination scheme employed by Greedy NMS. First, per-anchor outputs from the neural network are clustered using spatial affinity. Similar to NMS, greedy clustering is performed using the output category scores , by choosing the anchor with the highest non-background score as the cluster center, adding any anchor with an intersection over union (IOU) greater than 0.5 to the cluster, and eliminating all members in the cluster from the original updated anchor set. The clustering process terminates when all updated anchors are assigned to a cluster, or when the number of clusters exceed a predefined number. Different from NMS, BayesOD retains redundant anchors in clusters, rather than eliminating them, to prevent loss of information.
The output of greedy anchor clustering is anchor clusters, , each containing an anchor set, . is not constant and can vary between clusters in the same frame. The first anchor, , has the highest score, and as such is considered the cluster center, and will be described with its posterior state distributions in Eq. (12) and Eq. (15). The rest of the cluster members are assumed to be measurement outputs from the neural network described by the states and . These measurements are used to update the states of the cluster center to arrive at the final states of an object instance. Specifically, for the bounding box state , the final posterior state distribution can be written as:
(18) |
where is the set of inputs of the cluster members. The second term in the equation arrives from assuming conditional independence of the states of the cluster members given state . The sufficient statistics of Eq. (III-C) can be estimated in closed form as:
(19) | ||||
(20) |
where are the sufficient statistics of the per anchor posterior distribution derived in Eq. (12). Notice that every member of the cluster contributes to the estimation of both the mean and the covariance matrix of the final object instance state .
Similarly, to arrive at the final posterior distribution describing the category state , a similar analysis can be performed to update the sufficient statistics of the cluster center with categorical measurements
of the rest of the cluster members. Specifically, the posterior probability of
can be derived as:(21) |
where , and the categorical measurements are assumed to be i.i.d. In summary, is derived by updating the per-anchor Dirichlet posterior distribution in (15) of the cluster center with index with categorical measurements from all cluster members. The final categorical distribution describing the state can then be computed as:
(22) |
where can be computed as the mean of the posterior distribution in Eq. (III-C) as:
(23) |
A major result from this subsection is that the two states of any object can be updated easily given an additional measurement from a different component of the robotic system. To perform this update, one can simply use Eq. (III-C) to update the state with a multivariate Gaussian measurement, and Eq. (III-C) to update the state with a Categorical measurement. The state can then be inferred from using Eq. (22).
To show the effectiveness of BayesOD in comparison to the state of the art, it is applied to the problem of 2D object detection in image space. For training, the Berkley Deep Drive 100K Dataset (BDD) [21], which comprises of image frames is used. For testing on data closely resembling the frames seen in training,
image frames of the validation set of the BDD dataset are used. For testing on data visually different from frames which have been seen in training, the training split of the KITTI 2D object detection dataset
[12] is used. The KITTI data comprises of frames, which have been collected using a different sensor and in scenes different in appearance than those seen in training data from BDD.Deep Object Detector: RetinaNet [2] is chosen as the baseline deep object detector, and all methods used in comparison are integrated into its inference process. RetinaNet is trained to detect the Car and Pedestrian categories on the
training frames of the BDD dataset for 6 epochs using the ADAM optimizer with a batch size of
and an initial learning rate of . The learning rate is reduced every 2 epochs with a decay factor of. The remaining hyperparameters are left as the default ones presented in
[2].The two chosen categories exist in both datasets, and as such, the proposed experimental setup mimics what usually occurs in practice when deploying robotic systems, where object detectors are required to detect the same categories at test time as the ones it has been trained on, but in previously unobserved environments. As seen in Fig. 1, the open-set problem is inherent to the object detection task, even when testing for the same categories in both datasets.
Evaluation Metrics:
Two evaluation metrics are used to evaluate different performance criteria of uncertainty estimation methods in comparison to BayesOD. The
Average Precision (AP) is a standard metric used to evaluate the performance of object detectors [21, 12]. Throughout this section, AP is evaluated separately for the two categories at an IOU of . The maximum average precision achievable by a detector is . On the other hand, the Minimum Uncertainty Error (MUE) [14] is used to determine the ability of an uncertainty measure to discriminate true positives from false positives, where a detection is determined to be a true positive if it has an with a same category ground-truth bounding box. False positives in this case could include poorly localized detections, or false detections resulting from unknown unknowns. Uncertainty error (UE) can then be computed using the determined true positives (TP) and false positives (FP) as:(24) |
where is the uncertainty measure threshold. MUE is the best uncertainty error achievable by a detector at the best possible value of the threshold . The lowest MUE achievable by a detector is .
Car | Pedestrian | ||||||
---|---|---|---|---|---|---|---|
Experiment | AP(%) | GMUE(%) | CMUE(%) | AP(%) | GMUE(%) | CMUE(%) | |
1 | Full System | 61.35 | 25.53 | 16.96 | 43.62 | 26.15 | 23.56 |
2 | Variance Penalty | 60.89 | 27.37 | 17.44 | 42.60 | 27.15 | 23.83 |
3 | No Aleatoric Variance | 60.96 | 27.55 | 16.72 | 41.98 | 28.05 | 23.15 |
4 | Standard NMS | 61.30 | 36.49 | 17.37 | 42.32 | 41.49 | 24.53 |
5 | No Marginalization | 58.96 | 20.74 | 23.36 | 35.54 | 25.86 | 36.57 |
BayesOD is compared against three approaches representing the state of the art methods for uncertainty estimation methods used for object detection. The three approaches will be referred to as: Black Box [13, 14], Sampling Free [16, 15], and Redundancy [16]. Methods utilizing MC-Dropout use fixed random seed stochastic runs of RetinaNet. No improvement in performance for any of the methods used for comparison was seen for a number of runs greater than . Fixing the random seed guarantees the same per-run weights for every method, resulting in a fair evaluation. The affinity threshold used for clustering in all methods was set to the IOU, similar to that used for NMS in RetinaNet. For this dataset/detector combination, this threshold was shown to provide the highest AP for all methods used in comparison. The number of categorical samples in Eq. (15) is empirically set to . In general, reasonable effort was made to ensure the implemented methods achieve their highest possible performance on all metrics, and to make sure a controlled and fair evaluation was achieved.
The MUE is computed for all methods based on the two described uncertainty measures. Gaussian MUE (GMUE) uses the entropy of the Gaussian distribution describing the state as its uncertainty measure to be used to discriminate true positives from false positives. Note that using the entropy is seen to provide a better GMUE for all methods used for comparison, when compared to using Total Variance (trace of the covariance matrix) as in [14]. Similarly, Categorical MUE (CMUE) uses the entropy of the Categorical distribution describing the state as its uncertainty measure.
Table I shows the results of evaluating the three methods in comparison to BayesOD, on both testing datasets. BayesOD is seen to outperform all three methods on all performance metrics when tested on the BDD dataset. The major improvement can be seen in GMUE, where BayesOD provides a tremendous reduction of and in GMUE over the second best method Sampling Free for the car and pedestrian categories respectively. BayesOD also provides a reduction of and in CMUE over the second best method Black Box for the car and pedestrian categories respectively. This reduction in MUE is accompanied with an increase of and in AP of the car and pedestrian categories. A similar trend is seen for both MUE metrics on the KITTI dataset, where BayesOD outperforms all other methods used for comparison. However, Black Box is seen to score a and increase in AP over BayesOD.
To get better insight on why BayesOD provides a much lower GMUE over the three methods used for comparison, Fig. 3 provides plots of the Gaussian entropy used to determine GMUE versus the Categorical entropy used to determine the CMUE for the True Positives (shown in Blue) and False Positives (shown in Red) on both the BDD dataset (Top) and the KITTI dataset (Bottom). For a meaningful uncertainty measure, the entropy, and hence the uncertainty in both states of a true positive should be lower than those of a false positive. The most optimal plot should result in a blue cluster in the lower left corner representing the true positives, and a red cluster in the top right corner representing false positives. For the Categorical entropy, all methods are shown to follow this intuitive trend to a certain extent. For the Gaussian entropy however, two of the three methods in the state of the art: Redundancy and Black Box result in exactly the opposite behaviour, where the mean of the Gaussian entropy of true positives is higher than that of the false positives. To hypothesise on why such behaviour occurs, one should observe the mechanism employed by these two methods to estimate the final covariance matrix of the state . Both of these methods use the clustered output of stochastic runs to estimate a sample covariance matrix, with the only difference being that Black Box clusters the output of NMS, whereas Redundancy clusters the per-anchor output before NMS. Both of these methods lack adequate cluster merging, and explicit variance estimation, which reduces the discriminative power of their estimated uncertainty measure for the bounding box state . The first support for this hypothesis is that Sampling Free, a method that explicitly uses the per-anchor regressed covariance matrix, provides a and decrease in GMUE of the car and pedestrian categories over the second runner up from Black Box and Redundancy. This is also reflected in the plots of Fig. 3, where Sampling Free provides a (slightly) lower mean of the Gaussian entropy for the true positives over the false positives. As a final note, Fig. 3 shows that the behaviour of the uncertainty measures for both datasets are very consistent, which provides support for conclusions being drawn on one to be extended to the other.
Table II shows the results of the AP, GMUE, and CMUE for the ablation studies performed on the validation set of BDD. The results of the full BayesOD framework can be seen in experiment . By analyzing the results of the ablation studies, the following claims are put forth:
Pushing the variance of negative anchors to increase during training provides a slightly more discriminative uncertainty in the bounding box state . To support this claim, RetinaNet is trained using the original attenuated loss in Eq. (3) instead of the proposed modified loss in Eq. (8). The results of BayesOD using this original loss formulation are shown in experiment . When compared to the full system, an increase of and is observed in the GMUE for the car and pedestrian categories respectively. Although the improvement is not substantial, the proposed loss formulation in Eq. (8) is seen to be much more numerically stable, allowing for higher learning rates to be used in training. Furthermore, slightly better performance in AP and CMUE is observed when using the proposed loss formulation.
Explicit aleatoric covariance matrix estimation provides a slightly more discriminative uncertainty estimate of the bounding box state . To support this claim BayesOD is implemented without the update step in Eq. (11), to use only the per-anchor sample variance computed from multiple stochastic runs of MC-Dropout. The results, presented in experiment , show an increase of around is observed in the GMUE of both categories.
Greedy Non-Maximum Suppression is detrimental to the discriminative power of the uncertainty in the bounding box state . To support this claim, the elimination scheme of NMS is selected to retain only cluster centers, while discarding the remaining cluster members. The results presented in experiment show a large increase of and in the GMUE of the car and pedestrian categories respectively, when compared to the full system. Albeit with modest gains, BayesOD still outperforms all state of the art methods on every performance measure even when using elimination instead of Bayesian inference over cluster members.
The gains in performance on CMUE can be explained through the per-anchor marginalization over neural network parameters. To support this claim, BayesOD is stripped of the per-anchor marginalization step over the neural network parameters in Eq. (4), effectively estimating only aleatoric uncertainty. The results are presented as experiment , and show an increase of and in CMUE for the car and pedestrian categories over the full system. This increase in CMUE is accompanied with a respective drop of and in AP for both categories. Surprisingly however, the GMUE for the two categories is and lower than that the full system, implying that for the bounding box state , incorporating the epistemic covariance matrix could hurt the discriminative power of the estimated uncertainty measure.
As a summary, the above experiments provide additional evidence of the hypothesis presented in the previous section. Replacing NMS with Bayesian Inference and explicitly incorporating aleatoric covariance matrix estimation allows for a much more meaningful uncertainty measure that has a stronger negative correlation to the correctness of an output detection.
Qualitative results showing the progression of object state along BayesOD’s framework are presented in Fig. 4. The thresholds producing the minimum uncertainty error for both states are used to eliminate output detections with high entropy, shown in red.
This paper presents BayesOD, a Bayesian approach for estimating the uncertainty in the output of deep object detector. BayesOD provides a measure of uncertainty associated with both the bounding box and the category states of every detection instance. BayesOD also allows the incorporation of prior information at different stages of the detection process, and provides state probability distributions that can be interpreted as measurements by subsequent processes in robotic systems. This work aims to pave the path for future research directions that would use BayesOD for active learning, exploration, as well as object tracking. Furthermore, the effect of incorporating object priors into the object detection framework remains to be thoroughly studied. Future work will study the effect of informative priors originating from multiple detectors, temporal information, and different sensors on the perception capabilities of a robotic system.
The IEEE International Conference on Computer Vision (ICCV)
, Oct 2017.The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, July 2017.The limits and potentials of deep learning for robotics.
The International Journal of Robotics Research, 37(4-5):405–420, 2018.
Comments
There are no comments yet.