Perception errors are still challenging issues despite performance improvements in perception systems in the last decade. In autonomous driving or navigation systems, incorrect detections threaten the safe and robust performance of fully autonomous systems. Most modern perception systems utilize a pipeline of algorithms to combine functions of object detection and tracking, however they perform their inference in a purely feedforward manner. Perceptual processing in the mammalian brain, however, has significant feedback, as evidenced in the presence of dense connectivity from higher-order visual areas encoding object categories to early visible areas encoding features . This principle of feedback extends to intelligence after perception, in cognition, where the brain moves far beyond simply binding visual features and matching to the most likely feature distribution to recognize an object, but instead engages in cycles of attention, hypothesis revision, and decision making about conflicting features to infer the true identity of an object under uncertain conditions. The cognitive capability of the human brain to make inferences under uncertainty by foraging, adaptation, evaluation, and decision making is known as sense-making and has been studied using behavioral experiments and neurocognitive models [2, 3, 4]
. There have also been approaches attempting to embed this feedback into deep neural networks[5, 6, 7].
Evidence from neuroimaging illustrates the rich top-down feedback connections that provide the platform for attentive perceptual cognition in the brain. These connections extend from areas that perform decision-making in the prefrontal cortex back to modulate activity in the sensory cortical areas that provide the visual features themselves [8, 9]. Further, diverse cortical areas that process the semantic properties of objects feed activity to the heteromodal association areas that bind these semantic properties with their sensory properties (including visual, auditory, tactile features) to combine perception and semantics in cognition [10, 11].
Adaptation in the brain occurs in many ways under a multitude of mechanisms, from local to global. For instance, when we receive feedback that demonstrates significant surprise, or unexpected uncertainty of our model of the world, the brain adapts its processing for a remarkable ecological endeavor: controlling attention to acquiring new statistics of the world around it to make better inferences by adapting at multiple levels through a neuromodulatory cascade of norepinephrine . At the decision-making level, neurobiological circuits that maintain the inference model are directed to release the current model from working memory in the prefrontal cortex. Feedback to the sensor occurs at the sensory level: the pupils in the eye dilate to acquire more stimulus. In turn, experimental evidence also supports mid-level cognitive adaptation in the level of alertness and orienting behavior to reacquire new statistics of the world . Indeed, such models have been used in artificial neural networks for adapting behavior in robots [14, 15].
To this end, we are inspired by sense-making cognition as a model for intelligence in perception, using feedback, adaptation, and semantics. To enable fully autonomous systems, sense-making cognition for intelligent perception is necessary under uncertain conditions, including corner cases. In this paper, we demonstrated an approach to intelligent perception in a system that detects perception errors, and uses the principles of feedback and adaptation to correct them.
2 Related Work
In the context of robotics, Perception systems need to have strong confidence about their environment. Perception errors can lead to disastrous consequences for the robot or its surroundings, for example, autonomous car crashes. Today, Deep Neural Networks (DNN) are commonly used for perception; however, they are notoriously easy to fool  and do not output calibrated confidences 
. To prevent this vulnerability, researchers have investigated a more global measure called uncertainty. There are two kinds of uncertainty: aleatoric and epistemic. Aleatoric uncertainty is the uncertainty of our input, and epistemic uncertainty is the uncertainty of our model. Some methods to estimate epistemic uncertainty are Bayesian Neural Networks, ensemble methods [19, 20], Monte Carlo dropout , sampling-free methods , or directly from the input . Uncertainty can be used as an indicator of the likelihood of an error. If the uncertainty associated with output is high, the network is not very confident in its output, and that an error is likely. However, most of these methods do not study how to reduce the uncertainty associated with an input to improve performance.
Other approaches to detect perception errors and fix them have focused on formally verifying the systems using temporal logic [24, 25, 26, 27]. This approach has the advantage of mathematically proven performance guarantees, an essential aspect of autonomous systems operating under uncertain conditions. However, most of these systems apply motion control to the autonomous platforms themselves rather than fix the perception systems. Unlike previous works, our work is the first to apply temporal logic to directly modulate the image contrast to the best of our knowledge.
More direct methods to predict and fix errors use sensor cues from LIDAR or additional cameras to detect errors [31, 32]. In contrast to these works, our method requires no additional sensors besides a single camera and can be applied on top of any off-the-shelf object detector.
Our approach builds upon contrast enhancement techniques. Early work in the perceptual domain has explored improving upon conventional contrast enhancement techniques by using adaptive histogram equalization . Subsequent work has applied contrast enhancement to object detection, where it helps to accentuate features to detect objects more robustly [29, 30]. However, corresponding conventional methods use the image contrast information of the entire image. So, if some non-object areas cause high contrast, contrast adaptation cannot improve object detection.
In our proposed method, we draw upon temporal logic-based methods and contrast enhancement techniques. We improve the perception system using feedback control of a contrast parameter only from detected objects within a formally verified system.
The rest of the paper will be presented as follows: In the next section, we describe the proposed perception error detection/correction system. In Section 4, the probes converted from the perception data are explained. Then the detailed approach for error correction is presented in Section 5. With that mechanism, Section 6 goes into the details of the contrast-based perception adaptation. And Section 7 validates the proposed methods, and finally Section 8 concludes the paper.
The overall structure of the CogSense error detection and perception adaptation system is shown in Figure 1. Here, we review the construction of the CogSense system, from training data to characterizing the operational ranges of the perception system. First, from the perception data, we generate probes that describe detections’ characteristics, such as the size of objects, their type, tracking deviations from a motion tracker, and image statistics such as contrast and entropy. Using the probes, we set up ‘Probabilistic Signal Temporal Logic (PSTL) ,’ and PSTL provides axioms, each of which is constructed with a single or multiple probes with the corresponding statistical analyses. As an intermediate process, those axioms provide the error analysis on detections. This is the cognitively inspired process described earlier, in which heterogeneous sources of information from multiple modalities, including physical, semantic, and image based statistics provide checks and balances on the plausibility of output detections from the perception system that are orthogonal to the measures of uncertainty described above. Finally, with these axiom-based constraints, we solve an optimization problem to synthesize controls for the perception modules to reduce perception errors and improve valid detection rates. This paper first mathematically describes the CogSense system for the heterogeneous probes for error detection using PSTL and then the PSTL-based optimization framework for controlling parameters to correct perception errors. The following section describes our experimental results that utilize heterogeneous probes in CogSense to detect errors and an image contrast optimization for perception adaptation to correct errors.
In more detail, our approach is as follows:
Through the perception module, we receive input images of the scene.
Objects in the image are detected and recognized.
This module converts the perception data into probes that we use for signal temporal logic.
Probes are converted into the axioms under the probabilistic signal temporal logic structure.
The axioms are evaluated to verify if the corresponding observations are valid or erroneous based on the constraints using the statistically analyzed probe bounds.
If the axioms are invalid within certain probabilities, estimate the optimal contrast bound and entropy bound as perception module parameters to apply by solving the image contrast/entropy based optimization problem. Finally, this estimated parameters are delivered back to the perception module to adjust its contrast and entropy.
4 Perception probes generation and error evaluation
The first step in the process is to obtain the perception data along with characteristics of detections and recognitions. To get different types of characteristics efficiently, we used YOLOv3  as shown in Figure 2.
We can extract image-based bounding box information from detected object boxes in a single frame and the corresponding tracking information along with the image sequences. We call this information "probes." In our system, we have multiple probes such as detected object sizes, aspect ratios, recognition ID consistency, tracking deviations, and so on. The following are the sample probes we use in this paper.
Object size (in the image plane and in the world coordinate frame)
Aspect ratio of the detected objects
Localization and tracking performance
Contrast of the detected boxes
Entropy of the detected boxes
Each probe will be used to build constraints for perception adaptation using the "probabilistic signal temporal logic (PSTL) ." PSTL uses probabilistic and temporal aspects on real-valued signals. Let us assume that the signal at time satisfies a probabilistic atomic predicate . If we briefly describe its formula,
In this predicate, ,
is a time-varying random variable andis the tolerance level in satisfying the probabilistic properties. With this formula, we convert our probes into axioms.
All of our detections are divided into true positives and false positives. From the true positive and false positive detections, we can perform statistical analysis for each probe. Figure 3 shows a descriptive example of a probe. For a detected object, , we assume that we generate a probe, . By analyzing the values from true positives and also those from false positives, we can obtain probabilistic distributions of true positives and false positives as shown in the figure. We can define upper and lower bounds for true positives from the intersections between two different distribution graphs. And the shaded area presents the confidence probability, , of the probe.
If we describe this relation in a mathematical form (axiom) with the probabilistic inequality from the probabilistic signal temporal logic, it becomes as follows:
where is the predicate and y is the true detection or recognition. And means the time sequence between and , so is the probe sequence in the time frame of .
Depending on the probe dimensions, the probabilistic function can also be multi-dimensional. By integrating all the available axioms from , we can have a multi-dimensional range of the corresponding detection or recognition. When a new probe violates the corresponding axioms more than a certain probabilistic threshold, we can verify that the corresponding detection is considered as erroneous, and the probability of an error is higher. If it is categorized as erroneous, we apply the perception adaptation as shown in the next section.
5 Perception error correction using the PSTL-constraint-based optimization
Detecting perception errors is not sufficient to recover the perception quality in the following image sequences. Therefore, we also want to adjust perception modules to have more accurate and robust detections with that knowledge. In this paper, we propose a new optimization technique using the PSTL-constraint-based optimization with the following format:
where is the probe state at time and is the control input to the perception module. And is the cost function of estimating perception errors. Our goal is to achieve the optimal to reduce perception errors. Therefore, minimizing can achieve the optimal perception module control input. Eventually, the final optimization formula with the two or more PSTL-based constraints for probes, , , etc. becomes,
where and are the true positive labels for the probes, and , respectively. , , and are probabilistic error bounds acquired from the true positive / false positive distribution described in Section 5. And and are the probability thresholds for and , respectively.
6 Contrast-based perception adaptation
To achieve the contrast-based perception adaptation, we first set up the object detection constraints using five different types of constraints: (1) Detection ID consistency (tracking of the same object); (2) Localization consistency within the expected trajectory; (3) Bounding box size consistency in the image plane; (4) Contrast matching in the desired range; and (5) Entropy matching in the desired range. Details for each constraint are presented below. is the current time and is the time that the temporal logic window starts.
where is the probabilistic threshold for consistent ID detections. In this constraint, the detection ID is checked to be consistent. If the IDs keep changing, the detection process cannot be robust.
Bounding box size deviation over time
where is the bounding box size at time and is the desired bounding box size from its history. And is the probabilistic threshold for consistent bounding box size. Usually, highly varying bounding box sizes for the same object (unless it does not approach or move away abruptly) indicates unreliable bounding box estimation (e.g. too sensitive with respect to lighting condition changes).
Localization deviation from the desired tracking trajectory
where is the detected object’s location at time and is its expected path from the history, and is the probabilistic threshold for consistent localization. Localization deviation usually comes from unreliable bounding box estimation, which indicates perception errors.
where is the contrast of the bounding box at time , is the desired contrast from the training phase, and is the probabilistic threshold for contrast. Highly varying contrast also modifies the details of the same object information over time. Contrast consistency is one of the factors that we need to keep for more robust perception.
where is the entropy of the bounding box at time and is the desired entropy from the training phase. And
is the probabilistic threshold for entropy. Too blurred images or too sharpened images ruin even the state of the art deep-learning based object detection methods. So, the entropy deviates from the desired value, it is highly possible that we have perception errors.
Then the corresponding optimization formula to control contrast with the cost function is defined as,
where is the contrast value of the detected object at time , is the desired contrast value from the procedure of finding the probabilistic distributions of the probes, and is the system control input for contrast (which is the same as estimated deviation to apply to the perception module).
For contrast control, once the desirable contrast deviation is acquired, we set up the expansion of histogram ranges to achieve that contrast changes using the peak-to-peak contrast (Michelson contrast) . The peak-to-peak contrast is defined in the following way:
where is the maximum image intensity value and is the minimum image intensity value. From this definition, we can expect a new contrast is supposed to be:
where is the expanded histogram range to achieve the new contrast. Since , the histogram changing range changing amount will be
In this section, we present test results with our contrast-based PSTL perception adaptation. First, we present a test result on a video from the Multiple Object Tracking(MOT) Benchmark dataset . Through our proposed method, we improve detection results as shown in Figure 4. The red box in the upper figure is a erroneous person detection from the original image, and the green boxes in the lower figure are correct person detection newly added from the CogSense system.
For supporting the proposed method with quantified performances, we also provide a precision-recall graph and a ROC curve in Figs. 5 and 6, respectively. As shown in the graphs, the newly proposed approach presents better recall rates with the same precision, and also less false positive rates while achieving the same true positive rates. Especially, in the ROC curve, with 10% detection confidence thresholding, the false positive rate is reduced by 41.48%.
In addition to the above, we also tested more challenging video clips collected from [36, 37, 38, 39, 40]. Then we compare the detection results from the original video sequences with those from the conventional Contrast Limited Adaptive Histogram Equalization (CLAHE)  and those from our proposed CogSense method. Tables 1 and 2 show the true positive rates (the number of true positives over the sum of true positive and false positive detections) and false positive counts, respectively. As shown in Table 1, although the improvement rates are not that huge, the true positive rates are actually improved overall compared to the original method and the CLAHE method. And if we look into the false positive counts in Table 2, false positive counts with the proposed method are reduced by comparing to the original video processing, and by comparing to the CLAHE-applied video processing. The reason of CLAHE-applied videos’ high false positive rates is that CLAHE-based image enhancement improves visual aspects to human, not for object detection.
|Clip 1 ||0.9804||0.9602||0.9817|
|Clip 2 ||0.9931||0.9931||0.9904|
|Clip 3 ||0.9829||0.9860||0.9835|
|Clip 4 ||0.9964||0.9977||0.9993|
|Clip 5 ||0.9822||0.9796||0.9834|
|Clip 6 ||0.9495||0.9065||0.9527|
|Clip 1 ||22||43||20|
|Clip 2 ||14||8||12|
|Clip 3 ||13||11||13|
|Clip 4 ||5||3||1|
|Clip 5 ||17||19||16|
|Clip 6 ||17||34||15|
Figure 7 shows sample image triples from the three methods. In the figure, each row corresponds to the individual clip number in order. In Clip 1, the original video processing missed one person and the CLAHE-applied video processing still missed the same person, and even provided multiple detections from the same person. On the other hand, our proposed method detects all three objects. Also in Clip 2 and Clip 6, the proposed method detects more true positives. And finally, the result of Clip 4 shows two different detections from the original video processing and CLAHE-applied video processing, however, our proposed method detected them both.
As shown through the test results, our proposed CogSense method using PSTL-based constraints and detected-bounding-box-based optimization provides more robust detection outputs.
8 Conclusion and Discussion
This paper presents CogSense, a new perception error detection and perception parameter adaptation method using the probabilistic signal temporal logic, and a contrast-based perception adaption approach as a specific case. The proposed method evaluates perception errors using heterogeneous probes of the detected objects and subsequently correct perception errors by solving a contrast-based optimization problem and other objected detection-based constraints generated from the probabilistic signal temporal logic system. Our proposed contrast-based perception adaptation uses information only from the detected bounding boxes, and specifically our approach applies adaption to these boxes to improve object detection rather than unnecessarily applying adaptation to the entire image enhancement, which could have unintended consequences to other detections. In future work, we will increase the heterogeneity of probes to extend to other domains such as semantics and context, and pursue multi-modal perception parameter adaption by optimizing parameters to simultaneously control entities that extend beyond the current paper’s demonstration of image based contrast to enhance the CogSense system’s perception correction capabilities.
-  S. Gabay, Y. Pertzov, and A. Henik, "Orienting of attention, pupil size, and the norepinephrine system," Attention, Perception, and Psychophysics 73, 123–129, 2011 https://doi.org/10.3758/s13414-010-0015-4.
-  G. A. Ascoli, M. M. Botvinick, R. J. Heuer, and R. Bhattacharyya, “Neurocognitive models of sense-making,” Biol. Inspired Cogn. Archit., vol. 8, pp. 82–89, 2014.
-  C. Lebiere et al., “A Functional Model of Sensemaking in a Neurocognitive Architecture,” Comput. Intell. Neurosci., vol. 2013, Article ID 921695, doi: 10.1155/2013/921695.
-  M. D. Howard et al., “The Neural Basis of Decision-Making During Sensemaking: Implications for Human-System Interaction,” Proc. IEEE Aerosp. Conf. March 10 2015.
C. J. Spoerer, P. McClure and N. Kriegeskorte, "Recurrent convolutional neural networks: a better mode of biological object recognition," Frontiers of Psychology, 2017.
-  Y. Huang, S. Dai, T. Nguyen, P. Bao, D. Tsao, R. G. Barniuk, and A. Anandkumar, "Brain-inspired Robust Vision using Convolutional Neural Networks with Feedback," 2019 Conference on Neural Information Processing Systems NeuroAI Workshop, 2019.
-  J. Kubilius, M. Schrimpf, K. Kar, R. Rajalingham, H. Hong, N. J. Majaj, E. B. Issa, P. Bashivan, J. Prescott-Roy, K. Schmidt, A. Nayebi, D. Bear, D. L. K. Yamins, and J. J. DiCarlo, "Brain-Like Object Recognition with High-Peforming Shallow Recurrent ANNs," 2018 Conference on Neural Information Processing Systems, 2018.
-  T. P. Zanto, M. T. Rubens, A. Thangavel, and A. Gazzaley, “Causal role of the prefrontal cortex in top-down modulation of visual processing and working memory,” Nat. Neurosci., vol. 14, no. 5, pp. 656–661, 2011.
-  R. F. Helfrich, M. Huang, G. Wilson, and R. T. Knight, “Prefrontal cortex modulates posterior alpha oscillations during top-down guided visual perception,” Proc. Natl. Acad. Sci., vol. 114, no. 35, pp. 9457–9462, Aug. 2017, doi: 10.1073/pnas.1705965114.
-  M. F. Bonner, J. E. Peelle, P. A. Cook, and M. Grossman, “Heteromodal conceptual processing in the angular gyrus,” Neuroimage, vol. 71, pp. 175–186, 2013.
-  G. A. Calvert, R. Campbell, and M. J. Brammer, “Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex,” Curr. Biol., vol. 10, no. 11, pp. 649–657, 2000.
A. J. Yu and P. Dayan, “Uncertainty, neuromodulation, and attention,” Neuron, vol. 46, pp. 681–692, 2005.
-  G. Aston-Jones and J. D. Cohen, “An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance,” Annu Rev Neurosci, vol. 28, pp. 403–450, 2005.
-  J. L. Krichmar, “The Neuromodulatory System - A Framework for Survival and Adaptive Behavior in a Challenging World,” Adapt. Behav., vol. 16, pp. 385–399, 2008.
-  X. Zou, S. Kolouri, P. K. Pilly, and J. L. Krichmar, “Neuromodulated attention and goal-driven perception in uncertain domains,” Neural Netw., 2020.
A. Nguyen, J. Yosinski, and J. Clune, "Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images," 2015 IEEE Proceedings on Computer Vision, 2015.
C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, "On Calibration of Modern Neural Networks," 2017 International Conference on Machine Learning, 2017.
-  A. Kendall and Y. Gal, "What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?," 2017 Conference on Neural Information Processing Systems, 2017.
-  H. Grimmett, R. Paul, R. Triebel, and I. Posner, "Knowing When We Don’t Know: Introspective Classification for Mission-Critical Decision Making," 2013 IEEE International Conference on Robotics and Automation, 2013.
-  C. G. Blair, J. Thompson, and N. M. Robertson, "Introspective classification for pedestrian detection," 2014 Sensor Signal Processing for Defense, 2014.
-  Y. Gal and Z. Ghahramani, "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning," 2016 International Conference on Machine Learning, 2016.
J. Postels, F. Ferroni, H. Coskun, N. Navab, and F. Tombari, "Sampling-free Epistemi Uncertainty Estimation Using Approximated Variance Propagation," 2019 IEEE International Conference on Computer Vision, 2019.
-  S. Daftry, S. Zeng, J. A. Bagnell, and M. Herbert, "Introspective Perception: Learning to Predict Failures in Vision Systems," 2016 IEEE International Conference on Intelligent Robotics and Systems, 2016.
-  A. Dokhanchi, H.B. Amor, J.V. Deshmukh, and G. Fainekos, “Evaluating perception systems for autonomous vehicles using quality temporal logic,” International Conference on Runtime Verification, 2018.
-  R.R. da Silva, V. Kurtz, and M. Hebert, “Active Perception and Control from Temporal Logic Specifications,” arXiv:1905.03662, 2019.
S. Jha, V. Raman, D. Sadigh, and S.A. Seshia, “Safe Autonomy Under Perception Uncertainty Using Chance-Constrained Temporal Logic,” Journal of Automated Reasoning, 2018.
-  D. Sadigh and A. Kapoor, “Safe control under uncertainty with Probabilistic Signal Temporal Logic,” in Proc. Of Robotics: Science and Systems, 2016.
-  J. A. Stark, “Adaptive Image Contrast Enhancement Using Generalizations of Histogram Equalization,” IEEE Transactions on Image Processing, Vol. 9, No. 5, pp.889-896, 2000.
-  D. Holz and H. Yang, “Enhanced Contrast for Object Detection and Characterization by Optical Imaging,” US9626591B2.
-  V. Vonikakis, D. Chrysostomou, R. Kouskouridas and A. Gasteratos, “Improving the Robustness in Feature Detection by Local Contrast Enhancement,” 2012 IEEE International Conference on Image Systems and Techniques Proceedings, July 2012.
-  D. Barnes, W. Maddern, and I. Posner, "Find Your Own Way: Weakly-Supervised Segmentation of Path Proposals for Urban Autonomy," 2017 IEEE International Conference on Robotics and Automation, 2017.
-  M. S. Ramanagopal, C. Anderson, R. Vasudevan, and M. Johnson-Roberson, "Failing to Learn: Autonomously Identifying Perception Failures for Self-driving Cars," 2018 IEEE International Conference on Intelligent Robotics and Systems, 2018.
-  https://pjreddie.com/darknet/yolo/.
-  A. Michelson, “Studies in Optics,” University of Chicago Press, 1927.
-  Multiple Object Tracking Benchmark (https://motchallenge.net).
-  https://www.youtube.com/watch?v=0xSSnfRYBQY
-  https://www.youtube.com/watch?v=BOa0zQBRs_M
-  https://www.youtube.com/watch?v=NjRnTmUr_kg
-  https://www.youtube.com/watch?v=cGjWbGgDaCc
-  https://www.youtube.com/watch?v=99wezqewopU