In recent years, the popularity of autonomous vehicles has increased greatly. With this popularity, there has also been increased attention drawn to the various fatalities caused by the autonomous components on-board the vehicles, especially the perception systems [templeton_tesla_2020, lee_report_2018]. Perception modules on these vehicles use vision data from cameras to reason about the surrounding environment, including detecting objects and interpreting traffic signs, and in-turn used by controllers to perform safety-critical control decisions, including avoiding pedestrians. Due to the nature of these systems, it has become important that these systems be tested during design and monitored during deployment.
Signal temporal logic (STL) [maler_monitoring_2004] and Metric Temporal Logic (MTL) [fainekos_robustness_2009] have been used extensively in verification, testing, and monitoring of safety-critical systems. In these scenarios, typically there is a model of the system that is generating trajectories under various actions. These traces are the used to test if the system satisfies some specification. This is referred to as offline monitoring, and is the main setting for testing and falsification of safety-critical systems. On the other hand, STL and MTL have been used for online monitoring where some safety property is checked for compliance at runtime [nickovic_rtamt_2020, dokhanchi_online_2014]. These are used to express rich specifications on low-level properties of signals outputted from systems.
The output of a perception algorithm consists of a sequence of frames, where each frame contains a variable number of objects over a fixed set of categories, in addition to object attributes that can range over larger data domains (e.g. bounding box coordinates, distances, confidence levels, etc.). STL and MTL can handle mixed-mode signals and there have been attempts to extend them to incorporate spatial data [bortolussi_specifying_2014, nenzi_qualitative_2015, haghighi_spatel_2015]. However, these logics lack the ability to compare objects in different frames, or model complex spatial relations between objects.
Timed Quality Temporal Logic (TQTL) [dokhanchi_evaluating_2018], and Spatio-temporal Quality Logic (STQL) [hekmatnejad_formalizing_2021] are extensions to MTL that incorporate the semantics for reasoning about data from perception systems specifically. In STQL, which is in itself an extension of TQTL, the syntax defines operators to reason about discrete IDs and classes of objects, along with set operations on the spatial artifacts, like bounding boxes, outputted by perception systems.
In this project, we contribute the following:
We show how TQTL [dokhanchi_evaluating_2018] and STQL [hekmatnejad_formalizing_2021] can be used to express correctness properties for perception algorithms.
An online monitoring tool, PerceMon111https://github.com/CPS-VIDA/PerceMon.git, that efficiently monitors specifications. We integrate this tool with the CARLA simulation environment [dosovitskiy_carla_2017] and the Robot Operating System (ROS) [quigley_ros_2009].
S-TaLiRo [annpureddy_staliro_2011, fainekos_robustness_2019], VerifAI [dreossi_verifai_2019] and Breach [donze_automotive_2015] are some examples of tools used for offline monitoring of MTL and STL specifications. The presented tool, PerceMon, models its architecture similar to the RTAMT [nickovic_rtamt_2020] online monitoring tool for STL specifications: the core tool is written in C++ with an interface for use in different, application-specific platforms.
2 Spatio-temporal Quality Logic
Spatio-temporal quality logic (STQL) [hekmatnejad_formalizing_2021] is an extension of Timed Quality Temporal Logic (TQTL) [dokhanchi_evaluating_2018] that incorporates reasoning about high-level topological structures present in perception data, like bounding boxes, and set operations over these structures.
STL has been used extensively in testing and monitoring of control systems mainly due to the ability to express rich specifications on low-level, real-valued signals generated from these systems. To make the logic more high-level, spatial extensions have been proposed that are able to reason about spatial relations between signals [bortolussi_specifying_2014, nenzi_qualitative_2015, gabelaia_computational_2003, haghighi_spatel_2015]. A key feature of data streams generated by perception algorithms is that they contain frames of spatial objects consisting of both, real-values and discrete-valued quantities: the discrete-valued signals are the IDs of the objects and their associated categories; while real-valued signals include bounding boxes describing the objects and confidence associated with their identities. While STL and MTL can be used to reason about properties of a fixed number of such objects in each frame by creating signal variables to encode each of these properties, it is not possible to design monitors that handle arbitrarily many objects per frame.
TQTL [dokhanchi_evaluating_2018] is a logic that is specifically catered for spatial data from perception algorithms. Using Timed Propositional Temporal Logic [bouyer_expressiveness_2005] as a basis, TQTL allows one to pin or freeze the signal at a certain time point and use clock variables associated with the freeze operator to define time constraints. Moreover, TQTL introduces a quantifier over objects in a frame and the ability to refer to properties intrinsic to the object: tracking IDs, classes or categories, and detection confidence. STQL [hekmatnejad_formalizing_2021] further extends the logic to reason about the bounding boxes associated with these objects, along with predicate functions for these spatial sets, by incorporating topological semantics from the spatio-temporal logic [gabelaia_computational_2003].
Definition 1 (STQL Syntax [hekmatnejad_formalizing_2021])
Let be a set of time variables, be a set of frame variables, and be a set of object ID variables. Then the syntax for STQL is recursively defined by the following grammar:
Here, (for all indices ), , and . In the above grammar is a real-valued constant that allows for the comparison of ratios of object properties.
In the above grammar, and are, respectively, the negation and disjunction operators from propositional logic while , , , and are the temporal operators next, previous, until, and since respectively. The above grammar can be further used to derive the other propositional operators, like conjunction (), along with temporal operators like always () and eventually (), and their past-time equivalents holds () and once (). In addition to that, STQL extends these by introducing freeze quantifiers over clock variables and object variables. freezes the time and frame that the formula is evaluated, and assigns them to the clock variables and , where refers to pinned time variables and refers to pinned frame variables. The constants, refer to the value of the time and frame number where the current formula is being evaluated. This allows for the expression and to measure the duration and the number of frames elapsed, respectively, since the clock variables and were pinned. The expression searches over each object in a frame in the incoming data stream — assigning each object to the object variable — if there exists an object that satisfies . The functions and refer to the class and confidence the detected object associated with the ID variable. In addition to these TQTL operations, bounding boxes around objects can be extracted using the expression and set topological operations can be defined over them. The spatial exists operator checks if the spatial expression results in a non-empty space or not. Quantitative operations like measure the area of spatial sets; computes the Euclidean distances between references points () of bounding boxes; and and measure the latitudinal and longitudinal offset of bounding boxes respectively. Here, refers to the reference points — left, right, top, and bottom margins, and the centroid — for bounding boxes. Due to lack of space, we defer defining the formal semantics of STQL to Appendix 0.A and also refer the readers to [hekmatnejad_formalizing_2021] for more extensive details.
3 PerceMon: An Online Monitoring Tool
PerceMon is an online monitoring tool for STQL specifications. It computes the quality of a formula at the current evaluation frame, if can be evaluated with some finite number of frames in the past (history) and delayed frames from the future (horizon).
The core of the tool consists of a C++ library, libPerceMon, which provides an interface to define an STQL abstract syntax tree efficiently, along with a general online monitor interface. The PerceMon tool works by initializing a monitor with a given specification and can receive data in a frame-by-frame manner. It stores the frames in a first-in-first-out (FIFO) buffer with maximum size defined by the horizon and history requirement of the specification. This enables fast and efficient computation of the quality of the formula for the bounded horizon. An overview of the architecture can be seen in (a).
The library, libPerceMon, designed with the intention to be used with wrappers that convert application-specific data to data structures supported by the library (signified by the “Frontend” block in the architecture presented in (a)). In the subsequent section, we show an example of how such an integration can be performed by interfacing libPerceMon with the CARLA autonomous vehicle simulator [dosovitskiy_carla_2017] via the ROS middleware platform [quigley_ros_2009].
3.1 Integration with CARLA and ROS
In this section, we present an integration of the PerceMon tool with the CARLA autonomous vehicle simulator [dosovitskiy_carla_2017] using the ROS middleware platform [quigley_ros_2009]. This follows the example of [dreossi_verifai_2019] and [zapridou_runtime_2020] which interface with CARLA, and [nickovic_rtamt_2020], where the tool interfaces with the ROS middleware platform for use in online monitoring applications.
The CARLA simulator is an autonomous vehicle simulation environment that uses high-quality graphics engines to render photo-realistic scenes for testing such vehicles. Pairing this with ROS allows us to abstract the data generated by the simulator, the PerceMon monitor, and various perception modules as streams of data or topics in a publisher-subscriber network model. Here, a publisher broadcasts data in a known binary format at an endpoint (called a topic) without knowing who listens to the data. Meanwhile, a subscriber registers to a specific topic and listens to the data published on that endpoint.
In our framework, we use the ROS wrapper for CARLA222https://github.com/carla-simulator/ros-bridge/ to publish all the information from the simulator, including data from the cameras on the autonomous vehicle. The image data is used by perception modules — like the YOLO object detector [redmon_you_2016] and the DeepSORT object tracker [wojke_simple_2017] — to publish processed data. The information published by these perception modules can in-turn be used by other perception modules (like using detected objects to track them), controllers (that may try to avoid collisions), and by PerceMon online monitors. The architectural overview can be seen in (b).
The use of ROS allows us to reason about data streams independent of the programming languages that the perception modules are implemented in. For example, the main implementation of the YOLO object detector is written in C/C++ using a custom deep neural network framework calledDarknet [darknet13], while the DeepSORT object tracker is implemented in Python. The custom detection formats from each of these algorithms can be converted into standard messages that are published on predefined topics, which are then subscribed to from PerceMon. Moreover, this also paves the way to migrate and apply PerceMon to any other applications that use ROS for perception-based control, for example, in the software stack deployed on real-world autonomous vehicles [kato_autoware_2018].
In this section, we present a set of experiments using the integration of PerceMon with the CARLA autonomous car simulator [dosovitskiy_carla_2017] presented in Section 3.1. We build on the ROS-based architecture described in the previous section, and monitor the following perception algorithms:
Object Detection: The YOLO object detector [redmon_you_2016, redmon_yolov3_2018]
is a deep convolutional neural networks (CNN) based model that takes as input raw images from the camera and outputs a list of bounding boxes for each object in the image.
Object Tracking: The SORT object tracker [wojke_simple_2017]
takes the set of detections from the object detector and associates an ID with each of them. It then tries to track each annotated object across frames using Kalman filters and cosine association metrics.
We use the OpenSCENARIO specification format [openscenario_specification] to define scenarios in the CARLA simulation that mimic some real-world, accident-prone scenarios, where there have been several instances where deep neural network based perception algorithms fail at detecting or tracking pedestrians, cyclists, and other vehicles. To detect some common failure cases, we initialize the PerceMon monitors with the following specifications:
: For all objects in the current frame that have high confidence, if the object is far away from the margins, then the object must have existed in the previous frame too with sufficiently high confidence.
Object detection algorithms are known to frequently miss detecting objects in consecutive frames or detect them with low confidence after detecting them with high confidence in previous frames. This can cause issues with algorithms that rely on consistent detections, e.g., for obstacle tracking and avoidance. The above formula checks this for objects that we consider “relevant” (using ), i.e., the object is not too far away from the edges of the image. This allows us to filter false alarms from objects that naturally leave the field of view of the camera.
Smooth Object Trajectories
: For every object in the current frame, its bounding box and the corresponding bounding box in the previous frame must overlap more than 30%.
In consecutive frames, if detected bounding boxes are sufficiently far apart, it is possible for tracking algorithms that rely on the detections to produce incorrect object associations, leading to poor information for decision-making.
We monitor the above properties for scenarios described in Figure 2, and check for the time it takes to compute the satisfaction values of the above properties. As each scenario consists of some passive or non-adversarial vehicles, the number of objects detected by the object detector increases. Thus, since the runtime for the STQL monitor is exponential in the number of object IDs referenced in the existential quantifiers, this allows us to empirically measure the amount of time it takes to compute the satisfaction value in the monitor. The number of simulated non-adversarial objects are ranged from 1 to 10, and the time taken to compute the satisfaction value for each new frame is recorded. We present the results in Table 1, and refer the readers to [hekmatnejad_formalizing_2021] for theoretical results on monitoring complexity for STQL specifications.
|Average Number of Objects||Average Compute Time (s)|
In this paper, we presented PerceMon, an online monitoring library and tool for generating monitors for specifications given in Spatio-temporal Quality Logic (STQL). We also present a set of experiments that make use of PerceMon’s integration with the CARLA autonomous car simulator and the ROS middleware platform.
In future iterations of the tool, we hope to incorporate a more expressive version of the specification grammar that can reason about arbitrary spatial constructs, including oriented polygons and segmentation regions, and incorporate ways to formally reason about system-level properties (like system warnings and control inputs).
This work was partially supported by the National Science Foundation under grant no. NSF-CNS-2038666 and the tool was developed with support from Toyota Research Institute North America.
Appendix 0.A Semantics for STQL
Consider a data stream consisting of frames containing objects and annotated with a time stamp. Let be the current frame of evaluation, and let denote the frame. We let denote a mapping from a pinned time or frame variable to a frame index (if it exists), and let be a mapping from an object variable to an actual object ID that was assigned by a quantifier. Finally, we let denote the set of object IDs available in the frame , and let output the timestamp of the given frame.
Let be the quality of the formula, , at the current frame , which can be recursively defined as follows:
For the propositional and temporal operations, the semantics simply follows the Boolean semantics for LTL or MTL, i.e.,
For constraints on time and frame variables,
For operations on object variables,
For the area, latitudinal offset, and longitudinal offset,
where, , and
computes the lateral distance of the point of an object identified by from the Longitudinal axis;
computes the longitudinal distance of the point of an object identified by from the Lateral axis; and
is the compound spatial object created after set operations on bounding boxes (defined below).
And, finally, for the spatial existence operator,
Here, the compound spatial function, is defined as follows: