Formal Verification of Robustness and Resilience of Learning-Enabled State Estimation Systems for Robotics

10/16/2020 ∙ by Wei Huang, et al. ∙ 0

This paper presents a formal verification guided approach for a principled design and implementation of robust and resilient learning-enabled systems. We focus on learning-enabled state estimation systems (LE-SESs), which have been widely used in robotics applications to determine the current state (e.g., location, speed, direction, etc.) of a complex system. The LE-SESs are networked systems composed of a set of connected components including Bayes filters for localisation, and neural networks for processing sensory input. We study LE-SESs from the perspective of formal verification, which determines the satisfiability of a system model against the specified properties. Over LE-SESs, we investigate two key properties - robustness and resilience - and provide their formal definitions. To enable formal verification, we reduce the LE-SESs to a novel class of labelled transition systems, named PO2-LTS in the paper, and formally express the properties as constrained optimisation objectives. We prove that the robustness verification is NP-complete. Based on PO2-LTS and the optimisation objectives, practical verification algorithms are developed to check the satisfiability of the properties on the LE-SESs. As a major case study, we interrogate a real-world dynamic tracking system which uses a single Kalman Filter (KF) - a special case of Bayes filter - to localise and track a ground vehicle. Its perception system, based on convolutional neural networks, processes a high-resolution Wide Area Motion Imagery (WAMI) data stream. Experimental results show that our algorithms can not only verify the properties of the WAMI tracking system but also provide representative examples, the latter of which inspired us to take an enhanced LE-SESs design where runtime monitors or joint-KFs are required. Experimental results confirm the improvement of the robustness of the enhanced design.



There are no comments yet.


page 1

page 12

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

An autonomous system is a complex, intelligent system that can make decisions according to its internal state and its understanding about the external environment. To meet their design requirements, autonomous systems can be designed and implemented by connecting a number of heterogeneous components [48] – a form of networked system. In this paper, we consider a typical class of autonomous systems that have been widely used in robotics applications, i.e., state estimation systems (SESs). A SES is used to determine the current state (e.g., location, speed, direction, etc.) of a dynamic system such as a spacecraft or a ground vehicle. Typical applications of SESs in a robotics context include localisation [38], tracking [16], and control [65]. Moreover, with more and more robotics applications adopting deep neural network components to take advantage of their high prediction precision [18], we focus on those SESs with Deep Neural Network (DNN) components, and call them learning-enabled SESs, or LE-SESs.

Typically, in LE-SESs, neural networks are employed to process perceptional input received via sensors. For example, Convolutional Neural Networks (CNNs) are usually taken to process imagery inputs. For a real-world system – such as the WAMI tracking system which we will study in this paper – the perceptional unit may include multiple neural networks, which interact to implement a complex perceptional function. In addition to the perception unit, LE-SESs use other components – such as Bayes filters – to estimate, update, and predict the state.

However, neural networks have been found to be fragile, for example they are vulnerable to adversarial attacks [55], i.e., an imperceptibly small but valid perturbation on the input may incorrectly alter the classification output. Several approaches have been developed to study the robustness of neural networks, including adversarial attacks [55, 39, 5], formal verification [23, 27, 62, 13, 30, 60, 43, 61, 44, 29], and coverage-guided testing [53, 40, 51, 32, 52, 21, 20, 36]. All of these contribute to understanding the trustworthiness (i.e. the confidence that a system will provide an appropriate output for a given input) of systems containing deep neural networks; a recent survey can be found at [22]. As will be explained in the paper, for neural networks, robustness is a close concept to resilience. However, it is unclear whether this is the case for the LE-SESs, or more broadly, networked systems with learning components. We will show that, for LE-SESs, there are subtle, yet important, differences between robustness and resilience. Generally, robustness is the ability to consistently deliver its expected functionality by accommodating disturbances to the input, while resilience is the ability to handle and recover from challenging conditions including internal failures and external ‘shocks’, maintaining and/or resuming

part (if not all) of its designated functionality. Based on this general view, formal definitions of robustness and resilience on the WAMI tracking system are suggested. In our opinion: robustness quantifies the minimum external effort (of e.g., an attacker) to make a significant change to the system’s functionality – dynamic tracking; and resilience quantifies the supremum (i.e., least upper bound) of the deviation from its normal behaviour from which the system cannot recover. While these two properties are related, we show that they have subtle, yet important, difference from both their formal definitions and the experiments. From the outset, we note that the use of the term robustness in this sense differs from that used in traditional safety engineering. Whilst we continue to apply the prevailing use of term in this paper, we will later urge for alignment with safety engineering to foster the use of Machine Learning (ML) in those applications.

To study the properties of real-world LE-SESs in a principled way, we apply formal verification techniques, which demonstrate that a system is correct against all possible risks over a given specification – a formal representation of property – and the formal model of the system, and which returns counter-examples when it cannot. We adopt this approach to support the necessary identification of risks prior to deployment of safety critical applications. This paper reports the first time a formal verification approach has been developed for state estimation systems.

Technically, we first formalise an LE-SES as a novel labelled transition system which has components for payoffs and partial order relations (i.e. relations that are reflexive, asymmetric and transitive). The labelled transition system is named {PO}-LTS in the paper. Specifically, every transition is attached with a payoff, and for every state there is a partial order relation between its out-going transitions from the same state. Second, we show that the verification of the properties – both robustness and resilience – on such a system can be reduced into a constrained optimisation problem. Third, we prove that the verification problem is NP-complete for the robustness property on {PO}-LTS. Fourth, to enable practical verification, we develop an automated verification algorithm.

As a major case study, we work with a real-world dynamic tracking system [65], which detects and tracks ground vehicles over the high-resolution Wide Area Motion Imagery (WAMI) data stream, named WAMI tracking system in this paper. The system is composed of two major components: a state estimation unit and a perceptional unit. The perceptional unit includes multiple, networked CNNs, and the state estimation unit includes one or multiple Kalman filters, which are a special case of Bayes filter. We apply the developed algorithm to the WAMI tracking system to analyse both robustness and resilience, in order to understand whether the system can function well when subject to adversarial attacks on the perceptional neural network components.

The formal verification approach leads to a guided design of the LE-SESs. As the first design, we use a single Kalman filter to interact with the perceptional unit, and our experimental results show that the LE-SES performs very well in a tracking task, when there is no attack on the perceptional unit. However, it may perform less well in some cases when the perceptional unit is under adversarial attack. The returned counterexamples from our verification algorithms indicate that we may improve the safety performance of the system by adopting a better design. Therefore, a second, improved design – with joint-KFs to associate observations and/or a runtime monitor – is taken. Joint-KFs increase the capability of the system in dealing with internal and external uncertainties, and a runtime monitor can reduce some potential risks. We show that in the resulting LE-SES, the robustness is improved, without compromising the precision of the tracking.

The main contributions of this paper are as follows. Robust and resilient LE-SES design: This paper proposes a principled and detailed design of robust and resilient learning-enabled state estimation systems (Section IV). Formal guarantee: The robustness and resilience of the LE-SES is guaranteed by a novel formal verification technique (Sections V,VI,VII). Robustness vs resilience: This paper pioneers in aligning the definitions of robustness and resilience in LE-SESs with those applied in traditional high-integrity computing. Their similarity and difference are examined both in theory (Section VI) and in experimental evaluation (Section VIII).

In summary, the organisation of the paper is as follows. In the next section, we present preliminaries about neural networks and the Bayes (and Kalman) filters. In Section III, we introduce our first design of the WAMI-tracking system where a single Kalman filter is used. In Section IV, we present our enhanced design with a runtime monitor and/or joint-KFs. The reduction of LE-SES system to {PO}-LTS is presented in Section V. In Section VI, we present a methodological discussion on the difference between robustness and resilience, together with the formalisation of them as optimisation objectives. The automated verification algorithm is presented in Section VII, with the experimental results presented in Section VIII. We discuss some aspects on the definitions of robustness and resilience that are not covered in LE-SESs in Section IX. Finally, we discuss related work in Section X and conclude the paper in Section XI.

Ii Preliminaries

Ii-a Convolutional Neural Networks

Let be the input domain and be the set of labels. A neural network can be seen as a function mapping from to probabilistic distributions over . That is, is a probabilistic distribution, which assigns for each label

a probability value

. We let be a function such that for any , , i.e., returns the classification.

Ii-B Neural Network Enabled State Estimation

We consider a time-series linear state estimation problem that is widely assumed in the context of object tracking. The process model is defined as follows.


where is the state at time , is the transition matrix, is a zero-mean Gaussian noise such that , with being the covariance of the process noise. Usually, the states are not observable and need to be determined indirectly by measurement and reasoning. The measurement model is defined as:


where is the observation, is the measurement matrix, is a zero-mean Gaussian noise such that , and is the covariance of the measurement noise.

Bayes filters have been used for reasoning about the observations, , with the goal of learning the underlying states . A Bayes filter maintains a pair of variables, , over the time, denoting Gaussian estimate and Bayesian uncertainty, respectively. The basic procedure of a Bayes filter is to use a transition matrix, , to predict the current state, , given the previous state, . The prediction state can be updated into if a new observation, , is obtained. In the context of the aforementioned problem, this procedure is iterated for a number of time steps, and is always discrete-time, linear, but subject to noises.

We take the Kalman Filter (KF), one of the most widely used variants of Bayes filter, as an example to demonstrate the above procedure. Let be the initial state, such that and represent our knowledge about the initial estimate and the corresponding covariance matrix, respectively.

First, we perform the state prediction for :


Then, we can update the filter:




Intuitively, is usually called “innovation” in signal processing and represents the difference between the real observation and the predicted observation, is the covariance matrix of this innovation, and is the Kalman gain, representing the relative importance of innovation with respect to the predicted estimate .

In a neural network enabled state estimation, a perception system – which may include multiple CNNs – will provide a set of candidate observations , any of which can be chosen as the new observation . From the perspective of robotics, includes a set of possible states of the robot, measured by (possibly several different) sensors at time . These measurements are imprecise, and are subject to noise from both the environment (epistemic uncertainty) and the imprecision of sensors (aleatory uncertainty).

Iii A Real-World WAMI Dynamic Tracking System

In this part, we present a brief introduction, followed by the technical details, to the real-world WAMI dynamic tracking system that will be used as our major case study. The tracking system requires continuous imagery input from e.g., airborne high-resolution cameras. In the case study, the input is a video, which consists of a finite sequence of WAMI images. Each image contains a number of vehicles. The essential processing chain of the WAMI tracking system is as follows.

  1. Align a set of previous frames with the incoming frame.

  2. Construct the background model of incoming frames using the median frame.

  3. Extract moving objects using background subtraction.

  4. Determine if the moving objects are vehicles by using a Binary CNN.

  5. For complex cases, predict the locations of moving objects/vehicles using a regression CNN.

  6. Track one of the vehicles using a Kalman filter.

WAMI tracking uses Gated nearest neighbour (Gnn) to choose the new observation : from the set , the one closest to the predicted measurement is chosen, i.e.,


where is -norm distance (, i.e., Euclidean distance is used in this paper), and is the gate value, representing the maximum uncertainty in which the system is able to work.

Specifically, the WAMI system has the following definitions of and :


where denotes the mean values of two Gaussian stochastic variables, representing the location which is measurable from the input videos, and , representing the velocity which cannot be measured directly, respectively.

In the measurement space, the elements in are not correlated, which makes it possible to simplify the Bayesian uncertainty metric, , that is the trace of the covariance matrix:


Therefore, is partially related to the search range in which observations can be accepted. Normally, will gradually shrink before being bounded – the convergence property of KF as explained below.

Iii-a Wide-Area Motion Imagery Input

The input to the tracking system is a video, which consists of a finite sequence of images. Each image contains a number of vehicles. Similar to [66], we use the WPAFB 2009 [1] dataset. The images were taken by a camera system with six optical sensors that had already been stitched to cover a wide area of around . The frame rate is 1.25Hz. This dataset includes 1025 frames, which is around 13 minutes of video and is divided into training video ( frames) and testing video ( frames), and where all the vehicles and their trajectories are manually annotated. There are multiple resolutions of videos in the dataset. For the experiment, we chose to use the images, in which the size of vehicles is smaller than pixels. We use to denote the -th frame and the pixel on the intersection of -th column and -th row of .

In the following, we explain how the tracking system works, where video is used as input. In general this is undertaken in two stages: detection and tracking. In Section III-B through to Section III-E, we explain the detection steps, i.e., how to detect a vehicle with CNN-based perception units; this is followed by the tracking step in Section III-F.

Iii-B Background Construction

Vehicle detection in WAMI video is a challenging task due to the lack of vehicle appearances and the existence of frequent pixel noises. It has been discussed in [50, 28], that an appearance-based object detector may cause a large number of false alarms. For this reason, in this paper, we only consider detecting moving objects for tracking.

Background construction is a fundamental step in extracting pixel changes from the input image. The background is built for the current environment from a number of previous frames captured by the moving camera system, through the following steps:

Image registration

Is used to compensate for the camera motion by aligning all the previous frames to the current frame. The key is to estimate a transformation matrix, , which transforms frame to frame using a given transformation function. For the transformation function, we consider projective transformation (or homography), which has been widely applied in multi-perspective geometry; an area where WAMI camera systems are already utilised.

The estimation of is generated by applying feature-based approaches. First of all, feature points from images at frame and , respectively, are extracted by feature detectors (e.g., Harris corner or SIFT-like [31] approaches). Second, feature descriptors, such as SURF [2] and ORB [45], are computed for all detected feature points. Finally, pairs of corresponding feature points between two images can be identified and the matrix can be estimated by using RANSAC [10]

, which is robust against outliers.

Background Modeling

We generate the background, , for each time , by computing the median image of the previously-aligned frames, i.e.,


In our experiments, we take either or .

Note that, to align the previous frames to the newly received frame, only one image registration process is performed. After obtaining the matrices by processing previous frames, we perform image registration once to get , and then let

Extraction of Potential Moving Objects

By comparing the difference between and the current frame , we can extract a set of potential moving objects by first computing the following set of pixels


and then applying image morphology operation on , where is the set of pixels and is a threshold value to determine which pixels should be considered.

Iii-C CNN for Detection Refinement

After obtaining , we develop a CNN, , to detect vehicles. We highlight a few design decisions. The major causes of false alarms generated by the background subtraction are: poor image registration, light changes and the parallax effect in high objects (e.g., buildings and trees). We emphasise that the objects of interest (e.g., vehicles) mostly, but not exclusively, appear on roads. Moreover, we perceive that a moving object generates a temporal pattern (e.g., a track) that can be exploited to discern whether or not a detection is an object of interest. Thus, in addition to the shape of the vehicle in the current frame, we assert that the historical context of the same place can help to distinguish the objects of interest and false alarms.

By the above observations, we create a binary classification CNN to predict whether a pixels window contains a moving object given aligned image patches generated from the previous frames. The pixels window is identified by considering the image patches from the set . We suggest in this paper, as it is the maximum time for a vehicle to cross the window. The input to the CNN is a matrix and the convolutional layers are identical to traditional 2D CNNs, except that the three colour channels are substituted with grey-level frames.

Essentially, acts as a filter to remove, from , objects that are unlikely to be vehicles. Let be the obtained set of moving objects. If the size of an image patch in is similar to a vehicle, we directly label it as a vehicle. On the other hand, if the size of the image patch in is larger than a vehicle, i.e., there may be multiple vehicles, we pass this image patch to the location prediction for further processing.

Iii-D CNN for Location Prediction

We use a regression CNN to process image patches passed over from the detection refinement phase. As in [28], a regression CNN can predict the locations of objects given spatial and temporal information. The input to is similar to the classification CNN described in Section III-C, except that the size of the window is enlarged to . The output of is a

-dimensional vector, equivalent to a down-sampled image (

) for reducing computational cost.

For each image, we apply a filter to obtain those pixels whose values are greater than not only a threshold value but also the values of its adjacent pixels. We then obtain another image with a few bright pixels, each of which is labelled as a vehicle. Let be the set of moving objects updated from after applying location prediction.

Iii-E Detection Framework

Fig. 1: The architecture of the vehicle detector.

The processing chain of the detector is shown in Figure 1(a). At the beginning of the video, the detector takes the first frames to construct the background, thus the detections from frame can be generated. After the detection process finishes in each iteration, it is added to the template of previous frames. The updating process substitutes the oldest frame with the input frame. This is to ensure that the background always considers the latest scene, since the frame rate is usually low in WAMI videos such that parallax effects and light changes can be pronounced. As we wish to detect very small and unclear vehicles, we apply a small background subtraction threshold and a minimum blob size. This, therefore, leads to a huge number of potential blobs. The classification CNN is used to accept a number of blobs. As mentioned in Section III-C, the CNN only predicts if the window contains a moving object or not. According to our experiments, the cases where multiple blobs belong to one vehicle and one blob includes multiple vehicles, occur frequently. Thus, we design two corresponding scenarios: the blob is very close to another blob(s); the size of the blob is larger than . If any blob follows either of the two scenarios, we do not consider the blob for output. The regression CNN (Section III-D) is performed on these scenarios to predict the locations of the vehicles in the corresponding region, and a default blob will be given. If the blob does not follow any of the scenarios, this blob will be outputted directly as a detection. Finally, the detected vehicles include the output of both sets.

Iii-F Object Tracking

Iii-F1 Problem Statement

We consider a single target tracker (i.e. Kalman filter) to track a vehicle given all the detection points over time in the field of view. The track is initialised by manually giving a starting point and a zero initial velocity, such that the state vector is defined as where is the coordinate of the starting point. We define the initial covariance of the target, , which is the initial uncertainty of the target’s state111With this configuration it is not necessary for the starting point to be a precise position of a vehicle, and the tracker will find a proximate target on which to track. However, it is possible to define a specific velocity and reduce the uncertainty in , so that a particular target can be tracked..

A near-constant velocity model is applied as the dynamic model in the Kalman filter, which is defined as follows, by concretising Equation (1).



is a identity matrix,

is a zero matrix, and

and are as defined in Section II, such that the covariance matrix of the process noise is defined as follows:


where is the time interval between two frames and is a configurable constant. is suggested for the aforementioned WAMI video.

Next, we define the measurement model by concretising Equation (2):


where is the measurement representing the position of the tracked vehicle, and and are as defined in Section II. The covariance matrix, , is defined as , where we suggest for the WAMI video.

Since the camera system is moving, the position should be compensated for such motion using the identical transformation function for image registration. However, we ignore the influence to the velocity as it is relatively small, but consider integrating this into the process noise.

Iii-F2 Measurement Association

During the update step of the Kalman filter, the residual measurement should be calculated by subtracting the measurement () from the predicted state (). In the tracking system, a Gnn is used to obtain the measurement from a set of detections. K-nearest neighbour is first applied to find the nearest detection, , of the predicted measurement, . Then the Mahalanobis distance between and is calculated as follows:


where is the innovation covariance, which is defined within the Kalman filter.

A potential measurement is adopted if with in our experiment. If there is no available measurement, the update step will not be performed and the state uncertainty accumulates. It can be noticed that a large covariance leads to a large search window. Because the search window can be unreasonably large, we halt the tracking process when the trace of the covariance matrix exceeds a pre-determined value.

Iv Improvements to WAMI Tracking System

In this section, we introduce two techniques to improve the robustness and resilience of the LE-SESs. One of the techniques uses a runtime monitor to track a convergence property, expressed with the covariance matrix ; and the other considers components to track multiple objects around the primary target to enhance fault tolerance in the state estimation.

Iv-a Runtime Monitor for Bayesian Uncertainty

Generally speaking, a KF system includes two phases: prediction (Equation (3)) and update (Equation (4)). Theoretically, a KF system can converge [26] with optimal parameters: , , , and , that well describe the problem. In this paper, we assume that the KF system has been well designed to ensure the convergence. Empirically, this has been proven possible in many practical systems. We are interested in another characteristic of KF: where the uncertainty, , increases relative to , if no observation is available and thus the update phase will not be performed. In such a case, the predicted covariance will be treated as the updated covariance in this time-step.

In the WAMI tracking system, when the track does not have associated available observations (e.g., mis-detections) for a certain period of time, the magnitude of the uncertainty metric will be aggregated and finally ‘explode’, and thus the search range of observations is dramatically expanded. This case can be utilised to design a monitor to measure the attack, and therefore should be considered when analysing the robustness and resilience.

The monitor for the Bayesian uncertainty can be designed as follows: when increases, an alarm is set to alert the potential attack. From the perspective of an attacker, to avoid this alert, a successful attack should try to hide the increment of . To understand when the increment may appear for the WAMI tracking system, we recall the discussion in Section III, where a track associates the nearest observation within a pre-defined threshold, in Mahalanobis distance, in each time-step. By attacking all the observations in , we can create the scenario where no observations are within the search range, which mimics the aforementioned case: the Bayesian uncertainty metric increases due to the skipped update phase.

Formally, we define a parameter , defined in (17), which monitors the changes of the Kalman filter’s covariance over time and considers the convergence process.


Iv-B Joining Collaborative Components for Tracking

Reasons for the previous WAMI tracking system malfunctions are two-fold: false alarms and mis-detections. Note that, in this paper, we only track a single target. Using Gnn and a Kalman filter is sufficient to deal with false alarms in most cases. Nevertheless, mis-detection still brings significant issues. As mentioned in Section IV-A, mis-detections may cause the Bayesian uncertainty range to expand. This, compounded with the fact that there are usually many detections encountered in the WAMI videos, leads to the possibility that the tracking is switched to another target. In order to cope with this problem, we consider taking the approach of utilising joining collaborative components and we call it joint Kalman filters (joint-KFs). More specifically, we track multiple targets near the primary target simultaneously with multiple Kalman filters; the detailed process is as follows:

  • Two kinds of Kalman filter tracks are maintained: one track for the primary target, , and multiple tracks for the refining association, .

  • At each time-step other than the initialisation step, we have predicted tracks , a set of detections from current time-step, and a set of unassociated detections from the previous time-step.

    1. Calculate the likelihoods of all the pairs of detections and tracks using where the parameters can be found in Kalman filter.

    2. Sort the likelihoods from the largest to the smallest, and do a gated one-to-one data association in this order.

    3. Perform standard Kalman filter updates for all the tracks.

    4. For each detection in that is not associated and is located close to the primary target, calculate the distance to each element in .

      • If the distance is smaller than a predefined value, initialize a track and treat the distance as velocity then add this track into .

      • Otherwise, store this detection in

    5. Maintain all the tracks for refining association, : if a track is now far away from the primary target, remove it from the set.

By applying this data association approach, if the primary target is mis-detected, the track will not be associated to a false detection, and even when this occurs for a few time-steps and the search range becomes reasonably large, this system can still remain resilient (i.e. can still function and recover quickly).

V Reduction of LE-SESs to Labelled Transition Systems

Formal verification requires a formal model of the system, so that all possible behaviour of the system model can be exploited to understand the existence of incorrect behaviour. In this section, we will reduce the LE-SESs to a novel class of labelled transition systems – a formal model – such that all safety-related behaviour is preserved. In the next few sections, we will discuss the formalisation of the properties, the automated verification algorithms, and the experimental results, respectively.

V-a Threat Model of Adversarial Attack on Perception System

In Section III, a neural network based perception system determines whether or not there is a vehicle at a location . Let be an image covering the location , a neural network function maps into a Boolean value, , representing whether or not a vehicle is present at location . There are two types of erroneous detection: (1) a wrong classification prediction of the image , and (2) a wrong positioning of a moving object within . We focus on the former since the WAMI tracking system has a comprehensive mechanism to prevent the occurrence of the latter.

Fig. 2: The Workflow of Attack on WAMI System.

The threat model of an adversary is summarised as in Figure 2. Assuming that . An adversary must compute another input which requires a payoff and has a different classification, i.e., . Without loss of generality, the is measured with the norm-distance from to its original image , or formally


To deviate from an input image to its adversarial input , a large body of adversarial example generation algorithms and adversarial test case generation algorithms are available. Given a neural network and an input , an adversarial algorithm produces an adversarial example such that . On the other hand, for test case generation, an algorithm produces a set of test cases , among which the optimal adversarial test case is such that . We remark that, the work in this paper is independent from particular adversarial algorithms. We use in our experiments two algorithms:

  • DeepFool [33], which finds an adversarial example by projecting onto the nearest decision boundary. The projection process is iterative because the decision boundary is non-linear.

  • DeepConcolic [53], which generates a test suite by applying combined symbolic execution and concrete execution, guided by adapted MC/DC metrics for neural networks [51].

We denote by , the payoff that an adversarial algorithm needs to compute for an adversarial example from and . Furthermore, we assume that the adversary can observe the parameters of the Bayes filter, for example, , of the Kalman filter.

V-B {Po}-Labelled Transition Systems

Let be a set of atomic propositions. A payoff and partially-ordered label transition system, or {PO}-LTS, is a tuple , where is a set of states, is an initial state, is a transition relation, is a labelling function, is a payoff function assigning every transition a non-negative real number, and is a partial order relation between out-going transitions from the same state.

V-C Reduction of WAMI Tracking to {Po}-Lts

We model a neural network enabled state estimation system as a {PO}-LTS. A brief summary of some key notations in this paper is provided in Table I. We let each pair be a state, and use the transition relation to model the transformation from a pair to another pair in a Bayes filter. We have the initial state by choosing a detected vehicle on the map. From a state and a set of candidate observations, we have one transition for each , where can be computed with Equations (3)-(5) by having in Equation (6) as the new observation. For a state , we write to denote the estimate , to denote the covariance matrix , and to denote the new observation that has been used to compute and from its parent state .

Notations Description
observed location by WAMI tracking
an image covering location
neural network function
payoff for algorithm computing
an adversarial example from and
a state at step , consisting of
estimate and covariance matrix
, and
estimate of , covariance matrix of
and observed location for transition
a path of consecutive states
alias: the state and the observed
location on the path
TABLE I: A Summary of Notations Used

Subsequently, for each transition , its associated payoff is denoted by , i.e., the payoff that the adversary uses the algorithm to manipulate – the image covering the observation – into another image on which the neural network believes there exists no vehicle.

For two transitions and from the same state , we say that they have a partial order relation, written as , if making the new observation requires the adversary to fool the network into misclassifying . For example, in WAMI tracking, according to Equation (6), the condition means that , where is the predicted location.

Example 1

Figure 3 depicts a tree diagram for the unfolding of a labelled transition system. The root node on top represents the initial state . Each layer comprises all possible states of at step of WAMI tracking, with being one possible estimate, and the covariance matrix. Each transition connects a state at step to at step . are the observed locations at each step by WAMI tracking.

Given a {PO}-LTS , we define a path as a sequence of consecutive states , and as a sequence of corresponding observed location for , where and are the starting and ending time under consideration, respectively. We write for the state , and for the observed location on the path .

Fig. 3: Tree diagram of an unfolding {PO}-LTS

Vi Property Specification: Robustness and Resilience

Formal verification determines whether a specification holds on a given LTS [6]. Usually, a logic language, such as CTL, LTL, or PCTL, is used to formally express the specification . In this paper, to suit our needs, we let the specification be a constrained optimisation objective; and so verification is undertaken in two steps:

  1. determine whether, given and , there is a solution to the constrained optimisation problem. If the answer is affirmative, an optimal solution is returned.

  2. compare with a pre-specified threshold . If then we say that the property holds on the model with respect to the threshold . Otherwise, it fails.

Note that, we always take a minimisation objective in the first step. Since the optimisation is to find the minimal answer, in the second step, if , we cannot have a better – in terms of a smaller value – solution for the threshold . Intuitively, it is a guarantee that no attacker can succeed with less cost than , and the system is hence safe against the property. The above procedure can be easily adapted if we work with maximisation objectives.

Before proceeding to the formal definition of the robustness and resilience properties, we need several notations. First, we consider the measure for the loss of localisation precision. Let be an original path that has not suffered from an attack. The other path is obtained after an attack on . For the WAMI tracking system, we define their distance at time as


which is the -norm difference between two images and .

Moreover, let be a transition on an attacked path , and so we have


as the combined payoffs that are required to implement the transition . Intuitively, it requires that all the payoffs of the transitions , which are partially ordered by the envisaged transition , are counted. In the WAMI tracking system, this means that the attack results in misclassifications of all the images such that the observation is closer to the predicted location than .

Vi-a Definition of Robustness

Robustness is a concept that has been studied in many fields such as psychology [46], biomedical analysis [58], and chemical analysis [14]. Here, we adopt the general definition of robustness as used in the field of artificial intelligence (we later discuss the difference between this and the definition applied in software engineering):

Robustness is an enforced measure to represent a system’s ability to consistently deliver its expected functionality by accommodating disturbances to the input.

In LE-SESs, we measure the quality of the system maintaining its expected functionality under attack on a given scenario with the distance between two paths – its original path and the attacked path. Formally, given a track and an attacker, it is to consider the minimal perturbation to the input that can lead to misfunction. Intuitively, the larger the amount of perturbations a system can tolerate, the more robust it is. Let


be the accumulated distance, between the original track and the attacked track , from the start to the end .

Moreover, we measure the disturbances to an LE-SES as the perturbation to its imagery input. Formally, we let


be the accumulated combined payoff for the attacked track between time steps and , such that and . When , there is no perturbation and is the original track .

Finally, we have the following optimisation objective for robustness:


Basically, represents the amount of perturbation to the input, while the malfunctioning of system is formulated as ; that is, the deviation of the attacked track from the original track exceeds a given tolerance .

Vi-B Definition of Resilience

For resilience we take an ecological view widely seen in longitudinal population[59], psychological[8], and biosystem[11] studies.

Generally speaking, resilience indicates an innate capability to maintain or recover sufficient functionality in the face of challenging conditions [3] against risk or uncertainty, while keeping a certain level of vitality and prosperity [8].

This definition of resilience does not consider the presence of risk as a parameter [59], whereas the risks usually present themselves as uncertainty with heterogeneity in unpredictable directions including violence [37]. The outcome of the resilience is usually evidenced by either: a recovery of the partial functionality, albeit possibly with a deviation from its designated features [49]; or a synthetisation of other functionalities with its adaptivity in congenital structure or inbred nature. In the context of this paper, therefore resilience can be summarised as the system’s ability to continue operation (even with reduced functionality) and recover in the face of adversity. In this light, robustness, inter alia, is a feature of a resilient system. To avoid complicating discussions, we treat them separately.

In our definition, we take as the signal that the system has recovered to its designated functionality. Intuitively, means that the tracking has already returned back to normal – within acceptable deviation – on the time step .

Moreover, we take


to denote the deviation of a path from the normal path . Intuitively, it considers the maximum distance between two locations – one on the original track and the other on the attacked track – of some time step . The notation on denotes the time step corresponding to the maximal value.

Then, the general idea of defining resilience for LE-SESs is that we measure the maximum deviation at some step and want to know if the whole system can respond to the false information, gradually adjust itself, and eventually recover. Formally, taking , we have the following formal definition of resilience:


Intuitively, the general optimisation objective is to minimize the maximum deviation such that the system cannot recover at the end of the track. In other words, the time represents the end of the track, where the tracking functionality should have recovered to a certain level – subject to the loss .

We remark that, for resilience, the definition of “recovery” can be varied. While in Equation (25) we use to denote the success of a recovery, there can be other definitions, for example, asking for a return to some track that does not necessarily have be the original one, so long as it is acceptable.

Vi-C Computational Complexity

We study the complexity of the {PO}-LTS verification problem and show that the problem is NP-complete for robustness. Concretely, for the soundness, an adversary can take a non-deterministic algorithm to choose the states for a linear number of steps, and check whether the constraints satisfy in linear time. Therefore, the problem is in NP. To show the NP-hardness, we reduce the knapsack problem – a known NP-complete problem – to the same constrained optimisation problem as Equation (23) on a {PO}-LTS.

The general Knapsack problem is to determine, given a set of items, each with a weight and a value, the number of each item to include in a collection so that the total weight is less than or equal to a given limit, and the total value is as large as possible. We consider 0-1 Knapsack problem, which restricts the number of copies of each kind of item to zero or one. Given a set of items numbered from 1 up to , each with a weight and a value , along with a maximum weight , the aim is to compute

subject to

where represents the number of the item to be included in the knapsack. Informally, the problem is to maximize the sum of the values of the items in the knapsack so that the sum of the weights is less than or equal to the knapsack’s capacity.

We can construct a {PO}-LTS , where . Intuitively, for every item , we have two states representing whether or not the item is selected, respectively. For the transition relation , we have that , which connects each state of item to the states of the next item . The payoff function is defined as and , for all and , representing that it will take payoff to take the transition in order to add the item into the knapsack, and take payoff to take the other transition in order to not add the item . The partial order relation can be defined as having transition , for all and .

For the specification, we have the following robustness-related optimisation objective.


such that


Recall that, is defined in Equations (22) and (20), and is defined in Equation (21). As a result, the robustness of the model and the above robustness property is equivalent to the existence of a solution to the 0-1 Knapsack problem. This implies that the robustness problem on {PO}-LTSs is NP-complete.

Vii Automated Verification Algorithm

An attack on the LE-SESs, as explained in Section V-A, adds perturbations to the input images in order to fool a neural network, which is part of the perception unit, into making wrong detections. On one hand, these wrong detections will be passed on to the perception unit, which in turn affects the Bayes filter and leads to wrong state estimation; the LE-SES can be vulnerable to such attack. On the other hand, the LE-SESs may have internal or external mechanisms to tolerate such attack, and therefore perform well with respect to properties such as robustness or resilience. It is important to have a formal, principled analysis to understand how good a LE-SES is with respect to the properties and whether a designed mechanism is helpful in improving its performance.

We have introduced in Section V how to reduce an LE-SES into a {PO}-LTS and formally express a property – either robustness or resilience – with a constrained optimisation objective based on a path in Section VI. Thanks to this formalism, the verification of robustness and resilience can be outlined using the same algorithm. Now, given a model , an optimisation objective , and a pre-specified threshold , we aim to develop an automated verification algorithm to check whether the model is robust or resilient on the path ; or formally, , where denotes the optimal value obtained from the constrained optimisation problem over and .

The general idea of our verification algorithm is as follows. It first enumerates all possible paths of obtainable by attacking the given path (Algorithm 1), and then determines the optimal solution among the paths (Algorithm 2). Finally, the satisfiability of the property is determined by comparing and .

Vii-a Exhaustive Search for All Possible Tracks

The first step of the algorithm proceeds by exhaustively enumerating all possible attacked paths on the {PO}-LTS with respect to . It is not hard to see that, the paths will form a tree, unfolded from the {PO}-LTS , as illustrated in Figure 3. Since a final deviation is not available until the end of a simulation, the tree has to be fully expanded from the root to the leaf and all the paths explored. Clearly, the time complexity of this procedure is exponential with respect to the number of steps, which is consistent with our complexity result, as presented in Section VI-C. Specifically, breadth-first search (BFS) is used to enumerate the paths. The details are presented in Algorithm 1.

0:  LTS model , , ,
0:  path set , payoff set
1:  run original path from to
2:  set as root node
3:  for  from to  do
4:     for each node in  do
5:        find potential observations
6:        for each observed location in  do
8:           calculate the attack payoff
10:        end for
11:     end for
12:  end for
14:  run path in set to
15:  calculate the combined payoff for each path
16:  return  ,
Algorithm 1 Exhaustive Search based on BFS

We need several operation functions on the tree, including (which returns all leaf nodes of the root node), (which associates a node to its parent node), and (which returns all tree paths from the given root node to the leaf nodes).

Lines 2-12 in Algorithm 1 present the procedure of constructing the tree diagram. First, we set the root node (Line 2), that is, we will attack the system from the ()-th state of the original track and enumerate all possible adversarial tracks. At each step , function will list all observations near the predicted location (Line 5). Then, each observation is incorporated with current state for the calculation of the next state (Line 7). If no observation is available or , the KF can still run normally, skipping the update phase. To enable each transition , the partial order relation is followed when attacking the system and recording the payoff (Line 8). Then, the potential is accepted and added as the child node of . Once the tree is constructed, we continue simulating the tracks to the end of time, , (Lines 13-14). Finally, all the paths in set are output along with the attack payoff (Lines 15-16).

Vii-B Computing an Optimal Solution to the Constrained Optimisation Problem

After enumerating all possible paths in , we can compute optimal solutions to the constrained optimisation problems as in Equation (23) and (25). We let be the objective function to minimize, and be the constraints to follow. For robustness, we have and , and for resilience, we have and .

0:  path set , , ,
0:  representative path , value of , and optimal value
1:  find the original path in set
2:  , , ,
3:  for  in set  do
4:     if  then
7:        if  or  then
9:        end if
10:     else
12:     end if
13:  end for
16:  return  , ,
Algorithm 2 Computation of Optimal Solution and A Representative Path

Note that, our definitions in Equations (23) and (25) are set to work with cases that do not satisfy the properties, i.e., paths that are not robust or resilient, and identify the optimal one from them. Therefore, a path satisfying the constraints suggests that it does not satisfy the property. We split the set of paths into two subsets, and . Intuitively, includes those paths satisfying the constraints, i.e., fail to perform well with respect to the property, and includes those paths that do not satisfy the constraints, i.e., perform well with respect to the property. For robustness, includes paths satisfying and satisfying . For resilience, includes paths satisfying and satisfying .

In addition to the optimal solutions that, according to the optimisation objectives, are some of the paths in , it is useful to identify certain paths in that are robust or resilient. Let be the optimal value, from the optimal solution. We can sort the paths in according to their value, and let representative path be the path whose value is the greatest among those smaller than . Intuitively, represents the path that is closest to the optimal solution of the optimisation problem but satisfies the corresponding robust/resilient property. This path is representative because it serves as the worst case scenario for us to exercise the system’s robust property and resilient property respectively. Moreover, we let be the value of robustness/resilience of the path , called representative value in the paper.

The algorithm for the computation of the optimal solution and a representative path can be found in Algorithm 2. Lines 1 to 9 give the process to solve Equation (23) or (25) for the optimal value . The remaining Lines calculate the representative value and a representative path .

For each adversarial path in , it is added into either (Line 6) or