As robots become increasingly available and capable, there has been an increased interest in having mobile robots that can robustly and autonomously navigate in uncontrolled environments. Examples of this include delivery robots, warehouse robots, home service robots, etc. Deploying robots over extended periods of time and in the open world requires addressing the failures of autonomy originating from real-world uncertainty and imperfect perception. Continuous operator monitoring, while effective, is cumbersome and thus not scalable to many robots or large environments. An ideal solution to this problem would be to develop competence-aware
agents capable of assessing the probability of successfully completing a given task. Such agents would learn from failures and leverage the acquired knowledge when planning to improve their robustness and reliability. Previous efforts towards competence-aware path planning and motion planning either rely solely on statistical analysis of logged instances of failures in the configuration space of the robot and do not benefit from the sensing information collected by the robot, or are application specific and designed to reduce the probability of failure for a specific perception module such as visual SLAM . While there has been progress on introspective perception to enable perception algorithms to learn to predict their sources of errors [3, 4], the outputs of such algorithms have not yet been exploited in robot planning.
We present competence-aware path planning via introspective perception (CPIP), a general framework that bridges the gap between path planning and introspective perception and allows the robot to iteratively learn and exploit task-level competence in novel deployment environments. CPIP models the path planning problem as a Stochastic Shortest Path (SSP) problem and builds a model that represents both the topological map of the environment as well as the competence of the robot in traversing each part of the map autonomously. CPIP leverages introspective perception to predict the task-level competence of the robot in novel deployment environments and employs a Bayesian approach to update its estimate of the robot competence online and during the deployment. CPIP then uses this information to plan paths that reduce the risk of failures.
Our experimental results demonstrate that CPIP converges to the optimal planning policy in novel deployment environments while reducing the frequency of navigation failures by more than compared to the state-of-the-art competence-aware path planning algorithms that do not leverage introspective perception.
Ii Related Work
The idea of integrating perception with planning and control was introduced by pioneering works on active perception that suggested performance of perception can be improved by selecting control strategies that depend on the current state of perception data interpretation as well as the goal of the task [5, 6]. Researchers have applied this idea to various levels of control ranging from active vergence control for a stereo pair of cameras  to object manipulation given the next best view for surface reconstruction of unknown objects .
One line of work predicts and avoids degradation of perception performance given features extracted from the raw sensory data. Costante et al propose a perception-aware path planning for MAVs that maximizes the information gain from image matching while solving for dense V-SLAM. Sadat et al.  and Deng et al.  follow a similar approach and use an RRT* planner where the cost of a path is defined as a linear combination of the length of the path and the predicted density of image features along the path to reduce localization errors. In these works, the path planner cost function is designed to specifically address the reliability of V-SLAM and is not generalizable to arbitrary perception tasks. Moreover, the competence estimates of perception are obtained either via hand crafted metrics or by means of the Cramer-Rao lower uncertainty bound which can be overconfident. Saxena et al.  relax the need for hand crafted measures of perception reliability by learning to predict failures of perception from the raw sensory data; however, predicted failures are used to trigger an enumerated set of recovery actions rather than proactively generating plans that reduce the probability of failures. Similarly Gurau et al.  leverage image data and location specific features to do reactive planning by selecting between autonomous and supervised autonomy at any point in time.
A different line of work on competence-aware path planning that has a more holistic view of failures includes keeping track of all of the robot failures regardless of the perception algorithm that is the cause, and then leveraging this information to proactively generate plans with reduced risk of failures. Lacerda et al.  aggregate the failure instances of a service mobile robot while navigating the environment to model the probability of success for traversing each edge of a topological map using an MDP and generate navigation policies that prefer paths with high success probabilities. Krajník  use a spectral model to learn mid to long-term environmental changes assuming they have a periodic nature and exploit it to improve robot navigation and localization by predicting such changes. Vintr et al.  use a similar approach to learn a spatio-temporal model for predicting presence of humans in the robot’s deployment environment at different times of the day. Since these methods are based on statistical analysis of the frequency of navigation failures, they require ample experience and several samplings from any location in the map, in order to achieve an accurate estimate of the robot’s competence in navigating that specific location. Moreover, due to using location specific features of the environment for estimating the robot competence, these estimates cannot be generalized to novel deployment environments. Basich et al.  further expand the concept of competence to the optimal level of autonomy and define a stochastic model for solving the path planning problem, where the generated plans consist of a path and the optimal level of autonomy for each segment of the path. In order to learn to predict the probability of failure at each level of autonomy this work requires a curated list of environmental features that are potentially correlated with robot failures.
In this work we leverage machine learned models capable of predicting errors of individual perception modules, to reach an accurate estimate of the robot’s competence in successfully navigating throughout an environment. CPIP uses the estimate of competence to plan reliable and short-duration paths. Our work is similar to[13, 16] in that it reasons about the competence of the robot in successfully performing navigation tasks at a topological map level, however, it removes the need for an enumerated list of perception related features by automatically learning to extract such features from the raw sensory data. Furthermore, CPIP significantly reduces the frequency of experienced failures in new environments by exploiting the generalizable learned perception features as opposed to merely relying on statistical analysis of the location of previously experienced navigation failures.
Iii CPIP Definition
CPIP is a framework for integrating path planning with introspective perception in life-long learning settings. It is defined as a tuple , where is a stochastic planning model, is a set of introspective perception modules, and is a task-level competence predictor. CPIP leverages introspective perception and the competence predictor model to predict the probability of task-level failures given the raw sensory data at every time step and uses these estimates to update the planning model iteratively during robot deployments, hence learns policies that reduce the probability of failures. In section IV, we introduce the planning model and explain how it incorporates the probability of autonomous navigation failure in path planning. We then explain introspective perception and the competence predictor model and how they are used to structure the problem of learning to predict instances of navigation failures in section V.
Iv Competence-Aware Planning
The CPIP planning model uses a representation of the environment that includes both the connectivity of a set of sparse locations on the map as well as the probability of successful traversal between each two connected neighboring locations. In this section, we explain this model and how it is actively updated during deployments.
Iv-a Planning Model Description
The input to our problem is a topological map of the environment in the form of a directed graph composed of a set of nodes, , and a set of edges, . Each node represents a location, and each edge is defined by a tuple . Here, is the starting vertex, is the ending vertex, is the expected traversal time for the edge , and is the probability of successfully traversing the edge.
Given the topological map, , we model the planning problem as a Stochastic Shortest Path (SSP) problem, a formal decision-making model for reasoning in stochastic environments where the objective is to find the least-cost path from a start state to a goal state. An SSP is a tuple where is a finite set of states, is a finite set of actions, represents the probability of reaching state after performing action in state , represents the expected immediate cost of performing action in state , is an initial state, and is a finite (possibly singleton) set of goal states such that .
A solution to an SSP is a policy that indicates that action should be taken in state . A policy induces the value function that represents the expected cumulative cost of reaching from state following the policy . An optimal policy minimizes the expected cumulative cost from the initial state .
In our problem, is a finite set of states comprised of the map nodes and a finite set of failure states and is a finite set of actions comprised of the directed edges in the graph and a finite set of recovery actions . is determined by the probability of successfully traversing the edge , , which is zero if the action does not correspond to the edge , and whether the current state is a failure state. In a failure state , if and . is set to if , and the expected recovery cost for otherwise. Figure 1 illustrates the planning MDP for an example urban environment. During robot deployments, this transition function is updated to reflect the latest belief over the probability of navigation failures in traversing each edge on the map, or equivalently the probability of successful traversals. Next, we explain the method for updating the transition function.
Iv-B Updating the Failure Belief during Deployment
CPIP builds an SSP model to represent the topological map of the environment as described in section IV-A. CPIP updates the aforementioned SSP model structure during deployments as it collects more observational data from the environment, altering the underlying transition function such that the resultant model represents not just the map but the competence of the robot in traversing it. In order to achieve that, the occurrence of a failure of type at edge
is assumed to be a random variable from the categorical distribution. The belief over this variable is defined as , where the subscript indicates the traversal of the edge and is the observation made by the robot during that traversal. Applying the Bayes rule yields
Defining the negation of as
the belief can be implemented as the log odds ratio
where is the prior in log odds form. Before the first deployment of the robot in a new environment, and for every . Then upon each traversal of an edge, the above relation is used to update the transition function of the planning SSP model such that . The main term that needs to be computed for updating the belief in Eq. 2 after each traversal is , which is known as the inverse observation likelihood and in CPIP it is implemented by two different functions, each handling one of the two different types of observations : 1) Occurrence of failures of type which is indicated via intervention signals issued either by a human or a supervisory sensing unit and is denoted by ; 2) Sensory input that the robot continuously acquires such as RGB images captured by cameras on the robot, which is denoted by . For the former, the inverse observation likelihood is implemented as
where is a constant coefficient. The inverse observation likelihood function for the latter type of observations, however, is machine learned and is one of the key components of this work that allows CPIP to reach an accurate estimate of without requiring the robot to experience costly failures. CPIP structures the problem of learning such that it can be achieved with a small number of failure examples for training data. Introspective perception is leveraged to extract features associated with errors in perception from the high dimensional raw sensory data. These features are then used to learn to predict the probability of different types of failures of navigation. By learning this likelihood function, the robot will learn to better navigate its environment, proactively avoiding paths that are known to lead to failure cases, and reactively adjusting its policy upon encountering novel situations that may lead to failures. In the following section we describe the different parts of this learning problem.
V Failure Prediction via Introspective Perception
In order to predict failures of navigation given the sensory data, we need to approximate the function
. End-to-end learning of this function is intractable because it requires a great amount of training data, yet catastrophic failures in robotics when executing tasks such as autonomous navigation do not happen frequently. The scarcity of these examples makes it challenging to learn a classifier that predicts the probability of task execution failure directly from the raw sensory data. Without enough training data and without abstracting the acquired high dimensional sensory data, the learned classifier is bound to overfit to the training data. We instead propose to factorizewhere are the features extracted from observations by introspective perception — a model-free approach to predicting arbitrary errors of perception.
V-a Introspective Perception
Early works on introspective perception [17, 18] defined a perception algorithm to be introspective if it is capable of predicting the occurrence of errors in its output given the current state of the robot. Follow-up works [3, 4] extended this definition and required such perception algorithms to predict the probability of perception error conditioned on the region in the raw sensory data that the output is dependent upon, e.g. an image patch in the captured image by an RGB camera where the estimated depth of the scene is erroneous. This is usually obtained by means of an introspection function that is trained on empirical data.
In CPIP, the robot is assumed to be equipped with one or more introspective perception modules; each module has a learned function , which extracts features from the raw sensory data that encode information about sources of perception errors. The outputs of all introspective perception modules are fed to a navigation competence predictor , which learns to estimate the likelihood of each of the different classes of failure given a set of sources of perception errors, i.e. such that . The inverse observation likelihood function in Eq. 2 is then estimated as the composition of the above two functions, i.e. .
In this paper we implement introspective perception for a block matching-based stereo depth estimator 
using the same convolutional neural network architecture as that used in for the introspection function. The training data is collected autonomously using a depth camera as supervisory sensing, which is only occasionally available and provides oracular information about the true depth of the scene.
V-B Competence Predictor Model
We implement the navigation competence predictor model
as an ensemble of two deep neural networks. The input to the model is a list of image patchesextracted from the same input image and the output is the probability of each class of failure. The architecture as shown in Figure 2 consists of two sub-models that are trained independently.
The global_info network is a convolutional neural network (CNN) that operates simultaneously on all input image patches arranged on a blank image in their original pixel coordinates. The input to this network is equivalent to the input image masked at all regions except for those predicted by introspective perception to lead to errors. The global_info CNN captures task-contextual and spatial information from the current frame related to competence. By masking out parts of the full image deemed to be unrelated to perception failures we are able to ensure that the global_info CNN does not overfit to specific environments.
The local_info network is a CNN that is fed as input individual image patches. The output of this branch is the probability of each class of failure for each single image patch. This network learns correlations between navigation failures and image features that lead to perception errors. The goal of this branch is to locally pinpoint the potential source of navigation failures in the image space, when a class of failure is predicted by the global_info network.
The last stage of the model is a temporal filtering of the output of each of the two networks. Failure class probabilities that are produced by the global_info network are passed through a mean filter to output . Moreover, image patches that are predicted by the local_info network to lead to navigation failures are tracked in the full image over consecutive frames to form a set of active tracklets for each type of failure . The output of the model is obtained via strict consensus on the output such that
In other words, the predicted probability of each type of failure provided by the global_info network is only accepted if the local_info network also supports that by detecting at least one potential cause for the same type of failure in the image space. During deployment, if , i.e. there exists consensus between the two branches of the network on the existence of any type of failure, the output of the competence predictor model will be used to update the belief in Eq. 2.
V-C Training of CPIP
CPIP has two learned components, i.e. introspective perception and the competence predictor model and they are trained sequentially. The training data is extracted from logs of robot deployments in the training environment. The logs include data collected by the primary sensors such as RGB images captured by the stereo cameras, as well as data collected by supervisory sensing units that are only occasionally available such as high-fidelity depth cameras. Furthermore, intervention signals issued by a human operator upon occurrence of navigation failures are also recorded.
The deployment logs are processed offline. First, introspective perception is trained with data that is autonomously labeled using the supervisory sensing. Then, the training data for the competence predictor model is prepared by passing the raw sensory data through the introspective perception module and labeling the output features as associated with one of the types of navigation failures if they fall within a fixed time window preceding the occurrence of such failures. Each of the two sub-models of the competence predictor model explained in §V-B are then trained using a cross-entropy loss.
Vi Experimental Results
In this section: 1) We evaluate CPIP on how well it predicts sources of robot failures. 2) We compare CPIP against baseline global path planners in terms of their task completion success rate and their task completion time. 3) We evaluate the importance of introspective perception in CPIP’s performance and generalizability via ablation studies.
Vi-a Experimental Setup
In order to evaluate CPIP and compare it against SOTA extensively, we use AirSim , a photo-realistic simulation environment, where robot failures are not expensive and the robot can easily be reset upon occurrence of navigation failures. A simulated car is equipped with a stereo pair of RGB cameras as well as a depth camera that provides ground truth depth readings for every pixel in the camera frame. We use two separate urban environments for training and testing. The environments are populated with obstacles of different shapes, textures, and colors.
We also evaluate CPIP on a real robot. We use a Clearpath Husky robot equipped with a stereo pair of RGB cameras, an Orbbec Astra depth camera, and a Velodyne VLP-16 3D Lidar. We use different indoor sites for training and testing of CPIP. Each environment has different types of terrain such as tiles and carpet, and is populated with obstacles of different shapes, textures, and surface materials. Figure 3 shows the training and test environments.
Vi-B Failure Prediction Accuracy
In order to evaluate the accuracy of CPIP in predicting failures of navigation, we have the autonomous agent traverse each of the edges of the navigation graph in the test simulation environment 50 times and run the captured images by the robot camera through the CPIP’s introspective perception module and the competence predictor model to predict instances of navigation failure. In this paper, we implement CPIP with two classes of failures. 1) Catastrophic failures, where the robot ends up in a state that precludes completion of the task and is not recoverable with human intervention. Examples of this class include collisions and the robot getting stuck off-road. 2) Non-catastrophic failures, where the robot will not be able to complete its task unless intervention is provided by a human operator or a supervisory sensor. The robot getting stuck due to false detection of obstacles or because of localization errors are examples of this type of failure. Figure 4
illustrates the predicted and actual navigation failures in a confusion matrix. CPIP correctly predicts occurrence of navigation failures more thanof the time for both types of failures. Prediction errors mostly correspond to cases, where the source of failure is significantly different looking from the examples available in the training data.
Vi-C Navigation Success Rate and Plan Optimality
|TCR (%)||Avoided Failures Count|
|Real Robot||CPIP||100||5 (100%)||3 (100%)|
|Frequentist||73||3 (60%)||1 (33%)|
|Simulation||CPIP||97||14 (93%)||61 (97%)|
|Frequentist||83||9 (60%)||52 (83%)|
We test the end-to-end system in predicting navigation failures and leveraging this information to proactively plan paths that reduce the probability of failures, by deploying the robot in a previously unseen test environments. The robot is commanded to complete randomly generated navigation tasks that consist of a starting pose and a target pose. We conduct this experiment both in the simulation and on the real-robot, which consist of 100 and 10 navigation tasks, respectively. We compare CPIP with a baseline path planner that does not reason about the competence of the robot as well as a state-of-the-art approach for competence-aware path planning — called the Frequentist approach — that relies on keeping track of the frequency of past failures in traversing each of the edges of the navigation graph . Figure 5 compares the cumulative failure count for all three methods throughout the experiment. With the Frequentist approach, the robot learns to avoid regions of the environment, where it cannot navigate reliably as it experiences navigation failures. However, CPIP enables the robot to predict and avoid most of these failures, hence leads to the least number of experienced failures.
We also evaluate the optimality of the planned paths by comparing the task completion time for all the methods under test with an oracular path planner that is given the true probability of navigation failures for each edge of the navigation graph. The ground truth failure probabilities are obtained by having the agent traverse each edge of the navigation graph numerous times and logging the frequency of each type of failure. Figure 6 compares the mean task completion duration over a sliding window of 5 consecutive tasks in the simulation experiment. The duration values are normalized by the task completion duration when the oracular path planner is used. The figure also illustrates instances of task completion failures for both CPIP and the Frequentist method. Incomplete tasks are excluded when calculating the mean task completion durations. Such instances include occurrence of catastrophic failures or occurrence of consecutive non-catastrophic failures such that the robot cannot recover from a stuck state by re-planning. CPIP task completion duration is similar to that of the oracular path planner except for tasks where the robot visits a previously unseen part of the environment and has to re-plan upon prediction of a source of navigation failure. An example of such re-planning can be seen around task number 50 in Figure 6.
As a result CPIP keeps a high task completion rate by preventing instances of navigation failure, and by updating its belief over the probability of failure throughout the environment it converges to an optimal policy while experiencing much fewer failures compared to the SOTA in a new deployment environment. Table I summarizes the task completion rate (TCR) and the number of avoided navigation failures by CPIP for both simulation and real-robot experiments. Figure 7 illustrates snapshots of the test environments and highlights the different sources of navigation failures encountered by the robot, which includes different types of texture-less obstacles as well as reflective surfaces.
Vi-D Ablation Study
In order to evaluate the importance of introspective perception in the pipeline of CPIP, we conduct an ablation study. We train a classifier that instead of leveraging the extracted information by introspective perception, directly receives the raw captured RGB images as input and outputs the probability of each class of failure occurring in a specified time window in the future. We use a convolutional neural network with the AlexNet architecture similar to that used in prior work  for predicting failures of perception.
We train the classifier on the same simulation dataset used for training CPIP and we compare the performance of both methods in predicting navigation failures both in a previously unseen environment—the same test dataset described in section VI-B—as well as in a new set of deployments of the agent in the training environment. Figure 8 shows the average precision, recall, and f1-score metrics over all classes, i.e. two classes of failures and a no-failure class, for both CPIP and the end-to-end classifier. While both methods perform similarly good in a previously seen environment, CPIP significantly outperforms the alternative classifier in the novel environment. Leveraging the extracted features by introspective perception simplifies the learning task and allows CPIP to achieve better generalizability given the same amount of training data. This is specifically a benefit for task-level failure prediction, where the volume of training data is limited due to the costly nature of acquiring data from examples of robot failures.
In this paper, we introduced CPIP, a framework for integrating introspective perception with path planning in order to learn to reduce robot navigation failures in the deployment environment and with limited amount of training data. We empirically demonstrated that by leveraging introspective perception CPIP can learn a navigation competence predictor model that generalizes to novel environments and results in significantly reduced frequency of navigation failures. CPIP currently addresses the problem of robot global path planning on a coarse navigation map of the environment. As future directions, the CPIP framework can be extended to support competence-aware local motion planning as well as high-level task planning for mobile robots.
This work is supported in part by NSF (CAREER-2046955, IIS-1954778) and DARPA (HR001120C0031). The views and conclusions contained in this document are those of the authors only.
-  N. Hawes, C. Burbridge, F. Jovan, L. Kunze, B. Lacerda, L. Mudrova, J. Young, J. Wyatt, D. Hebesberger, T. Kortner, et al., “The strands project: Long-term autonomy in everyday environments,” IEEE Robotics & Automation Magazine, vol. 24, no. 3, pp. 146–156, 2017.
-  G. Costante, C. Forster, J. Delmerico, P. Valigi, and D. Scaramuzza, “Perception-aware path planning,” arXiv preprint arXiv:1605.04151, 2016.
-  S. Rabiee and J. Biswas, “IVOA: Introspective vision for obstacle avoidance,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 1230–1235.
-  S. Rabiee and J. Biswas, “IV-SLAM: Introspective vision for simultaneous localization and mapping,” in Conference on Robot Learning (CoRL), 2020.
-  R. Bajcsy, “Active perception,” Proceedings of the IEEE, vol. 76, no. 8, pp. 966–1005, 1988.
J. Aloimonos, I. Weiss, and A. Bandyopadhyay, “Active vision,”
International journal of computer vision, vol. 1, no. 4, pp. 333–356, 1988.
-  E. Krotkov, “Focusing,” International Journal of Computer Vision, vol. 1, no. 3, pp. 223–237, 1988.
-  M. Krainin, B. Curless, and D. Fox, “Autonomous generation of complete 3d object models using next best view manipulation planning,” in 2011 IEEE International Conference on Robotics and Automation. IEEE, 2011, pp. 5031–5037.
-  S. A. Sadat, K. Chutskoff, D. Jungic, J. Wawerla, and R. Vaughan, “Feature-rich path planning for robust navigation of mavs with mono-slam,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 3870–3875.
-  X. Deng, Z. Zhang, A. Sintov, J. Huang, and T. Bretl, “Feature-constrained active visual slam for mobile robot navigation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7233–7238.
-  D. M. Saxena, V. Kurtz, and M. Hebert, “Learning robust failure response for autonomous vision based flight,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 5824–5829.
-  C. Gurău, D. Rao, C. H. Tong, and I. Posner, “Learn from experience: probabilistic prediction of perception performance to avoid failure,” The International Journal of Robotics Research, vol. 37, no. 9, pp. 981–995, 2018.
B. Lacerda, D. Parker, and N. Hawes, “Optimal and dynamic planning for markov decision processes with co-safe ltl specifications,” in2014 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2014, pp. 1511–1516.
-  T. Krajník, J. P. Fentanes, J. M. Santos, and T. Duckett, “Fremen: Frequency map enhancement for long-term mobile robot autonomy in changing environments,” IEEE Transactions on Robotics, vol. 33, no. 4, pp. 964–977, 2017.
-  T. Vintr, Z. Yan, T. Duckett, and T. Krajník, “Spatio-temporal representation for long-term anticipation of human presence in service robotics,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 2620–2626.
-  C. Basich, J. Svegliato, K. H. Wray, S. Witwicki, J. Biswas, and S. Zilberstein, “Learning to optimize autonomy in competence-aware systems,” in Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, 2020, pp. 123–131.
P. Zhang, J. Wang, A. Farhadi, M. Hebert, and D. Parikh, “Predicting failures
of vision systems,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3566–3573.
-  S. Daftry, S. Zeng, J. A. Bagnell, and M. Hebert, “Introspective perception: Learning to predict failures in vision systems,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016, pp. 1743–1750.
-  K. Pulli, A. Baksheev, K. Kornyakov, and V. Eruhimov, “Real-time computer vision with opencv,” Communications of the ACM, vol. 55, no. 6, pp. 61–69, 2012.
-  S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and service robotics, 2018, pp. 621–635.