Log In Sign Up

An Online Data-Driven Emergency-Response Method for Autonomous Agents in Unforeseen Situations

by   Glenn Maguire, et al.
HRL Laboratories, LLC

Reinforcement learning agents perform well when presented with inputs within the distribution of those encountered during training. However, they are unable to respond effectively when faced with novel, out-of-distribution events, until they have undergone additional training. This paper presents an online, data-driven, emergency-response method that aims to provide autonomous agents the ability to react to unexpected situations that are very different from those it has been trained or designed to address. In such situations, learned policies cannot be expected to perform appropriately since the observations obtained in these novel situations would fall outside the distribution of inputs that the agent has been optimized to handle. The proposed approach devises a customized response to the unforeseen situation sequentially, by selecting actions that minimize the rate of increase of the reconstruction error from a variational auto-encoder. This optimization is achieved online in a data-efficient manner (on the order of 30 data-points) using a modified Bayesian optimization procedure. We demonstrate the potential of this approach in a simulated 3D car driving scenario, in which the agent devises a response in under 2 seconds to avoid collisions with objects it has not seen during training.


Reinforcement Learning with Uncertainty Estimation for Tactical Decision-Making in Intersections

This paper investigates how a Bayesian reinforcement learning method can...

Tactical Decision-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation

Reinforcement learning (RL) can be used to create a tactical decision-ma...

AI Agents in Emergency Response Applications

Emergency personnel respond to various situations ranging from fire, med...

Agent-Based Decision Support System to Prevent and Manage Risk Situations

The topic of risk prevention and emergency response has become a key soc...

Data-Driven Optimization for Police Zone Design

We present a data-driven optimization framework for redesigning police p...

Multi-Link and AUV-aided Energy-Efficient Underwater Emergency Response

The recent development of wireless communication has provided many promi...

Improving Behavioural Cloning with Human-Driven Dynamic Dataset Augmentation

Behavioural cloning has been extensively used to train agents and is rec...

1 Introduction

Recent advances in Reinforcement Learning (RL) through deep neural networks have shown promising results in developing autonomous agents that learn to effectively interact with their environments in a number of different application domains

(1; 2), including learning to play video games from image pixels (3; 4; 5), generating optimal control policies for robots (6; 8)

, speech recognition and natural language processing

(9), as well as making optimal trading decisions given dynamic market conditions (7). Under the RL paradigm, the agent learns to perform a given task through numerous training episodes involving trial-and-error interactions with its environment. By discovering the consequences of its actions in terms of the rewards obtained through these interactions the agent eventually learns the optimal policy for the given task.

These approaches work well in situations where it can be assumed that all the events encountered during deployment arise from the same distribution on which the agent has been trained. However, agents that must function within complex, real-world environments for an extended period of time can be subjected to unexpected circumstances outside of the distribution they have been designed for or trained on, due to environmental changes that arise. For example: an autonomous driving car may encounter significantly distorted lane-markings that it has never experienced before due to construction or wear, and must determine how to continue to drive safely; an unaware worker in a manufacturing facility may suddenly place a foreign object, such as their hand, within the workspace of a vision-guided robot-arm that must then react to avoid damage/injury; or the seam being tracked by a welding robot may suddenly shift due to a poorly secured work-piece, requiring the robot to react to minimize part-damage. In such unexpected, novel situations the agent’s policy would be inapplicable, causing the agent to perhaps take unsafe actions.

In this paper, we consider scenarios where a trained agent encounters an unforeseen situation during deployment that renders available system or state-transition models highly unreliable, so that any inferences based on such models, as well as any pre-defined safe state/action regions, are no longer valid for safe decision-making. An agent unable to respond effectively to a novel situation when first encountered is vulnerable to take dangerous actions. This is of particular concern in safety-critical applications where sub-optimal actions can lead to damage or destruction of the agent or loss of human life.

While continual learning approaches exist (29; 30; 31; 32; 33) that retrain agents, or re-optimize models, to eventually improve performance on novel events, they require a significant amount of time and repeated encounters with those events before learning the optimal policy that covers those cases. If these novel events represent dangerous situations in safety-critical applications, this still leaves the question of how to respond to these events in the interim.

We address this problem by developing a data-driven danger-mitigation system that allows an agent to deal with novel situations without reliance on the accuracy of existing models, or the validity of safe states and recovery policies developed offline or from past experiences. The key insight to our approach is that uncertainty in observations from the environment can be used not only to detect dangerous scenarios, but also as a driver for the generation of effective, short-term responses online, when necessary, to circumvent dangers, so that the agent can continue to function within its environment.

Increased observation uncertainty (e.g., as measured by out-of-bounds auto-encoder reconstruction errors (16)) can be used to detect novelty, for which the existing policy is unprepared. It stands to reason, then, that decreasing this uncertainty would decrease novelty and return states to those that the current policy can handle effectively, thus correlating uncertainty-minimization with safety. For example, when an aerial vehicle descends too close to the ground and finds difficulty in maintaining control due to strong ground-effect, ascending back to milder conditions that it is better designed to handle returns it to safety. While use of uncertainty to detect potential danger is not new, using it to generate actions in an online manner, customized to the particular never-before-seen emergency as it unfolds, is novel.

In the absence of a reliable model or policy network to make proper action decisions, determination of an appropriate response to a novel situation must be data-driven and sequential. In other words, after an action is taken by the agent at a given time-step, the resulting measurements must be analyzed, along with all prior measurement data gathered so far during the course of the response, in order to determine the best action to take for the subsequent time-step. Moreover, in an emergency situation this response must be devised efficiently (i.e., in just a few time-steps), meaning that little data will typically be available for finding the optimal actions to take. This reactive approach, therefore, necessitates a fast, online, optimal decision-making method.

Bayesian Optimization (BO) provides an ideal theoretical framework for this type of problem scenario (24). BO is a data-efficient, global-optimization method for sequential decision-making where the objective function is expensive to evaluate or is taken to be a black-box. It uses a probabilistic approach that involves building and maintaining a distribution of functions to model the objective being optimized, and sequentially improving this probabilistic model through measurement data obtained online. This model is used to compute the next best action to take in a manner that balances exploration of the unknown regions of the objective and exploitation of regions found to be most likely to contain the optimal value.

Using this framework we devise an emergency response generation method that combines a modified BO procedure for efficient sequential optimization, with Gaussian Process (GP) regression for representing the probabilistic model of the objective. The objective function in our approach is a metric designed to capture the uncertainty in the observations obtained by the autonomous agent in a way that facilitates the generation of an effective emergency response. The responses generated by this method are intended to be action-sequences over a short time-span that are only initiated when deemed necessary to circumvent a dangerous situation that the agent is not yet prepared to handle. As such, our overall approach is referred to as the data-driven, emergency-response method.

2 Preliminaries

2.1 Related Work

While there are existing works related to safety for autonomous agents (10; 11; 12; 13; 14; 15; 16; 17; 21), they are unable to address the scenarios being considered in this paper, which are situations that are unexpected and outside of the distribution of events that the agent has been trained for. Typical safe decision-making strategies involve incorporating pre-designed penalties into the reward or cost function for actions deemed unsafe or dangerous when training a deep neural network to generate policies (10; 11), or restricting agent actions to “safe” regions in order to prevent it from reaching unsafe states (12; 13; 14; 15). Such approaches are still vulnerable to unforeseen events not accounted for through the penalties or pre-determined safe regions stipulated.

Other approaches use examples of dangers in offline training in representative environments to either help identify potentially dangerous situations online and conservative behaviors to use based on pre-specified rules (16), or to learn recovery policies for specific dangerous scenarios that can be applied during deployment (17). During long-term deployment in complex environments, however, significantly novel events very different from those experienced during offline training can arise. These can produce potentially dangerous scenarios not effectively captured by the above-mentioned mechanisms, particularly when such situations develop very rapidly, thereby requiring a customized response to handle.

An agent must therefore be able to continually learn and adapt to such novel situations. Continual learning approaches in the literature that address this need, though, do so through the initiation of a new learning phase, with primary concerns being in trying to learn new tasks efficiently, in an autonomously triggered manner, while maintaining good performance on tasks already learned (29; 30; 31; 32; 33). Adaptation to the novel situation, then, is not instantaneous, and must happen over an extended period of time dictated by the continual learning method used.

Much work also exists in the literature on detection of novel situations, where observations outside of the distribution on which a deep learning agent has been trained must be identified

(16; 29; 33; 34). This is typically done through establishing some measure of uncertainty in the output of the model or deep neural networks involved. In 16, for example, the average squared pixel error between the input and reconstructed observation image from an auto-encoder is compared to a threshold to detect a novel situation. Similarly, in 29

, novelty detection is done through testing for a statistically significant difference in mean auto-encoder reconstruction error between what is expected on an already learned environment and what is sampled online. However, how to best respond to the novel event, particularly when it represents an emergency situation, is an open problem.

Nevertheless, what existing approaches do show is that deep learning neural networks produce erratic and unreliable predictions when presented with inputs very different from their training scenarios (16; 18), but also that uncertainty in predictions from such out-of-distribution inputs can be an effective way to detect novelty (12; 16; 19). Moreover, trying to jointly optimize for task performance and safety-violation can lead to restrictive, sub-optimal policies (17; 20). In addition, despite their limitations, these prior works also make it clear (as shown through the experiments on standard simulation benchmarks in 14, for example) that including a safety mechanism to assist learning agents that either limits or completely avoids dangerous actions improves success rate, constraint satisfaction, and sample efficiency.

2.2 Summary of Bayesian Optimization

Bayesian Optimization (BO) (24) finds use as an efficient, global optimization technique for scenarios where the objective function, , is significantly expensive to evaluate and/or is treated as a black-box system. BO builds a surrogate model of

in the form of a probability distribution over this function that includes a mean function,

, representing the current best estimate of

over the domain of

, and a variance function,

, representing the uncertainty in this estimate.

This model is built sequentially, by sampling the input at each step that is most likely to produce a better value for than the best found so far. With each data-point obtained, a revised posterior distribution over the function is computed. As such the model tends to improve only in the region that has the highest likelihood of containing the optimal solution. This makes BO data-efficient, allowing it to find a near-optimal solution with only a few evaluations.

In the proposed method, this surrogate model is a Gaussian Process (GP) regression model, (21). GP regression operates in function-space by defining a distribution over functions, thereby providing a formalism for computing a mean function, , and a variance function, , for a given set of data-points. This regression model can then be used to infer the function value, , for some unobserved input, . Details on how to solve a GP regression model to make inferences can be found in the Technical Appendix, Section A.

To select each subsequent data-point to sample, BO optimizes an acquisition function,

. This heuristic function quantifies the utility of any given input,

, in terms of its potential for optimizing . It is designed to capture the trade-off between exploitation (i.e., sampling near the best solution found so far) and exploration (i.e., sampling in unexplored regions with greater uncertainty) (25).

Thus, at any given step, , of the sequential optimization process of BO, we will have available some data-points, , , and the objective is to find the next best point, , to sample in terms of its utility as expressed by the acquisition function. This produces an acquisition function optimization problem that can be expressed as:




Here, is the domain of (i.e., the input space), and is the GP regression model of based on given data and , which can be solved to obtain an estimate, , and uncertainty, , of observation at some unknown location .

3 Methodology

3.1 Problem Description

We consider the data-driven emergency-response method to serve as an independent module that monitors a trained and deployed agent as it performs a given task. Within this scenario there may be instances where the agent encounters a situation is has never seen before that presents a danger if not acted upon properly. The agent’s existing policy is unable to determine an appropriate response without further training and any environment models become unreliable.

Whenever such an unforeseen event is encountered, the agent is considered to be in an emergency situation for which an emergency response is required to mitigate the danger. Two key sub-problems need to be addressed here: (1) Emergency Detection; and (2) Response Generation (Figure 1).

Figure 1: The data-driven emergency-response method.

During deployment the observations from the environment as well as the actions that the policy intends to take are received by the Emergency Detection algorithm, which must decide whether or not an emergency situation is imminent. If it is not, then no further action is taken by the emergency-response system. Otherwise, the Response Generation algorithm is engaged, which takes over the policy’s actions by replacing them with a customized action-sequence over the next time-steps. This action-sequence must be generated online as the encounter with the novel situation unfolds.

3.2 Emergency Detection

When a novel situation arises, the corresponding observations will be very different from what the agent has seen during training or from past experiences. This increases the uncertainty associated with these novel observations. In our approach to emergency detection, we use a Variational Auto-Encoder (VAE) (26) (see Technical Appendix, Section B for more details) to obtain a measure of observation uncertainty.

In particular, we represent observation uncertainty as the mean squared pixel-value error, , between the observation, , obtained from the environment at time , and the reconstructed output, , from a VAE:


Here, and are the pixel intensity values of images and , respectively, and is the total number of pixels in an image, including each of the R, G, and B color channels.

This VAE is assumed to be trained on the agent’s past experiences or on a representative environment during pre-deployment under some nominal settings. This available experiential data can be used to empirically establish a threshold, ULe, on the reconstruction errors observed during deployment, above which there is strong reason to believe that the agent has encountered a significantly novel situation.

This threshold is user-specified, and can be set to a reasonable value by observing the typical range of variation of errors under the nominal conditions in which the VAE has been trained. Qualitatively, a lower threshold corresponds to a more conservative stance on when an unforeseen situation is imminent, while a higher value would be more liberal, contributing to fewer false positives. An alternative probabilistic approach based on confidence limits is also given in the Technical Appendix, Section C.

The condition for detection is based on estimating the values of reconstruction errors into the short-term future using real-time data. At each time-step during deployment, a regression model, , , is fitted to the last VAE reconstruction errors computed, which is then extrapolated time-steps into the future. A positive detection occurs if at least the last extrapolated errors exceed the established ULe threshold (see Technical Appendix, Section C).

The choice for the values for and are related to how conservative or liberal one wishes to be when triggering the emergency response. The further into the future one extrapolates the more uncertain of the estimated values one would expect to be. To counteract this, a higher value for can be chosen to provide a more stringent condition before concluding that a true positive detection event has occurred.

3.3 Response Generation

The proposed method for generating a response to address an unforeseen situation is based on the idea that taking actions that reduce the uncertainty in the observations should correspond to an effective response that guides the agent to a more familiar, and therefore relatively safer, state. In our approach, the response devised is an action-sequence that spans some fixed number of time-steps, . The generation of actions that reduce observation uncertainty is thus taken to be a sequential optimization process, where each action must ideally be the optimal decision to make given all the data gathered since the initiation of the response.

To perform this online optimization we use BO coupled with GP regression. As shown in Figure 1, the sequential optimization uses the rate of change of the VAE reconstruction errors to drive the BO loop at each time-step, , of the response action-sequence. This is because there may be situations where it may not be possible to find actions that reduce the reconstruction errors, and all that can be done is to minimize its increase. This would still be a valid response if that is the best that can be done given the circumstances. An imminent collision with an obstacle is again a good example – some situations may simply call for maximum braking as there may not be any way to swerve around the obstacle. In such cases, errors would only rise as the agent approached the obstacle, with braking helping to slow down the rate of increase until it eventually plateaus at a higher but stable value. Minimizing the error-rate would capture the need to slow down the rise of the errors in such situations, but would also be able to keep driving the errors down further (i.e., negative error-rates) if it is indeed possible.

The objective at each time-step, , of the response is to find an action, , that minimizes the error-rate, , that would result from that action, by conducting one cycle of the BO loop shown in Figure 1. This optimization will have available data-points containing all the actions, , taken in the last time-steps of the action-sequence, as well as the corresponding true error-rates, , that resulted.

The last error data-points are always stored in a database. Once a response generation is triggered, every error data-point obtained from the start of the -step response is also saved ( and ) for the duration of the response. To compute the error-rate, , corresponding to the error data-point obtained, the available reconstruction errors, e, in the data-set are first passed through a smoothing filter, , to compute the smoothed errors, , as the raw data would be noisy. The last two smoothed error values can then be used to compute the rate, , as:


BO then proceeds to construct a model of the unknown relationship, , between error-rates, , and actions, a, for the given emergency scenario using GP regression. This GP model is used to conduct the relatively simpler acquisition function optimization to find the next best action, , to take, as described by equations (1) and (2).

The optimal action, , is then applied to the environment. At the subsequent time-step, the resulting error will be obtained, from which can be computed. Both and are then updated accordingly and the above BO loop procedure is repeated, until the response length, , is reached.

Acquisition Function:

Several options for the acquisition function can be found in the literature (e.g., Expected Improvement, Entropy Search, and Knowledge Gradient). We employ the Upper Confidence Bound (UCB), given by Eq. (5). Here, and are the mean and variance of the regression model for the relationship, .


UCB is chosen since it includes a parameter, , that allows direct control over the balance between exploration and exploitation, that is, how much the system should try actions that are far from those already sampled versus how much should it focus on the most promising actions found so far.

Since an emergency response is time-critical, it is important to ensure a transition from an initial exploratory behavior to an exploitative one in a timely manner so that the search converges on an effective solution fast enough to avoid the danger. To accomplish this, the explicit parameter, , is set to a decreasing function of time, , . The initial value, , must be relatively high to encourage the BO to explore the action-space. As the action-sequence progresses, this parameter should decrease to a relatively lower value, , so that the optimization begins to exploit the best solution found so far. These requirements produce the following constraints on the form of the time-varying function chosen for :


In this way, the degree to which the BO explores initially can be controlled by the choice for , and the degree to which it exploits the best solution found so far can be controlled through the choice for . The speed with which the optimization transitions from this exploration to exploitation behavior can be controlled by the choice for .

A second point of concern in devising the acquisition function is incorporating the influence of time. The underlying relationship between error-rate and actions can, in general, be expected to change with time. Thus, recent observations will have greater relevance to, and influence on, the decision being made at any given time-step compared to older observations. To account for this temporal variation, we propose a penalty function that discounts the utility of any given observation, as reflected by its corresponding acquisition function value, based on that observation’s “age” within the time-span of the response action-sequence.

However, the utility of any given action as determined by the UCB acquisition function depends on the GP regression model used to obtain and (see Eq. (5)). The GP regression model captures the influence of past observations on any other unseen one being estimated based on their relative distances in action-space. Thus, the discounting of action utility must be incorporated into the error-rate data used to compute the GP regression model. As such, we define a penalty function that operates directly on the set of error-rates available at any given time-step of the response. In particular, at the time-step of a response action-sequence, we will have available past observations given by and . Each error-rate, , in is transformed to a discounted measure, , through a penalty function, , before computing the GP regression, where:


Here, represents the age of the error-rate at time-step , and Eq. (11) indicates that the penalties should increase with age. This user-specified penalty function can be devised under this constraint depending on how strongly and quickly one wishes past data to lose its significance. An example is provided in the experiments presented in Section 4.

Optimization of the acquisition function at each time-step can be conducted through standard approaches used in the literature. Typically, the UCB function can be effectively optimized using quasi-Newton methods such as the popularly-used L-BFGS-B algorithm (23).

4 Simulation Experiments

To demonstrate and validate the proposed method, experiments were conducted using the open-source CARLA autonomous driving car simulator

(27). In these experiments, the proposed data-driven emergency-response method was used to detect and avoid imminent collisions with obstacles that an agent has never encountered before.

4.1 Experimental Setup

Nine different collision scenarios were setup in CARLA within a simulated urban environment (see Figure 2), each involving a different, unforeseen, stationary obstacle placed in the path of the autonomous car driving along a section of road in one of 5 different parts of the urban environment. These scenarios simulate a situation where an autonomous driving agent, assumed to have been trained to drive according to the rules of the road in an obstacle-free urban environment, is suddenly presented with an unforeseen situation involving a stationary obstacle placed in its path.

Normally the agent would not know how to best respond to such a novel situation and would have to experience it many times before learning the optimal response. This may even have to be repeated for different obstacle types. During this learning process the agent may take dangerous actions, possibly resulting in collisions with the obstacles.

Figure 2: Simulated stationary-obstacle scenarios used in the experiments: (a) location A, green car; (b) location B, garbage container; (c) location B, motorcycle; (d) location C, red car; (e) location C, vending machine; (f) location D, blue car; (g) location D, ATM machine; (h) location E, orange car; (i) location E, street-sign.

The proposed emergency-response method is incorporated into the simulation as an independent module that monitors the actions of the agent and the observations received from its forward-facing RGB camera sensor. Upon detecting an imminent collision using the approach described in Section 3.2 as the agent approaches a stationary obstacle, the module takes over the agent’s actions for the pre-specified next =30 time-steps with a customized action-sequence generated by the method described in Section 3.3 to attempt to prevent this collision.

At the start of each simulation frame, the agent receives an image corresponding to the current system state. The agent selects and applies an action, namely, the throttle, brake, and steering inputs, and the simulation updates the system state accordingly to the next time-step. This process repeats until the end of the simulation run. A worst-case scenario is simulated here, where, in the absence of the emergency-response system, the agent takes no action to avoid the obstacle and continues to follow the road. More details on the simulation setup can be found in the Technical Appendix, Section D.

A VAE is trained in a pre-deployment phase in this same environment without obstacles. The upper error threshold, ULe, is empirically determined for each scenario based on observing the nominal reconstruction error variations obtained during the pre-deployment phase and selecting a reasonable value from this data.

At each time-step during deployment, a second order polynomial is fit to the last =15 reconstruction error data-points, and is extrapolated =7 time-steps into the future. A positive detection of an emergency situation occurs if all of the last =3 or more data-points exceed ULe.

For the purposes of demonstration, the time-varying parameter and the temporal variation penalties for the modified UCB acquisition function are given by equations (12) and (13), respectively, with = 0.07, = 0, and = 5. These satisfy the conditions given by equations (6) – (11), and allow the BO loop to make a rapid transition from exploration to exploitation in a manner sufficient to narrow down to an effective response in all the scenarios tested.


It should be noted that while these functions and their parameters were chosen simply to demonstrate the overall method, they can always be further tuned and also learned online after initially starting with conservative values.

4.2 Results

In the first set of experiments, both the Emergency Detection and Response Generation components of the system were active and the average agent speed before encountering an obstacle was 20km/h. The proposed method (abbreviated as BO+GP) was compared to an alternate approach where the action-sequence was generated through a random selection of the action vector values at each time-step of the response. Multiple repetitions for each scenario were conducted (details in Technical Appendix, Section D) and the percentage of runs that resulted in successful collision avoidance (i.e., success-rate) was computed. Table 1 summarizes the results.

Scen. # 20km/h Tests 30km/h Tests
BO+GP Random BO+GP Random
1 80% 10% 82% 5%
2 65% 20% 90% 25%
3 75% 35% 75% 10%
4 75% 50% 80% 25%
5 75% 25% 85% 70%
6 45% 5% 55% 40%
7 75% 30% 70% 25%
8 70% 25% 25% 7.5%
9 80% 15% 57.5% 17.5%
Avg.: 71% 24% 68.8% 25%
Table 1: Summary of success-rates for simulated collision-avoidance scenarios (note: taking no action resulted in a 0% success-rate in all scenarios for both tests).

To further elucidate the impact that the use of the proposed approach has on VAE reconstruction errors and error-rates, the experiments were repeated at a higher average agent speed of 30km/h and with only the Response Generation component of the proposed method active. For a fair comparison, both the proposed approach and the random-selection approach were manually triggered at the same time for all scenarios to ensure that the same distance and initial approach speed existed for both. Table 1 summarizes the success-rate results for these tests as well.

As a representative example for illustration, Figure 3 shows the plots of the variations in VAE reconstruction errors and error-rates over the span of the response action-sequences for scenario 1. The proposed approach is compared with both a random response and taking no action.

Figure 3: Comparison of the impact of the proposed approach (BO+GP), a random response, and no-action, on (a) VAE reconstruction error-rates (computed from filtered errors), and (b) VAE reconstruction errors, during the span of the response action-sequence initiated over the scenario 1 runs. Solid lines show the median value and the shaded region shows the percentile to percentile range of the plotted quantities among all the runs.

5 Conclusions and Discussion

This paper proposed a data-driven emergency-response method for safe decision-making that allows a trained autonomous agent to safely address unforeseen situations encountered during deployment for which the existing policy becomes unreliable. When triggered, the method generates a response by finding optimal actions sequentially through minimization of VAE reconstruction errors from the novel observations using a modified BO algorithm. Simulation experiments in an autonomous car-driving domain demonstrate how minimization of observation uncertainty can find safe actions by using the proposed method to avoid collisions with obstacles the agent has never seen before.

The significantly greater average success-rate in collision-avoidance due to the proposed emergency-response method compared to a randomized approach, for both the average agent speeds tested, indicate that effective, intelligent actions are indeed being selected to avoid the novel dangerous situations, beyond simply what random chance would allow. This demonstrates how minimizing a measure of uncertainty in the observations can be correlated with good actions that help to effectively deal with unforeseen situations.

Figure 3(a) shows how the BO-based sequential optimization approach is able to quickly reduce and stabilize the error-rates over the course of the response. Through a combination of braking and turning, the agent not only reduces the speed with which it approaches the obstacle, thus significantly delaying the peak rate that results when the agent is at its closest proximity to the obstacle, but also maintains the error-rates at a much lower value throughout, compared to a random or no response. In fact, in some cases negative rates are also reached when the agent turns towards an adjacent lane, thereby side-stepping the obstacle and removing it from its view, causing the uncertainty measure to drop significantly. These effective, danger-avoiding behaviors are also reflected in the reconstruction errors themselves shown in Figure 3(b), where the errors rise at a relatively slower rate and plateau at a relatively lower final value due to the agent having transitioned to a more familiar (i.e., safer) state.

It is also evident that some scenarios presented more of a challenge than others. Detection of the imminent collisions was observed to happen when the agent was on average about 8m away from the obstacle. This left little distance and time to react, and in some cases it was not enough for the final response to avoid the collision, even though it may have been effective had the danger been detected sooner. Qualitative observations of some of the failures under the proposed method show that the agent still tries to make sensible maneuvers to avoid the collision and almost succeeds.

Curved roads are especially challenging, where a much stronger steering response may be needed to completely avoid the collision depending on which way the road turns and on which side the optimization decides to swerve (e.g., turning off on the side in the direction of the tangent to the road’s curvature would require less sharper steering to avoid a collision than choosing to turn into the curve). This accounts for the lower success-rates observed for the proposed method under some of the curved-road scenarios (scenarios 6-9). However, when the obstacle itself is smaller, the success-rate improves since the scenario is more forgiving of actions that may not have been as strong as one would ideally want them to be (e.g., compare the higher success-rates for the smaller ATM (scenario 7) and street-sign (scenario 9) obstacles to their wider and larger car-obstacle counterparts (scenarios 6 and 8, respectively) for the 30km/h tests).

It should be noted that the intention here was not to create the best obstacle-avoidance system for an autonomous car, but rather to demonstrate how minimization of observation uncertainty can be an effective driver to safely address novel situations that a learning agent would otherwise be unprepared to respond to. Moreover, the proposed method does not rely on having an accurate model of the environment or prior experience with those dangerous scenarios.

The method is also generic as it does not use context-specific information to generate the response (i.e., it does not need to know what the obstacle is, or what its presence means in the context of an autonomous driving car). As such, the performance of the method can always be improved by incorporating context-specific mechanisms on top of the basic emergency-response system for the particular application being addressed, if so desired.