Neural Simplex Architecture

by   Dung Phan, et al.

We present the Neural Simplex Architecture (NSA), a new approach to runtime assurance that provides safety guarantees for neural controllers (obtained e.g. using reinforcement learning) of complex autonomous and other cyber-physical systems without unduly sacrificing performance. NSA is inspired by the Simplex control architecture of Sha et al., but with some significant differences. In the traditional Simplex approach, the advanced controller (AC) is treated as a black box; there are no techniques for correcting the AC after it generates a potentially unsafe control input that causes a failover to the BC. Our NSA addresses this limitation. NSA not only provides safety assurances for CPSs in the presence of a possibly faulty neural controller, but can also improve the safety of such a controller in an online setting via retraining, without degrading its performance. NSA also offers reverse switching strategies, which allow the AC to resume control of the system under reasonable conditions, allowing the mission to continue unabated. Our experimental results on several significant case studies, including a target-seeking ground rover navigating an obstacle field and a neural controller for an artificial pancreas system, demonstrate NSA's benefits.



There are no comments yet.


page 1

page 2

page 3

page 4


Safe CPS from Unsafe Controllers

In this paper, we explore using runtime verification to design safe cybe...

A Distributed Simplex Architecture for Multi-Agent Systems

We present Distributed Simplex Architecture (DSA), a new runtime assuran...

Runtime-Assured, Real-Time Neural Control of Microgrids

We present SimpleMG, a new, provably correct design methodology for runt...

Runtime-Safety-Guided Policy Repair

We study the problem of policy repair for learning-based control policie...

Robustifying Controller Specifications of Cyber-Physical Systems Against Perceptual Uncertainty

Formal reasoning on the safety of controller systems interacting with pl...

SOTER: Programming Safe Robotics System using Runtime Assurance

Autonomous robots increasingly depend on third-party off-the-shelf compo...

Adapting Surprise Minimizing Reinforcement Learning Techniques for Transactive Control

Optimizing prices for energy demand response requires a flexible control...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep neural networks (DNNs) in combination with

reinforcement learning (RL) are increasingly being used to train powerful AI agents. These agents have achieved unprecedented success in strategy games, including beating the world champion in Go (Silver et al., 2017b), surpassing state-of-the-art chess and shogi engines (Silver et al., 2017a), and achieving human-level skill in Atari video games (Mnih et al., 2015). For these agents, safety is neither a concern nor a requirement: when a game-playing agent makes a mistake, the worst-case scenario is losing a game. The same cannot be said for AI agents that control cyber-physical systems (CPSs). A mistake by an AI controller may cause physical damage to the CPS it controls and its environment, including humans.

In this paper, we present the Neural Simplex Architecture (NSA), a new approach to runtime assurance that provides safety guarantees for AI controllers, including neural controllers obtained using reinforcement learning, of complex autonomous and other CPSs without unduly sacrificing performance. NSA is inspired by Sha et al.’s Simplex control architecture (Sha, 2001; Seto et al., 1998), but with some significant differences. In this architecture, a decision module (DM) switches control from a high-performance but unverified (hence potentially unsafe) advanced controller (AC) to a verified-safe baseline controller (BC) if a safety violation is imminent. In the traditional Simplex approach, the AC is treated as a black box, and there are no techniques for correcting the AC after it generates a potentially unsafe control input that causes a failover to the BC.

Figure 1. The Neural Simplex Architecture.

NSA, illustrated in Fig. 1, addresses this limitation. The high-performance Neural Controller (NC) is a DNN that given a plant state (or raw sensor readings), produces a control input for the plant. For complex plants and environments, manually designing a high-performance controller can be challenging, even if a white-box model of the plant is available. Learning a neural controller using RL is an attractive alternative, as it only requires a black-box model and an appropriately defined reward function.

NSA’s use of an NC, as opposed to the black-box AC found in traditional Simplex, allows online retraining of the NC’s DNN to occur. Such retraining is performed by NSA’s Adaptation Module (AM) using RL techniques. For systems with large state spaces, it may be difficult to achieve thorough coverage during initial training of the NC. Online retraining has the advantage of focusing the learning on areas of the state space that are relevant to the actual system behavior; i.e., regions of the state space the system actually visits.

The AM seeks to eliminate unsafe behavior exhibited by the NC, without degrading its performance. While the BC is in control of the plant, the NC runs in shadow mode and is actively retrained by the AM. The DM can subsequently switch control back to the NC with high confidence that it will not repeat the same mistakes. These online retraining and reverse switching capabilities are some of NSA’s distinguishing features. They allow NSA to improve system performance while continuing to ensure safety. Reverse switching permits the mission to continue under the auspices of the high-performance NC, which due to retraining becomes significantly more likely to deliver safe control inputs to the plant.

We also address the problem of safe reinforcement learning (SRL) (García and Fernández, 2015; Xiang et al., 2018) during the initial training of the NC. We demonstrate that recent approaches to SRL (e.g. justified speculative control (Fulton and Platzer, 2018) and preemptive shielding (Alshiekh et al., 2017)) are highly ineffective when used with policy-gradient RL algorithms. We use a simple yet effective approach to SRL that achieves superior results. In this approach, when the learning agent produces an unsafe action, we: (i) use that action as a training sample (but do not execute it), with a large negative reward because it is unsafe, and (ii) use safe actions to safely terminate the current trajectory (but not to train the agent). In contrast, other recent approaches, such as those cited above, do not use the unsafe action as a training sample.

We illustrate NSA on several example CPSs, including a target-seeking rover navigating through an obstacle field, and a neural controller for an artificial pancreas. Our results on these case studies conclusively demonstrate NSA’s benefits.

In summary, the main contributions of this paper are:

  • We introduce the Neural Simplex Architecture, a new approach to runtime assurance that provides safety guarantees for neural controllers of CPSs. We view NSA as providing a platform for runtime-assured autonomy.

  • We address a limitation of the traditional Simplex approach, namely lack of techniques for correcting the AC’s behavior after failover to the BC has occurred, so that reverse switching makes sense in the first place.

  • We provide a thorough evaluation of the NSA approach on two significant case studies.

Structure of the rest of the paper. Section 2 provides background on the Simplex control architecture and reinforcement learning. Section 3 presents our evaluation of SRL-PUA, an approach to safe reinforcement learning with penalized unrecoverable actions. Section 4 discusses our new NSA architecture. Sections 5-7 contain our experimental results. Section 8 considers related work, while Section 9 offers our concluding remarks and directions for future work.

2. Background

2.1. Simplex Architecture

The Simplex architecture was introduced by Sha et al. in (Sha, 2001; Seto et al., 1998) as a mechanism for ensuring high-confidence in the control of safety-critical systems. The architecture is similar to NSA (see Fig. 1), but without the Adaptation Module (AM) and where the NC is called the Advanced Controller (AC). The AC is in control of the plant under nominal operating conditions, and is designed to achieve high performance according to certain metrics (e.g., maneuverability, fuel economy, mission duration). The BC is certified to keep the plant within a prescribed safety region. A certified Decision Module (DM) continually monitors the state of the plant and switches control to the BC should the plant be in imminent danger (i.e., within the next time step) of exiting the safety region. As such, Simplex assures that the plant, e.g., an autonomous vehicle, is correctly controlled even in the presence of a faulty AC.

The BC is certified to guarantee the safety of the plant only if it takes over control while the plant’s state is within a recoverable region . As a simple example, consider the BC for a ground rover that simply applies maximum deceleration to stop the rover. The braking distance to stop the rover from a velocity is therefore . The BC can be certified to prevent the rover from colliding with obstacles if it takes over control in states such that is less than the minimum distance to any obstacle. The set of such states is the recoverable region of this BC.

A control input is called recoverable if it keeps the plant inside within the next time step. Otherwise, the control input is called unrecoverable. The DM switches control to the BC when the AC produces an unrecoverable control input. The DM’s switching condition determines whether a control input is unrecoverable. We also refer to it as the forward switching condition (FSC) to distinguish it from the condition for reverse switching, a new feature of NSA we discuss in Section 4.

Techniques to determine the forward switching condition include: (i) shrink

by an amount equal to a time step times the maximum gradient of the state with respect to the control input; then classify any control input as unrecoverable if the current state is outside this smaller region; (ii) simulate a model of the plant for one time step if the model is deterministic and check whether the plant strays from

; (iii) compute a set of states reachable within one time step and determine whether the reachable set contains states outside .

With the traditional Simplex approach, the AC is treated as a black box. When the DM switches control to the BC, the BC remains in control forever. There are no guidelines for switching control back to the AC, nor are there methods methods for correcting the AC after it produces an unsafe control input that causes the DM to failover to the BC. Our NSA approach addresses these limitations.

2.2. Reinforcement Learning

This section provides an overview of Reinforcement Learning (RL) algorithms for policies involving continuous, real-valued actions, such as those applicable to the control of CPSs.

Figure 2. Agent-environment interaction in reinforcement learning, from (Sutton and Barto, 1998). The dotted line represents the boundary between the current and the next time step.

Reinforcement learning (Sutton and Barto, 1998; Szepesvári, 2009) is concerned with the problem of how an agent learns which sequence of actions to take in a given environment such that a cumulative reward is maximized.

As shown in Fig. 2, at each time step , the agent receives observation and reward from the environment and takes action . The environment receives action and emits observation and reward in response. Let be a discount factor. The goal of reinforcement learning is to learn a policy , i.e., a way of choosing an action at each time step such that expected discounted sum of rewards is maximized. The discounted sum of future rewards is called the return and defined as .

The action-value function is the expected return for selecting action in state and then always following policy . The optimal action-value function gives the maximum action-value for state and action achievable by any policy . The state-value function is the expected return starting from state , and then always following policy . The optimal state-value function is .

Early RL algorithms were designed for discrete state and action spaces. These algorithms usually use look-up tables to store policies and value functions. As such, they are not applicable to large-scale or continuous problems. Recent advances in RL are driven by deep learning, as deep neural networks (DNNs) have been shown to be very effective at approximating functions. In particular,

deep reinforcement learning algorithms use DNNs extensively for representing policies and value functions, extracting features from large state spaces such as pixels in video games. Crucially, DNNs allow deep reinforcement learning algorithms to operate effectively over continuous state and action spaces.

Algorithms such as TRPO (Schulman et al., 2015), DDPG (Lillicrap et al., 2015), A3C (Mnih et al., 2016), ACER (Wang et al., 2016), PPO (Schulman et al., 2017), and ACKTR (Wu et al., 2017) have emerged as promising solutions for RL-based control problems in continuous domains. We use the DDPG algorithm in our experiments.

3. Safe Reinforcement Learning with Penalized Unrecoverable Actions

This section presents our evaluation of safe reinforcement learning with penalized unrecoverable actions (SRL-PUA), an approach for safe reinforcement learning of CPS controllers. Although SRL-PUA is our learning algorithm of choice for NSA, it is not specific to NSA, and represents a general SRL technique.

A common approach to SRL is to filter the learning agent’s unrecoverable actions before they reach the plant. For example, when the learning agent produces an unrecoverable action, a runtime monitor (Fulton and Platzer, 2018) or a preemptive shield (Alshiekh et al., 2017) replaces it with a recoverable one to continue the trajectory. The recoverable action is also passed to the RL algorithm to update the agent. Unrecoverable actions are discarded. We use the terms “recoverable” and “unrecoverable” in the context of our broader discussion of NSA, but these terms can be replaced by “safe” and “unsafe”, respectively, when talking about other runtime-assurance methods.

While filtering-based approaches have been shown to work for discrete-action problems that use Q-learning algorithms, we found that they are not suitable when policy-gradient RL algorithms are used. There are two reasons for this. First, the replacement actions are inconsistent with the learning agent’s probability distribution of actions. This negatively affects algorithms such as REINFORCE 

(Williams, 1992; Sutton et al., 1999), Natural Policy Gradient (Kakade, 2002), TRPO (Schulman et al., 2015), and PPO (Schulman et al., 2017), which optimize stochastic policies, as they assume that the actions used for training are sampled from the learning agent’s distribution. This is less relevant to DDPG (Lillicrap et al., 2015), which prefers uncorrelated samples to train deterministic policies.

Secondly, without penalties for unrecoverable actions, the training samples always have positive rewards. This negatively impacts a policy gradient algorithm’s ability to fit a good model, e.g., a DNN to estimate the state-value function or action-value function. A policy-gradient algorithm uses this model to estimate the

advantage of an action; i.e., how good or bad the action is compared to the average action, to update the learning agent. If there is a lack of samples with penalties for unrecoverable actions, the model is likely to be incorrect, leading to ineffective updates to the learning agent.

SRL-PUA represents a different approach to safe reinforcement learning that works well with policy gradient methods. This is important because policy gradient methods underlie much of the recent success in RL. This approach still needs a way to determine if an action is unrecoverable. We can use Simplex’s switching logic, a runtime monitor (Fulton and Platzer, 2018), or a shield (Alshiekh et al., 2017) for this purpose. When the learning agent produces an unrecoverable action while exploring a trajectory, SRL-PUA assigns a penalty (negative reward) to that action, uses it as a training sample, and then uses recoverable actions to safely terminate the trajectory.

The safety of the plant is guaranteed by the recoverable actions, which may be obtained from a BC or another technique. These recoverable actions are not used to train the agent. The training then continues by exploring a new trajectory from a random initial state. This approach addresses both aforementioned issues, because actions used for training are sampled from the learning agent, and penalty samples are collected for a better estimate of the state-value function and/or the action-value function.

We used the DDPG and TRPO algorithms to train neural controllers for an inverted pendulum (IP) control system to demonstrate that learning without penalties for unrecoverable actions is highly ineffective. Details about the IP system, including the reward function and the BC used to generate recoverable actions, are presented in Section 5.

We used the implementations of DDPG and TRPO in rllab ( et al., 2016)

. For TRPO, we trained two DNNs, one for the mean and the other for the standard deviation of a Gaussian policy. Both DDNs have two fully connected hidden layers of 32 neurons each and one output layer. The hidden layers all use the

tanhactivation function, and the output layer is linear. For DDPG, we trained a DNN that computes the action directly from the state. The DNN has two fully connected hidden layers of 32 neurons each and one output layer. The hidden layers use the ReLU activation function, and the output layer uses tanh. We followed the choice of activation functions in the examples accompanying rllab.

For each algorithm, we ran two training experiments. In the first one, we reproduce the filtering approach, i.e., we replace an unrecoverable action produced by the learning agent with the BC’s recoverable action, use the latter as the training sample, and continue the trajectory. We call this training method SRL-BC. In the second experiment, we evaluate the SRL-PUA approach: whenever the learning agent produces an unrecoverable action, we use that action with an associated penalty as a training sample and terminate the trajectory. Note that both algorithms explore different trajectories by resetting the system to a random initial state whenever the current trajectory is terminated. We set the maximum trajectory length to 500 time steps; this means that a trajectory is terminated when it exceeds 500 time steps.

We trained the DDPG and TRPO agents on a total of one million time steps. After training, we evaluated all trained policies on the same set of 1,000 random initial states. During evaluation, if an agent produces an unrecoverable action, the trajectory is terminated. The results are shown in Table 1. For both algorithms, the policies trained with recoverable actions (SRL-BC approach) produce unrecoverable actions in all test trajectories, while the SRL-PUA approach, where the policies are trained with penalties for unrecoverable actions, does not produce any such actions. As a result, the latter policies achieve superior returns and trajectory lengths (they are able to safely control the system the entire time).

In the above experiments, we replaced unrecoverable actions with actions generated by a deterministic BC, whereas the monitoring (Fulton and Platzer, 2018) and preemptive shielding  (Alshiekh et al., 2017) approaches replace unrecoverable actions with random recoverable actions. To show that our conclusions are independent of this difference, we ran one more experiment with each learning algorithm, in which we replaced each unrecoverable action with an action selected by randomly generating actions until a recoverable one is found. The results, shown in Table 2, once again demonstrate that training with only recoverable actions is ineffective. Compared to filtering-based approaches (SRL-BC in Table 1 and SRL-RND in Table 2), the SRL-PUA approach yields a 25- to 775-fold improvement in the average return.

Unrec Trajs 1,000 0 1,000 0
Comp Trajs 0 1,000 0 1,000
Avg. Return 112.53 4,603.97 61.52 4,596.04
Avg. Length 15.15 500 14.56 500
Table 1. Policy performance comparison. SRL-BC: policy trained with BC’s actions replacing unrecoverable ones. SRL-PUA: policy trained with penalized unsafe actions. Unrec Trajs: # trajectories terminated during evaluation due to an unrecoverable action. Comp Trajs: # trajectories that reach limit of 500 time steps. Avg. Return and Avg. Length: average return and trajectory length over 1,000 evaluated trajectories.
Unrec Trajs 1,000 0 1,000 0
Comp Trajs 0 1,000 0 1,000
Avg. Return 183.36 4,603.97 5.93 4,596.04
Avg. Length 1.93 500 14 500
Table 2. Policy performance comparison. SRL-RND: policy trained with random recoverable actions replacing unrecoverable ones.

4. Main Components of NSA

In this section, we discuss the main components of NSA, namely the neural controller (NC), the adaptation module (AM), and the reverse switching logic. These components in particular are not found in the Simplex control architecture, the underlying inspiration for NSA.

4.1. The Neural Controller

The NC is a DNN that can represent a deterministic or stochastic policy. For a deterministic policy, the DNN maps system states (or raw sensor readings) to control inputs. For a stochastic policy, the DNN maps system states (or raw sensor readings) to parameters of a probability distribution. For example, a DNN can represent a Gaussian policy by mapping a system state to the mean and standard deviation parameters of a Gaussian distribution; then, a control input is drawn from that distribution. It is also possible to train a separate DNN for each parameter of a probability distribution. The NC can be obtained using any RL algorithm. We used the DDPG algorithm with the safe learning strategy of penalizing unrecoverable actions, as discussed in Section 

3. DDPG is an attractive choice because it works with deterministic policies, and allows uncorrelated samples to be added to the pool of samples for training or retraining. The last property is important because it allows us to collect disconnected samples of what the NC would do while the plant is under the BC’s control, and use these samples for online retraining of the NC.

4.2. The Adaptation Module

The AM retrains the NC in an online manner when the NC produces an unrecoverable action that causes the DM to failover to the BC. Since the traditional Simplex architecture already assures safety, the main reason to retrain the NC is to improve performance. Recall that the NC is trained to exhibit high performance, especially compared to the BC. Without retraining, the NC may behave in the same, or in a similar, manner that led to an earlier failover. With retraining, the NC will be less likely to repeat the same or similar mistakes, allowing it to remain in control of the system more often.

Candidate techniques that we consider for online retraining of the NC include supervised learning and reinforcement learning. In supervised learning, state-action pairs of the form

are required for training purposes. The training algorithm uses these examples to teach the NC safe behavior. The control inputs produced by the BC can be used as training examples. However, this will train the NC to imitate BC’s behavior, which may lead to a loss in performance, especially if the BC’s focus is primarily on safety.

Therefore, we prefer reinforcement learning for online retraining, with a reward function that penalizes unsafe control inputs, and rewards safe, high-performance ones. This approach improves the safety of the NC without unduly sacrificing performance. In general, the reward function for retraining can be designed as follows.


where FSC is the forward switching condition (the condition the DM evaluates to decide whether to transfer control from the NC to the BC), is a negative number used to penalize unrecoverable actions, and is a performance-related reward function.

4.3. Retraining

Our basic procedure for online retraining is as follows. When the NC outputs an unrecoverable action, the DM switches control to the BC, and the AM computes the reward for the NC’s unsafe action and adds this sample to a pool of training samples. At every time step while the BC is active, the AM takes a sample by running the NC in shadow mode to compute its proposed action, and then computing a reward for NC’s proposed action. Samples are of the form , where is the current state, is the action proposed by the NC, is the state obtained by applying to state , and is the reward for taking in state . To obtain the next state , the AM runs a simulation of the system for one time step. The AM retrains the NC at each time step the BC is in control, using the collected retraining samples and the same algorithm as in the initial training.

We evaluated several variants of this procedure, by making different choices along the following dimensions.

  1. Start retraining with an empty pool of samples or with the pool created during the initial training of the NC.

  2. Add exploration noise to NC’s action when collecting a sample, or do not add noise. Adding noise means that the action included in each training sample is the sum of the action produced by NC and a random noise term . Note that if NC is in control when the sample is collected, then the action sent to the plant is NC’s action without noise; using the noisy action for plant control would degrade performance.

  3. Collect retraining samples only while BC is in control or at every time step. In both cases, the action in each training sample is the action output by NC (or a noisy version of it); we never use BC’s action in a training sample. Also, in both cases, the retraining algorithm for updating the NC (using the accumulated sample pool) is run only while the BC is in control.

We found that reusing the pool of training samples (DDPG’s so-called experience replay buffer) from initial training of the NC helps evolve the policy in a more stable way, as retraining samples gradually replace initial training samples in the sample pool. Another benefit of reusing the initial training pool is that the NC can be immediately retrained without having to wait for enough samples to be collected online. We found that adding exploration noise to NC’s actions in retraining samples and collecting retraining samples at every time step increase the benefit of retraining. This is because these two strategies provide more diverse samples and thereby help achieve more thorough exploration of the state-action space.

4.4. Reverse Switching

Figure 3. Switching boundaries

The traditional Simplex architecture provides no guidelines for reverse switching, i.e., switching from the BC to the AC. Consequently, when the DM switches to the BC, the BC remains in control forever. This sacrifices performance. In contrast, NSA includes reverse switching to improve performance. An additional benefit of well-designed reverse switching is that it lessens the burden on the BC to achieve performance objectives, leading to a simpler BC design that focuses mainly on safety.

The reverse switching condition (RSC) is the condition that triggers a switch back to the retrained NC. Control of the plant is returned to the NC when the RSC is true and the FSC is false in the current state. The latter condition ensures that reverse switching is safe. We seek to develop reverse switching conditions that return control to NC when it is safe to do so, and that avoid frequent switching between the BC and NC. Retraining of the NC makes it more likely to deliver recoverable control inputs going forward, thereby reducing the frequency of switching. Nevertheless, an overly “eager” reverse switching condition, which always immediately returns control to NC when it is safe to do so, might cause an excessive amount of switching.

We propose two approaches to reverse switching condition design. One approach is to reverse-switch if a forward switch will not occur in the near future. For deterministic systems, this can be checked by simulation; specifically, simulate the composition of the NC and plant for time steps, and reverse-switch if the forward-switching condition does not hold within this time horizon. For nondeterministic systems, a similar technique can be employed, except using a model checker instead of a simulator. This approach, used in our inverted pendulum case study, directly prevents frequent switching but may be computationally expensive for complex systems. A simpler approach is to reverse-switch if the current plant state is sufficiently far from the NC-to-BC switching boundary (see Fig. 3). This approach is used in our rover navigation case study.

We emphasize that the choice of the reverse switching condition does not affect safety and is application-dependent. In experiments with the inverted pendulum and rover case studies, we also found that varying or the distance to the switching boundary has little impact on the number of time steps the system spent under the BC’s control.

5. Inverted Pendulum Case Study

This section describes the problem setup and experimental results for the inverted pendulum case study. The inverted pendulum is a well studied problem in both the Simplex and reinforcement learning literature. The simplicity and small state-action space make it an ideal starting point to showcase a proof of concept.

5.1. The Inverted Pendulum Problem

We consider the classic control problem of keeping an inverted pendulum upright on a movable cart. We describe the problem briefly here; a detailed exposition is available in (Seto et al., 1999). The linearized dynamics is given by



is the state vector consisting of the cart position

, cart velocity , pendulum angle , and pendulum angular velocity , and control input is the armature voltage applied to the cart’s motor. The constant matrix and constant vector are given in (Johnson et al., 2016).

The safety constraints for this system are m, m/s, and . The control input is constrained to be in V. Although is unconstrained, its physical limits are implicitly imposed by the constraints on . The control objective is to keep the pendulum in the upright position, i.e., .

5.2. Baseline Controller and Switching Conditions

The BC is a linear state feedback controller of the form , with the objective of stabilizing the system to the setpoint . This controller can be obtained using the linear matrix inequality (LMI) approach described in (Seto et al., 1999). The LMI approach computes a vector and a matrix such that:

  • When the system state is inside the ellipsoid , all safety constraints are satisfied.

  • When the system starts in a state inside the ellipsoid and uses BC’s control law , it will remain in this ellipsoid forever.

The gain vector and matrix produced by the LMI approach for the described inverted pendulum system are


The matrix defines a recoverable region . The forward switching condition is that the control input will drive the system outside in the next time step. For the reverse switching logic, the DM simulates the NC for 10 time steps starting from the current state, and switches to the NC if there are no safety violations within this time horizon.

5.3. The Neural Controller

The inverted pendulum problem can be considered “solved” by many reinforcement learning algorithms, due to its small state-action space. To demonstrate the online retraining capability of NSA’s adaptation module, we intentionally under-train a neural controller, so that it produces unrecoverable actions. We used the DDPG algorithm with the following reward function, where and are the velocity and pendulum angle in state :


This reward function encourages the controller to (i) keep the pendulum upright via the penalty term , and (ii) minimize the movement of the cart via the penalty term . The total distance travelled by the cart is one performance metric where the NC is expected to do better than the BC. Whenever the forward switching condition becomes true, the execution terminates. Therefore, the neural controller should also learn to respect (not trigger) the FSC, in order to maximize the discounted cumulative reward. Each execution is limited to 500 time steps.

5.4. Experimental Results

We under-trained an NC by training it for only 500,000 time steps. The DNN for the NC has the same architecture as the DDPG DNN described in Section 3. For our retraining experiments, we created an NSA instance consisting of this NC and the BC described above. With regard to the choices described in Section 4.3, we reused the initial training pool that has 500,000 samples, added Gaussian noise to NC’s actions in retraining samples, and collected retraining samples at every time step.

We ran the NSA instance starting from 2,000 random initial states. Out of the 2,000 trajectories, forward switching occurred in 28 of them. During the 28 trajectories with forward switches, the BC was in control for a total of 4,477 time steps. This means there were 4,477 retraining updates to the NC. Notably, there was only one forward switch in the last 1,000 trajectories; this shows that the retraining during the first 1,000 trajectories significantly improved the NC’s safety.

To evaluate the overall benefits of retraining, we ran the initially trained NC and the retrained NC starting from the same set of 1,000 random initial states. The results, given in Table 3, show that after just 4,477 retraining updates, the retrained NC completely stops producing unrecoverable actions. As a result, retraining also significantly improves the average return, increasing it by a factor of 2.7.

Initially Trained Retrained
Unrecov Trajs 976 0
Complete Trajs 24 1,000
Avg. Return 1,711.17 4,547.11
Avg. Length 203.26 500
Table 3. Benefits of retraining for the inverted pendulum, based on 1,000 trajectories used for evaluation. Unrecov Trajs: # trajectories terminated because of an unrecoverable action. Complete Trajs: # trajectories that reach limit of 500 time steps. Avg. Return and Avg. Length: average return and average trajectory length over all 1,000 trajectories.

6. Rover Navigation Case Study

This section describes the problem setup and experimental results for the ground rover navigation case study.

6.1. The Rover Navigation Problem

We consider the problem of navigating a rover to a predetermined target while avoiding collisions with static obstacles. The rover is a circular disk of radius . It has a maximum speed and a maximum acceleration . The maximum braking time is therefore , and the maximum braking distance is . The control inputs are the accelerations and in the and directions, respectively. The system uses discrete-time control with a time step of .

The rover is equipped with distance sensors whose detection range is . The sensors are placed evenly around the perimeter of the rover, i.e., the center lines of sight of two adjacent sensors form an angle of . The rover can move only forwards, so its orientation is the same as its heading angle. The state vector for the rover is , where is the position, is the heading angle, is the velocity, and the ’s are the sensor readings.

Figure 4. Schematic illustration of our assumption about obstacle shapes.

We assume the sensors have a small angular field-of-view so that each sensor reading reflects the distance from the rover to an obstacle along the sensor’s center line of sight. If a sensor does not detect an obstacle, its reading is . We assume that when the sensor readings of two adjacent sensors and are and , respectively, then the (conservative) minimum distance to any obstacle point located in the cone formed by the center lines of sight of and is . Here, is a constant that limits how much an obstacle can protrude into the blind spot between and ’s lines of sight; see Fig. 4.

6.2. Forward and Reverse Switching Conditions

A state of the rover is recoverable if, starting from , the baseline controller (BC) can brake to a stop and the stopped rover will still be at least distance from any obstacle. This implies that is recoverable if the minimum sensor reading in state is at least , where the braking distance in state is , where is the rover’s speed in state .

The forward switching condition is that the control input proposed by the NC will put the rover in an unrecoverable state in the next time step. We check this condition by simulating the rover for one time step with as the control input, and then check if .

The reverse switching condition is . This ensures that the forward switching condition does not hold for the next time steps, i.e., the current state is sufficiently far away from the forward switching boundary. The constant can be empirically chosen to reduce excessive back-and-forth switching between NC and BC.

6.3. Baseline Controller

The BC performs the following steps:

  1. Apply the maximum braking power until the rover stops.

  2. Randomly pick a safe heading angle based on the current position and sensor readings.

  3. Rotate the rover until it’s heading angle is .

  4. Move with heading angle until either the forward switching condition becomes true (this is checked after each time step by the BC itself), in which case the BC is re-started at Step 1, or the reverse switching condition becomes true (this is checked by the DM), in which case NC takes over.

6.4. Experimental Results

The parameters used in our experiments are m, m/s, m/, m, , m, m, , and s. The target is fixed at location . The field of circular obstacles is also fixed during training and testing, as shown in Fig. 5. The initial position of the rover is randomized in the area during training and testing. The NC is a DNN with two ReLU hidden layers, each of size 64, and a tanh output layer. We used the DDPG algorithm for both initial training and online retraining of the NC. For initial training, we ran DDPG for 5 million time steps.

The reward function for initial training and online retraining is


where is the forward switching condition, and is the center-to-center distance from the rover to the target in state . The rover is considered to have reached the target if because the target is a disk with radius of 0.1 m. If the action triggers the forward switching logic, it is penalized by assigning a negative reward of -20,000. If causes the rover to reach the target, it receives a reward of 10,000. All other actions are penalized by an amount proportional to the distance to the target; this encourages the agent to reach the target in the fewest number of time steps.

A video showing how the initially trained NC navigates the rover through the same obstacle field used in training is available at The video shows that the NC is able to reach the target most of the times. However, it occasionally drives the rover into unrecoverable states. If we pair this NC with the BC in an NSA instance, the rover never enters unrecoverable states. A video showing this NSA instance in action with reverse switching enabled and online retraining disabled is available at Note that in this video, we curated only interesting trajectories where switches occurred. In the video, the rover is black when the NC is in control; it turns green when the BC is in control.

Figure 5. Training and testing setup for the rover case study. The red disks are obstacles, the black dot with an inscribed white triangle is the rover, and the blue dot is the target. The spokes coming out of the rover represent the distance sensors. The rover’s heading angle is shown by the orientation of the inscribed triangle. The length of the green line, which is a prefix of the spoke for the sensor pointing directly forward, is proportional to the rover’s speed.

The initially trained NC also performs reasonably well on random obstacle fields not seen during training. A video of this is available at The rover under NC control is able to reach the target most of the time. However, it sometimes overshoots the target, suggesting that we may need to vary the target position during training. We plan to investigate this as future work.

Our experiments with online retraining start with the same NSA instance as above, except with online retraining enabled. All settings for DDPG are the same as in initial training, except that we initialize the AM’s pool of retraining samples with the pool created by initial training, instead of an empty pool. The pool created by initial training contains one million samples; this is the maximum pool size, which is a parameter of the algorithm. When creating retraining samples, the AM adds Gaussian noise to the NC’s actions. The NC’s actions are collected at every time step, regardless of which controller is in control; thus, the AC also collects samples of what the NC would do while the BC is in control.

We ran the NSA instance starting from 10,000 random initial states. Out of 10,000 trajectories, forward switching occurred in 456 of them. Of these 456 trajectories, the BC was in control for a total of 70,974 time steps. This means there were 70,974 (71K) retraining updates to the NC. To evaluate the benefits of online retraining, we compared the performance of the NC after initial training and after 20K, 50K, and 71K updates. We evaluated each of these controllers (by itself, without NSA) by running it from the same set of 1,000 random initial states and collecting multiple performance metrics.

The results are given in Table 4. After 71K retraining updates, the NC outperforms the initially trained version in every metric. Table 4 also shows that the performance of the NC increases with the number of retraining updates. This demonstrates that NSA improves not only the safety of the NC, but also its performance.

IT 20K RT 50K RT 71K RT
FSCs 100 79 43 8
Timeouts 35 49 50 22
Targets 865 872 907 970
Avg. Return -9,137.3 -9,968.82 -5,314.57 -684.01
Avg. Length 138.67 142.29 156.13 146.56
Table 4. Benefits of retraining for ground rover navigation. There were a total of  71K updates to the NC. IT: results for initially trained NC. 20K RT, 50K RT, and 71K RT: results for NC after 20K, 50K and 71K retraining updates during one retraining experiment. All of the controllers are evaluated on the same set of 1,000 random initial states. FSCs: # trajectories in which FSC becomes true. Timeouts: # trajectories that reach the limit of 500 time steps without reaching the target or having FSC become true. Targets: # trajectories that reach the target. Avg. Return and Avg. Length: average return and average trajectory length over all 1,000 trajectories.

We resumed initial training to see if this would produce similar improvements. We continued the initial training for an additional 71K, 1M, and 3M updates. The results appear in Table 5. Extending the initial training slowly improves both the safety and the performance of the NC but requires substantially more updates. Comparing Tables 4 and 5 shows that 71K of retraining updates in NSA provide significantly more benefits than even 3M additional updates of initial training. NSA’s retraining is much more effective, because it samples more unrecoverable actions while the plant is under BC’s control, and because it tends to focus retraining on regions of the state-action space of greatest interest, especially regions near the forward switching boundary and regions near the current state. In contrast, trajectories in initial training start from random initial states.

FSCs 100 108 108 78
Timeouts 35 224 78 43
Targets 865 668 814 879
Avg. Return -9,137.3 -12,448.3 -9,484.83 -3,320.4
Avg. Length 138.67 215.7 137.75 124.26
Table 5. Extended initial training performance. 71K EIT, 1M EIT, and 3M EIT: results for NC after 71K, 1M, and 3M updates during extended initial training. All of the controllers are evaluated using the same set of 1,000 random initial states used for the evaluation results in Table 4


We also experimented with other combinations of choices along the three dimensions listed in Section 4.3. We expected the combination described above to provide the best results, for the reasons presented in Section 4.3. Indeed, we found that none of the other combinations produced consistent safety and performance improvements over time as did the combination described above.

7. Artificial Pancreas Case Study

This section describes the problem setup and experimental results for the artificial pancreas case study.

7.1. The Artificial Pancreas Problem

The artificial pancreas (AP) is a system for controlling blood glucose (BG) levels in Type 1 Diabetes patients through the automated delivery of insulin. Here we consider the problem of controlling the basal insulin, i.e., the insulin required in between meals. We consider a deterministic linear model (adapted from (Chen et al., 2015)) to describe the physiological state of the patient. The dynamics are given by:


where is the difference between the reference BG, mmol/L, and the patient’s BG; (mU/min) is the insulin input (i.e., the control input); (mU) is the insulin mass in the subcutaneous compartment; and is the plasma insulin concentration (mU/L). Parameters are patient-specific.

The AP should keep BG levels within safe ranges, typically 4 to 11 mmol/L, and in particular it should avoid hypoglycemia (i.e., BG levels below the safe range), a condition that leads to severe health consequences. Hypoglycemia happens when the controller overshoots the insulin dose. What makes insulin control uniquely challenging is the fact that the controller cannot take a corrective action to counteract an excessive dose; its most drastic safety measure is to shut off the insulin pump. For this reason, the baseline controller for the AP sets .

For this case study, we assume that the controller can observe the full state of the system, and thus, the corresponding policy is a map of the form . We perform discrete-time simulations of the ODE system with a time step of .

7.2. Neural Controller

Similarly to the inverted pendulum problem, we intentionally under-train the NC so that it produces unrecoverable actions. This results in an AP controller with poor performance. Controllers with poor performance may arise in practice for a variety of reasons, including the (common) situation where the physiological parameters used during training poorly reflect the patient’s physiology. The DNN for the NC has the same architecture as the DDPG DNN described in Section 3.

The reward function is designed to penalize deviations from the reference BG level. Such a deviation is promptly given by the state variable . We give a positive reward when is close to zero (within ), and we penalize larger deviations with a 5 factor for mild hyperglycemia (), a 7 factor for mild hypoglycemia (), 9 for strong hyperglycemia (), and 20 for strong hypoglycemia (

). The other constants are chosen to avoid jump discontinuities in the reward function.


where is the value of in state . This reward function is inspired by the asymmetric objective functions used in previous work on model predictive control for the AP (Gondhalekar et al., 2016; Paoletti et al., 2017).

7.3. Forward and Reverse Switching Conditions

A state is recoverable if under the control of the BC (), the system does not undergo hypoglycemia () in any future state starting from . This condition is checked by simulating the system from with until starts to increase: as one can see from the system dynamics (68), this is the point at which reaches its minimum value under the BC.

The FSC holds when the control input proposed by the NC leads to an unrecoverable state in the next time step. For reverse switching, we use the default strategy of returning control to the NC if applying the NC for a bounded time horizon from the current state does not produce a state satisfying the FSC.

7.4. Experimental Results

To produce an under-trained NC, we used 107,000 time steps of initial training. For retraining, we used the same settings as in the inverted pendulum case study.

Initially Trained Retrained
Unrecov Trajs 1,000 0
Complete Trajs 0 1,000
Avg. Return 824 2,402
Avg. Length 217 500
Table 6. Benefits of retraining for the AP case study. There were 61 updates to the NC. Row and column labels are as per Table 3.

We ran the NSA instance for 10,000 trajectories. Among the first 400 trajectories, 250 led to forward switches and hence retraining. The retraining that occurred in those 250 trajectories was very effective, because forward switching never occurred after the first 400 trajectories. As we did for the other case studies, we then evaluated the benefits of retraining by comparing the performance of the initially trained NC and the retrained NC (by themselves, without NSA) on trajectories starting from the same set of 1,000 random initial states. The results are in Table 6. We observe that retraining greatly improves the safety of the NC: the initially trained controller reaches an unrecoverable state in all 1,000 of these trajectories, while the retrained controller never reaches an unrecoverable state. The retrained controller’s performance is also significantly enhanced, with an average return 2.9 times higher than that of the initial controller.

8. Related Work

The traditional Simplex architecture does not consider automatic reverse switching. In (Seto et al., 1999, 1998), when the AC produces an unrecoverable action, it is disabled until it is manually re-enabled. In (Johnson et al., 2016), the authors briefly mention that reverse switching should be performed only when the FSC is false, and that a stricter RSC might be needed to prevent frequent switching; but the paper does not pursue this idea further. In contrast to our work, it does not suggest a general approach to designing a stricter RSC or suggest a specific RSC for any case study.

In terms of our approach to safe reinforcement learning, we refer the reader to two recent comprehensive literature reviews on the subject (García and Fernández, 2015; Xiang et al., 2018).

In (Alshiekh et al., 2018), a shield (a.k.a. post-posed shield in (Alshiekh et al., 2017)) is synthesized from a temporal-logic safety specification. The shield monitors the actions from the agent and corrects an action if it causes a safety violation. This shielding approach is limited to finite-state systems and finite action spaces. It can be applied to an infinite-state system if a good finite-state abstraction of the system is available. NSA uses an approach based on policy-gradient reinforcement learning for realistic applications with infinite state spaces and continuous action spaces.

In (Fulton and Platzer, 2018), verified runtime monitors generated by ModelPlex are used in the training phase of reinforcement learning to constrain the actions the agent can choose to a set of safe actions at every training step. However, the learned policy is not guaranteed to be safe. The paper mentions the idea of using the learned policy together with a known-safe fallback policy, but does not elaborate on this approach. In contrast, we discuss in detail how to guarantee safety and retrain the controller during deployment. Also, while it is easy to check if an action is safe, it is unclear from their paper how to efficiently obtain a set of safe actions at every training step.

In (Pathak et al., 2018), probabilistic model checking is used to verify policies, i.e., to bound the risk that a trained agent damages either the environment or itself. The authors also present different approaches for repairing learned policies such that the probability of the system reaching as unsafe state under control of the repaired policy is bounded. Instead of providing a probability bound, NSA guarantees safety of the system during both training and deployment of the controller. Additionally, their repairs are not performed online, whereas NSA retrains the controller online.

In (Chow et al., 2018), the authors present a method for constructing and using Lyapunov functions under the framework of constrained Markov decision problems (CMDPs) to guarantee the safety of a policy during training. The paper demonstrates the effectiveness of the Lyapunov approach when used with policy-iteration and Q-learning methods for discrete state and action problems. Their approach is currently not applicable to policy gradient algorithms, such as the DDPG algorithm used in our experiments, and continuous state and action problems.

In (Tessler et al., 2018), the authors propose the Reward Constrained Policy Optimization (RCPO) approach, in which a per-state penalty with an associated weight is added to the reward function. The weight is dynamically changed during training. RCPO is shown to almost surely converge to a constraint-satisfying solution. However, RCPO does not address the problem of guaranteeing safety during training. Our approach differs in that we penalize an unrecoverable action and terminate the current trajectory to ensure the safety of the plant. Additionally, the paper does not consider online retraining, a distinguishing feature of NSA.

In (Achiam et al., 2017), the authors propose the Constrained Policy Optimization (CPO) algorithm for constrained MDPs that guarantees safe exploration during training. Although the theory behind CPO is sound, the practical algorithm presented in the paper is an approximation. As a result, the algorithm only ensures approximate satisfaction of constraints and guarantees an upper bound on a cost associated with constraint violations. This work also neglects online retraining.

In (Ohnishi et al., 2018), control barrier functions are used to provide safety for an RL algorithm. In this approach, unknown system dynamics is learned using Gaussian processes. As a result, the safety guarantee is probabilistic. During training and deployment, the agent is limited to explore within a safe region of the state space defined by a barrier function. Whenever the agent produces an unsafe action, the action is minimally perturbed so that the resulting action will not drive the system out of the safe region. In contrast, in NSA, when the NC makes an unsafe action, the BC takes over and the NC is retrained by the AM.

9. Conclusions

We have presented the Neural Simplex Architecture for assuring the runtime safety of cyber-physical systems with neural controllers. NSA features an adaptation module that retrains the NC in an online fashion, seeking to eliminate any faulty behavior exhibited by the NC. NSA’s reverse switching capability allows control of the plant to be returned to the NC after a failover to BC has occurred, thereby allowing NC’s performance benefits to come back into play. We have demonstrated the utility of NSA on three case studies, the inverted pendulum, a target-seeking ground rover navigating an obstacle field, and an artificial pancreas system. As future work, we plan to investigate methods for establishing statistical bounds on the degree of improvement that online retraining yields in terms of safety and performance of the NC.


  • J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017) Constrained policy optimization. In

    International Conference on Machine Learning

    pp. 22–31. Cited by: §8.
  • M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu (2017) Safe reinforcement learning via shielding. arXiv preprint arXiv:1708.08611. Cited by: §1, §3, §3, §3, §8.
  • M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu (2018) Safe reinforcement learning via shielding. In AAAI Conference on Artificial IntelligenceAAAI, External Links: Link Cited by: §8.
  • S. Chen, J. Weimer, M. Rickels, A. Peleckis, and I. Lee (2015) Towards a model-based meal detector for type I diabetics. Medical Cyber-physical System Workshop. Cited by: §7.1.
  • Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh (2018) A lyapunov-based approach to safe reinforcement learning. In Advances in Neural Information Processing Systems, pp. 8103–8112. Cited by: §8.
  • Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1329–1338. External Links: Link Cited by: §3.
  • N. Fulton and A. Platzer (2018) Safe reinforcement learning via formal methods. In AAAI’18, Cited by: §1, §3, §3, §3, §8.
  • J. García and F. Fernández (2015) A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16 (1), pp. 1437–1480. External Links: ISSN 1532-4435, Link Cited by: §1, §8.
  • R. Gondhalekar, E. Dassau, and F. J. Doyle III (2016) Periodic zone-mpc with asymmetric costs for outpatient-ready safety of an artificial pancreas to treat type 1 diabetes. Automatica 71, pp. 237–246. Cited by: §7.2.
  • T. Johnson, S. Bak, M. Caccamo, and L. Sha (2016) Real-time reachability for verified simplex design. ACM Trans. Embed. Comput. Syst. 15 (2), pp. 26:1–26:27. External Links: ISSN 1539-9087, Link, Document Cited by: §5.1, §8.
  • S.M. Kakade (2002) A natural policy gradient. In NIPS, pp. 1531–1538. Cited by: §3.
  • T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2.2, §3.
  • V. Mnih, A. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In ICML, pp. 1928–1937. Cited by: §2.2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
  • M. Ohnishi, L. Wang, G. Notomista, and M. Egerstedt (2018) Safety-aware Adaptive Reinforcement Learning with Applications to Brushbot Navigation. ArXiv e-prints. External Links: 1801.09627 Cited by: §8.
  • N. Paoletti, K.S. Liu, S.A. Smolka, and S. Lin (2017) Data-Driven Robust Control for Type 1 Diabetes Under Meal and Exercise Uncertainties. In Computational Methods in Systems Biology, pp. 214–232. Cited by: §7.2.
  • S. Pathak, L. Pulina, and A. Tacchella (2018) Verification and repair of control policies for safe reinforcement learning. Applied Intelligence 48 (4), pp. 886–908. External Links: ISSN 1573-7497, Document, Link Cited by: §8.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In ICML, pp. 1889–1897. Cited by: §2.2, §3.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.2, §3.
  • D. Seto, B. Krogh, L. Sha, and A. Chutinan (1998) The Simplex Architecture for Safe Online Control System Upgrades. In Proc. 1998 American Control Conference, Vol. 6, pp. 3504–3508. External Links: Document Cited by: §1, §2.1, §8.
  • D. Seto, L. Sha, and N.L. Compton (1999) A case study on analytical analysis of the inverted pendulum real-time control system. Cited by: §5.1, §5.2, §8.
  • L. Sha (2001) Using Simplicity to Control Complexity. IEEE Software 18 (4), pp. 20–28. External Links: Document, ISSN 0740-7459 Cited by: §1, §2.1.
  • D. Silver, T. Hubert, J. Schrittwieser, et al. (2017a) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815. Cited by: §1.
  • D. Silver, J. Schrittwieser, K. Simonyan, et al. (2017b) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §1.
  • R. Sutton and A. Barto (1998) Reinforcement learning: an introduction. MIT Press, Cambridge. Cited by: Figure 2, §2.2.
  • R.S. Sutton, D. McAllester, S. Singh, and Y. Mansour (1999) Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, Cambridge, MA, USA, pp. 1057–1063. External Links: Link Cited by: §3.
  • C. Szepesvári (2009) Algorithms for reinforcement learning. Citeseer. Cited by: §2.2.
  • C. Tessler, D. J. Mankowitz, and S. Mannor (2018) Reward Constrained Policy Optimization. ArXiv e-prints. External Links: 1805.11074 Cited by: §8.
  • Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas (2016) Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224. Cited by: §2.2.
  • R.J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3), pp. 229–256. External Links: ISSN 1573-0565, Document, Link Cited by: §3.
  • Y. Wu, E. Mansimov, R. Grosse, S. Liao, and J. Ba (2017) Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In NIPS, pp. 5279–5288. Cited by: §2.2.
  • W. Xiang, P. Musau, A. A. Wild, D. Manzanas Lopez, N. Hamilton, X. Yang, J. Rosenfeld, and T. T. Johnson (2018) Verification for Machine Learning, Autonomy, and Neural Networks Survey. ArXiv e-prints. External Links: 1810.01989 Cited by: §1, §8.