1. Introduction
Many critical systems exhibit global system dynamics that are highly sensitive to the local performance of individual components. This holds for example for (air) traffic and transport networks, communication networks, security systems, and (smart) power grids (Cristian et al., 1996; Shooman, 2003; Knight, 2002; Liu et al., 2012). In each case, the failure of or malicious attack on a small set of nodes may lead to knockon effects that can potentially destabilise the whole system. Innovations in critical systems may introduce additional vulnerabilities to such attacks: e.g., in smart grids communication channels are needed for distributed intelligent energy management strategies, while simultaneously forming a potential target that could compromise safety (Yan et al., 2013). Our research is motivated precisely by the need for safety in these critical systems, which can be achieved by building in robustness against rare but significant deviations caused by one or more system components failing or being compromised in an attack.
In this article we present a new approach for learning policies in such systems that are robust against a chosen scenario of potential attacks or failures. We accomplish this by introducing a new Qfunction operator, which we call the operator, that encodes robustness into the bootstrapping update of traditional temporal difference (TD) learning methods. In particular, we design the operator to encode the possibility of significant rare events (SREs) without requiring the learning agent to observe such events in training. Although the operator is modelbased with respect to these SREs, it can be combined with any TD method and can thus still be modelfree with respect to the environment dynamics.
We prove convergence of our methods to the optimal robust Qfunction with respect to the model using the theory of Generalized Markov Decision Processes. In addition we prove convergence to the optimal Qfunction of the original MDP given that the probability of SREs vanishes. Empirical evaluations demonstrate the superior performance of based TD methods both in the early learning phase as well as in the final converged stage. In addition we show robustness of the proposed method to small model errors, as well as its applicability to multiagent jointaction learning.
2. Related Work
The aim to find robust policies is relevant to multiple research areas, including security games, robust control/learning, safe reinforcement learning and multiagent reinforcement learning.
The domain of security games has expanded in recent years with many realworld applications in critical domains (Pita et al., 2008; Shieh et al., 2012), where the main approach has been computing exact solutions and deriving strong theoretical guarantees, mostly using equilibria concepts such as Nash and Stackelberg equilibria (Korzhyk et al., 2011; Lou et al., 2017). In contrast, we base our approach on reinforcement learning from interactions with the environment, thus we do not need to know the system model; such an approach to security games has been studied less, exceptions being for example Ruan et al. (2005) and Klima et al. (2016) who use reinforcement learning in the context of patrolling and illegal rhino poaching problems, respectively. Security games often assume frequent adversarial attack, whereas our work focuses on occasional loss of control over the system, which can represent e.g. failures or adversarial attack. Moreover, our work adopts the information asymmetry assumption often used in Stackelberg Security games (Korzhyk et al., 2011), providing the model of attack types for the leader, and allowing leaderstrategyinformed best response strategies by attackers. Similar to security games, control theory starts with a model of the system to be controlled (the plant), and for the purpose of robust control assumes a set of possible plants as an explicit model of uncertainty, seeking to design a policy that stabilises all these plants (Zhou and Doyle, 1998). A slightly weaker assumption is made in related work that assumes control over the number of observations for significant rare events (SREs), performing updates by sampling from the model (Ciosek and Whiteson, 2017). In contrast, our work assumes that the model of this system is not known a priori, and a policy needs to be learned by interacting with it. While early work on robust reinforcement learning focused on learning within parameterised acceptable policies (Singh et al., 1994), later work transferred the objective of maximising tolerable disturbances from control theory to reinforcement learning (Morimoto and Doya, 2005). Our work is similar to the therein defined Actordisturbercritic, but we replace its model of minimax simultaneous actions with stochastic transitions between multiple controllers (one being in control at any time) with arbitrary objectives for each controller. In relation to the taxonomy of safe reinforcement learning of Garcıa and Fernández (2015) our method falls in between WorstCase Criterion under Parameter Uncertainty and RiskSensitive Reinforcement Learning Based on the Weighted Sum of Return and Risk, depending on the chosen alternate controller objectives. Our Q() method is comparable to the pessimistic Qlearning method of Gaskett (2003), however, we propose a more general operator of which Q() is only an example. Finally, our approach has commonalities with the multiagent reinforcement learning algorithm MinimaxQ (Littman, 1994) for zerosum games, which assumes minimisation over the opponent action space. However, in contrast, we define an attack to minimise over our own action space, and thus learn (but not enact) simultaneously our optimal policy and the (rare) attacks it is susceptible to. We further cover not only minimising adversaries but also random failures or any other policy encoding other adversaries’ agendas (see Section 4.1).
3. Background
This work belongs to the field of reinforcement learning (RL) Sutton and Barto (2018), and makes use of the core concept of a Markov decision process (MDP). An MDP is formally defined by a tuple , where is a finite set of states, is a finite set of actions, is the reward function for a given state and an action and is the transition function giving a probability of reaching state after taking action in state . In this work we also consider a multiagent setting, which uses the formalism of the stochastic game, which generalizes the MDP to multiple agents and is defined by a tuple , where is the number of agents and is the action space of agent . The joint action space is , and a joint action is .^{1}^{1}1We use the common shorthand to denote the joint action of all agents except agent , i.e., . is the reward function of agent for given state and joint action , and is the state transition function.
The main goal of RL is finding an optimal policy for given MDP. A common method is temporal difference (TD)
learning, which estimates the value of a state by bootstrapping from the valueestimates of successor states using Bellmanstyle equations. TD methods work by updating their statevalue estimates in order to reduce the
TD error, which describes the difference between the current estimate of the (state)value and a new sample obtained from interacting with the environment. In this work we focus on modifying the update target of this TD error, which has the standard form of , where is the discount factor and is the current estimate of the next state’s value. In onpolicy methods such as SARSA the target is induced by the actual (behaviour) policy being followed, while offpolicy methods use an alternative operator (e.g., greedy maximization as in Qlearning). We refer the reader to Sutton and Barto (2018) for an overview of RL.4. The Robust TD Operator
Before we formally define our robust TD operator , we give an intuitive example. Suppose a Qlearning agent needs to learn a robust policy against a potential malicious adversary who could, with some probability , take over control in the next state.^{2}^{2}2We use the symbol to denote the proposed TD operator and the symbol for the parameter denoting the probability of attack. The value of the next state thus depends on who is in control: if the agent is in control, she can choose an optimal action that maximizes expected return; or if the adversary is in control he might, in the worst case, aim to minimize the expected return. This can be captured by the following modified TD error
where we assume that the agent has knowledge of (or can estimate) the probability .^{3}^{3}3Note that while a token of control could be included in the state (doubling its size), our approach instead directly applies modelbased bootstrap updates. This makes it explicit that the robustness target is a chosen parameter of the operator, and allows to learn robust strategies before observing SREs or when learning during the SREs is not possible. This also highlights the difference between state transition probabilities, which are part of the environment and thus external to the agent, and the expected probability of SREs given by which are part of the agent’s internal model.
In the following we first present a formal, general model of the operator , by modifying the target in the standard Bellman style value function. We then present practical implementations of TD() methods that use this operator, for both single and multiagent settings, based on the classical on and offpolicy TD learning algorithms (Expected) SARSA and Qlearning.
4.1. Formal Model
We consider a set of possible control policies . At each time step, one of these policies is in control (and thus decides on the next action) with some probability that may depend on the state . The set and probability function are assumed to be (approximately) known by the agent. In our new TD methods, the value of the next state then becomes a function of both the state and the function , which we capture in our proposed operator as . Note that the set includes the focal policy that we seek to optimise in face of (possibly adversarial) alternative controllers. Such external control policies can represent for example a malicious attacker, aiming to minimize the expected return, or any arbitrary dynamics, such as random failures (e.g. represented by a uniformly random policy). Based on a prior assumption about the nature of we want to optimise the focal policy without necessarily observing actual attacks or failures. This means learning our robust policy right from the start.
We define in terms of our own Qvalue function, for example an attacker that is minimising our expected return. Thus we need to learn only one Qvalue function . This is similar to the standard assumption in Stackelberg games that the attacker is able to fully observe our past actions and thus can enact the informed best response. We define the Qvalue function update for our policy based on standard Bellman equation and given the operator as
(1) 
Note that where in the standard Bellman equation we would have , in our case we have
(2) 
computed as a weighted sum over all possible control policies . Note that we can learn without actually experiencing any attack or malfunction, based only on prior assumptions about the possible control policies as captured by the operator . We refer to this target modification as the operator because it closely resembles the Bellman optimality operator , which is defined as . Thus, we can then formally define the optimality operator by substituting the value function with .
In the following we present several versions of classical TD methods. For simplicity we assume a scenario in which we have only a single adversarial external policy that aims to minimize our value, and thus . Note however that our model is general, and would work for any and .
4.2. Examples of TD() Methods
We first present singleagent based learning methods by building on the standard TD methods Qlearning and Expected SARSA. Then we present twoagent jointaction learning approaches. Although a generalization to agents is relatively straightforward, we choose to focus solely on the single and twoagent case in this paper for clarity of exposition. In each case, we consider the setting in which either the focal agent, with policy , is in control, or the external adversary with policy aiming to minimize return. We further simplify the model by making the control policy probability function
stateindependent, reducing it to a probability vector.
4.2.1. SingleAgent Methods
Before we present the algorithms, it is important to note that we need to distinguish the target and behaviour policies. The operator is defined on the target (see Eq. (1)), while the behaviour policy is used only for selecting actions. We assume an greedy behaviour policy throughout.
In offpolicy Q(), the target policy is the greedy policy that maximizes expected return. The adversarial policy on the other hand aims to minimize the return, i.e., . Assuming a probability of attack of as before, we have and . Thus, Eq. (2) becomes
For onpolicy Expected SARSA() the target is the (expectation over the) focal policy , while the adversarial policy remains the same as before. Thus, we have
4.2.2. MultiAgent Methods
We move from a singleagent setting to a scenario in which multiple agents interact. For sake of exposition we only present a twoagent case with different action spaces, and , but an identical reward function and thus a shared joint action Qvalue function . Moreover, we assume full communication during the learning phase, allowing the agents to take each other’s policies into account when selecting the next action.^{4}^{4}4A common practice in cooperative multiagent learning settings, see e.g., (Foerster et al., 2018; Sunehag et al., 2017). Our algorithms are therefore based on the jointaction learning (JAL) paradigm Claus and Boutilier (1998). We further assume that only one agent can be attacked at each time step.^{5}^{5}5Although relaxing this assumption is straightforward, we opt to keep it for clarity. For multiagent Q() we can write Eq. (2) for each individual agent as
with and , representing the scenario in which no attack happens with probability , and each agent is attacked individually with probability .^{6}^{6}6Note the order of the , which follows the Stackelberg assumption of an allknowing attacker who moves last. Analogously, we can define Eq. (2) for multiagent Expected SARSA() as
where we now compute an expectation over the actual policy of the agents that are not attacked, while the attacker is still minimizing.
5. Theoretical Analysis
In this section we analyze theoretical properties of the proposed methods. We start by relating the different algorithms to each other in the limit of their respective parameters. Then we proceed to show convergence of both Q() and Expected SARSA() to two different fixed points: (i) to the optimal value function of the original MDP in the limit where ; and (ii) to the optimal robust value function of the MDP that is generalized w.r.t. the operator for constant parameter . Note that optimality in this sense is purely induced by the relevant operator. In (i) this is the standard Bellman optimality which maximizes the expected discounted return of the MDP. However, in (ii) we derive optimality in the context of Generalized MDPs Szepesvari and Littman (1997), where optimal simply means the fixed point of a given operator, which can take many forms.
Before proceeding with the convergence proofs, Figure 1 summarizes some relationships between the algorithms in terms of their targets, in the limit of their respective parameters: As is known, Expected SARSA, SARSA, and Qlearning become identical in the limit of a greedy policy (Sutton and Barto, 2018; van Seijen et al., 2009). Furthermore, the update targets of our methods approach the update targets of the standard TD methods on which they are based as . Finally, Expected SARSA() and Q() share the same relationship as their original versions, and thus Expected SARSA() approaches Q() as . Note that the algorithms’ equivalence in the limit does not hold in the transient phase of the learning process, and hence in practice they may converge on different paths and to different policies that share the same value function. For a comprehensive understanding of the algorithms introduced in Section 4.2, the following sections provide proofs for both convergence of methods for , as well as their convergence when stays constant.^{7}^{7}7While we focus on the adversarial targets considered in Section 4.2, a previous proof of convergence under persistent exploration (Szepesvari and Littman, 1997) can be interpreted as a model of random failures with fixed kappa.
5.1. Convergence to the Optimal
There exist several proofs of convergence for the temporal difference algorithms Qlearning Jaakkola et al. (1994); Tsitsiklis (1994), SARSA Singh et al. (2000), and Expected SARSA van Seijen et al. (2009). Each of these proofs hinges on linking the studied algorithm to a stochastic process, and then using convergence results from stochastic approximation theory Dvoretzky (1956); Robbins and Monro (1951). These proofs are based on the following lemma, presented as Theorem 1 in Jaakkola et al. (1994) and as Lemma 1 in Singh et al. (2000). These differ in the third condition, which describes the contraction mapping of the operator. The contraction property used for the Qlearning proof Jaakkola et al. (1994) has the form , where . We show the lemma as it was used for the SARSA proof provided by Singh et al. (2000), who show that the contraction property does not need to be strict; strict contraction is required to hold only asymptotically.
Lemma
Consider a stochastic process , where satisfy the equations
Let be a sequence of increasing fields such that and are measurable and and are measurable, . Then, converges to zero with probability one (w.p.1) under the following assumptions:

the set is finite,

, , w.p.1,

, where and converges to zero w.p.1,

, where is some constant,
where denotes a maximum norm.
The proof continues by relating Lemma 5.1 to the temporal difference algorithm, following the same reasoning as van Seijen et al. (2009) in their convergence proof for Expected SARSA. We define , , which represents the past at step and is a learning rate for state and action . To show the convergence of to the optimal fixed point we set , therefore when converges to zero, then the values converge to . The maximum norm can be expressed as maximizing over states and actions as .
We follow the reasoning of Theorem 1 from van Seijen et al. (2009), where we repeat the conditions (1), (2) and (4) and modify the condition (3) for the methods as: Q() and Expected SARSA() as defined in Section 4.2.1 using the respective value function , defined by
converge to the optimal Q function if:

the state space and action space are finite,

, and w.p.1,

converges to zero w.p.1,

for Expected SARSA() the policy is greedy in the limit with infinite exploration (GLIE assumption),

the reward function is bounded.
Proof.
Convergence of Q(): To prove convergence of Q() we have to show that the conditions from Lemma 5.1 hold. Conditions (1), (2) and (4) of Theorem 5.1 correspond to conditions (1), (2) and (4) of Lemma 5.1 van Seijen et al. (2009). We now need to show that the contraction property holds as well, using condition (3) of Theorem 5.1. Adapting the proof of van Seijen et al. (2009), we set to show that is a contraction mapping, i.e., condition (3) in Lemma 5.1. For Q() we write:
We want to show that to prove the convergence of Q() to the optimal value .
where the first inequality follows from standard algebra and the fact that splitting the maximum norm yields at least as large a number, the second inequality follows from the definition of and the maximal difference in values over all states being at least as large as a difference between values given in state , and the third inequality follows from the definition of above.^{8}^{8}8Recall that we set out in this section to show convergence to the same optimal Qvalue as classical Qlearning , even if we do so by our new operator. We can see that if we set , then for we get converging to zero w.p.1, thus proving convergence of Q(). ∎
Proof.
Convergence of Expected SARSA(): Similarly as in the proof of Q() we need to show that the contraction property holds as well, this time using conditions (3) and (3a) of Theorem 5.1. We first define:
and then show the following:
where the inequalities use the same operations as above in the proof of Q(). If we set and assume that the policy is greedy in the limit with infinite exploration (GLIE assumption) and parameter w.p.1 (conditions (3) and (3a)), it follows that converges to zero w.p.1, thereby proving that Expected SARSA() converges to optimal fixed point . ∎
5.2. Convergence to the Robust
In this section we show convergence to the robust value function which is optimal w.r.t. the operator . The main difference with the proof of Theorem 5.1 is that here we do not require but instead assume it remains constant over time. We base our reasoning on the theory of Generalized MDPs (Szepesvari and Littman, 1997). A Generalized MDP is defined using operatorbased notation as
where the operator defines how an optimal agent chooses her actions (in the classic Bellman equation this denotes maximization) and operator defines how the value of the current state is updated by the value of the next state (in the classic Bellman equation this denotes a probability weighted average over the transition function). These operators can be chosen to model various different scenarios. The generalized Bellman equation can now be written as . The main result of Szepesvari and Littman (1997) is that if and are nonexpansions, then there is a unique optimal solution to which the generalized Bellman equation converges, given certain assumptions. For and nonexpansion properties of and we get a contraction mapping of the Bellman operator defined as . Then, the operator has a unique fixed point by the Banach fixedpoint theorem Smart (1974).
Building on the stochastic approximation theory results (as we also used in the Section 5.1), Szepesvari and Littman (1997) show the following:
Lemma
Generalized Qlearning with operator using Bellman operator
converges to the optimal Q function w.p.1, if

, and w.p.1,

is a nonexpansion,

the reward function is bounded.
We base our convergence proofs for Q() and Expected SARSA() on the insights of Szepesvari and Littman (1997) given in Lemma 5.2.
Q() and Expected SARSA() as defined in Section 4.2.1 converge to the robust Q function for any fixed .
Proof.
Convergence of Q() to : To prove convergence of Q() we follow the proof of Generalized Qlearning in Lemma 5.2. The only condition we need to guarantee is the nonexpansion property of the operator in the value function update, which for Q() is a weighted average of the operators min and max. We write the operator for Q() as and define it as
In Appendix B of Szepesvari and Littman (1997), Theorem 9 states that any linear combination of nonexpansion operators is also a nonexpansion operator. Moreover Theorem 8 states that the summary operators and are also nonexpansions. Therefore, is a nonexpansion as well, thus proving the convergence of Q() to the robust fixed point induced by the operator . ∎
Proof.
Convergence of Expected SARSA() to : We base our convergence proof of Expected SARSA() again on the work of Szepesvari and Littman (1997), this time on their insights regarding persistent exploration (Section 4.5 in their paper). They show that Generalized Qlearning with greedy action selection converges, for a fixed , in the Generalized MDP. Following similar reasoning, we define the operator for Expected SARSA() with fixed as
Again, from repeated application of Theorems 8 and 9 in Appendix B of Szepesvari and Littman (1997) it follows that is a nonexpansion as well. Therefore, by Lemma 5.2, Expected SARSA() converges to for fixed exploration . ∎
It remains an open question whether Expected SARSA() also converges for decreasing , e.g., under the GLIE assumption, even though we conjecture that it might.
5.3. Convergence in the MultiAgent Case
We now prove convergence of the cooperative multiagent variant of the methods presented in Section 4.2.2. This proof builds on the theory of Generalised MDPs, similar to the proofs presented in Section 5.2. Therefore this proof also assumes a fixed probability of attack . In addition, we make use of the assumption that agents can communicate freely in the learning phase, and thus receive identical information and can build a common jointaction Qtable. Multiagent Q() and Expected SARSA() as defined in Section 4.2.2 converge to the robust Q function for any fixed .
Proof.
The operator for our multiagent versions of Q() and Expected SARSA() consists of a nested combination of different components, in particular , , and where is the greedy policy. By Theorem 8 of Szepesvari and Littman (1997), and are nonexpansions. By Theorem 9 of (Szepesvari and Littman, 1997), linear combinations of nonexpansion operators are also nonexpansion operators. Finally, by Theorem 10 of (Szepesvari and Littman, 1997), products of nonexpansion operators are also nonexpansion operators. Therefore, also , , and are nonexpansion operators, as are linear combinations of those compounds. Similarly, for fixed can be written as a linear combination of summary operators, which by Theorems 8 and 9 of Szepesvari and Littman (1997) is a nonexpansion. Therefore, the operator used in both multiagent Q() and Expected SARSA() is a nonexpansion. Thus, by Lemma 5.2, Q() and Expected SARSA() converge to for fixed , and in the case of Expected SARSA(), for fixed . ∎
6. Experiments and Results
In this section we evaluate temporal difference methods with the proposed operator ; offpolicy type of learning Q() and onpolicy type of learning Expected SARSA(). We experiment with a classic cliff walking scenario for the singleagent case and a multiagent puddle world scenario. Both these domains contain some critical states, a cliff and a puddle respectively, which render very high negative reward for the agent(s) in case of stepping into them. These critical states represent the significant rare events (SREs). We compare our methods to classic temporal difference methods like SARSA, Qlearning and Expected SARSA. In all the experiments we consider an undiscounted (=1), episodic scenario.
Cliff Walking: singleagent
The Cliff Walking experiment as shown in Figure 2 is a classic scenario proposed in Sutton and Barto (2018) and used frequently ever since (e.g., van Seijen et al. (2009)). The agent needs to get from the start state [S] to the goal state [G], while avoiding stepping into the cliff, otherwise rendering a reward of and sending him back to the start. For every move which does not lead into the cliff the agent receives a reward of .
Puddle World: multiagent
The Puddle World environment is a grid world with puddles which need to be avoided by the jointaction learning agents. The two agents jointly control the movement of a single robot in the Puddle World, each controlling either direction or . Agent 1 can take the actions {stay, move down, move up} and agent 2 can choose {stay, move left, move right, move right by 2}, thus their action spaces are different, further complicating the learning process compared to the singleagent scenario. The joint action is the combination of the two selected actions. We assume a reward of 1 for every move and 100 for stepping into a puddle (returning to the start node). The agents have to move together from the start node at the top left corner to the goal at the bottom right corner. Figure 3 shows the policy learned by our proposed algorithm Q() for the two jointlearning agents. Note how a safer path (longer, avoiding the puddles) is learned with increasing parameter (i.e., higher probability of SREs). For our algorithm degenerates to Qlearning (left panel).
6.1. Performance
We replicate the experiment of van Seijen et al. (2009) on the Cliff Walking domain, in which we compare our methods with Qlearning, SARSA and Expected SARSA, and perform a similar experiment on the Puddle World domain. In line with (van Seijen et al., 2009) we show (i) early performance, which is the average return over the first training episodes, and (ii) converged performance, which is the average return over episodes.
Figure 4 shows the results for three different settings of both scenarios: (i) a deterministic environment, where each action chosen by the policy is executed with certainty; (ii) an environment with stochasticity, in which a random action is taken with of the time; and (iii) an environment with probability of attack, in which an adversarial action is taken of the time. As before, we define an attack as an action that minimizes the Qvalue in the given state. The stochastic environment can be seen as modelling random failures.
The early performance experiments are averaged over 300 trials and the converged performance experiments are averaged over 10 trials. We also show the confidence intervals on all results. We fix the exploration rate to ; for the methods we set (later in this section we also experiment with different settings of ). Note that the yaxis, showing the average return, is the same in each row for easy comparison. The xaxis shows different learning rates . We can see how the average return decreases with more complex scenarios, from deterministic, over to stochastic, to one with attacks. The methods are superior to the other baselines in the early performance experiments, especially in the attack case, which is the scenario the methods are designed for. In the converged performance experiments the methods beat Qlearning and SARSA and performs at least as well as Expected SARSA.
6.2. Different Levels of Probability of Attack
In this section we investigate how the methods behave under different levels of attack, defined by the probability of attack per state. We consider an attack on trained (converged) methods, thus we first train each method for episodes (in deterministic environment) and then we test it on trials with given probability of attack per state. We average the results over trials and provide confidence intervals. Note, that this is a different methodology of testing the methods against an adversarial attack compared to the experiments in Figure 4, where we considered attacks during training. This experiment shows the strength of the methods for different levels of attacks. We assume the probability of attack to be known here and thus we set the parameter to be equal to that probability, which is the meaning of the parameter as described before. In other words, parameter prescribes how much safely we want to act. We consider very rare attacks ( probability of attack in each state) to more frequent attacks ( probability of attack in each state) as shown in Figure 5. For better visualisation we use logarithmic axes. We train all the methods with fixed exploration rate and learning rate , note that the methods (except SARSA) converge to the same result for different learning rates as shown in left panel of Figure 4. SARSA is very unstable for different learning rates (demonstrated by wide confidence intervals), learns different paths for different
and does not converge fast enough or not at all, which can be partly explained by its higher variance
van Seijen et al. (2009). We test the different levels of probability of attack on the Cliff Walking experiment in the left panel of Figure 5, where we can see that the methods compare favourably to the other baselines, however in some parts they give similar performance as Expected SARSA or SARSA. The Cliff Walking experiment has a limited expressiveness for testing the methods due to a limited number of possible safe paths with low costs (see Figure 2), which is the reason for the methods to show only similar performance compared to the baselines, not reaching their full potential. However, the Puddle World is more expressive, because there are several possible paths differing in level of safety and cost. The bigger solution space of the Puddle World is also induced by the two cooperating agents, each having their own action space. Therefore, on the right panel of Figure 5 we show the Puddle World experiment for different levels of probability of attack. Here, we can clearly see the methods outperform the baselines, especially Q() is superior over the whole range of considered probabilities of attack. Note that Q() learns a safer path even for very rare attacks (), which is also shown in Figure 3, where Q() learns a path with the same cost (distance) compared to Qlearning, but further to the puddles.6.3. Robustness Analysis
We now test the robustness of the proposed algorithms to an incorrect attack model, meaning that the value of in Q() and Expected SARSA() no longer matches the actual probability of attack (in our previous experiments matched the actual probability of attack precisely). Figure 6 shows the performance of our algorithms for a range of actual attack probabilities (yaxis) while learning using a fixed parameter .
To better highlight the robustness of our methods we choose a range of relatively high actual probabilities of attack around the fixed value of (note that we no longer use a logarithmic scale). One can see that even when is not equal to the actual probability of attack the proposed algorithms still outperform the baselines in most cases. In the Cliff Walking experiment (Figure 6 left) the methods perform similar to SARSA, however SARSA is quite unstable, as discussed before and as one can see by the width of the confidence interval. The Puddle World experiment (Figure 6 right) demonstrates the superior performance of the methods, which beat all the baselines even for the fixed parameter . These results show that even when we do not know the probability of attack accurately we can learn a more robust strategy using the methods.
7. Discussion and Conclusion
We presented a new operator for temporal difference learning, which improves robustness of the learning process against potential attacks or perturbations in control. We proved convergence of Q() and Expected SARSA() to (i) the optimal value function of the original MDP in the limit where ; and (ii) the optimal robust value function of the MDP that is generalized w.r.t. for constant parameter , in both single and multiagent versions of the methods. Our complementary empirical results demonstrated that the proposed methods indeed provide robustness against a chosen scenario of potential attacks and failures. Although our method assumes that a model of such attacks and failures is known to the agent, we further demonstrated that our methods are robust against small model errors. Moreover, we have shown that even in absence of attacks or failures, our method learns a policy that is robust in general against environment stochasticity, in particular in the early stages of learning.
There are several interesting directions for future work. The control space can be extended, allowing for more agents being attacked or malfunctioning with different intensity, or with control transitions depending on additional variables other than the state. Furthermore, the target of adversarial policies could be learned from experience using ideas from opponent modelling (e.g. DPIQN Hong et al. (2018)). Our proposed operator can potentially be combined with some recent stateoftheart reinforcement learning methods. For example, the operator could be combined with the multistep Retrace() (Munos et al., 2016) algorithm, potentially speeding up convergence. Mixed multistep updates could be introduced by combination with Q() (Asis et al., 2018), where the parameter can also be statedependent similarly to the control transitions in our model, allowing to learn robust policies against e.g. multistep attacks. Another interesting extension along this line would be to model the control transition similar to the options framework (Sutton et al., 1999; Bacon et al., 2017), in which case the alternate control policies could be seen as “malicious” options over which the agent has no control, with potentially complex initiation sets and termination conditions. Such extensions would further increase the flexibility of our proposed operator and narrow the reality gap, making it applicable to a wide range of realworld scenarios.
This project has received funding in the framework of the joint programming initiative ERANet Smart Energy Systems’ focus initiative Smart Grids Plus, with support from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 646039. We are indebted to the anonymous reviewers of AAMAS 2019 for their valuable feedback.
References
 (1)

Asis et al. (2018)
Kristopher De Asis, J.
HernandezGarcia, G. Holland, and
Richard S. Sutton. 2018.
MultiStep Reinforcement Learning: A Unifying
Algorithm. In
AAAI Conference on Artificial Intelligence
.  Bacon et al. (2017) PierreLuc Bacon, Jean Harb, and Doina Precup. 2017. The OptionCritic Architecture. In Proceedings of Association for the Advancement of Artificial Intelligence Conference (AAAI). 1726–1734.
 Ciosek and Whiteson (2017) Kamil Andrzej Ciosek and Shimon Whiteson. 2017. OFFER: OffEnvironment Reinforcement Learning. In Proceedings of Association for the Advancement of Artificial Intelligence Conference (AAAI).
 Claus and Boutilier (1998) Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI (1998), 746–752.
 Cristian et al. (1996) Flaviu Cristian, Bob Dancey, and Jon Dehn. 1996. Faulttolerance in air traffic control systems. ACM Transactions on Computer Systems (TOCS) 14, 3 (1996), 265–286.
 Dvoretzky (1956) Aryeh Dvoretzky. 1956. On Stochastic Approximation. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. University of California Press, 39–55.
 Foerster et al. (2018) Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018. Counterfactual multiagent policy gradients. In ThirtySecond AAAI Conference on Artificial Intelligence.

Garcıa and
Fernández (2015)
Javier Garcıa and
Fernando Fernández. 2015.
A comprehensive survey on safe reinforcement
learning.
Journal of Machine Learning Research
16, 1 (2015), 1437–1480.  Gaskett (2003) Chris Gaskett. 2003. Reinforcement learning under circumstances beyond its control. In Proceedings of the International Conference on Computational Intelligence for Modelling Control and Automation.
 Hong et al. (2018) ZhangWei Hong, ShihYang Su, TzuYun Shann, YiHsiang Chang, and ChunYi Lee. 2018. A deep policy inference qnetwork for multiagent systems. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 1388–1396.
 Jaakkola et al. (1994) Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. 1994. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems. 703–710.
 Klima et al. (2016) Richard Klima, Karl Tuyls, and Frans Oliehoek. 2016. Markov Security Games: Learning in Spatial Security Problems. NIPS Workshop on Learning, Inference and Control of MultiAgent Systems (2016), 1–8.
 Knight (2002) John C Knight. 2002. Safety critical systems: challenges and directions. In Proceedings of the 24th International Conference on Software Engineering. ACM, 547–550.
 Korzhyk et al. (2011) Dmytro Korzhyk, Zhengyu Yin, Christopher Kiekintveld, Vincent Conitzer, and Milind Tambe. 2011. Stackelberg vs. Nash in Security Games: An Extended Investigation of Interchangeability, Equivalence, and Uniqueness. Journal of Artificial Intelligence Research 41 (2011), 297–327.
 Littman (1994) Michael L. Littman. 1994. Markov games as a framework for multiagent reinforcement learning. Technical Report. Brown University. 157–163 pages.
 Liu et al. (2012) Jing Liu, Yang Xiao, Shuhui Li, Wei Liang, and CL Philip Chen. 2012. Cyber security and privacy issues in smart grids. IEEE Communications Surveys & Tutorials 14, 4 (2012), 981–997.
 Lou et al. (2017) Jian Lou, Andrew M Smith, and Yevgeniy Vorobeychik. 2017. Multidefender security games. IEEE Intelligent Systems 32, 1 (2017), 50–60.
 Morimoto and Doya (2005) Jun Morimoto and Kenji Doya. 2005. Robust reinforcement learning. Neural computation 17, 2 (2005), 335–359.
 Munos et al. (2016) Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. 2016. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems (NIPS). 1054–1062.
 Pita et al. (2008) James Pita, Manish Jain, Janusz Marecki, Fernando Ordonez, Christopher Portway, Milind Tambe, Craig Western, Praveen Paruchuri, and Sarit Kraus. 2008. Deployed ARMOR Protection: The Application of a Game Theoretic Model for Security at the Los Angeles International Airport. In International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), Vol. 3. 1805–1812.
 Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics 22, 3 (1951), 400–407.
 Ruan et al. (2005) Sui Ruan, Candra Meirina, Feili Yu, Krishna R Pattipati, and Robert L Popp. 2005. Patrolling in a stochastic environment. Technical Report. Electrical and Computer Engineering Department, University of Connecticut, Storrs.
 Shieh et al. (2012) Eric Shieh, Bo An, Rong Yang, Milind Tambe, Craig Baldwin, Joseph DiRenzo, Ben Maule, and Garrett Meyer. 2012. PROTECT: A Deployed Game Theoretic System to Protect the Ports of the United States. International Conference on Autonomous Agents and Multiagent Systems (AAMAS) 1 (2012), 13–20.
 Shooman (2003) Martin L Shooman. 2003. Reliability of computer systems and networks: fault tolerance, analysis, and design. John Wiley & Sons.
 Singh et al. (2000) Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. 2000. Convergence results for singlestep onpolicy reinforcementlearning algorithms. Machine learning 38, 3 (2000), 287–308.
 Singh et al. (1994) Satinder P Singh, Andrew G Barto, Roderic Grupen, and Christopher Connolly. 1994. Robust reinforcement learning in motion planning. In Advances in Neural Information Processing Systems (NIPS). 655–662.
 Smart (1974) D. R. Smart. 1974. Fixed point theorems. Cambridge University Press, Cambridge.
 Sunehag et al. (2017) Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, and Thore Grapel. 2017. Valuedecomposition networks for cooperative multiagent learning. arXiv preprint arXiv:1706.05296 (2017).
 Sutton and Barto (2018) Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction (second ed.). The MIT Press, Cambridge, MA.
 Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. 1999. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112, 12 (1999), 181–211.
 Szepesvari and Littman (1997) Csaba Szepesvari and Michael L Littman. 1997. Generalized Markov Decision Processes: Dynamicprogramming and Reinforcementlearning Algorithms. Technical Report. Brown University.
 Tsitsiklis (1994) John N Tsitsiklis. 1994. Asynchronous stochastic approximation and Qlearning. Machine Learning 16, 3 (1994), 185–202.
 van Seijen et al. (2009) Harm van Seijen, Hado van Hasselt, Shimon Whiteson, and Marco Wiering. 2009. A theoretical and empirical analysis of Expected Sarsa. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, ADPRL 2009. 177–184.
 Yan et al. (2013) Ye Yan, Yi Qian, Hamid Sharif, and David Tipper. 2013. A survey on smart grid communication infrastructures: Motivations, requirements and challenges. IEEE Communications Surveys & Tutorials 15, 1 (2013), 5–20.
 Zhou and Doyle (1998) Kemin Zhou and John Comstock Doyle. 1998. Essentials of robust control. Vol. 104. Prentice hall, Upper Saddle River, NJ.