Revisiting State Augmentation methods for Reinforcement Learning with Stochastic Delays

08/17/2021 ∙ by Somjit Nath, et al. ∙ Tata Consultancy Services 9

Several real-world scenarios, such as remote control and sensing, are comprised of action and observation delays. The presence of delays degrades the performance of reinforcement learning (RL) algorithms, often to such an extent that algorithms fail to learn anything substantial. This paper formally describes the notion of Markov Decision Processes (MDPs) with stochastic delays and shows that delayed MDPs can be transformed into equivalent standard MDPs (without delays) with significantly simplified cost structure. We employ this equivalence to derive a model-free Delay-Resolved RL framework and show that even a simple RL algorithm built upon this framework achieves near-optimal rewards in environments with stochastic delays in actions and observations. The delay-resolved deep Q-network (DRDQN) algorithm is bench-marked on a variety of environments comprising of multi-step and stochastic delays and results in better performance, both in terms of achieving near-optimal rewards and minimizing the computational overhead thereof, with respect to the currently established algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Despite their enormous success, reinforcement learning (RL) in the basic Markov Decision Process (MDP) framework makes various restrictive assumptions that limit its applicability to real-world problems. Two of the most common simplifying assumptions that pose challenges in practical scenarios are absence of (a) observation delay, i.e., a system’s current state is always assumed to be available to the agent, and (b) action delay, i.e., an agent’s action has an immediate effect on the system’s trajectory. It is well-known that the presence of delays in dynamical systems degrades the performance of the agents, resulting in undesirable behaviors of the underlying closed-loop systems and leading to instability (Logemann, 1998; Patanarapeelert et al., 2006).

Many real-world problems, including but not limited to congestion control (Altman et al., 1999), control of robotic systems (Imaida et al., 2004; Jin et al., 2008), distributed computing (Hannah and Yin, 2018), management of transportation networks (Dollevoet et al., 2018), and financial reporting (Whittred and Zimmer, 1984)

exhibit the undesirable effect of action and observation delays towards performance of the corresponding agents. Medical domain is another significant and probably the most pertinent domain in today’s world where decision making is based on observing delayed state information. For instance, the decision to self-isolate and quarantine in case of suspected COVID-19 infection is based on the outcome of swab-based test, the result of which is significantly delayed (often by a day or two), since the labs tend to run large batches of swab tests together, not to mention the time taken by the labs to collect samples 

(Larremore et al., 2020; Rong et al., 2020). Similarly in biological systems, visual and proprioceptive information for motor control is significantly (500ms) delayed during transmission through neural pathways (Zelevinsky, 1998).

In recent years, deep reinforcement learning (DRL) has demonstrated enormous capability in learning complex dynamics and interactions in highly constrained environments (Mnih et al., 2015; Gibney, 2016; Silver et al., 2016; Henderson et al., 2018)

. Almost all the problems that DRL addresses are modeled using MDPs. By ignoring delay of agents, one can equivalently model the underlying problem as partially observable MDP (POMDP). POMPDs are generalizations of MDPs, however solving POMDPs without estimating hidden action states leads to arbitrary sub-optimal policies 

(Singh et al., 1994). Therefore, it becomes imperative to augment the delayed observation with the last few actions in order to ensure Markov property (Altman and Nain, 1992; Katsikopoulos and Engelbrecht, 2003; Walsh et al., 2009; Bouteiller et al., 2021) in delayed settings.

The reformulation allows an MDP with delays to be viewed as an equivalent MDP without delays comprising of the same state-transition structure as that of the original MDP, but at the expense of increased (augmented) state-space and relatively complex cost structure involving a conditional expectation. While the state-augmentation is essential to ensure Markov property, the resulting Bellman equation with conditional expectation in the equivalent cost structure requires vast computational resources for estimation. In particular, if the effect of an action or state observation is delayed by units at any stage, then the time-complexity of computation of the expected cost scales as  (Altman and Nain, 1992; Katsikopoulos and Engelbrecht, 2003), where denotes the finite set of states. Moreover, the size of the “augmented” state grows exponentially with , i.e., , and thus tabular learning approaches become computationally prohibitive even for moderately large delays. Here and represent the finite set of information states and actions, respectively. In view of these challenges, most of the existing work on designing reinforcement learning algorithms has largely worked with assumptions of constant delays and trying to handle delays in the environment either by adopting model-based methods (which are again computationally expensive and susceptible to sub-optimality) or by simulating future observations.

In this paper, we further extend the applicability of reinforcement learning algorithms to delayed environments on the following fronts: (a) We formally define and introduce the constant delay MDP (CDMDP) and stochastic delay MDP (SDMDP), and show that MDP with deterministic or stochastic delays can be converted into equivalent MDPs without delays. This is in contrast with semi-Markov Decision Processes (SMDPs), where the agent can wait for the current state to become observable or wait for the most recent action to get applied before taking subsequent actions; (b) The equivalence presents itself as a general framework for modeling MDP with delays as MDPs without delays within which the augmented states (information states) carry the necessary information for optimal action selection, the conditional expectation of the cost structure is significantly simplified; (c) A Delay-Resolved Deep Q-Network (DRDQN) reinforcement learning algorithm is developed which is shown to achieve near-optimal performance with minimal computational overhead. In particular, by using deep neural networks as nonlinear function approximators and disregarding a tabular framework, the computational complexity scales as

, i.e., it scales linearly with the delay , and thus becomes amenable to work with.

A word on notation: A no-delay Markov Decision Process (MDP) is denoted by the tuple , where and represent the finite sets of states and actions, respectively. For each triple , the probability that the system state transitions from to under the effect of the action is denoted by . The immediate reward associated with each state-action pair is depicted by . The cumulative reward is discounted by a factor and is given by the infinite sum . Let represent the instantaneous delay in an agent’s action or observation. We use to denote the set of augmented states. Note that in case of stochastic delay, is not constant and thus the definition of changes accordingly; however, with a slight abuse of notation, is used to represent the set of augmented states independent of the total delay with the understanding that its meaning is clear from the context. Finally, we use and to differentiate between the delays in observation and action, respectively, though, it is shown later that the two delays have similar effects on optimal decision making in delayed environments.

2. Decision Making With Delays

Figure 1.

Graphical illustration of the problem with observation and actions delays. An action is computed in every time step, using the latest known state and a vector of all actions taken since that state was observed.

Fig. 1 is an illustration of an RL agent involved in sequential decision making in a delayed environment. At the beginning of each stage , agent has access to the most recent observed state with , and the most recent sequence of actions with , where is the most recent delayed action that got applied in an otherwise undelayed environment. It is easy to notice that in order to ensure Markov property in a stochastic delay setting, it is not only important to keep track of the most recent observed state, but also the time-instant at which it was observed first. Instead of naively waiting for the current state to become observable and the current action to get applied, agent must make informed decision at each step in order to minimize the total time taken to complete a given task.

One of the motivating examples of real-world delayed setting concerns with medical decision making for a suspected COVID-19 infection, where the mean time between symptom onset and a case being confirmed is estimated to be 5.6 days (standard deviation 4.2), see Fig. 

2. Therefore, based on a mean incubation period of 5.3 days (Li et al., 2020), it may take up to 10.9 days on average for a case to be confirmed post-infection. Assuming that an agent starts developing some early flu-like symptoms, basing the decision making to self-isolate only after the availability of complete information (after 10.9 days) may result in further increase in infections around the subject. To overcome this concern, agent can instead resort to decision making in an MDP setting that takes into account the most recent observation (early symptoms, subject’s proximity with potential confirmed cases, etc.) and action history (if the subject avoided close contacts with others during early symptomatic phase, agent’s adherence to putting on a face-mask, etc.).

Figure 2. Histogram of the time between onset of COVID-19 symptom(s) and case confirmation for 139 patients from Basel-Landschaft, adopted from (Sciré et al., 2020).

Thus, the ability to perform decision-making in the presence of delays without having to wait for the effect of the current action to become observable is critical, particularly when the environment responds poorly to such ‘waiting’ actions. Hence, it is extremely important to design algorithms with ability to perform real-time decision-making.

3. Related Work

The main objective of this paper is to learn to perform optimal decision-making in delayed environments without having to wait for the observations to arrive or actions to get applied. It must be emphasized that the equivalent MDP formulation proposed in this paper with simplified reward structure and appropriately chosen information state need not have the same optimal reward as the original MDP, however, the two MDPs share the same optimal policy and we leverage this equivalence for optimal decision-making both in deterministic, as well as stochastic delayed environments. MDPs in delayed environments have been applied extensively in a variety of domains, particularly in optimal control of systems with delayed feedback (Altman and Nain, 1992) and design of communication network in the presence of transmission delays (Altman et al., 1999). Much of the prior work on delayed-MDPs (Brooks and Leondes, 1972; Kim, 1985; Altman and Nain, 1992; Kim and Jeong, 1987; White, 1988; Bander and White III, 1999) and POMDPs (Sondik, 1978) considered only constant observation delays, which was later extended to analyze constant delays in action and cost collection (Bertsekas et al., 1995; Altman et al., 1999). The notion of “Constantly Delayed Markov Decision Process” was formally introduced in  (Walsh et al., 2009). MDPs with constant delays have largely been addressed using augmented states in the literature (Bertsekas et al., 1995; Katsikopoulos and Engelbrecht, 2003). However, as the size of the augmented state grows exponentially with the extent of delay, the augmented approaches find little applications due to intractability. A naive approach is to include memory-less methods that exploit prior information on the extent of constant delay (Schuitema et al., 2010) without the need for state augmentation. Consequently, model-based approaches were introduced to predict the current state in a delayed environment. These include involving estimation of transition probabilities in a multi-step delayed framework (Walsh et al., 2007; Chen et al., 2020). However, model-based approaches are computationally expensive when compared with model-free approaches (Otto et al., 2013)

, let alone in the presence of delayed observations and actions. Fortunately, the recent advances in deep learning have facilitated practical implementation of augmented approaches as, unlike tabular methods, the complexity is only polynomial in the size of

(Ramstedt and Pal, 2019) recently formalized Real-Time Reinforcement Learning (RTRL), a deep-learning based framework that incorporates the effect of single-step action delay. For small delays, (Xiao et al., 2019) analyzes the influence of incorporating the action-selection time towards correcting the effect of constant delay in an MDP.  (Bouteiller et al., 2021) introduced partial trajectory sampling built on top of the Soft Actor Critic with augmented states in order to account for random delays in environments with significantly better performance. Most recently, (Dalal et al., 2021) also introduces use of forward models to improve performance for constant delays.

The stochastic delayed environment in this work is fundamentally different from the stochastic setting described in (Katsikopoulos and Engelbrecht, 2003). (Katsikopoulos and Engelbrecht, 2003) makes impractical assumptions on asynchronous cost collection where observations arrive only in order, and cost is calculated only after the observation. Moreover, the amount of delay between any consecutive steps is non-negative, and the overall episodic delay can grow in no time. A more recent work on decision-making in stochastic delayed environments (Bouteiller et al., 2021) proposes a different solution for solving random delay problems by using augmented states along with the delay values for both action and observation. However, the authors in (Bouteiller et al., 2021) leverage the values of action and observation delays at each step to convert off-policy sampled trajectories into on-policy ones. Consequently, their approach is not suited for optimal decision-making in delay-agnostic environments discussed in this work. Detailed analysis on the relevant related work along with their merits and demerits are discussed further in Section 5.

4. Markov Decision Processes with Delays

We now formalize the general setting of an MDP with constant and stochastic delays in terms of the set of information states , action space , probability of state-transition , discount-factor , one-step reward , delay in observation , and delay in application of action . It must be noted that from the point of view of the agent, action delay is functionally equivalent to observation delay (as shown later in Lemma 3). The effect of action (or observation) delay can better be understood through the following simple two-state MDP. The example below is adopted from (Dalal et al., 2021), however, (Dalal et al., 2021)

considers a simple Markov chain where state-transitions are independent of the agent’s actions. We suitably modify it to consider

controlled Markov chains or MDPs.

Figure 3. An illustrative example of how delay in action affects the optimal reward. Different colored arrows indicate state-transitions under different actions (Blue:0 and Brown:1).
Proposition 0 ().

For any , the optimal reward for the undelayed-MDP shown in Fig. 3 is greater than the optimal reward for the corresponding delayed version. Furthermore, if , i.e., in the limiting case of a deterministic MDP, the effect of delay becomes negligible. Thus, the example shown in Fig. 3 highlights the fundamental trade-off between stochasticity and delays.

Proof.

The proof uses the fact that the problem is symmetric in its states, and thus the optimal rewards from each state, as well as the corresponding optimal policies are also symmetric in the system states. We begin by making two key observations:
1. The problem is symmetric in the system states, i.e., the optimal reward . Without loss of generality, we only consider computation of in our analysis.
2. From symmetry, it can be further concluded that the optimal stationary policy and . It is later shown that the optimal policy necessitates .

Let us first consider the case with no delay, then by definition:

(1)

since, and . Furthermore, the transition probabilities are independent of time , and thus for all .

Similarly, let us consider the optimal reward, , under one-step delay, given by:

(2)

Note that from the problem description, and , thus, (4) reduces to

(3)

Again, since the transition probabilities are independent of and the optimal policies are stationary, for all .

In order to evaluate the optimal reward in (4), the corresponding two-step transition probabilities need to be computed. For brevity, the analysis below shows the computation of only the first term on the right-hand-side of (4), (two-step transition probability under action 0). Fig. 4 shows the Markov chain for the aforementioned two-step transition along with corresponding one-step transition probabilities. Under action 0, the probability that the system state transitions from to is (and with probability , the system remains in ). Likewise, the one-step transition probabilities are evaluated to reach the goal-state .

Figure 4. Illustration of 2-step transition probability from to under action 0.

From Fig. 4, it follows that:

(4)

The transition probability from to itself under action 1 can similarly be obtained, and it turns out to be the same as in (4) for the given problem parameters. Thus, from (4) and (4), the optimal return is obtained as:

The return is maximized for a suitable choice of optimal policy , leading to

(5)

Thus, from (4), it can be concluded that for any , the delayed reward is always smaller than or equal to the undelayed reward . If , the undelayed reward approaches the delayed reward with equality at , indicating the fundamental trade-off between stochasticity and delays. The above analysis can be easily extended to consider any amount of delay (and not necessarily unit step delay).

4.1. Constant Delay Markov Decision Process

Definition 0 ().

A Constant Delay Markov Decision Process (CDMDP), denoted by the tuple , augments an MDP with state-space . Note that a policy is defined as a mapping .

A CDMDP can inherit delays due to delay in both observation, as well as action. Intuitively, the two delays are functionally equivalent from the point of view of the agent. For instance, consider a scenario with no action delay, however, the observations are delayed by units of time. For an agent trying to select an optimal action at any time , the corresponding reward is available only at time , i.e., the action at time gets applied effectively at time , as also happens in the action delay case. We formalize this intuitive argument in the form of Lemma 3.

It is shown in (Altman and Nain, 1992) that CDMDP can be reduced to an equivalent MDP with and suitably modified reward structure at any time , given by for . Thus, from (Altman and Nain, 1992), the information necessary for optimal action selection at any instant is contained in . The augmented state will henceforth be referred to as information state.

If an action is selected at time , the information state transitions to with the probability . According to (Altman and Nain, 1992), the action results in a reward . Note that the reward structure depends on and not

, and thus evaluation of expected reward entails computation of the conditional probability distribution

. The complexity of this computation grows as , which is why model-based methods (Chen et al., 2020) are computationally prohibitive to work with if the dimension of state-space is large. Alternatively, one could try to reformulate CDMDP into another equivalent MDP with simplified reward structure such that while the two MDPs differ in their achievable optimal rewards, the underlying optimal policies are still the same (Katsikopoulos and Engelbrecht, 2003). Thus, one can simultaneously achieve both the accuracy of model-free methods, as well as sample-efficiency (characteristic of model-based methods). It must be remarked that despite the choice of reward-structure, augmentation is required to impart Markov property. The flavor of the simplified reward structure can be seen in the proof of Lemma 3, however, a detailed analysis is presented for the generalized setting of stochastic delay in Theorem 5.

Lemma 0 (Equivalence of action and observation delays).

The CDMDPs and are functionally equivalent to the MDP with appropriately chosen information state .

Proof.

The proof is based on the fact that conditioned on the appropriately chosen information state, the optimal decision making for the two CDMDPs have a similar structure. Let us first consider the case of constant observation delay. We denote this observation delayed MDP by the CDMDP . The information necessary for optimal action selection at any stage is given by . According to (Altman and Nain, 1992), for an arbitrary but fixed policy , the total expected reward is given by:

(6)

We consider another MDP with a modified cost-structure with the total expected reward given by:

(7)

which can be further expanded as:

(8)

The first term on the right-hand side is independent of the policy when conditioned on the information state . Therefore, it can be concluded that

i.e., the optimal policies for the two MDPs (with and without modified reward structure) are equivalent.

Next, we consider the CDMDP with action-delay. According to (Altman and Nain, 1992), this CDMDP can be reduced into an equivalent MDP with information state and the total expected reward given by:

(9)

Following our analysis for the observation-delayed MDP with modified cost structure, we can obtain a similar relationship between the expected reward for the modified and unmodified rewards, i.e.,

(10)

As before, the policy that maximizes , also maximizes the . From the point of view of an agent taking an action, the agent just needs to work with the suitably modified reward structure and the corresponding information state, independent of the nature of delay. Finally, if we consider the CDMDP , then for a decision maker, it suffices to consider an equivalent MDP with the information state . ∎

Remark 1 ().

Lemma 3 suggests that it suffices to consider only one form of delay. Consequently, the rest of the paper takes into account scenarios only with delayed observation, since action delays can be similarly analyzed. Nonetheless, we later present some experiments with action delays for the sake of completeness.

4.2. Stochastic Delay Markov Decision Process

We now discuss the MDP with stochastic delays. Unlike in a CDMDP, in a stochastic delay MDP (SDMDP), the number of time steps between two successive observations is random and not equal to unity. Consequently, it is possible to observe a later state before observing . This is in contrast to the SDMDP setting described in (Katsikopoulos and Engelbrecht, 2003), where it is assumed that can only be observed after the state has been observed.

For brevity, we only consider the scenario with random observation delays. The analysis for SDMDPs with action delays follows easily from Lemma 3. Below we formally define an SDMDP.

Definition 0 ().

A Stochastic Delay Markov Decision Process (SDMDP), denoted by the tuple , augments an MDP with state-space such that . A policy in SDMDP is defined as a mapping . Here represents no action and corresponds to the scenario when the MDP freezes for the agent.

Unlike with CDMDPs, in an SDMDP the information state must also include the time-instant at which the last observed system state was first observed. This is required in order to ensure that the reward corresponding to is collected only once at stage . We also assume that if the delay for a later state is smaller than the corresponding delay for by more than unity, the agent gets to observe the most recent state and the reward associated with it, along with the rewards associated with and other previously unobserved states.

Another subtlety with SDMDPs is the size of information state . Let be the most recent observation first observed at instant . Then for an agent trying to make an optimal decision, the information necessary at time is . If the state becomes observable at the next instant, the system transitions to with . However, if the agent does not receive any new observation at time , the system transitions to with . Thus, the size of the information state may vary during the evolution of an SDMDP. We account for this inconsistency by adding a ‘no action’ to the action space and keeping the length of the information state at each stage to with the understanding that the maximum allowable delay is . In scenarios where the delay amount exceeds , it is assumed that the MDP freezes from the perspective of the agent, i.e., the agent does not take any new actions till the most recent state becomes observable. Thus at each instant, the information state is given by  - tuple , where

denotes the ‘no action’. In practice, ‘no action’ can be regarded as equivalent of zero-padding when implementing deep-RL algorithms. Following the similar arguments as in

(Altman and Nain, 1992; Bertsekas et al., 1995), it can be shown that SDMDP can be reduced to an undelayed MDP with . The total expected reward for any arbitrary but fixed policy is thus given by

(11)

As with CDMDPs, the reward structure in MDP in (11) necessitates computations which can be computationally prohibitive. The theorem below simplifies this cost structure by considering another MDP, whose optimal policy coincides with the optimal policy of the MDP in (11).

Theorem 5 (Equivalence of SDMDP and undelayed MDP).

The SDMDP can be reduced into an equivalent undelayed MDP with simplified cost structure.

Proof.

Note that the set of instants at which the states are first observed do not explicitly enter into the accompanying technical arguments, however, it is needed to ensure that the reward collections occur only once per state. The proof of Theorem 5 is similar to the derivation in Lemma 3. We consider an equivalent MDP where reward collection occurs as prescribed in Section 4.2, i.e., if the delay at instant is , then the corresponding reward gets discounted by . Let

be the observation delay random variable, i.e,

takes values in the set . Finally, let be the last observable state that was first observed at stage . The information state is given by . The total expected reward for the MDP with modified reward structure is given by:

(12)

The second term on the right hand side of (4.2) conditioned on can be re-written as:

Also, the first term on the right hand side of (4.2) is independent of conditioned on . Also, since the policy fixed and independent of the delay process, it can be concluded that

which completes the proof. ∎

5. Delay-Resolved Q-learning

(a)
(b)
Figure 5. Using Neural Networks to tackle the problem exploding state space: (a) shows the performance with Tabular Q-learning, whereas (b) uses DQN.

The basic essence of the Delay Resolved algorithms is the conversion of non-Markov problems with both constant and stochastic delays (in actions and observations) into Markov problems, CDMDPs and SDMDPs respectively. Once MDPs have been formulated, we can apply any suitable reinforcement learning algorithm for both policy evaluation and improvement.

The primary difference between Delay Resolved Q-learning and vanilla DQN ((Mnih et al., 2015)) is the state-augmentation, i.e., instead of the state just being the current observation, there is an added history of pending actions which are stored in the action buffer to make up an augmented state called information state. For constant delays, the dimension of the information state, at time t, is constant. However, for stochastic delays, depends on the value of delay. As explained in Section 4.2, we assign a maximum allowable dimension for and fill up the action buffer with ‘no-action’. Intuitively, it makes sense to choose the length of the action buffer as the maximum possible delay-value if known a priori, however, compute requirements may impose additional restrictions on its length.

Comparison with Related Approaches

Though work on delayed RL environments is not very extensive, some approaches have been proposed to address the problem of decision making in presence of delays. Model-based RL approaches (Walsh et al., 2007; Chen et al., 2020), which involve learning the underlying undelayed transition probabilities can potentially be very useful if the dynamics are learned accurately. However, the learned models are often inaccurate, particularly in the presence of large or stochastic delays. Additionally, learning the state-transition probability is computationally expensive , which is precisely why model-based methods are not very popular in practice. Another recent approach (Agarwal and Aggarwal, 2021)

used an expectation maximization algorithm where the agent takes decisions based on the expected next state, however it also requires estimating transition probabilities. Consequently,

(Schuitema et al., 2010) proposes a model-free approach that does not use augmented information states, instead, proposes to build an action history. Their scheme uses the effective action from the action history, typically the last action of the action buffer, during the Q-learning update. Here, the effective action is the action that gets applied on the current observed state and makes the update more accurate than just incorporating the current pending action. However, their approach works only with action delays. Moreover, their framework is not delay-agnostic, i.e., the delay value must be known a priori.

Most recently, (Dalal et al., 2021) addresses the CDMDP by using forward models for predicting the current undelayed state and selecting actions based on that predicted state. Though their algorithm shows promising results, it still is computationally more extensive than the normal model-free methods, because additional compute is required for learning the forward models. Learning good forward models is an essential aspect of this algorithm, which is generally not an easy task for all environments. Particularly with large delays, the forward model needs to be applied multiple times, which can lead to compounding errors. With stochastic delays, the task becomes even more difficult. Additionally, for best results, ideally the delay value will be necessary both during inference as well as training. This method also requires additional storage in the form of storing a sample buffer which is basically the trajectory of the pending actions along with their states and rewards. While, the delay resolved algorithm proposed in this paper requires additional memory for the action buffer, it is much easier to store lower-dimensional actions than the potentially higher-dimensional trajectories.

(Bouteiller et al., 2021) proposes a different solution for solving random delay problems by using augmented states along with the delay values for both action and observation. Their algorithm leverages the values of action and observation delays at each step to convert off-policy sampled trajectories into on-policy ones. Their algorithm, known as Delay Correcting Actor Critic (DCAC), performs reasonably well, however, at the expense of requiring complete information of the delay values at each step needed for re-sampling. On the contrary, our DRDQN algorithm is designed to work in environments that do not leverage the additional information on delay values at each step. Consequently, DCAC is not suited for optimal decision-making in delay-agnostic environments discussed in this work.

6. Experimental Results

(a)
(b)
(c)
Figure 7. Comparison with Baselines on Gym environments for constant and stochastic observation delays. DRDQN learns the optimal policy both for constant and stochastic delays.
(a)
(b)
(c)
Figure 8. Comparison with Baselines on Gym environments for constant action delays. In (a), both Delayed((Dalal et al., 2021)) and DRDQN have similar performance, whereas in (b) and (c) due to inaccurate forward models, Delayed DQN does not perform as well.
(a)
(b)
(c)
Figure 9. (a) and (b) show runtimes for DRDQN and Delayed DQN((Dalal et al., 2021)); (c) is asymptotic performance (average over last 10000 episodes)

We tested our algorithm on the delayed version of four standard environments, W-Maze, Acrobot, MountainCar and CartPole (details in 6.1) with both action and observation delays. For all our experiments, we used Deep Q-Networks ((Mnih et al., 2015)) as the base RL algorithm111Code available at: https://github.com/baranwa2/DelayResolvedRL. Since, the primary modification is in how the state is structured and not on how the samples are collected and trained with, Delay Resolved algorithms can be used with any RL algorithm.

6.1. Environments

We first describe the various environment dynamics along with the corresponding reward structures below.

W-Maze: W-Maze is a 7x11 grid world as shown in Fig. 6. The agent starts randomly at the bottom row and the objective is to reach the GOAL states through available actions (UP, DOWN, LEFT, RIGHT). The agent receives a reward of +10 if it reaches the GOAL state and -1 for every other step not concluding at the GOAL state.

Figure 6. W-Maze ((Schuitema et al., 2010))

CartPole-v0: In CartPole-v0, the agent aims to balance a pole on a cart by moving the cart either left or right. The agent receives a reward of +1 for every step and 0 when the episode ends, which happens when the agent either falls down or stays balanced for 200 time-steps. Thus the optimal reward for this environment is ~200. Note that in the reshaped version of CartPole-v0, originally used in  (Dalal et al., 2021), the agent gets a reward based on the current value of its position and velocity and this reward grows as the pole becomes upright.

Acrobot-v1: Acrobot-v1 is a complex version of CartPole and comprises of two links and two independent actuators. In essence, Acrobot represents a double pendulum with an aim to swing the end of the lower-link up to a desired height. The actions comprise of torques generated by the actuator. The agent receives a step-reward of -1 and 0 once the episode ends. The optimal reward is ~-100.

MountainCar-v0: In MountainCar-v0, the agent has to drive a vehicle uphill, however, the steepness of the slope is just enough to prevent it from driving up. Thus, he has to learn to drive backwards and build up momentum to reach the hilltop. The agent receives a step-reward of -1 and 0 on episode termination, which corresponds to him reaching the hilltop or 200 time-steps been elapsed.

6.2. Exploding State Space and its Solution

One of the biggest disadvantages of the algorithm as mentioned in  (Dalal et al., 2021) is that the information state space scales exponentially with delay value, . In tabular settings this becomes extremely problematic and infeasible to work with, specifically with large delays. However, if we use neural networks as function approximators, it is possible to include the action buffer as an input concatenated with the state space, and thus the size of the input space becomes , which is easily manageable.

Fig 5

represents the average reward value of the last 10000 episodes along with their standard errors for 10 runs on the W-Maze environment 

(Schuitema et al., 2010) across various action delay values. From Fig. 5a, it is evident that for smaller delays, delay resolved tabular Q-learning is able to achieve optimal reward (~0) for smaller delays. However, for large delays, they are not only computationally more expensive but also suffer from poor exploration. This is primarily due to exponential increase in the state-space resulting in poorer performance. The performance of the delay agent (Schuitema et al., 2010) is also depicted for the sake of completeness. This agent performs relatively well for smaller delays.

However, as shown in Fig. 5b, neural networks as function approximator can handle the action buffer as merely an input, which makes the information state space much smaller and thus leads to optimal performance across larger delays as well. Interestingly, the delay-DQN algorithm proposed in (Schuitema et al., 2010), performs reasonably well since we have a much powerful function approximation tool for predicting the value function. Although this is a useful trick, there are no theoretical guarantees on convergence to the optimal policy.

6.3. Results with Constant and Stochastic Observation Delays

A big advantage of Delay Resolved algorithms is that they can be used both for stochastic and constant delays, without requiring the delay input at each step as is needed for (Bouteiller et al., 2021). Fig. 7 highlights the performance of the proposed Delay Resolved DQN across three Gym environments. Each point is an average over the entire 1 Million steps of training across 10 runs (along with standard error bars), for constant as well as stochastic delays. The last column represents the stochastic delay uniformly selected between 0 and 10, with the maximum allowable dimension of the information state being 10. Across all the environments, for both stochastic and constant delays, the DRDQN algorithm is able to converge to optimal policy as reflected in its near optimal rewards.

6.4. Results with Constant Action Delays

Fig. 8 portrays the average reward across the entire training regime (1 Million steps across 10 runs with the standard error bars), for various action delays in three different Open AI Gym (Brockman et al., 2016) environments. For comparison, the average values for vanilla DQN (Mnih et al., 2015) and Delayed DQN (Dalal et al., 2021) are exhibited alongside. As mentioned earlier, the delayed RL algorithm (Dalal et al., 2021) is heavily influenced by the architecture of the forward model. For consistency, we used the same architecture as was used in the original paper (Dalal et al., 2021) for the primary DQN agent. However, as seen in Figs. 8b and 8

c, it does not always work out well, and failing to learn an accurate forward model leads to worse performance. So, it is extremely critical to choose the correct forward model architecture and this will vary based on the environment, which can be very difficult to gauge ahead of training. Delay Resolved algorithms, on the other hand, do not involve learning any models and hence do not require any additional hyperparameter search. It does relatively well across all the domains for all the delay values.

Additionally, in Fig. 8a, all the above algorithms perform relatively well getting to optimal rewards for large delays, however, the time taken is much longer for Delayed DQN (Dalal et al., 2021) because of the additional training of the forward model. Fig. 9a depicts the time-taken for all the algorithms across all 10 runs. It can be clearly seen that Delay Resolved DQN is computationally very light and almost takes the same time as the vanilla DQN, whereas the Delayed DQN takes substantially longer per run for convergence.

Note: The points on each of the plots are an average over one million frames across 10 runs. Thus, the drop in performance for larger delays in Figs. 7 and 8 are due to the fact that the agent takes a longer time to reach the optimal performance which is reflected in the lower average score.

6.5. Equivalence of Action and Observation Delays

Lemma 3 establishes the equivalence of action and observation delays. For the sake of completeness, we plot the rewards obtained by DRDQN for the last 1000 episodes on Acrobot for both action and observation delays in Fig. 10. It is evident that the DRDQN algorithm converges to similar values for action and observation delays across different values of delays.

Figure 10. Equivalence of Action and Observation Delays

6.6. Compute Comparison

The only modification of Delay Resolved algorithms is in the addition of action buffer in the state space. Although, this addition can lead to exponential increase of compute in tabular settings, when added as an input to a neural network value function approximator, the additional computational overhead is very minimal as can be seen in Figs. 9a and b. With the increase in delay, the run-time increases dramatically for Delayed method  (Dalal et al., 2021) as the agent has to first compute the forward model and then apply the forward model multiple times in order to predict the current state. The environment used for Figs. 9b and c is the same as CartPole-v0 except with reshaped rewards as used in (Dalal et al., 2021) (details in Section 6.1). We see that if appropriate forward models also reach optimality if appropriate architecture is chosen, but the time required is larger (Fig. 9b).

7. Conclusion and Future Work

In this work, we have addressed a very crucial assumption in Reinforcement Learning, which is the synchronization between the environment and the agent. In real world scenarios, we cannot expect this assumption to hold true and thus we would need to modify our RL algorithms accordingly. This paper revisits the idea of using augmented states as a solution for delayed RL problems. We formally define both constant and stochastic delays and provide theoretical analysis on why the reformulation still converges to the same optimal policy. Additionally, we adapt state augmentation methods to constant and stochastic delay problems, to formulate a class of Delay Resolved RL algorithms, which can perform well on both constant and random delay tasks.

For future work, we would like to extend this algorithm to real-world problems including accounting for actuator and motor delays in Robotics, and shipping delay in Supply Chains. Delay Resolved algorithms could prove to be a founding block for practical real-time RL algorithms. For example, without relaxing the assumption of knowing the delay value at every time-step, Delay Resolved DQN can still be improved upon by using better sampling strategies from the experience replay buffer.

References

  • M. Agarwal and V. Aggarwal (2021) Blind decision making: reinforcement learning with delayed observations. Proceedings of the International Conference on Automated Planning and Scheduling 31 (1), pp. 2–6. External Links: Link Cited by: §5.
  • E. Altman, T. Başar, and R. Srikant (1999) Congestion control as a stochastic control problem with action delays. Automatica 35 (12), pp. 1937–1950. Cited by: §1, §3.
  • E. Altman and P. Nain (1992) Closed-loop control with delayed information. ACM sigmetrics performance evaluation review 20 (1), pp. 193–204. Cited by: §1, §1, §3, §4.1, §4.1, §4.1, §4.1, §4.2.
  • J. Bander and C. White III (1999) Markov decision processes with noise-corrupted and delayed state observations. Journal of the Operational Research Society 50 (6), pp. 660–668. Cited by: §3.
  • D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas (1995) Dynamic programming and optimal control. Vol. 1, Athena scientific Belmont, MA. Cited by: §3, §4.2.
  • Y. Bouteiller, S. Ramstedt, G. Beltrame, C. Pal, and J. Binas (2021) Reinforcement learning with random delays. In International Conference on Learning Representations, External Links: Link Cited by: §1, §3, §3, §5, §6.3.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §6.4.
  • D. Brooks and C. T. Leondes (1972) Markov decision processes with state-information lag. Operations Research 20 (4), pp. 904–907. Cited by: §3.
  • B. Chen, M. Xu, L. Li, and D. Zhao (2020) Delay-aware model-based reinforcement learning for continuous control. arXiv preprint arXiv:2005.05440. Cited by: §3, §4.1, §5.
  • G. Dalal, E. Derman, and S. Mannor (2021) Acting in delayed environments with non-stationary markov policies. In International Conference on Learning Representations, External Links: Link Cited by: §3, §4, §5, Figure 8, Figure 9, §6.1, §6.2, §6.4, §6.4, §6.6.
  • T. Dollevoet, D. Huisman, M. Schmidt, and A. Schöbel (2018) Delay propagation and delay management in transportation networks. In Handbook of Optimization in the Railway Industry, pp. 285–317. Cited by: §1.
  • E. Gibney (2016) Google ai algorithm masters ancient game of go. Nature News 529 (7587), pp. 445. Cited by: §1.
  • R. Hannah and W. Yin (2018) On unbounded delays in asynchronous parallel fixed-point algorithms. Journal of Scientific Computing 76 (1), pp. 299–326. Cited by: §1.
  • P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2018) Deep reinforcement learning that matters. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 32. Cited by: §1.
  • T. Imaida, Y. Yokokohji, T. Doi, M. Oda, and T. Yoshikawa (2004) Ground-space bilateral teleoperation of ets-vii robot arm by direct bilateral coupling under 7-s time delay condition. IEEE Transactions on Robotics and Automation 20 (3), pp. 499–511. Cited by: §1.
  • M. Jin, S. H. Kang, and P. H. Chang (2008) Robust compliant motion control of robot with nonlinear friction using time-delay estimation. IEEE Transactions on Industrial Electronics 55 (1), pp. 258–269. Cited by: §1.
  • K. V. Katsikopoulos and S. E. Engelbrecht (2003) Markov decision processes with delays and asynchronous cost collection. IEEE transactions on automatic control 48 (4), pp. 568–574. Cited by: §1, §1, §3, §3, §4.1, §4.2.
  • S. H. Kim (1985) State information lag markov decision process with control limit rule. Naval research logistics quarterly 32 (3), pp. 491–496. Cited by: §3.
  • S. H. Kim and B. H. Jeong (1987) A partially observable markov decision process with lagged information. Journal of the Operational Research Society 38 (5), pp. 439–446. Cited by: §3.
  • D. B. Larremore, B. Wilder, E. Lester, S. Shehata, J. M. Burke, J. A. Hay, M. Tambe, M. J. Mina, and R. Parker (2020) Test sensitivity is secondary to frequency and turnaround time for covid-19 surveillance. MedRxiv. Cited by: §1.
  • Q. Li, X. Guan, P. Wu, X. Wang, L. Zhou, Y. Tong, R. Ren, K. S. Leung, E. H. Lau, J. Y. Wong, et al. (2020) Early transmission dynamics in wuhan, china, of novel coronavirus–infected pneumonia. New England Journal of Medicine. Cited by: §2.
  • H. Logemann (1998) Destabilizing effects of small time delays on feedback-controlled descriptor systems. Linear Algebra and its Applications 272 (1-3), pp. 131–153. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1, §5, §6.4, §6.
  • A. R. Otto, S. J. Gershman, A. B. Markman, and N. D. Daw (2013) The curse of planning: dissecting multiple reinforcement-learning systems by taxing the central executive. Psychological science 24 (5), pp. 751–761. Cited by: §3.
  • K. Patanarapeelert, T. Frank, R. Friedrich, P. Beek, and I. Tang (2006) Theoretical analysis of destabilization resonances in time-delayed stochastic second-order dynamical systems and some implications for human motor control. Physical Review E 73 (2), pp. 021901. Cited by: §1.
  • S. Ramstedt and C. Pal (2019) Real-time reinforcement learning. In Advances in neural information processing systems, pp. 3073–3082. Cited by: §3.
  • X. Rong, L. Yang, H. Chu, and M. Fan (2020) Effect of delay in diagnosis on transmission of covid-19. Mathematical Biosciences and Engineering 17 (3), pp. 2725–2740. Cited by: §1.
  • E. Schuitema, L. Buşoniu, R. Babuška, and P. Jonker (2010) Control delay in reinforcement learning for real-time dynamic systems: a memoryless approach. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3226–3231. Cited by: §3, §5, Figure 6, §6.2, §6.2.
  • J. Sciré, S. A. Nadeau, T. G. Vaughan, B. Gavin, S. Fuchs, J. Sommer, K. N. Koch, R. Misteli, L. Mundorff, T. Götz, et al. (2020) Reproductive number of the covid-19 epidemic in switzerland with a focus on the cantons of basel-stadt and basel-landschaft. Swiss Medical Weekly 150 (19-20), pp. w20271. Cited by: Figure 2.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
  • S. P. Singh, T. Jaakkola, and M. I. Jordan (1994) Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings 1994, pp. 284–292. Cited by: §1.
  • E. J. Sondik (1978) The optimal control of partially observable markov processes over the infinite horizon: discounted costs. Operations research 26 (2), pp. 282–304. Cited by: §3.
  • T. J. Walsh, A. Nouri, L. Li, and M. L. Littman (2009) Learning and planning in environments with delayed feedback. Autonomous Agents and Multi-Agent Systems 18 (1), pp. 83. Cited by: §1, §3.
  • T. J. Walsh, A. Nouri, L. Li, and M. L. Littman (2007) Planning and learning in environments with delayed feedback. pp. 442–453. External Links: ISBN 978-3-540-74958-5 Cited by: §3, §5.
  • C. C. White (1988) Note on “a partially observable markov decision process with lagged information”. Journal of the Operational Research Society 39 (2), pp. 217–217. Cited by: §3.
  • G. Whittred and I. Zimmer (1984) Timeliness of financial reporting and financial distress. Accounting Review, pp. 287–295. Cited by: §1.
  • T. Xiao, E. Jang, D. Kalashnikov, S. Levine, J. Ibarz, K. Hausman, and A. Herzog (2019) Thinking while moving: deep reinforcement learning with concurrent control. In International Conference on Learning Representations, Cited by: §3.
  • L. Zelevinsky (1998) Does time-optimal control of a stochastic system with sensory delay produce movement units. Master’s thesis, University of Massachusetts, Amherst. Cited by: §1.