Deep Q-learning: a robust control approach

by   Balázs Varga, et al.
Chalmers University of Technology

In this paper, we place deep Q-learning into a control-oriented perspective and study its learning dynamics with well-established techniques from robust control. We formulate an uncertain linear time-invariant model by means of the neural tangent kernel to describe learning. We show the instability of learning and analyze the agent's behavior in frequency-domain. Then, we ensure convergence via robust controllers acting as dynamical rewards in the loss function. We synthesize three controllers: state-feedback gain scheduling ℋ_2, dynamic ℋ_∞, and constant gain ℋ_∞ controllers. Setting up the learning agent with a control-oriented tuning methodology is more transparent and has well-established literature compared to the heuristics in reinforcement learning. In addition, our approach does not use a target network and randomized replay memory. The role of the target network is overtaken by the control input, which also exploits the temporal dependency of samples (opposed to a randomized memory buffer). Numerical simulations in different OpenAI Gym environments suggest that the ℋ_∞ controlled learning performs slightly better than Double deep Q-learning.



There are no comments yet.


page 1

page 15


Robust Fuzzy Q-Learning-Based Strictly Negative Imaginary Tracking Controllers for the Uncertain Quadrotor Systems

Quadrotors are one of the popular unmanned aerial vehicles (UAVs) due to...

The Effects of Memory Replay in Reinforcement Learning

Experience replay is a key technique behind many recent advances in deep...

Reinforcement Learning for Robust Missile Autopilot Design

Designing missiles' autopilot controllers has been a complex task, given...

Reduced-Dimensional Reinforcement Learning Control using Singular Perturbation Approximations

We present a set of model-free, reduced-dimensional reinforcement learni...

Reinforcement Learning of Structured Control for Linear Systems with Unknown State Matrix

This paper delves into designing stabilizing feedback control gains for ...

Learning an Interpretable Traffic Signal Control Policy

Signalized intersections are managed by controllers that assign right of...

Double Deep Q-Learning for Optimal Execution

Optimal trade execution is an important problem faced by essentially all...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past decade, the success of neural networks (NNs) has led to significant uptake of machine learning (ML) methods in various areas of science and real-world applications. On the other hand, working with large data sets, the black-box nature of some problems, and the complex structure of function approximators often hamper in-depth human understanding of such methods. Consequently, efforts have been made to improve the transparency of machine learning both in terms of training an ML model and the results produced by the trained model adadi2018peeking, roscher2020explainable.

Although machine learning-based controllers are gaining popularity and often outperform classical control, especially in highly nonlinear environments, their stability and performance are seldom guaranteed analytically hoel2018automated, zhou2021model. Similarly, making such heuristic learning algorithms converge requires tweaking and experimenting. Control theory has a well-established and mathematically sound toolkit to analyze dynamical systems and synthesize stabilizing, robust controllers skogestad2007multivariable. bradtke1994adaptive shows that dynamic programming based reinforcement learning (RL) (Q-learning) converges to an optimal linear quadratic (LQ) regulator if the environment is a linear system. On the other hand, RL shines in complex environments where formulating a closed-form solution is impossible. Several works deal with synergized model-based and data driven controllers to improve the performance of the controlled process kretchmar2001robust, hegedus2020handling or analyze learned controllers with tools from control donti2020enforcing, perrusquia2020robust, liu2020h. Meanwhile, control theory is seldom utilized to enhance the agent’s training performance.

One branch of ML is reinforcement learning, where (in the absence of labeled data), the agent learns in a trial and error way, interacting with its environment. The learning agent faces a sequential decision problem and receives feedback as a performance measure sutton2018reinforcement. This interaction is commonly depicted as in Figure 1.

Figure 1: Agent–environment interaction in reinforcement learning

This sequential decision problem can be described with a (discrete) Markov Decision Process (MDP) characterized by the following 5-tuple: , where is the continuous state-space with dimensions. is the finite, discrete action space,

is the transition probability matrix,

is the reward accumulated by the agent, and is the discount factor. The agent traverses the MDP following policy with discrete time-step . Reinforcement learning methods compute a mapping from the set of states of the environment to the set of possible actions in order to maximize the expected discounted cumulative reward.

One common way to tackle an RL problem is Q-learning. Here, the aim is learning the state-action-value (or Q) function - the measure of the overall expected reward for taking action at state



being the immediate reward. In Q-learning the states and actions are discretized and can have huge cardinality. Thus, it suffers from the curse of dimensionality. Deep Q-learning alleviates this problem via approximating the Q-function with a neural network (deep Q-network, DQN). Thus, the Q-function takes an

dimensional environment state and evaluates the Q-value for action , . Then, the policy selects the action corresponding to the largest Q-value in an

-greedy way. Deep Q-learning learns by minimizing the temporal difference (1-step estimation error) at time

following the quadratic loss function (mean squared Bellman residual baird1995residual):


with being the next state. Then, with learning rate , the weights of the neural network via gradient descent is


The gradient (assuming the Q-function is differentiable) determines the “direction” in which this update is performed. Observe that the target value also depends on . Thus, the correct gradient would be . On the other hand, the mainstream Q-learning algorithms perform the TD update with , resulting in faster and more stable algorithms baird1995residual. In the sequel, we will adhere to this more common approach.

Deep Q-learning in its pure form often shows divergent behavior for function approximation sutton2018reinforcement, van2018deep. It has no known convergence guarantees except for some similar algorithms where convergence results have been obtained fan2020theoretical. Two major ideas have been developed to improve (but not guarantee) its convergence: using a target network (Double deep Q-learning) and employing experience replay mnih2015human

. In Double deep Q-learning, the target network is the exact copy of the actual network but updated less frequently. Freezing the target network prevents the target value from changing faster than the actual Q-value during learning. Intuitively, learning can become unstable and lose convergence if the target changes faster than the actual value. With experience replay, a memory buffer is introduced. Samples are drawn randomly from this buffer, thus minimizing the correlation between samples observed in trajectory-based learning and enabling the use of supervised learning techniques that assume sample independence


Some recent advances in DQN modify the temporal difference target in order to achieve better convergence results, e.g. pohlen2018observe, durugkar2018td. achiam2019towards aims at characterizing divergence in deep Q-learning with the help of the recently introduced neural tangent kernel (NTK, jacot2018neural). In addition, they propose an algorithm that scales the learning rate to ensure convergence. In ohnishi2019constrained an additive regularization term is used to constrain the loss and enhance convergence.

This work aims at constructing a bridge between robust control theory and reinforcement learning. To this end, we borrow techniques from robust control theory to compensate the non-convergent behaviour of deep Q-learning via cascade control. First, we embed learning into a state-space framework as an uncertain, linear, time-invariant (LTI) system through the NTK. Based on the dynamical system description, convergence (or stability) can be concluded in a straightforward way. As opposed to achiam2019towards, stability is ensured via modifying the temporal difference term via robust stabilizing controllers. We synthesize and benchmark three controllers: state-feedback gain scheduling , dynamic , and constant gain controllers. The primary motivation for robust control is that it is capable of taking into account the uncertain nature of a reinforcement learning problem. In addition, we do not have to recompute the NTK in every step; we can include its variation as a parametric uncertainty in our controller design. This yields a computationally more efficient methodology than the one proposed in achiam2019towards. Our control-oriented approach makes parameter tuning more straightforward and transparent (i.e. involving fewer heuristics). The two aforementioned common heuristics of deep Q-learning (target network and random experience replay, carvalho2020new) are not needed. Instead, the temporal dependency of samples is exploited through the dynamical system formulation. Robust control can support the learning process, making it more explainable. Results suggest that robust controlled learning performs on par with DDQN in the benchmark environments.

The paper is organized as follows. First, in Section 2 using the NTK the dynamics of Q-learning is formulated as an uncertain LTI system. Then, based on the formulated model, three controllers are formulated: Section 3.1 formulates an state feedback control, in Section 3.2 a dynamical controller is synthesized in frequency domain. Then, in Section 3.3 the controller design is adjusted to result in a controller with constant gains. The proposed controlled learning approaches are thoroughly analyzed and compared in three challenging OpenAI Gym environments: Cartpole, Acrobot, and Mountain car (Section 4). Finally, Section 5 concludes the findings of this paper.

2 Control-oriented modeling of deep Q-learning

In this section, we present how to translate deep Q-learning into a dynamical system. In order to formulate the model, first, we introduce the NTK alongside some of its relevant properties. Neural Tangent Kernel jacot2018neural. Given data , the NTK of an input output artificial neural network , parametrized with , is


Multiple outputs. From the NTK perspective, a neural network with outputs behaves asymptotically like networks with scalar outputs trained independently. Constant kernel. Although,

is changing during training, in the infinite width limit, the NTK converges to an explicit constant kernel. It only depends on the depth, activation function, and parameter initialization variance of an NN. In other words, during training,

is independent of time . Linear dynamics. In the infinite width limit, an NN can be well described throughout training by its first-order Taylor expansion (i.e., linear dynamics) around its parameters at initialization (), assuming a low learning rate Lee2020Wide:


and . Gradient flow. The NTK describes the evolution of neural networks under gradient descent in function space. Under gradient flow (continuous learning with infinitely low learning rate via gradient descent), the weight update is given as


with an at least once continuously differentiable (w.r.t ) arbitrary loss function and

. Then, with the help of the chain-rule and gradient flow, the learning dynamics in Eq. (

5) becomes


In light of the above properties of the NTK, we restrain ourselves to shallow and wide neural networks for approximating the Q-function.

Assuming is the actual Q-value and is the next Q-value with are the system states, Q-learning can be modeled with uncertain continuous, linear time-invariant dynamics. Let the NTK of the deep Q-network be evaluated at the current state of the environment for output . Denote it as . Similarly, for the next state as . The role of , and is to characterize how and will evolve during learning according to Remark 4. In addition, denote the bounded uncertainty block encompassing unmodeled learning behaviour by , , where denotes the infinity norm. Dynamics of deep Q-learning. Deep Q-learning can be modeled as a continuous-time, linear time-invariant system with output multiplicative uncertainty with the help of the NTK as


The proof consists of three parts. First, learning dynamics are formulated for fixed state-action values, and the appearance of the NTK is shown. Then, results are cast into a state-space form for selected Q-values. Finally, the necessity of the uncertainty block and its components are discussed.

Part 1: Learning dynamics. In order to describe the learning as a dynamical system, first, we transform the weight update with quadratic loss (Eq. (3)) into continuous-time (gradient flow) with the learning rate :


We can write the Q-value evolution at state for action () with the help of the chain-rule as


In Eq. (2) the term is the NTK evaluated at for action : Note that, in this setting, the scalar product is always non-negative as it is the sum of the squared partial derivatives.

Similarly, we can compute the Q-value change due to the temporal difference update with the tuple at arbitrary state-action-values. E.g. at , it is


Note that it was stated before that , but in continuous-time it is , and . As per the above equation, we can conclude that the change of the NN is only influenced by the NTK. Next, let this arbitrary , be the next state and the best action at that state, .


Part 2: State-space. Using the simplified notations , , , and we can organize the two first-order inhomogeneous linear ODEs (Eq. (2) and Eq. (2)) into state-space form with the system states being and assuming the reward is an exogenous signal. The nominal plant becomes


Learning dynamics are characterized by the NTKs and in the coefficient matrices.

Part 3: Uncertainties. Despite its simple form, this system is inherently uncertain. This uncertainty stems from a single source but manifests in three forms that are specific for reinforcement learning. In contrast to a supervised learning setting, where data is static, in reinforcement learning, data is obtained sequentially as the agent explores the environment.

  • Changing environment states. The system states and have unmodeled underlying dynamics as they always correspond to different , environment states and actions (recap: ). On the other hand, if slow learning rate is assumed, and the Q-function is smooth, the deviation from the modeled Q-values is bounded. This deviation can be included into the modeling framework as an output multiplicative uncertainty , overbounding the temporal variation of the states. We assume this uncertainty is proportional to the magnitude of the Q-values.

  • Parametric uncertainty in the NTK. Dynamically changing environment states cause parametric uncertainty through the NTK. Although the NTK seldom changes during training for wide neural networks (Remark 2), it is only true if the data (where the NTK is evaluated) is static. This is not the case in reinforcement learning: it has to be evaluated for different , pairs in every step. Since both and are known, we can compute the NTK in every step. On the other hand, that would lead to a parameter-varying system. On the other hand, since the actual NTK values are only influenced by data, upon initialization of the neural network, we can evaluate it at several environment state pairs and estimate its bounds offline. Parametric uncertainties can form nonconvex regions which can only be handled via robust control by overbounding there regions. To this end, the parametric uncertainty is pulled out from the plant and overbounded by convex and unstructured uncertainty structure. In particular, it is captured with an output multiplicative uncertainty, see Figure 2 This technique is discussed comprehensively in zhou1998essentials. Finally, we enforce the following input/output multiplicative uncertainty structure, .

    Figure 2: The parametric uncertainty makes the frequency response of the system vary within nonconvex bounds, depicted with blue regions in this Nyquist diagram. The output multiplicative uncertainty overbounds this variation.
  • Exploration. Exploration in deep Q-learning means taking an action that do not correspond to the highest Q-value at . Thus, may not be , rather randomly selected value. It can be lumped into the previously introduced output multiplicative uncertainty terms as .

Note that none of the uncertainty blocks are time-dependent but bounded. That is because this model proposed to overbound all possible uncertainties in a robust way. Then, we suggest to combine all uncertainty components into a single uncertainty block


Finally, assuming the output of the single input, multiple output system is , and , the uncertain LTI model of deep Q-learning is


Next, through a series of remarks, some properties of this system are outlined.

Uncertainty structure. It would be possible to select different error structures for the unmodeled dynamics. For example, an input multiplicative uncertainty would make more sense for the exploration uncertainty. However, for simplicity, it is assumed it is an output multiplicative uncertainty. Alternatively, we could handle parametric uncertainties directly via -synthesis stein1991beyond.

Conjuncture 1.

Nominal stability.

The stability of the linearized deep-learning dynamics is easy to check. The nominal linear system is stable if the real parts of the

system matrix’s eigenvalues are negative, i.e,


The state matrix above has one zero eigenvalue, while the other eigenvalue is . Thus, the system describing Q-learning is locally asymptotically stable if . The magnitude of the NTK is related to the rate of change of the function approximator during learning. Intuitively, if (the target) is changing faster (dictated by ) than the actual value (dictated by ), learning will not converge. This result supports the divergence claim of standard deep Q-learning van2018deep.

Relation to Double deep Q-learning. A common remedy for the divergent behavior of Q-learning is the target network mnih2015human. I.e., is computed from an independent but identical neural network which is less frequently updated. In our modeling framework, this would mean a piecewise static , with . Since , the state space representation of Double deep Q-learning would be asymptotically stable for all . This remark highlights the efficiency of DDQN from an alternative perspective. Boundedness of the parametric uncertainty. In reinforcement learning, the NTK changes due to the dynamically changing data. Therefore, we can evaluate the bounds of the NTK by computing and for a set of environment states in a grid-based fashion, offline, assuming the environment states are bounded too. Figure 3 depicts a slightly different approach: actual state transitions are taken from one of the simulation case studies (Section 4.1). This significantly reduces the domain where the NTK is evaluated. In addition, it highlights another important property: and are correlated, since both values are computed with the same kernel. Exploiting this correlation can greatly reduce the range of the parametric uncertainty.

Figure 3: Evaluations of the NTK during uncontrolled learning in the Cartpole environment. The NTKs are bounded by the red lines: , . Note: in the controlled cases, will be less frequent and only happen when exploring.

Frequency of the learning.

Using Fast Fourier Transform (FFT), the frequency content of the agent’s input and output signals can be analyzed. Our results suggest that these signals are slowly changing. We hypothesize that is due to the smoothness of the Q function (plus low learning rate) and the rewarding scheme; thus learning is in the low-frequency domain (regardless of the environment and control strategy). Figure

4 depicts the FFT of , , and for a controlled Cartpole scenario.

((a)) FFT of
((b)) FFT of
((c)) FFT of
Figure 4: FFT of the agent’s input and output signals in a controlled Cartpole environment.

In the next section, stabilizing controllers are formulated based on the linearized learning dynamics (Eq. (8)).

3 Explicitly controlled deep Q-learning

We support learning with a cascade control layout that prevents divergent learning behavior for any state-action combination. To this end, we augment the common agent-environment interaction (Figure 1) with an additional feedback controller , as depicted in Figure 5. In the sequel, we dissect the effect of this block on learning and propose controlled loss functions. In particular, we synthesize and compare three different controllers: gain scheduling state feedback control, dynamic , and fixed structure robust .

Figure 5: Deep Q-learning cascade feedback control

Random experience replay. Opposed to several Q-learning variants, we do not use random experience replay. Instead, we exploit the sequential nature of the data when computing the tracking error. However, it is possible to log episodic trajectories and replay them to the agent to help learning. This method is advantageous in sparse reward environments.

3.1 State feedback controller design

The system (Eq. (8)) can be stabilized via a state feedback controller. First, we introduce the control input (an additional stabilizing reward):


As long as , the system is controllable skogestad2007multivariable. In addition, append a tracking error state to the state-space. Minimizing forces the controlled values to asymptotically converge to the target (if it is frozen) as




Then, the augmented state-space model for controller design becomes


where , , and .

Exogenous . If were included directly, the system would be rank deficient (i.e., rows 1 and 3 are multiplies of each other (with factor )), yielding a zero eigenvalue. Thus, the system would not be controllable. Consequently, stabilizability is not met either. If is an external signal, it can be chosen freely (e.g., as ) without affecting the dynamical properties of the closed-loop system.

An optimal, gain scheduling state-feedback controller can be realized assuming some properties of the uncertainty block and the external signals. controller cannot handle the uncertainty block explicitly. Thus, it is handled in two parts: the parametric uncertainty is computed explicitly in every step, while the rest of the uncertainties are neglected. The scheduling parameter captures the variation of from a nominal one in an affine way. The bounds of stem from Remark 7. Additionally, we assume that there exist a stabilizing controller for every . Next, we encompass he coefficient matrices of the augmented model in Eq. (20) in , and write


with , and . Instead of handling the parametric uncertainty in a linear parameter varying (LPV) way, we compute a locally optimal controller in every step via solving the controller design problem repeatedly. The goal is finding a stabilizing optimal controller that minimizes the lower linear fractional transformation (LFT) for every in a gain scheduled manner as


which turns into the following quadratic optimization problem skogestad2007multivariable:


The solution to the above optimization can be given in a closed-form, yielding the he Control Algebraic Ricatti Equation, kwakernaak1972linear.

, and are positive (semi-)definite diagonal weighting matrices, serving as tuning parameters for the controller. penalizes the performance, including the tracking error, and penalizes the control input. Assigning high diagonal elements to emphasizes on tracking: in our case the error shall be minimized, i.e.,  shall be high compared to the other diagonal elements and . That is, we do not want to minimize the Q-values. We can also keep the weight for the control input low (cheap control hespanha2018linear), because the extra reward in the form of does not have a physical meaning, does not result in excess energy consumption. On the other hand, acts as an arbitrary reward that will distort the learning dynamics.

Denoting the elements of the controller as , the closed-loop system can be written as


The control input only affects directly. Assuming , and computing the actual corresponding to we can write the Q-value change at , as


The controller gains can be placed inside the parenthesis since the control input acts through . and affect the stability of Q-learning, while influences the tracking error.

Next, we calculate the controlled loss function based on Eq. (25). To this end we recall the chain-rule and use the definition of the NTK (with detailed notations) to achieve


Then, the controlled loss is obtained by integrating Eq. (26) with respect to . In order to obtain a similar form to Eq. (2), we assume only is dependent when performing the integration. We make this assumption based on the following points.

  • When we first introduced the evolution of the weights (Eq. (2)) we assumed the more common direct method over a residual gradient method baird1995residual, thus the dependency of the temporal difference target
    is not considered.

  • The controller gains , , and are constants.

  • The terms in the integral (Eq. (19)) depend only on past values of (implicitly). Therefore, this term can be considered constant when integrating with respect to .

Integrating Eq. (26) with respect to considering the above assumptions we get


Eq. (27) is the loss function for the controlled agent. The terms in the loss are weighted by the controller’s parameters, helping convergence at the cost of biasing the true Q-values. Note that the controller is designed for the nominal plant without considering all the uncertainties in . Although the controller is conservative bokor2012robust, dynamic Q-value stabilization is only guaranteed in a local sense. With this approach, we cannot handle the uncertainties in and explicitly. On the other hand, the controller is inherently robust up to a multiplicative uncertainty of bokor2012robust. The parametric uncertainty is handled in a gain scheduling way. The optimal controller can be recomputed every episode via evaluating the NTK repeatedly. This is computationally intensive and bears the risk that a parameter combination occurs that cannot be stabilized, rendering the learning divergent. In the sequel, the convergence of deep Q-learning is aided by robust control: instead of considering fixed-parameter combinations, the variations in the parameters and the states are explicitly included in the controller design.

3.2 controller design

In this section we propose two types of robust controllers. First, we outline the controller design procedure via a generic robust dynamical controller where a linear time-invariant system computes the control input. Second, we fix the structure of the controller and utilize constant gains akin to the state feedback controller.

Although the parametric uncertainty could be explicitly computed, it is computationally inefficient and would lead to a parameter-varying control as demonstrated for the gain scheduling case. This inefficiency motivates the formulation of a robust controller: it can be synthesized before learning, and it will be stabilizing during learning for all combinations of states and parameters (see Figure 3). The design framework is capable of handling every uncertainty in in a robust way. Furthermore, in the controller design procedure the system’s response is shaped via dynamically weighting the inputs and outputs of the system. Therefore, the low-frequency nature of the controlled learning agent can be exploited too.

The aim is controlling the nominal system , encompassing Eq. (17), disturbed by noise through the block. The controller has two inputs: the noisy states fed back and the reference signal, identical to the tracking error of the controlled case: . In the

design, performance is enforced through the tuneable weights that give the desired shape to the singular values of the open-loop response.

  • penalizes the control input and penalizes the error . As discussed before, learning is done in the low-frequency range. Therefore, we want good tracking performance (large ) at low frequencies. Although the control input has no physical interpretation, it should be dynamically weighted too. At higher frequencies, tracking shall be penalized more in order to reduce the singular values.


    Bode diagrams of and are shown in Figure 6. Note that it turns out that these weights are universal regardless of the RL environment.

    Figure 6: Bode magnitude diagrams of the frequency-dependent tuning weights.
  • shapes the uncertainty. It is considered constant (with varying magnitude from environment to environment) but generally, we can say is that is constant at low frequencies, where learning is meaningful. Its magnitude has a peak at an extremely high frequency (), which is unimportant for the learning.

  • The purpose of and are to normalize and inject reference signal related dynamism to the reference signals. Here, they are considered frequency-independent with environment-specific magnitude.

The closed-loop system interconnection in the so-called structure, which is the general form of the design, is depicted in Figure 7. is responsible for selecting . , and denote the first and second input channel of the nominal plant , respectively. By applying the weighting and the compensator, the augmented plant can be formalized as

Figure 7: Generalized structure

The closed-loop transfer function from the exogenous signals to the performance outputs can be expressed provided that the inverse exists via a lower LFT as:




In control, the aim is finding a controller that minimizes the impact of the disturbance on the performance output. the the induced (worst-case) norm:


where is a prescribed disturbance attenuation level, progressively lowered by iteration zhou1998essentials, skogestad2007multivariable111Here denotes a norm, not to be confused with the discount factor. .

The resulting controller is an LTI system with 3 inputs and one output . The synthesized controller has several internal states that which are reset after each episode. The controller is given in the form:


where , , characterize the controller. Next, we compute the controlled learning loss with the same assumptions as for the case. The controlled evolution of is


If the target is independent of and the control signal only indirectly depends on the states of the learning agent, the controlled quadratic loss can be written as