1 Introduction
In the past decade, the success of neural networks (NNs) has led to significant uptake of machine learning (ML) methods in various areas of science and realworld applications. On the other hand, working with large data sets, the blackbox nature of some problems, and the complex structure of function approximators often hamper indepth human understanding of such methods. Consequently, efforts have been made to improve the transparency of machine learning both in terms of training an ML model and the results produced by the trained model ^{adadi2018peeking, roscher2020explainable}.
Although machine learningbased controllers are gaining popularity and often outperform classical control, especially in highly nonlinear environments, their stability and performance are seldom guaranteed analytically ^{hoel2018automated, zhou2021model}. Similarly, making such heuristic learning algorithms converge requires tweaking and experimenting. Control theory has a wellestablished and mathematically sound toolkit to analyze dynamical systems and synthesize stabilizing, robust controllers ^{skogestad2007multivariable}. ^{bradtke1994adaptive} shows that dynamic programming based reinforcement learning (RL) (Qlearning) converges to an optimal linear quadratic (LQ) regulator if the environment is a linear system. On the other hand, RL shines in complex environments where formulating a closedform solution is impossible. Several works deal with synergized modelbased and data driven controllers to improve the performance of the controlled process ^{kretchmar2001robust, hegedus2020handling} or analyze learned controllers with tools from control ^{donti2020enforcing, perrusquia2020robust, liu2020h}. Meanwhile, control theory is seldom utilized to enhance the agent’s training performance.
One branch of ML is reinforcement learning, where (in the absence of labeled data), the agent learns in a trial and error way, interacting with its environment. The learning agent faces a sequential decision problem and receives feedback as a performance measure ^{sutton2018reinforcement}. This interaction is commonly depicted as in Figure 1.
This sequential decision problem can be described with a (discrete) Markov Decision Process (MDP) characterized by the following 5tuple: , where is the continuous statespace with dimensions. is the finite, discrete action space,
is the transition probability matrix,
is the reward accumulated by the agent, and is the discount factor. The agent traverses the MDP following policy with discrete timestep . Reinforcement learning methods compute a mapping from the set of states of the environment to the set of possible actions in order to maximize the expected discounted cumulative reward.One common way to tackle an RL problem is Qlearning. Here, the aim is learning the stateactionvalue (or Q) function  the measure of the overall expected reward for taking action at state
(1) 
with
being the immediate reward. In Qlearning the states and actions are discretized and can have huge cardinality. Thus, it suffers from the curse of dimensionality. Deep Qlearning alleviates this problem via approximating the Qfunction with a neural network (deep Qnetwork, DQN). Thus, the Qfunction takes an
dimensional environment state and evaluates the Qvalue for action , . Then, the policy selects the action corresponding to the largest Qvalue in angreedy way. Deep Qlearning learns by minimizing the temporal difference (1step estimation error) at time
following the quadratic loss function (mean squared Bellman residual ^{baird1995residual}):(2) 
with being the next state. Then, with learning rate , the weights of the neural network via gradient descent is
(3) 
The gradient (assuming the Qfunction is differentiable) determines the “direction” in which this update is performed. Observe that the target value also depends on . Thus, the correct gradient would be . On the other hand, the mainstream Qlearning algorithms perform the TD update with , resulting in faster and more stable algorithms ^{baird1995residual}. In the sequel, we will adhere to this more common approach.
Deep Qlearning in its pure form often shows divergent behavior for function approximation ^{sutton2018reinforcement, van2018deep}. It has no known convergence guarantees except for some similar algorithms where convergence results have been obtained ^{fan2020theoretical}. Two major ideas have been developed to improve (but not guarantee) its convergence: using a target network (Double deep Qlearning) and employing experience replay ^{mnih2015human}
. In Double deep Qlearning, the target network is the exact copy of the actual network but updated less frequently. Freezing the target network prevents the target value from changing faster than the actual Qvalue during learning. Intuitively, learning can become unstable and lose convergence if the target changes faster than the actual value. With experience replay, a memory buffer is introduced. Samples are drawn randomly from this buffer, thus minimizing the correlation between samples observed in trajectorybased learning and enabling the use of supervised learning techniques that assume sample independence
^{carvalho2020new}.Some recent advances in DQN modify the temporal difference target in order to achieve better convergence results, e.g. ^{pohlen2018observe, durugkar2018td}. ^{achiam2019towards} aims at characterizing divergence in deep Qlearning with the help of the recently introduced neural tangent kernel (NTK, ^{jacot2018neural}). In addition, they propose an algorithm that scales the learning rate to ensure convergence. In ^{ohnishi2019constrained} an additive regularization term is used to constrain the loss and enhance convergence.
This work aims at constructing a bridge between robust control theory and reinforcement learning. To this end, we borrow techniques from robust control theory to compensate the nonconvergent behaviour of deep Qlearning via cascade control. First, we embed learning into a statespace framework as an uncertain, linear, timeinvariant (LTI) system through the NTK. Based on the dynamical system description, convergence (or stability) can be concluded in a straightforward way. As opposed to ^{achiam2019towards}, stability is ensured via modifying the temporal difference term via robust stabilizing controllers. We synthesize and benchmark three controllers: statefeedback gain scheduling , dynamic , and constant gain controllers. The primary motivation for robust control is that it is capable of taking into account the uncertain nature of a reinforcement learning problem. In addition, we do not have to recompute the NTK in every step; we can include its variation as a parametric uncertainty in our controller design. This yields a computationally more efficient methodology than the one proposed in ^{achiam2019towards}. Our controloriented approach makes parameter tuning more straightforward and transparent (i.e. involving fewer heuristics). The two aforementioned common heuristics of deep Qlearning (target network and random experience replay, ^{carvalho2020new}) are not needed. Instead, the temporal dependency of samples is exploited through the dynamical system formulation. Robust control can support the learning process, making it more explainable. Results suggest that robust controlled learning performs on par with DDQN in the benchmark environments.
The paper is organized as follows. First, in Section 2 using the NTK the dynamics of Qlearning is formulated as an uncertain LTI system. Then, based on the formulated model, three controllers are formulated: Section 3.1 formulates an state feedback control, in Section 3.2 a dynamical controller is synthesized in frequency domain. Then, in Section 3.3 the controller design is adjusted to result in a controller with constant gains. The proposed controlled learning approaches are thoroughly analyzed and compared in three challenging OpenAI Gym environments: Cartpole, Acrobot, and Mountain car (Section 4). Finally, Section 5 concludes the findings of this paper.
2 Controloriented modeling of deep Qlearning
In this section, we present how to translate deep Qlearning into a dynamical system. In order to formulate the model, first, we introduce the NTK alongside some of its relevant properties. Neural Tangent Kernel ^{jacot2018neural}. Given data , the NTK of an input output artificial neural network , parametrized with , is
(4) 
Multiple outputs. From the NTK perspective, a neural network with outputs behaves asymptotically like networks with scalar outputs trained independently. Constant kernel. Although,
is changing during training, in the infinite width limit, the NTK converges to an explicit constant kernel. It only depends on the depth, activation function, and parameter initialization variance of an NN. In other words, during training,
is independent of time . Linear dynamics. In the infinite width limit, an NN can be well described throughout training by its firstorder Taylor expansion (i.e., linear dynamics) around its parameters at initialization (), assuming a low learning rate ^{Lee2020Wide}:(5) 
and . Gradient flow. The NTK describes the evolution of neural networks under gradient descent in function space. Under gradient flow (continuous learning with infinitely low learning rate via gradient descent), the weight update is given as
(6) 
with an at least once continuously differentiable (w.r.t ) arbitrary loss function and
. Then, with the help of the chainrule and gradient flow, the learning dynamics in Eq. (
5) becomes(7)  
In light of the above properties of the NTK, we restrain ourselves to shallow and wide neural networks for approximating the Qfunction.
Assuming is the actual Qvalue and is the next Qvalue with are the system states, Qlearning can be modeled with uncertain continuous, linear timeinvariant dynamics. Let the NTK of the deep Qnetwork be evaluated at the current state of the environment for output . Denote it as . Similarly, for the next state as . The role of , and is to characterize how and will evolve during learning according to Remark 4. In addition, denote the bounded uncertainty block encompassing unmodeled learning behaviour by , , where denotes the infinity norm. Dynamics of deep Qlearning. Deep Qlearning can be modeled as a continuoustime, linear timeinvariant system with output multiplicative uncertainty with the help of the NTK as
(8) 
Proof.
The proof consists of three parts. First, learning dynamics are formulated for fixed stateaction values, and the appearance of the NTK is shown. Then, results are cast into a statespace form for selected Qvalues. Finally, the necessity of the uncertainty block and its components are discussed.
Part 1: Learning dynamics. In order to describe the learning as a dynamical system, first, we transform the weight update with quadratic loss (Eq. (3)) into continuoustime (gradient flow) with the learning rate :
(9) 
We can write the Qvalue evolution at state for action () with the help of the chainrule as
(10) 
In Eq. (2) the term is the NTK evaluated at for action : Note that, in this setting, the scalar product is always nonnegative as it is the sum of the squared partial derivatives.
Similarly, we can compute the Qvalue change due to the temporal difference update with the tuple at arbitrary stateactionvalues. E.g. at , it is
(11) 
Note that it was stated before that , but in continuoustime it is , and . As per the above equation, we can conclude that the change of the NN is only influenced by the NTK. Next, let this arbitrary , be the next state and the best action at that state, .
(12) 
Part 2: Statespace. Using the simplified notations , , , and we can organize the two firstorder inhomogeneous linear ODEs (Eq. (2) and Eq. (2)) into statespace form with the system states being and assuming the reward is an exogenous signal. The nominal plant becomes
(13) 
Learning dynamics are characterized by the NTKs and in the coefficient matrices.
Part 3: Uncertainties. Despite its simple form, this system is inherently uncertain. This uncertainty stems from a single source but manifests in three forms that are specific for reinforcement learning. In contrast to a supervised learning setting, where data is static, in reinforcement learning, data is obtained sequentially as the agent explores the environment.

Changing environment states. The system states and have unmodeled underlying dynamics as they always correspond to different , environment states and actions (recap: ). On the other hand, if slow learning rate is assumed, and the Qfunction is smooth, the deviation from the modeled Qvalues is bounded. This deviation can be included into the modeling framework as an output multiplicative uncertainty , overbounding the temporal variation of the states. We assume this uncertainty is proportional to the magnitude of the Qvalues.

Parametric uncertainty in the NTK. Dynamically changing environment states cause parametric uncertainty through the NTK. Although the NTK seldom changes during training for wide neural networks (Remark 2), it is only true if the data (where the NTK is evaluated) is static. This is not the case in reinforcement learning: it has to be evaluated for different , pairs in every step. Since both and are known, we can compute the NTK in every step. On the other hand, that would lead to a parametervarying system. On the other hand, since the actual NTK values are only influenced by data, upon initialization of the neural network, we can evaluate it at several environment state pairs and estimate its bounds offline. Parametric uncertainties can form nonconvex regions which can only be handled via robust control by overbounding there regions. To this end, the parametric uncertainty is pulled out from the plant and overbounded by convex and unstructured uncertainty structure. In particular, it is captured with an output multiplicative uncertainty, see Figure 2 This technique is discussed comprehensively in ^{zhou1998essentials}. Finally, we enforce the following input/output multiplicative uncertainty structure, .

Exploration. Exploration in deep Qlearning means taking an action that do not correspond to the highest Qvalue at . Thus, may not be , rather randomly selected value. It can be lumped into the previously introduced output multiplicative uncertainty terms as .
Note that none of the uncertainty blocks are timedependent but bounded. That is because this model proposed to overbound all possible uncertainties in a robust way. Then, we suggest to combine all uncertainty components into a single uncertainty block
(14) 
Finally, assuming the output of the single input, multiple output system is , and , the uncertain LTI model of deep Qlearning is
(15) 
∎
Next, through a series of remarks, some properties of this system are outlined.
Uncertainty structure. It would be possible to select different error structures for the unmodeled dynamics. For example, an input multiplicative uncertainty would make more sense for the exploration uncertainty. However, for simplicity, it is assumed it is an output multiplicative uncertainty. Alternatively, we could handle parametric uncertainties directly via synthesis ^{stein1991beyond}.
Conjuncture 1.
Nominal stability.
The stability of the linearized deeplearning dynamics is easy to check. The nominal linear system is stable if the real parts of the
system matrix’s eigenvalues are negative, i.e,
(16) 
The state matrix above has one zero eigenvalue, while the other eigenvalue is . Thus, the system describing Qlearning is locally asymptotically stable if . The magnitude of the NTK is related to the rate of change of the function approximator during learning. Intuitively, if (the target) is changing faster (dictated by ) than the actual value (dictated by ), learning will not converge. This result supports the divergence claim of standard deep Qlearning ^{van2018deep}.
Relation to Double deep Qlearning. A common remedy for the divergent behavior of Qlearning is the target network ^{mnih2015human}. I.e., is computed from an independent but identical neural network which is less frequently updated. In our modeling framework, this would mean a piecewise static , with . Since , the state space representation of Double deep Qlearning would be asymptotically stable for all . This remark highlights the efficiency of DDQN from an alternative perspective. Boundedness of the parametric uncertainty. In reinforcement learning, the NTK changes due to the dynamically changing data. Therefore, we can evaluate the bounds of the NTK by computing and for a set of environment states in a gridbased fashion, offline, assuming the environment states are bounded too. Figure 3 depicts a slightly different approach: actual state transitions are taken from one of the simulation case studies (Section 4.1). This significantly reduces the domain where the NTK is evaluated. In addition, it highlights another important property: and are correlated, since both values are computed with the same kernel. Exploiting this correlation can greatly reduce the range of the parametric uncertainty.
Frequency of the learning.
Using Fast Fourier Transform (FFT), the frequency content of the agent’s input and output signals can be analyzed. Our results suggest that these signals are slowly changing. We hypothesize that is due to the smoothness of the Q function (plus low learning rate) and the rewarding scheme; thus learning is in the lowfrequency domain (regardless of the environment and control strategy). Figure
4 depicts the FFT of , , and for a controlled Cartpole scenario.In the next section, stabilizing controllers are formulated based on the linearized learning dynamics (Eq. (8)).
3 Explicitly controlled deep Qlearning
We support learning with a cascade control layout that prevents divergent learning behavior for any stateaction combination. To this end, we augment the common agentenvironment interaction (Figure 1) with an additional feedback controller , as depicted in Figure 5. In the sequel, we dissect the effect of this block on learning and propose controlled loss functions. In particular, we synthesize and compare three different controllers: gain scheduling state feedback control, dynamic , and fixed structure robust .
Random experience replay. Opposed to several Qlearning variants, we do not use random experience replay. Instead, we exploit the sequential nature of the data when computing the tracking error. However, it is possible to log episodic trajectories and replay them to the agent to help learning. This method is advantageous in sparse reward environments.
3.1 State feedback controller design
The system (Eq. (8)) can be stabilized via a state feedback controller. First, we introduce the control input (an additional stabilizing reward):
(17) 
As long as , the system is controllable ^{skogestad2007multivariable}. In addition, append a tracking error state to the statespace. Minimizing forces the controlled values to asymptotically converge to the target (if it is frozen) as
(18) 
and
(19) 
Then, the augmented statespace model for controller design becomes
(20) 
where , , and .
Exogenous . If were included directly, the system would be rank deficient (i.e., rows 1 and 3 are multiplies of each other (with factor )), yielding a zero eigenvalue. Thus, the system would not be controllable. Consequently, stabilizability is not met either. If is an external signal, it can be chosen freely (e.g., as ) without affecting the dynamical properties of the closedloop system.
An optimal, gain scheduling statefeedback controller can be realized assuming some properties of the uncertainty block and the external signals. controller cannot handle the uncertainty block explicitly. Thus, it is handled in two parts: the parametric uncertainty is computed explicitly in every step, while the rest of the uncertainties are neglected. The scheduling parameter captures the variation of from a nominal one in an affine way. The bounds of stem from Remark 7. Additionally, we assume that there exist a stabilizing controller for every . Next, we encompass he coefficient matrices of the augmented model in Eq. (20) in , and write
(21) 
with , and . Instead of handling the parametric uncertainty in a linear parameter varying (LPV) way, we compute a locally optimal controller in every step via solving the controller design problem repeatedly. The goal is finding a stabilizing optimal controller that minimizes the lower linear fractional transformation (LFT) for every in a gain scheduled manner as
(22) 
which turns into the following quadratic optimization problem ^{skogestad2007multivariable}:
(23) 
The solution to the above optimization can be given in a closedform, yielding the he Control Algebraic Ricatti Equation, ^{kwakernaak1972linear}.
, and are positive (semi)definite diagonal weighting matrices, serving as tuning parameters for the controller. penalizes the performance, including the tracking error, and penalizes the control input. Assigning high diagonal elements to emphasizes on tracking: in our case the error shall be minimized, i.e., shall be high compared to the other diagonal elements and . That is, we do not want to minimize the Qvalues. We can also keep the weight for the control input low (cheap control ^{hespanha2018linear}), because the extra reward in the form of does not have a physical meaning, does not result in excess energy consumption. On the other hand, acts as an arbitrary reward that will distort the learning dynamics.
Denoting the elements of the controller as , the closedloop system can be written as
(24) 
The control input only affects directly. Assuming , and computing the actual corresponding to we can write the Qvalue change at , as
(25) 
The controller gains can be placed inside the parenthesis since the control input acts through . and affect the stability of Qlearning, while influences the tracking error.
Next, we calculate the controlled loss function based on Eq. (25). To this end we recall the chainrule and use the definition of the NTK (with detailed notations) to achieve
(26) 
Then, the controlled loss is obtained by integrating Eq. (26) with respect to . In order to obtain a similar form to Eq. (2), we assume only is dependent when performing the integration. We make this assumption based on the following points.

When we first introduced the evolution of the weights (Eq. (2)) we assumed the more common direct method over a residual gradient method ^{baird1995residual}, thus the dependency of the temporal difference target
is not considered. 
The controller gains , , and are constants.

The terms in the integral (Eq. (19)) depend only on past values of (implicitly). Therefore, this term can be considered constant when integrating with respect to .
Integrating Eq. (26) with respect to considering the above assumptions we get
(27) 
Eq. (27) is the loss function for the controlled agent. The terms in the loss are weighted by the controller’s parameters, helping convergence at the cost of biasing the true Qvalues. Note that the controller is designed for the nominal plant without considering all the uncertainties in . Although the controller is conservative ^{bokor2012robust}, dynamic Qvalue stabilization is only guaranteed in a local sense. With this approach, we cannot handle the uncertainties in and explicitly. On the other hand, the controller is inherently robust up to a multiplicative uncertainty of ^{bokor2012robust}. The parametric uncertainty is handled in a gain scheduling way. The optimal controller can be recomputed every episode via evaluating the NTK repeatedly. This is computationally intensive and bears the risk that a parameter combination occurs that cannot be stabilized, rendering the learning divergent. In the sequel, the convergence of deep Qlearning is aided by robust control: instead of considering fixedparameter combinations, the variations in the parameters and the states are explicitly included in the controller design.
3.2 controller design
In this section we propose two types of robust controllers. First, we outline the controller design procedure via a generic robust dynamical controller where a linear timeinvariant system computes the control input. Second, we fix the structure of the controller and utilize constant gains akin to the state feedback controller.
Although the parametric uncertainty could be explicitly computed, it is computationally inefficient and would lead to a parametervarying control as demonstrated for the gain scheduling case. This inefficiency motivates the formulation of a robust controller: it can be synthesized before learning, and it will be stabilizing during learning for all combinations of states and parameters (see Figure 3). The design framework is capable of handling every uncertainty in in a robust way. Furthermore, in the controller design procedure the system’s response is shaped via dynamically weighting the inputs and outputs of the system. Therefore, the lowfrequency nature of the controlled learning agent can be exploited too.
The aim is controlling the nominal system , encompassing Eq. (17), disturbed by noise through the block. The controller has two inputs: the noisy states fed back and the reference signal, identical to the tracking error of the controlled case: . In the
design, performance is enforced through the tuneable weights that give the desired shape to the singular values of the openloop response.

penalizes the control input and penalizes the error . As discussed before, learning is done in the lowfrequency range. Therefore, we want good tracking performance (large ) at low frequencies. Although the control input has no physical interpretation, it should be dynamically weighted too. At higher frequencies, tracking shall be penalized more in order to reduce the singular values.
(28) (29) Bode diagrams of and are shown in Figure 6. Note that it turns out that these weights are universal regardless of the RL environment.

shapes the uncertainty. It is considered constant (with varying magnitude from environment to environment) but generally, we can say is that is constant at low frequencies, where learning is meaningful. Its magnitude has a peak at an extremely high frequency (), which is unimportant for the learning.

The purpose of and are to normalize and inject reference signal related dynamism to the reference signals. Here, they are considered frequencyindependent with environmentspecific magnitude.
The closedloop system interconnection in the socalled structure, which is the general form of the design, is depicted in Figure 7. is responsible for selecting . , and denote the first and second input channel of the nominal plant , respectively. By applying the weighting and the compensator, the augmented plant can be formalized as
(30) 
The closedloop transfer function from the exogenous signals to the performance outputs can be expressed provided that the inverse exists via a lower LFT as:
(31) 
where
(32) 
In control, the aim is finding a controller that minimizes the impact of the disturbance on the performance output. the the induced (worstcase) norm:
(33) 
where is a prescribed disturbance attenuation level, progressively lowered by iteration ^{zhou1998essentials, skogestad2007multivariable}^{1}^{1}1Here denotes a norm, not to be confused with the discount factor. .
The resulting controller is an LTI system with 3 inputs and one output . The synthesized controller has several internal states that which are reset after each episode. The controller is given in the form:
(34) 
where , , characterize the controller. Next, we compute the controlled learning loss with the same assumptions as for the case. The controlled evolution of is
(35) 
If the target is independent of and the control signal only indirectly depends on the states of the learning agent, the controlled quadratic loss can be written as
Comments
There are no comments yet.