I Introduction
Mechatronic drivetrain systems face increasingly demanding performance and efficiency requirements in industrial and manufacturing applications. They furthermore need to become more autonomous while interacting with a varying environment. Adequate motion control needs to address these challenges. When designing, implementing and tuning controllers, having knowledge on the dynamics of the mechatronic system is key. Accurate position and speed control of servo drive systems for instance require detailed knowledge on the inertia and friction [kim2018moment]. Based on ab initio physical modelling principles it is possible to approximate the real system behavior. However capturing the full mechatronic system dynamics is often cumbersome and challenging as mechatronic systems are plagued by nonlinear and complex dynamic behavior due to interacting components [kim2018moment, wang2019adaptive, papageorgiou2018robust]. Parameter identification procedures can subsequently be engaged to closer align the model to the real mechatronic system [schon2011system]. Nonetheless, despite tremendous engineering modeling efforts, uncertainties may still be present.
Unfortunately the optimality of control system design is strongly affected by the modelling fidelity [mayne2000constrained]. When designed offline, controllers are approximate due to the inherent uncertainties in the realworld that are not incorporated in the modelling. They furthermore require tremendous tuning efforts, e.g. finding gains in PID controllers to track setpoints. A wide range of adaptive control strategies have been designed to alleviate this issue online: adapting PID gains [kuc2000adaptive], online gravity compensation [yang2021new] or, starting from an approximate model, adapting the parameters in linearly parameterized model predictive control (MPC) [adetola2011robust] or of a fuzzy approximation of the remaining unknown system influences in slidingmode control [yang2021adaptive]. In the stochastic optimal control framework, stochastic uncertainties are introduced to the deterministic optimal control resulting in linear quadratic Gaussian [athans1971role] and stochastic model predictive control formalisms [mesbah2016stochastic], that are variants of the LQR and deterministic MPC, respectively. In the framework of MPC, offline design strategies include openloop and feedback minimax formulations [lofberg2003minimax] and tubebased formulations. Initially, robust tubebased MPC has been used for linear systems in process control [limon2010robust] and more recently has been further elaborated for mechatronic systems [yan2016tube]. Tubebased MPC approaches the control problem by first solving the MPC problem for the nominal system. In a second stage the ancillary state feedback control law is designed to confine the error between nominal and actual states within an invariant tube.
Strategies such as the tubebased MPC are able to cope with uncertainties but do not aim to reduce them by learning intelligently from the inputoutput behavior of the controlled system. Learningbased techniques with iterative learning control techniques [wang2009survey] have been devised that adapt with respect to the tracking error. These iterative learning control algorithms can work under a modelfree assumption [chi2017improved], but can only compensate for periodic disturbances. Also to adapt for repetitive tasks, a recent MPC strategy was proposed relying solely on a datadriven model [rosolia2018data]. Reinforcement learning (RL) methods on the other hand adapt to the actual behavior of the mechatronic system, interacting with a variable environment, and directly learn an optimal feedback policy (referred to as the agent). As opposed to the aformentioned techniques, the learned policy is statedependent, where state can be interpreted broadly as any measurement information, e.g. camera images, and is independent of any time, state or trajectorydependent periodicity or underlying dynamics.
Hence RL opens up interesting new perspectives on adaptivity. Assuming the parametrized policy is sufficiently expressive, the training procedure of RL is capable of generalizing to variable operating conditions and changing environmental settings [lewis2009reinforcement]. Moreover, RL can handle control problems that are difficult to approach with conventional controllers because the control goal can be specified indirectly as a term in a reward function with no explicit requirements on its form or dependencies. The main disadvantage of RL is that it can only fashion new insights by interacting with the system and the environment in realtime, leveraging the actions it takes to probe and explore the optimality landscape. Since the former process relies on the stochasticity inherent to the system or on deliberate perturbation, this may lead to unsafe situations limiting the usage of RL to non safety critical situations [dulac2019challenges]. Recently, related work for the adaptive and robust control of nonlinear systems using actor critic RL has been developed. These methods however still require an approximate model or system identification, are designed for reference tracking only [fu2020mrac, na2020adaptive] or rely on the convergence of the critic function approximators and as such the quality of the collected dataset, for robustness guarantees. [radac2020robust].
Next to aforementioned stability issues, the amount of explorative trials to learn and find optimal control actions is significant. To face the issue of dataefficient training the literature proposes the use of offpolicy algorithms [wang2016sample]. As opposed to onpolicy algorithms [schulman2017proximal], offpolicy methods train the agent based on data generated with another, canonical controller. As a consequence, offpolicy algorithms strongly improve the dataefficiency, however they still require a significant amount of trials which may not always be feasible for realworld applications [dulac2019challenges]. Nevertheless, the simple observation that offpolicy algorithms can be merged with conventional controllers is of particular interest and sheds a new light on the adaptivity issue raised earlier.
In this paper we explore the possibilities of a residual architecture to cater the limitations bothering the straightforward application of traditional RL in an industrial setting. On the one hand we will rely on a traditional suboptimal but stabilising control law. On the other hand we superpose a residual RL agent that may adapt the control output in an attempt to optimize an auxiliary objective. This architecture, coined residual reinforcement learning (RRL), has been explored in earlier research and results into an efficient, safe and optimal control design. RRL has been introduced recently to alleviate the exploration needs and increase tractability in terms of dataefficiency for datadriven robot control [johannink2019residual, silver2018residual]. By applying the reinforcement learning algorithm residually on a base controller that roughly approaches the control objective, the base controller ‘guides’ the reinforcement learning algorithm to an approximate solution, accelerating training. The constraints imposed on the residual agent were absolute, determined by the limits of the controlled system’s inputs.
Originally used to accelerate learning in robot control, RRL can be engaged to increase the optimality in mechatronic motion control. The situation differs however for the setting of industrial motion control of mechatronic systems, where the challenge shifts to maintaining safe operation while realising an adaptive motion control. By leveraging a base controller, residual RL improves the dataefficiency that prohibits traditional RL from being used in such realworld mechatronic systems. However the safety concerns, being the main reason impeding its in industrial settings, remain unaddressed.
This paper explores the possibilities of adding RL next to an existing control law in a safe and robust manner. We do this by following a residual approach for which we introduce relative constraints on the residual agent. The ensuing objective of this paper is to realize an adaptive motion control for mechatronic systems using RL, applicable in a realworld setting. This is approached in a twofold manner. First, we detail the design of a stable constrained residual learning methodology. We introduce an algorithmic adaptation to residual RL, employing relative constraints and prove the stability of both methods using the Lyapunov method. Second, we validate the method’s ability to achieve adaptive control, improving the performance for motion control of a mechatronic system compared to the traditional controller. We provide implementation details on the presented methodology and demonstrate the results by applying it on a slider crank setup and evaluating the Mean Absolute Error (MAE) of the objective. The contributions of this paper are as follows:

Extension of the Residual Reinforcement Learning framework to Constrained RRL. This allows the use of RL algorithms in industrial, safetycritical settings and as such enables online adaptive control without any assumptions or prerequisites of the system dynamics, control objective or form of the controller inputs.

Theoretical analysis of the developed method using Lyapunov stability theory, proving stability for a broad class of mechatronic systems even under worst case conditions.

Experimental validation on a slidercrank, a nonlinear system with applications in many industrial systems.
In conclusion, where pure RL does not work, we show that this framework enables the use of these algorithms in realworld mechatronic settings.
Ii Methodology
Iia Reinforcement Learning
Reinforcement learning operates within a standard (Partially Observable) Markov Decision Process (MDP) framework. An MDP is a tuple
where are states, are actions, is the reward for taking action in state andis the probability of transitioning to state
following state and action . We define a trajectory as a sequence of states and actions, . The return is defined as the infinite discounted sum of rewards with the temporal discount factor. Given an initial state , the objective of any RL method is to solve the following stochastic optimal control problem by finding an optimal policy that maximizes the expected return(1)  
with
. In this paper, the Soft ActorCritic (SAC) algorithm is employed as RL method for all experiments. SAC is a stateoftheart actorcritic RL method. This implies that both estimates of the stateaction value function and policy are approximated using a neural network. Based on temporal differencing these estimates are iterated until they satisfy the Bellman equation. SAC is unique with regard to other actorcritic methods as it maintains a stochastic actor. Actions are realized by sampling from a Gaussian distribution whose mean and variance are outputted by the network. This has the advantage of encouraging exploration during training and achieving a higher stability after convergence. For further details we refer to
[haarnoja2018soft, haarnoja2018softalgandapp].IiB Constrained Residual Reinforcement Learning
Recently introduced for robot control, residual reinforcement learning trains an RL controller residually on top of an imperfect, traditional controller [johannink2019residual, silver2018residual]. The RL algorithm leverages the traditional controller as an initialization to enable dataefficient reinforcement learning for tasks where traditional RL is intractable, such as robotic insertion tasks where rewards are sparse [schoettler2019deep]. Starting from a suboptimal, but adequate and robust controller, as often present in the motion control of industrial applications, we introduce the Constrained Residual Reinforcement Learning (CRRL) architecture.
Two advantages are principle to the concept of CRRL. Firstly, the architecture leverages the traditional controller to guarantee robust exploration of the RL agent during operation. The robust controller can be tuned so that the exploration of the residual policy remains within the principle region of attraction. Secondly, as with basic RRL, the traditional controller provides a good initialization for the reinforcement learning algorithm which may further improve the steepness of the learning curve.
In this contribution we study two variants of the basic RRL architecture, absolute and relative CRRL.
IiB1 Absolute CRRL
For Absolute CRRL, we simply superpose the residual policy, parametrized by the parameters , to the traditional controller. This produces a control input
(2) 
where is the traditional control algorithm, is the reinforcement learning policy,
is a preprocessing feature extraction map and
is a parameter determining the scale of the residual actions. Besides the state , the map can contain e.g. or any other feature that seems interesting^{1}^{1}1Generally speaking the feature map may even contain additional measurement information such as camera images etc.. Note that (2) corresponds mathematically with basic RRL employing a RL algorithm which uses a to confine its actions, such as SAC [haarnoja2018soft], with special care taken to tune .In robot control, e.g. connector insertion tasks [schoettler2019deep], choosing as an absolute constraint so as to confine the total actions within the feasible action space, such as the torque limits of the controlled actuators, suffices to learn a tractable policy that succeeds in its task. For industrial motion control however, safety during all phases of the learning process is required. Determining bounds for the residual agent that guarantee safety is nontrivial. In Section IIC, we establish safety conditions that guarantee that regardless of the residual agent the system stays within the principle region of attraction of the traditional controller. However the detailed modelling requirements may make it unfeasible for some realworld applications in the industry. Alternatively, for cyclic processes, one can take the output of the base controller during one cycle as a reference. This can however not be done in a straightforward manner when facing noncyclic processes. Furthermore, it requires the scale of the base actions to remain consistent within one cycle to provide a safe constraint tube.
IiB2 Relative CRRL
To realize a residual agent maintaining safe operation irrespective of such considerations regarding the underlying process, we propose to constrain its actions relative to the base controller’s outputs. By constraining the reinforcement learning algorithm to a percentage of the classical control action and adding it to said action, we effectively create a tube around the classical control actions where the reinforcement learning algorithm is allowed to explore and learn corrective adaptations to the conventional controller’s output that improve its performance. Both during the learning phase, when exploration is the dominant behavior, and when encountering inputs that deviate from the training distribution after convergence, the classical control algorithm determines the bulk of the control input and thereby ensures a safe and robust operation. The resulting relative residual policy is:
(3) 
with a parameter constraining the actions of the neural network relative to the actions of the base control algorithm. Fig. 1 shows an overview of the relative CRRL structure.
IiB3 Algorithm
The implementation of the absolute CRRL architecture is straightforward. That of the relative CRRL architecture is more subtle. As the actor network is trained through gradient descent to minimize its loss [haarnoja2018soft], multiplying the actor output by directly scales the gradient of all actor network parameters by this fraction of the base controller’s output. This gives more weight to situations where the base controller’s output is large during the training of the actor network. To alleviate this imbalance, one can opt to use the unscaled output for training and only scale the action when applying it on the system itself. The input state for the networks can then be extended with the base controller’s output to again have full state information. However, in practice, we have found the residual controller’s performance to benefit from the scaling during training as well. Therefore the former option is used throughout the experiments.
IiB4 Convergence
For the convergence condition, we rely on the proof given in [haarnoja2018soft] for SAC in the theoretical tabular case, which is approximated for practical use in continuous domains by using the neural networks as function approximators. By considering the base controller as a part of the system on which the residual SAC agent acts and assuming its robustness, the same conditions for convergence hold. The adapted algorithm of [haarnoja2018softalgandapp], which automatically balances the stochasticity of the actor as a function of the reward, further promotes convergence in practice by lowering the variance of the policy in states where the policy achieves high rewards.
IiC Closedloop tracking stability guarantees for mechanical systems using a PD baseline controller
Here we analyse the closedloop stability of the proposed learning approach. We consider a CRRL controller both with absolute (2) and relative constraints (3). The experimental performance of both controllers is investigated further in Section IIIB3. Our synthesis focusses on closedloop tracking stability guarantees for mechanical systems using a PI baseline controller . Such provides a generic setting that meets the requirements of many practical examples from industry. We aim to establish safety guarantees when worstcase conditions are met during exploration, not to make claims about its near optimal behaviour during convergence. Therefore, we treat the residual agent as a disturbance whose actions destabilize the system. Our analysis allows us to determine robust gain values for the baseline controller that guarantee stability regardless of the actions taken by the residual agent.
Our analysis is based on classic Lyapunov stability theory and can be summarized in the following two theorems. We note that the proofs of both theorems rely on a particular choice of Lyapunov function. Therefore, these theorems are illustrative to the fact that it is possible to obtain conditions for the baseline control settings corresponding specific safety guarantees in the context of CRRL. On the other hand, our conditions might be overly conservative and possibly weaker conditions exist based on other Lyapunov functions.
Theorem 1.
Consider a mechanical system with generalised coordinates , input and reference trajectory so that and whose dynamics are governed by the equation
(4) 
With let the absolute CRRL policy be defined as
(5) 
Then closedloop error trajectories are bounded by
(6) 
with , , and defined as in theorem 4 and where
(7) 
where and matrices and defined as in theorem 4 if also , and .
For the proof, definitions of the norms we refer to appx. B.
Theorem 2.
Considering the same controlled system as in theorem 1. Let the relative CRRL policy be defined as
(8) 
Then closedloop error trajectories are bounded by
(9) 
where
(10) 
if it also holds that
,
and
.
For the proof we refer to appx. B.
These two theorems suggest that if the closedloop system is initiated in a state , the error will never grow beyond a magnitude . Provided a desired value for and depending on the CRRL architecture, it is possible to choose values for , and and therefore for or , so that the conditions in either theorem 1 or 2 are satisfied. For practical calculation, note that all variables in Theorems 1 and 2 depend only on the norms of the matrices, defined in appx. B. These norms can be calculated with knowledge of the initial state bounds as per the proof’s assumptions. For the slider crank system defined in Table I and a controller with parameters and , the conditions for are 0.36, 35.28 and 0.999 respectively to ensure stable convergence.
Iii Results and Discussion
In this contribution we study the CRRL methodology on a physical slidercrank setup. A PI controller is chosen to obtain a stable system. Combining RL with more complex controllers such as MPC controllers for online parameter tuning [zanon2020safe] or statefeedback controllers is possible, but is out of the scope of this paper: to improve upon and be directly modular with a commonly used controller in an industrial setting. Note that the CRRL framework allows for the use of any base controller nonetheless.
Iiia Experimental setup
IiiA1 Slidercrank linkage
Present in many industrial applications, a slidercrank provides reciprocating linear motion through a rotary motor in combination with a bar linkage system. Fig. 2 shows a schematic overview as well as a picture of the experimental setup. The dimensions of the setup are detailed in Table I. This system exhibits highly nonlinear behavior [de2019neural] and is often plagued by unidentified load disturbances and unknown interactions with the environment. To achieve an adequate control under these conditions, typically some form of PID controller is used. This strategy is adequate in applications with low requirements on precision, but suffers from suboptimality for systems with varying loads or environment conditions. Coping with these uncertainties requires either retuning the controller or having knowledge of the disturbances interacting with the system which is often unfeasible in practice. This application is of direct relevance in various industrial systems, e.g. compressors [gao2010filter], hydraulic pumps [li2019design], weaving looms [eren2005comparison] and presses [zheng2014modeling].
Parameter  Value 

Length of link 1 ()  
Length of link 2 ()  
Distance to link 1 center of mass ()  
Distance to link 2 center of mass ()  
Inertia of link 1  
Inertia of link 2  
Mass of link 1  
Mass of link 2  
Mass of sliding block  
Motor friction coefficient 
IiiB Experiments
The control problem that we consider is tracking of an angular speed reference through torque control of the system. Due to the nonlinear behavior of the system and large influence of friction on its dynamics, this requires a nonlinear control signal that is sensitive to external influences, making it a challenging task for a PID controller to achieve high performance. All results shown in this section are averaged over at least five runs with mean and minmax shown. The SAC policy, which is implemented customly to allow for the algorithmic changes, is trained using the mean squared error (MSE) of the instantaneously measured deviation from the required rotational velocity,
. In the discussion of the results, the Mean Absolute Error (MAE) is used to allow to interpret the algorithm’s properties intuitively as this focuses less on the outliers. In the figures however, both metrics are shown.
The sampling of the action is kept on for the entire experiment here, since the stochasticity is limited after convergence as described in Section IIB4
. The state used is composed of the crank’s rotational velocity and the sine and cosine of its angle, requiring no extra sensors but the encoder standardly used for the base controller. As with any RL algorithm, tuning of the hyperparameters is a necessary step to obtain the desired performance for CRRL as well. Nonetheless, we have found the robustness to hyperparameter settings of SAC with automatic temperature adjustment
[haarnoja2018softalgandapp] to hold for the CRRL employing SAC as well, with only the batch size and learning rate having a notable effect on the outcome, provided no unconventional values for the other parameters are set. To illustrate this robustness and to allow easy comparison, the same SAC hyperparameters, listed in Table II, are used throughout all CRRL experiments in this paper. The computation time for 1 epoch on a computer with a 6core Intel i78700 CPU with 8GB RAM is 0.87 seconds. Note that this is only necessary for training the network. During deployment, only a forward pass of the actor network at each timestep is needed, which consists of only 2436 FLOPS. Section
IIIB1 discusses the general behavior of a CRRL controller followed by a more indepth examination of its different features in the subsequent subsections.Parameter  CRRL  RL 
Optimizer (all networks)  Adam  Adam 
Learning rate (all networks)  3e4  1e5 
Discount  0.97  0.9 
Batch size (randomly sampled from replay buffer)  256  256 
Replay buffer size  1e6  1e6 
Number of hidden layers  Actor network  2  2 
Number of hidden layers  Critic network  3  2 
Number of neurons per hidden layer  Actor network 
32  32 
Number of neurons per hidden layer  Critic network  128  32 
Nonlinearity  ReLU  ReLU 
Target smoothing coefficient  0.005  0.005 
IiiB1 Learning process
Fig. 3 shows the performance of a relative CRRL controller with and an averagely well tuned PI controller (, ) as base policy tracking a constant angular reference of 60 rpm. This PI controller, as well as the optimally and poorly tuned ones employed later have been tuned through a grid search on the system for each reference signal. The grid ranges from 0.1 to 1.2 and 2.6 for and respectively in steps of 0.1. These bounds are determined empirically. denotes the desired and the actual crank angular velocity. The general form of the learning process displayed by this configuration is illustrative for all other configurations mentioned hereafter.
The performance of a PI controller on the slider crank setup varies slightly from run to run despite lab conditions with limited external disturbances. Therefore each experiment starts with an initial runin phase in which only the PI controller acts on the system to benchmark the results (the blue shaded region in Fig. 3). This variability of the PI controller’s performance over different runs is illustrative for the difficulties in optimally tuning a controller for all conditions. Note that the reward shown is not the instantaneous reward of one timestep, but the average reward over one revolution. As such, the error offset of the PI controller does not illustrate a steadystate error, but the inability of the PI controller to compensate for the nonlinearities throughout one revolution which amounts to the mean error shown. After this phase, the residual controller is activated at epoch 65. For our experiments, an epoch is defined as 500 measurement points sent to the PC. In the beginning of training, the Qvalues are small and the entropy term dominates in the objective [haarnoja2018soft]. This leads to actions that are sampled nearly uniformly from the distribution output by the SAC policy [haarnoja2018soft], resulting in random and therefore possibly unsafe controller outputs. In Fig. 1 one can see that the drop in performance caused by the exploration phase in CRRL is, although unavoidably still present, strongly limited. This intuitively corresponds to assuming a robustness of the base controller to limited disturbances, i.e. the residual actions constrained by the parameter , as can often be assumed of an industrially employed controller and confirms the theoretical findings of Section IIC. The effect of constraining the residual actions to the base policy is examined further in Section IIIB3. After the exploration phase, the performance of the RL algorithm improves until it converges to a residual policy that gives a stable improvement of approximately 13% compared to the base PI controller.
IiiB2 RL benchmark
A standalone SAC controller was pretrained to mimic an optimally tuned base controller and subsequently trained to further optimize the learning objective. A suitable hyparparameter combination, listed in Table II, was obtained after a two week period of manual tuning. The resulting loss curve is shown on Fig. 3 with the red shaded region indicating the pretraining period. For this best performing set of parameters, the algorithm converges to a MAE of approximately 0.55 rad/s after 200 epochs. During exploration, the error unavoidably reaches up to 6.28 rad/s occasionally, i.e. standstill, due to the full freedom of the SAC algorithm. This indicates that the residual controller benefits from the base controller both for limiting the unsafe behaviour during exploration and converging to a high performing policy. Having reached the limit of performance possible by tuning the base algorithm, it also demonstrates the modularity of CRRL to existing controllers without needing system specific adaptations such as policy constraints, specific reward shaping … which would be required to further decrease the standalone RL algorithm’s error.
IiiB3 The importance of constraining residual actions
In relative CRRL (3), the actions are constrained to a tube with width a percentage relative to the base controller’s output. Fig. 4 compares CRRL with relative constraints to a residual policy with absolute constraints (2), tracking a constant angular velocity reference of 60 rpm. To ease the comparison, the bounds of the absolute tube are expressed as a percentage of the largest base controller output during a cycle without residual controller. The base controller is an averagely tuned PI controller (, ). Fig. 4 on the left shows the average performance improvement after convergence of the CRRL controller relative to the average PI controller performance during the first 65 epochs. The right side shows a boxplot of the relative decrease in performance of all epochs after activating the residual controller where the reward was lower than the average PID reward. The dotted red line indicates the largest negative deviation by the PID controller itself from its average reward during the first 65 epochs. These conventions are maintained for the remainder of the results.
For a residual controller within an absolute tube of 20%, both the decrease in performance during exploration and the improvement after convergence are substantially larger than for a relative tube of 20%, as is to be expected due to the increased freedom given to the residual controller. In the next paragraph, this tradeoff is discussed in more detail. To compare with the improvement obtained by a relative bound, we experimentally found a residual controller with an absolute tube of 7.5% to have a similar decrease in performance as a relatively constrained controller of 20%. The final increase in performance however reaches only approximately 54% of the increase reached by a relatively constrained controller. This indicates that the relative constraint method of (3) is advantageous to achieve a higher optimality while maintaining safe operation. For all experiments throughout the remainder of this paper, this method is employed.
The relative constraints are regulated by the parameter . Fig. 5 shows the influence of by comparing both the convergence and exploration performance of a CRRL controller relative to the allowed deviation. As a larger allowed deviation gives the residual controller more freedom, determining is a tradeoff between the eventual increase in performance attainable and the decrease possible during exploration.
IiiB4 CRRL performance compared to base policy performance
A desirable key feature of CRRL is to not reduce the performance of an already optimal controller. The best PI parameters for the current setup and a constant angular velocity reference of 60 rpm were found to be and . Fig. 6 compares both the performance after convergence and during exploration of a CRRL controller relative to the base policy for this PI controller as well as an average (, ) and a poorly (, ) tuned one.
The CRRL controller attains a similar increase in performance for both nonoptimally tuned base controllers and it succeeds in improving the best base policy by 5% on average. Note that the reinforcement learning algorithm sometimes doesn’t succeed in finding an improvement for this base controller, in which case it learns to give nearly 0 output to not decrease the performance as the minmax interval line indicates. This result allows to deploy the CRRL methodology on controllers that are likely to be optimal as well, as the residual controller learns to refrain from adapting if no improvement is found, instead of diminishing performance. We also want to emphasize that no additional hyperparameter tuning was carried out to optimize the results for each experiment. For all configurations, the median of the decrease in performance caused by the exploration is less than the maximum decrease that was observed for the base controller itself. Notably for both nonoptimally tuned base controllers, the largest observed PI performance decrease is larger than or close to 75% of the decreases observed after activating the CRRL controller. Although the relative decrease in performance increases as the performance of the base policy increases, larger deviations are seldom and statistical outliers.
Fig. 7 on the left shows the residual actions taken by a CRRL controller after convergence with an optimally tuned base PI controller as well as the bounds in between which it is allowed to operate. The residual policy has learned to either add a specific signal, maximally reinforce or maximally counteract the base policy to increase optimality. As the constraints are relative, the residual actions are constrained to 0 at times when the base controller outputs zero as well. The regions where the residual signal reaches the limits of its allowed deviation suggest that a looser bound might result in a more optimal policy after convergence. This is a tradeoff with the possible performance loss during the exploration phase, as investigated in Subsection IIIB3. Fig. 7 on the right illustrates the base policy’s control signal and the total control signal with the residual added. Note the limited changes to the base signal that result in a 5% performance improvement.
IiiB5 CRRL adaptivity to different references
In Fig. 8, the performance is shown when tracking either a constant reference of 60 or 90 rpm or a sine reference in function of the crank angle, [rpm], with the crank angle. The base controllers are the ones that were found to have the best performance for each reference, respectively (, ), (, ) and (, ). Note that for the higher reference speed, excessive shaking of the setup caused by a high gain limits the practically feasible values. The figure shows how the same residual controller succeeds in learning improvements for different base controllers tracking different references, demonstrating the adaptivity to operating conditions. Note that for the sinusoidal reference, which is more challenging for the base PI controller to track, only some outlier cycles over all runs have a larger temporary decrease in performance than the largest decrease observed when only the base controller acts. In line with previous results, the difference in base controller for the constant references causes a large difference in relative decrease in performance observed during the exploration phase.
Iv Conclusions and Future Work
In this paper, we proposed CRRL, a method to improve the optimality of conventional controllers that are robust, but suffer from suboptimality when faced with uncertain operating conditions. In CRRL, a reinforcement learning algorithm learns corrective adaptations to the conventional controller’s output, directly from the controlled system’s operating data. By adding the adaptation residually on top of and constraining it by the base controller’s output, the robustness of the latter is leveraged to limit the possible performance decrease during the learning process of the residual agent. The Lyapunov stability theory was used to establish safety guarantees of the proposed method even when worstcase conditions are met for a broad class of mechatronic systems. The performance of CRRL was validated experimentally on a slider crank setup tracking a speed reference with a PI base controller. The method is shown to improve the performance for different configurations of the base controller tracking different references substantially after convergence, while maintaining safe operation at all times. The structure of the constraints applied on the residual agent’s actions was investigated and it was shown experimentally that constraints relative to the base controller’s output are beneficial to limit the possible performance decrease during training while achieving a substantial improvement after convergence. In future work we will focus on expanding the method with adaptive, statedependent constraints for the residual agent. A consideration is that even though the CRRL architecture limits the performance decrease during exploration greatly, the nearly uniformly sampled actions during exploration inherent to SAC would nonetheless be too brisk for e.g. a position controlled system. As a next step, we will explore how to design the exploration process for different situations to increase the general applicability of CRRL.
Appendix A Tracking stability of mechanical systems
We start by recollecting some ingredients from Lyapunov Stability theory.
Definition 1.
A strict Lyapunov function for at is quadratic if it is analytic and there exist three positive constants , and so that and .
Theorem 3.
Consider a disturbed dynamical system . Let satisfy definition 1 at for the undisturbed system (i.e. ) on and assume there exists a positive constant such that then the response of the disturbed system from any initial condition is bounded by .
Proof.
See reference [koditschek1987quadratic], Theorem 2. ∎
Further we are interested in the stability of the closedloop system dynamics of a mechanical system governed by , . We use the feedback linearisation control policy, , where is defined as the tracking error w.r.t. a reference . Further assume that the reference satisfies , , . Finally we assume the system is initialised so that and . The closedloop system dynamics are given. Note that the dynamics correspond with those in Theorem 3.
(11) 
The following theorem allows to make claims about the stability of the system in (11).
Theorem 4.
Let be a smooth curve, with and . Further, let be defined as
(12) 
with and where , then is quadratic for system (11) with where if and . The norms are defined as , and . Note that from the definition of the norms it follows that and .
Proof.
See reference [koditschek1987quadratic], Proposition 10, Corollary 11 and Proposition 12. ∎
Appendix B Proof of theorem 1 and 2
Proof.
We can analyse the tracking stability of an absolute CRRL agent defined as by analysing the disturbance term
(13) 
and adopting the Lyapunov function from Theorem 4
(14) 
where . Further we rely on the results from Theorem 3, Definition 1 and Theorem 4 to show that with , and as in Theorem 4, as defined above and
(15) 
where . Note that it follows that . ∎
Analogously we can analyse the tracking stability of a relative CRRL agent defined as . Here we could either perform a similar analysis as in the previous proof, identifying the associated disturbance and determining the corresponding value for . However, this disturbance would depend on the error and may therefore be overly conservative. Alternatively, we could try and analyse the undisturbed closedloop dynamics corresponding with the control policy . Then the closedloop system dynamics are as in (11) but with and substituted for and respectively, where and . If we can derive bounds for , and in this context and so that , as defined in theorem 4, satisfies definition 1 for the resulting closedloop system, we can rely again on theorem 3 to establish a bound on the error .
Proof.
The derivative of the Lyapunov function along the motion of the system is given by
(16) 
For the last term it holds that where and with . After factoring out, , for the middle term
(17) 
since and .
Substituting these inequalities into the equation for yields
(18) 
The second term is if . Finally we can rewrite the matrix difference as
(19) 
The first matrix is positive semidefinite if and the second if . ∎