This paper is a study of reinforcement learning (RL) as an optimal-control strategy. RL, a machine learning (ML) technique, mimics learning abilities of humans and animals. RL applications have been used by OpenAI to program robot-hands to manipulate physical objects with unprecedented human-like dexterity [b_OPENAI], by Stanford’s CARMA program for autonomous driving [b_CARMA] and studied for faster de-novo molecule design [b_DENOVO].
Valves were selected as the control plant as they are ubiquitous in process control and employed in almost every conceivable manufacturing and production industry. The controller, called an “agent” in RL terminology, is trained using the DDPG (Deep Deterministic Policy-Gradient) algorithm.
Industrial process loops involve thousands of valves and can be impossible to model accurately. Applying traditional control strategies, such as PID (proportional-integral-derivative) can potentially affect quality and efficiency of such processes and increase substantial costs. PIDs are the de-facto industry standard and according to an indicative survey cover more than 95% of process-industry controllers [b_DESBOROUGH].
RL promises better control strategies by learning optimal-control by directly interacting with the plant (such as valves) and hence eliminates the need of accurately modeling the plant.
Connecting a computer to a real physical plant and have the RL agent learn through direct interaction may not always be feasible. A practical approach adopted involves simulating the plant as close as possible to the real plant and training the agent and this is the approach employed in this paper.
Literature is researched to study valve nonlinearity to create a benchmark plant model for training the RL agent.
MATLAB Simulink™ is used to simulate a nonlinear valve, an industrial process, the agent training circuit and finally a unified validation circuit to evaluate RL and PID strategies side-by-side. The agent is trained using MATLAB’s recently launched (R2019a) Reinforcement Learning Toolbox™ [b_MATLAB].
Graded Learning, a technique discovered accidentally during this research is a simple procedural method to efficiently train a RL agent on complex tasks and in effect is the most simplified form of the more formal method known as “Curriculum Learning” [b_NARVEKAR], [b_WENG_CURRICULUM].
Summary research contributions of this work:
A basic understanding of RL as an optimal-control strategy.
Methodology targeted to assist practising plant engineers apply Reinforcement Learning for optimal control in industries.
Design and simulation techniques using MATLAB and Simulink™, instead of the more demanding Open Source Python.
Graded Learning: A semi-novel “coaching” method, based on the naive form of Curriculum Learning. This is suitable for practicing engineers and is an application oriented adaptation of the more formal and algorithmic “Curriculum for Reinforcement Learning”.
Short literature research of three published studies of RL used for control of valves.
Experimental comparison of PID and RL strategies in a unified framework.
Stability analysis of the RL controller in time and frequency domains.
Experiential learning corroborated with published literature.
Finally, while the valve is the focus of the paper, the methods are adaptable to any industrial system.
2 Reinforcement Learning Primer
In this section we take a brief look at conventional optimal-control solving methods, followed by an overview of RL, its connection with optimal-control and finally the DDPG algorithm selected for implementation.
Sutton and Barto’s book [b_BARTO] is the most comprehensive introduction to reinforcement learning and the source for theoretical foundations below.
2.1 Optimal Control and RL
Feedback controllers are traditionally designed using two different philosophies namely “adaptive-control” and “optimal-control”. Adaptive controllers learn to control unknown systems by measuring real-time data and therefore employ online learning. Adaptive controllers are not optimized since the design process does not involve minimizing any performance metrics suggested by users of the plant [b_FRANK].
Conventional optimal-control design, on the other hand, is performed off-line by solving Hamilton–Jacobi–Bellman (HJB) equations. According to [b_FRANK] solving HJB equations require complete knowledge of the dynamics of the plant and according to [b_TEDRAKE] this in turn requires an engineered guess as a start.
Richard Bellman’s extension of the century theory laid by Hamilton and Jacobi and Ronald Howard’s work in 1960 for solving Markovian Decision Processes (MDPs) all formed the foundations of modern RL. Bellman’s approach used the concept of a dynamic system’s state and of a “value-function”. Dynamic programming, which uses the Bellman equation and is a “backward-in-time” method, along with temporal-difference (TD) methods enabled building of optimal adaptive-controllers for discrete-time systems (i.e. time progression defined as ) [b_BARTO].
2.2 Optimal Control
The Hamilton-Jacobi-Bellman (HJB) (1) provides a sufficient condition for optimality [b_TEDRAKE].
Controller policies (i.e. behavior) are denoted by and optimum policies by . If a policy , and a related cost-function , are defined such that minimizes the right-hand-side of the HJB (1) and all to zero then:
Equation (1) assumes that the cost-function is continuously differentiable in and and since this is not always the case it does not satisfy all optimal-control problems. In [b_TEDRAKE], Tedrake shows that solving HJB depends on an engineered guess, for example a First-order Regulator is designed with a guessed solution . A Linear Quadratic Regulator is designed similarly. For complex, dynamic mechanical systems such initial solutions are hard to guess unless severely approximated and therefore in situations like these RL shows the relative ease with which real world optimal-controllers can be learned.
2.3 The RL framework
The core elements of RL are shown in Fig.1.
The learner and decision-maker is called the agent. The agent interacts with its environment continually, selecting actions to which the environment responds by presenting a new situation to the agent. The environment provides feedback on performance via rewards (or penalties). Rewards are scalar values. Over time, the agent attempts to maximize (or minimize) the rewards and this reinforces good actions over bad actions and thus learns an optimal behavior, formally termed a policy.
In control system terminology — the agent is the controller being designed. The environment consists of the system outside the controller i.e. the valve, the industrial process, the reference signal, other sensors, etc. The policy is the optimal-control behavior the designer seeks. RL allows learning this behavior without having to be explicitly programmed or modeling the plant in excruciating detail.
: The decision-making capability of the agent is based on a probability mapping of the best action to take vis-à-vis the state it is in. This mapping is called a policy, and is the probability that the action if the state .
Returns: Returns represent long-term rewards, gathered over time.
Discounting: Discounting provides a mechanism to control the impact of selecting an action that is immediate versus one where rewards are received far into the future.
where , the discount rate, is a parameter .
: These are functions of state-action pairs that provide an estimate of how good it is to perform a given action in a given state. A reward signal provides feedback on how “good” the current action is, in an immediate short-term sense. In contrast, a value-function, provides a measure of “goodness” in the long-term and is defined in terms of future expected return.
The value, denoted , is the expected return for a state , measured starting in that state and following the policy thereafter.
Equation (5) is referred to as the Bellman equation and forms the basis to approximately compute and learn and is therefore central to all RL algorithms
Q-function: By including the action, is defined as the expected return starting from state , taking an action and thereafter following policy .
Q-learning: Q-learning is an off-policy TD control algorithm that allows iteratively learning the Q-value. For each state-action pair the value is tracked. When an action is performed in some state , the two elements of feedback from the environment — the reward and the next state are used in the update shown in (6). is the learning rate.
where is the estimate of and the estimate of .
Optimal value-function: There always exists at least one optimal policy that guarantees the highest expected return denoted by and optimal action-value-function .
Model-based and model-free RL methods: Accurate models of the environment allow “planning” the next action as well as the reward. By a model we mean having access to a “table” of probabilities of being in a state given an action and associated rewards.
RL methods that use environment models are called model-based methods, as opposed to simpler model-free methods. Model-free agents can only learn by trial-and-error [b_BARTO].
Actor-Critic methods: Actor-critic structure allows a forward-in-time class of RL algorithms that are implemented in real-time. The actor component, under a policy, applies an action to the environment and receives a feedback that is evaluated by the critic component. There is a two step learning mechanism -– policy-evaluation performed by the critic, followed by policy-improvement performed by the actor.
2.4 The DDPG algorithm
MATLAB’s R2019a release provides six RL algorithms. DDPG is the only algorithm suitable for continuous action control [b_MATLAB].
In [b_LCRAP] Lillicrap et al. introduced DDPG to overcome the shortcomings of the DQN (Deep Q-Network) algorithm which in turn was an extension of the fundamental Q-learning algorithm.
The DDPG is a model-free, policy-gradient based, off-policy method as it uses a memory replay-buffer to store previous experiences
. With an actor-critic based algorithm it uses two neural networks. The actor network accepts the current state as the input and outputs a single real value (i.e the valve control signal) representing the action chosen from a continuous action space. The critic network performs the evaluation of the actor’s output (i.e. the action) by estimating the Q-value of the current state given this action. Actor network weights are updated by adeterministic policy gradient algorithm while the critic weights are updated by gradients obtained from the TD error signal. The DDPG algorithm, therefore, simultaneously learns both a Q-function and a policy by interleaving them.
Exploration vs. exploitation: For RL, as is in humans, performance improvement is achieved by exploitation of actions that provided the highest reward in the past. However, to discover the best actions in the first place, the agent must explore the action space. Balancing the discovery of new actions while continuously improving the best action is a common challenge in RL. Various exploration-exploitation strategies have been developed.
DDPG uses the Ornstein-Uhlenbeck process (OUP) to enable exploration [b_OUP]. Interestingly, OUP was developed for modeling the velocities of Brownian particles with friction which results in values that are temporally correlated. The simpler additive Gaussian noise model causes abrupt changes from one time-step to the next (i.e. uncorrelated) whereas the OUP noise model more closely mimics real life actuators that exhibit inertia [b_LCRAP].
The exploration policy is constructed by adding noise to the selected action (i.e. the actor policy) at each training time-step, sampled from the OUP noise process .
3 Control Valves and Rl
Control-valves modify the fluid flow rates using an actuator mechanism that respond to a signal from the control system. Processing plants consist of large networks of such control-valves designed to keep a process-variable (such as pressure, temperature, flow, etc.) under control. These variables must be controlled within a specified operating range to ensure quality of the end-product [b_ISA].
3.1 Nonlinearity in valves
Control-valves, like most other physical systems, possess nonlinear flow characteristics such as friction and backlash. Friction in-turn has two components — stiction, the static friction, is the inertial force that must be overcome before there is any relative motion between the two surfaces and is the prime cause of dead-band in valves while dynamic friction is the friction in motion [b_CHOUDHURY_2004_Quantification], [b_CHOUDHURY_2004_Data_Driven].
Nonlinearity can cause oscillatory valve outputs that in turn cause oscillations of the process output resulting in defective end-products, inefficient energy consumption and excessive wear of manufacturing systems [b_CHOUDHURY_2004_Quantification], [b_CHOUDHURY_2005_Modelling]. According to [b_CHOUDHURY_2004_Quantification], 30% of process-loop oscillation issues are due to control-valves, while [b_DESBOROUGH] reports that valves are the primary cause of 32% of surveyed inefficient controllers. Stiction in control-valves has been reported as the prime source of sustained oscillations in industrial control-loops [b_CAPACI].
3.2 A mathematical valve model
RL requires experiences for training. Simulated environments often provide a quick and low-cost environment for training an agent. Since the objective of building a controller is for it to be used in the real-world, one must strive to create as accurate an environment as possible. This appears to contradict the claim made earlier that RL does not require an accurate system model –– however it is assumed here that real physical environment is inaccessible, which on the other hand if accessible or available as a lab could well allow the RL agent (controller) to learn directly from real experiences.
In this paper we use first-principles to model the valve as outlined in [b_CAPACI].
He and Wang [b_HE_2007], [b_HE_2010] describe the nonlinear memory dynamics of valve by , at a time-step , where is expressed by relation (8). While the controller outputs , the actual position the valve attains is represented by , where represents the valve position error. and are the static (stiction) and dynamic friction parameters, dependent on the valve type, size and application. The “Experimental Setup” section will later describe the Simulink modeling of the valve.
3.3 RL for valve control: A literature research
The field of RL is relatively new and not many studies of its application for control of valves were found. Scopus brought up only 18 results for “reinforcement learning AND valves AND control”, Fig.5.
A study of three publications is presented below with emphasis on areas that can be compared with our research.
3.3.1 Throttle valve control
Throttle valves find application in both industrial and automotive industries.
Control of a throttle valve is challenging due to the highly dynamic behavior of the spring-damper design of the valve system and complex nonlinearities [b_BISCHOFF], [b_SCHOKNECHT]. [b_HOWELL] indicate the challenge arises from the multiple-input-multiple-output nature of the throttle valve optimization problem.
Bischoff et.al [b_BISCHOFF] use PILCO (probabilistic inference for learning), a practical, data-efficient model-based policy search method. PILCO reduces model bias, a key problem of model-based RL, by learning the probabilistic dynamics of the model and then explicitly incorporating model uncertainty into long-term planning. PILCO works with very little data and facilitates learning from scratch in only a few trials and therefore alleviates the need of millions of episodes normally required for training in trial-and-error based model-free methods [b_PILCO].
Throttle valve dynamics are modeled using the flap angle, angular velocity and the actuator input. They must be controlled at an extremely high rate of 200 Hz without any overshoot that result in engine torque jerks . The controller learns by minimizing the expected sum of cost over time.
To apply the constraint of zero overshoot, a novel asymmetric saturating cost-function is applied as seen in Fig.6. A trajectory approaching the goal (red) incurs a rapidly decreasing cost as it nears the goal while overshooting the goal incurs a disproportionately high cost almost immediately [b_BISCHOFF].
The effectiveness of the asymmetric cost-function is evident in their results (blue) in Fig.7, with no overshoot and only a low-noise behavior of controlled profile.
3.3.2 Heating, ventilation and air-conditioning (HVAC) control
Wang et. al [b_WANG] use a model-free, proximal actor-critic based RL algorithm to control the nonlinear dynamics of HVAC systems where the hot-water flow is governed by a power equation (10).
RL is compared to Proportional-Integral (PI) and Linear Quadratic Regulator (LQR) control strategies. 150 time-steps are used to allow sufficient time for RL controller to learn tracking the set-point. Disturbances are simulated using random-walk algorithms. Actor network configuration is [50, 50]
and the critic is a single layer of 50 units. One interesting aspect of the network architecture they employ is the use of GRU (Gated Recurrent Unit) to overcome the problem of vanishing/exploding gradients.
shows that the RL controller responds much faster than the LQR and PI controllers and tracks the reference signal better, thereby achieving lower Integral Absolute Error (IAE) and Integral Square Error (ISE) against both the competing strategies. However the RL shows a very high-variance noisy response against the smooth trajectories of PI and LQR controllers. Significant overshoots are also seen in the RL response.
3.3.3 Sterilization of canned food
Thermal processing used for sterilization of canned food results in deterioration of the organoleptic properties of the food. Controlling the thermal process is therefore important. In [b_SYAFIIE] Syafiie et al. apply Q-learning to learn the temperature profile that can be applied for the minimal time during the two stages of the thermal process — manipulation of the saturated-steam valve to cause heating and then cooling by opening the water valve.
A simple scalar reward is used [+1.0, 0.0, -2.0], therefore penalizing an action deviating from the desired start twice as more as rewarding it. The paper does not evaluate continuous rewards. Fig.9 shows the controlled temperature profile.
Overall observations on the three researched papers:
Disturbances in the RL controlled signal are evident in all three implementations: ([b_BISCHOFF], [b_WANG] and [b_SYAFIIE]).
Use of stochasticity mechanisms other than OUP to enable exploration of action space: ([b_BISCHOFF] and [b_WANG]).
Use of a novel objective function in [b_BISCHOFF].
None of these evaluated the stability of the RL controller design — an important consideration for an emerging breed of controllers.
MATLAB was not used as the design platform, which is obvious considering it was launched in 2019.
Only [b_WANG] compared the RL against the traditional PID.
4 Experimental Setup
This section describes the creation of the experimental setup, using MATLAB and Simulink, for design and evaluation of the RL and PID controllers. Fig.10 shows the core components.
Our setup used elements from the excellent 2018 paper, ”An augmented PID control structure to compensate valve stiction” by Bacci di Capaci and Scali.
Traditional PID controllers tuned solely on process dynamics, cause sustained oscillations attributed to the integral component that causes excessive variation of the control action to overcome static friction [b_CAPACI]. As a solution to this [b_CAPACI] presented a novel PID based controller, Fig.11(a), where stiction is overcome by employing a two-move control sequence (11) as the valve input.
where and are estimates of stiction and dynamic friction and is the estimate of steady-state position of the valve. These also show the reliance of this technique of correct estimation of these parameters.
The setup components:
A PID (with filter) controller tuned using MATLAB’s auto-tuning feature.
A training setup for the RL agent using the DDPG algorithm.
A unified framework for experimentation and evaluation of controllers
Items below were based on [b_CAPACI]:
Nonlinear valve model (11) including the valve friction values and .
A “benchmark waveform” profile with noise parameters (Fig.11(d)).
4.1 Modeling the valve
Simscape Fluids™ (formerly SimHydraulics™) provides simulations for several valve types and is the simplest and quickest option. [b_POPINCHALK] is a MathWorks article to enhance these into more realistic models using an understanding of system dynamics.
We however use first-principles and mathematically model the nonlinear valve. Algebraically rearranging equations shown in (11) produce (12); these equations are then implemented in Simulink using a “user-defined-function” and a “memory” block shown in Fig.12 with and .
4.2 Modeling the “industrial” process
4.3 PID controller setup
A PID controlled output is a function of the feedback error, represented in time-domain as:
where is the desired control signal and is the tracking error, between the desired output and the actual output . This error signal is fed to the PID controller, and the controller computes both the derivative and the integral of this error signal with respect to time providing a set-point tracking effect, this works continuously in a closed loop, until the controller is in effect.
The ideal theoretical PID form exhibits a drawback for high frequency signals — the derivative action results in very high gain. A high frequency measurement noise will therefore generate large variations in the control signal. Practical implementations reduce this effect by replacing the term by a first-order filter (where is represented as in Laplace form) by as (15) [b_MURRAY].
The filter coefficient determines the pole location of the filter that helps attenuate the high gain on high-frequency noise. A between 2 and 20 is recommended. A high value () results in (15) approaching the ideal form (14) [b_MURRAY].
The PID was tuned using MATLAB auto-tuning feature and the coefficients obtained were , , and . The low acts to suppress the derivative term.
4.4 RL controller setup
This section describes the Simulink design for training the RL controller using the DDPG algorithm.
Fig.15 shows the training setup. A switch allows testing a trained model on various signals built via a “signal-builder” block. Training a RL agent involves significant hyperparameter tuning and this setup allows for quick experiments and evaluations by activating a “software” switch.
4.4.1 RL controller design
4.4.2 Environment design
Several design factors need consideration when building the environment for efficiently training the agent to learn to follow the trajectories of a control signal. They can broadly be classified into agent related and environment related. Agent related factors are composition of the observations vector and the reward strategy. Environment related factors must cover the training strategy, training signals, initial conditions of the environment and criteria to terminate an episode (for episodic tasks).
4.4.3 Training strategy
One could train the RL agent to learn to follow the exact benchmark trajectory (Fig.11(d)), however this is a very constrained strategy. Instead, the agent was trained to follow random levels of straight-line signals. The agent was additionally challenged to learn to start at a randomly initialized flow value. Together this forms an effective and generalized training strategy to teach the agent to follow any control signal trajectory composed of straight lines. The RL ToolBox allows overriding the default “reset function” that assists in implement the above strategy.
env.ResetFcn = @(in)localResetFcn(in, VALVE_SIMULATION_MODEL);
4.4.4 Observation vector
The observation vector used was , where is the actual flow achieved, the error with respect to reference and finally the integral of the error.
Integral of error: The instantaneous error has no memory. The integral of error, which is the area under the curve as time progresses, provides a mechanism to compute the total error gathered over time and drive the agent to lower this (Fig.17).
This is an important observation input often used in training of RL controllers.
The observation vector is modeled as shown in Fig.18.
4.4.5 Rewards strategy
Rewards can be assigned via discrete, continuous or hybrid functions. Equation (16) is a simple discrete form.
where is some allowable error margin.
Equation (17) shows a reward that varies continuously as a function of error . is a small constant that avoids division-by-zero error.
Well designed continuous-reward functions help agents learn to be as close as possible to the reference signal during the early learning stages. Fig.19 shows the final implementation as a hybrid form. The reciprocal of the absolute error allows the controller to learn to drive the error lower and lower. The discrete part of the reward is the “penalty” block that assigns a set penalty for exceeding the flow limits.
4.4.6 Actor and Critic networks
The actor-critic DDPG components were implemented as shown in Fig.20. The networks have fully-connected layers, initialized with small random weights before beginning the training.
The actor network output is normalized to be between [-1, 1] using a tanh layer. This allows better learning and convergence for continuous action spaces.
4.4.7 Ornstein-Uhlenbeck (OU) action noise parameters
Guidelines for computing the DDPG exploration parameters i.e. the noise model variance and the decay rate of the variance are provided by MATLAB [b_MATLAB_DDPG].
where is the sampling time
Half-life of the variance factor, in time-steps, is computed first decided and the decay rate of the variance is then computed using:
4.4.8 Final DDPG hyperparameters
Summarized below in Table 1 are the final set of DDPG hyperparameters.
|Critic learning rate||1|
|Actor learning rate||1|
|Critic hidden layer-1||50 fully-connected|
|Critic hidden layer-2||25 fully-connected|
|Action-path bound||tanh layer|
|OUP Variance Decay Rate||1|
4.5 Setup for comparative study
An environment that combined the PID and RL strategies for a comparative evaluation is shown in Fig.21. It allows experimenting with various reference signals, studying the effects of noise added at three disturbance points i.e. input of the controller, output of the controller (i.e. the input of the plant) and finally output of the plant.
It provides a convenient platform to perform additional experiments using elements such as set-point filters, output smoothening filters, etc.
5 Graded Learning
Before presenting the results of the experiments we elaborate on a coaching method termed as “Graded Learning”. This simple, intuition based approach was accidentally discovered during the hundreds of experiments and trials (163 to be exact) that were conducted in an attempt to train a stable RL agent. It must be noted that this method is equivalent to the naive, domain-expert dependent form of the more formal method known as “Curriculum Learning” [b_WENG_CURRICULUM], [b_NARVEKAR].
Applying automatic Curriculum Learning requires algorithmic design and implementing complex frameworks [b_PORTELAS]
, for example ALP-GMM (absolute learning progress Gaussian mixture model) “teacher-student” framework. The “teacher” neural-network samples parameters from the continuous action space to generate a learning curriculum. Applying automated Curriculum Learning is currently not possible in MATLAB and will therefore be difficult for many practising engineers. Graded Learning, on the other hand, requires no programming and allows a control engineer to implement it.
Fig.22 shows examples of the numerous challenges faced during training, sometimes resulting in experiments with thousands of episodes that did not produce a stable learning curve and sometimes resulting in inexplicable controller actions. Some training trials lasted 20,000 episodes running for over 20 hours and therefore it is important to streamline these efforts.
Graded Learning helped avoid some of these challenges. The intuition for Graded Learning was based on observing how human instructors structure coaching of a new skill for apprentices.
While new skills such as chess or tennis are taught with the final goal in mind, one never starts with the hardest lessons. Foundation level skills are taught first and once some level of proficiency is gained, the student graduates to the next level with marginally more complex problems than the previous level. Skills and experiences gained in the previous level are retained and progressively built upon as one moves from one level to the next.
Graded Learning extends this iterative staged approach to RL. The RL task is first broken down to its fundamental level, an agent is trained for episodes or until convergence criteria is met. Next level of complexity is added to the previous task. Transfer-learning
is used to ensure previous experience is retained and built upon. Once this level of task is learned, the process of adding further complexity continues and each time transfer-learning allows to build upon experience gained during the previous levels.
Transfer-learning is a machine learning technique that is used to “transfer” the learning i.e. stabilized weights of a neural-network from one task (or domain in general) to another without having to train the neural-network from scratch [b_KARL].
The Graded Learning approach was discovered when the time-delay in (13) was reduced to zero and the agent quickly stabilized in contrast to the hundreds of earlier attempts and assisted in satisfactorily training a stable controller.
Fig.23 demonstrates the method in action and the agent evolving over 6 stages of increasing difficulty. Parameters that are progressively increased are the time-delay , static friction and dynamic friction .
Both the stability analysis and experimental results achieved next, demonstrate that Graded Learning applied to valve control (and possibly other complex industrial systems) appears to be an effective way to coach an RL agent.
6 Experiments, Results and Discussion
In this section we present the results of experiments conducted on a unified framework and evaluate the RL controller’s performance and compare it with the PID (with filter) controller.
Before conducting the experiments a stability analysis of the RL controller must be carried out.
6.1 Stability Analysis of RL Control
A basic stability analysis of the RL control is attempted in this section.
Open-loop transfer-function of the system is . Transfer-function of the plant where is the transfer-function of the FOPTD process (13) and is the transfer-function of the nonlinear valve which is unknown and must be estimated.
Simulink’s Control Design Linearization Analysis™ tool provides a GUI based interface to generate a linear approximation of a nonlinear system, computed across specified input and output points. However, this does not allow any control over the estimation in contrast to MATLAB’s tfest function.
The programmatic method allows a user controlled method to estimate the transfer-function by specifying the number of poles (np) and zeros (nz). Additionally the iodelay parameter allows experimenting the effect of time-delays in physical systems. This MATLAB function is based on [b_GARNIER].
Ψsys = tfest(data, np, nz, iodelay)
The block-diagram Fig.24 shows the points at which data and will be tapped to estimate the controller transfer-function and points and to estimate the complete plant transfer-function . Fig.25 is the Simulink setup to assist the estimation.
Estimated controller transfer-function: Equation (21) is the estimated continuous-time transfer-function for the controller.
We plot (Fig.27) the plant’s response using the estimated transfer-functions against the original RL signal to ensure that it is reasonably close and will serve the purpose of gaging the stability. It must be noted that the estimation is approximate and this method is provided as a means of understanding the methodology of conducting a very basic stability analysis.
6.2 Experiments and Results
In this section we present the results of experiments conducted on a unified framework that tests two valve control strategies — PID (with filter) and DDPG RL. Experiments with varying control signals, noise strengths and disturbance points were conducted. A plant with process-loop perturbations was experimented with. A critical time-domain analysis of the experimental results is presented followed finally by frequency-domain stability analysis.
Arbitrarily assumed constant reference level
Benchmark waveform (with noise)
Benchmark waveform subject to disturbances at:
Controller input (i.e. reference signal)
Plant input (i.e. controlled signal fed to plant)
Plant output (i.e. system output)
Practical example of a “water-supply” valve, subject to ground-borne vibrations of passing trains
Plant experiencing process loop-perturbations
Arbitrary control waveform
6.2.1 Experiment-1: Constant reference signal
Experiment: A basic analysis is best done on a simple constant reference flow rate arbitrarily set at 100 and run over 2,000 . Reference signal is superimposed with benchmark Gaussian noise added ().
Observations: Fig.30 shows the PID and RL trajectories. We observe that the PID has a large overshoot and settles in about 700 . The RL strategy demonstrates close to ideal damping and a quicker settling time of about 220 . The RL trajectory shows tiny ripples against the PID’s smoother profile. These oscillations can reduce the remaining-useful-life (RUL) of a mechanical system and we study this by conducting a (simplified) two factor DOE (design of experiments).
We vary the two factors; time-delay and valve friction (combined static and dynamic) as shown in Table 3. Default values of time-delay , static-friction and and these are treated as the high-levels and we lower each by a factor of 100 to obtain the low-levels as shown in Table 4.
|Time-delay ()||Friction values (, )|
Fig.31(a) highlights the RL’s capability to produce a very smooth profile when both the factors are low. This implies that the oscillations are not introduced by the RL technique. Fig.31(c) shows that the cause of oscillatory behavior is mainly due to the time-delay factor.
While the PID strategy (15), is implemented with a filter that suppresses noise, no filters were added to the RL setup to better understand the natural response of RL control strategies.
6.2.2 Experiment-2: The benchmark signal
It is observed that the PID shows higher over- and under-shoots. The RL shows better tracking to the reference signal levels. If such a valve controls fluid flow, the higher and lower fluid quantities could be detrimental to the product quality. In 32 the shifted PID waveform after 800 could be detrimental to the process if it depends on the timing of the flow of fluid.
6.2.3 Experiment-3.a: Noise at controller input
Experiment: Increased noise at the controller input ()
Observations: Fig.34(a) and 34(b) show almost no impact on the PID when compared with Experiment-2 (lower noise at input) but increased impact on the RL trajectory, demonstrating the PID strategy’s superior noise attenuation capabilities. The RL continues to closely track the reference signal (along with the noise).
6.2.4 Experiment-3.b: Noise at plant input
Experiment: Shift the source of noise to the plant input ().
Observations: Fig.35 shows that the PID trajectory is now impacted and it looses its relatively smooth output seen in Experiment-1 and Experiment-2. The RL strategy on the other hand remains unaffected when compared to Experiment-1. The PID strategy adjusts itself based on the error signal and hence shows a change in behavior while RL strategy does not.
6.2.5 Experiment-3.c: Noise at plant output
Experiment: Effect of noise experienced at the plant output is studied here ().
Observations: Fig.36(a) shows that both RL and PID strategies are affected equally due to the noise.
6.2.6 Experiment-4: Water-supply valve, subject to ground-borne vibrations
Experiment: Valve applications could often be exposed to extremely harsh conditions. A water-supply system, for example, may face ground-borne vibrations such as from passing railways, that is in the range of about 30–200 and varying amplitudes [b_TRAIN]. Since the control-valve assembly will often be placed in shielded environments frequencies between 30–100 were assumed for simulation.
6.2.7 Experiment-5: Arbitrary control waveform with benchmark noise signal
Experiment: This experiment tests the generalization capability of the training strategy of the RL controller vis-à-vis the generalization of PID tuning. “Training” signal for both the strategies was the benchmark waveform and this experiment subjected them to a completely different waveform.
Observations: Fig.38 shows that the RL controller out-performs the PID strategy considerably in this experiment. The RL controller tracks the arbitrary reference much closely and this demonstrates the importance of the training strategy in effective generalization. The PID trajectory, on the other hand shows a significant lag while tracking the reference and if such a valve controls fluid flow, the untimely higher or lower fluid quantities could be detrimental to the product quality.
Small ripples are evident sections of the RL controlled trajectory.
6.2.8 Experiment-6: Benchmark plant with process loop-perturbations
Experiment: This experiment tests resistance to severe process-loop perturbations modeled as a order transfer-function (22) [b_CAPACI].
Observations: A severe limitation of the RL controller is evident in this experiment. Fig.39 shows a significantly stunted output, clamped smoothly at around 35.0. The setup was then tested on a lower magnitude reference Fig.39(b) and the RL continues to be clamped at the same level 35.0. PID seems to scale to different levels under the influence of perturbations albeit with significant error. The RL controller shows increased oscillatory behavior at the lower flow magnitude.
6.3 Discussion: Experiential Learning Validated against Published Research
”Experiential: relating to, derived from, or providing experience”
A total of 163 experiments were conducted during this research. When experiments did not respond to seemingly logical steps it led to severe frustration. It was during this learning process that Graded Learning was discovered. In a quest to find answers for some of the strange observations, research was conducted to relate these to previously published studies and it highlighted the several known challenges that exist; reminding one that RL is still an emerging field.
Early adopters of RL for control are encouraged to try both the Graded Learning method and study the literature referenced in this section — which is a collection of studies conducted at Google, MIT and Berkeley ([b_SONG], [b_HENDERSON], [b_HARDT] and [b_ZHANG]).
In [b_HENDERSON], effects of hyperparameters and their tuning are analyzed with respect to network-architecture, rewards scaling and reproducibility on model-free, policy-gradient based algorithms for continuous control and is therefore directly applicable to the subject of this paper.
6.3.1 Over-fitting and saturation
For physical systems, there is always an upper limit of rewards that the agent cannot cross. However this is not known before hand and one often pushes the agent to continue training for hours. Significant neural-network saturation was observed in several of the training attempts.
Over-fitting in RL is being studied only recently. [b_SONG] studied over-fitting in model-free RL and observe that the agent often mistakenly correlates reward with spurious observation-space features. They term this as “observational overfitting”. In particular they have studied over-fitting with linear quadratic regulators (LQR) using neural-networks and show that under Gaussian initialization of the policy using gradient descent, a generalization gap “must necessarily exist” [b_SONG].
Fig.40 shows multiple examples of over-training and its effect on learning curves.
provides a theoretical proof, that stochastic gradient methods employing parametric models, when trained usingfewer
iterations have vanishing generalization errors. They argue this by experiments conducted and using stability criteria established for learning algorithms devised by Bousquet and Elisseeff. They conclude, that shortened training time by itself, sufficiently prevents over-fitting. This paper is important for extending the stability criteria developed for supervised learning to iterative algorithms, such as RL[b_BOUSQUET] .
6.3.2 Sensitivity to network architecture
Four policy-gradient methods including the DDPG are analyzed in [b_HENDERSON]
. While ReLU activations were stated to perform best, the effects were not consistent across algorithms or hyperparameter settings.
6.3.3 Sensitivity to reward-scaling
A large and sparse reward scale causes network saturation resulting in inefficient learning as was observed in Fig.41(b). Reward rescaling is a technique recommended to improve results for DDPG (Fig.41(a)). This is achieved by multiplying by a scalar such as 0.1 or clipping to [0, 1] [b_DUAN].
6.3.4 Sensitivity to noise parameter
DDPG uses the Ornstein-Uhlenbeck process to aid exploration. The effect of noise hyperparameter was not very easily ascertainable.
Based on (19), for a and time-steps per full episode, the half-life of exploration decay is about 150 episodes as seen in Fig.42(a). However there is an exploration explosion after about 700 episodes (Fig.42(b)). As an experiment a severely reduced was used implying a half-life of just about 15 episodes, however Fig.42(c) shows no decay in exploration for over 1000 episodes.
It is possible that the mixed results agree with [b_PLAPPERT] in that explicit noise settings are not necessary for a continuous space to assist in exploration. It must be noted that such results can also be possible due to inexplicable interaction effects of multiple hyperparameters.
6.3.5 Sensitivity to random seeds
Intuitively different random seeds should not affect results of a stable process. According to [b_HENDERSON], environment stochasticity coupled with stochasticity in the learning process have produced misleading inferences even when results were scientifically averaged across multiple trials.
In conclusion, as stated by Henderson et al. [b_HENDERSON] one of the possible reasons for the difficulties encountered could be the “intricate interplay” of hyperparameters of policy gradient methods (such as DDPG).
On the design front, the process of training a model-free reinforcement learning agent was outlined.
Hyperparameter tuning requires significant efforts and patience for building a stable controller. We proposed Graded Learning, the naive form of Curriculum Learning method. An engineer starts at the lowest complexity level and defines appropriate hyperparameter settings to understand the best reward strategy and reward scales to use and then gradually increase control task complexity. This avoids several problems mentioned earlier; for example network saturation. For most industrial control systems Table 1 should be a good starting point.
On the application front, experiments were conducted to evaluate it against the conventional PID control strategy.
The experiments showed that the RL strategy’s trajectory tracking appears to be superior to the PID’s. The PID demonstrates better disturbance rejection as compared to the disturbances appearing on the RL controlled signal. While this appears to be the prime limitation of the RL controller, it must be noted that these were evident in the published implementations studied as well ([b_BISCHOFF], [b_WANG] and [b_SYAFIIE]).
The PID appeared to lag the reference control signal and the RL controller performed better when challenged to track a control profile that it was not trained on and will demonstrate versatility when applied to different control tasks within the same environment, without having to be retrained.
Overall the RL controlled process appears to promise better process quality, while the PID controlled process will cause a significantly lower stress on the valve operation and result in reduced wear-and-tear.
Enhancements and Future work: The RL controller that was designed needs a mechanism to reduce the oscillatory behavior in the presence of high frequency disturbance with strong amplitudes. For noise at the input and output of the controller a low-pass filter may help reduce the high variance.
Further work is necessary to understand ways of defining objective and reward functions to prevent the noisy RL trajectory behaviour. If this succeeds this will be a better solution than applying a filter, which would otherwise slow down the response.
MATLAB 2019b release includes the Proximal Policy Optimization (PPO) algorithm for continuous control that must be evaluated. PPO is a recent development and is considered as being more stable and better than DDPG [b_HENDERSON].
The fields of reinforcement learning, optimal-control and control-systems are extremely exciting. It is the hope that this research will motivate further research to help better understand and hence popularize the use of reinforcement learning for control-systems.
This paper is a result of the work that began with the dissertation [b_RS] submitted to the Coventry University, UK. I am immensely grateful for the encouragement and guidance I received during the dissertation work from my supervisors — Dr Olivier Haas, Associate Professor and Reader in Applied Control Systems at Coventry University and Dr Prithvi Sekhar Pagala, Research Specialist at KPIT Technologies. Prof. Dr Acharya K.N.S must be thanked for instilling an interest in Control Systems through this teaching.
Rajesh Siraskar received the B.E. degree in Electronics and Telecommunications from Pune University, Pune, India, in 1990 and an M.Tech. degree in Automotive Electronics from Coventry University, UK in 2020. He works as a Data Scientist and develops solutions for industries ranging from automotive to energy and pharmaceutical to cement. He was previously a Six Sigma Master Black Belt. He is member of IEEE.