1 Introduction
This paper is a study of reinforcement learning (RL) as an optimalcontrol strategy. RL, a machine learning (ML) technique, mimics learning abilities of humans and animals. RL applications have been used by OpenAI to program robothands to manipulate physical objects with unprecedented humanlike dexterity [b_OPENAI], by Stanford’s CARMA program for autonomous driving [b_CARMA] and studied for faster denovo molecule design [b_DENOVO].
Valves were selected as the control plant as they are ubiquitous in process control and employed in almost every conceivable manufacturing and production industry. The controller, called an “agent” in RL terminology, is trained using the DDPG (Deep Deterministic PolicyGradient) algorithm.
Industrial process loops involve thousands of valves and can be impossible to model accurately. Applying traditional control strategies, such as PID (proportionalintegralderivative) can potentially affect quality and efficiency of such processes and increase substantial costs. PIDs are the defacto industry standard and according to an indicative survey cover more than 95% of processindustry controllers [b_DESBOROUGH].
RL promises better control strategies by learning optimalcontrol by directly interacting with the plant (such as valves) and hence eliminates the need of accurately modeling the plant.
Connecting a computer to a real physical plant and have the RL agent learn through direct interaction may not always be feasible. A practical approach adopted involves simulating the plant as close as possible to the real plant and training the agent and this is the approach employed in this paper.
Literature is researched to study valve nonlinearity to create a benchmark plant model for training the RL agent.
MATLAB Simulink™ is used to simulate a nonlinear valve, an industrial process, the agent training circuit and finally a unified validation circuit to evaluate RL and PID strategies sidebyside. The agent is trained using MATLAB’s recently launched (R2019a) Reinforcement Learning Toolbox™ [b_MATLAB].
Graded Learning, a technique discovered accidentally during this research is a simple procedural method to efficiently train a RL agent on complex tasks and in effect is the most simplified form of the more formal method known as “Curriculum Learning” [b_NARVEKAR], [b_WENG_CURRICULUM].
Summary research contributions of this work:

A basic understanding of RL as an optimalcontrol strategy.

Methodology targeted to assist practising plant engineers apply Reinforcement Learning for optimal control in industries.

Design and simulation techniques using MATLAB and Simulink™, instead of the more demanding Open Source Python.

Graded Learning: A seminovel “coaching” method, based on the naive form of Curriculum Learning. This is suitable for practicing engineers and is an application oriented adaptation of the more formal and algorithmic “Curriculum for Reinforcement Learning”.

Short literature research of three published studies of RL used for control of valves.

Experimental comparison of PID and RL strategies in a unified framework.

Stability analysis of the RL controller in time and frequency domains.

Experiential learning corroborated with published literature.
Finally, while the valve is the focus of the paper, the methods are adaptable to any industrial system.
2 Reinforcement Learning Primer
In this section we take a brief look at conventional optimalcontrol solving methods, followed by an overview of RL, its connection with optimalcontrol and finally the DDPG algorithm selected for implementation.
Sutton and Barto’s book [b_BARTO] is the most comprehensive introduction to reinforcement learning and the source for theoretical foundations below.
2.1 Optimal Control and RL
Feedback controllers are traditionally designed using two different philosophies namely “adaptivecontrol” and “optimalcontrol”. Adaptive controllers learn to control unknown systems by measuring realtime data and therefore employ online learning. Adaptive controllers are not optimized since the design process does not involve minimizing any performance metrics suggested by users of the plant [b_FRANK].
Conventional optimalcontrol design, on the other hand, is performed offline by solving Hamilton–Jacobi–Bellman (HJB) equations. According to [b_FRANK] solving HJB equations require complete knowledge of the dynamics of the plant and according to [b_TEDRAKE] this in turn requires an engineered guess as a start.
Richard Bellman’s extension of the century theory laid by Hamilton and Jacobi and Ronald Howard’s work in 1960 for solving Markovian Decision Processes (MDPs) all formed the foundations of modern RL. Bellman’s approach used the concept of a dynamic system’s state and of a “valuefunction”. Dynamic programming, which uses the Bellman equation and is a “backwardintime” method, along with temporaldifference (TD) methods enabled building of optimal adaptivecontrollers for discretetime systems (i.e. time progression defined as ) [b_BARTO].
2.2 Optimal Control
The HamiltonJacobiBellman (HJB) (1) provides a sufficient condition for optimality [b_TEDRAKE].
(1) 
Controller policies (i.e. behavior) are denoted by and optimum policies by . If a policy , and a related costfunction , are defined such that minimizes the righthandside of the HJB (1) and all to zero then:
(2) 
Equation (1) assumes that the costfunction is continuously differentiable in and and since this is not always the case it does not satisfy all optimalcontrol problems. In [b_TEDRAKE], Tedrake shows that solving HJB depends on an engineered guess, for example a Firstorder Regulator is designed with a guessed solution . A Linear Quadratic Regulator is designed similarly. For complex, dynamic mechanical systems such initial solutions are hard to guess unless severely approximated and therefore in situations like these RL shows the relative ease with which real world optimalcontrollers can be learned.
2.3 The RL framework
The core elements of RL are shown in Fig.1.
The learner and decisionmaker is called the agent. The agent interacts with its environment continually, selecting actions to which the environment responds by presenting a new situation to the agent. The environment provides feedback on performance via rewards (or penalties). Rewards are scalar values. Over time, the agent attempts to maximize (or minimize) the rewards and this reinforces good actions over bad actions and thus learns an optimal behavior, formally termed a policy.
In control system terminology — the agent is the controller being designed. The environment consists of the system outside the controller i.e. the valve, the industrial process, the reference signal, other sensors, etc. The policy is the optimalcontrol behavior the designer seeks. RL allows learning this behavior without having to be explicitly programmed or modeling the plant in excruciating detail.
Policy
: The decisionmaking capability of the agent is based on a probability mapping of the best action to take visàvis the state it is in. This mapping is called a policy
, and is the probability that the action if the state .Returns: Returns represent longterm rewards, gathered over time.
(3) 
Discounting: Discounting provides a mechanism to control the impact of selecting an action that is immediate versus one where rewards are received far into the future.
(4) 
where , the discount rate, is a parameter .
Valuefunctions
: These are functions of stateaction pairs that provide an estimate of how good it is to perform a given action in a given state. A reward signal provides feedback on how “good” the current action is, in an immediate shortterm sense. In contrast, a valuefunction, provides a measure of “goodness” in the longterm and is defined in terms of future expected return.
The value, denoted , is the expected return for a state , measured starting in that state and following the policy thereafter.
(5) 
Equation (5) is referred to as the Bellman equation and forms the basis to approximately compute and learn and is therefore central to all RL algorithms
Qfunction: By including the action, is defined as the expected return starting from state , taking an action and thereafter following policy .
Qlearning: Qlearning is an offpolicy TD control algorithm that allows iteratively learning the Qvalue. For each stateaction pair the value is tracked. When an action is performed in some state , the two elements of feedback from the environment — the reward and the next state are used in the update shown in (6). is the learning rate.
(6) 
where is the estimate of and the estimate of .
Optimal valuefunction: There always exists at least one optimal policy that guarantees the highest expected return denoted by and optimal actionvaluefunction .
Modelbased and modelfree RL methods: Accurate models of the environment allow “planning” the next action as well as the reward. By a model we mean having access to a “table” of probabilities of being in a state given an action and associated rewards.
RL methods that use environment models are called modelbased methods, as opposed to simpler modelfree methods. Modelfree agents can only learn by trialanderror [b_BARTO].
ActorCritic methods: Actorcritic structure allows a forwardintime class of RL algorithms that are implemented in realtime. The actor component, under a policy, applies an action to the environment and receives a feedback that is evaluated by the critic component. There is a two step learning mechanism – policyevaluation performed by the critic, followed by policyimprovement performed by the actor.
2.4 The DDPG algorithm
MATLAB’s R2019a release provides six RL algorithms. DDPG is the only algorithm suitable for continuous action control [b_MATLAB].
In [b_LCRAP] Lillicrap et al. introduced DDPG to overcome the shortcomings of the DQN (Deep QNetwork) algorithm which in turn was an extension of the fundamental Qlearning algorithm.
The DDPG is a modelfree, policygradient based, offpolicy method as it uses a memory replaybuffer to store previous experiences
. With an actorcritic based algorithm it uses two neural networks. The actor network accepts the current state as the input and outputs a single real value (i.e the valve control signal) representing the action chosen from a continuous action space. The critic network performs the evaluation of the actor’s output (i.e. the action) by estimating the Qvalue of the current state given this action. Actor network weights are updated by a
deterministic policy gradient algorithm while the critic weights are updated by gradients obtained from the TD error signal. The DDPG algorithm, therefore, simultaneously learns both a Qfunction and a policy by interleaving them.Exploration vs. exploitation: For RL, as is in humans, performance improvement is achieved by exploitation of actions that provided the highest reward in the past. However, to discover the best actions in the first place, the agent must explore the action space. Balancing the discovery of new actions while continuously improving the best action is a common challenge in RL. Various explorationexploitation strategies have been developed.
DDPG uses the OrnsteinUhlenbeck process (OUP) to enable exploration [b_OUP]. Interestingly, OUP was developed for modeling the velocities of Brownian particles with friction which results in values that are temporally correlated. The simpler additive Gaussian noise model causes abrupt changes from one timestep to the next (i.e. uncorrelated) whereas the OUP noise model more closely mimics real life actuators that exhibit inertia [b_LCRAP].
The exploration policy is constructed by adding noise to the selected action (i.e. the actor policy) at each training timestep, sampled from the OUP noise process .
(7) 
3 Control Valves and Rl
Controlvalves modify the fluid flow rates using an actuator mechanism that respond to a signal from the control system. Processing plants consist of large networks of such controlvalves designed to keep a processvariable (such as pressure, temperature, flow, etc.) under control. These variables must be controlled within a specified operating range to ensure quality of the endproduct [b_ISA].
3.1 Nonlinearity in valves
Controlvalves, like most other physical systems, possess nonlinear flow characteristics such as friction and backlash. Friction inturn has two components — stiction, the static friction, is the inertial force that must be overcome before there is any relative motion between the two surfaces and is the prime cause of deadband in valves while dynamic friction is the friction in motion [b_CHOUDHURY_2004_Quantification], [b_CHOUDHURY_2004_Data_Driven].
Nonlinearity can cause oscillatory valve outputs that in turn cause oscillations of the process output resulting in defective endproducts, inefficient energy consumption and excessive wear of manufacturing systems [b_CHOUDHURY_2004_Quantification], [b_CHOUDHURY_2005_Modelling]. According to [b_CHOUDHURY_2004_Quantification], 30% of processloop oscillation issues are due to controlvalves, while [b_DESBOROUGH] reports that valves are the primary cause of 32% of surveyed inefficient controllers. Stiction in controlvalves has been reported as the prime source of sustained oscillations in industrial controlloops [b_CAPACI].
3.2 A mathematical valve model
RL requires experiences for training. Simulated environments often provide a quick and lowcost environment for training an agent. Since the objective of building a controller is for it to be used in the realworld, one must strive to create as accurate an environment as possible. This appears to contradict the claim made earlier that RL does not require an accurate system model –– however it is assumed here that real physical environment is inaccessible, which on the other hand if accessible or available as a lab could well allow the RL agent (controller) to learn directly from real experiences.
In this paper we use firstprinciples to model the valve as outlined in [b_CAPACI].
He and Wang [b_HE_2007], [b_HE_2010] describe the nonlinear memory dynamics of valve by , at a timestep , where is expressed by relation (8). While the controller outputs , the actual position the valve attains is represented by , where represents the valve position error. and are the static (stiction) and dynamic friction parameters, dependent on the valve type, size and application. The “Experimental Setup” section will later describe the Simulink modeling of the valve.
(8) 
where
3.3 RL for valve control: A literature research
The field of RL is relatively new and not many studies of its application for control of valves were found. Scopus brought up only 18 results for “reinforcement learning AND valves AND control”, Fig.5.
A study of three publications is presented below with emphasis on areas that can be compared with our research.
3.3.1 Throttle valve control
Throttle valves find application in both industrial and automotive industries.
Control of a throttle valve is challenging due to the highly dynamic behavior of the springdamper design of the valve system and complex nonlinearities [b_BISCHOFF], [b_SCHOKNECHT]. [b_HOWELL] indicate the challenge arises from the multipleinputmultipleoutput nature of the throttle valve optimization problem.
Bischoff et.al [b_BISCHOFF] use PILCO (probabilistic inference for learning), a practical, dataefficient modelbased policy search method. PILCO reduces model bias, a key problem of modelbased RL, by learning the probabilistic dynamics of the model and then explicitly incorporating model uncertainty into longterm planning. PILCO works with very little data and facilitates learning from scratch in only a few trials and therefore alleviates the need of millions of episodes normally required for training in trialanderror based modelfree methods [b_PILCO].
Throttle valve dynamics are modeled using the flap angle, angular velocity and the actuator input. They must be controlled at an extremely high rate of 200 Hz without any overshoot that result in engine torque jerks . The controller learns by minimizing the expected sum of cost over time.
(9) 
To apply the constraint of zero overshoot, a novel asymmetric saturating costfunction is applied as seen in Fig.6. A trajectory approaching the goal (red) incurs a rapidly decreasing cost as it nears the goal while overshooting the goal incurs a disproportionately high cost almost immediately [b_BISCHOFF].
The effectiveness of the asymmetric costfunction is evident in their results (blue) in Fig.7, with no overshoot and only a lownoise behavior of controlled profile.
3.3.2 Heating, ventilation and airconditioning (HVAC) control
Wang et. al [b_WANG] use a modelfree, proximal actorcritic based RL algorithm to control the nonlinear dynamics of HVAC systems where the hotwater flow is governed by a power equation (10).
(10) 
RL is compared to ProportionalIntegral (PI) and Linear Quadratic Regulator (LQR) control strategies. 150 timesteps are used to allow sufficient time for RL controller to learn tracking the setpoint. Disturbances are simulated using randomwalk algorithms. Actor network configuration is [50, 50]
and the critic is a single layer of 50 units. One interesting aspect of the network architecture they employ is the use of GRU (Gated Recurrent Unit) to overcome the problem of vanishing/exploding gradients.
Fig.8
shows that the RL controller responds much faster than the LQR and PI controllers and tracks the reference signal better, thereby achieving lower Integral Absolute Error (IAE) and Integral Square Error (ISE) against both the competing strategies. However the RL shows a very highvariance noisy response against the smooth trajectories of PI and LQR controllers. Significant overshoots are also seen in the RL response.
3.3.3 Sterilization of canned food
Thermal processing used for sterilization of canned food results in deterioration of the organoleptic properties of the food. Controlling the thermal process is therefore important. In [b_SYAFIIE] Syafiie et al. apply Qlearning to learn the temperature profile that can be applied for the minimal time during the two stages of the thermal process — manipulation of the saturatedsteam valve to cause heating and then cooling by opening the water valve.
A simple scalar reward is used [+1.0, 0.0, 2.0], therefore penalizing an action deviating from the desired start twice as more as rewarding it. The paper does not evaluate continuous rewards. Fig.9 shows the controlled temperature profile.
Overall observations on the three researched papers:

Disturbances in the RL controlled signal are evident in all three implementations: ([b_BISCHOFF], [b_WANG] and [b_SYAFIIE]).

Use of stochasticity mechanisms other than OUP to enable exploration of action space: ([b_BISCHOFF] and [b_WANG]).

Use of a novel objective function in [b_BISCHOFF].

None of these evaluated the stability of the RL controller design — an important consideration for an emerging breed of controllers.

MATLAB was not used as the design platform, which is obvious considering it was launched in 2019.

Only [b_WANG] compared the RL against the traditional PID.
4 Experimental Setup
This section describes the creation of the experimental setup, using MATLAB and Simulink, for design and evaluation of the RL and PID controllers. Fig.10 shows the core components.
Our setup used elements from the excellent 2018 paper, ”An augmented PID control structure to compensate valve stiction” by Bacci di Capaci and Scali.
Traditional PID controllers tuned solely on process dynamics, cause sustained oscillations attributed to the integral component that causes excessive variation of the control action to overcome static friction [b_CAPACI]. As a solution to this [b_CAPACI] presented a novel PID based controller, Fig.11(a), where stiction is overcome by employing a twomove control sequence (11) as the valve input.
(11) 
where and are estimates of stiction and dynamic friction and is the estimate of steadystate position of the valve. These also show the reliance of this technique of correct estimation of these parameters.
The setup components:

A PID (with filter) controller tuned using MATLAB’s autotuning feature.

A training setup for the RL agent using the DDPG algorithm.

A unified framework for experimentation and evaluation of controllers
Items below were based on [b_CAPACI]:

Nonlinear valve model (11) including the valve friction values and .

A “benchmark waveform” profile with noise parameters (Fig.11(d)).
4.1 Modeling the valve
Simscape Fluids™ (formerly SimHydraulics™) provides simulations for several valve types and is the simplest and quickest option. [b_POPINCHALK] is a MathWorks article to enhance these into more realistic models using an understanding of system dynamics.
We however use firstprinciples and mathematically model the nonlinear valve. Algebraically rearranging equations shown in (11) produce (12); these equations are then implemented in Simulink using a “userdefinedfunction” and a “memory” block shown in Fig.12 with and .
(12) 
4.2 Modeling the “industrial” process
4.3 PID controller setup
A PID controlled output is a function of the feedback error, represented in timedomain as:
(14) 
where is the desired control signal and is the tracking error, between the desired output and the actual output . This error signal is fed to the PID controller, and the controller computes both the derivative and the integral of this error signal with respect to time providing a setpoint tracking effect, this works continuously in a closed loop, until the controller is in effect.
The ideal theoretical PID form exhibits a drawback for high frequency signals — the derivative action results in very high gain. A high frequency measurement noise will therefore generate large variations in the control signal. Practical implementations reduce this effect by replacing the term by a firstorder filter (where is represented as in Laplace form) by as (15) [b_MURRAY].
(15) 
The filter coefficient determines the pole location of the filter that helps attenuate the high gain on highfrequency noise. A between 2 and 20 is recommended. A high value () results in (15) approaching the ideal form (14) [b_MURRAY].
The PID was tuned using MATLAB autotuning feature and the coefficients obtained were , , and . The low acts to suppress the derivative term.
4.4 RL controller setup
This section describes the Simulink design for training the RL controller using the DDPG algorithm.
Fig.15 shows the training setup. A switch allows testing a trained model on various signals built via a “signalbuilder” block. Training a RL agent involves significant hyperparameter tuning and this setup allows for quick experiments and evaluations by activating a “software” switch.
4.4.1 RL controller design
4.4.2 Environment design
Several design factors need consideration when building the environment for efficiently training the agent to learn to follow the trajectories of a control signal. They can broadly be classified into agent related and environment related. Agent related factors are composition of the observations vector and the reward strategy. Environment related factors must cover the training strategy, training signals, initial conditions of the environment and criteria to terminate an episode (for episodic tasks).
4.4.3 Training strategy
One could train the RL agent to learn to follow the exact benchmark trajectory (Fig.11(d)), however this is a very constrained strategy. Instead, the agent was trained to follow random levels of straightline signals. The agent was additionally challenged to learn to start at a randomly initialized flow value. Together this forms an effective and generalized training strategy to teach the agent to follow any control signal trajectory composed of straight lines. The RL ToolBox allows overriding the default “reset function” that assists in implement the above strategy.
env.ResetFcn = @(in)localResetFcn(in, VALVE_SIMULATION_MODEL);
4.4.4 Observation vector
The observation vector used was , where is the actual flow achieved, the error with respect to reference and finally the integral of the error.
Integral of error: The instantaneous error has no memory. The integral of error, which is the area under the curve as time progresses, provides a mechanism to compute the total error gathered over time and drive the agent to lower this (Fig.17).
This is an important observation input often used in training of RL controllers.
The observation vector is modeled as shown in Fig.18.
4.4.5 Rewards strategy
Rewards can be assigned via discrete, continuous or hybrid functions. Equation (16) is a simple discrete form.
(16) 
where is some allowable error margin.
Equation (17) shows a reward that varies continuously as a function of error . is a small constant that avoids divisionbyzero error.
(17) 
Well designed continuousreward functions help agents learn to be as close as possible to the reference signal during the early learning stages. Fig.19 shows the final implementation as a hybrid form. The reciprocal of the absolute error allows the controller to learn to drive the error lower and lower. The discrete part of the reward is the “penalty” block that assigns a set penalty for exceeding the flow limits.
4.4.6 Actor and Critic networks
The actorcritic DDPG components were implemented as shown in Fig.20. The networks have fullyconnected layers, initialized with small random weights before beginning the training.
The actor network output is normalized to be between [1, 1] using a tanh layer. This allows better learning and convergence for continuous action spaces.
4.4.7 OrnsteinUhlenbeck (OU) action noise parameters
Guidelines for computing the DDPG exploration parameters i.e. the noise model variance and the decay rate of the variance are provided by MATLAB [b_MATLAB_DDPG].
(18) 
where is the sampling time
Halflife of the variance factor, in timesteps, is computed first decided and the decay rate of the variance is then computed using:
(19) 
4.4.8 Final DDPG hyperparameters
Summarized below in Table 1 are the final set of DDPG hyperparameters.
Hyperparameter  Setting 

Critic learning rate  1 
Actor learning rate  1 
Critic hidden layer1  50 fullyconnected 
Critic hidden layer2  25 fullyconnected 
Actionpath neurons 
25 fullyconnected 
Actionpath bound  tanh layer 
Gamma  0.9 
Batch size  64 
OUP Variance  1.5 
OUP Variance Decay Rate  1 
4.5 Setup for comparative study
An environment that combined the PID and RL strategies for a comparative evaluation is shown in Fig.21. It allows experimenting with various reference signals, studying the effects of noise added at three disturbance points i.e. input of the controller, output of the controller (i.e. the input of the plant) and finally output of the plant.
It provides a convenient platform to perform additional experiments using elements such as setpoint filters, output smoothening filters, etc.
5 Graded Learning
Before presenting the results of the experiments we elaborate on a coaching method termed as “Graded Learning”. This simple, intuition based approach was accidentally discovered during the hundreds of experiments and trials (163 to be exact) that were conducted in an attempt to train a stable RL agent. It must be noted that this method is equivalent to the naive, domainexpert dependent form of the more formal method known as “Curriculum Learning” [b_WENG_CURRICULUM], [b_NARVEKAR].
Applying automatic Curriculum Learning requires algorithmic design and implementing complex frameworks [b_PORTELAS]
, for example ALPGMM (absolute learning progress Gaussian mixture model) “teacherstudent” framework. The “teacher” neuralnetwork samples parameters from the continuous action space to generate a learning curriculum. Applying automated Curriculum Learning is currently not possible in MATLAB and will therefore be difficult for many practising engineers. Graded Learning, on the other hand, requires no programming and allows a control engineer to implement it.
Fig.22 shows examples of the numerous challenges faced during training, sometimes resulting in experiments with thousands of episodes that did not produce a stable learning curve and sometimes resulting in inexplicable controller actions. Some training trials lasted 20,000 episodes running for over 20 hours and therefore it is important to streamline these efforts.
Graded Learning helped avoid some of these challenges. The intuition for Graded Learning was based on observing how human instructors structure coaching of a new skill for apprentices.
While new skills such as chess or tennis are taught with the final goal in mind, one never starts with the hardest lessons. Foundation level skills are taught first and once some level of proficiency is gained, the student graduates to the next level with marginally more complex problems than the previous level. Skills and experiences gained in the previous level are retained and progressively built upon as one moves from one level to the next.
Graded Learning extends this iterative staged approach to RL. The RL task is first broken down to its fundamental level, an agent is trained for episodes or until convergence criteria is met. Next level of complexity is added to the previous task. Transferlearning
is used to ensure previous experience is retained and built upon. Once this level of task is learned, the process of adding further complexity continues and each time transferlearning allows to build upon experience gained during the previous levels.
Transferlearning is a machine learning technique that is used to “transfer” the learning i.e. stabilized weights of a neuralnetwork from one task (or domain in general) to another without having to train the neuralnetwork from scratch [b_KARL].
The Graded Learning approach was discovered when the timedelay in (13) was reduced to zero and the agent quickly stabilized in contrast to the hundreds of earlier attempts and assisted in satisfactorily training a stable controller.
Fig.23 demonstrates the method in action and the agent evolving over 6 stages of increasing difficulty. Parameters that are progressively increased are the timedelay , static friction and dynamic friction .
Both the stability analysis and experimental results achieved next, demonstrate that Graded Learning applied to valve control (and possibly other complex industrial systems) appears to be an effective way to coach an RL agent.
Grade  Episodes  Time (h)  
GradeI.1  0.1  930  1.67  
GradeI.2  0.1  2000  12.35  
GradeII  0.5  1000  5.31  
GradeIII  1.5  1000  5.21  
GradeIV  1.5  1000  4.65  
GradeV  2.0  500  2.27  
GradeVI  2.5  8.4  3.524  2000  7.59 
Total  8430  39.05 
6 Experiments, Results and Discussion
In this section we present the results of experiments conducted on a unified framework and evaluate the RL controller’s performance and compare it with the PID (with filter) controller.
Before conducting the experiments a stability analysis of the RL controller must be carried out.
6.1 Stability Analysis of RL Control
A basic stability analysis of the RL control is attempted in this section.
Openloop transferfunction of the system is . Transferfunction of the plant where is the transferfunction of the FOPTD process (13) and is the transferfunction of the nonlinear valve which is unknown and must be estimated.
Simulink’s Control Design Linearization Analysis™ tool provides a GUI based interface to generate a linear approximation of a nonlinear system, computed across specified input and output points. However, this does not allow any control over the estimation in contrast to MATLAB’s tfest function.
The programmatic method allows a user controlled method to estimate the transferfunction by specifying the number of poles (np) and zeros (nz). Additionally the iodelay parameter allows experimenting the effect of timedelays in physical systems. This MATLAB function is based on [b_GARNIER].
Ψsys = tfest(data, np, nz, iodelay)
The blockdiagram Fig.24 shows the points at which data and will be tapped to estimate the controller transferfunction and points and to estimate the complete plant transferfunction . Fig.25 is the Simulink setup to assist the estimation.
Estimated plant transferfunction: The continuoustime transferfunction (20) for the plant was estimated by MATLAB as shown in Fig.26, with a fit of and MSE of .
(20) 
Estimated controller transferfunction: Equation (21) is the estimated continuoustime transferfunction for the controller.
(21) 
We plot (Fig.27) the plant’s response using the estimated transferfunctions against the original RL signal to ensure that it is reasonably close and will serve the purpose of gaging the stability. It must be noted that the estimation is approximate and this method is provided as a means of understanding the methodology of conducting a very basic stability analysis.
6.2 Experiments and Results
In this section we present the results of experiments conducted on a unified framework that tests two valve control strategies — PID (with filter) and DDPG RL. Experiments with varying control signals, noise strengths and disturbance points were conducted. A plant with processloop perturbations was experimented with. A critical timedomain analysis of the experimental results is presented followed finally by frequencydomain stability analysis.
Experiments conducted:

Arbitrarily assumed constant reference level

Benchmark waveform (with noise)

Benchmark waveform subject to disturbances at:

Controller input (i.e. reference signal)

Plant input (i.e. controlled signal fed to plant)

Plant output (i.e. system output)


Practical example of a “watersupply” valve, subject to groundborne vibrations of passing trains

Plant experiencing process loopperturbations

Arbitrary control waveform
6.2.1 Experiment1: Constant reference signal
Experiment: A basic analysis is best done on a simple constant reference flow rate arbitrarily set at 100 and run over 2,000 . Reference signal is superimposed with benchmark Gaussian noise added ().
Observations: Fig.30 shows the PID and RL trajectories. We observe that the PID has a large overshoot and settles in about 700 . The RL strategy demonstrates close to ideal damping and a quicker settling time of about 220 . The RL trajectory shows tiny ripples against the PID’s smoother profile. These oscillations can reduce the remainingusefullife (RUL) of a mechanical system and we study this by conducting a (simplified) two factor DOE (design of experiments).
We vary the two factors; timedelay and valve friction (combined static and dynamic) as shown in Table 3. Default values of timedelay , staticfriction and and these are treated as the highlevels and we lower each by a factor of 100 to obtain the lowlevels as shown in Table 4.
Timedelay ()  Friction values (, ) 
Low  Low 
Low  High 
High  Low 
High  High 
0.025  0.084  0.0352 
0.025  8.400  3.524 
2.500  0.084  0.0352 
2.500  8.400  3.524 
Fig.31(a) highlights the RL’s capability to produce a very smooth profile when both the factors are low. This implies that the oscillations are not introduced by the RL technique. Fig.31(c) shows that the cause of oscillatory behavior is mainly due to the timedelay factor.
While the PID strategy (15), is implemented with a filter that suppresses noise, no filters were added to the RL setup to better understand the natural response of RL control strategies.
6.2.2 Experiment2: The benchmark signal
Experiment: The waveform profile used in [b_CAPACI], with Gaussian noise (), is subject to both strategies. We also zoom sections of timedomain plot Fig.32 and observe them more closely in Fig.33.
It is observed that the PID shows higher over and undershoots. The RL shows better tracking to the reference signal levels. If such a valve controls fluid flow, the higher and lower fluid quantities could be detrimental to the product quality. In 32 the shifted PID waveform after 800 could be detrimental to the process if it depends on the timing of the flow of fluid.
6.2.3 Experiment3.a: Noise at controller input
Experiment: Increased noise at the controller input ()
Observations: Fig.34(a) and 34(b) show almost no impact on the PID when compared with Experiment2 (lower noise at input) but increased impact on the RL trajectory, demonstrating the PID strategy’s superior noise attenuation capabilities. The RL continues to closely track the reference signal (along with the noise).
6.2.4 Experiment3.b: Noise at plant input
Experiment: Shift the source of noise to the plant input ().
Observations: Fig.35 shows that the PID trajectory is now impacted and it looses its relatively smooth output seen in Experiment1 and Experiment2. The RL strategy on the other hand remains unaffected when compared to Experiment1. The PID strategy adjusts itself based on the error signal and hence shows a change in behavior while RL strategy does not.
6.2.5 Experiment3.c: Noise at plant output
Experiment: Effect of noise experienced at the plant output is studied here ().
Observations: Fig.36(a) shows that both RL and PID strategies are affected equally due to the noise.
6.2.6 Experiment4: Watersupply valve, subject to groundborne vibrations
Experiment: Valve applications could often be exposed to extremely harsh conditions. A watersupply system, for example, may face groundborne vibrations such as from passing railways, that is in the range of about 30–200 and varying amplitudes [b_TRAIN]. Since the controlvalve assembly will often be placed in shielded environments frequencies between 30–100 were assumed for simulation.
6.2.7 Experiment5: Arbitrary control waveform with benchmark noise signal
Experiment: This experiment tests the generalization capability of the training strategy of the RL controller visàvis the generalization of PID tuning. “Training” signal for both the strategies was the benchmark waveform and this experiment subjected them to a completely different waveform.
Observations: Fig.38 shows that the RL controller outperforms the PID strategy considerably in this experiment. The RL controller tracks the arbitrary reference much closely and this demonstrates the importance of the training strategy in effective generalization. The PID trajectory, on the other hand shows a significant lag while tracking the reference and if such a valve controls fluid flow, the untimely higher or lower fluid quantities could be detrimental to the product quality.
Small ripples are evident sections of the RL controlled trajectory.
6.2.8 Experiment6: Benchmark plant with process loopperturbations
Experiment: This experiment tests resistance to severe processloop perturbations modeled as a order transferfunction (22) [b_CAPACI].
(22) 
Observations: A severe limitation of the RL controller is evident in this experiment. Fig.39 shows a significantly stunted output, clamped smoothly at around 35.0. The setup was then tested on a lower magnitude reference Fig.39(b) and the RL continues to be clamped at the same level 35.0. PID seems to scale to different levels under the influence of perturbations albeit with significant error. The RL controller shows increased oscillatory behavior at the lower flow magnitude.
6.3 Discussion: Experiential Learning Validated against Published Research
MerriamWebster:
”Experiential: relating to, derived from, or providing experience”
A total of 163 experiments were conducted during this research. When experiments did not respond to seemingly logical steps it led to severe frustration. It was during this learning process that Graded Learning was discovered. In a quest to find answers for some of the strange observations, research was conducted to relate these to previously published studies and it highlighted the several known challenges that exist; reminding one that RL is still an emerging field.
Early adopters of RL for control are encouraged to try both the Graded Learning method and study the literature referenced in this section — which is a collection of studies conducted at Google, MIT and Berkeley ([b_SONG], [b_HENDERSON], [b_HARDT] and [b_ZHANG]).
In [b_HENDERSON], effects of hyperparameters and their tuning are analyzed with respect to networkarchitecture, rewards scaling and reproducibility on modelfree, policygradient based algorithms for continuous control and is therefore directly applicable to the subject of this paper.
6.3.1 Overfitting and saturation
For physical systems, there is always an upper limit of rewards that the agent cannot cross. However this is not known before hand and one often pushes the agent to continue training for hours. Significant neuralnetwork saturation was observed in several of the training attempts.
Overfitting in RL is being studied only recently. [b_SONG] studied overfitting in modelfree RL and observe that the agent often mistakenly correlates reward with spurious observationspace features. They term this as “observational overfitting”. In particular they have studied overfitting with linear quadratic regulators (LQR) using neuralnetworks and show that under Gaussian initialization of the policy using gradient descent, a generalization gap “must necessarily exist” [b_SONG].
Fig.40 shows multiple examples of overtraining and its effect on learning curves.
[b_HARDT]
provides a theoretical proof, that stochastic gradient methods employing parametric models, when trained using
feweriterations have vanishing generalization errors. They argue this by experiments conducted and using stability criteria established for learning algorithms devised by Bousquet and Elisseeff. They conclude, that shortened training time by itself, sufficiently prevents overfitting. This paper is important for extending the stability criteria developed for supervised learning to iterative algorithms, such as RL
[b_BOUSQUET] .6.3.2 Sensitivity to network architecture
Four policygradient methods including the DDPG are analyzed in [b_HENDERSON]
. While ReLU activations were stated to perform best, the effects were not consistent across algorithms or hyperparameter settings.
6.3.3 Sensitivity to rewardscaling
A large and sparse reward scale causes network saturation resulting in inefficient learning as was observed in Fig.41(b). Reward rescaling is a technique recommended to improve results for DDPG (Fig.41(a)). This is achieved by multiplying by a scalar such as 0.1 or clipping to [0, 1] [b_DUAN].
6.3.4 Sensitivity to noise parameter
DDPG uses the OrnsteinUhlenbeck process to aid exploration. The effect of noise hyperparameter was not very easily ascertainable.
Based on (19), for a and timesteps per full episode, the halflife of exploration decay is about 150 episodes as seen in Fig.42(a). However there is an exploration explosion after about 700 episodes (Fig.42(b)). As an experiment a severely reduced was used implying a halflife of just about 15 episodes, however Fig.42(c) shows no decay in exploration for over 1000 episodes.
It is possible that the mixed results agree with [b_PLAPPERT] in that explicit noise settings are not necessary for a continuous space to assist in exploration. It must be noted that such results can also be possible due to inexplicable interaction effects of multiple hyperparameters.
6.3.5 Sensitivity to random seeds
Intuitively different random seeds should not affect results of a stable process. According to [b_HENDERSON], environment stochasticity coupled with stochasticity in the learning process have produced misleading inferences even when results were scientifically averaged across multiple trials.
In conclusion, as stated by Henderson et al. [b_HENDERSON] one of the possible reasons for the difficulties encountered could be the “intricate interplay” of hyperparameters of policy gradient methods (such as DDPG).
7 Conclusion
On the design front, the process of training a modelfree reinforcement learning agent was outlined.
Hyperparameter tuning requires significant efforts and patience for building a stable controller. We proposed Graded Learning, the naive form of Curriculum Learning method. An engineer starts at the lowest complexity level and defines appropriate hyperparameter settings to understand the best reward strategy and reward scales to use and then gradually increase control task complexity. This avoids several problems mentioned earlier; for example network saturation. For most industrial control systems Table 1 should be a good starting point.
On the application front, experiments were conducted to evaluate it against the conventional PID control strategy.
The experiments showed that the RL strategy’s trajectory tracking appears to be superior to the PID’s. The PID demonstrates better disturbance rejection as compared to the disturbances appearing on the RL controlled signal. While this appears to be the prime limitation of the RL controller, it must be noted that these were evident in the published implementations studied as well ([b_BISCHOFF], [b_WANG] and [b_SYAFIIE]).
The PID appeared to lag the reference control signal and the RL controller performed better when challenged to track a control profile that it was not trained on and will demonstrate versatility when applied to different control tasks within the same environment, without having to be retrained.
Overall the RL controlled process appears to promise better process quality, while the PID controlled process will cause a significantly lower stress on the valve operation and result in reduced wearandtear.
Enhancements and Future work: The RL controller that was designed needs a mechanism to reduce the oscillatory behavior in the presence of high frequency disturbance with strong amplitudes. For noise at the input and output of the controller a lowpass filter may help reduce the high variance.
Further work is necessary to understand ways of defining objective and reward functions to prevent the noisy RL trajectory behaviour. If this succeeds this will be a better solution than applying a filter, which would otherwise slow down the response.
MATLAB 2019b release includes the Proximal Policy Optimization (PPO) algorithm for continuous control that must be evaluated. PPO is a recent development and is considered as being more stable and better than DDPG [b_HENDERSON].
The fields of reinforcement learning, optimalcontrol and controlsystems are extremely exciting. It is the hope that this research will motivate further research to help better understand and hence popularize the use of reinforcement learning for controlsystems.
This paper is a result of the work that began with the dissertation [b_RS] submitted to the Coventry University, UK. I am immensely grateful for the encouragement and guidance I received during the dissertation work from my supervisors — Dr Olivier Haas, Associate Professor and Reader in Applied Control Systems at Coventry University and Dr Prithvi Sekhar Pagala, Research Specialist at KPIT Technologies. Prof. Dr Acharya K.N.S must be thanked for instilling an interest in Control Systems through this teaching.
Rajesh Siraskar received the B.E. degree in Electronics and Telecommunications from Pune University, Pune, India, in 1990 and an M.Tech. degree in Automotive Electronics from Coventry University, UK in 2020. He works as a Data Scientist and develops solutions for industries ranging from automotive to energy and pharmaceutical to cement. He was previously a Six Sigma Master Black Belt. He is member of IEEE.
Comments
There are no comments yet.