Reinforcement Learning for Control of Valves

12/29/2020 ∙ by Rajesh Siraskar, et al. ∙ 0

This paper compares reinforcement learning (RL) with PID (proportional-integral-derivative) strategy for control of nonlinear valves using a unified framework. RL is an autonomous learning mechanism that learns by interacting with its environment. It is gaining increasing attention in the world of control systems as a means of building optimal-controllers for challenging dynamic and nonlinear processes. Published RL research often uses open-source tools (Python and OpenAI Gym environments) which could be difficult to adapt and apply by practicing industrial engineers, we therefore used MathWorks tools. MATLAB's recently launched (R2019a) Reinforcement Learning Toolbox was used to develop the valve controller; trained using the DDPG (Deep Deterministic Policy-Gradient) algorithm and Simulink to simulate the nonlinear valve and setup the experimental test-bench to evaluate the RL and PID controllers. Results indicate that the RL controller is extremely good at tracking the signal with speed and produces a lower error with respect to the reference signals. The PID, however, is better at disturbance rejection and hence provides a longer life for the valves. Experiential learnings gained from this research are corroborated against published research. It is known that successful machine learning involves tuning many hyperparameters and significant investment of time and efforts. We introduce “Graded Learning" as a simplified, application oriented adaptation of the more formal and algorithmic “Curriculum for Reinforcement Learning”. It is shown via experiments that it helps converge the learning task of complex non-linear real world systems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper is a study of reinforcement learning (RL) as an optimal-control strategy. RL, a machine learning (ML) technique, mimics learning abilities of humans and animals. RL applications have been used by OpenAI to program robot-hands to manipulate physical objects with unprecedented human-like dexterity [b_OPENAI], by Stanford’s CARMA program for autonomous driving [b_CARMA] and studied for faster de-novo molecule design [b_DENOVO].

Valves were selected as the control plant as they are ubiquitous in process control and employed in almost every conceivable manufacturing and production industry. The controller, called an “agent” in RL terminology, is trained using the DDPG (Deep Deterministic Policy-Gradient) algorithm.

Industrial process loops involve thousands of valves and can be impossible to model accurately. Applying traditional control strategies, such as PID (proportional-integral-derivative) can potentially affect quality and efficiency of such processes and increase substantial costs. PIDs are the de-facto industry standard and according to an indicative survey cover more than 95% of process-industry controllers [b_DESBOROUGH].

RL promises better control strategies by learning optimal-control by directly interacting with the plant (such as valves) and hence eliminates the need of accurately modeling the plant.

Connecting a computer to a real physical plant and have the RL agent learn through direct interaction may not always be feasible. A practical approach adopted involves simulating the plant as close as possible to the real plant and training the agent and this is the approach employed in this paper.

Literature is researched to study valve nonlinearity to create a benchmark plant model for training the RL agent.

MATLAB Simulink™ is used to simulate a nonlinear valve, an industrial process, the agent training circuit and finally a unified validation circuit to evaluate RL and PID strategies side-by-side. The agent is trained using MATLAB’s recently launched (R2019a) Reinforcement Learning Toolbox™ [b_MATLAB].

Graded Learning, a technique discovered accidentally during this research is a simple procedural method to efficiently train a RL agent on complex tasks and in effect is the most simplified form of the more formal method known as “Curriculum Learning” [b_NARVEKAR], [b_WENG_CURRICULUM].

Summary research contributions of this work:

  1. A basic understanding of RL as an optimal-control strategy.

  2. Methodology targeted to assist practising plant engineers apply Reinforcement Learning for optimal control in industries.

  3. Design and simulation techniques using MATLAB and Simulink™, instead of the more demanding Open Source Python.

  4. Graded Learning: A semi-novel “coaching” method, based on the naive form of Curriculum Learning. This is suitable for practicing engineers and is an application oriented adaptation of the more formal and algorithmic “Curriculum for Reinforcement Learning”.

  5. Short literature research of three published studies of RL used for control of valves.

  6. Experimental comparison of PID and RL strategies in a unified framework.

  7. Stability analysis of the RL controller in time and frequency domains.

  8. Experiential learning corroborated with published literature.

Finally, while the valve is the focus of the paper, the methods are adaptable to any industrial system.

2 Reinforcement Learning Primer

In this section we take a brief look at conventional optimal-control solving methods, followed by an overview of RL, its connection with optimal-control and finally the DDPG algorithm selected for implementation.

Sutton and Barto’s book [b_BARTO] is the most comprehensive introduction to reinforcement learning and the source for theoretical foundations below.

2.1 Optimal Control and RL

Feedback controllers are traditionally designed using two different philosophies namely “adaptive-control” and “optimal-control”. Adaptive controllers learn to control unknown systems by measuring real-time data and therefore employ online learning. Adaptive controllers are not optimized since the design process does not involve minimizing any performance metrics suggested by users of the plant [b_FRANK].

Conventional optimal-control design, on the other hand, is performed off-line by solving Hamilton–Jacobi–Bellman (HJB) equations. According to [b_FRANK] solving HJB equations require complete knowledge of the dynamics of the plant and according to [b_TEDRAKE] this in turn requires an engineered guess as a start.

Richard Bellman’s extension of the century theory laid by Hamilton and Jacobi and Ronald Howard’s work in 1960 for solving Markovian Decision Processes (MDPs) all formed the foundations of modern RL. Bellman’s approach used the concept of a dynamic system’s state and of a “value-function”. Dynamic programming, which uses the Bellman equation and is a “backward-in-time” method, along with temporal-difference (TD) methods enabled building of optimal adaptive-controllers for discrete-time systems (i.e. time progression defined as ) [b_BARTO].

2.2 Optimal Control

The Hamilton-Jacobi-Bellman (HJB) (1) provides a sufficient condition for optimality [b_TEDRAKE].


Controller policies (i.e. behavior) are denoted by and optimum policies by . If a policy , and a related cost-function , are defined such that minimizes the right-hand-side of the HJB (1) and all to zero then:


Equation (1) assumes that the cost-function is continuously differentiable in and and since this is not always the case it does not satisfy all optimal-control problems. In [b_TEDRAKE], Tedrake shows that solving HJB depends on an engineered guess, for example a First-order Regulator is designed with a guessed solution . A Linear Quadratic Regulator is designed similarly. For complex, dynamic mechanical systems such initial solutions are hard to guess unless severely approximated and therefore in situations like these RL shows the relative ease with which real world optimal-controllers can be learned.

2.3 The RL framework

The core elements of RL are shown in Fig.1.

Figure 1: The basic RL flow [b_BARTO]

The learner and decision-maker is called the agent. The agent interacts with its environment continually, selecting actions to which the environment responds by presenting a new situation to the agent. The environment provides feedback on performance via rewards (or penalties). Rewards are scalar values. Over time, the agent attempts to maximize (or minimize) the rewards and this reinforces good actions over bad actions and thus learns an optimal behavior, formally termed a policy.

In control system terminology — the agent is the controller being designed. The environment consists of the system outside the controller i.e. the valve, the industrial process, the reference signal, other sensors, etc. The policy is the optimal-control behavior the designer seeks. RL allows learning this behavior without having to be explicitly programmed or modeling the plant in excruciating detail.


: The decision-making capability of the agent is based on a probability mapping of the best action to take vis-à-vis the state it is in. This mapping is called a policy

, and is the probability that the action if the state .

Returns: Returns represent long-term rewards, gathered over time.


Discounting: Discounting provides a mechanism to control the impact of selecting an action that is immediate versus one where rewards are received far into the future.


where , the discount rate, is a parameter .


: These are functions of state-action pairs that provide an estimate of how good it is to perform a given action in a given state. A reward signal provides feedback on how “good” the current action is, in an immediate short-term sense. In contrast, a value-function, provides a measure of “goodness” in the long-term and is defined in terms of future expected return.

The value, denoted , is the expected return for a state , measured starting in that state and following the policy thereafter.


Equation (5) is referred to as the Bellman equation and forms the basis to approximately compute and learn and is therefore central to all RL algorithms

Q-function: By including the action, is defined as the expected return starting from state , taking an action and thereafter following policy .

Q-learning: Q-learning is an off-policy TD control algorithm that allows iteratively learning the Q-value. For each state-action pair the value is tracked. When an action is performed in some state , the two elements of feedback from the environment — the reward and the next state are used in the update shown in (6). is the learning rate.


where is the estimate of and the estimate of .

Optimal value-function: There always exists at least one optimal policy that guarantees the highest expected return denoted by and optimal action-value-function .

Model-based and model-free RL methods: Accurate models of the environment allow “planning” the next action as well as the reward. By a model we mean having access to a “table” of probabilities of being in a state given an action and associated rewards.

RL methods that use environment models are called model-based methods, as opposed to simpler model-free methods. Model-free agents can only learn by trial-and-error [b_BARTO].

Actor-Critic methods: Actor-critic structure allows a forward-in-time class of RL algorithms that are implemented in real-time. The actor component, under a policy, applies an action to the environment and receives a feedback that is evaluated by the critic component. There is a two step learning mechanism -– policy-evaluation performed by the critic, followed by policy-improvement performed by the actor.

Figure 2: Actor-Critic architecture

2.4 The DDPG algorithm

MATLAB’s R2019a release provides six RL algorithms. DDPG is the only algorithm suitable for continuous action control [b_MATLAB].

In [b_LCRAP] Lillicrap et al. introduced DDPG to overcome the shortcomings of the DQN (Deep Q-Network) algorithm which in turn was an extension of the fundamental Q-learning algorithm.

The DDPG is a model-free, policy-gradient based, off-policy method as it uses a memory replay-buffer to store previous experiences

. With an actor-critic based algorithm it uses two neural networks. The actor network accepts the current state as the input and outputs a single real value (i.e the valve control signal) representing the action chosen from a continuous action space. The critic network performs the evaluation of the actor’s output (i.e. the action) by estimating the Q-value of the current state given this action. Actor network weights are updated by a

deterministic policy gradient algorithm while the critic weights are updated by gradients obtained from the TD error signal. The DDPG algorithm, therefore, simultaneously learns both a Q-function and a policy by interleaving them.

Exploration vs. exploitation: For RL, as is in humans, performance improvement is achieved by exploitation of actions that provided the highest reward in the past. However, to discover the best actions in the first place, the agent must explore the action space. Balancing the discovery of new actions while continuously improving the best action is a common challenge in RL. Various exploration-exploitation strategies have been developed.

DDPG uses the Ornstein-Uhlenbeck process (OUP) to enable exploration [b_OUP]. Interestingly, OUP was developed for modeling the velocities of Brownian particles with friction which results in values that are temporally correlated. The simpler additive Gaussian noise model causes abrupt changes from one time-step to the next (i.e. uncorrelated) whereas the OUP noise model more closely mimics real life actuators that exhibit inertia [b_LCRAP].

The exploration policy is constructed by adding noise to the selected action (i.e. the actor policy) at each training time-step, sampled from the OUP noise process .


3 Control Valves and Rl

Control-valves modify the fluid flow rates using an actuator mechanism that respond to a signal from the control system. Processing plants consist of large networks of such control-valves designed to keep a process-variable (such as pressure, temperature, flow, etc.) under control. These variables must be controlled within a specified operating range to ensure quality of the end-product [b_ISA].

3.1 Nonlinearity in valves

Control-valves, like most other physical systems, possess nonlinear flow characteristics such as friction and backlash. Friction in-turn has two components — stiction, the static friction, is the inertial force that must be overcome before there is any relative motion between the two surfaces and is the prime cause of dead-band in valves while dynamic friction is the friction in motion [b_CHOUDHURY_2004_Quantification], [b_CHOUDHURY_2004_Data_Driven].

Figure 3: Actual valve movement trajectory [b_CHOUDHURY_2004_Quantification]
Figure 4: Nonlinear valve operating characteristics, with stiction [b_CHOUDHURY_2004_Quantification]

Nonlinearity can cause oscillatory valve outputs that in turn cause oscillations of the process output resulting in defective end-products, inefficient energy consumption and excessive wear of manufacturing systems [b_CHOUDHURY_2004_Quantification], [b_CHOUDHURY_2005_Modelling]. According to [b_CHOUDHURY_2004_Quantification], 30% of process-loop oscillation issues are due to control-valves, while [b_DESBOROUGH] reports that valves are the primary cause of 32% of surveyed inefficient controllers. Stiction in control-valves has been reported as the prime source of sustained oscillations in industrial control-loops [b_CAPACI].

3.2 A mathematical valve model

RL requires experiences for training. Simulated environments often provide a quick and low-cost environment for training an agent. Since the objective of building a controller is for it to be used in the real-world, one must strive to create as accurate an environment as possible. This appears to contradict the claim made earlier that RL does not require an accurate system model –– however it is assumed here that real physical environment is inaccessible, which on the other hand if accessible or available as a lab could well allow the RL agent (controller) to learn directly from real experiences.

In this paper we use first-principles to model the valve as outlined in [b_CAPACI].

He and Wang [b_HE_2007], [b_HE_2010] describe the nonlinear memory dynamics of valve by , at a time-step , where is expressed by relation (8). While the controller outputs , the actual position the valve attains is represented by , where represents the valve position error. and are the static (stiction) and dynamic friction parameters, dependent on the valve type, size and application. The “Experimental Setup” section will later describe the Simulink modeling of the valve.



3.3 RL for valve control: A literature research

The field of RL is relatively new and not many studies of its application for control of valves were found. Scopus brought up only 18 results for “reinforcement learning AND valves AND control”, Fig.5.

Figure 5: Scopus: Publications on RL for valve control

A study of three publications is presented below with emphasis on areas that can be compared with our research.

3.3.1 Throttle valve control

Throttle valves find application in both industrial and automotive industries.

Control of a throttle valve is challenging due to the highly dynamic behavior of the spring-damper design of the valve system and complex nonlinearities [b_BISCHOFF], [b_SCHOKNECHT]. [b_HOWELL] indicate the challenge arises from the multiple-input-multiple-output nature of the throttle valve optimization problem.

Bischoff [b_BISCHOFF] use PILCO (probabilistic inference for learning), a practical, data-efficient model-based policy search method. PILCO reduces model bias, a key problem of model-based RL, by learning the probabilistic dynamics of the model and then explicitly incorporating model uncertainty into long-term planning. PILCO works with very little data and facilitates learning from scratch in only a few trials and therefore alleviates the need of millions of episodes normally required for training in trial-and-error based model-free methods [b_PILCO].

Throttle valve dynamics are modeled using the flap angle, angular velocity and the actuator input. They must be controlled at an extremely high rate of 200 Hz without any overshoot that result in engine torque jerks . The controller learns by minimizing the expected sum of cost over time.


To apply the constraint of zero overshoot, a novel asymmetric saturating cost-function is applied as seen in Fig.6. A trajectory approaching the goal (red) incurs a rapidly decreasing cost as it nears the goal while overshooting the goal incurs a disproportionately high cost almost immediately [b_BISCHOFF].

Figure 6: Asymmetric cost-function to avoid overshoots [b_BISCHOFF]

The effectiveness of the asymmetric cost-function is evident in their results (blue) in Fig.7, with no overshoot and only a low-noise behavior of controlled profile.

(a) Control profile
(b) Zoomed section shows minor aberrations
Figure 7: Throttle valve control using PILCO [b_BISCHOFF]

3.3.2 Heating, ventilation and air-conditioning (HVAC) control

Wang et. al [b_WANG] use a model-free, proximal actor-critic based RL algorithm to control the nonlinear dynamics of HVAC systems where the hot-water flow is governed by a power equation (10).


RL is compared to Proportional-Integral (PI) and Linear Quadratic Regulator (LQR) control strategies. 150 time-steps are used to allow sufficient time for RL controller to learn tracking the set-point. Disturbances are simulated using random-walk algorithms. Actor network configuration is [50, 50]

and the critic is a single layer of 50 units. One interesting aspect of the network architecture they employ is the use of GRU (Gated Recurrent Unit) to overcome the problem of vanishing/exploding gradients.


shows that the RL controller responds much faster than the LQR and PI controllers and tracks the reference signal better, thereby achieving lower Integral Absolute Error (IAE) and Integral Square Error (ISE) against both the competing strategies. However the RL shows a very high-variance noisy response against the smooth trajectories of PI and LQR controllers. Significant overshoots are also seen in the RL response.

(a) Control profile
(b) Zoomed section shows a noisy control profile
Figure 8: HVAC control [b_WANG]

3.3.3 Sterilization of canned food

Thermal processing used for sterilization of canned food results in deterioration of the organoleptic properties of the food. Controlling the thermal process is therefore important. In [b_SYAFIIE] Syafiie et al. apply Q-learning to learn the temperature profile that can be applied for the minimal time during the two stages of the thermal process — manipulation of the saturated-steam valve to cause heating and then cooling by opening the water valve.

A simple scalar reward is used [+1.0, 0.0, -2.0], therefore penalizing an action deviating from the desired start twice as more as rewarding it. The paper does not evaluate continuous rewards. Fig.9 shows the controlled temperature profile.

(a) Control profile
(b) Zoomed section shows aberrations
Figure 9: Thermal process control using Q-learning [b_SYAFIIE]

Overall observations on the three researched papers:

  1. Disturbances in the RL controlled signal are evident in all three implementations: ([b_BISCHOFF], [b_WANG] and [b_SYAFIIE]).

  2. Use of stochasticity mechanisms other than OUP to enable exploration of action space: ([b_BISCHOFF] and [b_WANG]).

  3. Use of a novel objective function in [b_BISCHOFF].

  4. None of these evaluated the stability of the RL controller design — an important consideration for an emerging breed of controllers.

  5. MATLAB was not used as the design platform, which is obvious considering it was launched in 2019.

  6. Only [b_WANG] compared the RL against the traditional PID.

4 Experimental Setup

This section describes the creation of the experimental setup, using MATLAB and Simulink, for design and evaluation of the RL and PID controllers. Fig.10 shows the core components.

Figure 10: Basic block components

Our setup used elements from the excellent 2018 paper, ”An augmented PID control structure to compensate valve stiction” by Bacci di Capaci and Scali.

Traditional PID controllers tuned solely on process dynamics, cause sustained oscillations attributed to the integral component that causes excessive variation of the control action to overcome static friction [b_CAPACI]. As a solution to this [b_CAPACI] presented a novel PID based controller, Fig.11(a), where stiction is overcome by employing a two-move control sequence (11) as the valve input.

(a) Two-move compensator
(b) Compensator results on a constant reference signal
(c) Compensator results on a process with loop perturbations
(d) Recreated “benchmark waveform”
Figure 11: Bacci di Capaci and Scali’s “PID compensator” [b_CAPACI]

where and are estimates of stiction and dynamic friction and is the estimate of steady-state position of the valve. These also show the reliance of this technique of correct estimation of these parameters.

The setup components:

  1. A PID (with filter) controller tuned using MATLAB’s auto-tuning feature.

  2. A training setup for the RL agent using the DDPG algorithm.

  3. A unified framework for experimentation and evaluation of controllers

    Items below were based on [b_CAPACI]:

  4. Nonlinear valve model (11) including the valve friction values and .

  5. Two industrial processes controlled by the valve:

    1. Normal process (13)

    2. Process with loop perturbations (22)

  6. A “benchmark waveform” profile with noise parameters (Fig.11(d)).

4.1 Modeling the valve

Simscape Fluids™ (formerly SimHydraulics™) provides simulations for several valve types and is the simplest and quickest option. [b_POPINCHALK] is a MathWorks article to enhance these into more realistic models using an understanding of system dynamics.

We however use first-principles and mathematically model the nonlinear valve. Algebraically rearranging equations shown in (11) produce (12); these equations are then implemented in Simulink using a “user-defined-function” and a “memory” block shown in Fig.12 with and .

(a) Valve stiction modeling
(b) MATLAB valve model script
Figure 12: Simulink valve model

4.2 Modeling the “industrial” process

The benchmark “industrial process” is modeled as a first-order plus time-delay (FOPTD) process (13) and using transfer-function and time-delay blocks as shown in Fig.13.


where , and .

Figure 13: FOPTD process model

4.3 PID controller setup

Figure 14: PID control setup

A PID controlled output is a function of the feedback error, represented in time-domain as:


where is the desired control signal and is the tracking error, between the desired output and the actual output . This error signal is fed to the PID controller, and the controller computes both the derivative and the integral of this error signal with respect to time providing a set-point tracking effect, this works continuously in a closed loop, until the controller is in effect.

The ideal theoretical PID form exhibits a drawback for high frequency signals — the derivative action results in very high gain. A high frequency measurement noise will therefore generate large variations in the control signal. Practical implementations reduce this effect by replacing the term by a first-order filter (where is represented as in Laplace form) by as (15) [b_MURRAY].


The filter coefficient determines the pole location of the filter that helps attenuate the high gain on high-frequency noise. A between 2 and 20 is recommended. A high value () results in (15) approaching the ideal form (14) [b_MURRAY].

The PID was tuned using MATLAB auto-tuning feature and the coefficients obtained were , , and . The low acts to suppress the derivative term.

4.4 RL controller setup

This section describes the Simulink design for training the RL controller using the DDPG algorithm.

Fig.15 shows the training setup. A switch allows testing a trained model on various signals built via a “signal-builder” block. Training a RL agent involves significant hyperparameter tuning and this setup allows for quick experiments and evaluations by activating a “software” switch.

Figure 15: RL DDPG agent training setup

4.4.1 RL controller design


shows the DDPG Agent Simulink block and shows how feedback from environment is channelized via the Observations vector. It also shows the block that computes Rewards and the Stop-simulation block that controls the termination of an episode.

Figure 16: RL DDPG agent details

4.4.2 Environment design

Several design factors need consideration when building the environment for efficiently training the agent to learn to follow the trajectories of a control signal. They can broadly be classified into agent related and environment related. Agent related factors are composition of the observations vector and the reward strategy. Environment related factors must cover the training strategy, training signals, initial conditions of the environment and criteria to terminate an episode (for episodic tasks).

4.4.3 Training strategy

One could train the RL agent to learn to follow the exact benchmark trajectory (Fig.11(d)), however this is a very constrained strategy. Instead, the agent was trained to follow random levels of straight-line signals. The agent was additionally challenged to learn to start at a randomly initialized flow value. Together this forms an effective and generalized training strategy to teach the agent to follow any control signal trajectory composed of straight lines. The RL ToolBox allows overriding the default “reset function” that assists in implement the above strategy.

 env.ResetFcn = @(in)localResetFcn(in,

4.4.4 Observation vector

The observation vector used was , where is the actual flow achieved, the error with respect to reference and finally the integral of the error.

Integral of error: The instantaneous error has no memory. The integral of error, which is the area under the curve as time progresses, provides a mechanism to compute the total error gathered over time and drive the agent to lower this (Fig.17).

This is an important observation input often used in training of RL controllers.

Figure 17: Error integral

The observation vector is modeled as shown in Fig.18.

Figure 18: RL observations vector

4.4.5 Rewards strategy

Rewards can be assigned via discrete, continuous or hybrid functions. Equation (16) is a simple discrete form.


where is some allowable error margin.

Equation (17) shows a reward that varies continuously as a function of error . is a small constant that avoids division-by-zero error.


Well designed continuous-reward functions help agents learn to be as close as possible to the reference signal during the early learning stages. Fig.19 shows the final implementation as a hybrid form. The reciprocal of the absolute error allows the controller to learn to drive the error lower and lower. The discrete part of the reward is the “penalty” block that assigns a set penalty for exceeding the flow limits.

Figure 19: RL rewards computation block

4.4.6 Actor and Critic networks

The actor-critic DDPG components were implemented as shown in Fig.20. The networks have fully-connected layers, initialized with small random weights before beginning the training.

The actor network output is normalized to be between [-1, 1] using a tanh layer. This allows better learning and convergence for continuous action spaces.

(a) Policy (actor) network
(b) Critic (action-value) network
Figure 20: DDPG network architectures

4.4.7 Ornstein-Uhlenbeck (OU) action noise parameters

Guidelines for computing the DDPG exploration parameters i.e. the noise model variance and the decay rate of the variance are provided by MATLAB [b_MATLAB_DDPG].


where is the sampling time

Half-life of the variance factor, in time-steps, is computed first decided and the decay rate of the variance is then computed using:


4.4.8 Final DDPG hyperparameters

Summarized below in Table 1 are the final set of DDPG hyperparameters.

Hyperparameter Setting
Critic learning rate 1
Actor learning rate 1
Critic hidden layer-1 50 fully-connected
Critic hidden layer-2 25 fully-connected

Action-path neurons

25 fully-connected
Action-path bound tanh layer
Gamma 0.9
Batch size 64
OUP Variance 1.5
OUP Variance Decay Rate 1
Table 1: DDPG hyperparameter settings

4.5 Setup for comparative study

An environment that combined the PID and RL strategies for a comparative evaluation is shown in Fig.21. It allows experimenting with various reference signals, studying the effects of noise added at three disturbance points i.e. input of the controller, output of the controller (i.e. the input of the plant) and finally output of the plant.

It provides a convenient platform to perform additional experiments using elements such as set-point filters, output smoothening filters, etc.

Figure 21: Unified setup for a comparative evaluation of RL and PID control strategies

5 Graded Learning

Before presenting the results of the experiments we elaborate on a coaching method termed as “Graded Learning”. This simple, intuition based approach was accidentally discovered during the hundreds of experiments and trials (163 to be exact) that were conducted in an attempt to train a stable RL agent. It must be noted that this method is equivalent to the naive, domain-expert dependent form of the more formal method known as “Curriculum Learning” [b_WENG_CURRICULUM], [b_NARVEKAR].

Applying automatic Curriculum Learning requires algorithmic design and implementing complex frameworks [b_PORTELAS]

, for example ALP-GMM (absolute learning progress Gaussian mixture model) “teacher-student” framework. The “teacher” neural-network samples parameters from the continuous action space to generate a learning curriculum. Applying automated Curriculum Learning is currently not possible in MATLAB and will therefore be difficult for many practising engineers. Graded Learning, on the other hand, requires no programming and allows a control engineer to implement it.

Fig.22 shows examples of the numerous challenges faced during training, sometimes resulting in experiments with thousands of episodes that did not produce a stable learning curve and sometimes resulting in inexplicable controller actions. Some training trials lasted 20,000 episodes running for over 20 hours and therefore it is important to streamline these efforts.

(a) Inexplicable learning curves
(b) Inexplicable controller actions
Figure 22: RL agent training challenges

Graded Learning helped avoid some of these challenges. The intuition for Graded Learning was based on observing how human instructors structure coaching of a new skill for apprentices.

While new skills such as chess or tennis are taught with the final goal in mind, one never starts with the hardest lessons. Foundation level skills are taught first and once some level of proficiency is gained, the student graduates to the next level with marginally more complex problems than the previous level. Skills and experiences gained in the previous level are retained and progressively built upon as one moves from one level to the next.

Graded Learning extends this iterative staged approach to RL. The RL task is first broken down to its fundamental level, an agent is trained for episodes or until convergence criteria is met. Next level of complexity is added to the previous task. Transfer-learning

is used to ensure previous experience is retained and built upon. Once this level of task is learned, the process of adding further complexity continues and each time transfer-learning allows to build upon experience gained during the previous levels.

Transfer-learning is a machine learning technique that is used to “transfer” the learning i.e. stabilized weights of a neural-network from one task (or domain in general) to another without having to train the neural-network from scratch [b_KARL].

The Graded Learning approach was discovered when the time-delay in (13) was reduced to zero and the agent quickly stabilized in contrast to the hundreds of earlier attempts and assisted in satisfactorily training a stable controller.

Fig.23 demonstrates the method in action and the agent evolving over 6 stages of increasing difficulty. Parameters that are progressively increased are the time-delay , static friction and dynamic friction .

Both the stability analysis and experimental results achieved next, demonstrate that Graded Learning applied to valve control (and possibly other complex industrial systems) appears to be an effective way to coach an RL agent.

Grade Episodes Time (h)
Grade-I.1 0.1 930 1.67
Grade-I.2 0.1 2000 12.35
Grade-II 0.5 1000 5.31
Grade-III 1.5 1000 5.21
Grade-IV 1.5 1000 4.65
Grade-V 2.0 500 2.27
Grade-VI 2.5 8.4 3.524 2000 7.59
Total 8430 39.05
Table 2: Graded Learning: Staged learning parameters and training episodes and times
(a) Grade-I: =0.1, =, =
(b) Grade-I.2: Grade-I trained for a further 1000 episodes
(c) Grade-II: =0.5, =, =
(d) Grade-III: =1.5, =, =
(e) Grade-IV: =1.5, =, =
(f) Grade-V: =2.0, =, =
(g) Final learned model: Grade-VI: =2.5, =8.4, =3.524
Figure 23: Graded Learning

6 Experiments, Results and Discussion

In this section we present the results of experiments conducted on a unified framework and evaluate the RL controller’s performance and compare it with the PID (with filter) controller.

Before conducting the experiments a stability analysis of the RL controller must be carried out.

6.1 Stability Analysis of RL Control

A basic stability analysis of the RL control is attempted in this section.

Figure 24: Block diagram of a single-loop control system

Open-loop transfer-function of the system is . Transfer-function of the plant where is the transfer-function of the FOPTD process (13) and is the transfer-function of the nonlinear valve which is unknown and must be estimated.

Simulink’s Control Design Linearization Analysis™ tool provides a GUI based interface to generate a linear approximation of a nonlinear system, computed across specified input and output points. However, this does not allow any control over the estimation in contrast to MATLAB’s tfest function.

The programmatic method allows a user controlled method to estimate the transfer-function by specifying the number of poles (np) and zeros (nz). Additionally the iodelay parameter allows experimenting the effect of time-delays in physical systems. This MATLAB function is based on [b_GARNIER].

Ψsys = tfest(data, np, nz, iodelay)

The block-diagram Fig.24 shows the points at which data and will be tapped to estimate the controller transfer-function and points and to estimate the complete plant transfer-function . Fig.25 is the Simulink setup to assist the estimation.

Figure 25: Setup for transfer-function estimation

Estimated plant transfer-function: The continuous-time transfer-function (20) for the plant was estimated by MATLAB as shown in Fig.26, with a fit of and MSE of .

Figure 26: MATLAB’s transfer-function estimation for the plant

Estimated controller transfer-function: Equation (21) is the estimated continuous-time transfer-function for the controller.


We plot (Fig.27) the plant’s response using the estimated transfer-functions against the original RL signal to ensure that it is reasonably close and will serve the purpose of gaging the stability. It must be noted that the estimation is approximate and this method is provided as a means of understanding the methodology of conducting a very basic stability analysis.

Figure 27: RL controller: Waveform of the estimated transfer-function

Stability analysis: The step-response in Fig.28 shows a stable closed-loop system. The open-loop Bode plot, Fig.29, shows a gain-margin of 10.9 dB and a phase-margin of 68.0 degrees, indicating a fairly stable system.

Figure 28: RL controller: Step response
Figure 29: RL controller: Open-loop Bode loop

6.2 Experiments and Results

In this section we present the results of experiments conducted on a unified framework that tests two valve control strategies — PID (with filter) and DDPG RL. Experiments with varying control signals, noise strengths and disturbance points were conducted. A plant with process-loop perturbations was experimented with. A critical time-domain analysis of the experimental results is presented followed finally by frequency-domain stability analysis.
Experiments conducted:

  1. Arbitrarily assumed constant reference level

  2. Benchmark waveform (with noise)

  3. Benchmark waveform subject to disturbances at:

    • Controller input (i.e. reference signal)

    • Plant input (i.e. controlled signal fed to plant)

    • Plant output (i.e. system output)

  4. Practical example of a “water-supply” valve, subject to ground-borne vibrations of passing trains

  5. Plant experiencing process loop-perturbations

  6. Arbitrary control waveform

6.2.1 Experiment-1: Constant reference signal

Experiment: A basic analysis is best done on a simple constant reference flow rate arbitrarily set at 100 and run over 2,000 . Reference signal is superimposed with benchmark Gaussian noise added ().

Figure 30: Expt.-1: Constant reference signal

Observations: Fig.30 shows the PID and RL trajectories. We observe that the PID has a large overshoot and settles in about 700 . The RL strategy demonstrates close to ideal damping and a quicker settling time of about 220 . The RL trajectory shows tiny ripples against the PID’s smoother profile. These oscillations can reduce the remaining-useful-life (RUL) of a mechanical system and we study this by conducting a (simplified) two factor DOE (design of experiments).

We vary the two factors; time-delay and valve friction (combined static and dynamic) as shown in Table 3. Default values of time-delay , static-friction and and these are treated as the high-levels and we lower each by a factor of 100 to obtain the low-levels as shown in Table 4.

Time-delay () Friction values (, )
Low Low
Low High
High Low
High High
Table 3: DoE table
0.025 0.084 0.0352
0.025 8.400 3.524
2.500 0.084 0.0352
2.500 8.400 3.524
Table 4: DoE table with actual values
(a) =Low; and =Low
(b) =Low; and =High
(c) =High; and =Low
(d) =High; and =High
Figure 31: Expt.-1: DoE with time-delay and friction parameters

Fig.31(a) highlights the RL’s capability to produce a very smooth profile when both the factors are low. This implies that the oscillations are not introduced by the RL technique. Fig.31(c) shows that the cause of oscillatory behavior is mainly due to the time-delay factor.

While the PID strategy (15), is implemented with a filter that suppresses noise, no filters were added to the RL setup to better understand the natural response of RL control strategies.

6.2.2 Experiment-2: The benchmark signal

Experiment: The waveform profile used in [b_CAPACI], with Gaussian noise (), is subject to both strategies. We also zoom sections of time-domain plot Fig.32 and observe them more closely in Fig.33.

Figure 32: Expt.-2: Benchmark waveform
(a) Zoomed section 1
(b) Zoomed section 2
(c) Zoomed section 3
(d) Zoomed section 4
Figure 33: Expt.-2: Zoomed sections of the benchmark signal

It is observed that the PID shows higher over- and under-shoots. The RL shows better tracking to the reference signal levels. If such a valve controls fluid flow, the higher and lower fluid quantities could be detrimental to the product quality. In 32 the shifted PID waveform after 800 could be detrimental to the process if it depends on the timing of the flow of fluid.

6.2.3 Experiment-3.a: Noise at controller input

Experiment: Increased noise at the controller input ()

(a) Entire trajectory plot
(b) Zoomed section
Figure 34: Expt.-3.a: Noise at controller input ()

Observations: Fig.34(a) and 34(b) show almost no impact on the PID when compared with Experiment-2 (lower noise at input) but increased impact on the RL trajectory, demonstrating the PID strategy’s superior noise attenuation capabilities. The RL continues to closely track the reference signal (along with the noise).

6.2.4 Experiment-3.b: Noise at plant input

Experiment: Shift the source of noise to the plant input ().

Figure 35: Expt.-3.b: Noise at plant input ()

Observations: Fig.35 shows that the PID trajectory is now impacted and it looses its relatively smooth output seen in Experiment-1 and Experiment-2. The RL strategy on the other hand remains unaffected when compared to Experiment-1. The PID strategy adjusts itself based on the error signal and hence shows a change in behavior while RL strategy does not.

6.2.5 Experiment-3.c: Noise at plant output

Experiment: Effect of noise experienced at the plant output is studied here ().

(a) Entire trajectory plot
(b) Zoomed section
Figure 36: Expt-3c: Noise at plant output ()

Observations: Fig.36(a) shows that both RL and PID strategies are affected equally due to the noise.

6.2.6 Experiment-4: Water-supply valve, subject to ground-borne vibrations

Experiment: Valve applications could often be exposed to extremely harsh conditions. A water-supply system, for example, may face ground-borne vibrations such as from passing railways, that is in the range of about 30–200 and varying amplitudes [b_TRAIN]. Since the control-valve assembly will often be placed in shielded environments frequencies between 30–100 were assumed for simulation.

(a) Entire trajectory plot
(b) Zoomed section
Figure 37: Expt.-4: Ground-borne vibrations of a passing metros or train

Observations: Figures 37(a) and 37(b) show that, as in Experiment-3.c the impact of noise is similar on both strategies and RL continues to track the reference signal better than PID.

6.2.7 Experiment-5: Arbitrary control waveform with benchmark noise signal

Experiment: This experiment tests the generalization capability of the training strategy of the RL controller vis-à-vis the generalization of PID tuning. “Training” signal for both the strategies was the benchmark waveform and this experiment subjected them to a completely different waveform.

(a) Arbitrary control waveform
(b) Zoomed section 1
(c) Zoomed section 2
Figure 38: Expt.-5: Response to an arbitrary control waveform

Observations: Fig.38 shows that the RL controller out-performs the PID strategy considerably in this experiment. The RL controller tracks the arbitrary reference much closely and this demonstrates the importance of the training strategy in effective generalization. The PID trajectory, on the other hand shows a significant lag while tracking the reference and if such a valve controls fluid flow, the untimely higher or lower fluid quantities could be detrimental to the product quality.

Small ripples are evident sections of the RL controlled trajectory.

6.2.8 Experiment-6: Benchmark plant with process loop-perturbations

Experiment: This experiment tests resistance to severe process-loop perturbations modeled as a order transfer-function (22) [b_CAPACI].

(a) Response to benchmark signal
(b) Response to benchmark signal with lower strength
Figure 39: Expt.-6: Response to plant with perturbations

Observations: A severe limitation of the RL controller is evident in this experiment. Fig.39 shows a significantly stunted output, clamped smoothly at around 35.0. The setup was then tested on a lower magnitude reference Fig.39(b) and the RL continues to be clamped at the same level 35.0. PID seems to scale to different levels under the influence of perturbations albeit with significant error. The RL controller shows increased oscillatory behavior at the lower flow magnitude.

6.3 Discussion: Experiential Learning Validated against Published Research

”Experiential: relating to, derived from, or providing experience

A total of 163 experiments were conducted during this research. When experiments did not respond to seemingly logical steps it led to severe frustration. It was during this learning process that Graded Learning was discovered. In a quest to find answers for some of the strange observations, research was conducted to relate these to previously published studies and it highlighted the several known challenges that exist; reminding one that RL is still an emerging field.

Early adopters of RL for control are encouraged to try both the Graded Learning method and study the literature referenced in this section — which is a collection of studies conducted at Google, MIT and Berkeley ([b_SONG], [b_HENDERSON], [b_HARDT] and [b_ZHANG]).

In [b_HENDERSON], effects of hyperparameters and their tuning are analyzed with respect to network-architecture, rewards scaling and reproducibility on model-free, policy-gradient based algorithms for continuous control and is therefore directly applicable to the subject of this paper.

6.3.1 Over-fitting and saturation

For physical systems, there is always an upper limit of rewards that the agent cannot cross. However this is not known before hand and one often pushes the agent to continue training for hours. Significant neural-network saturation was observed in several of the training attempts.

Over-fitting in RL is being studied only recently. [b_SONG] studied over-fitting in model-free RL and observe that the agent often mistakenly correlates reward with spurious observation-space features. They term this as “observational overfitting”. In particular they have studied over-fitting with linear quadratic regulators (LQR) using neural-networks and show that under Gaussian initialization of the policy using gradient descent, a generalization gap “must necessarily exist” [b_SONG].

Fig.40 shows multiple examples of over-training and its effect on learning curves.

Figure 40: Effects of over-training and network saturation


provides a theoretical proof, that stochastic gradient methods employing parametric models, when trained using


iterations have vanishing generalization errors. They argue this by experiments conducted and using stability criteria established for learning algorithms devised by Bousquet and Elisseeff. They conclude, that shortened training time by itself, sufficiently prevents over-fitting. This paper is important for extending the stability criteria developed for supervised learning to iterative algorithms, such as RL


6.3.2 Sensitivity to network architecture

Four policy-gradient methods including the DDPG are analyzed in [b_HENDERSON]

. While ReLU activations were stated to perform best, the effects were not consistent across algorithms or hyperparameter settings.

6.3.3 Sensitivity to reward-scaling

A large and sparse reward scale causes network saturation resulting in inefficient learning as was observed in Fig.41(b). Reward rescaling is a technique recommended to improve results for DDPG (Fig.41(a)). This is achieved by multiplying by a scalar such as 0.1 or clipping to [0, 1] [b_DUAN].

(a) Published results [b_HENDERSON]
(b) Inefficient learning at scales
Figure 41: Effect of large reward spaces

6.3.4 Sensitivity to noise parameter

DDPG uses the Ornstein-Uhlenbeck process to aid exploration. The effect of noise hyperparameter was not very easily ascertainable.

Based on (19), for a and time-steps per full episode, the half-life of exploration decay is about 150 episodes as seen in Fig.42(a). However there is an exploration explosion after about 700 episodes (Fig.42(b)). As an experiment a severely reduced was used implying a half-life of just about 15 episodes, however Fig.42(c) shows no decay in exploration for over 1000 episodes.

It is possible that the mixed results agree with [b_PLAPPERT] in that explicit noise settings are not necessary for a continuous space to assist in exploration. It must be noted that such results can also be possible due to inexplicable interaction effects of multiple hyperparameters.

(a) Variance decay-rate=, after 350 episodes
(b) Variance decay-rate=, after 1300 episodes
(c) Variance decay-rate=
Figure 42: Effect the OUP parameters

6.3.5 Sensitivity to random seeds

Intuitively different random seeds should not affect results of a stable process. According to [b_HENDERSON], environment stochasticity coupled with stochasticity in the learning process have produced misleading inferences even when results were scientifically averaged across multiple trials.

In conclusion, as stated by Henderson et al. [b_HENDERSON] one of the possible reasons for the difficulties encountered could be the “intricate interplay” of hyperparameters of policy gradient methods (such as DDPG).

7 Conclusion

On the design front, the process of training a model-free reinforcement learning agent was outlined.

Hyperparameter tuning requires significant efforts and patience for building a stable controller. We proposed Graded Learning, the naive form of Curriculum Learning method. An engineer starts at the lowest complexity level and defines appropriate hyperparameter settings to understand the best reward strategy and reward scales to use and then gradually increase control task complexity. This avoids several problems mentioned earlier; for example network saturation. For most industrial control systems Table 1 should be a good starting point.

On the application front, experiments were conducted to evaluate it against the conventional PID control strategy.

The experiments showed that the RL strategy’s trajectory tracking appears to be superior to the PID’s. The PID demonstrates better disturbance rejection as compared to the disturbances appearing on the RL controlled signal. While this appears to be the prime limitation of the RL controller, it must be noted that these were evident in the published implementations studied as well ([b_BISCHOFF], [b_WANG] and [b_SYAFIIE]).

The PID appeared to lag the reference control signal and the RL controller performed better when challenged to track a control profile that it was not trained on and will demonstrate versatility when applied to different control tasks within the same environment, without having to be retrained.

Overall the RL controlled process appears to promise better process quality, while the PID controlled process will cause a significantly lower stress on the valve operation and result in reduced wear-and-tear.

Enhancements and Future work: The RL controller that was designed needs a mechanism to reduce the oscillatory behavior in the presence of high frequency disturbance with strong amplitudes. For noise at the input and output of the controller a low-pass filter may help reduce the high variance.

Further work is necessary to understand ways of defining objective and reward functions to prevent the noisy RL trajectory behaviour. If this succeeds this will be a better solution than applying a filter, which would otherwise slow down the response.

MATLAB 2019b release includes the Proximal Policy Optimization (PPO) algorithm for continuous control that must be evaluated. PPO is a recent development and is considered as being more stable and better than DDPG [b_HENDERSON].

The fields of reinforcement learning, optimal-control and control-systems are extremely exciting. It is the hope that this research will motivate further research to help better understand and hence popularize the use of reinforcement learning for control-systems.

This paper is a result of the work that began with the dissertation [b_RS] submitted to the Coventry University, UK. I am immensely grateful for the encouragement and guidance I received during the dissertation work from my supervisors — Dr Olivier Haas, Associate Professor and Reader in Applied Control Systems at Coventry University and Dr Prithvi Sekhar Pagala, Research Specialist at KPIT Technologies. Prof. Dr Acharya K.N.S must be thanked for instilling an interest in Control Systems through this teaching.

Rajesh Siraskar received the B.E. degree in Electronics and Telecommunications from Pune University, Pune, India, in 1990 and an M.Tech. degree in Automotive Electronics from Coventry University, UK in 2020. He works as a Data Scientist and develops solutions for industries ranging from automotive to energy and pharmaceutical to cement. He was previously a Six Sigma Master Black Belt. He is member of IEEE.