The process of controlling industrial plants, or parts of such, involves a variety of challenging aspects that reinforcement learning (RL) 
algorithms need to tackle. For coping adequately with the complexity of real-world systems, important challenges that need to be considered are: continuous state and action spaces, high-dimensional and only partially observable state spaces, stochasticity induced from heteroscedastic sensor noise and latent variables, delayed effects, multi-criterial reward components, and non-stationarity in the optimal steerings, i.e. the optimal policy will not approach a fixed operation point.
Here, we consider applications where on-line learning, like the classical temporal-difference learning approach , is prohibited for safety reasons, since it requires exploration on the plant’s dynamics. In contrast, batch RL algorithms generate a policy based on existing data, which is deployed to the plant after training. In this setting, either the value function or the system dynamics are trained by historic operational plant data, which is a set of four-tuples (observation, action, reward, next observation) called batch in the following. Research from the past two decades suggests that the family of batch RL algorithms [3, 4, 5, 6], meet the requirements of real-world systems, especially when involving neural networks modeling either the state/action value function [7, 8, 9, 10, 11], or the system dynamics [12, 13, 14]. Moreover, batch RL algorithms are data efficient [7, 15], because the batch data is utilized repeatedly during the training phase.
In the following we investigate different RL approaches on the Industrial Benchmark (IB)  that comes with challenges being vital in industrial settings as described above. We report on results for applying Particle Swarm Optimization Policy (PSO-P) 
, which is a powerful algorithm for RL with continuous actions that achieves remarkable results out of the box. The actions to perform are derived from rollouts on a system model, which simulates the IB’s transition dynamics. This model, a recurrent neural network, was trained on a batch of transitions sampled from applying random actions to the IB. In real-world applications, however, transitions are usually available in form of historic operational data, that might have been produced by a constant default controller.
We compare the performance of PSO-P on IB with two other RL approaches, that utilize the batch in different ways. First, the Recurrent Control Neural Network (RCNN) , which is a model-based RL algorithm for continuous actions, that uses the system model during training of the controller. Second, Neural Fitted Q-Iteration (NFQ) , a model-free RL algorithm for discrete actions, where the controller is learned via iteratively applying Watkins’ Q-learning algorithm  on the batch data. Here, the system model is used to select the best policy after the NFQ training has finished. Policy selection is necessary, because the performance of NFQ policies can decrease significantly over iterations. This is a well-known phenomenon in the context of neuro-dynamic programming .
Ii Industrial Benchmark
aims at being realistic in the sense that it includes a variety of aspects that we found to be vital in industrial applications. It is not designed to be an approximation of any specific real-world system, but to pose a comparable hardness and complexity found in many industrial applications. State and action space are continuous, the state space is high-dimensional and only partially observable. The actions consist of three continuous components and affect three steerings. Moreover, the IB includes stochastic and delayed effects. The optimization task is multi-criterial in the sense that there are two reward components that show opposite dependencies on the actions. The dynamical behavior is heteroscedastic with state-dependent observation noise and state-dependent probability distributions, based on latent variables. Furthermore, it depends on an external driver that cannot be influenced by the actions. The IB is designed such that the optimal policy will not approach a fixed operation point in the three steerings. Any specific choice is driven by our experience with industrial challenges.
At any time step the RL agent can influence the environment (IB) via actions
that are three dimensional vectors in. Each action can be interpreted as three proposed changes to three observable state variables called current steerings. Those current steerings are: velocity , gain , and shift . Each of those is limited to , yielding
with scaling factors , , and .
After applying the action , the environment transitions to the next time step , yielding the internal state . State and successor state are the Markovian states of the environment. In many industrial applications, an operator-defined load is applied to the system. Depending on load and the control values , the system shows fatigue and consumes resources such as power, fuel, etc., represented by consumption . Both, and , are external drivers for the IB. In response, the IB generates values for and , which are part of the internal state . The reward is solely determined by :
In the real world tasks that motivated the IB, the reward function has always been known explicitly. That is why we assume that here the reward function is also known and consumption and fatigue are observable. However, except for the values of the steerings, the remaining part of the Markov state remains unobservable. This yields an observation vector consisting of:
the current steerings: velocity , gain , and shift ,
the external driver: set point ,
and the reward relevant variables: consumption and fatigue .
One of the IB’s features is the possibility to freeze its stochasticity. On the one hand, for data generation, online RL experiments, and policy evaluation, stochasticity makes the benchmark realistic and challenging. On the other hand, there are some settings where freezing the randomness in the stochastic effects might be useful. This is realized by remembering the applied seed of the IB-internal pseudo-random number generator (RNG). For instance, in the experiments presented in Section IV, we searched for the true maximum reward given a certain Markov state and its current RNG seed, as the upper bound for the RL technique performance evaluation. This has been done by applying PSO-P directly on the IB system dynamics, provided with full knowledge about the future, encoded in the RNG seed.
In the Particle Swarm Optimization Policy (PSO-P) framework 
, solving an RL problem is reformulated as an optimization problem. RL is an area of machine learning, where the Markov decision problem has to be solved by learning from observed state transitions, with and representing the Markovian states, the applied action, and the real-valued reward, at discrete time steps and . The goal is to find a policy maximizing the expected cumulative reward , called return, where is the so-called discount factor .
Since the true underlying Markovian state is not observable in the IB, it is approximated by a sufficient amount of historic observations, i.e. the information contained in is approximated by with horizon .
Given a system model (see Section IV
), trained by supervised learning methods on previous observations, finding the best action, for a given an observation horizon with respect to the return , is described as
The discount factor is chosen such that at the end of the time horizon , the last reward accounted for is weighted by , computed by .
Particle swarm optimization  (PSO) is then searching for the optimal action sequence satisfying
Analogues to receding horizon control (RHC), only the first element of is returned. This yields an RL policy , which conducts an optimization for every new observation. This might be computationally expensive, but it does not rely on a predefined closed-form policy structure, which very often is a hard to asses a priori assumption for common RL algorithms on novel problems.
We compare the performance of PSO-P on IB with the well established RCNN and NFQ methods. For all of the applied RL techniques, we required an adequate system model simulating IB trajectory rollouts:
RCNN: The RCNN is trained on to calculate the respective gradients for the policy’s weight update step.
NFQ: Despite NFQ being considered model-free RL, it is still very useful to evaluate the policy’s performance after each Q-iteration step on , since performance drops during the training are very likely to occur when applying NFQ on off-policy batch data. Therefore, in our experiments the policy with the best performance according to model is saved and returned as the final NFQ training result.
PSO-P: The policy represented by PSO-P utilizes during runtime. PSO-generated trajectories are rated by performing rollouts on in every policy query.
In the following experiments the system model predicts consumption and fatigue for each step of the rollout, and the reward is computed according to Eq. (5).
To generate an adequate training data set , we initialized the IB 10 times for each set point and produced random trajectories of lengths , resulting in .
The system model was chosen to consist of two recurrent neural networks (RNN) and , to predict consumption and fatigue , respectively. Both models are unfolded a sufficient number of steps into the past ( time steps for , time steps for ) and 50 time steps into the future. In each time step they take the observable variables of the past and present as input. Whereas, in the future branch of the RNNs only the steerings (velocity, gain, and shift) are used as input. The topology of these RNNs is described in 
as ”Markov Decision Process Extraction Network”.
Both models have been trained with the RPROP learning algorithm  on the data set , with 70% training and 30% validation data for early stopping. The training process was repeated 8 times and the networks with the lowest validation error were chosen. In our experiments, we could validate that the training process of these RNNs is stable and the results depend little on the chosen learning algorithm.
Fig. 1 depicts the squared error of the two selected RNNs with respect to the true IB values and describes the ’no self input’ and ’self input’ design of both RNNs.
The Recurrent Control Neural Network (RCNN)  consists of two parts. One is a system model which is trained to predict the return by a rollout of length . The second part is a policy network, that computes for each step of the rollout an action to be fed in the system model. This policy network takes as input the internal state of the system model , which has the Markov property approximately .
The policy network has been trained with the same data set , as the system model . It uses the neurons of the internal states of the consumption and the fatigue networks, and
, as input, followed by two hidden layers with 12 and 6 neurons, respectively, and three output neurons to encode the changes in the three steering variables, velocity, gain, and shift. The hidden layers use hyperbolic tangent as activation function, the output layer uses the sine function. All these configuration parameters have been chosen with almost no tuning. Some tuning has been necessary to configure the training process of the policy network, though. Neither for RPROP, nor Vario-Eta, nor momentum-backprop stable training behaviors have been observed. The best results have been observed with online-backprop with small mini-batches and small learning rates . We used random learning rates between and , uniformly chosen in the logarithm of , and a batch size of one.
One note concerning the possibility to assess the quality of a trained policy without executing it on the ”real system” , here the IB: if the validation error of a system model is lower on average for a rollout of sufficient length, it is the better system model (Fig. 1
). If the selected system model estimates a higher return over the rollout of sufficient length for a given policy, that policy is likely to perform better, when executed on the IB (Fig.2).
The policies of Neural Fitted Q-iteration (NFQ)  were trained using a [9-20-1]-layered feed-forward MLP, with 9 neurons on the input layer for the observation and action , and 20 neurons on the hidden layer. The output layer comprises one neuron for the associated -value . All neurons of the neural network utilize a logistic activation function. Since NFQ is an algorithm for discrete actions, we discretized the three delta steerings towards a setting of either , , or to each steering, which in total yields different actions.
The weights of the networks were trained using non-batch stochastic gradient descent with a manually tuned constant learning rate of. This setting produced better results than using RPROP, as suggested by Riedmiller , because weights trained with RPROP tended to be unstable during learning on our dataset. Before starting the training, the training samples from were permuted. Furthermore, we divided into a set of training data () and validation data (
) for early-stopping. The input data is presented to the neural network using a Z-score transformation. The Q-values of the output layer were scaled into the interval of the activation function as proposed by Gabel et al..
The overall training and evaluation procedure is as follows. After creating an MLP with random initial weights from the interval , NFQ is performed over 200 iterations. Each row of training data is presented to the neural network for a maximum of 300 training epochs, in case the error on the validation set does not start to raise within 10 epochs. During the experiments we observed that the performance of NFQ policies, once successfully learned, can degrade over time. This is consistent with findings in [3, 26, 19]. Therefore, we utilized a policy selection process in each experiment: we evaluated the -function after each NFQ iteration on the system model and saved the policy with the best performance according to . Subsequently, only this policy is retained and its performance is evaluated on the IB over 10 initial states (set points ).
In the PSO-P setup we applied a PSO search on with 100 particles searching for 100 iterations until the best trajectory found so far was returned. The planning time horizon was set to , which yielded as discount factor. The particles were arranged in the so called star topology, i.e. each particle connected with every single other particle in the swarm .
The calculation of the particles’ fitnesses could be computed in parallel on 96 CPUs 222Intel Xeon CPU E7-4850 v2 @ 2.30 GHz, resulting in an overall computation time of less than 8 seconds to compute . While today the computation time might still be too long for several real-world industrial applications, in the future the increase in CPU speed and/or parallelization, as well as computation on GPU clusters might make PSO-P computational tractable for more and more applications.
All of the three applied RL techniques were able to produce decent results on the IB. The average rewards per step are given in tables in Appendix A.
Fig. 3 condenses the results of 30 RL policies and highlights the superior performance of PSO-P. This RL technique produces significantly more robust results than NFQ and RCNN. Note that all of the techniques have been trained/evaluated on the same system model .
The NFQ results are of the lowest performance in our experiments. This can be partially explained by the fact that NFQ applied only discrete actions, which is some limitation given that the IB is designed to work on continuous action spaces. Nevertheless, a second NFQ-inherent problem which has been revealed, is its highly unstable training behavior in off-policy, batch-based trainings settings. The training process itself gives no answer to the question when to stop the training. One might think that it might be a good plan to perform the training as long as computationally feasible, and use the result from the last iteration. In our experiments, this procedure would have only created bad policies (see Fig. 4). Significantly better results were achieved by evaluating the resulting NFQ policy of each iteration with the system model . The policy which produced the highest approximated average reward was then declared the best NFQ policy of each experiment. Fig. 5 gives a detailed explanation on some properties of the best performing NFQ policy.
The RCNN experiments produced better results, compared to NFQ. RCNN policies operate in the continuous IB action space and yield compact closed-form policies. Even though all of the trained policies yielded similar training errors, their real performance evaluated on the IB differs quite a lot. This property implies that RCNN is rather sensitive towards prediction errors of the system model . Fig. 6 gives a detailed explanation on some properties of the best performing RCNN policy.
In our experiments PSO-P has demonstrated high reward performance and outstanding robustness, that have been observed before . For 8 out of 10 set point values for , PSO-P yielded the best RL policy on average. Moreover, the performances of all experiments were very close to each other, which implies a high robustness against different initial PSO conditions, like particle positions and velocities. Fig. 7 gives a detailed explanation on some properties of the best performing PSO-P result.
In this paper, we have compared the new RL approach PSO-P with two standard RL techniques, RCNN and NFQ, on a recently introduced industrial benchmark. This benchmark has been designed to imitate realistic behavior and aspects which can be found in real-world industrial applications.
The experiments show important steps of the off-policy, batch based method stack necessary for applying RL in industrial facilities. Starting with limited exploration data, an RNN system model is trained and tested. Such a model is crucial, because applying random policies on the real system is usually prohibited in real-world applications. Despite NFQ being classified as a model-free RL technique, our experiments show that it still requires a precise system model for policy selection. The same model has been used for either training a closed-form neural network policy (RCNN), policy selection (NFQ), and exploiting the model for finding optimal actions (PSO-P).
NFQ, with its inherent limitation to discrete actions, and its tendency for instability during the training process, produced the worst performing policies. Although higher performance could be achieved, for example, by increasing the discrete action space and approximating the Markov state by concatenating historic observations.
RCNN, with its ability to apply continuous actions and an inherent policy performance measure, computed closed-form policies of good performance. Possible improvements are, for example, different network topologies, bigger neural networks, and more advanced neural learning algorithms.
PSO-P demonstrated the best performance with unmatched robustness. Out of the box, by only setting few, easy to determine parameters, it produced the best results for almost every set point. The biggest disadvantage of this technique lies in the computational effort required for the determination of the next action. In our experiments the next action has been computed in less than 8 seconds, which is still too long for many industrial applications. Improvements increasing computational power and speed might lower this value, until it becomes feasible for more and more applications.
In summary, first experiences have been made with the IB, which indeed contains many realistic objectives, issues, and features. Experiments have shown that the benchmark could be solved by three completely different RL techniques in a realistic off-policy, batch-based setting.
The project this report is based on was supported with funds from the German Federal Ministry of Education and Research under project number 01IB15001. The sole responsibility for the report’s contents lies with the authors.
Appendix A Result tables
Tables I, II, and III contain the average per step rewards for each of the experiments. The maximum achievable average reward is given in brackets in the first column. These values have been computed by applying PSO-P on the real IB system dynamics under preservation of the initial seed of the pseudo-random number generator, i.e. the optimizer searched for the best actions given a fixed and infinitely replicable future. As a result, a very accurate estimate of the maximum achievable average reward for each initial Markov state has been evaluated.
-  R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
-  R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
-  G. Gordon, “Stable function approximation in dynamic programming,” in Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann, 1995, pp. 261–268.
-  D. Ormoneit and S. Sen, “Kernel-based reinforcement learning,” Machine learning, vol. 49, no. 2, pp. 161–178, 2002.
-  M. Lagoudakis and R. Parr, “Least-squares policy iteration,” Journal of Machine Learning Research, pp. 1107–1149, 2003.
-  D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode reinforcement learning,” Journal of Machine Learning Research, vol. 6, pp. 503–556, 2005.
-  M. Riedmiller, “Neural fitted Q iteration — first experiences with a data efficient neural reinforcement learning method,” in Proceedings of the European Conference on Machine Learning, vol. 3720. Springer, 2005, pp. 317–328.
-  ——, “Neural reinforcement learning to swing-up and balance a real pole,” in Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 4, 2005, pp. 3191–3196.
-  D. Schneegass, S. Udluft, and T. Martinetz, “Neural rewards regression for near-optimal policy identification in markovian and partial observable environments,” in Proceedings the European Symposium on Artificial Neural Networks, 2007, pp. 301–306.
-  ——, “Improving optimality of neural rewards regression for data-efficient batch near-optimal policy identification,” in Proceedings the International Conference on Artificial Neural Networks, 2007, pp. 109–118.
-  M. Riedmiller, T. Gabel, R. Hafner, and S. Lange, “Reinforcement learning for robot soccer,” Autonomous Robots, vol. 27, no. 1, pp. 55–73, 2009.
-  B. Bakker, “The state of mind: Reinforcement learning with recurrent neural networks,” Ph.D. dissertation, Leiden University, Netherlands, 2004.
-  A. M. Schäfer, “Reinforcement learning with recurrent neural networks,” Ph.D. dissertation, University of Osnabrück, Germany, 2008.
-  S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, and S. Udluft. (2016) Learning and policy search in stochastic dynamical systems with bayesian neural networks. [Online]. Available: https://arxiv.org/abs/1605.07127v2
-  A. M. Schäfer, S. Udluft, and H.-G. Zimmermann, “A recurrent control neural network for data efficient reinforcement learning,” in Proceedings of the IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning. IEEE, 2007, pp. 151–157.
-  D. Hein, A. Hentschel, V. Sterzing, M. Tokic, and S. Udluft. (2016) Introduction to the ”Industrial Benchmark”. [Online]. Available: https://arxiv.org/abs/1610.03793
-  D. Hein, A. Hentschel, T. Runkler, and S. Udluft, “Reinforcement learning with particle swarm optimization policy (PSO-P) in continuous state and action spaces.” International Journal of Swarm Intelligence Research (IJSIR), vol. 7, no. 3, pp. 23–42, 2016.
-  C. J. C. H. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, King’s College, Cambridge, UK, May 1989.
-  T. Gabel and M. Riedmiller, “Reducing policy degradation in neuro-dynamic programming,” in Proceedings of the 14th European Symposium on Artificial Neural Networks, 2006, pp. 653–658.
-  J. Kennedy and R.C. Eberhart, “Particle swarm optimization,” Proceedings of the IEEE International Joint Conference on Neural Networks, pp. 1942–1948, 1995.
-  S. Duell, S. Udluft, and V. Sterzing, “Solving partially observable reinforcement learning problems with recurrent neural networks,” in Neural Networks: Tricks of the Trade, ser. Lecture Notes in Computer Science, G. Montavon, G. Orr, and K.-R. Müller, Eds. Springer Berlin / Heidelberg, 2012, vol. 7700, pp. 709–733.
-  M. Riedmiller and H. Braun, “Rprop — a fast adaptive learning algorithm,” in Proceedings of International Symposium on Computer and Information Science VII, 1992, pp. 279–286.
-  R. Neuneier and H.-G. Zimmermann, “How to train neural networks,” in Neural Networks: Tricks of the Trade, ser. Lecture Notes in Computer Science, G. Montavon, G. Orr, and K.-R. Müller, Eds. Springer Berlin / Heidelberg, 2012, vol. 7700, pp. 369–418.
-  A. Hans, S. Duell, and S. Udluft, “Agent self-assessment: Determining policy quality without execution,” in Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2011, pp. 84–90.
-  T. Gabel, C. Lutz, and M. Riedmiller, “Improved neural fitted Q iteration applied to a novel computer gaming and learning benchmark,” in Proceedings of the IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning. Paris France: IEEE Press, April 2011.
-  S. Thrun and A. Schwartz, “Issues in using function approximation for reinforcement learning,” in Proceedings of the Fourth Connectionist Models Summer School. Erlbaum, 1993.
-  R. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,” in Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1995, pp. 39–43.