A Benchmark Environment Motivated by Industrial Control Problems

09/27/2017 ∙ by Daniel Hein, et al. ∙ 0

In the research area of reinforcement learning (RL), frequently novel and promising methods are developed and introduced to the RL community. However, although many researchers are keen to apply their methods on real-world problems, implementing such methods in real industry environments often is a frustrating and tedious process. Generally, academic research groups have only limited access to real industrial data and applications. For this reason, new methods are usually developed, evaluated and compared by using artificial software benchmarks. On one hand, these benchmarks are designed to provide interpretable RL training scenarios and detailed insight into the learning process of the method on hand. On the other hand, they usually do not share much similarity with industrial real-world applications. For this reason we used our industry experience to design a benchmark which bridges the gap between freely available, documented, and motivated artificial benchmarks and properties of real industrial problems. The resulting industrial benchmark (IB) has been made publicly available to the RL community by publishing its Java and Python code, including an OpenAI Gym wrapper, on Github. In this paper we motivate and describe in detail the IB's dynamics and identify prototypic experimental settings that capture common situations in real-world industry control problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

Code Repositories

industrialbenchmark

Industrial Benchmark


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Applying reinforcement learning (RL) methods to industrial systems, such as in process industry like steel processing [1], pulp and paper processing [2], and car manufacturing [3], or power generation with gas or wind turbines [4, 5], is an exciting area of research. The hope is that an intelligent agent will provide greater energy efficiency and, desirably, less polluting emissions. However, the learning process also entails a significant amount of risk: we do not know beforehand how a particular learning algorithm will behave, and with complex and expensive systems like these, experiments can be costly. Therefore, there is high demand in having simulations that share some of the properties that can be observed in these industrial systems.

The existing simulation benchmarks have lead to great advancements in the field of RL. Traditionally simple dynamical systems like pendulum dynamics are studied, whereas nowadays the focus has shifted towards more complex simulators, such as video game environments [6]. Also in the field of robotics very sophisticated simulation environments exist, on which new learning algorithms can be tested [7, 8]. The existence of such benchmarks has played a vital role in pushing the frontier in this domain of science.

For industrial control, however, such a test bed is lacking. In these systems we observe a combination of properties that usually are not present in existing benchmarks, such as high dimensionality combined with complex heteroscedastic stochastic behavior. Furthermore, in industrial control different experimental settings are of relevance, for instance, the focus is usually less on exploration, and more on batch RL settings 

[9].

To this end, we recently developed the industrial benchmark (IB), an open source software benchmark111Java/Python source code: http://github.com/siemens/industrialbenchmark, with both Java and Python implementations, including an OpenAI Gym wrapper, available. Previously, this benchmark has already been used to demonstrate the performance of a particle swarm based RL policy approach [10]. The contribution of this paper lies in presenting the complete benchmark framework as well as mathematical details, accompanied by illustrations and motivations for several design decisions. The IB aims at being realistic in the sense that it includes a variety of aspects that we found to be vital in industrial applications like optimization and control of gas and wind turbines. It is not designed to be an approximation of any real system, but to pose the same hardness and complexity. Nevertheless, the process of searching for an optimal action policy on the IB is supposed to resemble the task of finding optimal valve settings for gas turbines or optimal pitch angles and rotor speeds for wind turbines.

The state and action spaces of the IB are continuous and high-dimensional, with a large part of the state being latent to the observer. The dynamical behavior includes heteroscedastic noise and a delayed reward signal that is composed of multiple objectives. The IB is designed such that the optimal policy will not approach a fixed operation point in the three steerings. All of these design choices were driven by our experience with industrial challenges.

This paper has three key contributions: in Section II, we will embed the IB in the landscape of existing benchmarks and show that it possesses a combination of properties other benchmarks do not provide, which makes it a useful addition as a test bed for RL. In Section III we will give a detailed description of the dynamics of the benchmark. Our third contribution, described in Section IV, is to define prototype experimental setups that we find relevant for industrial control. Our goal is to encourage other researchers to study scenarios common in real-world situations.

Ii Placement of the Industrial Benchmark in the RL Benchmark Domain

In the RL community numerous benchmark suits exist, on which novel algorithms can be evaluated. For research in industrial control we are interested in a particular set of properties, such as stochastic dynamics with high dimensional continuous state and action spaces. We argue that only few freely available benchmarks exist fulfilling these properties, thereby making our contribution, the IB, a useful addition. To that end, we want to briefly review existing benchmarks in RL.

Classic control problems in RL literature [11], such as the cart-pole balancing and mountain car problems, usually have low dimensional state and action spaces and deterministic dynamics. In the field of robotics more complex and high-dimensional environments exist with a focus on robot locomotion, such as the MuJoCo environment [7, 8]. Other examples are helicopter flight222https://sites.google.com/site/rlcompetition2014/domains/helicopter [12] or learning to ride a bicycle [13]. These systems, while complex, usually have deterministic dynamics or only limited observation noise.

Utilizing games as RL benchmarks recently brought promising results of deep RL into the focus of a broad audience. Famous examples include learning to play Atari games333https://github.com/mgbellemare/Arcade-Learning-Environment [6, 14] based on raw pixels, achieving above-human performance playing Ms. Pac-Man [15], and beating human experts in the game of Go [16]. In these examples, however, the action space is discrete and insights from learning to play a game may not translate to learning to control an industrial system like gas or wind turbines.

In Figure 1 we give a qualitative overview on the placement of the proposed IB with respect to other RL benchmarks for continuous control. Here, we focus on stochasticity and dimensionality of the benchmarks at hand. Note that by stochasticity we do not only refer to the signal to noise ratio, but also to the structural complexity of the noise, such as heteroscedasticity or multimodality.

State dimensions

Stochasticity

IB

Helicopter

MuJoCo

Cart-pole

Bicycle

Mountain Car

Wet Chicken
Fig. 1: Qualitative comparison of different RL benchmarks with continuous actions. The state space of the wet chicken 2D benchmark [17] is rather low, but it is highly stochastic which makes it a challenging RL problem. Cart-pole and mountain car are deterministic benchmarks with few state dimensions and only a single action variable. The bicycle benchmark introduces some noise to simulate imperfect balance. The helicopter software simulation has a 12-dimensional state space and a 4-dimensional continuous action space. Stochasticity is introduced to simulate wind effects on the helicopter. The state space of the IB is high, since multiple past observations have to be taken into account to approximate the true underlying Markov state. Stochasticity is not only introduced by adding noise on different observations, but also by stochastic state transitions on hidden variables.

We conclude that the IB is a useful addition to the set of existing RL benchmarks. In particular, Figure 1 illustrates that the combination of high dimensionality and complex stochasticity appears to be novel compared to existing environments. In the following section, a detailed description and motivation for the applied IB dynamics is presented.

Iii Detailed description

At any time step the RL agent can influence the environment, i.e., the IB, via actions

that are three dimensional vectors in

. Each action can be interpreted as three proposed changes to the three observable state variables called current steerings. Those current steerings are named velocity , gain , and shift . Each of those is limited to as follows:

(2)
(3)
(4)

with scaling factors , , and . The step size for changing shift is calculated as .

After applying action , the environment transitions to the next time step in which it enters internal state . State and successor state are the Markovian states of the environment that are only partially observable to the agent.

An observable variable of the IB, setpoint , influences the dynamical behavior of the environment but can never be changed by actions. An analogy to such a setpoint is, for example, the demanded load in a power plant or the wind speed actuating a wind turbine. As we will see in the upcoming description, different values of setpoint will induce significant changes to the dynamics and stochasticity of the benchmark. The IB has two modes of operation: a) fixing the setpoint to a value

, thereby acting as a hyperparameter or b) as a time-varying external driver, making the dynamics become highly non-stationary. We give a detailed description of setting b) in Subsection 

III-D.

The set of observable state variables is completed by two reward relevant variables, consumption and fatigue . In the general RL setting a reward for each transition from state via action to the successor state

is drawn from a probability distribution depending on

, , and . In the IB, the reward is given by a deterministic function of the successor state , i.e.,

(5)

In the real-world tasks that motivated the IB, the reward function has always been known explicitly. In some cases it itself was subject to optimization and had to be adjusted to properly express the optimization goal. For the IB we therefore assume that the reward function is known and all variables influencing it are observable.

Thus the observation vector at time comprises current values of the set of observable state variables, which is a subset of all the variables of Markovian state , i.e.,

  1. the current steerings, velocity , gain , and shift ,

  2. the external driver, setpoint ,

  3. and the reward relevant variables consumption and fatigue .

Appendix A gives a complete overview on the IB’s state space.

The data base for learning comprises tuples

. The agent is allowed to use all previous observation vectors and actions to estimate the Markovian state

.

The dynamics can be decomposed into three different sub-dynamics named operational cost, mis-calibration, and fatigue.

Iii-a Dynamics of operational cost

The sub-dynamics of operational cost are influenced by the external driver setpoint and two of the three steerings, velocity and gain . The current operational cost is calculated as

(6)

The observation of is delayed and blurred by the following convolution:

(7)

The convoluted operational cost cannot be observed directly, instead it is modified by the second sub-dynamic, called mis-calibration, and finally subject to observation noise. The motivation for this dynamical behavior is that it is non-linear, it depends on more than one influence, and it is delayed and blurred. All those effects have been observed in industrial applications, like the heating process observable during combustion. In Figure (b)b we give an example trajectory of the convolution process over a rollout of 200 time steps. The delayed and blurred relations between operational cost and the convoluted costs are clearly visible.

Iii-B Dynamics of mis-calibration

The sub-dynamics of mis-calibration are influenced by external driver setpoint and steering shift . The goal is to reward an agent to oscillate in in a pre-defined frequency around a specific operation point determined by setpoint . Thereby, the reward topology is inspired by an example from quantum physics, namely Goldstone’s ”Mexican hat” potential.

In the first step, setpoint and shift are combined to an effective shift calculated by:

(8)

Effective shift influences three latent variables, which are domain , response , and direction . Domain can enter two discrete states, which are negative and positive, represented by integer values and , respectively. Response can enter two discrete states, which are disadvantageous and advantageous, represented by integer values , and , respectively. Direction is a discrete index variable, yielding the position of the current optima in the mis-calibration penalty space.

Fig. 2: Visual description of the mis-calibration dynamics. Blue color represents areas of low penalty (-1.00), while yellow color represents areas of high penalty (1.23). If the policy keeps in the so-called safe zone, is driven towards 0 stepwise. When is reached the mis-calibration dynamics are reset, i.e., domain and response . The policy is allowed to start the rotation cycle at any time by leaving the safe zone and entering the positive or the negative domain. Consider positive domain : After initially leaving the safe zone, response is in the state advantageous, i.e., is increased stepwise. The upper right area is a reversal point for . As soon as , response switches from advantageous to disadvantageous . In the subsequent time steps is decreased until either the policy brings back to the safe zone or reaches the left boundary at -6. If the latter occurs, phi is kept constant at -6, i.e., the policy yields a high penalty in each time step. Since the mis-calibration dynamics are symmetric around , opposite dynamics are applied in the negative domain at the lower part of the plot.

Figure 2 is a visualization of the mis-calibration dynamics introduced in equation form in the following paragraphs. In each time step the mis-calibration dynamics are transitioned starting with and as follows:

(9)
(10)

where safe zone (area in the center of Figure 2) is calculated using . Note that the policy itself is allowed to decide when to leave the safe zone.

In the next step, direction index is updated accordingly:

(11)
(12)

The first option realizes the return of if the policy returns into the safe zone. The second option stops the rotation if reaches the opposite domain bound (upper left and lower right area in Figure 2). The third option implements the cyclic movement of depending on response and the direction of effective shift .

If, after this update, the absolute value of direction index reaches or exceeds the predefined maximum index of 6 (upper right and lower left area in Figure 2), response enters state disadvantageous and index is turned towards 0.

(13)
(14)

In the final step of the mis-calibration state transition, it is checked if effective shift has returned to safe zone while at the same time direction index has completed a full cycle (reset area in the center of Figure 2). If this is the case, domain and response are reset to their initial states positive and advantageous, respectively:

(15)
(16)

Note that in this state the policy can again decide to start a new cycle (positive or negative direction) or to remain in state .

The penalty landscape of mis-calibration is computed as follows. Based on the current value of , the penalty function computes the performance of maintaining shift in the beneficial area. The penalty function is defined as a linearly biased Goldstone potential computed by

(17)

The definition of radius can be found in Appendix B. From direction index the sine of direction angle is calculated as follows:

(18)

Note that this sine function represents the optimal policy for the mis-calibration dynamics. Exemplary policy trajectories through the penalty landscape of mis-calibration are depicted and described in Figure 3.

(a) Optimal policy
(b) Suboptimal policy
(c) Bad policy
Fig. 3: Comparison of three mis-calibration policies. Depicted is a visual representation of the Goldstone potential based function . Areas yielding high penalty are colored yellow, while areas yielding low penalty are colored blue. The highlighted area in the center depicts the safe zone from to . LABEL:sub@fig:shift_opt A policy which maintains such that a sine-shaped trajectory is generated yields lowest penalty. Note that the policy itself starts the rotation cycle at any time by leaving the safe zone. After returning to the safe zone, while at the same time , the dynamics are reset and a new cycle can be initiated at any following time step in positive or negative direction. LABEL:sub@fig:shift_sub The depicted policy starts initiating the rotation cycle by leaving the safe zone, but returns after six steps. After this return is decreased in four steps back to 0. Subsequently, the dynamics are reset. This policy yields lower penalty compared to a constant policy that remains in the safe zone the whole time. LABEL:sub@fig:shift_bad The depicted policy approaches one of the global optima of by directly leaving the safe zone by constantly increasing . Subsequently, it remains at this point. However, the rotation dynamic yields a steady decrease in after reaching the right boundary at . This decrease ”pushes” the agent to the left, i.e., the penalties received are increased from step to step. After reaching the left boundary at , the dynamics remain in this area of high penalty. Note that the policy could bring the dynamics back to the initial state by returning to . This benchmark property ensures that the best constant policies are the ones which remain in the safe zone.

The resulting mis-calibration is added to the convoluted operational cost , giving ,

(19)

Before being observable as consumption , the modified operational cost is subject to heteroscedastic observation noise

(20)

i.e., a Gaussian noise with zero mean and a standard deviation of

. In Figure (c)c we show in an example rollout of 200 steps how both convoluted operational cost and mis-calibration affect consumption .

Iii-C Dynamics of fatigue

The sub-dynamics of fatigue are influenced by the same variables as the sub-dynamics of operational cost, i.e., setpoint , velocity , and gain . The IB is designed in such a way that, when changing the steerings velocity and gain as to reduce the operational cost, fatigue will be increased, leading to the desired multi-criterial task, with two reward-components showing opposite dependencies on the actions. The basic fatigue is computed as

(21)

From basic fatigue , fatigue is calculated by

(22)

where is an amplification. The amplification depends on two latent variables and , effective velocity , and effective gain . Furthermore, it is affected by noise,

(23)

In Eq. (23) we see that can undergo a bifurcation if one of the latent variables or reaches a value of . In that case, will increase and lead to higher fatigue, affecting the reward negatively.

The noise components and , as well as the latent variables and , depend on effective velocity , and effective gain . These are calculated by setpoint-dependent transformation functions

(24)
(25)

Based on these transformation functions, effective velocity and effective gain are computed as follows:

(26)
(27)

To compute the noise components and , six random numbers are drawn from different random distributions: and

are obtained by first sampling from an exponential distribution with mean 0.05 and applying the logistic function to these samples afterwards,

and

are drawn from binomial distributions

and , respectively, and

are drawn from a uniform distribution in

. Noise components and are computed as follows:

(28)
(29)

The latent variables and are calculated as

(30) (31)

The sub-dynamic of fatigue results in a value for fatigue , which is relevant for the reward function (Eq.  5).

(a) Fatigue dynamics
(b) Operational cost
(c) Consumption dynamics
(d) Reward dynamics
Fig. 4: Visualization of relevant variables of the IB in a rollout using random actions over 200 time steps. LABEL:sub@fig:fatigue_dyn: Shown are latent variable and fatigue . As seen in Eq. (23), the latent variable can lead to a bifurcation of the dynamics. In the scenario shown at , we observe the beginning of a runaway effect that originates from the second case in Eq. (31). LABEL:sub@fig:oc_dyn: Shown are operational cost and convoluted sigma given by Eq. (7). At around the delayed effect of the convolution is clearly visible: decreases sharply while is still ascending. LABEL:sub@fig:cons_dyn: Shown is the composition of visible consumption (purple) by the two components and mis-calibration . LABEL:sub@fig:rew_dyn: Shown is the composition of final negative reward by its two components, fatigue (blue) and consumption (red). In this case, the runaway effect from Figure LABEL:sub@fig:fatigue_dyn has the most prominent effect on the reward signal.

An example interplay of the components of the fatigue dynamics is visualized in Figure (a)a. From up to we see the effect of the noise components described in Eq. (28): the combination of binomial and exponential noise components yields heterogeneous spike-like behavior. From to we observe a self-amplifying process in . This self-amplification originates from the second case of Eq. (31). At around , the fatigue dynamics rapidly change towards higher, less noisy regions. This change originates from the bifurcation in in Eq. (23), which we pointed out earlier.

Iii-D Setpoint dynamics

Setpoint can either be kept constant or it is variable over time. In the variable setting, it will change by a constant value over a fixed period of time steps in the benchmark dynamics. Afterwards, a new sequence length and change rate is determined.

We sample sequence length uniformly from and draw rate from a mixture of a uniform distribution and a delta distribution with weighting probabilities 0.9 and 0.1. For each time step we update the setpoint according to:

(32)
(33)

where will flip change rate with a probability of 50% if the setpoint reaches one of the two bounds at or . Note that the equation above produces piecewise linear functions of constant change. We visualized four example trajectories in Figure 5.

Fig. 5: Four example trajectories of the setpoint of the IB in a variable setpoint setting.

Iv Experimental Prototypes

The IB aims at being realistic in the sense, that it includes a variety of aspects that we found to be vital in industrial applications. In this section we want to outline prototypes of experimental settings that include key aspects present in industrial applications.

Iv-a Batch Reinforcement Learning

In this setting, we are given an initial batch of data from an already-running system and are asked to find a better (ideally near-optimal) policy.

The learners task is therefore to return a policy that can be deployed on the system at hand, solely based on the information provided by the batch [9]. These scenarios are common in real-world industry settings where exploration is usually restricted to avoid possible damage to the system.

Two scenarios using the IB for batch RL experiments are described subsequently.

Random exploration

In this setting, we generate a batch of state transitions using a random behavior policy, for instance by sampling action proposals from a uniform distribution. Example instances of these settings can be found in [18] and [10].

In the cited examples, the benchmark is initialized for ten different setpoints with the latent variables in their default values and the three steering variables at 50 each. Then, for each setpoint the behavior policy is applied on the benchmark for 1,000 time steps, resulting in a total of 10,000 recorded state transitions. This process can be repeated to study the performance using different batch sizes.

For evaluation, the system is either initialized to its start settings [10], or at a random place in state space [18], at which point the policy drives the system autonomously.

Safe behavior policy

In real industrial settings, we seldom will run a fully random policy on the system at hand. A more realistic setting is that we have a batch of data generated by a safe, but suboptimal, behavior policy with limited randomness. In this setting, the task is to improve . Unlike in the random exploration setting, the difficulty here is that large parts of the state space will be unavailable in the batch. The batch of data will likely contain more information of specific areas in state space and few information everywhere else. An example experiment can be found in [19].

Iv-B Transfer Learning

A common situation in industrial control is that we have data from different industrial systems, or data from one industrial system that operates in different contexts. We expect that each instance will behave similarly on a global level, while we can expect significant deviations on a low level.

In the IB, this is realized by the setpoint , a hyperparameter of the dynamics. Each value of will define a different stochastic system, where the dissimilarity of two systems grows with the distance in .

In transfer learning, we want to transfer our knowledge from system A to system B. For example, suppose we have a large batch

of state transitions from the IB with . We also have a small batch of state transition with . If our goal is to learn a good model for a system with , the challenge of transfer learning is how to efficiently incorporate the batch to improve learning. An example instance of this setup, albeit using pendulum dynamics, can be found in [20].

V Conclusion

This paper introduced the IB, a novel benchmark for RL, inspired by industrial control. We have shown that it provides a useful addition to the set of existing RL benchmarks due to its unique combination of properties. Furthermore, we outlined prototype experimental setups relevant for industrial control. Our contributions are a step towards enabling other researchers to study RL in realistic industrial settings to expand the economical and societal impact of machine learning.

Acknowledgment

The project this report is based on was supported with funds from the German Federal Ministry of Education and Research under project number 01IB15001. The sole responsibility for the report’s contents lies with the authors. The authors would like to thank Ludwig Winkler from TU Berlin for implementing the OpenAI Gym wrapper and sharing it with the community.

References

  • [1] M. Schlang, B. Feldkeller, B. Lang, P. T., and R. T. A., “Neural computation in steel industry,” in 1999 European Control Conference (ECC), 1999, pp. 2922–2927.
  • [2] T. A. Runkler, E. Gerstorfer, M. Schlang, E. Jünnemann, and J. Hollatz, “Modelling and optimisation of a refining process for fibre board production,” Control engineering practice, vol. 11, no. 11, pp. 1229–1241, 2003.
  • [3] S. A. Hartmann and T. A. Runkler, “Online optimization of a color sorting assembly buffer using ant colony optimization,” Operations Research Proceedings 2007, pp. 415–420, 2008.
  • [4] A. M. Schaefer, D. Schneegass, V. Sterzing, and S. Udluft, “A neural reinforcement learning approach to gas turbine control,” in

    2007 International Joint Conference on Neural Networks

    , 2007, pp. 1691–1696.
  • [5] A. Hans, D. Schneegass, A. M. Schaefer, and S. Udluft, “Safe exploration for reinforcement learning,” in 2008 European Symposium on Artificial Neural Networks (ESANN), 2008, pp. 143–148.
  • [6] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,”

    Journal of Artificial Intelligence Research

    , vol. 47, pp. 253–279, 2013.
  • [7] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2012, pp. 5026–5033.
  • [8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  • [9] S. Lange, T. Gabel, and M. Riedmiller, “Batch reinforcement learning,” in Reinforcement Learning.   Springer, 2012, pp. 45–73.
  • [10] D. Hein, S. Udluft, M. Tokic, A. Hentschel, T. A. Runkler, and V. Sterzing, “Batch reinforcement learning on the industrial benchmark: First experiences,” in 2017 International Joint Conference on Neural Networks (IJCNN), 2017, pp. 4214–4221.
  • [11] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.   Cambridge, MA: MIT Press, 1998.
  • [12] P. Abbeel, A. Coates, and A. Y. Ng, “Autonomous helicopter aerobatics through apprenticeship learning,” The International Journal of Robotics Research, vol. 29, no. 13, pp. 1608–1639, 2010.
  • [13] J. Randløv and P. Alstrøm, “Learning to drive a bicycle using reinforcement learning and shaping,” in Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), J. W. Shavlik, Ed.   San Francisco, CA, USA: Morgan Kauffman, 1998, pp. 463–471.
  • [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [15] H. van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang, “Hybrid reward architecture for reinforcement learning,” arXiv preprint arXiv:1706.04208, 2017.
  • [16] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
  • [17] A. Hans and S. Udluft, “Efficient uncertainty propagation for reinforcement learning with limited data,” in 2009 Proceedings of the International Conference on Artificial Neural Networks (ICANN 2009), 2009, pp. 70–79.
  • [18] S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, and S. Udluft, “Learning and policy search in stochastic dynamical systems with Bayesian neural networks,” arXiv preprint arXiv:1605.07127, 2016.
  • [19] ——, “Uncertainty decomposition in Bayesian neural networks with latent variables,” arXiv preprint arXiv:1706.08495, 2017.
  • [20]

    S. Spieckermann, S. Düll, S. Udluft, A. Hentschel, and T. A. Runkler, “Exploiting similarity in system identification tasks with recurrent neural networks,”

    Neurocomputing, vol. 169, pp. 343–349, 2015.

Appendix A State Description

Only a part of the state variables is observable. This observation vector is also called observable state, but one has to keep in mind, that it does not fulfill the Markov property. The observation vector at time comprises current values of velocity , gain , shift , setpoint , consumption , and fatigue .

The preferred minimal Markovian state fulfills the Markov property with the minimum number of variables. It comprises 20 values. These are the observation vector (velocity , gain , shift , setpoint , consumption , and fatigue ) plus some latent variables of the sub-dynamics. The sub-dynamics of operational cost add a list of previous operational costs, with . Note that the current operational cost is not part of this state definition. It would be redundant, as it can be calculated by , gain , and setpoint . The sub-dynamics of mis-calibration need three additional latent variables, , , and , (Section III-B). The sub-dynamics of fatigue add 2 additional latent variables and , (Eq. (30) and (31)).

text name or description symbol

———————— Markovian state ————————

– Observables –

setpoint
velocity
gain
shift
consumption
fatigue
operational cost at
operational cost at
operational cost at
operational cost at
operational cost at
operational cost at
operational cost at
operational cost at
operational cost at
latent variable of mis-calibration
latent variable of mis-calibration
latent variable of mis-calibration
latent variable fatigue
latent variable fatigue
TABLE I: IB Markovian state.

Appendix B Goldstone Potential Based Equations

The resulting penalty of the mis-calibration reward component is computed by adopting a so-called linearly biased Goldstone potential. The following constants are pre-defined to subsequently compute the respective penalty:

(34)
(35)
(36)
(37)
(38)
(39)

Given effective shift and the sine of direction angle , which is denoted as , function is computed using the following set of equations:

(40)
(41)
(42)
(43)
(44)
(45)
(46)
(47)