ModelicaGym: Applying Reinforcement Learning to Modelica Models

09/18/2019 ∙ by Oleh Lukianykhin, et al. ∙ Ukrainian Catholic University 0

This paper presents ModelicaGym toolbox that was developed to employ Reinforcement Learning (RL) for solving optimization and control tasks in Modelica models. The developed tool allows connecting models using Functional Mock-up Interface (FMI) toOpenAI Gym toolkit in order to exploit Modelica equation-based modelling and co-simulation together with RL algorithms as a functionality of the tools correspondingly. Thus, ModelicaGym facilitates fast and convenient development of RL algorithms and their comparison when solving optimal control problem for Modelicadynamic models. Inheritance structure ofModelicaGymtoolbox's classes and the implemented methods are discussed in details. The toolbox functionality validation is performed on Cart-Pole balancing problem. This includes physical system model description and its integration using the toolbox, experiments on selection and influence of the model parameters (i.e. force magnitude, Cart-pole mass ratio, reward ratio, and simulation time step) on the learning process of Q-learning algorithm supported with the discussion of the simulation results.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

1.1. Motivation

In the era of big data and cheap computational resources, advancement in machine learning algorithms is naturally raised. These algorithms are developed to solve complex issues, such as predictive data analysis, data mining, mathematical optimization and control by computers.

The control design is arguably the most common engineering application (bogodorova2015bayesian), (turitsyn2011options), (AchControlElLoads). This type of problems can be solved applying learning from interaction between controller (agent) and a system (environment). This type of learning is known as reinforcement learning (sutton2018reinforcement). Reinforcement learning algorithms are good in solving complex optimal control problems (moriyama2018reinforcement), (smottahedi-rl), (hallen2018comminution).

Moriyama et al. (moriyama2018reinforcement) achieved 22% improvement compared to a model-based control of the data centre cooling model. The model was created with EnergyPlus and simulated with FMUs (FMUref).

Mottahedi (smottahedi-rl) applied Deep Reinforcement Learning to learn optimal energy control for a building equipped with battery storage and photovoltaics. The detailed building model was simulated using an FMU.

Proximal Policy Optimization was successfully applied to optimizing grinding comminution process under certain conditions in (hallen2018comminution). Calibrated plant simulation was using an FMU.

However, while emphasizing stages of the successful RL application in the research and development process, these works focus on single model integration. On the other hand, the authors of (moriyama2018reinforcement), (smottahedi-rl), (hallen2018comminution) did not aim to develop a generalized tool that offers convenient options for the model integration using FMU. Perhaps the reason is that the corresponding implementation is not straightforward. It requires writing a significant amount of code, that describes the generalized logic that is common for all environments. However, the benefit of such implementation is clearly in avoiding boilerplate code instead creating a modular and scalable open source tool which this paper focused on.

OpenAI Gym (brockman2016openai) is a toolkit for implementation and testing of reinforcement learning algorithms on a set of environments. It introduces common Application Programming Interface (API) for interaction with the RL environment. Consequently, a custom RL agent can be easily connected to interact with any suitable environment, thus setting up a testbed for the reinforcement learning experiments. In this way, testing of RL applications is done according to the plug and play concept. This approach allows consistent, comparable and reproducible results while developing and testing of the RL applications. The toolkit is distributed as a Python package (Gym).

For engineers a challenge is to apply computer science research and development algorithms (e.g. coded in Python) successfully when tackling issues using their models in an engineering-specific environment or modeling language (e.g. Modelica) (MyIEPS2014), (vanfretti2013unambiguous).

To ensure a Modelica model’s independence of a simulation tool, the Functional Mock-up Interface (FMI) is used. FMI is a tool-independent standard that is made for exchange and co-simulation of dynamic systems’ models. Objects that are created according to the FMI standard to exchange Modelica models are called Functional Mock-up Units (FMUs). FMU allows simulation of environment internal logic modelled using Modelica by connecting it to Python using PyFMI library (andersson2016pyfmi). PyFMI library supports loading and execution of models compliant with the FMI standard.

In (dymrl), the author declared an aim to develop a universal connector of Modelica models to OpenAI Gym and started implementation. Unfortunately, the attempt of the model integration did not extend beyond a single model simulated in Dymola (Dymola), which is proprietary software. Also, the connector had other limitations, e.g. the ability to use only a single input to a model in the proposed implementation, the inability to configure reward policy. However, the need for such a tool is well motivated by the interest of the engineering community to (dymrl). Another attempt to extend this project by Richter (fmirl) did not overcome the aforementioned limitations. In particular, the aim of a universal model integration was not achieved, and a connection between the Gym toolbox and PyFMI library was still missing in the pipeline presented in Figure 1.

Thus, this paper presents ModelicaGym toolbox that serves as a connector between OpenAI Gym toolkit and Modelica model through FMI standard (FMI_standard).

1.2. Paper Objective

Considering a potential to be widely used by both RL algorithm developers and engineers who exploit Modelica models, the paper objective is to present the ModelicaGym toolbox that was implemented in Python to facilitate fast and convenient development of RL algorithms to connect with Modelica models filling the gap in the pipeline (Figure 1).

toolbox provides the following advantages:

  • Modularity and extensibility - easy integration of new models minimizing coding that supports the integration. This ability that is common for all FMU-based environments is available out of the box.

  • Possibility of integration of FMUs compiled both in proprietary (Dymola) and open source ( (Jmod)) tools.

  • Possibility to develop RL applications for solutions of real-world problems by users who are unfamiliar with Modelica or FMUs.

  • Possibility to use a model of both - single and multiple inputs and outputs.

  • Easy integration of a custom reward policy into the implementation of a new environment. Simple positive/negative rewarding is available out of the box.

2. Software description

This section aims to describe the presented toolbox. In the following subsections, toolbox and inheritance structure of the toolbox are discussed.

2.1. Toolbox Structure

ModelicaGym toolbox, that was developed and released in Github (modelicagym), is organized according to the following hierarchy of folders (see Figure 2):

  • - a folder with environment setup instructions and an FMU integration tutorial.

  • - a package for integration of FMU as an environment to OpenAI Gym.

  • - a folder with FMU model description file (.mo) and compiled FMU for testing and reproducing purposes.

  • - a package with examples of:

    • custom environment creation for the given use case (see the next section);

    • Q-learning agent training in this environment;

    • scripts for running various experiments in a custom environment.

  • - a package for Reinforcement Learning algorithms that are compatible with OpenAI Gym environments

  • - a package with a test for working environment setup. It allows testing environment prerequisites before working with the toolbox.

To create a custom environment for a considered FMU simulating particular model, one has to create an environment class. This class should be inherited from or class, depending on what tool was used to export a model. More details are given in the next subsection.

Figure 2. ModelicaGym toolbox structure

Class hierarchy of

2.2. Inheritance Structure

This section aims to introduce a hierarchy of modelicagym/environments that a user of the toolbox needs to be familiar to begin exploitation of ModelicaGym toolbox for his purpose. The inheritance structure of the main classes of the toolbox is shown in Figure 3.

Folder contains the implementation of the logic that is common for all environments based on an FMU simulation. Main class is inherited from the class (see Figure 3) to declare OpenAI Gym API. It also determines internal logic required to ensure proper functioning of the logic common for all FMU-based environments.

class is inherited by ModelicaCSEnv and ModelicaMEEnv. These abstract wrapper classes are created for structuring logic that is specific to FMU export mode: co-simulation or model-exchange respectively. Note, that model-exchange mode is currently not supported.

Two classes and that inherit ModelicaCSEnv class are created to support an FMU that is compiled using Dymola and JModelica respectively (refer to Figure 3). Any specific implementation of an environment integrating an FMU should be inherited from one of these classes. Further in this section, details of both OpenAI Gym and internal API implementation are discussed.

declares the following API:

  • - restarts environment, sets it ready for a new experiment. In the context of an FMU, this means setting initial conditions and model parameter values and initializing the FMU, for a new simulation run.

  • - performs an that is passed as a parameter in the environment. This function returns a new state of the environment, a reward for an agent and a boolean flag if an experiment is finished. In the context of an FMU, it sets model inputs equal to the given action and runs a simulation of the considered time interval. For reward computing internal method is used. To determine if experiment has ended internal method is used.

  • - an attribute that defines space of the actions for the environment. It is initialized by an abstract method , that is model-specific and thus should be implemented in a subclass.

  • - an attribute that defines state space of the environment. It is initialized by an abstract method , that is model specific and thus should be implemented in a subclass.

  • - a dictionary with metadata used by package.

  • - an abstract method, should be implemented in a subclass. It defines a procedure of visualization of the environment’s current state.

  • - an abstract method, should be implemented in a subclass. It determines the procedure of a proper environment shut down.

To implement the aforementioned methods, a configuration attribute with model-specific information is utilized by the class. This configuration should be passed from a child-class constructor to create a correctly functioning instance. This way, using the model-specific configuration, model-independent general functionality is executed in the primary class. The following model-specific configuration is required:

  • - one or several variables that represent an action performed in the environment.

  • - one or several variables that represent an environment’s state.

    Note: Any variable in the model (i.e. a variable that is not defined as a parameter or a constant in Modelica) can be used as the state variable of the environment. On the contrary, for proper functionality, only model inputs can be used as environment action variables.

  • - a dictionary that stores model parameters with the corresponding values, and model initial conditions.

  • - defines time difference between simulation steps.

  • (optional) - a positive reward for a default reward policy. It is returned when an experiment episode goes on.

  • (optional) - a negative reward for a default reward policy. It is returned when an experiment episode is ended.

However, class is defined as abstract, because some internal model-specific methods have to be implemented in a subclass (see 3). The internal logic of the toolbox requires an implementation of the following model-specific methods:

  • _get_action_space(), _get_observation_space() - describe variable spaces of model inputs (environment action space) and outputs (environment state space), using one or several classes from package of OpenAI Gym.

  • - returns a boolean flag if the current state of the environment indicates that episode of an experiment has ended. It is used to determine when a new episode of an experiment should be started.

  • (optional) - the default reward policy is ready to be used out of box. The available method rewards a reinforcement learning agent for each step of an experiment and penalizes when the experiment is done. In this way, the agent is encouraged to make the experiment last as long as possible. However, to use a more sophisticated rewarding strategy, method has to be overridden.

Figure 3. Class hierarchy of the

Class hierarchy of the

Examples and experiments will be discussed in the next section.

3. Use Case: Cart-Pole Problem

In this section, a use case of the toolbox set up and exploitation is presented. For this purpose, a classic Cart-Pole problem was chosen.

3.1. Problem Formulation

The two-dimensional Cart-Pole system includes a cart of mass moving on a 1-d frictionless track with a pole of mass and length standing on it (see Figure 4). Pole’s end is connected to the cart with a pivot so that the pole can rotate around this pivot.

The goal of the control is to keep the pole standing while moving the cart. At each time step, a certain force is applied to move the cart (refer to Figure 4). In this context, the pole is considered to be in a standing position when deflection is not more than a chosen threshold. Specifically, the pole is considered standing if at -th step two conditions and are fulfilled. Therefore, a control strategy for standing upright in an unstable equilibrium point should be developed. It should be noted, that an episode length serves as the agent’s target metric that defines how many steps an RL agent can balance the pole.

In this particular case, a simplified version of the problem was considered meaning that at each time step force magnitude is constant, only direction is variable. In this case the constraints for the system are a) moving cart is not further than meters from the starting point, ; b) pole’s deflection from the vertical is not more than degrees is allowed, i.e. .

Figure 4. Cart-Pole system

Cart-Pole system

3.2. Modelica Model

A convenient way to model the Cart-Pole system is to model its parts in the form of differential and algebraic equations and to connected the parts together (refer to Figure 5). In addition, the elements of the Cart-Pole system can be instantiated from the Modelica standard library. This facilitates the modeling process. However, several changes to the instances are required.

Thus, to use efficiently, the modeling problem was reformulated. The pole can be modeled as an inverted pendulum standing on a moving cart. Center of pole’s mass is located at an inverted pendulum’s bob. To model the pole using the standard model of a pendulum, the following properties have to be considered: a) the length of the pendulum is equal to half of the pole’s length; b) a mass of the bob is equal to the mass of the pole. Pendulum’s pivot is placed in the centre of the cart. It can be assumed that the mass of the cart can be concentrated in this point as well. Also, a force of the magnitude is applied to the cart to move the pivot along the 1-d track.

As a result, using the standard pendulum model , the example in (dymrl) and elements from the Modelica standard library, the model was composed. In contrast to the model in (dymrl), the developed model is structurally simpler, and its parameters are intuitive. To simulate the developed model, an FMU was generated in (see Figure 5).

Figure 5. Cart-Pole model. Modelica model structure in OpenModelica (fritzson2006openmodelica).

Cart-Pole model. Modelica model structure in OpenModelica (fritzson2006openmodelica)

3.3. Cart-Pole FMU Integration

To integrate the required FMU using ModelicaGym toolbox, one should create an environment class inherited according to the inheritance structure presented in Section 2.2. To this end, the model’s configuration should be passed to the parent class constructor. Furthermore, some methods that introduce model-specific logic should be implemented. In this section, these steps to integrate a custom FMU are discussed.111One can find a detailed tutorial in the toolbox documentation (modelicagym). It describes the implementation in a step-wise manner with detailed explanations. This tutorial allows toolbox users to get started quickly.

To start, an exact FMU specification, which is determined by a model, should be passed to the primary parent class constructor as a configuration. This way, the logic that is common to all FMU-based environments is correctly functioning.

Therefore, the Modelica model’s configuration for the considered use case is given below with explanations.

Initial conditions and model parameters’ values are set automatically when the environment is created. For the considered model these are:

  • - initial angle value of the pole (in rad). This angle is measured between the pole and the positive direction of X-axis (see Figure 4).

  • - initial angular velocity of a pole (in rad/s);

  • - a mass of a cart (in kg);

  • - a mass of a pole (in kg).

Environment state is represented by the Modelica model outputs that are generated at each simulation step. For the considered model the state variables are:

  • - a cart position (in m);

  • - a cart velocity (in m/s);

  • - the pole’s angle (in rad) that is initialized with ;

  • - the angular velocity of pole (in rad/s) which is initialized with

The action is presented by the magnitude of the force applied to the cart at each time step. According to the problem statement (see Section 3.1), the magnitude of the force is constant and chosen when the environment is created, while the direction of the force is variable and chosen by the RL agent.

Listing 1 gives an example of the configuration for the Cart-Pole environment that has to be passed to the parent class constructor (ModelicaBaseEnv in Figure 3).

config = {
’model_input_names’: ’f’,
’model_output_names’: [’x’,
’model_parameters’:{’m_cart’: 10,
                    ’m_pole’: 1,
                    ’theta_dot_0’: 0},
’time_step’: 0.05,
’positive_reward’: 1,
’negative_reward’: -100
Listing 1: Environment configuration for the Cart-Pole example

For the Cart-Pole example, a specific class that relates the settings of the Cart-Pole environment and general functionality of the toolbox was created. This class is inherited from the class according to the toolbox inheritance structure requirements (see Figure 3). The JModelicaCSCartPoleEnv class was written such that all Modelica model parameters were made class attributes. So that one can configure the environment during an experiment setup. This eases running experiments using the created environment.

To finish the integration, several model-specific methods were implemented and are briefly discussed below:

  • checks if the cart position and pole’s angle are inside of the required bounds that are defined by thresholds. This method returns if the cart is not further than meters from the starting point and pole deflection from vertical position is less than 12 degrees. Otherwise, returns , as the pole is considered as fallen, therefore, the experiment episode is ended.

  • returns a gym.spaces.Discrete action space of size 2, because only 2 actions push left and push right are available for the agent.

  • returns a gym.spaces.Box state space with specified lower and upper bounds for continuous state variables.

  • visualizes Cart-Pole environment in the current state, using built-in tools.

  • method was overridden to implement expected action execution, i.e. fixed force magnitude, but variable direction at each experiment step. A sign of the force determines the direction: positive - push the cart to the right, otherwise - push to the left. also was overridden to allow a proper shut down of a custom rendering procedure.

This way, a custom FMU that is exported in co-simulation mode using and simulates the Cart-Pole environment was integrated to Gym in a couple of straightforward steps. The configured environment allows running experiments in the plug and play manner thanks to the utilization of ModelicaGym toolbox.

4. The Cart Pole Simulation Set Up

This section aims to explain the components of the Cart Pole experiment set up that could serve as an example for the ModelicaGym user when setting up another experiment.

4.1. Experiment Procedure

To verify the correctness of the FMU integration, several experiments were performed in the implemented Cart-Pole environment according to the pipeline in Algorithm 1. For each experiment, a set of parameters that define the Cart-Pole environment was created and set as an input of the procedure. The number of episodes of Q-learning agent’s training in reinforcement learning is the input parameter in Algorithm 1 as well. The value of n_episodes parameter was of the same value for all the experiments in order to maintain equal conditions for further comparison of the simulation results. To obtain a statistically representative sample, the training was repeated in the restarted environment. The output that includes episodes’ lengths and total execution time was saved to ease further results analysis and visualization.

Parameters :  - number of experiment repeats to perform, - number of episodes to perform in one experiment, - parameters required to configure the environment
Result: lists of length with experiment execution times, matrix of shape with episodes’ lengths
create with ;
for  to  do
       train Q-learning agent in ;
       append episodes’ lengths and execution time to result;
       reset ;
end for
Algorithm 1 Experiment procedure

Following the established procedure (see Algorithm 1), four experiments on varying input parameters that influence the outcome of reinforcement learning were established. These experiments are 1) variation of force magnitude, 2) variation of a cart-pole mass ratio, 3) variation of a positive-negative reward ratio, 4) variation of a simulation time step. The values of the changed parameters are given for each experiment in Table 1.

4.2. Q-learning Agent

In Section 4.1, a Q-learning agent was mentioned in the context of the input parameter settings in Algorithm 1. This section aims to explain in detail the role and the set up of the Q-learning agent in Q-learning algorithm for the Cart-Pole experiment using ModelicaGym toolbox. In ModelicaGym toolbox the Q-learning agent is implemented using class from gymalgs/rl package of the toolbox.

In general Q-learning algorithm assumes discrete state and action spaces of an environment. Therefore, a continuous state space of the Cart-pole environment was discretized by splitting an interval of possible values for each state variable in 10 bins. To take into account possible values outside the interval, the most left bin is bounded with from the left, while the most right bin is bounded with

from the right. These bins are utilized for encoding the current environment state and getting an index of the state in a Q-table. The Q-table is populated by concatenating indexes of the bins where each of four state variables belongs to. Moreover, the Q-table that represents agents’ belief about an optimal control policy is initialized randomly with values uniformly distributed in the interval


According to the problem formulation (see Section 3.1), the intervals for the variables were chosen as follows:

  • Cart’s position - .

  • Cart’s velocity - .

  • Pole’s angle - .

  • Pole’s angle velocity - .

The Q-learning algorithm is parametrized not only by external parameters that are defined by the environment, but also by its intrinsic parameters. Thus, the following intrinsic parameters of the algorithm and their values were utilized:

  • - a part of Q-value that is updated with each new observation. The chosen value makes agent replace only 20% of the previous knowledge.

  • - defines the importance of future reward. The chosen value encourages infinitely long runs.

  • - determines exploration (random action choice) probability at each step. In this case, agent chooses random action with probability equal to 0.5 at the first step.

  • - means an exploration probability decay at each step. Slow decay was chosen to let the agent explore more in the early learning phase, while to exploit learned policy after significant training time.

The Q-learning training procedure that is utilized in the experiment procedure in Algorithm 1 was carried out according to Algorithm 2.

The Q-learning algorithm uses a concept of Q-value - a proxy that determines the utility of an action in a state. A set of Q-values, which are assigned to all state-action pairs, forms a Q-table. It represents the agent’s belief about the optimal control policy. In the training process, Q-learning agent uses information about the current environment state, action to be performed and the resulting state after the action is applied. The latter updates the Q-table. The update of the Q-values is mathematically formulated as follows:


where is a learning rate, is a discount factor, - a starting state, - a resulting state, - an action that led from to , - a reward received by the agent in the state after performing the action , - the Q-values that are defined by the starting state and the action or the resulting state and the action correspondingly.

To solve the exploration-exploitation trade-off in the training procedure, the Q-learning agent utilizes an -greedy policy. The policy name originates from parameter that is referenced as exploration_rate in Algorithm 2. According to the policy, the probability to choose the optimal action is set to . This choice defines exploitation of the best action among already explored actions by the agent. Therefore, to support exploration of the other actions, the next action is chosen randomly with a probability of . Adaptive change of parameter , which is introduced by utilization of exploration_decay_rate, allows an agent to be more flexible in the early stages of exploration and more conservative in the mature phase.

Parameters :  - environment, - maximum number of steps allowed in an episode, - number of episodes for agent training, - a boolean flag if should be rendered at each step; Q-learning parameters: , , ,
Result: A trained Q-learner, a list of size with episodes’ lengths, training execution time in seconds
start ; ; ; initialize Q-learner with Q-table, given parameters; for  to  do
       encode as a sequence of discretization bin’s index;
       choose initial randomly;
       for  to  do
             if  then
                   render ;
             end if
             update Q-table using and ;
             choose using Q-table and -greedy policy;
             if  OR step ==  then
             end if
       end for
end for
end ; return Q-learner, , execution time from ;
Algorithm 2 Training a Q-learning agent in an FMU-based environment

5. Results & Discussion

In this section results of experiments that differ in entry Cart Pole parameter values are presented and summarized in Table 1.

5.1. Selection of force magnitude

According to the formulation of the Cart Pole problem in Section 3.1, one of the variables that influence the experiment is the force that is applied to the cart. Therefore, in this subsection dependency of the learning process of Q-learning algorithm on three different force magnitudes is studied. These values are chosen with respect to the reference force magnitude, which is the force required to give a cart an acceleration of .

Thus, to investigate the Cart Pole system’s behaviour, three values of force magnitude were considered:

  1. Force magnitude that is significantly smaller than the reference;

  2. Force magnitude that is slightly bigger than the reference;

  3. Force magnitude that is significantly bigger than the reference.

The exact values can be found in Table 1.

Five experiments were run for each selected force magnitude value with the number of episodes for agent training equal 100. Episodes’ lengths were smoothed with a moving average window of size equal to 20 for visualization purpose. In this way, the average smoothed episode length represents the trend. Results are shown in Figure 6.

Force magnitude Cart-pole masses Positive-negative reward Time step
Seconds per simulation step Seconds per simulation step Seconds per simulation step Seconds per simulation step
5 0.118 1; 10 0.111 1; -200 0.114 0.01 0.11
11 0.113 5; 10 0.112 1; -100 0.113 0.05 0.113
17 0.112 10; 10 0.118 1; -50 0.113 0.1 0.117
- - 10; 5 0.112 - - 0.5 0.145
- - 10; 1 0.113 - - 1 0.218
Average execution time per simulation step, s: 0.122
Average execution time per simulation step, excluding 0.5s, 1s time step experiments, s: 0.114
Table 1. Experiments summary: changed parameter values, average execution time per simulation step

In Figure 6, as expected, the episode length growth was observed for the moderate and big magnitude of the applied force. In this problem the episode’s length is a reinforcement learning agent’s target metric. Therefore, it can be stated that the agent is learning in these cases. Moreover, with bigger force magnitude higher value of average episode length can be reached, meaning the agent is learning faster.

However, in the third case, the agent fails to learn. In Figure 6 we observe a plateau that is close to the initial level in this case. The reason is that with such a small magnitude of the applied force to the cart, it is not possible to cause enough influence on the system to balance the poll within the given constraints.

Figure 6. Average smoothed episode length for the force variation experiment.

For the force bigger than the cart mass agent is learning, while for small force magnitude, it fails to learn.

5.2. Selection of Cart-pole mass ratio

In the Cart Pole problem another physical parameter that influences the control and, therefore, the reinforcement learning process, is cart-pole mass ratio.

Thus, to observe this influence, the system’s behaviour where five pairs of the cart and pole masses with different ratio were considered, is studied in this section.

These five pairs of the mass ratio are selected as follows:

  1. The pole’s mass is significantly bigger than the mass of a cart.

  2. The pole’s mass is two times bigger than the mass of a cart.

  3. The pole’s mass is equal to the mass of a cart.

  4. The pole’s mass is two times smaller than the mass of a cart.

  5. The pole’s mass is significantly smaller than the mass of a cart.

Exact values of the selected mass ratio are shown in Table 1. In Table 1 for each experiment with the selected mass ratio, the number of episodes for agent training is equal to 200.

The observed system’s behaviour in most scenarios of the selected Cart-pole mass ratio indicates that the agent is able to learn, showing a good performance, regardless of what is heavier: a cart or a pole. This was observed in 4 out of 5 cases when the RL agent’s ability to perform the required task increased with an augment in the training time.222One can find the results and visualizations in the toolbox documentation (modelicagym) In the mentioned four cases, the agent reached the same level of performance that is measured by a smoothed average episode length of around 40 steps. Therefore, it can be concluded that in these cases the chosen values of the cart and pole masses do not influence training speed.

However, for the case (3), when masses of cart and pole are equal, the observed system’s behaviour is extremely different (see Figure 7). The difference is that episode length does not increase along with the number of episodes.

For the visualization purpose, the episode lengths were smoothed with a moving average of window size 20. Average smoothed episode length (in red in Figure 7) represents the trend of the experiment. There is a plateau in episode length at the level of 9-10 steps. This value is almost equal to the initial guess observed at the beginning of the experiment, indicating that the agent fails to train.

In one of the runs the episode length, which defines the agent’s performance, drops considerably to the episode length value of 8 steps. The reason for such phenomenon may be that the physical system of cart pole of equal masses is very unstable. Therefore, it may be not possible to balance the pole within the given constraints.

Figure 7. Smoothed episode length for experiment (3)

With equal cart and pole masses agent fails to learn.

5.3. Selection of Reward ratio

The aim of this subsection is to observe a dependency of the system behaviour and it’s leaning performance on the reward value. Two types of the rewards, positive and negative, are assigned among other input parameters of the experiment. While the positive reward is given when the agent succeeds to hold the pole in a standing position and experiment episode goes on, the negative reward is assigned when pole falls and experiment episode ends. Three different pairs of positive and negative reward values were considered in the experiment (refer to the exact values in Table 1):

  1. The negative reward is so big, that agent has to balance pole for 200 steps to get a non-negative reward.

  2. The negative reward is so big, that agent has to balance pole for 100 steps to get a non-negative reward.

  3. The negative reward is so big, that agent has to balance pole for 50 steps to get a non-negative reward.

The length of each experiment has been defined by the number of episodes for training the agent (Figure 8). In order to visualize a trend, the episode’s lengths were smoothed with a moving average window of size that is equal to .

Figure 8. Average smoothed episode length for the reward variation experiment.

For bigger negative reward agent learns faster.

According to expected longer time balancing of the pole on the cart, the observed episode’s length increases with a training time increase (refer to Figure 8). Besides, it can be noticed that the biggest negative reward’s magnitude leads to the best result. In particular, the episode’s length increased faster in this case. Also, the final average smoothed episode length, which is an indicator of an agent’s capability to solve the balancing problem, get bigger when negative reward’s magnitude increases.

On the contrary, when the negative reward’s magnitude is smaller, the slower training and even a significant decrease in performance are observed. These decreases could be resolved with more extended experiment duration, but this is not an optimal training result. The reason may be in the fact that a smaller negative reward’s magnitude penalizes bad agent’s decisions not strongly enough.

5.4. Time step variation and Execution time

Time step defines an interval between two consecutive control actions that are applied to the system. At each iteration the system’s behaviour is simulated in this interval. To study the influence of a time step change on the training result, five different values for the time step were considered and presented in Table 1. In particular, the smallest simulation time step () appeared to be too small for a real experiment when controlling a Cart-Pole system. While the biggest simulation time step () is too large for keeping the balance of the system, i.e. the pole will be able to fall within the time step.

It was observed that with a very small time step training is too slow, while it is inefficient for a very big time step (see Table 1). When the time step equals to , the agent can not overcome the threshold of simulation length that yields four steps. The reason for this is most likely the same as for the simulation time step of . Thus, such learning behaviour is due to the fact that large time steps narrow the ability to control the system for achieving the required goal. In other words, the control action changes too seldom to reach the goal.

On the other hand, a simulation of the learning process with a big time step takes less execution time than the simulation of the same interval of learning with a smaller time step. This result caused by the fact that additional time is spent for each call of an FMU in each time step, while the FMU simulation time slightly increases.

Thus, to guarantee effective and efficient training, a trade-off between a time step length and execution time of learning has to be found. For the considered system, a simulation time step of is a good choice. This is reasonable from the point of view of both training and application.

For all the experiments execution time was measured. Average time per simulation step is summarized in Table 1. It was observed, that for the fixed time step, time per simulation step is almost the same for any set of parameters.

6. Conclusions

In this project ModelicaGym - the open source universal toolbox for Modelica model integration in OpenAI Gym as an environment to provide more opportunities for a fast and convenient RL application development - is developed.

Thus, both RL solution developers and the engineers using Modelica can benefit from using ModelicaGym that allows for FMU to simulate environment logic, therefore, use reinforcement learning solution for Modelica modelled real-world problems.

Using a classic Cart-Pole control problem example, the ModelicaGym functionality, modularity, extensibility and validity have been presented and discussed. The results of the experiments that were performed on the Cart-Pole environment indicated that integration was successful.

The toolbox can be easily used by both developers of RL algorithms and engineers who use Modelica models. It provides extensive options for model integration, allows to employ both open source and proprietary Modelica tools. Thus, it is expected to have great value for a broad user audience.

7. Future Work

Even though ModelicaGym toolbox is set and ready, there are several extensions that can be made.

First of all, a more sophisticated use case is in focus of our research plans. Power system models having higher modeling complexity are a suitable research object. Especially in the context of applying more advanced reinforcement learning algorithms.

Second, adding reinforcement learning methods that are working out of the box with small customization would enhance the toolbox functionality.

In addition, currently only FMUs exported in co-simulation mode are supported. Thus, one more extension step would be testing and providing functionality for FMUs exported in model-exchange mode. There is such a possibility in the toolbox architecture. However, feedback from the community should be received first to understand a real demand for such functionality.

We would like to thank Dr. Luigi Vanfretti from Rensselaer Polytechnic Institute for insightful discussions and for igniting our inspiration to use Modelica. Authors would like to thank ELEKS ( for funding the Machine Learning Lab at Ukrainian Catholic University and this research.


Appendix A Setup

The toolbox was tested on the following environment setup:

  • Ubuntu 18, 64-bit version

  • Python 3.6.8

  • Java 8

  • Assimulo 2.9

  • PyFMI 2.3.1

  • Sundials 2.4.0

  • Ipopt 3.12.12

Listed libraries are required for proper usage. Modelica tools are required for FMU compilation, so are optional for toolbox usage. If one uses co-simulation FMU exported from Dymola, licence file should be available as well. Authors utilized JModelica 2.4 and Dymola 2017.

Machine parameters were: Intel-i7 7500U 2.7GHz (3 cores available), 12GB RAM.