Deep Q-Learning with Q-Matrix Transfer Learning for Novel Fire Evacuation Environment

05/23/2019 ∙ by Jivitesh Sharma, et al. ∙ Universitetet Agder 0

We focus on the important problem of emergency evacuation, which clearly could benefit from reinforcement learning that has been largely unaddressed. Emergency evacuation is a complex task which is difficult to solve with reinforcement learning, since an emergency situation is highly dynamic, with a lot of changing variables and complex constraints that makes it difficult to train on. In this paper, we propose the first fire evacuation environment to train reinforcement learning agents for evacuation planning. The environment is modelled as a graph capturing the building structure. It consists of realistic features like fire spread, uncertainty and bottlenecks. We have implemented the environment in the OpenAI gym format, to facilitate future research. We also propose a new reinforcement learning approach that entails pretraining the network weights of a DQN based agents to incorporate information on the shortest path to the exit. We achieved this by using tabular Q-learning to learn the shortest path on the building model's graph. This information is transferred to the network by deliberately overfitting it on the Q-matrix. Then, the pretrained DQN model is trained on the fire evacuation environment to generate the optimal evacuation path under time varying conditions. We perform comparisons of the proposed approach with state-of-the-art reinforcement learning algorithms like PPO, VPG, SARSA, A2C and ACKTR. The results show that our method is able to outperform state-of-the-art models by a huge margin including the original DQN based models. Finally, we test our model on a large and complex real building consisting of 91 rooms, with the possibility to move to any other room, hence giving 8281 actions. We use an attention based mechanism to deal with large action spaces. Our model achieves near optimal performance on the real world emergency environment.



There are no comments yet.


page 1

page 12

page 13

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reinforcement Learning (RL) has been a subject of extensive research and applications in various real world domains such as Robotics, Games, Industrial Automation and Control, System Optimization, Quality Control and Maintenance. But, some extremely important areas, where Reinforcement Learning could be immensely vital, have not received adequate attention from researchers. We turn our attention to the major problem of evacuation in case of fire emergencies.
Fire related disasters are the most common type of Emergency situation. They require thorough analysis of the situation for quick and precise response. Even though this critical application hasn’t received adequate attention from AI researchers, there have been some noteworthy contributions. One such paper, focusing on assisting decision making for fire brigades, is described in [1]. Here, the the RoboCup Rescue simulation is used as a fire simulation environment [2]. A SARSA Agent [3] is used with a new learning strategy called Lesson-by-Lesson learning, similar to curriculum learning. Results show that the RL agent is able to perform admirably in the simulator. However, the simulator lacks realistic features like bottlenecks, fire spread and has a grid structure which is too simplistic to model realistic environments. Also, the approach seems unstable and needs information about the state which isn’t readily available in real life scenarios.
In [4], multiple coordinated agents are used for forest fire fighting. The paper uses a software platform called Pyrosim which is used to create dynamic forest fire situations. The simulator is mostly used for terrain modeling and a coordinated multiple agent system is used to extinguish fire and not for evacuation.
The evacuation approach described in [5] is similar to the problem we try to solve in this paper. In [5], a fading memory mechanism is proposed with the intuition that in dynamic environments less trust should be put on older knowledge for decision making. But arguably, this could be achieved more efficiently by the ’’ parameter in Q-learning along with prioritized experience replay. Also, the graph based environment used in [5] lacks many key features like fire spread, people in rooms, bottlenecks etc.
The most significant work done on building evacuation using RL is reported in [6]. The evacuation environment is grid based with multiple rooms and fire. The fire spread is modelled accurately and uncertainty taken into account. The multi-agent Q-learning model is shown to work in large spaces as well. Further, the paper demonstrates a simple environment and strategy for evacuation. However, the approach proposed in [6] lacks key features like bottlenecks and actual people in rooms. The grid based environment isn’t able to capture details of the building model like room locations and paths connecting rooms.
Some interesting research on evacuation planning take a completely different approach by simulating and modelling human and crowd behaviour under evacuation [7, 8, 9, 10]. Our work on evacuation planning is not based on human behaviour modelling or the BDI (Belief-Desire-Intention) framework for emergency scenarios. These methods are beyond the scope of this paper and not discussed here.

Proposed Environment

There are many reinforcement learning libraries that contain simulations and game environments to train reinforcement learning based agents [11, 12, 13, 14, 15]. However, currently no realistic learning environment for emergency evacuation has been reported.
In our paper, we build the first realistic fire evacuation environment specifically designed to train reinforcement learning agents for evacuating people in the safest manner in the least number of time-steps possible. The environment has the same structure as OpenAI gym environments, so it can be used easily in the same manner.
The proposed fire evacuation environment is graph based, which requires complex decision making such as routing, scheduling and dealing with bottlenecks, crowd behaviour uncertainty and fire spread. This problem falls in the domain of discrete control. The evacuation is performed inside a building model, which is represented as a graph. The agent needs to evacuate all persons in all rooms through any available exits using the shortest path in the least number of time-steps, while avoiding any perilous situations like the fire, bottlenecks and other hazardous situations.
Some previous research papers focus on modelling fire spread and prediction, mostly using cellular automata [16] and other novel AI techniques [17, 18, 19]. An effective and innovative way of modelling fire spread is to use spatial reinforcement learning, as proposed in [20]. However, our way of simulating fire spread is far less complex and leverages rewarding system of the RL framework. In our proposed environment, we simply use an exponential decay reward function to model the fire spread and direction. To keep in tune with the RL framework, the feedback from the environment sent back to the agent should convey enough information. So, we design the reward function in such a manner that the agent can learn about the fire spread and take measures accordingly.

Proposed Method

Since this environment poses a high level of difficulty, we argue that incorporating the shortest path information (shortest path from each room to the nearest exit) in the DQN model(s) by transfer learning and pretraining the DQN neural network function approximator is necessary.

Transfer learning has been used extensively in computer vision tasks for many years, recently vastly expanded for many computer vision problems in

[21]. Lately, it has been utilized in Natural Language models [22, 23]. In reinforcement learning, pretrained models have started to appear as well [24, 25]. In fact, we use the convergence analysis of [25], which provides a general theoretical perspective of task transfer learning, to prove that our method guarantees convergence.
In this paper, we present a new class of pretrained DQN models called Q-matrix Pretrained Deep Q-Networks (QMP-DQN). We employ Q-learning to learn a Q-matrix representing the shortest paths from each room to the exit. We perform multiple random episodic starts and -greedy exploration of the building model graph environment. Q-learning is applied on a pretraining instance of the environment that consists of only the building model graph. Then, we transfer the Q-matrix to a DQN model, by pretraining the DQN to reproduce the Q-matrix. Finally, we train the pretrained DQN agent on the complete fire evacuation task. We compare our proposed pretrained DQN models (QMP-DQN) against regular DQN models and show that pretraining for our fire evacuation environment is necessary. We also compare our QMP-DQN models with state-of-the-art Reinforcement Learning algorithms and show that off-policy Q-learning techniques perform better than other policy based methods as well as actor-critic models.
Finally, in Section 5, we show that our method can perform optimal evacuation planning on a large and complex real world building model by dealing with the large discrete action space in a new and simple way by using an attention based mechanism.


This paper contributes to the field of reinforcement learning, emergency evacuation and management in the following manner:

  1. We propose the first reinforcement learning based fire evacuation environment with OpenAI Gym structure.

  2. We build a graph based environment to accurately model the building structure, which is more efficient than a maze structure.

  3. The environment can consist of a large discrete action space with number of actions (for all possibilities), where is the number of rooms in the building. That is, the action space size increases exponentially with respect to the rooms.

  4. Our proposed environment contains realistic features such as multiple fires and dynamic fire spread which is modelled by the exponential decay reward function.

  5. We further improve the realism of our environment by restricting the number of people allowed in each room to model over-crowded hazardous situations.

  6. We also include uncertainty about action performed in the environment to model uncertain crowd behaviour, which also acts as a method of regularization.

  7. We use the Q-matrix to transfer learned knowledge of the shortest path by pretraining a DQN agent to reproduce the Q-matrix.

  8. We also introduce a small amount of noise in the Q-matrix, to avoid stagnation of the DQN agent in a local optimum.

  9. We perform exhaustive comparisons with other state-of-the-art reinforcement learning algorithms like DQN, DDQN, Dueling DQN, VPG, PPO, SARSA, A2C and ACKTR.

  10. We test our model on a large and complex real world scenario, which is the University of Agder Building, which consists of nodes, and actions.

  11. We propose a new and simple way to deal with large discrete action spaces in our proposed environment, by employing an attention mechanism based technique.

The rest of the paper is organized as follows: Section 2 summarizes the RL concepts used in this paper. Section 3 gives a detailed explanation of the proposed Fire Emergency Evacuation System with each module of the system described in subsequent sub-sections. Section 4 reports our exhaustive experimental results. Section 5 presents the real world application of our model in a large and complex environment, and finally Section 6 concludes the paper.

Ii Preliminaries

Reinforcement Learning is a sub-field of Machine Learning which deals with learning to make appropriate decisions and take actions to achieve a goal. A Reinforcement Learning agent learns from direct interactions with an environment without requiring explicit supervision or a complete model of the environment. The agent interacts with the environment by performing actions. It receives feedback for it’s actions in terms of reward (or penalty) from the environment and observes changes in the environment as a result of the actions it performs. These observations are called states of the environment and the agent interacts with the environment at discrete time intervals

by performing an action in a state of the environment , it transitions to a new state (change in the environment) while receiving a reward

, with probability

. The main aim of the agent is to maximize the cumulative reward over time through it’s choice of actions. A pictorial representation of the RL framework is shown in Fig. 1.
In the subsequent subsections, a brief presentation of the concepts and methods used in this paper are explained.

Fig. 1: Reinforcement Learning Framework (the figure is taken from [26])

Ii-a Markov Decision Process

The Reinforcement learning framework is formalised by Markov Decision Processes (MDP) which are used to define the interaction between a learning agent and its environment in terms of states, actions, and rewards

[27]. An MDP consists of a tuple of [26], where is the state space, is the action space, is the transition probability from one state to the next, and is the reward function, .
When state space , action space and rewards consist of finite number of elements, and

have well-defined discrete probability distributions which depend only on the present state and action (Markov Property). This is represented as

, where determines the dynamics of the Markov Decision Process and where:


contains all the information about the MDP, so we can compute important aspects about the environment from , like state transition probability and expected rewards for state-action pairs [26]:


The equation 3, gives the immediate reward we expect to get when performing action from state . The agent tries to select actions that maximize the sum of rewards it expects to achieve, as time goes to infinity. But, in a dynamic and/or continuous Markov Decision Process, the notion of discounted rewards is used [26]:


where, is the discount factor and is in the range . If is near , then the agent puts emphasis on rewards received in the near future and if is near , then the agent also cares about rewards in the distant future.
In order to maximize , the agent picks an action when in a state according to a policy function . A policy function is a probabilistic mapping from the state space to the action space, . The policy function outputs probabilities for taking each action in give state, so it can also be denoted as .

Ii-B Q-Learning

Most of the Reinforcement Learning algorithms (value based) try to estimate the value function which gives an estimate of how good a state is for the agent to reside in. This is estimated according to the expected reward of a state under a policy and is denoted as



Q-learning is a value based Reinforcement Learning algorithm that tries to maximize the function [28]. The function is a state-action value function and is denoted by . It tries to maximize the expected reward give a state and action performed on that state:


According the Bellman Optimality equation [26], the optimal function can be obtained by:


where, . And, is the optimal action which results in maximum reward, the optimal policy is formed as . This method was proposed in [28] which is tabular style Q-learning. The update rule for each time step of Q-learning is as follows:


Q-learning is an incremental dynamic programming algorithm that determines the optimal policy in a step-by-step manner. At each step , the agent performs the following operations:

  • Observes current state .

  • Selects and performs an action .

  • Observes the next state .

  • Receives the reward .

  • Updates the q-values using equation 8.

The value function converges to the optimal value as . Detailed convergence proof and analysis can be found in [28].
This tabular Q-learning method is used in our proposed approach to generate a Q-matrix for the shortest path to the exit based on the building model. In order to incorporate the shortest path information, this Q-matrix is used to pretrain the DQN models.

Ii-C Deep Q Network

The tabular Q-learning approach works well for small environments, but becomes infeasible for complex environments with large multidimensional discrete or continuous state-action spaces. To deal with this problem, a parameterized version of the function is used for approximation . This way of function approximation was first proposed in [29].
Deep Neural Networks (DNNs) have become the predominant method for approximating complex intractable functions. They have become the defacto method for various applications such as image processing and classification [30, 31, 32, 33, 34, 35, 36], speech recognition [37, 38, 39, 40, 41, 42, 43]

, and natural language processing

[44, 45, 46, 47, 48]. DNNs have also been applied to reinforcement learning problems successfully by achieving noteworthy performance [49, 50].
The most noteworthy research in integrating deep neural networks and Q-learning in an end-to-end reinforcement learning fashion is the Deep Q-Networks (DQNs) [51, 52]

. To deal with the curse of dimensionality, a neural network is used to approximate the parameterised Q-function

. The neural network takes a state as input and approximates Q-values for each action based on the input state. The parameters are updated and the Q-function is refined in every iteration through an appropriate optimizer like Stochastic Gradient Descent


, RMSProp

[54], Adagrad [55], Adam [56] etc. The neural network outputs q-values for each action for the input state and the action with the highest q-value is selected (There is another DQN architecture, which is less frequently used, that takes in the state and action as input and returns it’s q-value as output).

The DQN can be trained by optimizing the following loss function:


where, is the discount factor, and are the Q-network parameters at iteration and respectively. In order to train the Q-network, we require a target to calculate loss and optimize parameters. The target q-values are obtained by holding the parameters fixed from the previous iteration.



is the target for the next iteration to refine the Q-network. Unlike supervised learning where the optimal target values are known and fixed prior to learning, in DQN the approximate target values

, which depend on network parameters, are used to train the Q-network. The loss function can be rewritten as:


The process of optimizing the loss function at the iteration by holding the parameters from the previous iteration fixed, to get target values, results in a sequence of well-defined optimization time-steps. By differentiating the loss function in equation 11, we get the following gradient:


Instead of computing the full expectation of the above gradient, we optimize the loss function using an appropriate optimizer (in this paper we use the Adam optimizer [56]). The DQN is a model-free algorithm since it directly solves tasks without explicitly estimating the environment dynamics. Also, DQN is an off-policy method as it learns a greedy policy , while following an -greedy policy for sufficient exploration of the state space. One of the drawbacks of using a nonlinear function approximator like neural network is that it tends to diverge and is quite unstable for reinforcement learning. The problem of instability arises mostly due to: correlations between subsequent observations and that small changes in q-values can significantly change the policy and the correlations between q-values and target values.
The most well-known and simple technique to alleviate the problem of instability is the experience replay [57]. At each time-step, a tuple consisting of the agent’s experience

is stored in a replay memory over many episodes. A minibatch of these tuples is randomly drawn from the replay memory to update the DQN parameters. This ensures that the network isn’t trained on a sequence of observations (avoiding strong correlations between samples and reducing variance between updates) and it increases sample efficiency. This technique greatly increases stability of DQN.

Ii-D Double DQN

Q-learning and DQN are capable of achieving performance beyond the human level on many occasions. However, in some cases Q-learning performs poorly and so does its deep neural network counterpart DQN. The main reason behind such poor performance is that Q-learning tends to overestimate action values. These overestimations are caused due to a positive bias that results from the function in Q-learning and DQN updates which outputs the maximum action value as an approximation of the maximum expected action value.
The Double Q-learning method was proposed in [58] to alleviate this problem and later extended to DQN [59] to produce the Double DQN (DDQN) method. Since Q-learning uses the same estimator to select and evaluate an action, which results in overoptimistic action values, we can interpret it as a single estimator. In Double Q-learning, the task of evaluation and selection is decoupled by using double estimator approach consisting of two functions: and . The function is updated with a value from the function for the next state and the function is updated with a value from the function for the next state.




where, is the action with the maximum q-value in state according to the function and is the action with the maximum q-value in state according to the function.
The double estimator technique is unbiased which results in no overestimation of action values, since action evaluation and action selection is decoupled into two functions that use separate function estimates of action values. In fact, thorough analysis of Double Q-learning in [58] shows that it sometimes might underestimate action values.
The Double Q-learning algorithm was adapted for large state-action spaces in [59] by forming the Double DQN method in a similar way as DQN. The two Q-functions ( and ) can be parameterised by two sets of weights and . At each step, one set of weights is used to update the greedy policy and the other to calculate it’s value. For Double DQN, equation 10 can be written as:


The first set of weights are used to determine the greedy policy just like in DQN. But, in Double DQN, the second set of weights is used for an unbiased value estimation of the policy. Both sets of weights can be updated symmetrically by switching between and .
The target value network in DQN can be used as the second Q-function instead of introducing an additional network. So, the weights at the iteration are used to evaluate the greedy policy and the weights at the previous iteration to estimate it’s value. The update rule remains the same as DQN, while changing the target as:


Note that in both DQN and DDQN, the target network uses the parameters of the previous iteration . However, to generalise, the target network can use parameters from the any previous iteration. Then, the target network parameters are updated periodically with the copies of the parameters of the online network.

Ii-E Dueling DQN

In quite a few RL applications, it is sometimes unnecessary to estimate the value of each action. In many states, the choice of action has no consequence on the outcome. A new architecture for model-free Reinforcement Learning, called the dueling architecture, is proposed in [60]

. The dueling architecture explicitly separates state values and action advantage values into two streams which share a common feature extraction backbone neural network. The architecture is similar to that of the DQN and DDQN architectures; the difference being that instead of a single stream of fully connected layers, there are two streams providing estimates of the value and state-dependent advantage functions. The two streams are combined at the end producing a single Q-function.

One stream outputs a scalar state value, while the other outputs an advantage vector having dimensionality equal to number of actions. Both the streams are combined at the end to produce the Q-function estimate. The combining module at the end can simply aggregate the value and advantage estimates as:


where, are the parameters of the lower layers of the neural network (before streams are split); and are the parameters of the advantage and value function streams. However, such an aggregation of streams would require to be replicated as many times as the dimensionality of . Also, value and advantage estimates cannot be uniquely recovered given the estimated Q-function.
One way of addressing these issues, proposed in [60], is to force the advantage function estimator to have zero value at the selected action. This aggregation is implemented in the combining module as:


Now, for a chosen action (action with max Q-function), , putting in equation 20, we get . Hence, the two streams can be uniquely recovered.
In [60], another way of aggregation is proposed which eliminates the operator.


where, is the number of actions. Even though value and advantage estimates are now off-target by a constant, this way of aggregation improves stability by capping the changes in the advantage estimates by their mean and enhances overall performance.
In this paper, we use above mentioned off-policy, model-free algorithms on our novel fire evacuation environment and significantly improve performance for each of the above methods by transferring tabular Q-learning knowledge of the building structure into these methods.

Iii Fire Emergency Evacuation System

In this paper, we propose the first fire evacuation environment to train reinforcement learning agents and a new transfer learning based tabular Q-learning+DQN method that outperforms state-of-the-art RL agents on the proposed environment. The fire evacuation environment consists of realistic dynamics that simulate real-world fire scenarios. For such a complex environment, an out-of-the-box RL agent doesn’t suffice. We incorporate crucial information in the agent before training it, like the shortest path to the exit from each room. The rest of the section explains the entire system in detail.

Iii-a The Fire Evacuation Environment

We propose the first benchmark environment for fire evacuation to train reinforcement learning agents. To the best of our knowledge, this is the first environment of it’s kind. The environment has been specifically designed to simulate realistic fire dynamics and scenarios that frequently arise in real world fire emergencies. We have implemented the environment in the OpenAI gym format [11], to facilitate further research.
The environment has a graph based structure to represent a building model. Let be an undirected graph, such that is a set of vertices that represents rooms and hallways and is a set of edges that represents paths connecting the rooms and hallways. A simple fire evacuation environment consisting of 5 rooms and paths connecting these rooms is shown in Fig. 2.

Fig. 2: A Simple Fire Evacuation Environment

The red vertex indicates fire in that room and the green vertex is exit. The orange arrows show the fire spread direction (more towards 1 compared to 3).

To represent the graph consisting of rooms, hallways and connecting paths, we use the adjacency matrix . It is a square matrix consisting of elements that indicate whether a pair of vertices is connected by an edge or not. The adjacency matrix is used to represent the structure of the graph and check the validity of actions performed by the agent. The adjacency matrix for the building model in Fig. 2 is given by:

The environment dynamics are defined as follows:


Each vertex of the graph represents a room and each room is associated with an integer , which is the number of people in that room. The state of the environment is given by a vector consisting of the number of people in each room . To force the RL agent to learn the environment dynamics by itself, the environment doesn’t provide any other feedback to the agent apart from the state (number of people left in each room) and the reward.


An agent performs an action by moving a person from one room to the other and the state is updated after every valid action. Therefore, the action space is discrete. To keep things simple, we restrict the agent to move one person from one room at a time step. The agent can move a person from any room to any other room at any time step, even if the rooms aren’t connected to each other by a path. So, the number of possible actions at each step is .
This action space is necessary so that the agent can easily generalize to any graph structure. Also, this enables the agent to directly select which room to take people from and which room to send people to, instead of going through each room in a serial manner or assigning priorities.
When the agent selects an action, where there is no path between the rooms, the agent is heavily penalized. Due to this unrestricted action space and penalization, the agent is able to learn the graph structure (building model) with sufficient training and only performs valid actions at the end. The adjacency matrix is used to check the validity of actions.
Note that our graph based fire evacuation environment has possible actions (even though many of them are illegal moves and incur huge penalties), where is the number of rooms. Even for a small toy example of rooms, the total number of possible actions is , which is a lot more than almost all of the OpenAI gym environments and Atari game environments [11].


We design a reward function specifically suited for our environment. We use an exponential decay function to reward/penalize the agent depending on the action it takes and to simulate fire spread as well. The reward function looks like this:


where, is the time step, is the room where a person is moved to and is the degree of fire spread for a room. returns a positive number and if a room has a higher value of degree of fire spread, that means that fire is spreading more rapidly towards that room. We explicitly assign degrees to each room using a degree vector , where the maximum value belongs to the room where the fire is located.
Using such a reward function ensures the following: Firstly, the reward values drop exponentially every time step as the fire increases and spreads. Secondly, the reward of an action depends on the room where a person is moved to. The reward function will penalize an action more heavily if a person is moved to a more dangerous room (higher degree of fire spread towards that room). This is because the function yields more rapidly decaying negative rewards. Lastly, the function yields a negative reward for every action which forces the agent to seek the least number of time-steps. The reward for reaching the exit is a constant [].

Fire Location(s) and Exit(s)

The room where the fire occurs is given the highest degree, hence the maximum penalty for entering. The direction of fire spread is randomly decided and the degrees are assigned accordingly. The degrees are updated gradually to simulate fire spread.


where, is a small number () associated with . is assigned to each room according to fire spread direction. So, can be used to determine fire spread direction, since higher value of for a room means that fire is spreading towards that room more rapidly.
As shown in Fig. 2, the fire spread is randomly and independently decided for all rooms . The exit is also treated like a room. The only difference being that the agent gets a positive reward for moving people to the exit. The number of people at the exit is reset to zero after every action. The rooms which are exits are stored in a vector .


Probably one of the most important feature in our proposed fire evacuation environment that enhances realism is the bottlenecks in rooms. We put an upper limit on the number of people that can be in a room at a time step. This restriction ensures congestion control during evacuation, which has been a huge problem in emergency situations. The bottleneck information is not explicitly provided to the agent, instead the agent learns about this restriction during training, since a negative reward is received by the agent if the number of people in a room exceed the bottleneck value. The bottleneck is set to 10 in our experiments.


To take into account uncertain behaviour of the crowd and introduce stochasticity in the environment, a person moves from one room to the other with probability . This means that an action , selected by the agent at time-step , is performed with probability or ignored with probability . If the action is ignored, then there is no change in the state, but the reward received by the agent is as if the action was performed. This acts like a regularizing parameter and due to this, the agent is never able to converge to the actual global minimum. In our experiments, the uncertainty probability is kept at .

Terminal Condition

The terminal/goal is reached once there are no people in any of the rooms [].
The pseudocode for the proposed environment is given in Algorithm 1.

Environment variables: , , , ,
while not Terminal do
       if  then
       end if
             if  then
             end if
            else if  and in  then
             end if
            else if  then
             end if
            else if  then
             end if
             end if
       end if
      Update according to
end while
Algorithm 1 Fire Evacuation Environment Pseudocode

From Algorithm 1, we can see that a heavier penalty is received by the agent for an illegal move compared to bottleneck restriction violation and moving towards fire. In a way, rewards are used to assign priorities to scenarios. It can easily be changes if needed.

Pretraining Environment

We create two instances of our environment: one for fire evacuation and the other for shortest path pretraining. For the pretraining instance, we consider only the graph structure and the aim is to get to the exit from every room in the minimum number of time-steps.
The pretraining environment consists of the graph structure only, i.e. the adjacency matrix . The pretraining environment doesn’t contain fire, the number of people in each room or bottlenecks. The rewards are static integers: -1 for every path to force the agent to take minimum time-steps, -10 for illegal actions (where there is no path) and +1 for reaching the exit. The agent is thus trained to incorporate shortest path information of the building model.
The pseudocode for the pretraining environment is given in Algorithm 2.

Environment variables: ,
while not Exit do
       if  and in  then
       end if
      else if  then
       end if
       end if
end while
Algorithm 2 Shortest Path Pretraining Environment Pseudocode

The procedure is repeated until the agent converges to the shortest path from any room to the exit.

Iii-B Similarities and Differences with Other Environments

The fire evacuation environment is implemented in the OpenAI gym format [11], to enable future research on the topic. OpenAI gym environments consists of four basic methods: init, step, render and reset. Our environment consists of the same four methods.
The init method consists of the initialization conditions of the environment. In our case, it contains the action space size, , the state space size, , the starting state which is an array consisting of the number of people in each room (vertex), the adjacency matrix of the graph based building model, and the fire location(s), . The reset method simply sets the environment back to the initial conditions.
The step method is like the Algorithm 1, without the while loop. The step method takes in the action performed at time-step as the argument and returns the next state , boolean variable for terminal (indicating whether the terminal state was reached with the action performed or not) and the reward for performing the action. The next state, reward and terminal depend on the conditions of the environment as shown in Algorithm 1. The render method simply returns the current state .
The pretraining environment instance has the same structure of methods. The only difference is in the step method, shown in Algorithm 2 excluding the while loop, where the reward system is changed with fewer conditions and the state is represented as the set of empty vertices (rooms with no people) of the graph.
Even though our environment might have the same structure as any OpenAI gym environment, it differs a lot in functionality from other environments or any game-based environments. In some ways, it might look like the mouse maze game in which the player (mouse) needs to reach the goal (cheeze) in the least possible steps through a maze. But, it is drastically different in many ways:

  • Our environment is a graph based environment with much less connectivity then the maze environment, which makes finding the optimal route difficult.

  • The optimal path(s) might change dynamically from one episode to the next or within a few time-steps due to fire spread and uncertainty in the fire evacuation environment, while the optimal path(s) for the mouse maze game remains the same.

  • All the people in all the rooms must be evacuated to the nearest exit in the minimum number of time-steps under dynamic and uncertain conditions with bottlenecks, whereas in the mouse maze environment an optimal path only from the starting point to the goal needs to be found.

  • The fire evacuation problem is a problem in which multiple optimal paths for all people in all rooms must be found while avoiding penalizing conditions like fire, bottlenecks and fire spread, whereas the mouse maze problem is a simple point-to-point problem.

  • The mouse maze environment is static and lacks any variations, uncertainty or dynamics. On the other hand, the fire evacuation environment is dynamic, variable and uncertain.

  • In the maze environment, the shortest path to the goal state is always the best. But, in the fire evacuation environment, even though the DQN agent is pretrained on the shortest path information, the shortest path to the exit might not be the best due to fire, fire spread and bottlenecks.

  • The fire evacuation environment has a much larger action space than the maze environment (four actions: up, down, left, right) because all actions can be performed even if they are illegal (which will yield high penalties) to make the RL agent learn the building structure (graph model).

  • Finally, a graph is a much better way to model a building’s structure than a maze, since connectivity can be better described with a graph rather than a maze. It’s what graphs were made for, to depict the relationships (connections) between entities (vertices).

Hence, the fire evacuation problem is a much more complex and dramatically different problem than the mouse maze problem or any other game based problem. Even the Go game has , i.e, possible actions, whereas the fire evacuation environment has possible actions, i.e., as the number of rooms increase, the possible actions increase exponentially (although the Go game rules are quite complex to interpret by an RL agent).

Iii-C Q-matrix Pretrained Deep Q-Networks

For the proposed graph based fire evacuation environment, we also present a new reinforcement learning technique based on the combination of Q-learning and DQN (and its variants). We apply tabular Q-learning to the simpler pretraining environment, with a small state space, to learn the shortest paths from each room to the nearest exit. The output of this stage is an Q-matrix which contains q-values for state-action pairs according to the shortest path.
This Q-matrix is used to transfer the shortest path information to the DQN agent(s). This is done by pretraining the agent’s neural network by deliberately overfitting it to the Q-matrix. After pretraining, the neural network weights have the shortest path information incorporated in them. Now, the agent is trained on the complete fire evacuation environment to learn to produce the optimal evacuation plan.
The main purpose of using such a strategy of training an agent by pretraining it first is to provide the agent with vital information about the environment beforehand, so that it doesn’t have to learn all the complexities of the environment altogether. Since, after pretraining, the agent knows the shortest paths to the nearest exits in the building, dealing with other aspects of the environment like fire, fire spread, number of people and bottlenecks is made easier.
We provide two instances of our environment: simpler shortest path pretraining instance and complex fire evacuation instance. First, the agent is pretrained on the simpler instance of the environment (for shortest path pretraining) and then trained on the more complex instance (for optimal evacuation). This approach of training the agent on a simpler version of the problem before training it on the actual complex problem is somewhat similar to curriculum learning [61].
We also add a small amount of noise or offset to the Q-matrix produced by training on the pretraining environment instance. This is done by adding or subtracting (depending on the q-value) a small to each element of the Q-matrix.

where, can be thought of as a regularization parameter, which is set to in our experiments. Adding noise to the Q-matrix is necessary because we don’t want the DQN agent to just memorize all the paths and get stuck at a local minimum. The actual fire evacuation instance is complex, dynamic and has uncertainty which means that an optimal path at time-step might not be the optimal path at time-step

. The hyperparameter

acts as a regularizer.
Note that we add if the element of the Q-matrix is negative or zero and subtract if the element is positive. This is done to offset the imbalance between good and bad actions. If we just add or subtract then the relative difference between q-values would remain the same. Conditional addition or subtraction truly avoids the DQN agent from being biased to a particular set of actions leading to an exit.
Even though pretraining adds some overhead to the system, there are several advantages including:

Better Conditioning

Pretraining provides the neural network with a better starting position of weights for training compared to random initializations.

Faster Convergence

Since the neural network weights are better conditioned due to pretraining, training starts closer to the optimum and hence rate of convergence is faster.

Crucial Information

Especially in the case of fire evacuation, pretraining with shortest path information provides the agent with crucial information about the environment before training begins.

Increased Stability

As pretraining restricts the weights in a better basin of attraction in the parameter space, the probability of divergence is reduced which makes the model stable.

Fewer number of updates

As the weights are near the optimal on the error surface, the number of updates required to reach the optimum is lower, which results in fewer memory updates and requiring less training epochs.

The pseudocode for the proposed Q-matrix pretrained DQN algorithm is given in Algorithm 3.

Environment instances: ,
Environment variables: , , , , ,
       for  to  do
             while not terminal do
                   if  then
                   end if
                   end if
                   Update using eq 8;
             end while
       end for
End Function
       for  to  do
       end for
End Function
       for  to  do
             while not terminal do
                   Train the DQNAgent by:
                     Calculate using eq. 11;
                     Calculate using eq. 12;
                     Update weights: ;
             end while
       end for
End Function
Algorithm 3 Q-Matrix Pretrained DQN

The algorithm 3 consists of 3 functions: for tabular Q-learning on the pretraining environment instance for finding optimal q-values for shortest path from each room to the nearest exit; for overfitting the shortest path Q-matrix to incorporate the information in the DQN Agent’s network; for using the pretrained DQN Agent to learn the optimal evacuation plan by training it on the fire evacuation environment.
Modifying the final training part to include Double DQN and Dueling DQN Agents is straightforward.

Iii-D Pretraining Convergence

The paper [25] thoroughly analyses and proves conditions where task transfer Q-learning will work. We use the proved propositions and theorems from [25] to show that pretraining works in our case.
Let the pretraining instance and the fire evacuation instance be represented by two MDPs: for pretraining instance and for fire evacuation instance. So, according to proposition 1 in [25]:


where is the distance between MDPs and and are the corresponding optimal Q-functions. In our case, and since our environments are deterministic MDPs, i.e., taking an action at state will always lead to a specific next state and no other state, with probability . This makes the second and third term of equation 24 to zero. So, the distance between the two instances of our environment is reduced to the first term only.
So, according to proposition 1, if the distance between two MDPs is small, then the learned Q-function from the pretraining task is closer to the optimal of the fire evacuation task compared to random initializations and hence helps in convergence to an optimum and improves the speed of convergence.
Also, convergence is guaranteed according to theorem 4 in [25], if the safe condition is met:


where, is the Bellman error. In our case, is small and that multiplied by , which is less than since , is less than . For our case, this seems obvious since the two MDPs are instances of the same MDP. This means that convergence is guaranteed, as long as the shortest path Q-matrix obtained from the pretraining environment converges.
Now, to prove that our method has guaranteed convergence, we need to prove that the Q-matrix is able to capture the shortest path information accurately.

Iii-E Convergence Analysis of Q-learning for finding shortest path

The guarantee of convergence for Q-learning has been discussed and proved in many different ways and for general as well as unique settings [28, 62]. The convergence of Q-learning is guaranteed, while using the update rule given in equation 8, if the learning rate is bounded between and the following conditions hold:


Then, as , , with probability 1. This means that for the learning rate conditions to hold with the constraint , all state-action pairs must be visited an infinite number of times. Here, the only complication is that some state-action pairs might never be visited.
In our pretraining environment, which is an episodic task, we can make sure that all state-action pairs are visited by starting episodes at random start states which is shown in Algorithm 2. Apart from this we use an -greedy exploration policy to explore all state-action pairs. The initial value of and the decay rate are set according to the size of the graph based environment.
We run Q-learning on the pretraining environment for episodes so ensure that the Q-matrix converges to . Since we have an action space of 25 actions for 5 rooms, running for more episodes is convenient. But, for large building models (8281 actions for the large real world building scenario, in Section 5), running for many episodes could become computationally too expensive. So, we use a type of early stopping criteria, in which we stop training the Q-matrix if there is a very small change in it’s elements from one episode to the next.
However, as we shall see in Section 5, that we do not require early stopping at all. We are able to reduce the action space drastically and hence the Q-matrix can be trained in the same way as it was trained for smaller action spaces.
In [63], the proof of convergence of Q-learning is given for stochastic processes, but in our case, the environment is deterministic. Also, in [64], a more general convergence proof for Q-learning is provided using convergence properties of stochastic approximation algorithms and their asynchronous versions. The asymptotic bounds for the error has been shown to be bound by the number of visits to state-action pairs and :


where, and is the sampling probability of . So, it is necessary to run the Q-learning algorithm for as many episodes as possible. Hence, we device a strategy to reduce the action space for large discrete action spaces, which are as a result of real world building models, so that it becomes feasible to train the Q-matrix for a large number of episodes.

Iii-F Discussion on alternative Transfer Learning techniques

There are a few ways of pretraining an agent, some of which have been discussed and evaluated in [65]. A naive approach would be to preload the experience replay memory with demonstration data before hand. This method, however, isn’t actually pretraining. The agent trains normally with the benefit of being able to learn good transitions immediately.
Our method of pretraining beckons the question of pretraining the agent’s network directly. Pretraining a DQN network’s weights on the pretraining environment would require more time compared to tabular Q-learning. The DQN would require more time to converge. Also, in the next step where the Q-matrix is used as a fixed output to train the network’s weights to overfit on the q-values requires much less time. Also, for a smaller state space (like the pretraining environment) tabular Q-learning is much more efficient than DQN. The total time taken for pretraining using the proposed method is ( for tabular Q-learning and for overfitting the agent’s weights on the Q-matrix) compared to for pretraining DQN directly. It’s because using the direct pretraining method would effectively require the DQN to be trained twice overall (once on the pretraining environment and then on the fire evacuation environment), which is inefficient and computationally expensive.
Also, this complexity will grow exponentially when we train it on a large real world building model, which is shown in Section 5. For an environment with rooms and

actions, training a DQN agent twice would be extremely inefficient and computationally infeasible, due to the size of the neural network and computations required and the expense of backpropagation. Whereas, training the Q-matrix would only require computing equation 8.

One of the most successful algorithms in pretraining deep reinforcement learning is the Deep Q-learning from Demonstrations (DQfD) [66, 67]. It pretrains the agent using a combination of Temporal Difference (TD) and supervised losses on demonstration data in the replay memory. During training, the agent trains its network using prioritized replay mechanism between demonstration data and interactions with the environment to optimize a complex combination of four loss functions (Q-loss, n-step return, large margin classification loss and L2 regularization loss).
The DQfD uses a complex loss function and the drawback of using demonstration data is that it isn’t able to capture the complete dynamics of the environment as it covers a very small part of the state space. Also, prioritized replay adds more overhead. Our approach is far simpler and because we create a separate pretraining instance to incorporate essential information about the environment instead of the full environment dynamics, it is more efficient than demonstration data.

Iv Experiments and Results

We perform unbiased experiments on the fire evacuation environment and compare our proposed approach with state-of-the-art reinforcement learning algorithms. We test different configurations of hyperparameters and show the results with best performing hyperparameters for these algorithms on our environment. The main intuition behind using Q-learning pretrained DQN model was to provide it with important information before hand, to increase stability and convergence. The results confirms our intuition empirically.

The Agent’s Network

Unlike the convolutional neural networks

[31] used in DQN [51, 52], DDQN [58, 59] and Dueling DQN [60]

, we implement a fully connected feedforward neural network. The network configuration is given in Table 1. The network consists of 5 layers. The ReLU function

[68] is used for all layers, except the output layer, where a linear activation is used to produce the output.


The environment given in Fig. 2 is used for all unbiased comparisons. The state of the environment is given as : with bottleneck . All rooms contain 10 people (the exit is empty), which is the maximum possible number of people. We do this to test the agents under maximum stress. The fire starts in room 2 and the fire spread is more towards room 1 than room 3 (as shown in Fig. 2 with orange arrows). Room 4 is the exit. The total number of actions possible for this environment is 25. So, the agent has to pick one out of 25 actions at each step.


The Adam optimizer [56] with default parameters and a learning rate of

is used for training for all the agents. Each agent is trained for 500 episodes. Training was performed on a 4GB NVIDIA GTX 1050Ti GPU. The models were developed in Python with the help of Tensorflow


and Keras



Initially, the graph connections were represented as 2D arrays of the adjacency matrix . But, when the building model’s graphs get bigger, the adjacency matrices become more and more sparse, which makes the 2D array representation inefficient. So, the most efficient and easiest way to implement a graph is as a dictionary, where the keys represent rooms and their values are an array that lists all the rooms that are connected to it.

Comparison Graphs

The comparison graphs shown from Fig. 3 to Fig. 10 have the total number of time-steps required for complete evacuation for an episode on the y-axis and the number of episodes on the x-axis. The comparisons shown in the graphs are different runs of our proposed agents with exactly the same environment settings used for all the other agents as well.

Type Size Activation
Dense 128 ReLU
Dense 256 ReLU
Dense 256 ReLU
Dense 256 ReLU
Dense No. of actions Linear
TABLE I: Network Configuration

We first compare the Q-matrix pretrained versions of the DQN and its variants with the original models. The graph based comparisons between models consists of number of time-steps for evacuating all people on the -axis and episode number on the -axis. We put an upper-limit of 1000 time-steps for an episode due to computational reasons. The training loop breaks and a new episode begins once this limit is reached.

Fig. 3: Q-matrix pretrained DQN vs DQN

The graph comparing DQN with our proposed Q-matrix pretrained DQN (QMP-DQN) in Fig. 3 shows the difference in their performance on the fire evacuation environment. Although the DQN reaches the optimal number of time-steps quickly, it isn’t able to stay there. The DQN drastically diverges from the solution and is highly unstable.

Fig. 4: Q-matrix pretrained DDQN vs DDQN

It’s the same case with DDQN (Fig. 4) and Dueling DQN (Fig. 5), which, although perform better that DQN with less fluctuations and spend more time near the optimal solution. Our results clearly shows a big performance lag compared to the pretrained versions. As these results suggest that pretraining ensures convergence and stability. We show that having some important information about the environment prior to training reduces the complexity of the learning task for an agent.

Fig. 5: Q-matrix pretrained Dueling DQN vs Dueling DQN

The original Q-learning based models aren’t able to cope with the dynamic and stochastic behaviour of the environment. And since they don’t posses pretrained information, their learning process is made even more difficult. Table 2 displays a few numerical results, comparing DQN, DDQN and Dueling DQN, with and without the Q-matrix pretraining on the basis of average number of time-steps for all 500 episodes, minimum number of time-steps reached during training and the training time per episode.

Model Average Time-Steps Average Time-Steps with Pretraining Minimum Time-Steps Minimum Time-Steps with Pretraining Training Time (per episode) Training time with Pretraining (per episode)
DQN 228.2 76.356 63 61 10.117 6.87
DDQN 134.62 71.118 61 60 12.437 8.11
Dueling DQN 127.572 68.754 61 60 12.956 9.02
TABLE II: Performance

As it was also clear from the Figs. 3, 4 and 5, the average number of time-steps is greatly reduced with pretraining, as it makes the models more stable by reducing variance. Based on the environment given in Fig. 2, the minimum possible number of time-steps is 60. All the DQN based models are able to come close to this, but pretraining pushes these models further and achieves the minimum possible number of time-steps. Even though the difference seems small, in emergency situations even the smallest differences could mean a lot at the end. The training time is also reduced with pretraining, as the number of time-steps taken during training is reduced and pretrained models get a better starting position nearer to the optimum.

Fig. 6: Proposed method vs Random Agent

Next, we make comparisons between our proposed approach and state-of-the-art reinforcement learning algorithms. For these comparisons, we use the Q-matrix pretrained Dueling DQN model, abbreviated QMP-DQN. We also compare it with a random agent, shown in Fig. 6. The random agent performs random actions at each step, without any exploration. The random agent’s poor performance of 956.33 average time-steps shows that finding the optimal or even evacuating all the people isn’t a simple task.

Fig. 7: Proposed method vs State-Action-Reward-State-Action method

The State-Action-Reward-State-Action (SARSA) algorithm is an on-policy reinforcement learning agent introduced in [3]. While Q-learning follows a greedy policy, SARSA takes the policy into account and incorporates it into its updates. It updates values by considering the policy’s previous actions. On-policy methods like SARSA have a downside of getting trapped in local minima if a sub-optimal policy is judged as the best. On the other hand, off-policy methods like Q-learning are flexible and simple as they follow a greedy approach. As it is clear from Fig. 7, that SARSA behaves in a highly unstable manner and isn’t able to reach the optimal solution and shows high variance.

Fig. 8: Proposed method vs Policy based methods (PPO and VPG)

Policy gradient methods are highly preferred in many applications, however they aren’t able to perform optimally on our fire evacuation environment. Since the optimal policy could change in a few time-steps in our dynamic environment, greedy action selection is probably the best approach. An evacuation path that seems best at a particular time step could be extremely dangerous after the next few time-steps and a strict policy of routing cannot be followed continuously due to fire spread and/or bottleneck. These facts are evident from Fig. 8, where we compare our approach to policy gradient methods like Proximal Policy Optimization (PPO) [71] and Vanilla Policy Gradient (VPG) [72]. Even though PPO shows promising movement, it isn’t able to reach the optimum.

Fig. 9: Proposed method vs Synchronous Advantage Actor Critic method (A2C)
Fig. 10: Proposed method vs Actor Critic using Kronecker-Factored Trust Region (ACKTR)
Model Average Time-Steps Minimum Time-Steps Training Time (per episode)
SARSA 642.21 65 19.709
PPO 343.75 112 16.821
VPG 723.47 434 21.359
A2C 585.92 64 25.174
ACKTR 476.56 79 29.359
Random Agent 956.33 741 -
(Dueling DQN Backbone)
68.754 60 9.02
TABLE III: Comparison with State-of-the-art RL Algorithms

Another major type of reinforcement learning algorithms are the actor-critic methods. It is a hybrid approach consisting of two neural networks: an actor which controls the policy (policy based) and a critic which estimates action values (value based). To further stabilize the model, an advantage function is introduced which gives the improvement of an action compared to an average action used in a particular state. Apart from the previously mentioned shortcomings of using policy based methods on the fire evacuation environment, the advantage function would have high variance since the best action at a particular state could change rapidly leading to unstable performance. This is apparent from Fig. 9, where we compare the synchronous advantage actor critic method (A2C) [73] with our proposed method. The A2C gives near optimal performance in the beginning but diverges and rapidly fluctuates.
We do not compare our proposed method with the asynchronous advantage actor critic method (A3C) [74], because A3C is just an asynchronous version of A2C, which is more complex as it creates many parallel versions of the environment and gives relatively the same performance, but is not as sample efficient as claimed in [75]. The only advantage of A3C is that it exploits parallel and distributed CPU and GPU architectures which boosts learning speed as it can update asynchronously. However, the main focus of this paper is not learning speed. Hence, we think that the comparison with A2C is sufficient for actor-critic models.
Probably the best performing Actor Critic based model is the ACKTR (Actor Critic with Kronecker-factored Trust Region) [76]. The algorithm based on applying trust region optimization using Kronecker-factored approximation, which is the first scalable trust region natural gradient method for actor critic models that can be applied to both continuous and discrete action spaces. The Kronecker-factored Approximate Curvature (K-FAC) [77], is used to approximate the Fisher Matrix to perform approximate natural gradient updates. We compare our method to the ACKTR algorithm, shown in Fig. 10. The results suggest that the ACKTR is not able to converge (within 500 episodes, due to slow convergence rate) and is susceptible to the dynamic changes in the environment as evident from the fluctuations. ACKTR is far too complex compared to our proposed method, which converges much faster and deals with the dynamic behaviour of the fire evacuation environment efficiently.
We summarize our results in Table 3. All the RL agents use the same network configuration mentioned in Table 1 for unbiased comparison. The training time for the QMP-DQN is much lower compared to other algorithms because pretraining provides it with a better starting point, so it requires less number of time-steps and memory updates to reach the terminal state. Also, SARSA and A2C come really close to the minimum number of time-steps, but as the average number of time-steps suggests, they aren’t able to converge and exhibit highly unstable performance. Our proposed method, Q-matrix pretrained Dueling Deep Q-network gives the best performance on the fire evacuation environment by a huge margin.
Note that, in all the comparison graphs, our proposed method comes close to the global optimum, but isn’t able to completely converge to it. This is because of the uncertainty probability , which decides whether an action is performed or not and is set to . This uncertainty probability is used to map the uncertain crowd behaviour. Even though, , does not allow complete convergence, it also prevents the model from memorizing an optimal path which might change as the fire spreads.

Multiple Fires Scenario

Now that we have shown that the proposed method is able to outperform state-of-the-art reinforcement learning algorithms, we test our model on a more complex and difficult environment setup. The environment configuration consists of multiple fires in different rooms and a more complex graph structure consisting of 8 rooms. The environment is shown in Fig. 10. The green node is the exit, the red nodes are the rooms where the fire is located and the orange arrows depict the direction of fire spread.
As we can see from Fig. 10, the fire spreads in different directions from different fire locations. This makes things especially difficult because as the fire spreads, the paths to the exit could be blocked. We do not change the configuration of our approach, except the output layer of the network, since the number of possible actions is 64 now. The State of the environment and Bottleneck given as input is: and .

Fig. 11: A Multiple Fire Evacuation Environment

We employ the Q-matrix pretrained Dueling DQN model. Fig. 11 shows the graphical results on the multiple fires scenario. The initial fluctuations are due to -greedy exploration. Since this configuration of the environment is bigger and more complex, the agent explores the environment a little longer.

Fig. 12: Q-matrix pretrained Dueling DQN in Multiple Fire Scenario

As the results suggest from Fig. 11, the proposed model is able to converge very quickly. A few metrics for the proposed method on the multiple fires environment is given below:

Average number of time-steps:


Minimum number of time-steps:


Training time (per episode):


Note that, there is a difference of time-steps between the minimum number of time-steps and average number of time-steps. This is because the average of all 500 episodes is taken which includes the initial fluctuations due to exploration and the uncertainty probability .

V Scalability: Large and Complex Real World Scenario - University of Agder Building

To prove that our method is capable of performing on large and complex building models, we simulate a real world building, i.e., the University of Agder A, B, C and D blocks, and perform evacuation in case of fire in any random room(s).
This task is especially difficult because of the resulting complex graph structure of the building and the large discrete action space. We consider the A, B, C and D blocks which are in the same building. The total number of rooms in this case is , which means that the number of all possible actions is . This discrete action space is many times larger than any other OpenAI gym environment or Atari game environments [11]. Even the Go game has , i.e., actions.
Dealing with such a large action space would require a huge agent model or moving towards to a multi-agent approach and dividing the environment into subsets, with each sub-environment for each agent to deal with. These techniques for dealing with the large discrete action space would be computationally complex and difficult to implement for the fire evacuation environment.
Another way could be to use a policy gradient method which are much more effective in dealing with large action spaces compared to value based methods. But, dealing with such large action spaces would require an ensemble of neural networks and tree search algorithms like in [78] or extensive training from human interactions like in [79]. However, in a fire emergency environment we obviously can’t have human interactions and we would like to solve the issue of large action space without having to use dramatically huge models. Also we saw in the previous section that even though PPO performs much better compared to other algorithms, it wasn’t able to outperform our QMP-DQN methods.
In [80], a new method to deal with extremely large discrete action spaces (1 million actions) was proposed. The novel method, called the Wolpertinger policy algorithm, uses a type of actor-critic architecture, in which the actor proposes a proto-action in an action embedding space from which most similar actions are selected using the k-nearest neighbour algorithm. These actions are received by the critic which makes a greedy selection based on the learned q-values. This technique shows promising results, however, it is highly complex.
We propose a much simpler approach to deal with large number of actions. Our method consists of two stages: One-Step Simulation (OSS) of all actions resulting in an action importance vector and then element-wise addition with the DQN output for training. We explain our method in the following subsections.

V-a One-Step Simulation and Action Importance Vector

We make use of the pretraining environment instance shown in Algorithm 2 to calculate the action importance vector , as shown in Algorithm 4. The function is implemented in the environment itself to enable the environment object to use the method and the function to use the environment variables.
The function simulates all possible actions for each state/room for one time-step in the pretraining environment. It stores all rewards received for these actions taken from room and returns the best actions for each room which yield the highest rewards.
The function is run for each room in , which is the total number of rooms in the environment. The equation , is used to convert the best actions returned by function for all rooms , into a single vector of actions. This is necessary because the DQN agent can take any appropriate action from any room at a particular time-step. So, it outputs a single vector consisting of Q-values for all actions at each time-step.

Environment instances:
Environment variables: Number of rooms
       for  in  do
       end for
      return ;
End Function
for  in  do
end for
for  in  do
       if  in  then
       end if
       end if
end for
Algorithm 4 One-Step Simulation and

After we have a unique index for all selected actions in the environment, we form the action importance vector by placing at index , if the action is present in the vector , which consists of all the best actions for each room , otherwise, a large negative number (like ) at index .
The action importance vector can be though of as a fixed weight vector which contains weight for good actions and a large negative weight for others. is then added element-wise with the output of the DQN to produce the final output on which the DQN is trained on.


This makes the Q-values of the good actions to remain the same and reduces the Q-values of other actions to huge negative numbers. This method effectively reduces the action space from to , where . In our experiments, we set the hyperparameter as the maximum degree of vertices in the building model’s graph, i.e. . So, in our model, the action space is effectively reduced from actions to actions, which is a decrease.
Hence, our complete method consists of shortest path pretraining using Q-matrix transfer learning and action space reduction by one-step simulation and action importance and finally DQN based model training and execution. The shortest path pretraining provides the model with global graph connectivity information and the one-step simulation and action importance delivers local action selection information.
The action importance vector can also be thought of as an attention mechanism [37, 81, 82, 83]. Most of the attention mechanisms employ a neural network or any other technique to output an attention vector which is then combined with the input or an intermediate output to convey attention information to a model. Unlike these methods, our proposed model combines the action importance vector with the output of the DQN. This means that the current action selection is based on a combination of the Q-values produced by the DQN and the action importance vector, but the training of the DQN is impacted by the attention vector in the next iteration of training, as the final output of the iteration is used as the label for training the model at the iteration.
One major advantage of such an attention mechanism used in our method is that, since the graph based environment has a fixed structure, the attention vector needs to be calculated just once at the beginning. We test our method on the University of Agder (UiA), Campus Grimstad building with blocks A, B, C and D consisting of rooms.
Note that, unlike the usual attention based models, we do not perform element-wise multiplication of the attention vector with the output of a layer. Instead, we add the attention vector because initially the DQN model will explore the environment and will have negative Q-values for almost all actions (if not all). This means that if we use a vector of ones and zeros for good and bad actions respectively and multiply element-wise with the output of a layer then, the Q-values of good actions will be copied as it is and the Q-value of other actions will

Fig. 13: University of Agder Graph

The red vertices indicate fire in that room and the green vertices are exits. The yellow vertices show the fire spread towards that room.

become zero. If the Q-value of good actions is negative in the beginning due to exploration (and lack of learning since it is the beginning of training), then the max function in the Q-value selection equation will select bad actions since they are zeros and good actions are negative. This will lead to catastrophic behaviour of the system and it will never converge. So, instead we use addition with zeros for good actions so that they remain the same and with large negative numbers for other actions so that their Q-values become so low that they are never selected.

V-B Fire Evacuation in the UiA building

The graph for UiA’s building model is based on the actual structure of the 2nd floor of blocks A, B, C and D111UiA building map can be found here: The graph for the building model is shown in Fig. 13. It consists of rooms (from room to room ) out of which there are exits. We simulate the fire evacuation environment in which there are multiple distributed fires in rooms , , and . The fire spread for each fire is individually simulated in a random direction as shown by the yellow nodes in the graph.
As shown in Fig. 13, the building connectivity can be quite complex and there has been limited research work that deals with this aspect. The graph structure shows that these connections between rooms cannot possibly be captured by a grid based or maze environment.
Also, note that, the double sided arrows in the graph enable transitions back and forth between rooms. This makes the environment more complicated for the agent since the agent could just go back and forth between ’safe’ rooms and get stuck in a loop and may never converge. This point makes pretraining even more indispensable.
Since, the proposed method is able to reduce the action space by a lot, the neural network doesn’t need to be made too large. The network configuration is given in Table 4. Note that the addition layer does not require any trainable parameters.

Type Size Activation
Dense 512 ReLU
Dense 1024 ReLU
Dense 1024 ReLU
Dense 1024 ReLU
Dense 8281 Linear
Addition - -
TABLE IV: Network Configuration

The neural network is trained using the Adam optimizer [56] with default hyperparameter settings and a learning rate for episodes. The training was performed on the NVIDIA DGX-2. The optimal number of steps for evacuation in the UiA building graph is around .

V-C Results

Fig. 14: Proposed method applied on the UiA Building

The results of our proposed method consisting of shortest path Q-matrix transfer learning to Dueling-DQN model with one-step simulation and action importance vector acting as an attention mechanism applied on the University of Agder’s A,B,C and D blocks consisting of rooms and actions (whose graph is shown in Fig. 13) is shown in Fig. 14. The performance numbers are given below:

Average number of time-steps:


Minimum number of time-steps:


Training time (per episode):

32.18 s

The graph in Fig. 14 shows the convergence of our method with evacuation time-steps on the y-axis and the episode number on the x-axis. It takes slightly longer to converge compared to the convergence in previous small example environments. This is obviously due to the size of the environment and complex connectivity. But overall the performance of our model is excellent.
After episodes, the algorithm has almost converged. There are a few spikes suggesting fluctuations from the optimal behaviour due to the dynamic nature of the environment and the uncertainty in actions. After episodes, the algorithm completely converges in the range times-steps for total evacuation. The method cannot converge to the minimum possible time-steps because of the fire spread dynamics, encountering bottleneck conditions and action uncertainty.
The results clearly suggest that even though the proposed fire evacuation environment is dynamic, uncertain and full of constraints, our proposed method using novel action reduction technique with attention based mechanism and transfer learning of shortest path information is able to achieve excellent performance on a large and complex real world building model. This further confirms that, with a minute added overhead of one-step simulation and action importance vector, our method is scalable to much larger and complex building models.

Vi Conclusion

In this paper, we propose the first realistic fire evacuation environment to train reinforcement learning agents. The environment is implemented in OpenAI gym format. The environment has been developed to simulate realistic fire scenarios. It includes features like fire spread with the help of exponential decay reward functions and degree functions, bottlenecks, uncertainty in performing an action and a graph based environment for accurately mapping a building model.
We also propose a new reinforcement learning method for training on our environment. We use tabular Q-learning to generate q-values for shortest path to the exit using the adjacency matrix of the graph based environment. Then, the result of Q-learning (after being offset by a ) is used to pretrain the DQN network weights to incorporate shortest path information in the agent. Finally, the pretrained weights of the DQN based agents are trained on the fire evacuation environment.
We prove the faster convergence of our method using Task Transfer Q-learning theorems and the convergence of Q-learning for the shortest path task. The Q-matrix pretrained DQN agents (QMP-DQN) are compared with state-of-the-art reinforcement learning algorithms like DQN, DDQN, Dueling-DQN, PPO, VPG, A2C, ACKTR and SARSA on the fire evacuation environment. The proposed method is able to outperform all these models on our environment on the basis of convergence, training time and stability. Also, the comparisons of QMP-DQN with original DQN based models show clear improvements over the latter.
Finally, we show the scalability of our method by testing it on a real world large and complex building model. In order to reduce the large action space ( actions), we use the one-step simulation technique on the pretraining environment instance to calculate the action importance vector, which can be thought of as an attention based mechanism. The action importance vector gives the best actions a weight of and the rest are assigned a large negative weight of (to render the Q-values of these too low to be selected by the Q-function). This reduces the action space by and our proposed method, QMP-DQN model, is applied on this reduced action space. We test this method on the UiA, Campus Grimstad building, with the environment consisting of rooms. The results show that this combination of methods works really well in a large real world fire evacuation emergency environment.


The authors would like to thank Tore Olsen, Chief of the Grimstad Fire Department, and the Grimstad Fire Department for supporting us with their expertise regarding fire emergencies and evacuation procedures as well as giving feedback to improve our proposed environment and evacuation system. We would also like to thank Dr. Jaziar Radianti, Center for Integrated Emergency Management (CIEM), University of Agder, for her input to this research work.