I Introduction
Reinforcement Learning (RL) has been a subject of extensive research and applications in various real world domains such as Robotics, Games, Industrial Automation and Control, System Optimization, Quality Control and Maintenance. But, some extremely important areas, where Reinforcement Learning could be immensely vital, have not received adequate attention from researchers. We turn our attention to the major problem of evacuation in case of fire emergencies.
Fire related disasters are the most common type of Emergency situation. They require thorough analysis of the situation for quick and precise response. Even though this critical application hasn’t received adequate attention from AI researchers, there have been some noteworthy contributions. One such paper, focusing on assisting decision making for fire brigades, is described in [1]. Here, the the RoboCup Rescue simulation is used as a fire simulation environment [2]. A SARSA Agent [3] is used with a new learning strategy called LessonbyLesson learning, similar to curriculum learning. Results show that the RL agent is able to perform admirably in the simulator. However, the simulator lacks realistic features like bottlenecks, fire spread and has a grid structure which is too simplistic to model realistic environments. Also, the approach seems unstable and needs information about the state which isn’t readily available in real life scenarios.
In [4], multiple coordinated agents are used for forest fire fighting. The paper uses a software platform called Pyrosim which is used to create dynamic forest fire situations. The simulator is mostly used for terrain modeling and a coordinated multiple agent system is used to extinguish fire and not for evacuation.
The evacuation approach described in [5] is similar to the problem we try to solve in this paper. In [5], a fading memory mechanism is proposed with the intuition that in dynamic environments less trust should be put on older knowledge for decision making. But arguably, this could be achieved more efficiently by the ’’ parameter in Qlearning along with prioritized experience replay. Also, the graph based environment used in [5] lacks many key features like fire spread, people in rooms, bottlenecks etc.
The most significant work done on building evacuation using RL is reported in [6]. The evacuation environment is grid based with multiple rooms and fire. The fire spread is modelled accurately and uncertainty taken into account. The multiagent Qlearning model is shown to work in large spaces as well. Further, the paper demonstrates a simple environment and strategy for evacuation. However, the approach proposed in [6] lacks key features like bottlenecks and actual people in rooms. The grid based environment isn’t able to capture details of the building model like room locations and paths connecting rooms.
Some interesting research on evacuation planning take a completely different approach by simulating and modelling human and crowd behaviour under evacuation [7, 8, 9, 10]. Our work on evacuation planning is not based on human behaviour modelling or the BDI (BeliefDesireIntention) framework for emergency scenarios. These methods are beyond the scope of this paper and not discussed here.
Proposed Environment
There are many reinforcement learning libraries that contain simulations and game environments to train reinforcement learning based agents [11, 12, 13, 14, 15]. However, currently no realistic learning environment for emergency evacuation has been reported.
In our paper, we build the first realistic fire evacuation environment specifically designed to train reinforcement learning agents for evacuating people in the safest manner in the least number of timesteps possible. The environment has the same structure as OpenAI gym environments, so it can be used easily in the same manner.
The proposed fire evacuation environment is graph based, which requires complex decision making such as routing, scheduling and dealing with bottlenecks, crowd behaviour uncertainty and fire spread. This problem falls in the domain of discrete control. The evacuation is performed inside a building model, which is represented as a graph. The agent needs to evacuate all persons in all rooms through any available exits using the shortest path in the least number of timesteps, while avoiding any perilous situations like the fire, bottlenecks and other hazardous situations.
Some previous research papers focus on modelling fire spread and prediction, mostly using cellular automata [16] and other novel AI techniques [17, 18, 19]. An effective and innovative way of modelling fire spread is to use spatial reinforcement learning, as proposed in [20]. However, our way of simulating fire spread is far less complex and leverages rewarding system of the RL framework. In our proposed environment, we simply use an exponential decay reward function to model the fire spread and direction. To keep in tune with the RL framework, the feedback from the environment sent back to the agent should convey enough information. So, we design the reward function in such a manner that the agent can learn about the fire spread and take measures accordingly.
Proposed Method
Since this environment poses a high level of difficulty, we argue that incorporating the shortest path information (shortest path from each room to the nearest exit) in the DQN model(s) by transfer learning and pretraining the DQN neural network function approximator is necessary.
Transfer learning has been used extensively in computer vision tasks for many years, recently vastly expanded for many computer vision problems in
[21]. Lately, it has been utilized in Natural Language models [22, 23]. In reinforcement learning, pretrained models have started to appear as well [24, 25]. In fact, we use the convergence analysis of [25], which provides a general theoretical perspective of task transfer learning, to prove that our method guarantees convergence.In this paper, we present a new class of pretrained DQN models called Qmatrix Pretrained Deep QNetworks (QMPDQN). We employ Qlearning to learn a Qmatrix representing the shortest paths from each room to the exit. We perform multiple random episodic starts and greedy exploration of the building model graph environment. Qlearning is applied on a pretraining instance of the environment that consists of only the building model graph. Then, we transfer the Qmatrix to a DQN model, by pretraining the DQN to reproduce the Qmatrix. Finally, we train the pretrained DQN agent on the complete fire evacuation task. We compare our proposed pretrained DQN models (QMPDQN) against regular DQN models and show that pretraining for our fire evacuation environment is necessary. We also compare our QMPDQN models with stateoftheart Reinforcement Learning algorithms and show that offpolicy Qlearning techniques perform better than other policy based methods as well as actorcritic models.
Finally, in Section 5, we show that our method can perform optimal evacuation planning on a large and complex real world building model by dealing with the large discrete action space in a new and simple way by using an attention based mechanism.
Contributions
This paper contributes to the field of reinforcement learning, emergency evacuation and management in the following manner:

We propose the first reinforcement learning based fire evacuation environment with OpenAI Gym structure.

We build a graph based environment to accurately model the building structure, which is more efficient than a maze structure.

The environment can consist of a large discrete action space with number of actions (for all possibilities), where is the number of rooms in the building. That is, the action space size increases exponentially with respect to the rooms.

Our proposed environment contains realistic features such as multiple fires and dynamic fire spread which is modelled by the exponential decay reward function.

We further improve the realism of our environment by restricting the number of people allowed in each room to model overcrowded hazardous situations.

We also include uncertainty about action performed in the environment to model uncertain crowd behaviour, which also acts as a method of regularization.

We use the Qmatrix to transfer learned knowledge of the shortest path by pretraining a DQN agent to reproduce the Qmatrix.

We also introduce a small amount of noise in the Qmatrix, to avoid stagnation of the DQN agent in a local optimum.

We perform exhaustive comparisons with other stateoftheart reinforcement learning algorithms like DQN, DDQN, Dueling DQN, VPG, PPO, SARSA, A2C and ACKTR.

We test our model on a large and complex real world scenario, which is the University of Agder Building, which consists of nodes, and actions.

We propose a new and simple way to deal with large discrete action spaces in our proposed environment, by employing an attention mechanism based technique.
The rest of the paper is organized as follows: Section 2 summarizes the RL concepts used in this paper. Section 3 gives a detailed explanation of the proposed Fire Emergency Evacuation System with each module of the system described in subsequent subsections. Section 4 reports our exhaustive experimental results. Section 5 presents the real world application of our model in a large and complex environment, and finally Section 6 concludes the paper.
Ii Preliminaries
Reinforcement Learning is a subfield of Machine Learning which deals with learning to make appropriate decisions and take actions to achieve a goal. A Reinforcement Learning agent learns from direct interactions with an environment without requiring explicit supervision or a complete model of the environment. The agent interacts with the environment by performing actions. It receives feedback for it’s actions in terms of reward (or penalty) from the environment and observes changes in the environment as a result of the actions it performs. These observations are called states of the environment and the agent interacts with the environment at discrete time intervals
by performing an action in a state of the environment , it transitions to a new state (change in the environment) while receiving a reward, with probability
. The main aim of the agent is to maximize the cumulative reward over time through it’s choice of actions. A pictorial representation of the RL framework is shown in Fig. 1.In the subsequent subsections, a brief presentation of the concepts and methods used in this paper are explained.
Iia Markov Decision Process
The Reinforcement learning framework is formalised by Markov Decision Processes (MDP) which are used to define the interaction between a learning agent and its environment in terms of states, actions, and rewards
[27]. An MDP consists of a tuple of [26], where is the state space, is the action space, is the transition probability from one state to the next, and is the reward function, .When state space , action space and rewards consist of finite number of elements, and
have welldefined discrete probability distributions which depend only on the present state and action (Markov Property). This is represented as
, where determines the dynamics of the Markov Decision Process and where:(1) 
contains all the information about the MDP, so we can compute important aspects about the environment from , like state transition probability and expected rewards for stateaction pairs [26]:
(2) 
(3) 
The equation 3, gives the immediate reward we expect to get when performing action from state . The agent tries to select actions that maximize the sum of rewards it expects to achieve, as time goes to infinity. But, in a dynamic and/or continuous Markov Decision Process, the notion of discounted rewards is used [26]:
(4) 
where, is the discount factor and is in the range . If is near , then the agent puts emphasis on rewards received in the near future and if is near , then the agent also cares about rewards in the distant future.
In order to maximize , the agent picks an action when in a state according to a policy function . A policy function is a probabilistic mapping from the state space to the action space, . The policy function outputs probabilities for taking each action in give state, so it can also be denoted as .
IiB QLearning
Most of the Reinforcement Learning algorithms (value based) try to estimate the value function which gives an estimate of how good a state is for the agent to reside in. This is estimated according to the expected reward of a state under a policy and is denoted as
:(5) 
Qlearning is a value based Reinforcement Learning algorithm that tries to maximize the function [28]. The function is a stateaction value function and is denoted by . It tries to maximize the expected reward give a state and action performed on that state:
(6) 
According the Bellman Optimality equation [26], the optimal function can be obtained by:
(7) 
where, . And, is the optimal action which results in maximum reward, the optimal policy is formed as . This method was proposed in [28] which is tabular style Qlearning. The update rule for each time step of Qlearning is as follows:
(8) 
Qlearning is an incremental dynamic programming algorithm that determines the optimal policy in a stepbystep manner. At each step , the agent performs the following operations:

Observes current state .

Selects and performs an action .

Observes the next state .

Receives the reward .

Updates the qvalues using equation 8.
The value function converges to the optimal value as . Detailed convergence proof and analysis can be found in [28].
This tabular Qlearning method is used in our proposed approach to generate a Qmatrix for the shortest path to the exit based on the building model. In order to incorporate the shortest path information, this Qmatrix is used to pretrain the DQN models.
IiC Deep Q Network
The tabular Qlearning approach works well for small environments, but becomes infeasible for complex environments with large multidimensional discrete or continuous stateaction spaces. To deal with this problem, a parameterized version of the function is used for approximation . This way of function approximation was first proposed in [29].
Deep Neural Networks (DNNs) have become the predominant method for approximating complex intractable functions. They have become the defacto method for various applications such as image processing and classification [30, 31, 32, 33, 34, 35, 36], speech recognition [37, 38, 39, 40, 41, 42, 43]
, and natural language processing
[44, 45, 46, 47, 48]. DNNs have also been applied to reinforcement learning problems successfully by achieving noteworthy performance [49, 50].The most noteworthy research in integrating deep neural networks and Qlearning in an endtoend reinforcement learning fashion is the Deep QNetworks (DQNs) [51, 52]
. To deal with the curse of dimensionality, a neural network is used to approximate the parameterised Qfunction
. The neural network takes a state as input and approximates Qvalues for each action based on the input state. The parameters are updated and the Qfunction is refined in every iteration through an appropriate optimizer like Stochastic Gradient Descent
[53], RMSProp
[54], Adagrad [55], Adam [56] etc. The neural network outputs qvalues for each action for the input state and the action with the highest qvalue is selected (There is another DQN architecture, which is less frequently used, that takes in the state and action as input and returns it’s qvalue as output).The DQN can be trained by optimizing the following loss function:
(9) 
where, is the discount factor, and are the Qnetwork parameters at iteration and respectively. In order to train the Qnetwork, we require a target to calculate loss and optimize parameters. The target qvalues are obtained by holding the parameters fixed from the previous iteration.
(10) 
where,
is the target for the next iteration to refine the Qnetwork. Unlike supervised learning where the optimal target values are known and fixed prior to learning, in DQN the approximate target values
, which depend on network parameters, are used to train the Qnetwork. The loss function can be rewritten as:(11) 
The process of optimizing the loss function at the iteration by holding the parameters from the previous iteration fixed, to get target values, results in a sequence of welldefined optimization timesteps. By differentiating the loss function in equation 11, we get the following gradient:
(12) 
Instead of computing the full expectation of the above gradient, we optimize the loss function using an appropriate optimizer (in this paper we use the Adam optimizer [56]). The DQN is a modelfree algorithm since it directly solves tasks without explicitly estimating the environment dynamics. Also, DQN is an offpolicy method as it learns a greedy policy , while following an greedy policy for sufficient exploration of the state space.
One of the drawbacks of using a nonlinear function approximator like neural network is that it tends to diverge and is quite unstable for reinforcement learning. The problem of instability arises mostly due to: correlations between subsequent observations and that small changes in qvalues can significantly change the policy and the correlations between qvalues and target values.
The most wellknown and simple technique to alleviate the problem of instability is the experience replay [57]. At each timestep, a tuple consisting of the agent’s experience
is stored in a replay memory over many episodes. A minibatch of these tuples is randomly drawn from the replay memory to update the DQN parameters. This ensures that the network isn’t trained on a sequence of observations (avoiding strong correlations between samples and reducing variance between updates) and it increases sample efficiency. This technique greatly increases stability of DQN.
IiD Double DQN
Qlearning and DQN are capable of achieving performance beyond the human level on many occasions. However, in some cases Qlearning performs poorly and so does its deep neural network counterpart DQN. The main reason behind such poor performance is that Qlearning tends to overestimate action values. These overestimations are caused due to a positive bias that results from the function in Qlearning and DQN updates which outputs the maximum action value as an approximation of the maximum expected action value.
The Double Qlearning method was proposed in [58] to alleviate this problem and later extended to DQN [59] to produce the Double DQN (DDQN) method. Since Qlearning uses the same estimator to select and evaluate an action, which results in overoptimistic action values, we can interpret it as a single estimator. In Double Qlearning, the task of evaluation and selection is decoupled by using double estimator approach consisting of two functions: and . The function is updated with a value from the function for the next state and the function is updated with a value from the function for the next state.
Let,
(13) 
(14) 
Then,
(15) 
(16) 
where, is the action with the maximum qvalue in state according to the function and is the action with the maximum qvalue in state according to the function.
The double estimator technique is unbiased which results in no overestimation of action values, since action evaluation and action selection is decoupled into two functions that use separate function estimates of action values. In fact, thorough analysis of Double Qlearning in [58] shows that it sometimes might underestimate action values.
The Double Qlearning algorithm was adapted for large stateaction spaces in [59] by forming the Double DQN method in a similar way as DQN. The two Qfunctions ( and ) can be parameterised by two sets of weights and . At each step, one set of weights is used to update the greedy policy and the other to calculate it’s value. For Double DQN, equation 10 can be written as:
(17) 
The first set of weights are used to determine the greedy policy just like in DQN. But, in Double DQN, the second set of weights is used for an unbiased value estimation of the policy. Both sets of weights can be updated symmetrically by switching between and .
The target value network in DQN can be used as the second Qfunction instead of introducing an additional network. So, the weights at the iteration are used to evaluate the greedy policy and the weights at the previous iteration to estimate it’s value. The update rule remains the same as DQN, while changing the target as:
(18) 
Note that in both DQN and DDQN, the target network uses the parameters of the previous iteration . However, to generalise, the target network can use parameters from the any previous iteration. Then, the target network parameters are updated periodically with the copies of the parameters of the online network.
IiE Dueling DQN
In quite a few RL applications, it is sometimes unnecessary to estimate the value of each action. In many states, the choice of action has no consequence on the outcome. A new architecture for modelfree Reinforcement Learning, called the dueling architecture, is proposed in [60]
. The dueling architecture explicitly separates state values and action advantage values into two streams which share a common feature extraction backbone neural network. The architecture is similar to that of the DQN and DDQN architectures; the difference being that instead of a single stream of fully connected layers, there are two streams providing estimates of the value and statedependent advantage functions. The two streams are combined at the end producing a single Qfunction.
One stream outputs a scalar state value, while the other outputs an advantage vector having dimensionality equal to number of actions. Both the streams are combined at the end to produce the Qfunction estimate. The combining module at the end can simply aggregate the value and advantage estimates as:
(19) 
where, are the parameters of the lower layers of the neural network (before streams are split); and are the parameters of the advantage and value function streams. However, such an aggregation of streams would require to be replicated as many times as the dimensionality of . Also, value and advantage estimates cannot be uniquely recovered given the estimated Qfunction.
One way of addressing these issues, proposed in [60], is to force the advantage function estimator to have zero value at the selected action. This aggregation is implemented in the combining module as:
(20) 
Now, for a chosen action (action with max Qfunction), , putting in equation 20, we get . Hence, the two streams can be uniquely recovered.
In [60], another way of aggregation is proposed which eliminates the operator.
(21) 
where, is the number of actions. Even though value and advantage estimates are now offtarget by a constant, this way of aggregation improves stability by capping the changes in the advantage estimates by their mean and enhances overall performance.
In this paper, we use above mentioned offpolicy, modelfree algorithms on our novel fire evacuation environment and significantly improve performance for each of the above methods by transferring tabular Qlearning knowledge of the building structure into these methods.
Iii Fire Emergency Evacuation System
In this paper, we propose the first fire evacuation environment to train reinforcement learning agents and a new transfer learning based tabular Qlearning+DQN method that outperforms stateoftheart RL agents on the proposed environment. The fire evacuation environment consists of realistic dynamics that simulate realworld fire scenarios. For such a complex environment, an outofthebox RL agent doesn’t suffice. We incorporate crucial information in the agent before training it, like the shortest path to the exit from each room. The rest of the section explains the entire system in detail.
Iiia The Fire Evacuation Environment
We propose the first benchmark environment for fire evacuation to train reinforcement learning agents. To the best of our knowledge, this is the first environment of it’s kind. The environment has been specifically designed to simulate realistic fire dynamics and scenarios that frequently arise in real world fire emergencies. We have implemented the environment in the OpenAI gym format [11], to facilitate further research.
The environment has a graph based structure to represent a building model. Let be an undirected graph, such that is a set of vertices that represents rooms and hallways and is a set of edges that represents paths connecting the rooms and hallways. A simple fire evacuation environment consisting of 5 rooms and paths connecting these rooms is shown in Fig. 2.
To represent the graph consisting of rooms, hallways and connecting paths, we use the adjacency matrix . It is a square matrix consisting of elements that indicate whether a pair of vertices is connected by an edge or not. The adjacency matrix is used to represent the structure of the graph and check the validity of actions performed by the agent. The adjacency matrix for the building model in Fig. 2 is given by:
The environment dynamics are defined as follows:
State
Each vertex of the graph represents a room and each room is associated with an integer , which is the number of people in that room. The state of the environment is given by a vector consisting of the number of people in each room . To force the RL agent to learn the environment dynamics by itself, the environment doesn’t provide any other feedback to the agent apart from the state (number of people left in each room) and the reward.
Action
An agent performs an action by moving a person from one room to the other and the state is updated after every valid action. Therefore, the action space is discrete. To keep things simple, we restrict the agent to move one person from one room at a time step. The agent can move a person from any room to any other room at any time step, even if the rooms aren’t connected to each other by a path. So, the number of possible actions at each step is .
This action space is necessary so that the agent can easily generalize to any graph structure. Also, this enables the agent to directly select which room to take people from and which room to send people to, instead of going through each room in a serial manner or assigning priorities.
When the agent selects an action, where there is no path between the rooms, the agent is heavily penalized. Due to this unrestricted action space and penalization, the agent is able to learn the graph structure (building model) with sufficient training and only performs valid actions at the end. The adjacency matrix is used to check the validity of actions.
Note that our graph based fire evacuation environment has possible actions (even though many of them are illegal moves and incur huge penalties), where is the number of rooms. Even for a small toy example of rooms, the total number of possible actions is , which is a lot more than almost all of the OpenAI gym environments and Atari game environments [11].
Reward
We design a reward function specifically suited for our environment. We use an exponential decay function to reward/penalize the agent depending on the action it takes and to simulate fire spread as well. The reward function looks like this:
(22) 
where, is the time step, is the room where a person is moved to and is the degree of fire spread for a room. returns a positive number and if a room has a higher value of degree of fire spread, that means that fire is spreading more rapidly towards that room. We explicitly assign degrees to each room using a degree vector , where the maximum value belongs to the room where the fire is located.
Using such a reward function ensures the following: Firstly, the reward values drop exponentially every time step as the fire increases and spreads. Secondly, the reward of an action depends on the room where a person is moved to. The reward function will penalize an action more heavily if a person is moved to a more dangerous room (higher degree of fire spread towards that room). This is because the function yields more rapidly decaying negative rewards. Lastly, the function yields a negative reward for every action which forces the agent to seek the least number of timesteps. The reward for reaching the exit is a constant [].
Fire Location(s) and Exit(s)
The room where the fire occurs is given the highest degree, hence the maximum penalty for entering. The direction of fire spread is randomly decided and the degrees are assigned accordingly. The degrees are updated gradually to simulate fire spread.
(23) 
where, is a small number () associated with . is assigned to each room according to fire spread direction. So, can be used to determine fire spread direction, since higher value of for a room means that fire is spreading towards that room more rapidly.
As shown in Fig. 2, the fire spread is randomly and independently decided for all rooms .
The exit is also treated like a room. The only difference being that the agent gets a positive reward for moving people to the exit. The number of people at the exit is reset to zero after every action. The rooms which are exits are stored in a vector .
Bottleneck
Probably one of the most important feature in our proposed fire evacuation environment that enhances realism is the bottlenecks in rooms. We put an upper limit on the number of people that can be in a room at a time step. This restriction ensures congestion control during evacuation, which has been a huge problem in emergency situations. The bottleneck information is not explicitly provided to the agent, instead the agent learns about this restriction during training, since a negative reward is received by the agent if the number of people in a room exceed the bottleneck value. The bottleneck is set to 10 in our experiments.
Uncertainty
To take into account uncertain behaviour of the crowd and introduce stochasticity in the environment, a person moves from one room to the other with probability . This means that an action , selected by the agent at timestep , is performed with probability or ignored with probability . If the action is ignored, then there is no change in the state, but the reward received by the agent is as if the action was performed. This acts like a regularizing parameter and due to this, the agent is never able to converge to the actual global minimum. In our experiments, the uncertainty probability is kept at .
Terminal Condition
The terminal/goal is reached once there are no people in any of the rooms [].
The pseudocode for the proposed environment is given in Algorithm 1.
From Algorithm 1, we can see that a heavier penalty is received by the agent for an illegal move compared to bottleneck restriction violation and moving towards fire. In a way, rewards are used to assign priorities to scenarios. It can easily be changes if needed.
Pretraining Environment
We create two instances of our environment: one for fire evacuation and the other for shortest path pretraining. For the pretraining instance, we consider only the graph structure and the aim is to get to the exit from every room in the minimum number of timesteps.
The pretraining environment consists of the graph structure only, i.e. the adjacency matrix . The pretraining environment doesn’t contain fire, the number of people in each room or bottlenecks. The rewards are static integers: 1 for every path to force the agent to take minimum timesteps, 10 for illegal actions (where there is no path) and +1 for reaching the exit. The agent is thus trained to incorporate shortest path information of the building model.
The pseudocode for the pretraining environment is given in Algorithm 2.
The procedure is repeated until the agent converges to the shortest path from any room to the exit.
IiiB Similarities and Differences with Other Environments
The fire evacuation environment is implemented in the OpenAI gym format [11], to enable future research on the topic. OpenAI gym environments consists of four basic methods: init, step, render and reset. Our environment consists of the same four methods.
The init method consists of the initialization conditions of the environment. In our case, it contains the action space size, , the state space size, , the starting state which is an array consisting of the number of people in each room (vertex), the adjacency matrix of the graph based building model, and the fire location(s), . The reset method simply sets the environment back to the initial conditions.
The step method is like the Algorithm 1, without the while loop. The step method takes in the action performed at timestep as the argument and returns the next state , boolean variable for terminal (indicating whether the terminal state was reached with the action performed or not) and the reward for performing the action. The next state, reward and terminal depend on the conditions of the environment as shown in Algorithm 1. The render method simply returns the current state .
The pretraining environment instance has the same structure of methods. The only difference is in the step method, shown in Algorithm 2 excluding the while loop, where the reward system is changed with fewer conditions and the state is represented as the set of empty vertices (rooms with no people) of the graph.
Even though our environment might have the same structure as any OpenAI gym environment, it differs a lot in functionality from other environments or any gamebased environments. In some ways, it might look like the mouse maze game in which the player (mouse) needs to reach the goal (cheeze) in the least possible steps through a maze. But, it is drastically different in many ways:

Our environment is a graph based environment with much less connectivity then the maze environment, which makes finding the optimal route difficult.

The optimal path(s) might change dynamically from one episode to the next or within a few timesteps due to fire spread and uncertainty in the fire evacuation environment, while the optimal path(s) for the mouse maze game remains the same.

All the people in all the rooms must be evacuated to the nearest exit in the minimum number of timesteps under dynamic and uncertain conditions with bottlenecks, whereas in the mouse maze environment an optimal path only from the starting point to the goal needs to be found.

The fire evacuation problem is a problem in which multiple optimal paths for all people in all rooms must be found while avoiding penalizing conditions like fire, bottlenecks and fire spread, whereas the mouse maze problem is a simple pointtopoint problem.

The mouse maze environment is static and lacks any variations, uncertainty or dynamics. On the other hand, the fire evacuation environment is dynamic, variable and uncertain.

In the maze environment, the shortest path to the goal state is always the best. But, in the fire evacuation environment, even though the DQN agent is pretrained on the shortest path information, the shortest path to the exit might not be the best due to fire, fire spread and bottlenecks.

The fire evacuation environment has a much larger action space than the maze environment (four actions: up, down, left, right) because all actions can be performed even if they are illegal (which will yield high penalties) to make the RL agent learn the building structure (graph model).

Finally, a graph is a much better way to model a building’s structure than a maze, since connectivity can be better described with a graph rather than a maze. It’s what graphs were made for, to depict the relationships (connections) between entities (vertices).
Hence, the fire evacuation problem is a much more complex and dramatically different problem than the mouse maze problem or any other game based problem. Even the Go game has , i.e, possible actions, whereas the fire evacuation environment has possible actions, i.e., as the number of rooms increase, the possible actions increase exponentially (although the Go game rules are quite complex to interpret by an RL agent).
IiiC Qmatrix Pretrained Deep QNetworks
For the proposed graph based fire evacuation environment, we also present a new reinforcement learning technique based on the combination of Qlearning and DQN (and its variants). We apply tabular Qlearning to the simpler pretraining environment, with a small state space, to learn the shortest paths from each room to the nearest exit. The output of this stage is an Qmatrix which contains qvalues for stateaction pairs according to the shortest path.
This Qmatrix is used to transfer the shortest path information to the DQN agent(s). This is done by pretraining the agent’s neural network by deliberately overfitting it to the Qmatrix. After pretraining, the neural network weights have the shortest path information incorporated in them. Now, the agent is trained on the complete fire evacuation environment to learn to produce the optimal evacuation plan.
The main purpose of using such a strategy of training an agent by pretraining it first is to provide the agent with vital information about the environment beforehand, so that it doesn’t have to learn all the complexities of the environment altogether. Since, after pretraining, the agent knows the shortest paths to the nearest exits in the building, dealing with other aspects of the environment like fire, fire spread, number of people and bottlenecks is made easier.
We provide two instances of our environment: simpler shortest path pretraining instance and complex fire evacuation instance. First, the agent is pretrained on the simpler instance of the environment (for shortest path pretraining) and then trained on the more complex instance (for optimal evacuation). This approach of training the agent on a simpler version of the problem before training it on the actual complex problem is somewhat similar to curriculum learning [61].
We also add a small amount of noise or offset to the Qmatrix produced by training on the pretraining environment instance. This is done by adding or subtracting (depending on the qvalue) a small to each element of the Qmatrix.
where, can be thought of as a regularization parameter, which is set to in our experiments. Adding noise to the Qmatrix is necessary because we don’t want the DQN agent to just memorize all the paths and get stuck at a local minimum. The actual fire evacuation instance is complex, dynamic and has uncertainty which means that an optimal path at timestep might not be the optimal path at timestep
. The hyperparameter
acts as a regularizer.Note that we add if the element of the Qmatrix is negative or zero and subtract if the element is positive. This is done to offset the imbalance between good and bad actions. If we just add or subtract then the relative difference between qvalues would remain the same. Conditional addition or subtraction truly avoids the DQN agent from being biased to a particular set of actions leading to an exit.
Even though pretraining adds some overhead to the system, there are several advantages including:
 Better Conditioning

Pretraining provides the neural network with a better starting position of weights for training compared to random initializations.
 Faster Convergence

Since the neural network weights are better conditioned due to pretraining, training starts closer to the optimum and hence rate of convergence is faster.
 Crucial Information

Especially in the case of fire evacuation, pretraining with shortest path information provides the agent with crucial information about the environment before training begins.
 Increased Stability

As pretraining restricts the weights in a better basin of attraction in the parameter space, the probability of divergence is reduced which makes the model stable.
 Fewer number of updates

As the weights are near the optimal on the error surface, the number of updates required to reach the optimum is lower, which results in fewer memory updates and requiring less training epochs.
The pseudocode for the proposed Qmatrix pretrained DQN algorithm is given in Algorithm 3.
The algorithm 3 consists of 3 functions: for tabular Qlearning on the pretraining environment instance for finding optimal qvalues for shortest path from each room to the nearest exit; for overfitting the shortest path Qmatrix to incorporate the information in the DQN Agent’s network; for using the pretrained DQN Agent to learn the optimal evacuation plan by training it on the fire evacuation environment.
Modifying the final training part to include Double DQN and Dueling DQN Agents is straightforward.
IiiD Pretraining Convergence
The paper [25] thoroughly analyses and proves conditions where task transfer Qlearning will work. We use the proved propositions and theorems from [25] to show that pretraining works in our case.
Let the pretraining instance and the fire evacuation instance be represented by two MDPs: for pretraining instance and for fire evacuation instance. So, according to proposition 1 in [25]:
(24) 
where is the distance between MDPs and and are the corresponding optimal Qfunctions. In our case, and since our environments are deterministic MDPs, i.e., taking an action at state will always lead to a specific next state and no other state, with probability . This makes the second and third term of equation 24 to zero. So, the distance between the two instances of our environment is reduced to the first term only.
So, according to proposition 1, if the distance between two MDPs is small, then the learned Qfunction from the pretraining task is closer to the optimal of the fire evacuation task compared to random initializations and hence helps in convergence to an optimum and improves the speed of convergence.
Also, convergence is guaranteed according to theorem 4 in [25], if the safe condition is met:
(25) 
where, is the Bellman error. In our case, is small and that multiplied by , which is less than since , is less than . For our case, this seems obvious since the two MDPs are instances of the same MDP. This means that convergence is guaranteed, as long as the shortest path Qmatrix obtained from the pretraining environment converges.
Now, to prove that our method has guaranteed convergence, we need to prove that the Qmatrix is able to capture the shortest path information accurately.
IiiE Convergence Analysis of Qlearning for finding shortest path
The guarantee of convergence for Qlearning has been discussed and proved in many different ways and for general as well as unique settings [28, 62]. The convergence of Qlearning is guaranteed, while using the update rule given in equation 8, if the learning rate is bounded between and the following conditions hold:
(26) 
Then, as , , with probability 1. This means that for the learning rate conditions to hold with the constraint , all stateaction pairs must be visited an infinite number of times. Here, the only complication is that some stateaction pairs might never be visited.
In our pretraining environment, which is an episodic task, we can make sure that all stateaction pairs are visited by starting episodes at random start states which is shown in Algorithm 2. Apart from this we use an greedy exploration policy to explore all stateaction pairs. The initial value of and the decay rate are set according to the size of the graph based environment.
We run Qlearning on the pretraining environment for episodes so ensure that the Qmatrix converges to . Since we have an action space of 25 actions for 5 rooms, running for more episodes is convenient. But, for large building models (8281 actions for the large real world building scenario, in Section 5), running for many episodes could become computationally too expensive. So, we use a type of early stopping criteria, in which we stop training the Qmatrix if there is a very small change in it’s elements from one episode to the next.
However, as we shall see in Section 5, that we do not require early stopping at all. We are able to reduce the action space drastically and hence the Qmatrix can be trained in the same way as it was trained for smaller action spaces.
In [63], the proof of convergence of Qlearning is given for stochastic processes, but in our case, the environment is deterministic. Also, in [64], a more general convergence proof for Qlearning is provided using convergence properties of stochastic approximation algorithms and their asynchronous versions. The asymptotic bounds for the error has been shown to be bound by the number of visits to stateaction pairs and :
(27) 
where, and is the sampling probability of . So, it is necessary to run the Qlearning algorithm for as many episodes as possible. Hence, we device a strategy to reduce the action space for large discrete action spaces, which are as a result of real world building models, so that it becomes feasible to train the Qmatrix for a large number of episodes.
IiiF Discussion on alternative Transfer Learning techniques
There are a few ways of pretraining an agent, some of which have been discussed and evaluated in [65]. A naive approach would be to preload the experience replay memory with demonstration data before hand. This method, however, isn’t actually pretraining. The agent trains normally with the benefit of being able to learn good transitions immediately.
Our method of pretraining beckons the question of pretraining the agent’s network directly. Pretraining a DQN network’s weights on the pretraining environment would require more time compared to tabular Qlearning. The DQN would require more time to converge. Also, in the next step where the Qmatrix is used as a fixed output to train the network’s weights to overfit on the qvalues requires much less time. Also, for a smaller state space (like the pretraining environment) tabular Qlearning is much more efficient than DQN. The total time taken for pretraining using the proposed method is ( for tabular Qlearning and for overfitting the agent’s weights on the Qmatrix) compared to for pretraining DQN directly. It’s because using the direct pretraining method would effectively require the DQN to be trained twice overall (once on the pretraining environment and then on the fire evacuation environment), which is inefficient and computationally expensive.
Also, this complexity will grow exponentially when we train it on a large real world building model, which is shown in Section 5. For an environment with rooms and
actions, training a DQN agent twice would be extremely inefficient and computationally infeasible, due to the size of the neural network and computations required and the expense of backpropagation. Whereas, training the Qmatrix would only require computing equation 8.
One of the most successful algorithms in pretraining deep reinforcement learning is the Deep Qlearning from Demonstrations (DQfD) [66, 67]. It pretrains the agent using a combination of Temporal Difference (TD) and supervised losses on demonstration data in the replay memory. During training, the agent trains its network using prioritized replay mechanism between demonstration data and interactions with the environment to optimize a complex combination of four loss functions (Qloss, nstep return, large margin classification loss and L2 regularization loss).
The DQfD uses a complex loss function and the drawback of using demonstration data is that it isn’t able to capture the complete dynamics of the environment as it covers a very small part of the state space. Also, prioritized replay adds more overhead. Our approach is far simpler and because we create a separate pretraining instance to incorporate essential information about the environment instead of the full environment dynamics, it is more efficient than demonstration data.
Iv Experiments and Results
We perform unbiased experiments on the fire evacuation environment and compare our proposed approach with stateoftheart reinforcement learning algorithms. We test different configurations of hyperparameters and show the results with best performing hyperparameters for these algorithms on our environment. The main intuition behind using Qlearning pretrained DQN model was to provide it with important information before hand, to increase stability and convergence. The results confirms our intuition empirically.
The Agent’s Network
Unlike the convolutional neural networks
[31] used in DQN [51, 52], DDQN [58, 59] and Dueling DQN [60], we implement a fully connected feedforward neural network. The network configuration is given in Table 1. The network consists of 5 layers. The ReLU function
[68] is used for all layers, except the output layer, where a linear activation is used to produce the output.Environment
The environment given in Fig. 2 is used for all unbiased comparisons. The state of the environment is given as : with bottleneck . All rooms contain 10 people (the exit is empty), which is the maximum possible number of people. We do this to test the agents under maximum stress. The fire starts in room 2 and the fire spread is more towards room 1 than room 3 (as shown in Fig. 2 with orange arrows). Room 4 is the exit. The total number of actions possible for this environment is 25. So, the agent has to pick one out of 25 actions at each step.
Training
The Adam optimizer [56] with default parameters and a learning rate of
is used for training for all the agents. Each agent is trained for 500 episodes. Training was performed on a 4GB NVIDIA GTX 1050Ti GPU. The models were developed in Python with the help of Tensorflow
[69]and Keras
[70].Implementation
Initially, the graph connections were represented as 2D arrays of the adjacency matrix . But, when the building model’s graphs get bigger, the adjacency matrices become more and more sparse, which makes the 2D array representation inefficient. So, the most efficient and easiest way to implement a graph is as a dictionary, where the keys represent rooms and their values are an array that lists all the rooms that are connected to it.
Comparison Graphs
The comparison graphs shown from Fig. 3 to Fig. 10 have the total number of timesteps required for complete evacuation for an episode on the yaxis and the number of episodes on the xaxis. The comparisons shown in the graphs are different runs of our proposed agents with exactly the same environment settings used for all the other agents as well.
Type  Size  Activation 
Dense  128  ReLU 
Dense  256  ReLU 
Dense  256  ReLU 
Dense  256  ReLU 
Dense  No. of actions  Linear 
We first compare the Qmatrix pretrained versions of the DQN and its variants with the original models. The graph based comparisons between models consists of number of timesteps for evacuating all people on the axis and episode number on the axis. We put an upperlimit of 1000 timesteps for an episode due to computational reasons. The training loop breaks and a new episode begins once this limit is reached.
The graph comparing DQN with our proposed Qmatrix pretrained DQN (QMPDQN) in Fig. 3 shows the difference in their performance on the fire evacuation environment. Although the DQN reaches the optimal number of timesteps quickly, it isn’t able to stay there. The DQN drastically diverges from the solution and is highly unstable.
It’s the same case with DDQN (Fig. 4) and Dueling DQN (Fig. 5), which, although perform better that DQN with less fluctuations and spend more time near the optimal solution. Our results clearly shows a big performance lag compared to the pretrained versions. As these results suggest that pretraining ensures convergence and stability. We show that having some important information about the environment prior to training reduces the complexity of the learning task for an agent.
The original Qlearning based models aren’t able to cope with the dynamic and stochastic behaviour of the environment. And since they don’t posses pretrained information, their learning process is made even more difficult. Table 2 displays a few numerical results, comparing DQN, DDQN and Dueling DQN, with and without the Qmatrix pretraining on the basis of average number of timesteps for all 500 episodes, minimum number of timesteps reached during training and the training time per episode.
Model  Average TimeSteps  Average TimeSteps with Pretraining  Minimum TimeSteps  Minimum TimeSteps with Pretraining  Training Time (per episode)  Training time with Pretraining (per episode) 
DQN  228.2  76.356  63  61  10.117  6.87 
DDQN  134.62  71.118  61  60  12.437  8.11 
Dueling DQN  127.572  68.754  61  60  12.956  9.02 
As it was also clear from the Figs. 3, 4 and 5, the average number of timesteps is greatly reduced with pretraining, as it makes the models more stable by reducing variance. Based on the environment given in Fig. 2, the minimum possible number of timesteps is 60. All the DQN based models are able to come close to this, but pretraining pushes these models further and achieves the minimum possible number of timesteps. Even though the difference seems small, in emergency situations even the smallest differences could mean a lot at the end. The training time is also reduced with pretraining, as the number of timesteps taken during training is reduced and pretrained models get a better starting position nearer to the optimum.
Next, we make comparisons between our proposed approach and stateoftheart reinforcement learning algorithms. For these comparisons, we use the Qmatrix pretrained Dueling DQN model, abbreviated QMPDQN. We also compare it with a random agent, shown in Fig. 6. The random agent performs random actions at each step, without any exploration. The random agent’s poor performance of 956.33 average timesteps shows that finding the optimal or even evacuating all the people isn’t a simple task.
The StateActionRewardStateAction (SARSA) algorithm is an onpolicy reinforcement learning agent introduced in [3]. While Qlearning follows a greedy policy, SARSA takes the policy into account and incorporates it into its updates. It updates values by considering the policy’s previous actions. Onpolicy methods like SARSA have a downside of getting trapped in local minima if a suboptimal policy is judged as the best. On the other hand, offpolicy methods like Qlearning are flexible and simple as they follow a greedy approach. As it is clear from Fig. 7, that SARSA behaves in a highly unstable manner and isn’t able to reach the optimal solution and shows high variance.
Policy gradient methods are highly preferred in many applications, however they aren’t able to perform optimally on our fire evacuation environment. Since the optimal policy could change in a few timesteps in our dynamic environment, greedy action selection is probably the best approach. An evacuation path that seems best at a particular time step could be extremely dangerous after the next few timesteps and a strict policy of routing cannot be followed continuously due to fire spread and/or bottleneck. These facts are evident from Fig. 8, where we compare our approach to policy gradient methods like Proximal Policy Optimization (PPO) [71] and Vanilla Policy Gradient (VPG) [72]. Even though PPO shows promising movement, it isn’t able to reach the optimum.
Model  Average TimeSteps  Minimum TimeSteps  Training Time (per episode)  
SARSA  642.21  65  19.709  
PPO  343.75  112  16.821  
VPG  723.47  434  21.359  
A2C  585.92  64  25.174  
ACKTR  476.56  79  29.359  
Random Agent  956.33  741    

68.754  60  9.02 
Another major type of reinforcement learning algorithms are the actorcritic methods. It is a hybrid approach consisting of two neural networks: an actor which controls the policy (policy based) and a critic which estimates action values (value based). To further stabilize the model, an advantage function is introduced which gives the improvement of an action compared to an average action used in a particular state. Apart from the previously mentioned shortcomings of using policy based methods on the fire evacuation environment, the advantage function would have high variance since the best action at a particular state could change rapidly leading to unstable performance. This is apparent from Fig. 9, where we compare the synchronous advantage actor critic method (A2C) [73] with our proposed method. The A2C gives near optimal performance in the beginning but diverges and rapidly fluctuates.
We do not compare our proposed method with the asynchronous advantage actor critic method (A3C) [74], because A3C is just an asynchronous version of A2C, which is more complex as it creates many parallel versions of the environment and gives relatively the same performance, but is not as sample efficient as claimed in [75]. The only advantage of A3C is that it exploits parallel and distributed CPU and GPU architectures which boosts learning speed as it can update asynchronously. However, the main focus of this paper is not learning speed. Hence, we think that the comparison with A2C is sufficient for actorcritic models.
Probably the best performing Actor Critic based model is the ACKTR (Actor Critic with Kroneckerfactored Trust Region) [76]. The algorithm based on applying trust region optimization using Kroneckerfactored approximation, which is the first scalable trust region natural gradient method for actor critic models that can be applied to both continuous and discrete action spaces. The Kroneckerfactored Approximate Curvature (KFAC) [77], is used to approximate the Fisher Matrix to perform approximate natural gradient updates. We compare our method to the ACKTR algorithm, shown in Fig. 10. The results suggest that the ACKTR is not able to converge (within 500 episodes, due to slow convergence rate) and is susceptible to the dynamic changes in the environment as evident from the fluctuations. ACKTR is far too complex compared to our proposed method, which converges much faster and deals with the dynamic behaviour of the fire evacuation environment efficiently.
We summarize our results in Table 3. All the RL agents use the same network configuration mentioned in Table 1 for unbiased comparison. The training time for the QMPDQN is much lower compared to other algorithms because pretraining provides it with a better starting point, so it requires less number of timesteps and memory updates to reach the terminal state. Also, SARSA and A2C come really close to the minimum number of timesteps, but as the average number of timesteps suggests, they aren’t able to converge and exhibit highly unstable performance. Our proposed method, Qmatrix pretrained Dueling Deep Qnetwork gives the best performance on the fire evacuation environment by a huge margin.
Note that, in all the comparison graphs, our proposed method comes close to the global optimum, but isn’t able to completely converge to it. This is because of the uncertainty probability , which decides whether an action is performed or not and is set to . This uncertainty probability is used to map the uncertain crowd behaviour. Even though, , does not allow complete convergence, it also prevents the model from memorizing an optimal path which might change as the fire spreads.
Multiple Fires Scenario
Now that we have shown that the proposed method is able to outperform stateoftheart reinforcement learning algorithms, we test our model on a more complex and difficult environment setup. The environment configuration consists of multiple fires in different rooms and a more complex graph structure consisting of 8 rooms. The environment is shown in Fig. 10. The green node is the exit, the red nodes are the rooms where the fire is located and the orange arrows depict the direction of fire spread.
As we can see from Fig. 10, the fire spreads in different directions from different fire locations. This makes things especially difficult because as the fire spreads, the paths to the exit could be blocked. We do not change the configuration of our approach, except the output layer of the network, since the number of possible actions is 64 now. The State of the environment and Bottleneck given as input is: and .
We employ the Qmatrix pretrained Dueling DQN model. Fig. 11 shows the graphical results on the multiple fires scenario. The initial fluctuations are due to greedy exploration. Since this configuration of the environment is bigger and more complex, the agent explores the environment a little longer.
As the results suggest from Fig. 11, the proposed model is able to converge very quickly. A few metrics for the proposed method on the multiple fires environment is given below:
 Average number of timesteps:

119.244
 Minimum number of timesteps:

110
 Training time (per episode):

15.628
Note that, there is a difference of timesteps between the minimum number of timesteps and average number of timesteps. This is because the average of all 500 episodes is taken which includes the initial fluctuations due to exploration and the uncertainty probability .
V Scalability: Large and Complex Real World Scenario  University of Agder Building
To prove that our method is capable of performing on large and complex building models, we simulate a real world building, i.e., the University of Agder A, B, C and D blocks, and perform evacuation in case of fire in any random room(s).
This task is especially difficult because of the resulting complex graph structure of the building and the large discrete action space. We consider the A, B, C and D blocks which are in the same building. The total number of rooms in this case is , which means that the number of all possible actions is . This discrete action space is many times larger than any other OpenAI gym environment or Atari game environments [11]. Even the Go game has , i.e., actions.
Dealing with such a large action space would require a huge agent model or moving towards to a multiagent approach and dividing the environment into subsets, with each subenvironment for each agent to deal with. These techniques for dealing with the large discrete action space would be computationally complex and difficult to implement for the fire evacuation environment.
Another way could be to use a policy gradient method which are much more effective in dealing with large action spaces compared to value based methods. But, dealing with such large action spaces would require an ensemble of neural networks and tree search algorithms like in [78] or extensive training from human interactions like in [79]. However, in a fire emergency environment we obviously can’t have human interactions and we would like to solve the issue of large action space without having to use dramatically huge models. Also we saw in the previous section that even though PPO performs much better compared to other algorithms, it wasn’t able to outperform our QMPDQN methods.
In [80], a new method to deal with extremely large discrete action spaces (1 million actions) was proposed. The novel method, called the Wolpertinger policy algorithm, uses a type of actorcritic architecture, in which the actor proposes a protoaction in an action embedding space from which most similar actions are selected using the knearest neighbour algorithm. These actions are received by the critic which makes a greedy selection based on the learned qvalues. This technique shows promising results, however, it is highly complex.
We propose a much simpler approach to deal with large number of actions. Our method consists of two stages: OneStep Simulation (OSS) of all actions resulting in an action importance vector and then elementwise addition with the DQN output for training. We explain our method in the following subsections.
Va OneStep Simulation and Action Importance Vector
We make use of the pretraining environment instance shown in Algorithm 2 to calculate the action importance vector , as shown in Algorithm 4. The function is implemented in the environment itself to enable the environment object to use the method and the function to use the environment variables.
The function simulates all possible actions for each state/room for one timestep in the pretraining environment. It stores all rewards received for these actions taken from room and returns the best actions for each room which yield the highest rewards.
The function is run for each room in , which is the total number of rooms in the environment. The equation , is used to convert the best actions returned by function for all rooms , into a single vector of actions. This is necessary because the DQN agent can take any appropriate action from any room at a particular timestep. So, it outputs a single vector consisting of Qvalues for all actions at each timestep.
After we have a unique index for all selected actions in the environment, we form the action importance vector by placing at index , if the action is present in the vector , which consists of all the best actions for each room , otherwise, a large negative number (like ) at index .
The action importance vector can be though of as a fixed weight vector which contains weight for good actions and a large negative weight for others. is then added elementwise with the output of the DQN to produce the final output on which the DQN is trained on.
(28) 
This makes the Qvalues of the good actions to remain the same and reduces the Qvalues of other actions to huge negative numbers. This method effectively reduces the action space from to , where . In our experiments, we set the hyperparameter as the maximum degree of vertices in the building model’s graph, i.e. . So, in our model, the action space is effectively reduced from actions to actions, which is a decrease.
Hence, our complete method consists of shortest path pretraining using Qmatrix transfer learning and action space reduction by onestep simulation and action importance and finally DQN based model training and execution. The shortest path pretraining provides the model with global graph connectivity information and the onestep simulation and action importance delivers local action selection information.
The action importance vector can also be thought of as an attention mechanism [37, 81, 82, 83]. Most of the attention mechanisms employ a neural network or any other technique to output an attention vector which is then combined with the input or an intermediate output to convey attention information to a model. Unlike these methods, our proposed model combines the action importance vector with the output of the DQN. This means that the current action selection is based on a combination of the Qvalues produced by the DQN and the action importance vector, but the training of the DQN is impacted by the attention vector in the next iteration of training, as the final output of the iteration is used as the label for training the model at the iteration.
One major advantage of such an attention mechanism used in our method is that, since the graph based environment has a fixed structure, the attention vector needs to be calculated just once at the beginning. We test our method on the University of Agder (UiA), Campus Grimstad building with blocks A, B, C and D consisting of rooms.
Note that, unlike the usual attention based models, we do not perform elementwise multiplication of the attention vector with the output of a layer. Instead, we add the attention vector because initially the DQN model will explore the environment and will have negative Qvalues for almost all actions (if not all). This means that if we use a vector of ones and zeros for good and bad actions respectively and multiply elementwise with the output of a layer then, the Qvalues of good actions will be copied as it is and the Qvalue of other actions will
become zero. If the Qvalue of good actions is negative in the beginning due to exploration (and lack of learning since it is the beginning of training), then the max function in the Qvalue selection equation will select bad actions since they are zeros and good actions are negative. This will lead to catastrophic behaviour of the system and it will never converge. So, instead we use addition with zeros for good actions so that they remain the same and with large negative numbers for other actions so that their Qvalues become so low that they are never selected.
VB Fire Evacuation in the UiA building
The graph for UiA’s building model is based on the actual structure of the 2nd floor of blocks A, B, C and D^{1}^{1}1UiA building map can be found here: https://use.mazemap.com/#v=1&zlevel=2&left=8.5746533&right=8.5803711&top=58.3348318&bottom=58.3334208&campusid=225. The graph for the building model is shown in Fig. 13. It consists of rooms (from room to room ) out of which there are exits. We simulate the fire evacuation environment in which there are multiple distributed fires in rooms , , and . The fire spread for each fire is individually simulated in a random direction as shown by the yellow nodes in the graph.
As shown in Fig. 13, the building connectivity can be quite complex and there has been limited research work that deals with this aspect. The graph structure shows that these connections between rooms cannot possibly be captured by a grid based or maze environment.
Also, note that, the double sided arrows in the graph enable transitions back and forth between rooms. This makes the environment more complicated for the agent since the agent could just go back and forth between ’safe’ rooms and get stuck in a loop and may never converge. This point makes pretraining even more indispensable.
Since, the proposed method is able to reduce the action space by a lot, the neural network doesn’t need to be made too large. The network configuration is given in Table 4. Note that the addition layer does not require any trainable parameters.
Type  Size  Activation 
Dense  512  ReLU 
Dense  1024  ReLU 
Dense  1024  ReLU 
Dense  1024  ReLU 
Dense  8281  Linear 
Addition     
The neural network is trained using the Adam optimizer [56] with default hyperparameter settings and a learning rate for episodes. The training was performed on the NVIDIA DGX2. The optimal number of steps for evacuation in the UiA building graph is around .
VC Results
The results of our proposed method consisting of shortest path Qmatrix transfer learning to DuelingDQN model with onestep simulation and action importance vector acting as an attention mechanism applied on the University of Agder’s A,B,C and D blocks consisting of rooms and actions (whose graph is shown in Fig. 13) is shown in Fig. 14. The performance numbers are given below:
 Average number of timesteps:

2234.5
 Minimum number of timesteps:

2000
 Training time (per episode):

32.18 s
The graph in Fig. 14 shows the convergence of our method with evacuation timesteps on the yaxis and the episode number on the xaxis. It takes slightly longer to converge compared to the convergence in previous small example environments. This is obviously due to the size of the environment and complex connectivity. But overall the performance of our model is excellent.
After episodes, the algorithm has almost converged. There are a few spikes suggesting fluctuations from the optimal behaviour due to the dynamic nature of the environment and the uncertainty in actions. After episodes, the algorithm completely converges in the range timessteps for total evacuation. The method cannot converge to the minimum possible timesteps because of the fire spread dynamics, encountering bottleneck conditions and action uncertainty.
The results clearly suggest that even though the proposed fire evacuation environment is dynamic, uncertain and full of constraints, our proposed method using novel action reduction technique with attention based mechanism and transfer learning of shortest path information is able to achieve excellent performance on a large and complex real world building model. This further confirms that, with a minute added overhead of onestep simulation and action importance vector, our method is scalable to much larger and complex building models.
Vi Conclusion
In this paper, we propose the first realistic fire evacuation environment to train reinforcement learning agents. The environment is implemented in OpenAI gym format. The environment has been developed to simulate realistic fire scenarios. It includes features like fire spread with the help of exponential decay reward functions and degree functions, bottlenecks, uncertainty in performing an action and a graph based environment for accurately mapping a building model.
We also propose a new reinforcement learning method for training on our environment. We use tabular Qlearning to generate qvalues for shortest path to the exit using the adjacency matrix of the graph based environment. Then, the result of Qlearning (after being offset by a ) is used to pretrain the DQN network weights to incorporate shortest path information in the agent. Finally, the pretrained weights of the DQN based agents are trained on the fire evacuation environment.
We prove the faster convergence of our method using Task Transfer Qlearning theorems and the convergence of Qlearning for the shortest path task. The Qmatrix pretrained DQN agents (QMPDQN) are compared with stateoftheart reinforcement learning algorithms like DQN, DDQN, DuelingDQN, PPO, VPG, A2C, ACKTR and SARSA on the fire evacuation environment. The proposed method is able to outperform all these models on our environment on the basis of convergence, training time and stability. Also, the comparisons of QMPDQN with original DQN based models show clear improvements over the latter.
Finally, we show the scalability of our method by testing it on a real world large and complex building model. In order to reduce the large action space ( actions), we use the onestep simulation technique on the pretraining environment instance to calculate the action importance vector, which can be thought of as an attention based mechanism. The action importance vector gives the best actions a weight of and the rest are assigned a large negative weight of (to render the Qvalues of these too low to be selected by the Qfunction). This reduces the action space by and our proposed method, QMPDQN model, is applied on this reduced action space. We test this method on the UiA, Campus Grimstad building, with the environment consisting of rooms. The results show that this combination of methods works really well in a large real world fire evacuation emergency environment.
Acknowledgment
The authors would like to thank Tore Olsen, Chief of the Grimstad Fire Department, and the Grimstad Fire Department for supporting us with their expertise regarding fire emergencies and evacuation procedures as well as giving feedback to improve our proposed environment and evacuation system. We would also like to thank Dr. Jaziar Radianti, Center for Integrated Emergency Management (CIEM), University of Agder, for her input to this research work.
References
 [1] Abbas Abdolmaleki, Mostafa Movahedi, Sajjad Salehi, Nuno Lau, and Luis Paulo Reis. A reinforcement learning based method for optimizing the process of decision making in fire brigade agents. In Luis Antunes and H. Sofia Pinto, editors, Progress in Artificial Intelligence, pages 340–351, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.
 [2] Fatemeh Pahlevan Aghababa, Masaru Shimizu, Francesco Amigoni, Amirreza Kabiri, and Arnoud Visser. Robocup 2018 robocup rescue simulation league virtual robot competition rules document, May 2018.
 [3] G. A. Rummery and M. Niranjan. Online Qlearning using connectionist systems. Technical Report TR 166, Cambridge University Engineering Department, Cambridge, England, 1994.
 [4] Daniel Moura and Eugenio Oliveira. Fighting fire with agents: An agent coordination model for simulated firefighting. In Proceedings of the 2007 Spring Simulation Multiconference  Volume 2, SpringSim ’07, pages 71–78, San Diego, CA, USA, 2007. Society for Computer Simulation International.
 [5] Haifeng Zhao and Stephan Winter. A timeaware routing map for indoor evacuation. Sensors, 16(1), 2016.
 [6] Ashley Wharton. Simulation and investigation of multiagent reinforcement learning for building evacuation scenarios *. 2009.
 [7] Mei Ling Chu, Paolo Parigi, Kincho Law, and JeanClaude Latombe. Modeling social behaviors in an evacuation simulator. Computer Animation and Virtual Worlds, 25(34):373–382.
 [8] V. J. Cassol, E. Smania Testa, C. Rosito Jung, M. Usman, P. Faloutsos, G. Berseth, M. Kapadia, N. I. Badler, and S. Raupp Musse. Evaluating and optimizing evacuation plans for crowd egress. IEEE Computer Graphics and Applications, 37(4):60–71, 2017.
 [9] LH Lee and YoungJun Son. Dynamic learning in human decision behavior for evacuation scenarios under bdi framework. In Proceedings of the 2009 INFORMS Simulation Society Research Workshop. INFORMS Simulation Society: Catonsville, MD, pages 96–100, 2009.
 [10] Seungho Lee, YoungJun Son, and Judy Jin. An integrated human decision making model for evacuation scenarios under a bdi framework. ACM Trans. Model. Comput. Simul., 20(4):23:1–23:24, November 2010.
 [11] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.
 [12] Arthur Juliani, VincentPierre Berges, Esh Vckay, Yuan Gao, Hunter Henry, Marwan Mattar, and Danny Lange. Unity: A general platform for intelligent agents. CoRR, abs/1809.02627, 2018.
 [13] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, John Quan, Stephen Gaffney, Stig Petersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David Silver, Timothy P. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawrence, Anders Ekermo, Jacob Repp, and Rodney Tsing. Starcraft II: A new challenge for reinforcement learning. CoRR, abs/1708.04782, 2017.
 [14] Pete Shinners. Pygame. http://pygame.org/, 2011.
 [15] Łukasz Kidziński, Sharada P Mohanty, Carmichael Ong, Jennifer Hicks, Sean Francis, Sergey Levine, Marcel Salathé, and Scott Delp. Learning to run challenge: Synthesizing physiologically accurate motion using deep reinforcement learning. In Sergio Escalera and Markus Weimer, editors, NIPS 2017 Competition Book. Springer, Springer, 2018.
 [16] Stephen Wolfram. Statistical mechanics of cellular automata. Rev. Mod. Phys., 55:601–644, Jul 1983.
 [17] S. I. Pak and T. Hayakawa. Forest fire modeling using cellular automata and percolation threshold analysis. In Proceedings of the 2011 American Control Conference, pages 293–298, June 2011.
 [18] Marco Wiering and Marco Dorigo. Learning to control forest fires. Ultrech University Repository, Jan 1998.
 [19] G. E. Sakr, I. H. Elhajj, G. Mitri, and U. C. Wejinya. Artificial intelligence for forest fire prediction. In 2010 IEEE/ASME International Conference on Advanced Intelligent Mechatronics, pages 1311–1316, July 2010.
 [20] Sriram Ganapathi Subramanian and Mark Crowley. Using spatial reinforcement learning to build forest wildfire dynamics models from satellite images. Frontiers in ICT, 5:6, 2018.

[21]
Amir R. Zamir, Alexander Sax, William B. Shen, Leonidas J. Guibas, Jitendra
Malik, and Silvio Savarese.
Taskonomy disentangling task transfer learning.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
. IEEE, 2018.  [22] Jeremy Howard and Sebastian Ruder. Universal language model finetuning for text classification. CoRR, abs/1801.06146, 2018.
 [23] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
 [24] David Ha and Jürgen Schmidhuber. World models. CoRR, abs/1803.10122, 2018.
 [25] Yue Wang, Qi Meng, Wei Chen, Yuting Liu, Zhiming Ma, and TieYan Liu. Target transfer qlearning and its convergence analysis. CoRR, abs/1809.08923, 2018.
 [26] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998.
 [27] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994.
 [28] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 [29] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30 – 37. Morgan Kaufmann, San Francisco (CA), 1995.
 [30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [31] Kunihiko Fukushima. Neocognitron: A selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202, 1980.
 [32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [33] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [34] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
 [35] François Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357, 2016.
 [36] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 [37] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964, March 2016.
 [38] ChungCheng Chiu, Tara Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Katya Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. Stateoftheart speech recognition with sequencetosequence models. 2018.

[39]
Alex Graves and Navdeep Jaitly.
Towards endtoend speech recognition with recurrent neural networks.
In Tony Jebara and Eric P. Xing, editors, Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 1764–1772. JMLR Workshop and Conference Proceedings, 2014.  [40] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, Nov 2012.
 [41] Alex Graves, Abdelrahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. CoRR, abs/1303.5778, 2013.
 [42] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran. Deep convolutional neural networks for lvcsr. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8614–8618, May 2013.
 [43] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Contextdependent pretrained deep neural networks for largevocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30–42, Jan 2012.
 [44] Xiang Zhang, Junbo Zhao, and Yann LeCun. Characterlevel convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, NIPS’15, pages 649–657, Cambridge, MA, USA, 2015. MIT Press.
 [45] Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormándi, George E. Dahl, and Geoffrey E. Hinton. Large scale distributed neural network training through online distillation. CoRR, abs/1804.03235, 2018.
 [46] Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best of both worlds: Combining recent advances in neural machine translation. CoRR, abs/1804.09849, 2018.
 [47] Lierni Sestorain, Massimiliano Ciaramita, Christian Buck, and Thomas Hofmann. Zeroshot dual machine translation. CoRR, abs/1805.10338, 2018.
 [48] Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin JunczysDowmunt, William Lewis, Mu Li, Shujie Liu, TieYan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. Achieving human parity on automatic chinese to english news translation. CoRR, abs/1803.05567, 2018.
 [49] S. Lange and M. Riedmiller. Deep autoencoder neural networks in reinforcement learning. In The 2010 International Joint Conference on Neural Networks (IJCNN), pages 1–8, July 2010.
 [50] H. D. Patino and D. Liu. Neural networkbased model reference adaptive control system. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 30(1):198–204, Feb 2000.
 [51] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013. cite arxiv:1312.5602Comment: NIPS Deep Learning Workshop 2013.
 [52] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, feb 2015.
 [53] Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 09 1951.
 [54] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 [55] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Technical Report UCB/EECS201024, EECS Department, University of California, Berkeley, Mar 2010.
 [56] D.P. Kingma and L.J. Ba. Adam: A method for stochastic optimization. In ICLR, International Conference on Learning Representations (ICLR), page 13, San Diego, CA, USA, 7–9 May 2015. Ithaca, NY: arXiv.org.
 [57] LongJi Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Pittsburgh, PA, USA, 1992. UMI Order No. GAX9322750.
 [58] Hado V. Hasselt. Double qlearning. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2613–2621. Curran Associates, Inc., 2010.
 [59] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 2094–2100. AAAI Press, 2016.
 [60] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 1995–2003. JMLR.org, 2016.
 [61] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY, USA, 2009. ACM.
 [62] Csaba Szepesvári. The asymptotic convergencerate of qlearning. In Advances in Neural Information Processing Systems, pages 1064–1070, 1998.
 [63] Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput., 6(6):1185–1201, November 1994.
 [64] John N. Tsitsiklis. Asynchronous stochastic approximation and qlearning. Machine Learning, 16(3):185–202, Sep 1994.
 [65] E. Larsson. Evaluation of pretraining methods for deep reinforcement learning. 2018.
 [66] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel DulacArnold, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Learning from demonstrations for real world reinforcement learning. CoRR, abs/1704.03732, 2017.
 [67] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel DulacArnold, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Learning from demonstrations for real world reinforcement learning. CoRR, abs/1704.03732, 2017.
 [68] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudik, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.
 [69] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [70] François Chollet et al. Keras. https://keras.io, 2015.
 [71] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 [72] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057–1063, Cambridge, MA, USA, 1999. MIT Press.
 [73] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 1928–1937. JMLR.org, 2016.
 [74] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016.
 [75] Yuhuai Wu, Elman Mansimov, Shun Liao, Alec Radford, and John Schulman. Openai baselines: Acktr and a2c. https://openai.com/blog/baselinesacktra2c/, 2017.
 [76] Yuhuai Wu, Elman Mansimov, Shun Liao, Roger B. Grosse, and Jimmy Ba. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. CoRR, abs/1708.05144, 2017.
 [77] James Martens and Roger B. Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. CoRR, abs/1503.05671, 2015.
 [78] David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–503, 2016.
 [79] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 [80] Gabriel DulacArnold, Richard Evans, Peter Sunehag, and Ben Coppin. Deep reinforcement learning in large discrete action spaces. CoRR, abs/1512.07679, 2015.
 [81] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.
 [82] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2015.
 [83] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2014. cite arxiv:1409.0473Comment: Accepted at ICLR 2015 as oral presentation.
Comments
There are no comments yet.