1 Introduction
In recent years, there has been a huge amount of work on the application of deep learning techniques in combination with reinforcement learning (the so called
deep reinforcement learning) for the development of automatic agents for different kind of games. Gamelike environments provide realistic abstractions of reallife situations, creating new and innovative dimensions in the kind of problems that can be addressed by means of neural networks. Since the seminal work by Mnih et al.
[16] exploiting a combination of Qlearning and neural networks (DQN) in application to Atari games [5], the field has rapidly evolved, offering several improvements such as Double Qlearning [8] (correcting overestimations in the action value of the original version) to the recent breackthrough provided by the introduction of asynchronous methods, the so called A3C model [15].In this work, we apply a version of A3C to automatically move a player in the dungeons of the famous Rogue video game. Rogue was the ancestor of this gender of games, and the first application exploiting a procedural, random creation of its levels; we use it precisely in this way: as a generator of different kind of labyrinths, with a reasonable level of complexity. Of course, the full game offers many other challenges, comprising collecting objects, evolving the rogue, and fighting with monsters of increasing power, but, at least for the moment, we are not addressing these aspects (although they may provide interesting cues for future developments).
We largely based this work on the learning environment the was previously created to this aim in [4, 3], and that allows a simple interaction with Rogue. At the same time, the extension to A3C forced a major revision of the environment, that will be discussed in Section 6.
The reasons for addressing Rogue, apart from the fascination of this vintage game, have been extensively discussed in [4, 3] (see also [6]), and we just recall here the main motivations. In particular, Rogue’s dungeons are a classical example of Partially Observable Markov Decision Problem (POMDP), since each level is initially unknown and not entirely visible. Solving this kind of task is notoriously difficult and challenging [20], since it requires an important amount of exploration.
The other important characteristic that differentiates it from other, more modern, 3D dungeonsbased games such as ViZDoom [11] or the Labyrinth in [15] is precisely the graphical interface, that in the case of Rogue is ASCIIbased. Our claim is that, at the current state of knowledge, decoupling vision from more intelligent activities such as planning can only be beneficial, allowing to focus the attention on the really challenging aspects of the player behavior.
1.1 Achievements overview
Rogue is a complex game, where the player (the “rogue”) is supposed to roam through many different levels of a dungeon trying to retrieve the amulet of Yandor. In his quest, the player must be able to: 1. explore the dungeon (partially visible, when you enter a new level); 2. defend himself from enemies, using the items scattered through the dungeon; 3. avoid traps; 4. avoid starvation, looking for and eating food inside the dungeon.
Currently, we are merely focusing on exploring the maze: as explained in Section 6 monsters and traps may be easily disabled in the game.
The dungeon consists of 26 floors (configurable) and each floor consists of up to 9 rooms of varying dimension and location, randomly connected through non linear corridors, and small mazes. To reach a new floor the agent needs to find and to go down the stairs, whose position is likely hidden from sight, located in a yet unexplored room and in a different spot at each new level. Finding and taking the stairs are the main ingredients governing the agent movement: the only differences between the first floors and the subsequent ones are related to the frequency of meeting enemies, dark rooms, mazes or hidden doors. As a consequence, we organized the training process on the base of a single level, terminating the episode as soon as the rogue takes the stairs. In the rest of the work, when we talk about the performance
of an agent, we refer to the probability that it correctly completes a
single level, finding and taking the stairs within a maximum of 500 moves^{1}^{1}1For a good agent, in average, little more than one hundred move are typically enough.. The performance is measured on a set of 200 consecutive (i.e. random) games.To have a first term of comparison, the performance of a completely random agent is around ^{2}^{2}2The mobility resulting from brownian motion is always impressive.. The performance of the agent presented in [4], based on DQN, was about . The agent discussed in this work attains the remarkable performance of .
agent  random  DQN [4]  this work 
performance  7%  23%  98% 
There are essentially three ingredients behind this achievement:

the adoption of A3C as a base learning algorithm, in substitution of DQN; we shall diffusely talk about A3C in Section 3.2

an agentcentered, cropped representation of the state

a supervised partition of the problem in a predefined set of situations, each one delegated to a different A3C agent, sharing nevertheless a common value function (i.e. a common evaluation of the state).^{3}^{3}3Source code and weights are publicly available at [2] We shall talk about situations in Section 4.1
While the adoption of A3C and the idea of experimenting with situations was a planned activity [3], the shift to an agentcentered view, as well as the choice of the agent situations have been mostly the result of trialanderror, through an extremely long and painful experimentation process.
2 Related work
As we mentioned in the introduction, there is a huge amount of research
around the application of deep reinforcement learning to video games . In this section we
shall merely mention some recent works that, in addition to those already mentioned,
have been a source of inspiration for our work,
or the subject of different experimentations we performed. A few more works that seems to offer
promising developments [22, 18] will be discussed in the conclusions.
Our current bot is essentially a partitioned multitask agent in the sense of
[19]. Its treelike structure may be reminiscent of Hierarchical models [13, 7, 21], but they are in fact distinct notions. In Hierarchical models a Master a cooperates with
one or more Workers, by dictating them macro actions (e.g. “reach the next room”), that are taken
by Workers as their objectives. The Master typically gets rewards from the environment and gives
ad hoc, possibly intrinsic bonuses to Workers. The hope is to let toplevel agents focus
on planning while subparts of the hierarchy manage simple atomic actions, improving the learning process.
In our case, we simply split the task according to different situations the rogue may be faced with:
a room, a corridor, the proximity to stairs/walls, etc. (see Section 4.1 for details).
We did several experiments with hierarchical structures, but so far none of them gave satisfactory
result.
We also experimented with several forms of intrinsic rewards [17], especially after passing to a roguecentered view. Intrinsic motivations are stimuli received form the surrounding environment different from explicit, extrinsic rewards, and that could be used by the agent for alternative form of training, learning to do a particular action because inherently enjoyable. Examples are empowerment [12] or auxiliary tasks [9]. In this case too, we have not been able to obtain interesting results.
3 Reinforcement Learning Background
A Reinforcement Learning problem is usually formalized as a Markov Decision Process (MDP). In this setting, an agent interacts at discrete timesteps with an external environment. At each time step , the agent observes a state and choose an action according to some policy
, that is a mapping (or more generally a probability distribution) from states to actions. As a result of its action, the environment change to a new state
; moreover the agent obtains a reward (see Fig. 2). The process is then iterated until a terminal state is reached.The future cumulative reward is the total accumulated reward from time starting at . is the so called discount factor: it represents the difference in importance between present and future rewards.
The goal of the agent is to maximize the expected return starting from an initial state .
The action value is the expected return for selecting action in state and prosecuting with strategy .
Given a state and an action , the optimal action value function is the best possible action value achievable by any policy.
Similarly, the value of state given a policy is and the optimal value function is .
3.1 Qlearning and DQN
The Qfunction, similarly to the Vfunction can be represented by suitable function approximators, e.g. neural networks. We shall use the notation to denote an approximate actionvalue function with parameters .
In (onestep) Qlearning, we try to approximate the optimal action value function:
by learning the parameters via backpropagation according to a sequence of loss function functions defined as follows:
where is the new state reached from taking action .
The previous loss function is motivated by the well know Bellman equation, that must be satisfied by the optimal function:
Indeed, if we know the optimal stateaction values for next states, the optimal strategy is to take the action that maximizes .
Qlearning is an offpolicy reinforcement learning algorithm. The main drawback of this method is that a reward only directly affects the value of the state action pair s,a that led to the reward. The values of other state action pairs are affected only indirectly through the updated value . The back propagation to relevant precedings states and actions may require several updates, slowing down the learning process.
3.2 ActorCritic and A3C
In contrast to valuebased methods, policybased methods directly parameterize the policy and update the parameters by gradient ascent on .
The standard REINFORCE [20] algorithm updates the policy parameters in the direction
, which is an unbiased estimate of
.It is possible to reduce the variance of this estimate while keeping it unbiased by subtracting a learned function of the state
known as a baseline. The gradient is then ,A learned estimate of the value function is commonly used as the baseline . In this case, the quantity can be seen as an estimate of the advantage of action in state for policy , defined as , just because is an estimate of and is an estimate of .
This approach can be viewed as an actorcritic architecture where the policy is the actor and the baseline is the critic.
A3C [15] is a particular implementation of this technique based on the asynchronous interaction of several parallel couples of Actor and Critic. The experience of each agent is independent from that of the other agents, which stabilizes learning without the need for experience replay as in DQN.
4 Neural Network Architecture
Our implementation is essentially based on A3C. In this section we describe a novel technique that partitions the sample space into a predefined set of situations, each one addressed by a different A3C agent. All of these agents contributes to build a common cumulative reward without sharing any other information, and for this reason they are said to be highly independent. Each agent employs the same architecture, state representation and reward function. In this section we discuss: the situations (Sec.4.1), the state representation (Sec. 4.2), how we shaped the reward function (Sec. 4.3), the neural network (Sec. 4.4), hyperparameters tuning (Sec. 4.5).
4.1 Situations
In our work, with the term situation we mean the environment state used to discriminate which situational agent should perform the next action. We experimented the four situations listed below, from higher to lower priority:

The rogue (the agent) stands on a corridor

The stairs are visible

The rogue is next to a wall

Any other case
The situations are determined programmatically and are not learned.
When multiple conditions in the above list are met, the one with higher priority will be selected. For example, if the stairs are visible but the rogue is walking on a corridor, the situation is determined to be 1 rather than 2, because the former has higher priority.
We define:

s4 as the configuration made of all the aforementioned situations

s1 as the configuration with no situations at all
We believe that situations may be seen as a way to simplify the overall problem, breaking it down into easier subproblems.
4.2 State representation
The state is a matrix corresponding to a cropped view of the map centered on the rogue (i.e. the rogue position is always on the center of the matrix). This representation has the advantage to be sufficiently small to be fed to dense layers (possibly after convolutions); moreover, it does not require to represent the rogue into the map. In our experiments we adopted two variations of the above matrix. The first (called c1) has a single channel, resulting in a shape, and it is filled with the following values:

[style=multiline, leftmargin=1.5cm, labelindent=.8cm]
 4

for stairs
 8

for walls
 16

for doors and corridors
 0

everywhere else
The second (called c2) is made of two channels (the stairs channel and the environment channel) and thus has shape . The values used for c2 are the same of c1.
4.3 Reward shaping
We designed the following reward function:

a positive reward () is given when using a door never used before

a positive reward () is given when, after an action, one or more new doors are found

a huge positive reward () is given when descending the stairs

a small negative reward () is given when taking an action that does not change the state (eg.: try to cross a wall)
The chosen reward values are not random. In fact each floor contains at most rooms and each room has maximum doors, thus on each floor the cumulative reward of the rewards of type 1 and 2 can not exceed . But what normally happens, in the episodes with the best return, is that only about of the cumulative reward is given by finding new rooms.
This is true because negative rewards are enough to teach the agent not to take useless actions and, in the meantime, they do not significantly affect the balance between room exploration and stair descent.
The result is that the agent is encouraged both to descend the stairs and to explore the floor, and this impacts positively and significantly on its performance.
4.4 Neural Network
The neural network architecture we used is shown in figure 3
. This network consists of two convolutional layers followed by a dense layer to process spatial dependencies and a LSTM layer to process temporal dependencies, and finally, value and policy output layers. The CNNs have a RELU, a
kernel with unitary stride and respectively 16 and 32 filters. The CNNs output is flattened and it is the input for a FC with RELU and 256 units. We call this structure: tower.
The tower input is the state representation described in Section 4.2 and its output is concatenated with a numerical “one hot” representation of the action taken in the previous state and the obtained reward. This concatenation is fed into an LSTM composed of 256 units. The idea of concatenating previous actions and rewards to the LSTM input comes from [9].
The output of the LSTM is then the input for the value and policy layers.
A network with the aforementioned structure implements an agent for each situation described in Section 4.1. The loss is computed separately for each network, and corresponds to the A3C loss computed in [9].
4.5 HyperParameters Tuning
Each episode lasts at most steps/actions, and it may end either achieving success (i.e. descending the stairs), or reaching the steps limit. Thus, the death state is impossible for the agent, since in our experiments monsters and traps have been disabled and steps are not enough to die for starvation.
Most of the remaining hyperparameters values we adopted (for example the entropy ) came from [14], an OpenSource implementation of [9], except the following:

[style=multiline, leftmargin=6cm, labelindent=.5cm]
 discount factor

0.95
 batch size

60
We employed the same Tensorflow’s RMSprop optimizer
[1] available in [14], with parameters:
[style=multiline, leftmargin=6cm, labelindent=.5cm]
 decay

0.99
 momentum

0
 epsilon

0.1
 clip norm

40
The learning rate is annealed over time according to the following equation:
, where is the maximum global step, and is the current global step.
The initial learning rate is approximatively .
5 Evaluation
For evaluation purposes we want to measure how often the agent is able to descend the stairs and to explore the floor. In our experiments, the final state is reached when the agent descend the stairs. For this reason, a good evaluation metric for a Roguelike explorationonly system should be based at least on:

the success rate: the percentage of episodes in which the final state is reached (an equivalent of the accuracy)

the number of new tiles found during the exploration process

the number of steps taken to win an episode
We evaluated our systems using an average of the aforementioned metrics over episodes. The results we achieved are summarized in figure 4 and table 1.^{4}^{4}4Source code and weights are publicly available at [2]
Agent  s1c2  s2c2  s4c1  s4c2 
Success rate  0.03%  98%  96.5%  97.6% 
Avg return  16.16  17.97  17.66  17.99 
Avg number of seen tiles  655.02  386.46  365.88  389.27 
Avg number of steps to succeed  2.11  111.48  108.22  110.26 
Our best agent^{5}^{5}5A video of our agent playing is available at https://youtu.be/1j6_165Q46w shows remarkable skills in exploring the dungeon, searching for the stairs.
Using four situations instead of just two did not prove to be beneficial, however adopting a separate situation (and hence a separate neural network) for the case when the stairs are visible was fundamental. In fact, as can be seen in Fig. 4, the policy learned by s1c2 completely ignored the stairs, thus achieving a very low success rate.
The experiment with 4 situations resulted in the development of the peculiar inclination for the agent of walking alongside walls.
Finally, state representation c2 induced faster learning, but only
a slight increase in the resulting success rate.
6 Refactoring the Rogue In A Box library
With this article, we release a new version [2] of the Rogue In A Box library [4, 3] that improves modularity,
efficiency and usability with respect to the previous version. In particular, the old library was mainly centered around DQNagents, that at the time looked as the most promising approach
for the application of deep reinforcement learning to this kind of games. With the advent
of A3C and other techniques, we restructured the learning environment,
neatly decoupling the interface with the game, supported by a suitable API,
from the design of the agents.
Other innovative features comprise:

Screen parser and frames memory

Communication between Rogue and the library

Enabling or disabling monsters and traps

Evaluation module
Of particular note is the evaluation module, which provides statistics on the history of environment interactions, allowing to properly compare the policies of different agents.
7 Conclusions
In this article, we have shown how we can address the Partially Observable Markov Decision Problem behind the exploration of Rogue’s dungeons, achieving a success rate of , with a simple technique that partitions the sample space into situations. Each situation is handled by a different A3C agent, all of them sharing a common value function. The interest of Rogue is that the planar, ASCIIbased, bidimensional interface permits to decouple vision from more intelligent activities such as planning: in this way we may better investigate and understand the most challenging aspects of the player’s behavior.
The current version of the agent works very well, but still has some problems in culdesac situations, where the agent should traceback his path. Moreover, to completely solve the Rogue’s exploration problem, dark rooms and hidden doors are also required to be handled. We predict that the main challenge is going to be provided by hidden doors, since they are almost completely unpredictable and hard to detect even for a human. Different aspects of the game, such as collecting objects and fighting could also be taken into account, possibly delegating them to adhoc situations.
In spite of the fact that the overall performance of our agent is really good, its design is not yet entirely satisfactory. In fact, too much intelligence about the game is built in, both in the design of situations, and especially in their identification and attribution to specific networks. Also the roguecentered, cropped view introduces a major simplification of the problem, completely bypassing the attention problem (see e.g. [10]) that, as discussed in [3], was one of the interesting aspects of Rogue.
Currently, our efforts are going in the direction of designing an unsupervised version of the work described in this paper, where the agent is able to autonomously detect interesting situations, delegating them to specific subnets. As additional research topics, we are
References
 [1] Rmspropoptimizer. https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer
 [2] Asperti, A., Cortesi, D., Sovrano, F.: Partitioned a3c for rogueinabox. https://github.com/FrancescoSovrano/PartitionedA3CforRogueInABox
 [3] Asperti, A., Pieri, C.D., Maldini, M., Pedrini, G., Sovrano, F.: A modular deeplearning environment for rogue. WSEAS Transactions on Systems and Control 12 (2017), http://www.wseas.org/multimedia/journals/control/2017/a785903070.php
 [4] Asperti, A., Pieri, C.D., Pedrini, G.: Rogueinabox: an environment for roguelike learning. International Journal of Computers 2, 146–154 (2017), http://www.iaras.org/iaras/filedownloads/ijc/2017/0060022(2017).pdf
 [5] Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. (JAIR) 47, 253–279 (2013). https://doi.org/10.1613/jair.3912, http://dx.doi.org/10.1613/jair.3912

[6]
Cerny, V., Dechterenko, F.: Roguelike games as a playground for artificial intelligence–evolutionary approach. In: International Conference on Entertainment Computing. pp. 261–271. Springer (2015)
 [7] Dilokthanakul, N., Kaplanis, C., Pawlowski, N., Shanahan, M.: Feature control as intrinsic motivation for hierarchical reinforcement learning. CoRR abs/1705.06769 (2017), http://arxiv.org/abs/1705.06769
 [8] van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double qlearning. CoRR abs/1509.06461 (2015), http://arxiv.org/abs/1509.06461
 [9] Jaderberg, M., Mnih, V., Czarnecki, W.M., Schaul, T., Leibo, J.Z., Silver, D., Kavukcuoglu, K.: Reinforcement learning with unsupervised auxiliary tasks. CoRR abs/1611.05397 (2016), http://arxiv.org/abs/1611.05397

[10]
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 712, 2015, Montreal, Quebec, Canada. pp. 2017–2025 (2015),
http://papers.nips.cc/paper/5854spatialtransformernetworks  [11] Kempka, M., Wydmuch, M., Runc, G., Toczek, J., Jaskowski, W.: Vizdoom: A doombased AI research platform for visual reinforcement learning. CoRR abs/1605.02097 (2016), http://arxiv.org/abs/1605.02097

[12]
Klyubin, A.S., Polani, D., Nehaniv, C.L.: Empowerment: a universal agentcentric measure of control. In: Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2005, 24 September 2005, Edinburgh, UK. pp. 128–135 (2005).
https://doi.org/10.1109/CEC.2005.1554676, https://doi.org/10.1109/CEC.2005.1554676  [13] Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.B.: Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. CoRR abs/1604.06057 (2016), http://arxiv.org/abs/1604.06057
 [14] Miyoshi, K.: Unreal implementation. https://github.com/miyosuda/unreal
 [15] Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T.P., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. CoRR abs/1602.01783 (2016), http://arxiv.org/abs/1602.01783
 [16] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M.A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Humanlevel control through deep reinforcement learning. Nature 518(7540), 529–533 (2015). https://doi.org/10.1038/nature14236, https://doi.org/10.1038/nature14236
 [17] Singh, S.P., Barto, A.G., Chentanez, N.: Intrinsically motivated reinforcement learning. In: Advances in Neural Information Processing Systems 17 [Neural Information Processing Systems, NIPS 2004, December 1318, 2004, Vancouver, British Columbia, Canada]. pp. 1281–1288 (2004), http://papers.nips.cc/paper/2552intrinsicallymotivatedreinforcementlearning
 [18] Song, Y., Xu, M., Zhang, S., Huo, L.: Generalization tower network: A novel deep neural network architecture for multitask learning. CoRR abs/1710.10036 (2017), http://arxiv.org/abs/1710.10036
 [19] Sun, R., Peterson, T.: Multiagent reinforcement learning: weighting and partitioning. Neural Networks 12(45), 727–753 (1999). https://doi.org/10.1016/S08936080(99)000246, https://doi.org/10.1016/S08936080(99)000246
 [20] Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edn. (1998)
 [21] Vezhnevets, A.S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., Kavukcuoglu, K.: Feudal networks for hierarchical reinforcement learning. CoRR abs/1703.01161 (2017), http://arxiv.org/abs/1703.01161
 [22] Wang, Z.: Sample efficient actorcritic with experience replay (2016)