I Introduction
Recent research has focused on forward models of games that can be learned either through heuristic methods or using deep neural network architectures. These learned models can then be used by traditional planning algorithms, or as part of the architecture in reinforcement learning. Using neural networks in planning algorithms can be difficult, as the accuracy of state observations tends to decrease with the number of steps that are simulated. This results in diminishing efficacy of planning algorithms when larger rollout lengths are used. Recent neural network models also tend to rely on a fixed dimensional observational input to predict the rewards and subsequent states and therefore struggle to generalize to games that may have different sized observational spaces.
Heuristic rulebased algorithms for learning forward models [1] offer high performance when they work, but require human input regarding the form the rules will take.
Recent work on a local approach to learning forward models [2] shares some similarities with the Neural Game Engine, in that both methods are able to generalise to levels of a different size than those seen during training. Compared to [2], the Neural Game Engine works directly with pixels rather than tiles, and for many games also does accurate reward prediction.
Gridbased arcade style games, although simple to understand for humans, still present highly challenging environments for artificial intelligence. In this paper a gridbased game refers to a game that is based on a grid of discrete tiles such as walls, floors, boxes and other gamespecific items. A single agent has a set of actions it can perform at each time step, such as movement or interaction with other tiles in the grid. The agent is restricted to perform a single action at each time step. Additionally, each environment may have a different grid dimensions, leading to variable observation space sizes. These game environments can be represented by a fully observervable markov decision process with states
as the pixels of the environment, actions of the agent and the rewards given by the game score.This paper proposes a novel architecture, the Neural Game Engine, based on a modified Neural GPU[3] [4].
The Neural Game Engine (NGE) can learn gridbased arcade style games of any dimensions with very high accuracy over arbritary numbers of game ticks. Additionally the architecture can scale to grid games of any number of tiles without loss of accuracy. The NGE engine is trained on several deterministic games ^{1}^{1}1Pretraied models are available through the OpenAI Gym interface and are available publicly for future research here: https://github.com/Bam4d/NeuralGameEngine from the GVGAI environment [5], an updated version of pyVGDL [6] which provides many gridbased games under the openAI gym wrapper [7].
The paper is structured as follows: Firstly, section II covers recent similar research that covers the generation of forward models and how they have been used in various research such as statistical forward planning algorithms or reinforcement learning.
Ii Background
Iia Learning Forward Models
Humans have the ability to be able to model the outcome of their actions. This is achieved through having an internal model of the environment in which different actions can be tested out and appropriately chosen. This inherent ability allows humans to perform tasks such as planning or seeking intrinsic rewards [8]. For artificially intelligent agents, having access to, or learning the model of its environment through experience is arguably an unavoidable step towards being able to achieve humanlevel intelligence.
Deep neural networks have been used successfully to estimate forward models for various use cases. For example
curiosity driven exploration [9] make use of having forward model that can be used to measure uncertainly about particular states. This measure of uncertainly is used as intrinstic motivation to drive the agent to explore actions that take the agent to states that have not been seen before.Similarly predictions about how much information the agent has access to in certain states have been used to try and maximise the Empowerment of the agent. If an agent is more empowered, it typically has greater access to states that will end in high rewards [10].
In many cases, it is very difficult to learn a model of the environment which can perfectly reproduce the environment dynamics and observation frames over many time steps, especially if the model has stochastic elements such as enemies that move in unpredictable ways.
In [11]
, a combination of autoencoders and recurrent neural networks are used to predict the next frames of several OpenAI gym games. Autoencoders were used to encode the data of a single frame into a latent state
, this latent state is then used in combination with an LSTM, which stores information about previous states, to output a probability density function
where is the action applied at time and is the hidden state produced by the previous LSTM cell. This distribution can then be used to produce the next frame.More recently generative models have been used to predict frames of environments by sampling from a distribution where is the state being predicted. Generative models allow the capture of stochastic and deterministic dynamics of game states, and can even predict the actions of NPC based characters.
Generative state space models [12] [13] [14] [15]
encode state information into a typically 3dimensional tensor instead of a latent vector allowing a richer representation of the underlying environment states, these models are commonly combined with recurrent neural network techniques such as LSTMs and GRUs and are generally more accurate than latent vector encoding of states. State space representations are also used in
[16] in order to predict future states without performing stepby step rollouts.IiB Local Forward Modelling
A recent successful method of learning forward models of gridbased games is to use Local Modelling. Local modelling takes advantage of the fact that in many games, the mechanics can be applied to small areas of the game environment independently of others. The most simplistic example of a local model is that of a 2D cellular automata. It has been shown that using local modelling, forward models of basic cellular automata based games can be learned by focussing on the rules which modify the state of single tiles based on the surrounding tiles [17] [2]. In [18] the model used for learning the forward model of the imagined Sokoban game is the equivalent of a local feedforward cellular automata model. This technique is also used in [12] [15] for statespace transitions and during encoding and decoding of pixels to the latent statespace.
Local modelling is also used in [19] as part of a modelfree RL agent. Because this architecture works well with tasks that usually require agents to plan, it is argued that although this architecture is not explicitly trained to reproduce the underlying environment model, it is learning to plan implicitly.
IiC Neural GPU
The Neural GPU (NGPU) architecture introduced in [3] and improved upon in [4] can learn several unbounded binary operations. For example, multiplication, addition, reversal and duplication. This is achieved by effectively learning 1D cellular automata rules which are then applied over a number of steps until the result is achieved. The number of steps
is typically proportional to the size of the binary digits being processed. The Neural GPU applies the cellular automata rules to an embedded representation of the binary digits using a convolutional gated recurrent unit (CGRU) with hard nonlinearities. The CGRU itself is described by the following set of update rules:
(1)  
In the above equations , , are convolutional kernel banks and , , are learnable biases. The operator is used to describe a convolution operation of the left parameter over the right. For example denotes the kernels in convolved over the values in . and represent the hard nonlinearity versions of the tanh and sigmoid functions respectively and represents the Hadamard (or elementwise) product between two tensors. Details of the hard nonlinearities are given in the original paper [4]. When dealing with binary operations, the Neural GPU takes an input of arbitrary length, containing the binary encoded digits and the operation to perform. The binary digits and operation symbol are embedded into the initial state this state is then iterated through the CGRU for steps and final state
is read out using a softmax layer which predicts the binary result.
As the Neural GPU can be seen as a recurrent application of learnable cellular automata rules, this leaves it well suited to being able to learn the local rules of gridworld based games. This architecture is comparable to other statespace architectures that use sizepreserving layers [18], [12] [19], with the exception that parameters are shared between layers, no latent state information is shared between frames and different gating mechanisms are explored.
Iii Neural Game Engine
The Neural Game Engine is a neural network architecture with a modified Neural GPU at its core. The main modifications to the Neural GPU are outlined in this section. In the Neural Game Engine the state takes a two dimensional shape where the width and height reflect the width and height in tiles of the game being trained and is the number of channels. Each vector stored at represents a single tile in the grid environment. The convolutional kernel banks , and are also modified to be two dimensional with a shape of
. The stride and zeropadding are kept the same as the original paper at
. As there are no diagonal movements allowed in any GVGAI environment, the kernels are also masked to ignore the nonadjacent cells. Similarly to the NGPU, an iteration of the CGRU unit with input produces a new state. The number of iterations of CGRU cell per frame of the game state is tuned as a hyperparameter
.The width and height of the games in the GVGAI environment can be any positive integer value. Due to the fact that changing the values of and does not result in any change of the number of parameters in the underlying Neural GPU, this means that the Neural Game Engine can generalize to any and . This unbounded computation of game state is discussed further in section VD.
In many reinforcement learning techniques, the rewards that the game provides to the player are augmented in order to aid exploration, modify the agent’s goals, or provide auxillary losses to reduce training time [20], [21]. In some cases the original rewards supplied by the environments are modified from their original values with a technique known as reward shaping [22].
Reward prediction in the Neural Game Engine aims to reproduce the original game rewards as accurately as possible, but taking into account that reward prediction should be a seperate process entirely from the game mechanics.
At every time step the Neural GPU is applied to an encoded observation image and iterated times to and then decoded to give the next observation state and reward .
The architecture for a single timestep calculation is shown in figure 1.
Iiia Observation Encoder 
In the GVGAI environment, tiles in the trained games are set to have the same width and height dimensions . This consistency allows the tiles to be embedded into a tensor with the same dimensions of the NGPU initial state
. This tile embedding is achieved by using a convolutional neural network with kernel width, kernel height and stride set to
, input channels set to to reflect the RGB components of the image and finally output channels set to , the number of channels in the NGPU state.IiiB Observation Decoder 
To render the game pixels, a mapping from the underlying embedded tile representations to the pixel representations of the tile is learned. This mapping takes the form of a convolutional transpose with kernel size and stride . The number of input channels is set to 3 to reproduce the RGB components. This mapping recovers a tensor of shape which can be rendered.
IiiC Action Conditioning 
As the action needs to be considered as part of the local rule calculations in the NGPU, information about the actions must be available in the state, along with the observations. To achieve this, the action
is onehot encoded and then embedded with a linear layer of output size
. This is then added to each cell of the initial state . In practice this can be achieved by tiling the onehot representation of the action into a tensor of size , where and are the width and height of the NGPU state, and is the cardinality of the set of actions for the game. This state can then by passed to a 1x1 convolutional neural network with output channels. The resulting tensor can then be added to the , which results in the initial state of the Neural GPU.IiiD Reward Observation Encoder 
The reward observation encoder consists of a tile embedding layer similar to the observation encoder encoding each tile into a vector with channels, giving an embedded observation state of size .
IiiE Reward Action Conditioning 
A seperate action conditioning network encodes the action at each step to a one hot vector which is the embedded into a linear layer of size and then added to each of the embedded tile vectors giving the reward state . This process is identical to the NGPU action conditioning, the only difference is the number of channels may be different depending on hyperparameter choices.
IiiF Reward Decoder 
In order to decode the rewards from the reward state
, a convolutional network network with kernel size of 3 and padding 1 is used followed by two convolutional layers with kernel size of 1, 0 padding and number of channels decreasing in each layer. A final convolutional layer with kernel size 3 is used to decrease the number of channels to 16 and an arbitrary height and width. Global max pooling is applied across the remaining arbitrary height and width dimensions leaving 16 outputs. These 16 ouputs are then trained with categorical cross entropy loss to predict an 8 bit binary number corresponding to the reward. Predicting binary rewards in this way instead of predicting linear values means that reward prediction is reduced to a classification problem. This has the effect that variance in underlying parameters can remain low. Negative reward values given in the original environment are currently ignored as the predicted binary number is unsigned. To support negative rewards, a sign bit or two’s complement encoding could be used. To predict fractional rewards, float or double encoding could be used.
Iv Neural GPU enhancements
Iva 2D Diagonal Gating
Diagonal gating, introduced in [4] is a technique used in the NGPU architecture to allow state cell values to be passed directly to neighbouring states cells. In the context of a gridworld game, it follows that information such as tile type, could be transferred in this manner. The state of the original NGPU is a onedimensional vector and thus its diagonal gating mechanism allows it to copy state information from the left and right cells. The state of the underlying NGPU in the Neural Game Engine is two dimensional, which means that the diagonal gating mechanism can copy from above and below, as well as left and right. To achieve this, the state is now split into 5 parts and a 2D convolution operator with fixed kernels as shown in equation 2 is used.
(2)  
IvB Selective Gating
One of the issues with diagonal gating is that the copying of the state information is unidirectional for the state values in each cell . To illustrate this issue consider the values in any substate . The values in each substate are only shifted in a single direction. This means that substates that are shifted in one direction are not the same states that can be shifted in the other directions. This unidirectional flow does not allow consitent copying of state information across all directions. Intuitively this means that if a tile moves upwards, the state information it can bring to the cell above cannot be moved to the left, right or even back to the cell that it started in.
To alleviate this issue, a selective gating mechanism is proposed which allows the gating mechanism to copy values in any direction for any value in any cell .
The selection mechanism works by learning a classifier that, given the state tensor
outputs a selection tensor of dimensions where the selection of the gating directions (up, down, left, right, center) are onehot encoded into the last dimension. The selection tensor is created by applying a convolution operation to the state with kernel size of 3x3, stride of 1, padding of 1 and output channels. The channels are then reshaped into a tensor of size and a softmax applied across the first dimension to give a selection for each of the values. The selection tensor is then multiplied by a tensor of shape containing 5 directionally shifted versions of the original state. This gives the new state .(3)  
The shifting operation can be achieved by the convolution of a fixed kernel that copies states from adjacent cells. Zero padding of 1 is applied so the state retains its original shape. For example:
(4)  
IvC Evaluation Methodology
The aim of the experiments is to try reach pixelperfect reproduction of original GVGAI environment games over abitrarily long time frames for levels with any dimensions. In order to achieve this, the network must learn the game mechanics on a symbolic level and then be able to apply these to larger game states.
The results presented in this paper are performed on the game Sokoban as it is a good example of a GVGAI game with local rules.
In order to measure the accuracy of the reproduction of the game, two related measures are used. Firstly the meansquared error of the raw pixel outputs at each step and secondly, a closest tile f1 measure. The closest tile measure is created by firstly taking a tile map of the original observation which has dimensions where each element in the map corresponds to an index of the set of possible tiles . A second tile map is then created by finding the closest matching tile in the set of tiles for each x tile in the predicted observation. The closest tile f1 measure is calculated from the mean of the f1 scores for each of the tiles in and
. The f1 scores are generated by measuring the precision and recall of the tile predictions.
Alongside learning the pixelaccuracy, the rewards given by the environment are learned. Reward error is measured by converting the real reward values to a binary representation and then calculating the cross entropy loss. Reward accuracy is measured using precision, recall and f1 score of the binary classifications.
IvD Training
In order to obtain accurate rollouts over long time periods, for any size network, the training data is generated in a way that does not bias towards game sizes, numbers of tiles (such as walls, boxes and holes in sokoban), or particular RL or planning policies.
Level generation for GVGAI games has been explored in [23], [24] and [25]. However these generators are aimed at either producing levels that help RL agents to learn or are pleasing to human players.
To generate levels for learning the environment dynamics, the probability for an agent to interact with different types of tiles must be evenly distributed. To achieve this, levels are randomly generated with height and width between certain values , , , . GVGAI environments typically contain 5 prebuilt levels. These prebuilt levels are used to generate the probabilities of each tile being placed in the environment. Tiles are positioned with these calculated probabilities with the caveat that wall tiles are always placed on the edges of the game state if this is consistent with the 5 prebuilt levels. Additionally, tiles that only appear once in each level are placed only once in generated levels.
A random agent is used to generate experience data in the environment. To improve the distribution of training data, each step is augmented by creating 8way tilesymmetrical observation and actions. Each step of learning uses minibatch gradient descent, where the batch contained the symmetrical experiences. Batch sizes are fixed at 32 state transitions, giving a total of 256 frame transitions per batch. Similarly to the Neural GPU, saturation cost is calculated for the hardnon linearites in the CGRU units, however these saturation costs are averaged across the batch instead of summed, which increased the stability of training and improved training results overall. The saturation limit in all experiments is set at 0.99 and weighted at 0.001. Saturation cost is not clamped at any value with respect to the overall loss as it is in the original paper.
As the observation predictions at each time step become the inputs for the next prediction, errors can build up over time and cause the rollout accuracy to decrease rapidly. During experiments, the same Prediction Dependent Training (PDT) technique introduced in [26] coupled with a curriculum schedule was employed which increased accuracy and training stability. Observation noise is also added to training data, this was integral to acheiving high accuracy.
In order to evaluate the training progress of the environment, rollouts are performed every 200 epochs using real game levels from the GVGAI environment. 3 repeats of rollouts of length 100 are performed, and
and are calculated. ^{2}^{2}2All training and testing is performed on a single Ubuntu 18.04 machine with an NVIDIA 2080ti GPU, Intel® Core™ i76800K CPU and openBLAS (0.2.20) libraries installed.V Experiments and Results
Va Comparison of gating mechanisms
In figure 2, the NGE architecture using the different NGPU gating mechanisms described in III is shown. Even with no diagonal or selective gating, the NGE can very learn very accurate models of game environments. In the experiments, Selective gating had a small advantage in stability over long time horizons, this is also reflected in table I.
VB Comparison with other methods
The best Neural Game Engine (NGE) model is compared against several common networks from recent literature with the game Sokoban. Rewards prediction is not analysed as it is a seperate network. The The network architectures that are compared are the following:
FeedForward (FF)
This model replaces the NGPU module with two feedforward convolutional layers with kernel size of 3, stride 1 and padding 1. This is the equivalent of the basic block used in [18] when training Sokoban. The model compared does not use poolandinject layers as Sokoban has no longdistance dependencies that require global state changes. This model is commonly used as the determinsitic component of generative statespace architectures and is well suited to deterministic grid environments.
Recurrent Environment Simulators (RES)
The state of the game is encoded into a latent state using an autoencoder. This latent state then forms the input to an LSTM unit which can store past state information in its hidden state. This model is equivalent to the Recurrent Environment Simulator (RES) [26] and models that use an autoencoder to create a latent state.
Stochastic State Space (sSS)
The most complex model which, like NGE heavily uses cellular automatalike layers which encode pixel information into a compressed grid. The model differs from NGE in that it works with continuous and stochastic environments, and therefore uses sampling in order to produce the output observations.
Figure 3 shows the comparison of these 4 methods with the same input data and number of epochs. The training in this experiment is limited to random grids of fixed size (10x10). This is due to the fact that RES and sSS models contain architectural components that cannot generalize to different size grids. Each method trains to high accuracy very quickly, followed by a plateau in decreasing error, leading to a maximum accuracy. In the case of Sokoban, FF, sSS and NGPU methods have a slight advantage as Sokoban is naturally suited to local modelling. However the sSS model is disadvantaged by the fact that it contains stochastic components that are trying to model completely deterministic state transitions.
VC Ablation Testing
An important feature of many games is that many interactions between adjacent cells can be dependent on other surrounding cells. For example in Sokoban, when pushing a movable block against a wall, a 3x3 grid around the location of the agent will not take into account the wall when calculating the next state of the cell currently occupied by the agent. However, the NGPU accounts for nonlocal interactions when it iterates during a single time step. This effectively lets cells share information during the processing of a single state. With , the NGPU can share information from more adjacent cells, encompassing the wall that the block cannot be moved past. Other models such as those used in [12] [18]
use similar techniques, but use fixed networks with different convolutional network sizes and apply residual layers. Using a NGPU with multiple iterations removes the requirement for multiple layers of convolutions and residual connections, making the network much simpler and smaller.
In [4], diagonal gating is used to share state information between adjacent cells. As described in section 3 this only allows singledirection information flow, which reflects in the higher error rate of NGE models using diagonal gating.
To test that the iteration of NGPU is vital for information flow in local interactions, two experiments are performed under all the same conditions of the high performing models. One with the modification that only a single NGPU step and no PDT is configured during training. The other with a single NGPU step, but using PDT described in section IVD. The second experiment aimed to rule out that local information could be transported through pixels. The results of this are shown in figure 4.
In both the 2step and 2step+PDT experiments, the accuracy achieved is very high, but with the single step options, the accuracy achieved plateaus at a much lower value and the prediction error remains high. This result shows that multiple steps of the NGPU are vital to achieving high accuracy. It’s also important to note that the 2layer FF model in figure 3 also could not achieve this high accuracy.
VD Unbounded Computation
To test the generalization ability of the trained NGE, the models trained in section IVD are used to play several levels with much larger dimensions than those during training. These larger models are then compared against the the original GVGAI environment with an identical starting state and action list. The two methods ( and ) of measuring the accuracy of these models are used as described in section IVD. For each model, the two measure are calculated for each step up to 500 and an average of the measures are taken over 10 repeats. These results are shown in table I
Grid Size  

30x30  7.5e6  1.0 
50x50  8.3e6  1.0 
70x70  7.9e6  1.0 
100x100  7.9e6  1.0 
VE Results on GVGAI games
The results of training Neural Game Engine on several GVGAI games is shown in table II. The rollouts follow the same setup described in section IVC Games that result in scores of 1.0 show that the underlying game rules are learned accurately and the NGE does not make any mistakes when tested. Reward F1 scores can be interpreted in the same way. Most of the tested games acheive high accuracy, however there are some game mechanics that cannot be supported by the NGE without modifications. As an example, clusters completely fails to learn the reward function. Rewards are fairly common in the game and the forward model itself learns very accurately, so the reason for this is unclear. The game aliens is included as an example of a game that has stochastic (the enemies randomly shoot at the player) and partially observable (the enemies spawn from a location that has no visible markers) components. The reward function of aliens is partialled learned by NGE, however the score is 0.73 meaning that just over a quater of the tiles are predicted incorrectly.
Game  

sokoban  7.5e6  1.0  1.0 
cookmepasta  9e4  0.98  0.83 
bait  5.2e4  0.97  0.99 
brainman  3.6e4  0.97  1.0 
labyrinth  1.6e5  0.97  1.0 
realsokoban  1.8e3  0.86  1.0 
painter  4.6e6  1.0  1.0 
clusters  1.3e5  1.0  0.0 
zenpuzzle  8.2e6  1.0  1.0 
aliens  5.1e3  0.73  0.85 
Vi Discussion
There are several interesting applications for games trained with the NGE architecture, for example the fact that games can be learned with very high accuracy over long time horizons, these can be used in planning algorithms. Additionally, because these games also run entirely on the GPU, the sample rate and parallelization ability mean that they can be used as efficient environments for reinforcement learning experimentation.
There are two main limitations that the NGE architecture suffers from: its lack of ability to model stochastic game elements and global statechanges. Further experimentation and research is required to achieve these goals. One approach could be to use NGPU modules in place of the deterministic size preserving layers in sSS models.
One large area for improvement for the Neural Game Engine is that the statical method of level generation and random agent movement does not produce enough examples for some local patterns. In many cases, tweaking the random level generation parameters is enough to give the NGPU a distribution which greatly improves the accuracy of training. Improving the data distribution of local states to train the NGPU is an area which could greatly be improved. Using curiosity driven agents, or planning agents may provide much better data distributions for learning rewards, but may avoid areas of low rewards and therefore not learn the full game dynamics.
Another area of improvement would be that the Neural Game Engine only predicts a single time step in the future, therefore events that do not specifically change the observational state are completely lost. For example, in some games the agent picks up a key and then the agent tile changes to show the agent holding a key. Once the agent has a key, the agent can open a door. NGE learns these dynamics well and learns that if the agent lands on a tile with a key, it changes to an agent with a key and can then interact with a door. However if the fact that the agent is holding a key does not change the agent tile, NGE has no knowlege of this at the next step and therefore the information is lost. This could be fixed by following the latent state space model training techniques used in [12], [27] and [28] where future observations are predicted several steps in the future without decoding the visual information between steps.
Vii Conclusion
In this paper, the Neural Game Engine architecture is proposed as a method of learning very accurate forward models for gridworld games. The Neural Game Engine architecture, which is built upon the Neural GPU, learns a set of underlying local rules that can be applied over several iterations rather than stacking layers with different parameters. Improvements to the Neural GPU architecture such as selective gating are introduced which enable it to be applied to predicting the forward dynamics of games. This paper shows that this method has many advantages: fast learning time; very high accuracy over long timehorizons and fast and easily parallelised execution. The Neural Game Engine shows higher accuracy at predicting the state transitions in the game Sokoban when compared to similar state space models that are used in several modelbased reinforcement learning applications. Additionally the Neural Game Engine is shown to generalize well to different game environment dimensions not seen during training.
References
 [1] A. Dockhorn and D. Apeldoorn, “Forward model approximation for general video game learning,” in 2018 IEEE Conference on Computational Intelligence and Games (CIG). IEEE, aug 2018, pp. 1–8. [Online]. Available: https://ieeexplore.ieee.org/document/8490411/
 [2] A. Dockhorn, S. M. Lucas, V. Volz, I. Bravi, R. D. Gaina, and D. PerezLiebana, “Learning local forward models on unforgiving games,” in accepted at IEEE Conference on Games (COG), 2019.
 [3] L. Kaiser and I. Sutskever, “Neural GPUs learn algorithms,” arXiv, nov 2015. [Online]. Available: https://arxiv.org/abs/1511.08228
 [4] K. Freivalds and R. Liepins, “Improving the neural GPU architecture for algorithm learning,” arXiv, feb 2017. [Online]. Available: https://arxiv.org/abs/1702.08727
 [5] D. PerezLiebana, J. Liu, A. Khalifa, R. D. Gaina, J. Togelius, and S. M. Lucas, “General video game AI: a multitrack framework for evaluating agents, games and content generation algorithms,” arXiv, feb 2018. [Online]. Available: https://arxiv.org/abs/1802.10363
 [6] T. Schaul, “A video game description language for modelbased or interactive learning,” in 2013 IEEE Conference on Computational Inteligence in Games (CIG). IEEE, aug 2013, pp. 1–8. [Online]. Available: http://ieeexplore.ieee.org/document/6633610/
 [7] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
 [8] J. Schmidhuber, “Formal theory of creativity, fun, and intrinsic motivation (1990–2010),” IEEE transactions on autonomous mental development, vol. 2, no. 3, pp. 230–247, sep 2010. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5508364

[9]
D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiositydriven
exploration by selfsupervised prediction,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
. IEEE, jul 2017, pp. 488–489. [Online]. Available: http://ieeexplore.ieee.org/document/8014804/  [10] R. Houthooft, X. Chen, Y. Duan, J. Schulman, and P. Abbeel, “VIME: Variational information maximizing exploration,” may 2016. [Online]. Available: https://www.researchgate.net/publication/{303698546_VIME_Variational_Information_Maximizing_Exploration}
 [11] D. Ha and J. Schmidhuber, “World models,” arXiv, mar 2018. [Online]. Available: https://arxiv.org/abs/1803.10122
 [12] L. Buesing, T. Weber, S. Racaniere, S. M. A. Eslami, D. Rezende, D. P. Reichert, F. Viola, F. Besse, K. Gregor, D. Hassabis, and D. Wierstra, “Learning and querying fast generative models for reinforcement learning,” arXiv, feb 2018. [Online]. Available: https://arxiv.org/abs/1802.03006
 [13] K. Gregor, D. J. Rezende, F. Besse, Y. Wu, H. Merzic, and A. v. d. Oord, “Shaping belief states with generative environment models for RL,” arXiv, jun 2019. [Online]. Available: https://arxiv.org/abs/1906.09237v2
 [14] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” arXiv, nov 2018. [Online]. Available: https://arxiv.org/abs/1811.04551
 [15] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” arXiv, dec 2019. [Online]. Available: https://arxiv.org/abs/1912.01603
 [16] K. Gregor, G. Papamakarios, F. Besse, L. Buesing, and T. Weber, “Temporal difference variational autoencoder,” arXiv, jun 2018. [Online]. Available: https://arxiv.org/abs/1806.03107
 [17] S. M. Lucas, A. Dockhorn, V. Volz, C. Bamford, R. D. Gaina, I. Bravi, D. PerezLiebana, S. Mostaghim, and R. Kruse, “A local approach to forward model learning: Results on the game of life game,” arXiv, mar 2019. [Online]. Available: https://arxiv.org/abs/1903.12508
 [18] T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Hassabis, D. Silver, and D. Wierstra, “Imaginationaugmented agents for deep reinforcement learning,” arXiv, jul 2017. [Online]. Available: https://arxiv.org/abs/1707.06203
 [19] A. Guez, M. Mirza, K. Gregor, R. Kabra, S. Racanière, T. Weber, D. Raposo, A. Santoro, L. Orseau, T. Eccles, G. Wayne, D. Silver, and T. Lillicrap, “An investigation of modelfree planning,” arXiv, jan 2019. [Online]. Available: https://arxiv.org/abs/1901.03559
 [20] E. Shelhamer, P. Mahmoudieh, M. Argus, and T. Darrell, “Loss is its own reward: Selfsupervision for reinforcement learning,” arXiv, dec 2016. [Online]. Available: https://arxiv.org/abs/1612.07307
 [21] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” arXiv, jul 2017. [Online]. Available: https://arxiv.org/abs/1707.01495
 [22] M. Grzes, “Reward shaping in episodic reinforcement learning,” in Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, ser. AAMAS ’17. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2017, p. 565–573.

[23]
A. Khalifa, D. PerezLiebana, S. M. Lucas, and J. Togelius, “General video
game level generation,” in
Proceedings of the 2016 on Genetic and Evolutionary Computation Conference  GECCO ’16
, T. Friedrich, Ed. New York, New York, USA: ACM Press, jul 2016, pp. 253–259. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2908812.2908920  [24] O. Drageset, M. H. M. Winands, R. D. Gaina, and D. PerezLiebana, “Optimising level generators for general video game AI,” in 2019 IEEE Conference on Games (CoG). IEEE, aug 2019, pp. 1–8. [Online]. Available: https://ieeexplore.ieee.org/document/8847961/
 [25] N. Justesen, R. R. Torrado, P. Bontrager, A. Khalifa, J. Togelius, and S. Risi, “Illuminating generalization in deep reinforcement learning through procedural level generation,” arXiv, jun 2018. [Online]. Available: https://arxiv.org/abs/1806.10729
 [26] S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed, “Recurrent environment simulators,” arXiv, apr 2017. [Online]. Available: https://arxiv.org/abs/1704.02254
 [27] B. Amos, L. Dinh, S. Cabi, T. Rothörl, S. G. Colmenarejo, A. Muldal, T. Erez, Y. Tassa, N. de Freitas, and M. Denil, “Learning awareness models,” arXiv, apr 2018. [Online]. Available: https://arxiv.org/abs/1804.06318
 [28] M. G. Azar, B. Piot, B. A. Pires, J.B. Grill, F. Altché, and R. Munos, “World discovery models,” arXiv, feb 2019. [Online]. Available: https://arxiv.org/abs/1902.07685
Comments
There are no comments yet.