1 Introduction
The goal of reinforcement learning is to maximise some notion of feedback through interaction with an environment [23]. The environment can be known, which makes this learning process trivial, or have hidden state information, which typically increases the complexity of learning significantly. In modelfree reinforcement learning, actions are sampled from some policy that is optimised indirectly through direct policy search (Policy gradients), a statevalue function (Qlearning), or a combination of these (ActorCritic). There are many recent contributions to these algorithms that increase sample efficiency [8], reduce variance [10], and increase training stability [21].
It is challenging to deploy modelfree methods in realworld environments because current stateoftheart algorithms require millions of samples before any optimal policy is learned. Due to this, modelbased reinforcement learning is an appealing approach because it has significantly better sample efficiency compared to the modelfree methods [17]. The goal of modelbased algorithms is to learn a predictive model of the real environment that is used to learn the controller of an agent. The downside of modelbased reinforcement learning is that the predictive model may become inaccurate for longer timehorizons, or collapse entirely in areas of statespace that has not observed.
We propose a modelbased reinforcement learning approach for industrynear systems where a predictive model is learned without direct interaction with the environment. We use Automated Storage and Retrieval Systems (ASRS) to benchmark our proposed algorithm. Learning a predictive model of the environment is isolated from the physical environment, which guarantees safety during training. If a predictive model is sufficiently trained, a modelfree algorithm, such as DQN [19] can be trained offline. Training can be done in a largescale distributed setting, which significantly reduces the training time. When the modelfree algorithm is trained sufficiently, it will be able to replace a suboptimal expertsystem with minimal effort.
The paper is organised as follows. Section 2 discusses the current state of the art in modelbased reinforcement learning, and familiarise the reader of recent work in ASRS systems. Section 3 briefly outlines relevant background literature on reinforcement learning. Section 4 introduces the DVAE2 algorithm and details the architecture thoroughly. Section 5 proposes the Deep Warehouse, a novel highperformance environment for industrynear testing of reinforcement learning algorithms. Section 6 presents our results using DVAE2 in various environments, including complex environments such as Deep Warehouse, Deep RTS and Deep Line Wars. Finally, section 7 concludes our work and outlines a roadmap for our future work.
2 Literature Review
Reinforcement Learning is a maturing field in artificial intelligence, where a significant portion of the research is concerned with modelfree approaches in virtual environments. Reinforcement learning methods in largescale industrynear environments are virtually absent from the literature. The reason for this could be that (1) modelfree methods do not give the sample efficiency required and that (2) there is little evidence that modelbased approaches achieve reliable performance. In this section, we briefly discuss the previous work in ASRS systems and present promising results for modelbased reinforcement learning.
2.1 Automated Storage and Retrieval Systems (ASRS)
There is to our knowledge no published work where reinforcement learning schemes are used to control taxiagents in ASRS environments. The literature is focused on heuristicbased approaches, such as treesearch and traditional pathfinding algorithms. In
[20], a detailed survey of the advancements in ASRS systems which categorise an ASRS system into five components; System Configuration, Storage Assignment, Batching, Sequencing, and Dwellpoint. We adopt these categories in search of a reinforcement learning approach for ASRS systems2.2 Modelbased Reinforcement Learning
In modelbased reinforcement learning, the goal is to learn statetransitions based on observations from the environment, the predictive model. If the predictive model is stable, with low variance and improves monotonically during training, it is, to some degree, possible to learn modelfree agents to act optimally in environments that have never been observed directly.
Perhaps the most sophisticated algorithm for modelbased reinforcement learning is the Modelbased policy optimisation (MBPO) algorithm, proposed by Janner et al. [16] The authors empirically show that MBPO performs significantly better in continuous control tasks compared to previous methods. MBPO proves to be monotonically improving given that the following bounds hold:
where denotes the returns in the real environment under a policy whereas denotes the returns in the predicted model under policy . Furthermore, the authors show that as long as they can improve the C, the performance will increase monotonically [16].
Gregor et al. proposed a scheme to train expressive generative models to learn beliefstates of complex 3D environments with little prior knowledge. Their method was effective in predicting multiple steps into the future (overshooting) and significantly improve sample efficiency. In the experiments, the authors illustrated modelfree policy training in several environments, including DeepMind Lab. However, the authors found it difficult to use their predictive model in modelfree agents directly. [11]
Neural Differential Information Gain Optimisation (NDIGO) algorithm by Azar et al. is a selfsupervised exploration model that learns a world model representation from noisy data. The primary features of NDIGO are its robustness to noise due to their method to cancel out negative loss and to give positive learning more value. The authors show in their maze environment that the model successfully converges towards an optimal world model even when introducing noise. The author claims that the algorithm outperforms previous stateoftheart, being the Recurrent World Model from. [4]
The Dreaming Variational Autoencoder (DVAE) is an endtoend solution for prediction the probable future state
. The authors showed that the algorithm successfully predicted next state in noncontinuous environments and could with some error predict future states in continuous statespace environments such as the Deep Line Wars environment. In the experiments, the authors used DQN, PPO, and TRPO using an artificial buffer to feed states to the algorithms. In all cases, the DVAE algorithm was able to create buffers that were accurate enough to learn a nearoptimal policy. [3]The algorithm VMAVC is a combination of VAE and attentionbased value function (AVF), and mixture density network recurrent neural network (MDNRNN) from
[12]. This modification to the original World Models algorithm improved performance in the Cart Pole environment. They used the onpolicy algorithm PPO to learn the optimal policy from the latent representation of the statespace [18].Deep Planning Network (PlaNet) is a modelbased agent that interpret the pixels of a state to learn a predictive model of an environment. The environment dynamics are stored into latentspace, where the agent sample actions based on the learned representation. The proposed algorithm showed significantly better sample efficiency compared to modelfree algorithms such as A3C [14].
In Recurrent World Models Facilitate Policy Evolution, a novel architecture for training RL algorithms using variational autoencoders. This paper showed that agents could successfully learn the environment dynamics and use this as an exploration technique requiring no interaction with the target domain. The architecture is mainly three components; vision, controller, and model, the vision model is a variational autoencoder that outputs a latentspace variable of an observation. The latentspace variable is processed in the model and is fed into the controller for action decisions. Their algorithms show stateoftheart performance in selfsupervised generative modelling for reinforcement learning agents. [12]
Chua et al. proposed Probabilistic Ensembles with Trajectory Sampling (PETS). The algorithm uses an ensemble of bootstrap neural networks to learn a dynamics model of the environment over future states. The algorithm then uses this model to predict the best action for future states. The authors show that the algorithm significantly lowers sampling requirements for environments such as halfcheetah compared to SAC and PPO. [9]
DARLA is an architecture for modelling the environment using VAE [15]. The trained model was used to learn the optimal policy of the environment using algorithms such as DQN [19], A3C, and Episodic Control [5]. DARLA is to the best of our knowledge, the first algorithm to introduce learning without access to the groundtruth environment during training.
3 Background
Markov decision processes (MDP’s) are a mathematical framework commonly used to define reinforcement learning problems, as illustrated in Figure 1. In an MDP, we consider the tuple (, , , , , ) ^{1}^{1}1 and is defined for discrete or continuous spaces. where is commonly referred to as in the literature.where is the state space, is the action space available to the agent, is the expected immediate reward function, is the transition function which defines the probability and is the probability for the initial state .
The goal of a reinforcement learning agent is to encourage good behaviour and to discourage bad behaviour. Optimal behaviour is achieved when the agent finds a composition of parameters that maximise its performance, thus finds the optimal policy. Consider
(1) 
where is the objective function for maximising the expected discounted reward defined as
(2) 
where is the discounting factor of future rewards. If , all future state rewards are accounted for equally, while , we are only concerned about the current state.
4 Learning policies using predictive models
The Dreaming Variational Autoencoder v2 (DVAE2) is an architecture for learning a predictive model of arbitrary environments [3]. In this work, we aim to improve the first version of the DVAE for better performance in realworld environments. A common problem in modelbased reinforcement learning is that it takes millions of samples to generalise well across sparse data. We aim to approve sample efficiency from the original DVAE and if possible, surpass the performance of modelfree methods.
4.1 Motivation and Environment Safety
Figure 2
shows an abstract overview of DVAE2 training in an environment. In realworld, industrynear environments, there is little room for interruptions. In modelfree reinforcement learning, the agent interacts with the environment to learn its policy. Because this is not possible in many realworld environments, the DVAE2 algorithm only observes during training. During training, the DVAE2 algorithm learns how the transition function behaves and learns an estimated statevalue function
that represent the value of being in that current state.4.2 The Dreaming Variational Autoencoder v2
The original DVAE architecture had severe challenges with modelling of continuous statespaces [3], and many algorithms were added to the model to improve performance across various environments including autoencoders, LSTMs, and finetuned variations of these. The DVAE2 extends this with a split into three individual components; forming the View, Reason and Control (VRC) model. The VRC model embeds all improvements into a single model and learns which algorithms to use under certain conditions in an environment
Figure 3 shows an overview of the proposed VRC. (1) A state is observed. During training, this observation stems from the realenvironment while at inference time, from the predictive model. The observation is encoded in the view component (e.g. via AE or GAN) and outputs an embedding at time w.r.t policy . (2) The reason component learns the time dynamics between state sequences. Encoded states are accumulated into a buffer and are then used to predict the hiddenstate w.r.t the encoded state sequence. The reason component typically consists of a model with RNNlike structure that generalises well on sequence data. (3) The hidden state is then used to evaluate an action using policy , and (4) is sent to the environment and the view for the next iteration. (5) The decoder, prepares the hiddenstate and encoded state , producing the succeeding state . The prediction is then used in the next iteration as current state , which leads back to (1). As an optional mechanism, the controller can use the output from the decoder, instead of the hidden state information. This is beneficial when working with modelfree algorithms such as deep qnetworks [19].
4.3 Model selection
During technique selection in the components, we perform the following evaluation. An observation is sent to the view component of DVAE2. All of the view techniques are initially assumed to be uniformly qualified to encode and predict future states. For each iteration, the computed error is summarised as a score, and during inference, the technique with the lowest score is used^{2}^{2}2In this setting, the lowest score is the technique with least accumulated error.. We use the same method for determining the best reasoning algorithm in a specific environment.
4.4 Implementation
The implementation of the DVAE2 algorithm with dynamic component selection enabled several significant improvements to over the previous DVAE model[3]. Notably, the kstep model rollout from [16] is implemented to stabilise training. We found that using shorter modelrollouts provided better control policies, but at the cost of higher sample efficiency. Also, by embedding time into the encoded state improved the model stability and prediction capabilities [13]. The DVAE2 algorithm is defined as follows.
Algorithm 1 works as follows. (Line 1) We initialise the control policy and the predictive model (DVAE2) parameters. (Line 2) The variable denotes a finite set of sequential view model (ENC) predictions that are used to capture time dependency between states in the reason model (TR). (Line 5) We collect samples from the real environment under a predefined policy, such as an expert system, see Figure 2. (Line 6) The predictive model is then trained using the collected data via maximum likelihood estimation. In our case, we use mean squared error to measure the error distance . When the DVAE2 algorithm has trained sufficiently, the modelfree algorithm will train for epochs (Line 7) using the predictive model instead of . (Line 8) First, we sample the initial state uniformly from the real dataset . (Line 9) We then construct a prediction dataset and predict future states using the control policy (i.e. sampling from the predictive model). (Line 10) The parameterised control policy is then optimised using pairs during rollouts.
5 The Deep Warehouse Environment
Training algorithms in realworld environments is known to have severe safety challenges during training and suffers from low sampling speeds [6]. It is therefore practical, to create a simulation of the real environment so that researches can quickly test algorithm variations with quick feedback on its performance.
This section presents the Deep Warehouse^{4}^{4}4The deep warehouse environment is opensource and freely available at https://github.com/cair/deepwarehouse environment for discrete and continuous action and state spaces. The environment has a wide range of configurations for time and agent behaviour, giving it tolerable performance in simulating proprietary automated storage and retrieval systems.
5.1 Motivation
In the context of warehousing, an Automated Storage and Retrieval System (ASRS) is a composition of computer programs working together to maximise the incoming and outcoming throughput of goods. There are many benefits of using an ASRS system, including high scalability, increased efficiency, reduced operating expenses, and operation safety. We consider a cubebased ASRS environment where each cell is stacked with item containers. On the surface of the cube, taxiagents are collecting and delivering goods to delivery points placed throughout the surface. The taxiagents are controlled by a computer program that reads sensory data from the taxi and determines the next action.
Although these systems are far better than manual labour warehousing, there is still significant improvement potential in current stateoftheart. Most ASRS systems are manually crafted expert systems, which due to the high complexity of the multiagent ASRS systems only performs suboptimally. [20].
5.2 Implementation
Figure 4 illustrates the statespace in the deep warehouse environment. In a simple cubebased ASRS configuration, the environment consists of (B) passive and (C) active deliverypoints, (D) pickuppoints, and (F) taxis. Also, the simulator can model other configurations, including advanced cube and shelfbased automated storage and retrieval systems. In the deep warehouse environment, the goal is to store and retrieve goods from one location to another where each cell represents several layers of containers that a taxi can pick up. A taxi (F) receives feedback based on the time used on the task it performs. A taxi can move using a discrete or continuous controller. In discrete mode, the agent can increase and decrease thrust, and move in either direction, including the diagonals. For the continuous mode, all of these actions are floating point numbers between (off) 0 and (on) 1, giving a significantly harder actionspace to learn. The simulator also features continuous mode for the statespace, where actions are performed asynchronously to the game loop. It is possible to create custom support modules for mechanisms such as task scheduling, agent controllers and fitness scoring.
A significant benefit of the deep warehouse is that it can accurately model real warehouse environments at high speed. The deep warehouse environment runs 1000 times faster on a single highend processor core compared to realworld systems measured from the speed improvement by counting how many operations a taxi can do per second. The simulator can be distributed across many processing units to increase the performance further. In our benchmarks, the simulator was able to collect 1 million samples per second during the training of deep learning models using highperformance computing (HPC).
6 Experimental Results
In this section, we present our preliminary results of applied modelbased reinforcement learning using DVAE2. We aim to answer the following questions.
(1) Does the DVAE2 algorithm improve sample efficiency compared to modelfree methods? (2) How well do DVAE2 perform versus modelfree methods in the deep warehouse environment? (3) Which of DVAE2 VRC components is preferred by the model?
6.1 The importance of compute
According to AI pioneer Richard S. Sutton “The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.” [22]. It is therefore not surprising that compute is still the most decisive factor when training a large model, also for predictive models. DVAE2 was initially trained using two NVIDIA 2080 RTX TI GPU cards that, if tuned properly, can operate at approximately 26.9 TFLOPS. For simpler problems, such as gridwarehouses of size and CartPole, the compute was enough to train the model in 5 minutes, but for larger environments, this time grew exponentially. To somewhat mitigate the computational issue for larger environments, we performed the experiments with approximately 1.25 PFLOPS of compute power. This led to significantly faster training speeds, and made large experiments feasible^{5}^{5}5We recognise large experiments to consist of environments where the agents require significant sampling to converge.
6.2 Results
Figure 5 shows that the average return value of DVAE2 training four tasks, including Deep RTS [2], Deep Warehouse, Deep Line Wars [1] and CartPole [7].
Deep Warehouse: The environment is a contribution in this paper for industrynear testing of autonomous agents. The DVAE2 algorithm outperforms both PPO and DQN in terms of sampling and performance during 150000 game steps. The score function is a counter of how many tasks the agent has performed during the episode. If the agent manages to collect and retrieve 300 packages, the agent has sufficient performance to beat many handcrafted algorithms in ASRS systems. The environment is multiagent, and in this experiment, we used a grid with 20 taxis running the same policy.
Deep RTS is a flexible realtime strategy game (RTS) engine with multiple environments for unit control and resource management. In this experiment, we used the resource harvester environment where the goal is to harvest 500 wood resources before the time limit is up. The score is measured from 500 to 0, where 0 is the best score. For every wood harvested, the score increase with 1. We consider the task mastered if the agent has less than 200 score at the terminal state. DVAE2 outperform the baseline algorithms in terms of sample efficiency but falls behind PPO in terms of score performance. [2]
Deep Line Wars: Surprisingly, the DQN policy outperforms the DVAE2 and PPO policy in discrete actionspace environment. Because we used PPO as the policy for DVAE2, we still see a marginal improvement over the same algorithm in a modelfree setting yielding better performance and better sample efficiency. We found that DQN quickly learned the correct Qvalues due to the small environment size. In future experiments, we would like to include larger map sizes that would increase the statespace significantly, hence making Qvalues more challenging to learn. [1]
CartPole: As a simple baseline environment, we use CartPole from the OpenAI Gym environment suite [7]. The goal of this environment is to balance a pole on a moving cart using a discrete actionspace of 2 actions. We found that DVAE2 and PPO had similar performance, but DVAE2 had marginally better sample efficiency after 25000 steps.
In terms of VRC, the algorithm tended to choose Convolutional + LSTM and Temporal Convolution and GAN for continuous control tasks (see Figure 1). It should be noted that PPO and DVAE2 are presented with the same hyperparameters, and are therefore directly comparable. We used PPO as our policy for DVAE2, and we see that DVAE2 is more sample efficient and performs equally good or better than modelfree PPO in all tested scenarios.
7 Conclusion and Future Work
In this paper, we present DVAE2, a novel modelbased reinforcement learning algorithm for improved sample efficiency in environments where sampling is not available. We also present the deep warehouse environment for training reinforcement learning agents in industrynear ASRS systems. This section concludes our work and defines future work for DVAE2..
Although the deep warehouse does not behave identical to a realworld system, it is adequate to determine the training time and performance. DVAE2 is presented as a VRC model for training reinforcement learning algorithms with a learned model of the environment. The method is tested in the Deep warehouse several continuous game environments. Our algorithm reduces training time and depends less on data sampled from the real environment compared to modelfree methods.
We find that a carefully tuned policy gradient algorithms can converge to nearoptimal behaviour in simulated environments. Modelfree algorithms are significantly harder to train in terms of sample efficiency and stability, but perform better if there is unlimited sampling available from the environment.
Our work shows promising results for reinforcement learning agents in ASRS. There are, however, open research questions that are essential for safe deployment in realworld systems. We wish to pursue the following questions to achieve safety deployment in realworld environments. (1) How do we ensure that the agent acts within defined safety boundaries? (2) How would the agent act if parts of the statespace changes to unseen data (i.e. a fire occurs, or a collision between agents.) (3) Can agents with a nonstationary policy function well in a multiagent setting?
References
 [1] Andersen, P.A., Goodwin, M., Granmo, O.C.: Towards a deep reinforcement learning approach for tower line wars. In: Bramer, M., Petridis, M. (eds.) Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 10630 LNAI, pp. 101–114 (2017). https://doi.org/10.1007/9783319710785_8
 [2] Andersen, P.A., Goodwin, M., Granmo, O.C.: Deep RTS: A Game Environment for Deep Reinforcement Learning in RealTime Strategy Games. Proceedings of the IEEE International Conference on Computational Intelligence and Games (aug 2018), http://arxiv.org/abs/1808.05032
 [3] Andersen, P.A., Goodwin, M., Granmo, O.C.: The Dreaming Variational Autoencoder for Reinforcement Learning Environments. In: Max Bramer, Petridis, M. (eds.) Artificial Intelligence, vol. 11311, pp. 143–155. Springer, Cham, xxxv edn. (dec 2018). https://doi.org/10.1007/9783030041915_11, http://link.springer.com/10.1007/9783030041915_11
 [4] Azar, M.G., Piot, B., Pires, B.A., Grill, J.B., Altché, F., Munos, R.: World Discovery Models. arxiv preprint arXiv:1902.07685 (feb 2019), http://arxiv.org/abs/1902.07685
 [5] Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leibo, J.Z., Rae, J., Wierstra, D., Hassabis, D.: ModelFree Episodic Control. arxiv preprint arXiv:1606.04460 (jun 2016), http://arxiv.org/abs/1606.04460
 [6] Botvinick, M., Ritter, S., Wang, J.X., KurthNelson, Z., Blundell, C., Hassabis, D.: Reinforcement Learning, Fast and Slow. Trends in cognitive sciences 23(5), 408–422 (may 2019). https://doi.org/10.1016/j.tics.2019.02.006, http://www.ncbi.nlm.nih.gov/pubmed/31003893
 [7] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI Gym. arxiv preprint arXiv:1606.01540 (jun 2016), http://arxiv.org/abs/1606.01540
 [8] Buckman, J., Hafner, D., Tucker, G., Brevdo, E., Lee, H.: SampleEfficient Reinforcement Learning with Stochastic Ensemble Value Expansion. Advances in Neural Information Processing Systems 32 pp. 8224–8234 (jul 2018), http://arxiv.org/abs/1807.01675
 [9] Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. Advances in Neural Information Processing Systems 31 (may 2018), http://arxiv.org/abs/1805.12114

[10]
Greensmith, E., Bartlett, P.L., Baxter, J.: Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research
5(Nov), 1471–1530 (2004)  [11] Gregor, K., Rezende, D.J., Besse, F., Wu, Y., Merzic, H., van den Oord, A.: Shaping Belief States with Generative Environment Models for RL. arxiv preprint arXiv:1906.09237 (jun 2019), http://arxiv.org/abs/1906.09237
 [12] Ha, D., Schmidhuber, J.: Recurrent World Models Facilitate Policy Evolution. Advances in Neural Information Processing Systems 31 (sep 2018), http://arxiv.org/abs/1809.01999
 [13] Ha, D., Schmidhuber, J.: World Models. arxiv preprint arXiv:1803.10122 (mar 2018). https://doi.org/10.5281/zenodo.1207631, https://arxiv.org/abs/1803.10122
 [14] Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.: Learning Latent Dynamics for Planning from Pixels. Proceedings of the 36 th International Conference on Machine Learning (nov 2018), http://arxiv.org/abs/1811.04551
 [15] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: betaVAE: Learning Basic Visual Concepts with a Constrained Variational Framework. International Conference on Learning Representations (nov 2016), https://openreview.net/forum?id=Sy2fzU9gl
 [16] Janner, M., Fu, J., Zhang, M., Levine, S.: When to Trust Your Model: ModelBased Policy Optimization. arXiv preprint arXiv:1906.08253 (jun 2019), http://arxiv.org/abs/1906.08253
 [17] Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research (apr 1996). https://doi.org/10.1.1.68.466, http://arxiv.org/abs/cs/9605103
 [18] Liang, X., Wang, Q., Feng, Y., Liu, Z., Huang, J.: VMAVC: A Deep Attentionbased Reinforcement Learning Algorithm for Modelbased Control. arxiv preprint arXiv:1812.09968 (dec 2018), http://arxiv.org/abs/1812.09968
 [19] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing Atari with Deep Reinforcement Learning. Neural Information Processing Systems (dec 2013), http://arxiv.org/abs/1312.5602
 [20] Roodbergen, K.J., Vis, I.F.A.: A survey of literature on automated storage and retrieval systems. European Journal of Operational Research (2009). https://doi.org/10.1016/j.ejor.2008.01.038
 [21] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal Policy Optimization Algorithms. arxiv preprint arXiv:1707.06347 (jul 2017), http://arxiv.org/abs/1707.06347
 [22] Sutton, R.S.: The Bitter Lesson (2019), http://www.incompleteideas.net/IncIdeas/BitterLesson.html
 [23] Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT Press (2018)
Comments
There are no comments yet.