Producing fully independent agents that learn optimal behavior and develop over time purely by trial and error interaction with the surrounding environment is one of the prominent dilemmas in the field of artificial intelligence. A mathematical framework that encapsulates the problem of these autonomous systems is reinforcement learning. Over the past few years, exceptional progress has been made in devising artificial agents that can learn and solve problems in a variety of domains using deep RL methods(Mnih et al., 2015; Schulman et al., 2015; Silver et al., 2016). However, these algorithms are perceived as extremely data inefficient. They are thought to require an immense amount of non-optimal interaction with the real environment before they begin to operate acceptably well (Irpan, 2018).
One of the most popular benchmarks for assessing overall performance and data complexity of deep RL algorithms is Atari Learning Environment (Bellemare et al., 2013; Machado et al., 2018). The state-of-the-art model-free approaches, at least in the way they were presented so far, need millions of frames to learn how to play these games acceptably well (Schulman et al., 2017; Hessel et al., 2018). It corresponds to days of play experience using the standard frame rate. However, human players can achieve the same within minutes (Tsividis et al., 2017).
A lot of work has been produced to circumvent these shortcomings. Most successful studies focus on model-based strategies inspired by the classical Dyna approach (Sutton, 1991) and action-conditional prediction methods (Oh et al., 2015; Leibfried et al., 2016). Although some of them manage to drastically reduce the amount of data required by the standard algorithms, they do it by highly increasing both conceptual and computational complexity of the models.
In this paper, we argue, and experimentally prove, that already existing model-free techniques can be much more data-efficient than it is assumed. We introduce simple change to the state-of-the-art Rainbow DQN algorithm. In some environments like Pong or Hero, it can achieve the same results given only 5% - 10% of the data it is often presented to need. Furthermore, it results in the same data-efficiency as the state-of-the-art model-based approaches while being much more stable, simpler, and requiring much less computation.
Following the introduction, section 2
gives a brief background behind reinforcement learning with the focus on Q-learning and its deep learning equivalents. Section3 provides an overview of recent studies aimed at improving data efficiency using model-based approaches. Section 4 argues that model-free methods can be much more efficient than it tends to be presented and that existing model-based techniques only give an illusion of efficiency. Then, the description and analysis of experiments follow in sections 5 and 6. Finally, section 7 concludes this study.
Reinforcement learning is a problem of learning a policy that maximises the reward signal for a given task. To define RL setting we need a set of possible environment states , a set of available actions , and relations between those. These relations are described by a transition function that defines dynamics of transitions from one state to another, and a reward function that defines the real-valued reward signal. Together, and constitute the model of the environment. The goal of reinforcement learning is to find a policy that maximises the total cumulative reward over time. One of the most popular reinforcement learning algorithms is Q-learning (Watkins and Dayan, 1992). Q-learning decides on an optimal policy based on the state-action value function that maps state and action performed in that state to the expected total cumulative reward following the action. The algorithm chooses an action that maximizes , i.e. . is learned in the process of interacting with the environment. At every agent’s step, tuple is obtained and immediately used to update the Q function. Because state-action combination is often too big or continuous to represent directly in a tabular manner,
is commonly approximated using different supervised learning algorithms. However, using deep learning to approximate
is not trivial because Q-learning breaks important assumptions required by neural networks. Namely,update is recursive and experience tuples are highly correlated when used sequentially.
Recently introduced DQN (Mnih et al., 2015) bypassed this issue by introducing two concepts: target network and replay buffer. Target network is simply a fixed snapshot of the network that approximates value (online network) taken every steps. Instead of updating the online network towards itself, it is updated towards the target network. This approach maintains the logic of Q-learning while stopping the online network from diverging due to recursive updates. Replay buffer, on the other hand, guarantees a much higher level of independence between experience tuples. They are not used immediately, one after another anymore but stored in the replay buffer instead. Then, every steps, a single training step is performed, i.e. a mini-batch of randomly sampled experience from the replay buffer is used to update the online network. It reduces the correlation between experience samples by breaking their ordering.
Rainbow DQN (Hessel et al., 2018)
is a combination of several incremental improvements on top of DQN that increased both sample efficiency and the total performance of the algorithm achieving state-of-the-art results. It is an architecture that we use as an example that current model-free deep RL is not as inefficient as it is often stated. Throughout the paper hyperparameters fromHessel et al. (2018) are employed, unless stated otherwise.
3 Model-based reinforcement learning
The most successful approach to improving data efficiency of deep RL is based on the premise of model-based techniques (Sutton and Barto, 2018). Having access to transition and reward mechanics of the environment would make it possible to construct an artificial simulation where the agent could be trained without performing often costly interactions with the real environment. However, in most scenarios, the agent is not given any prior information about the model of its environment. This issue is often overcome by learning the model instead. Oh et al. (2015) and Leibfried et al. (2016) have shown that it is possible with a very high level of accuracy.
Ability to learn the model of the environment was subsequently leveraged to successfully improve different aspects of deep RL (Racanière et al., 2017; Oh et al., 2017; Buesing et al., 2018; Ha and Schmidhuber, 2018). Azizzadenesheli et al. (2018), Holland et al. (2018), and Kaiser et al. (2019), however, focused directly on employing the learned models to increase data efficiency of deep RL algorithms.
Azizzadenesheli et al. (2018)
proposed Generative Adversarial Tree Search (GATS). Unlike in the standard approach to learning the environment dynamics, GATS creates two separate models: Generative Dynamics Model (GDM) based on a modified Pix2Pix(Isola et al., 2017) to learn the transition model ; and Reward Predictor (RP), a simple 3-class classification architecture to learn the reward model . Both models learn from experience stored in DQN’s replay buffer and are then used for bounded Monte Carlo tree search as in (Silver et al., 2016). GATS is evaluated primarily on the game Pong where it learns an optimal policy using around 42% of the data required by using standalone model-free agent what is a tiny improvement compared to the methods described next.
Holland et al. (2018) explored the performance of the model-based approach given either perfect model, model pretrained on expert data (pretrained model), or model learned alongside the agent’s value function (online model). Both non-perfect models followed standard architecture for the task (Oh et al., 2015; Leibfried et al., 2016). These models are then used to generate 100 samples of simulated experience for every interaction with the real environment. All three variations outperformed state-of-the-art Rainbow DQN in terms of data efficiency on 5 out of 6 games. Nevertheless, only the results of the online model are used for further discussion to ensure a fair comparison between the algorithms.
Kaiser et al. (2019) introduced Simulated Policy Learning (SimPLe). Similarly to the previous two architectures, it learns the model of the environment using a modified version of Oh et al. (2015). It differs from previous approaches by employing PPO (Schulman et al., 2017) as its RL agent and by using the learned model much more exhaustively. It uses the model similarly to Holland et al. (2018), however it provides at least 800k samples of artificial data after every 6.4k interactions. The approach is then evaluated on a range of 26 different Atari games. It provides results that highly outperform both Holland et al. (2018) and Azizzadenesheli et al. (2018) in terms of data efficiency achieving at least 2x improvement on over half of the games and more than 10x improvement on Freeway. To the best of our knowledge, SimPLe is the state of the art in terms of data-efficient deep reinforcement learning; thus, it will be used as a primary baseline throughout the rest of the paper.
4 Data efficiency of standard approaches
We argue that DQN-like model-free methods are not as data inefficient as they are often portrayed. They are simply used in a very inefficient way. Let us define ratio describing the number of training steps to the number of interactions with the environment. In the default setting . It means that the algorithm performs a single update of the network for every 4 interactions with the environment, i.e., .
As explained in section 3, both the online-model-based algorithm from Holland et al. (2018) and SimPLe from Kaiser et al. (2019) first learn the approximated model of the environment. Then, this approximation is used to provide simulated samples of experience alongside the real data. Nevertheless, these samples, in the best case, can only provide as much real signal to the agent as was provided in the original data. However, as a byproduct of the agent’s interactions with the learned model, the ratio significantly increases. Holland et al. (2018) performs 100 simulated steps for each real step causing . SimPLe executes 800k simulated steps after every 6.4k interactions with the real environment. Thus, if SimPLe was using DQN as its model-free component ratio would be even higher ().
It seems unfair to allow model-based methods to perform more training steps for each gathered data point without letting model-free baselines to do the same. However, from the studies discussed above, only Holland et al. (2018) performed tests allowing DQN for extra updates111Their results showed that indeed model-based approach with the online model does not overperform model-free approach with extra updates. However, the study was mainly interested in thorough analysis, rather than improving the state of the art.. GATS was compared solely to the standard version of DQN and SimPLe to the standard version of PPO algorithm together with the Rainbow DQN that, as stated in the paper, was hypertuned for sample efficiency (HRainbow). However, hyperparameters for HRainbow were not disclosed. We hypothesize, that the main reason behind improved data efficiency in the results is essentially increased .
5 Experimental setup
To test the above-mentioned hypothesis, we train a standard Rainbow DQN agent, as described in Hessel et al. (2018), with only a few small differences to increase ratio . Firstly, we decrease period between updates as much as possible so (thus ). Then, because it is impossible to further increase using existing hyperparameters, we introduce a new parameter that specifies how many network updates should be performed every steps (similarly to DQN Extra Updates from Holland et al. (2018)). We find that produces the best results (hence ). We also decrease epsilon decay period to only 50K steps to make it compatible with low data settings. We will refer to this modified version of Rainbow DQN as ’OTRainbow’ (Overtrained Rainbow) throughout the rest of the paper.
Existing code for the Rainbow DQN from the Dopamine framework (Castro et al., 2018) was modified as explained above to obtain OTRainbow. Dopamine was used for two reasons: (i) it allows for quick and easy prototyping of new RL algorithms; (ii) to ensure the same implementation for each version of the Rainbow DQN (whether it is OTRainbow, HRainbow, or standard Rainbow). It was then evaluated on the same range of 26 Atari games from the Atari Learning Environment as used by SimPLe in the original paper. We then compare the outcomes to multiple different baselines: an agent that always chooses action uniformly at random (Random), human score as reported by Mnih et al. (2015) (Human), Rainbow DQN with the original hyperparameters from Hessel et al. (2018) (SRainbow), and SimPLe and HRainbow scores as reported by Kaiser et al. (2019).
Similarly to Kaiser et al. (2019), sample efficiency is evaluated based on a mean score in the low data regime of 100k interactions with the real environment (400k frames). This is again motivated by the fairness of comparison between SimPLe and OTRainbow. On top of that, we compare the overall performance of all models depending on the amount of available data using median human normalized performance. I.e. we normalize agent scores on each game such that 0% is the performance of the random agent and 100% corresponds to human score.
Section 6.1 in detail analyzes data efficiency. Then, section 6.2 focuses on the long term performance. Overall, OTRainbow and SimPLe prove to be the best models for 100k-interactions-only settings, without the clear winner between the two. Not surprisingly, SRainbow leads in regards to the long term performance as it does not sacrifice exploration to achieve the best possible scores within the first 100k steps. When comparing SimPLe to the variations of Rainbow DQN with respect to computational complexity, SimPLe is orders of magnitude more expensive. As shown in section 4, using SimPLe increases ratio 126 times, while the most computationally demanding variation of Rainbow - OTRainbow - increases 32 times. Thus, when taking into an account only the reinforcement learning part, SimPLe already requires almost 4 times more network updates. On top of that, however, SimPLe has to perform expensive training of the world model. As reported by Kaiser et al. (2019), a full version of SimPLe takes more than three weeks on 100k data points to complete the training. Using the same amount of data, OTRainbow is able to finish within the first 24 hours222When running on 8 cores of Intel Haswell CPU..
6.1 Data efficiency
Results presented in this section are obtained after running 100k training interactions of the agent with the real environment (excluding Human). This setting is unfair towards SRainbow as it does not finish epsilon decay in that time. Nevertheless, its results are still provided as one of the baselines so it is clearly visible that although SRainbow is more likely to produce the best results in the long run, it achieves very poor performance during the first few iterations. Numerical results for this setting are shown in Table 1. Moreover Figure 1 compares OTRainbow and SimPLe directly, using graphical convention similar to Kaiser et al. (2019). However, in this study, we use a logarithmic scale to denote the number of data samples needed to reach SimPLe’s score. Doing so ensures that whether OTRainbow requires n times more experience or n times less, the visual absolute deviation from the SimPLe baseline is the same. Also, results are clipped to the absolute maximum deviation of 5x (i.e., 20k - 500k) as OTRainbow was evaluated on a maximum of 500k interactions due to computational constraints.
We can see that both OTRainbow and SimPLe outperform Random on all 26 games, interestingly neither HRainbow nor SRainbow managed to do the same. However, HRainbow falls behind Random only when playing Kangaroo. OTRainbow produces better scores than HRainbow on all games, it is a much better result than SimPLe’s that manages to beat HRainbow only on 20 out of 26 games. In terms of direct comparison between OTRainbow and SimPLe, they perform very evenly. OTRainbow outperforms SimPLe on exactly half of the games but is dominated by SimPLe on the remaining half. Interestingly, the original paper behind SimPLe reported that efficiency on Freeway benefits most from the model-based approach, with SimPLe being 10x more efficient than HRainbow. However, this result is improved even further by OTRainbow as it manages to score over 8 points higher. We also calculate the median human normalized performance for each algorithm. Full numerical results of these calculations can be seen in Table 3 in Appendix A. Median human performance of OTRainbow beats SimPLe by over 10pp, however, SimPLe achieves super-human performance on 3 games (Pong, CrazyClimber, Boxing), while OTRainbow manages to do that only on Krull.
Overall, although both OTRainbow and SimPLe can learn much more optimal policies than all other models in the low-data regime, none of them visibly outperforms the other. These results show, that even the state-of-the-art model-based approach, highly tuned for achieving best scores given a small number of interactions with the real environment, is not significantly more data-efficient than slightly modified existing techniques.
6.2 Different numbers of iterations
In addition to the score in the low data-regime, it is important that the agent can continue improving when performing any future interactions with the environment. To evaluate that, we tested OTRainbow in a settings with up to 500k interactions and provided SRainbow baseline for 500k, 1M, and 2M interactions. We were not able to execute experiments with a different number of real experience for SimPLe or HRainbow, the reasons being computational requirements of the former and undisclosed hyperparameters for the latter. However, we try to draw a comparison with SimPLe based on the analysis provided in the original paper.
Figure 2 shows the median human normalized performance for each of the evaluated methods. OTRainbow in both data settings scores surprisingly high, with its low-data regime version (100k) achieving better median result than SRainbow after full 1M steps. We hypothesize, however, that improvement of the performance of OTRainbow quickly slows down after the initial 500k steps, similarly to what was observed in SimPLe by Kaiser et al. (2019). This hypothesis is based on the change in performance between the 2 evaluations of OTRainbow, relatively to the standard algorithm. I.e., improvement between OTRainbow (100k) and OTRainbow (500k) is barely over 1.6x, despite 5x more data. SRainbow, between the same data regimes, improves over 100x, which is followed by 5x improvement given only 2x more data twice (from 500k to 1M, and from 1M to 2M). Nevertheless, it should be confirmed empirically in future work.
We presented an intuition why the previous research did not use fair baselines when comparing new advancements with currently existing methods. We suggested the way of using state-of-the-art Rainbow DQN, namely OTRainbow, that leverages Rainbow’s actual capabilities in terms of data efficiency. We experimentally proved that model-free OTRainbow is no worse than the state-of-the-art model-based approaches when given limited data while requiring an order of magnitude fewer computations. It shows that the recent work in sample efficient deep reinforcement learning does not produce significant improvements over the existing methods upholding the position of model-free algorithms as the state of the art, both in terms of data efficiency and total performance. Through these results, we aim to underline the importance of using appropriate model-free baselines, such as OTRainbow, in the future research that tries to improve data efficiency of deep RL approaches.
- Sample-efficient deep rl with generative adversarial tree search. arXiv preprint arXiv:1806.05780. Cited by: §3, §3, §3.
- The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, pp. 253–279. Cited by: §1.
- Woulda, coulda, shoulda: counterfactually-guided policy search. arXiv preprint arXiv:1811.06272. Cited by: §3.
- Dopamine: A Research Framework for Deep Reinforcement Learning. External Links: Cited by: §5.
- World models. arXiv preprint arXiv:1803.10122. Cited by: §3.
- Rainbow: combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2, §5, §5.
- The effect of planning shape on dyna-style planning in high-dimensional state spaces. arXiv preprint arXiv:1806.01825. Cited by: §3, §3, §3, §4, §4, §5.
- Deep reinforcement learning doesn’t work yet. Note: https://www.alexirpan.com/2018/02/14/rl-hard.html Cited by: §1.
Image-to-image translation with conditional adversarial networks. In , pp. 1125–1134. Cited by: §3.
- Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374. Cited by: §3, §3, §4, §5, §5, §6.1, §6.2, §6.
- A deep learning approach for joint video frame and reward prediction in atari games. arXiv preprint arXiv:1611.07078. Cited by: §1, §3, §3.
- Revisiting the arcade learning environment: evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research 61, pp. 523–562. Cited by: §1.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §2, §5.
- Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pp. 2863–2871. Cited by: §1, §3, §3, §3.
- Value prediction network. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6118–6128. External Links: Cited by: §3.
- Imagination-augmented agents for deep reinforcement learning. In Advances in neural information processing systems, pp. 5690–5701. Cited by: §3.
Trust region policy optimization.
International conference on machine learning, pp. 1889–1897. Cited by: §1.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §3.
- Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1, §3.
- Reinforcement learning: an introduction. MIT press. Cited by: §3.
- Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin 2 (4), pp. 160–163. Cited by: §1.
- Human learning in atari. In 2017 AAAI Spring Symposium Series, Cited by: §1.
- Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §2.
Appendix A Complete numerical results
|OTRainbow (100k)||OTRainbow (500k)||SimPLe (100k)||HRainbow (100k)||SRainbow (100k)|
|SRainbow (500k)||SRainbow (1M)||SRainbow (2M)||Human||Random|
|OTRainbow (100k)||OTRainbow (500k)||SimPLe (100k)||HRainbow (100k)|
|SRainbow (100k)||SRainbow (500k)||SRainbow (1M)||SRainbow (2M)|