A Human Mixed Strategy Approach to Deep Reinforcement Learning

04/05/2018 ∙ by Ngoc Duy Nguyen, et al. ∙ Deakin University 0

In 2015, Google's DeepMind announced an advancement in creating an autonomous agent based on deep reinforcement learning (DRL) that could beat a professional player in a series of 49 Atari games. However, the current manifestation of DRL is still immature, and has significant drawbacks. One of DRL's imperfections is its lack of "exploration" during the training process, especially when working with high-dimensional problems. In this paper, we propose a mixed strategy approach that mimics behaviors of human when interacting with environment, and create a "thinking" agent that allows for more efficient exploration in the DRL training process. The simulation results based on the Breakout game show that our scheme achieves a higher probability of obtaining a maximum score than does the baseline DRL algorithm, i.e., the asynchronous advantage actor-critic method. The proposed scheme therefore can be applied effectively to solving a complicated task in a real-world application.



There are no comments yet.


page 1

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent advances in deep learning have made reinforcement learning (RL) [2] a possible solution for creating an agent that can mimic human behaviors [3, 4, 5, 6]. In 2015, for the first time, Mnih et al. [7] succeeded in training an agent to surpass human performance on playing Atari games. By employing a convolutional layer [8], the agent directly perceives the environment’s state in the form of a graphical representation. Furthermore, the agent responds with a proper action for each perceived state to maximize the long-term reward. Specifically, Mnih et al. [7] created a novel structure, named deep Q-network

(DQN), which simulated the human brain to take decisive actions in a series of 49 Atari games. As a result, DQN initiates a new research branch of machine learning called deep RL that has recently attracted considerable research attention.

Since 2015, there have been extensive improvements to DQN. However, most of these variants substantially modify DQN structure in some aspects to fill the gap. For example, Hasselt [9, 10] explored the idea of double Q-learning to stabilize the convergence of DQN; Schaul et al. [11] reduced correlated samples by assigning a priority to each transition in experience replay using temporal-difference (TD) error; Wang et al. [12] adjusted DQN’s policy network to forward the agent’s attention to only the important regions of the game; and Hausknecht and Stone [13] added a recurrent layer to DQN to prolong the agent’s memory. In 2016, Mnih et al. [14] proposed another asynchronous method of deep RL called asynchronous advantage actor-critic (A3C). A3C combines actor-critic architecture [15], advantage function, and multithreading to drastically improve DQN in both perspectives: training speed and score achievement. Therefore, in this paper, we compare our proposed scheme with A3C, which is considered as the baseline deep RL algorithm.

Fig. 1: A Breakout’s gameplay using Arcade Learning Environment [16].

Our proposed scheme is initially motivated by the human brain’s activities while playing the game. The human brain naturally divides a complicated task into a series of smaller and easier functional missions. This strategy – divide and conquer – is shown in everyday human activities. In this paper, we integrate this strategy into deep RL to create a human-like agent. Furthermore, we demonstrate our proposed scheme in the Breakout game using the Arcade Learning Environment [16]. In Breakout, the player controls the red paddle to the left or the right so that the paddle catches the ball falling, as shown in Fig. 1. When the ball touches the paddle, it bounces back and breaks the bricks at the top of the screen. The goal is to break as many bricks as possible and to keep the ball above the paddle at all times. If the paddle misses the ball, the player loses a ball’s life. If the player cannot catch the ball five consecutive times, the game is over.

Fig. 2: Human strategy when playing Breakout. a) If there are full bricks in the game, we only focus on the ball. b) If there are few bricks left, we focus on the ball, the speed and the direction of the paddle, and the position of the brick.

Fig. 2 illustrates in detail a human strategy to achieve a high score when playing Breakout. In the beginning of gameplay, all bricks remain at the top of the screen, and the strategy is to focus only on the ball motion to assure a successful catch, as the probability of breaking a brick in this case is high. Gradually, the player focuses attention on the bricks towards the end of the game, when there are few bricks left. This scenario is more complicated, because we not only focus on the ball, but also its speed and the position of the brick. In Atari games, deep RL algorithms often perceive an environment’s state as a whole. Therefore, any unimportant changes in the environment may cause unintended noise, and hence may degrade the algorithm’s performance. In this paper, we suggest an approach that partially eliminates this drawback, reduces input data density, and encourages further exploration.

In summary, the paper brings the following key contributions:

  • We took a first step to integrate a human strategy into deep RL. Moreover, the proposed scheme is general in the sense that it can be used with any deep RL algorithm. In this paper, we examine our divide-and-conquer strategy for A3C, a state-of-the-art deep RL algorithm.

  • Although we only demonstrate our approach using the Breakout game because of the limited scope of this paper, our proposed scheme is extendable, i.e., it can be employed in any games as well as in real-world applications.

  • We provide helpful guidelines to solve a complicated task by using a divide-and-conquer strategy. Specifically, we divide a complicated task into smaller tasks , then use different strategies to conquer each task . Finally, we combine multiple policies using a generalized version of the stochastic -greedy rule to produce a single mixed strategy policy. Therefore, the resulting agent becomes more human-like and flexible with the stochastic environment.

The paper is organized as follows: the next section summarizes variants of the deep RL algorithm; Section III illustrates our proposed scheme; Section IV shows our simulation results; and Section V concludes our work.

Ii Related Work

As mentioned in Section I, DQN [7]

is the first successful attempt to combine deep learning with RL. The key of DQN is the utilization of a neural network to approximate optimal value function by minimizing the following loss function:


where and

represent the parameters of the estimation network and target network, respectively. To break the correlation between samples and to stabilize the convergence, Mnih

et al. [7] introduced an experience replay that is used to store history samples and a target network that is updated asynchronously for every steps from the estimation network. Although DQN can solve a challenging problem in RL literature, it still has drawbacks, and has been improving since its inception in 2015. At first, Hasselt et al. [10] proposed a double deep Q-network (DDQN) to reduce the overfitting problem in Q-learning by separating action evaluation from selection. In other words, the loss function (1) is replaced by the following function:

To promote “rare” samples, Schaul et al. [11] proposed a prioritized experience replay that assigns each sample in the experience replay a priority number based on its TD-error. Finally, Wang et al. [12] introduced a dueling network architecture that breaks down a Q-value into state value and advantage action value , as below:

where denotes the number of possible actions. The dueling network helps to stabilize the policy network, especially in environments with sparse rewards.

Another major drawback of DQN is training time. DQN requires a training time of 7–8 days to surpass human performance in each Atari game. Therefore, in 2016, Mnih et al. [14] introduced an asynchronous version of DQN as well as of A3C. The simulation shows that A3C drastically speeds up the training process to only 1–3 days on CPU compared to DQN. Therefore, A3C becomes a baseline approach for deep RL. In this paper, we use A3C as a benchmark algorithm for comparisons with our proposed scheme.

Iii Proposed Scheme

Fig. 3: A divide and conquer approach with two strategies using A3C.

As mentioned above, we use A3C as the base deep RL algorithm to integrate our divide-and-conquer strategy. Note that A3C uses actor-critic architecture, which was proposed by Konda and Tsitsiklis [17]. Therefore, there exist two policies in A3C, one for the actor network and another for the critic network. The actor network, parameterized by , represents a stochastic policy, . It perceives state as input and produces probabilities for all possible actions as output. On the other hand, the critic network, parameterized by , represents a value function at state , . The overall objective of A3C is to minimize the following loss function [14]:


where denotes the entropy function and:

Fig. 4: A “local stuck” phenomenon in Breakout due to lack of exploration.

The entropy regularization term in (2) is used to encourage exploration in the training process. In practice, the actor network and critic network often share parameters in convolutional layers. Therefore, we can assume that the actor and critic network are actually a unique network with two output layers, denoted as .

Based on A3C, our proposed scheme (Fig. 3) can be described in the following three steps:

1) First, we train a policy network to learn Breakout using A3C. The state of the environment (a history of four frames) is converted to grayscale and fed directly to . In this way, the policy is trained to learn all aspects of the game, including the position of the paddle, the ball motion, and the regions of the bricks. As explained earlier, any unimportant changes in may degrade the performance of the algorithm. Therefore, the policy represents a human strategy, as shown in Fig. 2b.

2) Second, we use A3C to train a second network . We remove all immutable objects in the input state before feeding it to , and give a negative reward for any life lost. In our implementation, we blacken all immutable objects in state . This policy is a life safeguard. It focuses only on the ball, and continues to catch the ball regardless of the presence or absence of bricks. Apparently, this strategy is only suitable in the beginning of the game, as shown in Fig. 2a. The use of this pure strategy can lead to a negative effect, which we name “local stuck”. This phenomenon occurs when the gameplay is stuck in an infinite loop. In Fig. 4, for example, the paddle moves only between two different positions, which leads to a loop circle of ball motion . Therefore, in this way, the game becomes stuck. This phenomenon occurs due to the lack of exploration in the training process. It also occurs in policy , but with less frequency. The “local stuck” phenomenon can be observed easily here111https://youtu.be/gbcdPSQP4XI..

3) Finally, we combine and to create a stochastic policy using the following generalized version of the -greedy rule with two strategies:

where and . Because and , we can easily infer that . Therefore, the combined policy can be seen as a stochastic policy that integrates two human strategies. The adjustment factor can be used to modify the tendency behavior of the agent: if , becomes and vice versa. Moreover, the policy with can drastically reduce “local stuck” phenomenon because of its stochastic behavior.

Name Parameter Value
Breakout Episode max steps 10000
Number of skipped frames 4
Loss of life marks terminal state No
Repeat action probability 0
Pixel-wise maximum over No
two consecutive frames
A3C Optimized learning rate 0.004
Discounted factor 0.99
Beta 0.1
Number of history frames 4
Global normalization clipping 40
Asynchronous training steps 5
Number of threads 8
Anneal learning rate Yes
Optimizer RMS optimizer
RMS’s decay 0.99
RMS’s epsilon 1e-6
TABLE I: Environment & Algorithm settings

This mixed strategy approach can be extendable to policies () using the following guidelines:

1) Given a complicated task , we divide into smaller tasks and solve these tasks using different strategies to achieve the best overall performance.

2) We select a suitable deep RL algorithm (such as A3C) and train policy networks. Each policy network corresponds with each human strategy.

3) For each policy , we assign a priority so that and . We then infer a combined policy using the following generalized version of the -greedy rule with strategies, as shown below:

Fig. 5: Reward distribution while training and .
Fig. 6: Mixed strategy policy with different value of . We run 120,000 steps and record minimum, maximum, and average median reward for each checkpoint.

Iv Simulation Results

In this section, we use the Breakout game as an environment to examine our mixed strategy scheme. We train two different strategies, and , using A3C. Table I summarizes the setting parameters of Breakout and A3C. The algorithm is run on a computer with a GTX 1080 Ti graphics card. Unlike , in the training process of , we allocate a negative reward of for each life lost. Moreover, we train both policies in 70 million steps, which is equivalent to 280 million frames of Breakout.

During the training process, we collect the total reward achieved in each episode and record it, as seen in Fig. 5. It is evident that in policy , it is easier to achieve a score of 800 than in . Therefore, the proposed policy achieves a maximum score with a higher probability than that of . However, the use of pure strategy is not recommended because it is prone to the “local stuck” phenomenon as mentioned in the previous section. Therefore, a mixed strategy is used to balance the maximum score achievement and the average number of steps per episode. Given the same score, it is preferable to use a policy that uses a smaller number of steps. In Fig. 6, the adjustment parameter is used to balance the two strategies. We also keep in all cases. We see that, with , the probability of achieving a score of 800 at 60 million training steps is high. Therefore, it is desirable to assign in the Breakout game. In a real-world application, the choice of depends on the goal objective, and it can be altered in the real time to adapt with the environment. Finally, Fig. 7 shows average number of steps used in each episode with different values of . As expected, the pure policy () uses the highest number of steps, but can be fixed by increasing . In summary, with , we obtain a balanced performance between maximum score achievement and average number of steps per episode.

V Conclusions

In this paper, we introduce an extended approach that applies human strategy to deep reinforcement learning. This marks the first step to building a human-like agent that can adapt to its environment using human strategies. Because the limited scope of this paper, we only simulated our mixed strategy approach using the Breakout game, but the proposed scheme can be applied to other Atari games, as well as to real-world problems. The simulation results confirm that our the mixed strategy approach is efficient and promising. We also provide helpful guidelines to solve a complicated task by mimicking the divide-and-conquer strategy of human behaviors. Our future work will continue to work on building human-like agents that can automatically adapt to their environments using different learning strategies.

Fig. 7: Average number of steps in each episode with different values of .


  • [1] L. Deng and D. Yu, “Deep learning: Methods and applications,” Found. Trends Signal Process., vol. 7, no. 3–4, pp. 197–387, 2014.
  • [2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: The MIT Press, 2012.
  • [3] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1238–1274, 2013.
  • [4] N. D. Nguyen, T. Nguyen, and S. Nahavandi, “System design perspective for human-level agents using deep reinforcement learning: A survey,” IEEE Access, vol. 5, pp. 27091–27102, 2017.
  • [5] D. Silver et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7578, pp. 484–489, 2016.
  • [6] T. T.  Nguyen, “A multi-objective deep reinforcement learning framework,” in arXiv preprint arXiv:1803.02965, 2018.
  • [7] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [8]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in

    Adv. Neural Inf. Process. Syst., pp. 1097–1105, 2012.
  • [9] H.V. Hasselt, Double q-learning, in Adv. Neural Inf. Process. Syst., 2010, pp. 2613–2621.
  • [10] H.V. Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-learning, in AAAI, 2016, pp. 2094–2100.
  • [11] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Prioritized experience replay, in ICLR, 2016.
  • [12] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, N. Freitas, Dueling network architectures for deep reinforcement learning, in Int. Conf. Mach. Learn., pp. 1995–2003, 2016.
  • [13] M. Hausknecht, P. Stone, Deep recurrent q-learning for partially observable mdps, in AAAI Fall Symposium, 2015.
  • [14] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” in Int. Conf. Mach. Learn., pp. 1928–1937, 2016.
  • [15] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Adv. Neural Inf. Process. Syst., pp. 1008–1014, 2000.
  • [16] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” J. Artif. Intell. Res., vol. 47, pp. 253–279, 2013.
  • [17] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Adv. Neural Inf. Process. Syst., pp. 1008–1014, 2000.