I Introduction
Recent advances in deep learning have made reinforcement learning (RL) [2] a possible solution for creating an agent that can mimic human behaviors [3, 4, 5, 6]. In 2015, for the first time, Mnih et al. [7] succeeded in training an agent to surpass human performance on playing Atari games. By employing a convolutional layer [8], the agent directly perceives the environment’s state in the form of a graphical representation. Furthermore, the agent responds with a proper action for each perceived state to maximize the longterm reward. Specifically, Mnih et al. [7] created a novel structure, named deep Qnetwork
(DQN), which simulated the human brain to take decisive actions in a series of 49 Atari games. As a result, DQN initiates a new research branch of machine learning called deep RL that has recently attracted considerable research attention.
Since 2015, there have been extensive improvements to DQN. However, most of these variants substantially modify DQN structure in some aspects to fill the gap. For example, Hasselt [9, 10] explored the idea of double Qlearning to stabilize the convergence of DQN; Schaul et al. [11] reduced correlated samples by assigning a priority to each transition in experience replay using temporaldifference (TD) error; Wang et al. [12] adjusted DQN’s policy network to forward the agent’s attention to only the important regions of the game; and Hausknecht and Stone [13] added a recurrent layer to DQN to prolong the agent’s memory. In 2016, Mnih et al. [14] proposed another asynchronous method of deep RL called asynchronous advantage actorcritic (A3C). A3C combines actorcritic architecture [15], advantage function, and multithreading to drastically improve DQN in both perspectives: training speed and score achievement. Therefore, in this paper, we compare our proposed scheme with A3C, which is considered as the baseline deep RL algorithm.
Our proposed scheme is initially motivated by the human brain’s activities while playing the game. The human brain naturally divides a complicated task into a series of smaller and easier functional missions. This strategy – divide and conquer – is shown in everyday human activities. In this paper, we integrate this strategy into deep RL to create a humanlike agent. Furthermore, we demonstrate our proposed scheme in the Breakout game using the Arcade Learning Environment [16]. In Breakout, the player controls the red paddle to the left or the right so that the paddle catches the ball falling, as shown in Fig. 1. When the ball touches the paddle, it bounces back and breaks the bricks at the top of the screen. The goal is to break as many bricks as possible and to keep the ball above the paddle at all times. If the paddle misses the ball, the player loses a ball’s life. If the player cannot catch the ball five consecutive times, the game is over.
Fig. 2 illustrates in detail a human strategy to achieve a high score when playing Breakout. In the beginning of gameplay, all bricks remain at the top of the screen, and the strategy is to focus only on the ball motion to assure a successful catch, as the probability of breaking a brick in this case is high. Gradually, the player focuses attention on the bricks towards the end of the game, when there are few bricks left. This scenario is more complicated, because we not only focus on the ball, but also its speed and the position of the brick. In Atari games, deep RL algorithms often perceive an environment’s state as a whole. Therefore, any unimportant changes in the environment may cause unintended noise, and hence may degrade the algorithm’s performance. In this paper, we suggest an approach that partially eliminates this drawback, reduces input data density, and encourages further exploration.
In summary, the paper brings the following key contributions:

We took a first step to integrate a human strategy into deep RL. Moreover, the proposed scheme is general in the sense that it can be used with any deep RL algorithm. In this paper, we examine our divideandconquer strategy for A3C, a stateoftheart deep RL algorithm.

Although we only demonstrate our approach using the Breakout game because of the limited scope of this paper, our proposed scheme is extendable, i.e., it can be employed in any games as well as in realworld applications.

We provide helpful guidelines to solve a complicated task by using a divideandconquer strategy. Specifically, we divide a complicated task into smaller tasks , then use different strategies to conquer each task . Finally, we combine multiple policies using a generalized version of the stochastic greedy rule to produce a single mixed strategy policy. Therefore, the resulting agent becomes more humanlike and flexible with the stochastic environment.
Ii Related Work
As mentioned in Section I, DQN [7]
is the first successful attempt to combine deep learning with RL. The key of DQN is the utilization of a neural network to approximate optimal value function by minimizing the following loss function:
(1) 
where and
represent the parameters of the estimation network and target network, respectively. To break the correlation between samples and to stabilize the convergence, Mnih
et al. [7] introduced an experience replay that is used to store history samples and a target network that is updated asynchronously for every steps from the estimation network. Although DQN can solve a challenging problem in RL literature, it still has drawbacks, and has been improving since its inception in 2015. At first, Hasselt et al. [10] proposed a double deep Qnetwork (DDQN) to reduce the overfitting problem in Qlearning by separating action evaluation from selection. In other words, the loss function (1) is replaced by the following function:To promote “rare” samples, Schaul et al. [11] proposed a prioritized experience replay that assigns each sample in the experience replay a priority number based on its TDerror. Finally, Wang et al. [12] introduced a dueling network architecture that breaks down a Qvalue into state value and advantage action value , as below:
where denotes the number of possible actions. The dueling network helps to stabilize the policy network, especially in environments with sparse rewards.
Another major drawback of DQN is training time. DQN requires a training time of 7–8 days to surpass human performance in each Atari game. Therefore, in 2016, Mnih et al. [14] introduced an asynchronous version of DQN as well as of A3C. The simulation shows that A3C drastically speeds up the training process to only 1–3 days on CPU compared to DQN. Therefore, A3C becomes a baseline approach for deep RL. In this paper, we use A3C as a benchmark algorithm for comparisons with our proposed scheme.
Iii Proposed Scheme
As mentioned above, we use A3C as the base deep RL algorithm to integrate our divideandconquer strategy. Note that A3C uses actorcritic architecture, which was proposed by Konda and Tsitsiklis [17]. Therefore, there exist two policies in A3C, one for the actor network and another for the critic network. The actor network, parameterized by , represents a stochastic policy, . It perceives state as input and produces probabilities for all possible actions as output. On the other hand, the critic network, parameterized by , represents a value function at state , . The overall objective of A3C is to minimize the following loss function [14]:
(2) 
where denotes the entropy function and:
The entropy regularization term in (2) is used to encourage exploration in the training process. In practice, the actor network and critic network often share parameters in convolutional layers. Therefore, we can assume that the actor and critic network are actually a unique network with two output layers, denoted as .
Based on A3C, our proposed scheme (Fig. 3) can be described in the following three steps:
1) First, we train a policy network to learn Breakout using A3C. The state of the environment (a history of four frames) is converted to grayscale and fed directly to . In this way, the policy is trained to learn all aspects of the game, including the position of the paddle, the ball motion, and the regions of the bricks. As explained earlier, any unimportant changes in may degrade the performance of the algorithm. Therefore, the policy represents a human strategy, as shown in Fig. 2b.
2) Second, we use A3C to train a second network . We remove all immutable objects in the input state before feeding it to , and give a negative reward for any life lost. In our implementation, we blacken all immutable objects in state . This policy is a life safeguard. It focuses only on the ball, and continues to catch the ball regardless of the presence or absence of bricks. Apparently, this strategy is only suitable in the beginning of the game, as shown in Fig. 2a. The use of this pure strategy can lead to a negative effect, which we name “local stuck”. This phenomenon occurs when the gameplay is stuck in an infinite loop. In Fig. 4, for example, the paddle moves only between two different positions, which leads to a loop circle of ball motion . Therefore, in this way, the game becomes stuck. This phenomenon occurs due to the lack of exploration in the training process. It also occurs in policy , but with less frequency. The “local stuck” phenomenon can be observed easily here^{1}^{1}1https://youtu.be/gbcdPSQP4XI..
3) Finally, we combine and to create a stochastic policy using the following generalized version of the greedy rule with two strategies:
where and . Because and , we can easily infer that . Therefore, the combined policy can be seen as a stochastic policy that integrates two human strategies. The adjustment factor can be used to modify the tendency behavior of the agent: if , becomes and vice versa. Moreover, the policy with can drastically reduce “local stuck” phenomenon because of its stochastic behavior.
Name  Parameter  Value 

Breakout  Episode max steps  10000 
Number of skipped frames  4  
Loss of life marks terminal state  No  
Repeat action probability  0  
Pixelwise maximum over  No  
two consecutive frames  
A3C  Optimized learning rate  0.004 
Discounted factor  0.99  
Beta  0.1  
Number of history frames  4  
Global normalization clipping  40  
Asynchronous training steps  5  
Number of threads  8  
Anneal learning rate  Yes  
Optimizer  RMS optimizer  
RMS’s decay  0.99  
RMS’s epsilon  1e6 
This mixed strategy approach can be extendable to policies () using the following guidelines:
1) Given a complicated task , we divide into smaller tasks and solve these tasks using different strategies to achieve the best overall performance.
2) We select a suitable deep RL algorithm (such as A3C) and train policy networks. Each policy network corresponds with each human strategy.
3) For each policy , we assign a priority so that and . We then infer a combined policy using the following generalized version of the greedy rule with strategies, as shown below:
Iv Simulation Results
In this section, we use the Breakout game as an environment to examine our mixed strategy scheme. We train two different strategies, and , using A3C. Table I summarizes the setting parameters of Breakout and A3C. The algorithm is run on a computer with a GTX 1080 Ti graphics card. Unlike , in the training process of , we allocate a negative reward of for each life lost. Moreover, we train both policies in 70 million steps, which is equivalent to 280 million frames of Breakout.
During the training process, we collect the total reward achieved in each episode and record it, as seen in Fig. 5. It is evident that in policy , it is easier to achieve a score of 800 than in . Therefore, the proposed policy achieves a maximum score with a higher probability than that of . However, the use of pure strategy is not recommended because it is prone to the “local stuck” phenomenon as mentioned in the previous section. Therefore, a mixed strategy is used to balance the maximum score achievement and the average number of steps per episode. Given the same score, it is preferable to use a policy that uses a smaller number of steps. In Fig. 6, the adjustment parameter is used to balance the two strategies. We also keep in all cases. We see that, with , the probability of achieving a score of 800 at 60 million training steps is high. Therefore, it is desirable to assign in the Breakout game. In a realworld application, the choice of depends on the goal objective, and it can be altered in the real time to adapt with the environment. Finally, Fig. 7 shows average number of steps used in each episode with different values of . As expected, the pure policy () uses the highest number of steps, but can be fixed by increasing . In summary, with , we obtain a balanced performance between maximum score achievement and average number of steps per episode.
V Conclusions
In this paper, we introduce an extended approach that applies human strategy to deep reinforcement learning. This marks the first step to building a humanlike agent that can adapt to its environment using human strategies. Because the limited scope of this paper, we only simulated our mixed strategy approach using the Breakout game, but the proposed scheme can be applied to other Atari games, as well as to realworld problems. The simulation results confirm that our the mixed strategy approach is efficient and promising. We also provide helpful guidelines to solve a complicated task by mimicking the divideandconquer strategy of human behaviors. Our future work will continue to work on building humanlike agents that can automatically adapt to their environments using different learning strategies.
References
 [1] L. Deng and D. Yu, “Deep learning: Methods and applications,” Found. Trends Signal Process., vol. 7, no. 3–4, pp. 197–387, 2014.
 [2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: The MIT Press, 2012.
 [3] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1238–1274, 2013.
 [4] N. D. Nguyen, T. Nguyen, and S. Nahavandi, “System design perspective for humanlevel agents using deep reinforcement learning: A survey,” IEEE Access, vol. 5, pp. 27091–27102, 2017.
 [5] D. Silver et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7578, pp. 484–489, 2016.
 [6] T. T. Nguyen, “A multiobjective deep reinforcement learning framework,” in arXiv preprint arXiv:1803.02965, 2018.
 [7] V. Mnih et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[8]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in
Adv. Neural Inf. Process. Syst., pp. 1097–1105, 2012.  [9] H.V. Hasselt, Double qlearning, in Adv. Neural Inf. Process. Syst., 2010, pp. 2613–2621.
 [10] H.V. Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double qlearning, in AAAI, 2016, pp. 2094–2100.
 [11] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Prioritized experience replay, in ICLR, 2016.
 [12] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, N. Freitas, Dueling network architectures for deep reinforcement learning, in Int. Conf. Mach. Learn., pp. 1995–2003, 2016.
 [13] M. Hausknecht, P. Stone, Deep recurrent qlearning for partially observable mdps, in AAAI Fall Symposium, 2015.
 [14] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” in Int. Conf. Mach. Learn., pp. 1928–1937, 2016.
 [15] V. R. Konda and J. N. Tsitsiklis, “Actorcritic algorithms,” in Adv. Neural Inf. Process. Syst., pp. 1008–1014, 2000.
 [16] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” J. Artif. Intell. Res., vol. 47, pp. 253–279, 2013.
 [17] V. R. Konda and J. N. Tsitsiklis, “Actorcritic algorithms,” in Adv. Neural Inf. Process. Syst., pp. 1008–1014, 2000.
Comments
There are no comments yet.