Reinforcement learning (RL) 
has attracted wide research interests due to its ability to plan a long-term solution for sequential decision making. However, traditional RL algorithms seem to work in obvious problems where the number of data dimensions is limited. To deal with multi-dimensional environments, a deep neural network is used to approximate the probability distribution of all possible actions[3, 4]
. However, the use of neural networks may cause divergence while estimating Q-value function.
The primary reason of this chaos originates from RL’s online learning, i.e., sequential data from the environment contain correlative constraints. Therefore, every update of the Q-value function tends to approximate a biased value and forces the agent into local minimum solutions. As a result, a number of algorithms have been proposed in the literature to handle the correlative information of input data. For instance, correlative samples are first assembled and stored in an experience replay memory , and are later selected randomly to estimate the Q-value function. Using different network (known as a target network ) also avoids correlative samples because the network is only updated infrequently . Double Q-learning  is another approach that performs action selection and action evaluation in two separate networks. To further enhance correlative dilation, each sample in the replay memory has a priority level depending on its temporal difference error . Moreover, particularly in stochastic environments, an asynchronous method proves helpful, i.e., the goal is to create multiple, simultaneous agent-environment instances. In other words, agents are trained separately in different replica environments, but afterward they jointly update a shared policy network . Additionally, Open AI 
uses a heuristic search to optimize an objective by conducting evolutionary generations through thousands of workers.
However, what if an observed environment is profoundly biased (e.g. the Multiple Tank Defence game described in Section III), or if the biased data take a primary portion of the environment’s data distribution? In these cases, all the algorithms mentioned above may no longer work because the agent is tricked to follow a greedy solution. In the worst cases, the agent turns into a “zombie” and behaves peculiarly. One straightforward solution is to extend the training time or to allocate more computational power resources such that further exploration may be attained. However, extending the training time is not desirable, and may even prove intractable if the observed environment is fully biased. For instance, given an agent that controls a fighter jet over a battlefield with a variety of enemies, the jet will crash if it collides with an enemy or if it runs out of fuel. The jet may replenish its fuel by flying over a fuel depot. However, shooting the fuel depot will earn a high reward, which tricks the agent to shoot all the fuel depots to assure the highest score achievement. As a result, the agent rarely advances to the next stage because of fuel depletion and becomes stuck in a local minimum solution. Multiple Tank Defence is another example of such an environment and will be described in Section III.
In this paper, we tackle the biased environments by using human strategies. Initially, we use human strategies to represent the environment into a goal map. In the goal map, we define a set of targets which each presents a human-centric goal. For instance, in the previous example, the desired agent should control the jet to shoot fuel depots to accumulate rewards but should recognize when fuel is low, and refill the fuel. The goal map in this case includes two essential targets: one target represents for any obstacles that the jet can shoot to accumulate rewards and another target represents for the action of refilling the fuel. By using this approach, the player can prolong the jet’s lifetime and achieve a higher score in the long term. The next step is to train the agent to understand the goal map and achieve the predefined targets in the goal map. Therefore, we develop a software architecture called Multi-Target System (MTS) that is used to train agents to understand the goal map and enable cooperation between multiple agents. Fig. 1 illustrates our proposed method to tackle a biased environment by using human strategies and a goal map. Section IV-D describes MTS in more detail.
In summary, the paper includes the following contributions:
We developed an environment named Multiple Tank Defence that is used as a testbed to evaluate the performance of deep RL methods in multi-agent settings. Moreover, the use of Multiple Tank Defence is similar to Arcade Learning Environment . As a result, we can use any deep RL methods to solve the Multiple Tank Defence without changing existing parameters.
The goal map can be utilized in different practical scenarios such as aiding the cooperation between multiple agents or controlling agent behaviors in running time without asking a lot of human feedback.
This paper includes the following sections. The next two sections summarize the related work and introduce the gameplay of Multiple Tank Defence. Section IV introduces the goal map and its implementation by using MTS. Section V presents the performance evaluation of the proposed method in Multiple Tank Defence. Finally, Section VI concludes our work.
Ii Related Work
Since 2015, there have been a number of improvements over DQN  such as double Q-network , dueling network , and prioritized experience replay . However, these extensions require a large amount of resources to train agents regarding training time and memory allocation. In 2016, Mnih et al.  introduced a light-weight approach based on actor-critic architecture , namely asynchronous advantage actor-critic (A3C), by training multiple agent-environment instances at the same time. This method reduces training time, the use of memory allocation, and becomes a state-of-the-art method in Atari domain. As a result, we use A3C algorithm as the baseline in this paper.
Alongside the rapid increase in the complexity of applications, there has been a rising demand to provide human feedback in the agent training process . Christiano et al.  reshape the reward signal by injecting human feedback during the training process. However, this approach requires an expert to observe hundreds to thousands of agent’s video clips. Our proposed method is more feasible as we create a goal map, which integrates human strategies before the training process. Moreover, the goal map is easily created by using an essential localization method .
Another challenge in RL is navigation of environments with sparse rewards. In such cases, agents become easily stuck in local minimum solutions due to insufficient exploration. One solution is to split a complicated task into hierarchical subtasks in which a parent subtask has higher abstraction than its successors . By combining hierarchical subtasks with the agent’s intrinsically-motivated rewards , Kulkarni et al.  successfully instructed an agent to accomplish its goal in the Atari game Montezuma’s Revenge. However, this approach requires training of two policies simultaneously: one that estimates goal-value function and one that estimates action-value function. By adopting temporal abstraction with the target map, our proposed scheme does not require to train an additional policy network over goals and enables smooth cooperation within agents and also between humans and agents. Moreover, as a target map is independent of policy networks and deep RL algorithms, our approach is more robust and can deal with a broader range of applications.
Finally, our proposed method is used to cooperate multiple agents in multi-agent systems. Previous studies [21, 22] train agents to communicate in a shared limited bandwidth channel. However, these methods limit a number of training agents. Recently, Gupta et al.  scale to cooperate a huge number of agents by using a parameter-sharing network. This approach is used only in homogeneous systems. In this paper, we introduce a goal map that can be used to train agents with different types. Section IV-D discusses this issue in greater detail.
Iii The Multiple Tank Defence Environment
In this section, we introduce the gameplay of Multiple Tank Defence, as illustrated in Fig. 2. The game includes one or two player agents and five enemy tanks. If an enemy tank is destroyed, another enemy will appear randomly on the battlefield. This approach assures the stochastic behavior of the game. The game is over if the base is destroyed or no players left. Finally, a reward of 10 is given for destroying an enemy tank.
There are different terrains in the game such as a pond tile, a soft wall tile, and a hard wall tile. A pond tile is used to avoid a tank from moving though a bullet can pass through it. A soft wall can be collapsed by a bullet, but a hard wall cannot be demolished. Different terrains can be constructed in different locations on the battlefield to enable a complicated strategy to protect the base. Multiple Tank Defence is a biased problem as the agents greedily destroy the enemies to accumulate rewards without protecting the base. As the result, the game is over earlier.
Moreover, Multiple Tank Defence includes the following features:
Multiple Tank Defence can have unlimited number of stages by modifying the number of enemies, the number of players, and different construction ways of terrains on the battlefield.
The game has two different goals: one goal is to destroy enemies to accumulate the rewards and another goal is to protect the base to prolong the game lifetime. Therefore, the game requires a complicated strategy to get a high score.
The game supports human mode, i.e., a human can play with an AI agent to cooperatively protect the base. This requires the agent to follow a designated strategy.
Iv Proposed Scheme
In this section, we introduce the concept of a goal map. We then describe the implementation detail of a goal map by using masks. Finally, we propose a network architecture that is used to train agents to understand the goal map.
Iv-a Goal map
A goal map is an abstract graphical representation that is used to present the human strategies while analyzing the problem. Fig. 3b illustrates an example of a goal map in Multiple Tank Defence. In the goal map, we define a set of targets for each agent. Specifically, the yellow agent’s targets are all enemies inside the yellow region and the green agent’s target is the enemy that is closest to the base.
Iv-B Implementation details
To efficiently implement the goal map, we introduce the concept of a goal mask for each agent. The goal mask is a black image in which contains a set of targets. A target is then represented by a white square, as shown in Fig. 3c and Fig. 3d.
Specifically, the goal masks have the same dimension with the observed graphical state of the environment. The mask is then converted into grayscale and it is rescaled by a factor . Other information of the targets such as working region, priority target, reward information, or target location is stored in a metadata structure. Those metadata are retrieved later in the training process to instruct agents to achieve the targets defined in the goal map.
Iv-C Training process
In the training process, the key point is to create a separate policy network for each agent or a set of agents, which have the same strategy. Specifically, we create two policy networks in Multiple Tank Defence to represent the goal map described in the previous subsection. Fig. 4 illustrates a policy network for the yellow region. Initially, we generate the goal mask and the grayscale version of the original state. Those data are put into two separate convolutional networks . The output of two networks are combined and then put to the subsequent layers of the A3C’s network. Once the agent is trained to understand the goal map, we can modify the target in the goal map to change the behaviors of the agent in real time without conducting a training process.
Iv-D Multi-Target System
We put the concept of a goal map into a high-level system architecture, i.e. MTS, as shown in Fig. 5. In particular, MTS retrieves human strategies and creates a corresponding goal map. The goal map is used with the observed state to generate the goal masks. Reward signals, target locations, and target boundary regions are stored in a metadata structure. These data are used together with the goal masks to train different policy networks. Each policy network is used to train a set of agents which have the same targets.
MTS can apply in a variety number of applications, especially in heterogeneous multi-agent systems. For each type of agents, we create a separate policy network. The MTS then schedules the agents’ activities to cooperatively accomplish a designated task.
V Performance Evaluation
V-a Experimental Settings
In this section, we present the simulation by using Multiple Tank Defence to evaluate the proposed scheme in two-player mode. As explained earlier, we use A3C as the baseline method. The A3C’s network parameters are kept the same as in , except the following changes. Each A3C algorithm is ran in a 8-core CPU. The learning rate is 0.004. The convolutional network includes two layers: one layer has 16 filters of size 8
8 with a stride of 4 and another layer has 32 filters of size 44 with a stride of 2. Finally, the action repeat of 5 is used while processing input.
Each A3C variant is trained in 10 million steps. This takes 2 training days for A3C baseline and 3 training days for A3C with a goal map. We evaluate the policy network in 20 different milestones during the training process. In each milestone, we measure the mean total of reward and the mean total number of steps per episode by running policy network in 50,000 steps. Finally, we include two human reference levels: novice level and competent level. The novice level is measured by letting a human plays the game in 50,000 steps and the competent level is measured by playing the game up to 500,000 steps.
V-B A3C with Goal Map
We compare the performance of two schemes: A3C and A3C with the goal map. Fig. 6 shows that the A3C method with the goal map performs 200% better than the baseline. It also surpasses the human competent level. In the A3C baseline, two players do not cooperate as they greedily destroy enemies rather than protecting the base. In contrast, the use of a goal map aids two agents to cooperate to protect the base. This strategy prolongs the base’s lifetime and hence obtains a high score in the long run.
Finally, we use MTS to modify agent behaviors and derive the following methods:
In the single-player mode, we change the targets of the yellow region in real time based on current situation and human experience.
In the multi-player mode, we train the yellow agent using the goal map and a human plays in a role of the green player.
We modify the metadata structure to narrow the working region of two players so that the yellow player only focuses shooting on the right side and the green player only focuses shooting on the left side. This scheme achieves the highest score among three variants. Table I summarizes the mean total of reward and mean total of steps per episode among three variants mentioned here.
|Scheme name||Mean reward||Mean steps/episode|
|A3C with a||149||152|
|goal map (single-|
|A3C with a||301||234|
|and a human|
|A3C with a||363||295|
|goal map +|
This paper proposed the novel concept of the goal map and the high-level system architecture MTS. By using MTS, agents’ behaviors can be modified in real time without conducting a training process. Moreover, agents can achieve an expected solution without using a burden of human feedback. As a result, agents do not become stuck in local minimum solutions and act more human-like in biased environments such as Multiple Tank Defence. Finally, the use of the goal map and MTS eases the cooperation between agents–agents or humans–agents in heterogeneous systems. Therefore, the study provides a promising framework that aims to attract considerable attention to building a human-like agent in large-scale systems with deep RL.
-  R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998.
-  R. Veerabhadrappa, A. Bhatti, C. P. Lim, T. T. Nguyen, S. J. Tye, P. Monaghan, and S. Nahavandi. Statistical modelling of artificial neural network for sorting temporally synchronous spikes. In International Conference on Neural Information Processing, pages 261–272, 2015.
-  N. D. Nguyen, T. Nguyen, and S. Nahavandi. System design perspective for human-level agents using deep reinforcement learning: a survey. IEEE Access, 5:27091–27102, 2017.
-  J. N. Sitsiklis and B. V. Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control. 42:674–690, 1997.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
-  T. T. Nguyen. A multi-objective deep reinforcement learning framework. arXiv preprint arXiv:1803.02965, 2018.
-  H. V. Hasselt. Double q-learning. In Advances in Neural Information Processing Systems, pages 2613–2621, 2010.
-  T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In International Conference on Learning Representations, 2016.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In
International Conference on Machine Learning, pages 1928–1937, 2016.
-  T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents.
Journal of Artificial Intelligence Research, 47:253–279, 2013.
-  Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, pages 1995–2003, 2016.
-  V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems, pages 1008–1014, 2000.
-  N. D. Nguyen, S. Nahavandi, and T. Nguyen. A human mixed strategy approach to deep reinforcement learning. arXiv preprint arXiv:1804.01874, 2018.
-  P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4302–4310, 2017.
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. InComputer Vision and Pattern Recognition, pages 2921–2929, 2016.
-  T. G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
-  N. Chentanez, A. G. Barto, and S. P. Singh. Intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems, pages 1281–1288, 2005.
-  T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 3675–3683, 2016.
-  J. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
S. Sukhbaatar and R. Fergus. Learning multiagent communication with backpropagation. InAdvances in Neural Information Processing Systems, pages 2244–2252, 2016.
A. Khatami, M. Babaie, H. R. Tizhoosh, A. Khosravi, T. Nguyen, and S. Nahavandi. A sequential search-space shrinking using CNN transfer learning and a radon projection pool for medical image retrieval.Expert Systems with Applications, 100:224–233, 2018.
-  J. K. Gupta, M. Egorov, and M. Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pages 66–83, 2017.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition, pages 1–9, 2015.