I Introduction
With the rapid development of wireless communications and artificial intelligence (AI), it is an inevitable trend for wireless network to be intelligentized, which is equipped with the selforganization, selfconfiguration, and selfrecovery abilities so as to accomplish the missions that are tough for human beings. For example, wireless sensor networks (WSN) are deployed to sense target area where human beings are usually hard to access. Due to limited power supply, it is essential that the networks are organized in an autonomously manner and make decisions intelligently at the node level. The same goes for the flying adhoc networks (FANET), i.e., multiple unmanned aerial vehicles (UAVs) networks organized in an ad hoc way. By carrying the internet of things (IoT) devices, FANET can accomplish many remarkable missions via the cooperation and coordination of the UAVs. Due to the large amount of nodes and vast spatial span of FANET, making the decision autonomously by the UAV itself is undoubtedly the final answer for FANET.
In the upcoming intelligent communication era, all the device nodes are expected to be an agent, i.e., can perceive the environment and autonomously make a decision towards it with a or several certain goals. For all the agents surrounded by the complicated and changeable wireless communication environment, the stochastically changed channels (e.g., the airtoground channel in UAV communications), the dynamic change of the accessible spectrum resource (e.g., cognitive radio in IoT) will degrade the performance of traditional optimization algorithms, even the convergence of them cannot be guaranteed. Fortunately, learning in AI (in particular reinforcement learning) is an powerful tool to obtain the high payoff in an unknown environment. Specifically, by learning an agent explores the environment and exploits the feedback information to do the online learning and reasoning, then make an intelligent choice to cope with the unfriendly wireless environment. Nevertheless, there still exists several challenges of the intelligent decisionmaking in wireless networks:

The individual decisionmaking by each agent may result in difficult coordination of network and reducing the system performance.

The network may be disordered or even corrupted in the adversarial context with the appearance of malicious agents (e.g., jammers).
Therefore, the investigation of intelligent decisionmaking problem in the multiagent system is essential and of practical significance for the intelligent wireless networks.
Game theory is an extensively used mathematical tool to study and model the interactions of a group of decision makers and has been widely applied in wireless communications [1]. Using game theory we can analyse the impact of decisionmaking interactions of the internal agents in the system, and also can study the adversarial process against the external hostile agents so as to realize the stable and optimal system. In the meantime, adopting AI is able to deal with the complicated and changeable wireless environment and achieve high payoff. The combination of these two great techniques is an promising direction of developing intelligent wireless networks.
There are various application cases of game theory or learning in AI techniques for wireless networks, e.g., [1], [2]. However, there is a lack of a systematic framework and discussion when game theory and learning in AI are applied into the intelligent wireless networks. In this paper, we first introduce the fundamental connections between game theory and learning in AI, and discuss the technical challenges and requirements of them applied into intelligent wireless networks. Then, a twolayer gametheoretic learning framework is introduced. Different from the previous works, the framework considers not only the internal coordination, but also the external adversary. Last but not the least, several gametheoretic learning methods and their applications are introduced, and a reallife testbed based on the framework is developed.
The rest of this article is organized as follows. We discuss the connections, challenges and requirements of combining learning in AI with game theory. We propose the gametheoretic learning framework for intelligent wireless networks and discuss the practical issues of the framework. We introduce the typical applications of several gametheoretic learning methods. We show a case study of the framework based on the experiment results of a reallife testbed. We discuss the future research directions and conclude the article.
Ii Connections between Learning in AI and Game Theory, and Their Challenges and Requirements in Intelligent Wireless Networks
The intelligent wireless networks are expected to have learning and coordinating abilities so as to provide high qualityofexperience (QoE) service autonomously. However, there are several technical challenges and requirements if combining the techniques of learning in AI and game theory in the wireless networks. In this section, as the Fig. 1 shows, the connections between learning in AI and game theory are firstly introduced, then the challenges and requirements are discussed.
Iia Connections
Game theory and AI, different subjects notwithstanding, have fundamental connections. Learning is the essential part for both the game theory and AI. Learning in AI can be roughly divided into three levels: pattern recognition, prediction and decision making. Herein, we mainly discuss the connections between game theory and AI in the three levels.
IiA1 Pattern Recognition
The main focus of pattern recognition is that by capturing the patterns of the input data, the regularities of them are recognized and then they can be further processed such as classification. A remarkable application of the first level of learning in AI with game theory is generative adversarial networks (GAN) [3], which consist of a generative network that captures the regularities of data and generates the mimic data, and a discriminative network that efforts to differentiate the mimic data from the real data. These two networks are involved in an adversarial process that
tries to maximize the detection probability while
’s goal is to minimize it, which pertains to the minimax game. The equilibrium of the game is reached when fails to make a distinction, i.e., the detection probability is 1/2.IiA2 Prediction
In the second level of learning in AI, it is easy to understand that once we recognize the patterns of the data, we can make predictions. However, there is another important prediction method which has close connections with game theory. Based on the predictor interacting with the environment which generates the data, the prediction process between predictor and environment can be modeled as a repeated game, therefore providing a lot of convenience for the analysis of prediction [4].
IiA3 Decision Making
Last but the most relevant one between game theory and learning in AI is the decisionmaking level. In terms of this level, first of all, both the game theory and AI focus on how the players/agents^{1}^{1}1In wireless communications, the players in game theory can be terminal equipments, base stations, routers, etc. In this article, we define the agent as the equipment that serves users, and we use agent, player, node and equipment interchangeably in this paper. deal with the complex context and make decisions towards it to make the system performance/payoff optimal. Second, learning is the fundamental process for both the game theory and AI to achieve the final goals. In game theory, players optimize their utilities and find the equilibrium by means of learning the other players’ strategies. Learning in game theory can predict the behaviour of players and the outcome of the game. It emphasises to explain the existence of equilibrium at theoretical analysis level and guide the players to move towards the equilibrium. Learning in AI efforts to find the optimal strategies at the practical operation level. From this perspective, combining AI with game theory and designing the effective multiagent intelligent decisionmaking schemes can not only overcome the complicated and dynamic context, but coordinate the strategies between agents.
The first two levels have been widely studied in AI. However, in the decisionmaking level, there are still a lot of problems needing be investigated in wireless networks. Therefore, in this paper, we mainly focus on the decisionmaking level of learning in AI.
IiB Challenges
Nevertheless, due to the specific characteristics of wireless networks, many challenges rise when the combined techniques are applied in wireless communications. The challenges are summarized as follows.

Constrained. Different from the normal agents in AI which are equipped with powerful capabilities, the agents in wireless networks are usually of limited resources such as computation ability, energy and so on. Therefore, the restrained conditions are supposed to be taken into consideration when designing the utility functions and learning schemes.

Incomplete. Information is essential for learning to make intelligent decisions. But due to the limitation in hardware, the information obtained from the wireless environment may be incomplete, which as a result is hard to guarantee the convergence, robustness and optimality of the algorithms.

Distributed. Many existing learning algorithms need information exchange between agents or the coordination of a central controller. However, for the sake of deepfading effect and/or the restrained transmission power, the direct communication links in the network may fail, on which case that the nodes may have to look for the assist of other nodes to relay the information or increase transmission power. Obviously, that is inefficient and impractical.

Dynamically scalable. The participants of the game is dynamic. For example, considering a dynamic spectrum access scenario, those who have data to send are involved in the competition of accessing channels and are out of the game after they finish. Besides, with the network size scaling up, the analysis on the interaction of the largescale agents is challenging.

Heterogeneous. Most of the networks are constituted by a group of heterogeneous nodes that are produced by different enterprises. For example, depending on the missions, FANET can be formed of diverse types of UAV such as fixedwing and rotarywing so as to meet different requirements. As a result, there may exist several different types of intelligent decisionmaking algorithm. How to design a mechanism that guarantees the convergence of those algorithms in the network is challenging.
IiC Requirements
In the multiagent system, we are unable to design every decision made by the agents during the interaction among them, but there must be rules and motivation to govern the action of the agents. The core of the intelligent network is to do the service for our human beings, and of course by “intelligent” it means to serve the people more intelligently rather than simply focus on improving the quality of service (QoS) such as the data rate or decreasing the transmission delay. Therefore, the design of the gametheoretic intelligent network is also required to satisfy some requirements.
First and foremost, the QoE is the main optimized object of the intelligent network, which means the agents are supposed to be contextaware. In particular, the individual features of users need to be considered when designing the agents’ algorithms. On the level of users, the context includes the type of equipment (e.g., highdefinition laptops and phones have different requirements for the definition of videos), residual energy, location, demand and preference of users and content of service (online movie and realtime video call for the requirements of delay). On the level of network, the context contains priority of service, the state of spectrum and network. As long as the network can provide personalized service, the QoE of users is enhanced.
Another important issue for the intelligent network is robustness. For instance, in the ad hoc network, some nodes may be out of energy or suddenly fail, the fast selfrecovery mechanism is necessary. More seriously, if there are malicious agents such as jammers, the network has to survive in the adversarial context and provide at least the minimum guaranteed service.
Iii GameTheoretic Learning Framework for Intelligent Wireless Networks
In this section, we propose a twolayer framework for the intelligent networks as shown in Fig. 2. The first layer, problem analysing and game modeling, and according to the first layer, the gametheoretic learning method is proposed in the second layer.
Iiia Problem Analysing and Game Modeling
The first layer can be divided into two steps, i.e., problem analysing and game modeling. In Fig. 1, we take the scenario of FANET as an example, and it is clustered according to the location. Note that, besides of FANET, we can find many applications of this kind of network, such as devicetodevice network, WSN and so on. In the internal, due to the same frequency in each cluster, the mutual interference must be considered. In the external, UAVs must access the jammingfree channels or increase the transmit power due to the existence of jammers. Therefore, the problem can be modeled as dynamic spectrum access, power control and antijamming.
With the help of problem analysing, the powerful mathematical tool, game theory, is adopted to analyse and model the multiagent system. Considering the distributed characteristic of the selforganized wireless network, in this paper we model the multiagent system as noncooperative game. The model can be expressed as , where is the set of the participant agents^{2}^{2}2We note that the participator set is dynamic as we mentioned in the challenge of dynamic scalability, i.e., is a variable number., is the available action set of agent (e.g., available channel, transmit power, transmit duration, etc.), and denotes the utility function of agent . In terms of an agent, the utility function is the evaluation of the current decisionmaking. The goal of agents is maximizing by adjusting their decisions. However, in the network, users are mostly selfinterested whose goal is maximizing their own utilities, which may results in low system performance. The reasons why adopting game theory are two, as shown in Fig. 1:

Distributed coordination: As we mentioned in Section IIB that the rules and motivation of the game is vital, the dilemma of selfinterested game can be solved by coordinating the behaviors of the agents. Specifically, in the internal of the network, if the utility function of each agent considers not only the payoff it can get, but the impact of the decision on other agents, then the coordination of the network can be realized spontaneously and distributedly.

Adversarial decisionmaking: Game theory can also analyse and model the adversarial relationship among agents. Assuming malicious users exists in the external of the network. The layer of legitimate users make the best response against the jammer aiming at maximizing the signaltointerferenceplusnoise ratio (SINR), while the layer of jammers make the best response against the legitimate users aim at minimizing the SINR of legitimate users. This kind of adversarial process can be modeled as Stackelberg game [5], which will be introduced in Section IVB.
As is wellknown, Nash equilibrium (NE) is the stable outcome of noncooperative games. By analysing the NE solutions we can predict the behaviours of agents and the performance of the game. However, not all games have NE. Fortunately, this can be handled by designing the utility function which decides the all the properties of game model. Due to the good properties such as guaranteeing the existence of at least one NE, potential game is extensively used in wireless communications [6]. The condition that a game is a potential game is a potential function exists, which has the same variation trend as the utility function does when an arbitrary agent unilaterally changes its action. Another promising property of potential game is that all of the NEs are the local or global optimal solutions of potential function. It means that if the potential function is designed as the optimized object of the network, the NEs of the potential game are the local or global optimal solutions of the network.
IiiB GameTheoretic Learning Methods
The second layer of the gametheoretic learning framework is going to tell the agents how to behave in the complicated network. The general gametheoretic learning model can be expressed as
(1) 
where and denote the action of agent and the action profile of all the agents except agent at the th decisionmaking period, represents the reward of agent which is related to and , and is the upgrading function of the agent ’s strategy. That is to say, the decisionmaking at the th period is adjusted by the action and received reward at the th period. This kind of online learning method can overcome the disadvantages of dynamic, unknown environment. Although there are various kinds of learning methods in AI, several practical situations must be considered:

The learning algorithms must guarantee the network convergence to the NE. More than that, there may exist more than one NE point, thus it is the best for the algorithms to achieve the optimal NE.

Different from the original game theory which adjusting the player’s action by observing the others’ decisions, the agent in the wireless network is hard to be informed of all the others’ actions, i.e., as discussed in Section IIA. Therefore, the learning methods demanding for the information of others as little as possible is practical for the intelligent wireless network.

In the adversarial situation, both the legitimate agents and malicious agents use the intelligent techniques to compete with each other. The reward may vary during different decisionmaking period, which may create troubles for the convergence of the algorithms.
Based on the above discussion, in the following, we introduce several gametheoretic learning methods and their applications in wireless network.
Iv Applications of GameTheoretic Learning Methods
Combining learning in AI with game theory, it is challenging to theoretically prove the algorithms to converge to NE for the reason that the approach varies for the different applications. In this section, we introduce some gametheoretic learning methods which possess attractive properties that have been both theoretically and numerically proven. Note that, all the learning methods introduced in this paper belongs to the category of reinforcement learning, i.e., decisionrewardupdate process as Fig. 3 shows.
Iva Distributed Coordination
In this subsection, we introduce three uncoupled (no need for the information of other agents) learning methods which have good properties of NE in games.
IvA1 LogLogit Learning
The main idea of loglogit learning (
) algorithm is BoltzmannGibbs strategy [7], i.e., choosing the highutility strategy with high probability while the lowutility strategy with low probability. During each iteration, selects an agent to randomly explore the actions in and update the probabilities of decisionmaking based on the payoff the agent achieved during the exploration. This learning algorithm is proved to converge to the globally optimal strategy. In [8], we consider a resource allocation problem in a multicell scenario. By setting the base station and joint power and channel allocation as the players and actions, respectively, a QoEoriented game considering the fairness and users’ requirements is proposed. Note that, this problem is a combinatorial optimization problem and NPhard. Therefore, we resort to the
algorithm to obtain the globally optimal solution which is much better than that of the QoS based algorithm.IvA2 Stochastic Learning Automata
algorithm is asynchronous, i.e., only one agent can do the learning during one iteration, which as a result the convergence speed is slow and inapplicable for the changing fast environment. Hence, we then introduce a synchronous and uncoupled algorithm, i.e., stochastic learning automata (SLA) algorithm [9]. The main idea of SLA is that based on the previous experience of actions, those leading to higher rewards tend to be repeated more frequently. Although SLA converges to arbitrary NEs of potential game, the synchronous and uncoupled properties are still promising in practical applications. In [10], a dynamic computation offloading problem is investigated. Considering the dynamic numbers of the active agents and timevarying fading channel, the problem is modeled as a potential game, and the SLA algorithm is adopted to intelligently make the offloading decisions.
However, these learning algorithms are unable to converge in generic games. Next, we introduce a more practical algorithm called trial and error learning (TEL), which converges to a efficient NE of generic game.
IvA3 Trial and Error Learning
TEL is an “emotional” algorithm. The state of each agent at the th iteration is assumed to be , where and represent the benchmarks of strategy and reward, respectively, and denotes the mood of agent . By personifying the agents, they are assumed to have four kinds of moods, which in descending order are content, hopeful, watchful and discontent. By interacting with the environment and the rewards it gets, the agent’s mood, the benchmark of strategy and the benchmark of reward update [11]. Inspired by TEL, we proposed a QoEaware game in heterogeneous wireless networks [12]. Based on the users’ requirements, the user’s throughput is mapped into several levels of QoE, and the TEL algorithm is adopted in the QoEaware game to achieve the efficient NE which effectively enhance the QoE.
Up to now we can see that in the complex wireless networks, most of the optimization problems are NPhard. The price of centralized algorithms is unacceptably high, while the heuristic algorithms are unable to achieve the optimal solutions. Fortunately, we can turn to the learning algorithms. In conclusion, these algorithms provide the agents with
cognitive computing abilities, which are specifically summarized as follows:
Interactive. Driven by the goal of maximize the utilities, the agents interact with the environment and/or each other, or even the adversaries, to obtain information.

Adaptive. The agents can learn the changes from the received information and make responds to them adaptively and timely.

Predictive. The agents can “remember” the interacting experiences and accordingly make the highpayoff decisions in time.
IvB Adversarial DecisionMaking
If there are malicious users who try to disrupt the network, the dynamic process of competitive interaction cannot be directly modeled by the above gametheoretic learning methods. As mentioned, we can adopt Stackelberg game to do the analysis. In Stackelberg game, there are leader and follower. The leader first takes an action, and the follower takes a best response against the leader. Then the leader also chooses the best action according to the follower. The competitive interaction continues until converges to the Stackelberg equilibrium (SE) where neither the leader nor the follower can obtain higher utilities by changing the strategy unilaterally. In [5], we adopt the Stackelberg game to analyse an antijamming problem where a group of users and a malicious jammer exist. The game consists of two subgame: the follower subgame that the users adopt SLA algorithm aiming at minimizing the impact of cochannel interference (CCI) and jamming signal, and the leader subgame that the jammer uses Qlearning algorithm trying to maximize the effect of interferences.
Note that, Qlearning
is a classic reinforcement learning method which aims to maximize the longterm cumulative reward of the problem modeled as Markov decision process (MDP)
^{3}^{3}3The definition of MDP is that the agent sequentially makes decisions with Markovian transition model in a stochastic environment.. The learning process is that, the agent has a Qvalue table which stores the evaluation of the actions under the current state. Through the interaction with environment, the agent updates the Qvalue table, and those actions with higher Qvalue are more likely to be selected. However, MDP is designed for singleagent scenario. If each agent treats the other agents as the environment and does the individual Qlearning, the network may never converge to an equilibrium [13]. In the next section, based on the algorithm we proposed in [14] and a testbed we have developed, we choose the multiagent reinforcement learning (MARL) antijamming as a case study of the gametheoretic learning framework.V Case Study: A Testbed for MultiAgent Reinforcement Learning AntiJamming
Va MultiAgent Reinforcement Learning AntiJamming Algorithm
In multiagent system, we can resort to Markov game [14] to analyse the multiagent MDP. A Markov game can be modeled as , where is the legitimate agent set, is the available channel set, the state is defined as the action profile and the channel that is currently being jammed, is the reward function and is the probability transition function. The reward is defined as positive if the agent choose the channel that hasn’t been occupied by other agents and the jammer, otherwise the reward is zero. In the system, the number of agents is two. As shown in Fig. 4 (a), Qlearning algorithm is adopted by the agents to learn the jamming pattern and avoid CCI in a collaborative way. Specifically, if the agent successfully transmits a packet (confirmed by ACK), it means the agent chooses a channel without CCI and jamming signal, and the user gets a positive reward to update the Qvalue. Then, the Qvalue is exchanged with each other to do the collaborative Qlearning. From Fig. 4 (b) we can see that the proposed algorithm achieves a better performance than the sensingbased method, and avoid the nonconvergence of the independent Qlearning, which is a good case of distributed coordination and adversarial decisionmaking of the gametheoretic learning framework.
VB A Testbed for MultiAgent Reinforcement Learning AntiJamming
As Fig. 5 shows, we developed a multiagent antijamming testbed based on the algorithm. The system consists of two users and one jammer. Here, the user is a transmitterreceiver pair. Assuming there are five available channels and the transmitter try to send a picture to receiver. Each user is equipped with three universal software radio peripherals (USRPs) which serve as transceiver and wideband spectrum sensing (WBSS) equipment. The jammer USRP transmits jamming signal in a sweeping jamming pattern, i.e., periodically jamming the channels in sequence.
Fig. 5 (b) and (c) is the screenshots of the dynamic process running by the LabVIEW software. There are four windows in each receiver’s interface, which are denoted by , , and in Fig. 5 (b). As we can see, shows the process of selecting channels by user 1 and user 2, shows the normalized throughput, is the realtime spectrum display of WBSS where the two pulses are the users’ signals and the middle is the jamming signal, and is the picture under receiving. As the experiment results show, the jamming signal and CCI can be effectively avoided and the transmitted pictures can be successfully received.
Vi Future Research Directions and Conclusion
Via Future Research Directions
It can be seen that the combination of game theory and learning in AI is highly promising to be applied into the intelligent wireless networks, but the current researches still have a long way to achieve the expected goal. Herein, we concluded some future research directions as follows.

The traditional Qlearning is difficult to be adopted under the general realworld settings due to the tremendous state space, which is called “the curse of dimension”. Due to the great capability of approximation, deep reinforcement learning (deep Qlearning) overcomes the shortcoming and is more suitable for the actual wireless networks. We have applied the deep Qlearning into the singleagent antijamming scenario [15], shown in Fig. 6. Based on the existing researches on multiagent Qlearning, we can study the properties of multiagent deep Qlearning in a gametheoretic perspective.

In the largescale wireless network, the heterogeneous characteristic such as clustering and hierarchical can provide the system with effectiveness and efficiency. Therefore, it is of practical significance to investigate the hybrid and hierarchical MARL that consist of several reinforcement learning algorithms in the multiagent system. We can theoretically study the performance and convergence of the hybrid and hierarchical MARL with the help of game theory.
ViB Conclusion
In this paper, we proposed a gametheoretic learning framework for the intelligent wireless networks, which possesses good properties of both game theory and learning in AI. First of all, we explain why the combination of game theory and learning in AI is necessary for the intelligent networks. Then, the connections, challenges and requirements were discussed. To handle these problems, we introduced the proposed twolayer gametheoretic learning framework from internal and external perspectives. Combined with the works we have proposed, several promising gametheoretic learning methods were introduced. Moreover, a multiagent antijamming testbed was developed, and the experiment results demonstrated the effectiveness of the gametheoretic learning method. Finally, some future research directions are discussed.
Acknowledgment
This work was supported by the National Natural Science Foundation of China under Grant No. 61771488, No. 61671473 and No. 61631020, in part by the Natural Science Foundation for Distinguished Young Scholars of Jiangsu Province under Grant No. BK20160034, and in part by the Open Research Foundation of Science and Technology on Communication Networks Laboratory.
References
 [1] Z. Han, D. Niyato, W. Saad, T. Baar, and A. Hjrungnes, Game theory in wireless and communication networks: theory, models, and applications: Cambridge University Press, 2012.

[2]
J.B. Wang, J. Wang, Y. Wu, J.Y. Wang, H. Zhu, M. Lin, and J. Wang, “A Machine Learning Framework for Resource Allocation Assisted by Cloud Computing,”
IEEE Network, vol. 32, no. 2, pp. 144151, 2018.  [3] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, ”Generative adversarial nets.” in International Conference on Neural Information Processing Systems, MIT Press, pp. 26722680.
 [4] N. CesaBianchi, and G. Lugosi, Prediction, learning, and games. Cambridge university press, 2006.
 [5] F. Yao, L. Jia, Y. Sun, Y. Xu, S. Feng, and Y. Zhu, “A hierarchical learning approach to antijamming channel selection strategies,” Wireless Networks, pp. 113, 2017.
 [6] D. Monderer. and L. S. Shapley, “Potential Games,” Games and Economic Behavior, vol. 14, 1996, pp. 124143.
 [7] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.
 [8] J. Zheng, Y. Cai, Y. Liu, Y. Xu, B. Duan, and X. S. Shen, “Optimal power allocation and user scheduling in multicell networks: Base station cooperation using a gametheoretic approach,” IEEE Trans. Wireless Commun., vol. 13, no. 12, pp. 69286942, 2014.
 [9] P. Sastry, V. Phansalkar, and M. Thathachar, “Decentralized learning of Nash equilibria in multiperson stochastic games with incomplete information,” IEEE Trans. Syst., Man, Cybern. B, vol. 24, no. 5, pp. 769777, 1994.
 [10] J. Zheng, Y. Cai, Y. Wu, and X. S. Shen, “Dynamic Computation Offloading for Mobile Cloud Computing: A Stochastic GameTheoretic Approach,” IEEE Trans. Mobile Computing, 2018.
 [11] B. Pradelski, and H. P. Young, “Efficiency and equilibrium in trial and error learning,” University of Oxford, Department of Economics, Economics Series Working Papers, vol. 480, 2010.
 [12] Z. Du, Q. Wu, P. Yang, Y. Xu, J. Wang, and Y.D. Yao, “Exploiting user demand diversity in heterogeneous wireless networks,” IEEE Trans. Wireless Commun., vol. 14, no. 8, pp. 41424155, 2015.
 [13] N. Vlassis, “A concise introduction to multiagent systems and distributed artificial intelligence,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 1, no. 1, pp. 171, 2007.
 [14] F. Yao, and L. Jia, “A Collaborative Multiagent Reinforcement Learning Antijamming Algorithm in Wireless Networks,” arXiv preprint, arXiv:1809.04374, 2018.
 [15] X. Liu, Y. Xu, L. Jia, Q. Wu and A. Anpalagan, “AntiJamming Communications Using Spectrum Waterfall: A Deep Reinforcement Learning Approach,” IEEE Communications Letters, vol. 22, no. 5, pp. 9981001, 2018.