With the rapid development of wireless communications and artificial intelligence (AI), it is an inevitable trend for wireless network to be intelligentized, which is equipped with the self-organization, self-configuration, and self-recovery abilities so as to accomplish the missions that are tough for human beings. For example, wireless sensor networks (WSN) are deployed to sense target area where human beings are usually hard to access. Due to limited power supply, it is essential that the networks are organized in an autonomously manner and make decisions intelligently at the node level. The same goes for the flying ad-hoc networks (FANET), i.e., multiple unmanned aerial vehicles (UAVs) networks organized in an ad hoc way. By carrying the internet of things (IoT) devices, FANET can accomplish many remarkable missions via the cooperation and coordination of the UAVs. Due to the large amount of nodes and vast spatial span of FANET, making the decision autonomously by the UAV itself is undoubtedly the final answer for FANET.
In the upcoming intelligent communication era, all the device nodes are expected to be an agent, i.e., can perceive the environment and autonomously make a decision towards it with a or several certain goals. For all the agents surrounded by the complicated and changeable wireless communication environment, the stochastically changed channels (e.g., the air-to-ground channel in UAV communications), the dynamic change of the accessible spectrum resource (e.g., cognitive radio in IoT) will degrade the performance of traditional optimization algorithms, even the convergence of them cannot be guaranteed. Fortunately, learning in AI (in particular reinforcement learning) is an powerful tool to obtain the high payoff in an unknown environment. Specifically, by learning an agent explores the environment and exploits the feedback information to do the online learning and reasoning, then make an intelligent choice to cope with the unfriendly wireless environment. Nevertheless, there still exists several challenges of the intelligent decision-making in wireless networks:
The individual decision-making by each agent may result in difficult coordination of network and reducing the system performance.
The network may be disordered or even corrupted in the adversarial context with the appearance of malicious agents (e.g., jammers).
Therefore, the investigation of intelligent decision-making problem in the multi-agent system is essential and of practical significance for the intelligent wireless networks.
Game theory is an extensively used mathematical tool to study and model the interactions of a group of decision makers and has been widely applied in wireless communications . Using game theory we can analyse the impact of decision-making interactions of the internal agents in the system, and also can study the adversarial process against the external hostile agents so as to realize the stable and optimal system. In the meantime, adopting AI is able to deal with the complicated and changeable wireless environment and achieve high payoff. The combination of these two great techniques is an promising direction of developing intelligent wireless networks.
There are various application cases of game theory or learning in AI techniques for wireless networks, e.g., , . However, there is a lack of a systematic framework and discussion when game theory and learning in AI are applied into the intelligent wireless networks. In this paper, we first introduce the fundamental connections between game theory and learning in AI, and discuss the technical challenges and requirements of them applied into intelligent wireless networks. Then, a two-layer game-theoretic learning framework is introduced. Different from the previous works, the framework considers not only the internal coordination, but also the external adversary. Last but not the least, several game-theoretic learning methods and their applications are introduced, and a real-life testbed based on the framework is developed.
The rest of this article is organized as follows. We discuss the connections, challenges and requirements of combining learning in AI with game theory. We propose the game-theoretic learning framework for intelligent wireless networks and discuss the practical issues of the framework. We introduce the typical applications of several game-theoretic learning methods. We show a case study of the framework based on the experiment results of a real-life testbed. We discuss the future research directions and conclude the article.
Ii Connections between Learning in AI and Game Theory, and Their Challenges and Requirements in Intelligent Wireless Networks
The intelligent wireless networks are expected to have learning and coordinating abilities so as to provide high quality-of-experience (QoE) service autonomously. However, there are several technical challenges and requirements if combining the techniques of learning in AI and game theory in the wireless networks. In this section, as the Fig. 1 shows, the connections between learning in AI and game theory are firstly introduced, then the challenges and requirements are discussed.
Game theory and AI, different subjects notwithstanding, have fundamental connections. Learning is the essential part for both the game theory and AI. Learning in AI can be roughly divided into three levels: pattern recognition, prediction and decision making. Herein, we mainly discuss the connections between game theory and AI in the three levels.
Ii-A1 Pattern Recognition
The main focus of pattern recognition is that by capturing the patterns of the input data, the regularities of them are recognized and then they can be further processed such as classification. A remarkable application of the first level of learning in AI with game theory is generative adversarial networks (GAN) , which consist of a generative network that captures the regularities of data and generates the mimic data, and a discriminative network that efforts to differentiate the mimic data from the real data. These two networks are involved in an adversarial process that
tries to maximize the detection probability while’s goal is to minimize it, which pertains to the minimax game. The equilibrium of the game is reached when fails to make a distinction, i.e., the detection probability is 1/2.
In the second level of learning in AI, it is easy to understand that once we recognize the patterns of the data, we can make predictions. However, there is another important prediction method which has close connections with game theory. Based on the predictor interacting with the environment which generates the data, the prediction process between predictor and environment can be modeled as a repeated game, therefore providing a lot of convenience for the analysis of prediction .
Ii-A3 Decision Making
Last but the most relevant one between game theory and learning in AI is the decision-making level. In terms of this level, first of all, both the game theory and AI focus on how the players/agents111In wireless communications, the players in game theory can be terminal equipments, base stations, routers, etc. In this article, we define the agent as the equipment that serves users, and we use agent, player, node and equipment interchangeably in this paper. deal with the complex context and make decisions towards it to make the system performance/payoff optimal. Second, learning is the fundamental process for both the game theory and AI to achieve the final goals. In game theory, players optimize their utilities and find the equilibrium by means of learning the other players’ strategies. Learning in game theory can predict the behaviour of players and the outcome of the game. It emphasises to explain the existence of equilibrium at theoretical analysis level and guide the players to move towards the equilibrium. Learning in AI efforts to find the optimal strategies at the practical operation level. From this perspective, combining AI with game theory and designing the effective multi-agent intelligent decision-making schemes can not only overcome the complicated and dynamic context, but coordinate the strategies between agents.
The first two levels have been widely studied in AI. However, in the decision-making level, there are still a lot of problems needing be investigated in wireless networks. Therefore, in this paper, we mainly focus on the decision-making level of learning in AI.
Nevertheless, due to the specific characteristics of wireless networks, many challenges rise when the combined techniques are applied in wireless communications. The challenges are summarized as follows.
Constrained. Different from the normal agents in AI which are equipped with powerful capabilities, the agents in wireless networks are usually of limited resources such as computation ability, energy and so on. Therefore, the restrained conditions are supposed to be taken into consideration when designing the utility functions and learning schemes.
Incomplete. Information is essential for learning to make intelligent decisions. But due to the limitation in hardware, the information obtained from the wireless environment may be incomplete, which as a result is hard to guarantee the convergence, robustness and optimality of the algorithms.
Distributed. Many existing learning algorithms need information exchange between agents or the coordination of a central controller. However, for the sake of deep-fading effect and/or the restrained transmission power, the direct communication links in the network may fail, on which case that the nodes may have to look for the assist of other nodes to relay the information or increase transmission power. Obviously, that is inefficient and impractical.
Dynamically scalable. The participants of the game is dynamic. For example, considering a dynamic spectrum access scenario, those who have data to send are involved in the competition of accessing channels and are out of the game after they finish. Besides, with the network size scaling up, the analysis on the interaction of the large-scale agents is challenging.
Heterogeneous. Most of the networks are constituted by a group of heterogeneous nodes that are produced by different enterprises. For example, depending on the missions, FANET can be formed of diverse types of UAV such as fixed-wing and rotary-wing so as to meet different requirements. As a result, there may exist several different types of intelligent decision-making algorithm. How to design a mechanism that guarantees the convergence of those algorithms in the network is challenging.
In the multi-agent system, we are unable to design every decision made by the agents during the interaction among them, but there must be rules and motivation to govern the action of the agents. The core of the intelligent network is to do the service for our human beings, and of course by “intelligent” it means to serve the people more intelligently rather than simply focus on improving the quality of service (QoS) such as the data rate or decreasing the transmission delay. Therefore, the design of the game-theoretic intelligent network is also required to satisfy some requirements.
First and foremost, the QoE is the main optimized object of the intelligent network, which means the agents are supposed to be context-aware. In particular, the individual features of users need to be considered when designing the agents’ algorithms. On the level of users, the context includes the type of equipment (e.g., high-definition laptops and phones have different requirements for the definition of videos), residual energy, location, demand and preference of users and content of service (online movie and real-time video call for the requirements of delay). On the level of network, the context contains priority of service, the state of spectrum and network. As long as the network can provide personalized service, the QoE of users is enhanced.
Another important issue for the intelligent network is robustness. For instance, in the ad hoc network, some nodes may be out of energy or suddenly fail, the fast self-recovery mechanism is necessary. More seriously, if there are malicious agents such as jammers, the network has to survive in the adversarial context and provide at least the minimum guaranteed service.
Iii Game-Theoretic Learning Framework for Intelligent Wireless Networks
In this section, we propose a two-layer framework for the intelligent networks as shown in Fig. 2. The first layer, problem analysing and game modeling, and according to the first layer, the game-theoretic learning method is proposed in the second layer.
Iii-a Problem Analysing and Game Modeling
The first layer can be divided into two steps, i.e., problem analysing and game modeling. In Fig. 1, we take the scenario of FANET as an example, and it is clustered according to the location. Note that, besides of FANET, we can find many applications of this kind of network, such as device-to-device network, WSN and so on. In the internal, due to the same frequency in each cluster, the mutual interference must be considered. In the external, UAVs must access the jamming-free channels or increase the transmit power due to the existence of jammers. Therefore, the problem can be modeled as dynamic spectrum access, power control and anti-jamming.
With the help of problem analysing, the powerful mathematical tool, game theory, is adopted to analyse and model the multi-agent system. Considering the distributed characteristic of the self-organized wireless network, in this paper we model the multi-agent system as non-cooperative game. The model can be expressed as , where is the set of the participant agents222We note that the participator set is dynamic as we mentioned in the challenge of dynamic scalability, i.e., is a variable number., is the available action set of agent (e.g., available channel, transmit power, transmit duration, etc.), and denotes the utility function of agent . In terms of an agent, the utility function is the evaluation of the current decision-making. The goal of agents is maximizing by adjusting their decisions. However, in the network, users are mostly self-interested whose goal is maximizing their own utilities, which may results in low system performance. The reasons why adopting game theory are two, as shown in Fig. 1:
Distributed coordination: As we mentioned in Section II-B that the rules and motivation of the game is vital, the dilemma of self-interested game can be solved by coordinating the behaviors of the agents. Specifically, in the internal of the network, if the utility function of each agent considers not only the payoff it can get, but the impact of the decision on other agents, then the coordination of the network can be realized spontaneously and distributedly.
Adversarial decision-making: Game theory can also analyse and model the adversarial relationship among agents. Assuming malicious users exists in the external of the network. The layer of legitimate users make the best response against the jammer aiming at maximizing the signal-to-interference-plus-noise ratio (SINR), while the layer of jammers make the best response against the legitimate users aim at minimizing the SINR of legitimate users. This kind of adversarial process can be modeled as Stackelberg game , which will be introduced in Section IV-B.
As is well-known, Nash equilibrium (NE) is the stable outcome of non-cooperative games. By analysing the NE solutions we can predict the behaviours of agents and the performance of the game. However, not all games have NE. Fortunately, this can be handled by designing the utility function which decides the all the properties of game model. Due to the good properties such as guaranteeing the existence of at least one NE, potential game is extensively used in wireless communications . The condition that a game is a potential game is a potential function exists, which has the same variation trend as the utility function does when an arbitrary agent unilaterally changes its action. Another promising property of potential game is that all of the NEs are the local or global optimal solutions of potential function. It means that if the potential function is designed as the optimized object of the network, the NEs of the potential game are the local or global optimal solutions of the network.
Iii-B Game-Theoretic Learning Methods
The second layer of the game-theoretic learning framework is going to tell the agents how to behave in the complicated network. The general game-theoretic learning model can be expressed as
where and denote the action of agent and the action profile of all the agents except agent at the th decision-making period, represents the reward of agent which is related to and , and is the upgrading function of the agent ’s strategy. That is to say, the decision-making at the th period is adjusted by the action and received reward at the th period. This kind of online learning method can overcome the disadvantages of dynamic, unknown environment. Although there are various kinds of learning methods in AI, several practical situations must be considered:
The learning algorithms must guarantee the network convergence to the NE. More than that, there may exist more than one NE point, thus it is the best for the algorithms to achieve the optimal NE.
Different from the original game theory which adjusting the player’s action by observing the others’ decisions, the agent in the wireless network is hard to be informed of all the others’ actions, i.e., as discussed in Section II-A. Therefore, the learning methods demanding for the information of others as little as possible is practical for the intelligent wireless network.
In the adversarial situation, both the legitimate agents and malicious agents use the intelligent techniques to compete with each other. The reward may vary during different decision-making period, which may create troubles for the convergence of the algorithms.
Based on the above discussion, in the following, we introduce several game-theoretic learning methods and their applications in wireless network.
Iv Applications of Game-Theoretic Learning Methods
Combining learning in AI with game theory, it is challenging to theoretically prove the algorithms to converge to NE for the reason that the approach varies for the different applications. In this section, we introduce some game-theoretic learning methods which possess attractive properties that have been both theoretically and numerically proven. Note that, all the learning methods introduced in this paper belongs to the category of reinforcement learning, i.e., decision-reward-update process as Fig. 3 shows.
Iv-a Distributed Coordination
In this subsection, we introduce three uncoupled (no need for the information of other agents) learning methods which have good properties of NE in games.
Iv-A1 Log-Logit Learning
The main idea of log-logit learning () algorithm is Boltzmann-Gibbs strategy , i.e., choosing the high-utility strategy with high probability while the low-utility strategy with low probability. During each iteration, selects an agent to randomly explore the actions in and update the probabilities of decision-making based on the payoff the agent achieved during the exploration. This learning algorithm is proved to converge to the globally optimal strategy. In 
, we consider a resource allocation problem in a multi-cell scenario. By setting the base station and joint power and channel allocation as the players and actions, respectively, a QoE-oriented game considering the fairness and users’ requirements is proposed. Note that, this problem is a combinatorial optimization problem and NP-hard. Therefore, we resort to thealgorithm to obtain the globally optimal solution which is much better than that of the QoS based algorithm.
Iv-A2 Stochastic Learning Automata
algorithm is asynchronous, i.e., only one agent can do the learning during one iteration, which as a result the convergence speed is slow and inapplicable for the changing fast environment. Hence, we then introduce a synchronous and uncoupled algorithm, i.e., stochastic learning automata (SLA) algorithm . The main idea of SLA is that based on the previous experience of actions, those leading to higher rewards tend to be repeated more frequently. Although SLA converges to arbitrary NEs of potential game, the synchronous and uncoupled properties are still promising in practical applications. In , a dynamic computation offloading problem is investigated. Considering the dynamic numbers of the active agents and time-varying fading channel, the problem is modeled as a potential game, and the SLA algorithm is adopted to intelligently make the offloading decisions.
However, these learning algorithms are unable to converge in generic games. Next, we introduce a more practical algorithm called trial and error learning (TEL), which converges to a efficient NE of generic game.
Iv-A3 Trial and Error Learning
TEL is an “emotional” algorithm. The state of each agent at the th iteration is assumed to be , where and represent the benchmarks of strategy and reward, respectively, and denotes the mood of agent . By personifying the agents, they are assumed to have four kinds of moods, which in descending order are content, hopeful, watchful and discontent. By interacting with the environment and the rewards it gets, the agent’s mood, the benchmark of strategy and the benchmark of reward update . Inspired by TEL, we proposed a QoE-aware game in heterogeneous wireless networks . Based on the users’ requirements, the user’s throughput is mapped into several levels of QoE, and the TEL algorithm is adopted in the QoE-aware game to achieve the efficient NE which effectively enhance the QoE.
Up to now we can see that in the complex wireless networks, most of the optimization problems are NP-hard. The price of centralized algorithms is unacceptably high, while the heuristic algorithms are unable to achieve the optimal solutions. Fortunately, we can turn to the learning algorithms. In conclusion, these algorithms provide the agents withcognitive computing abilities, which are specifically summarized as follows:
Interactive. Driven by the goal of maximize the utilities, the agents interact with the environment and/or each other, or even the adversaries, to obtain information.
Adaptive. The agents can learn the changes from the received information and make responds to them adaptively and timely.
Predictive. The agents can “remember” the interacting experiences and accordingly make the high-payoff decisions in time.
Iv-B Adversarial Decision-Making
If there are malicious users who try to disrupt the network, the dynamic process of competitive interaction cannot be directly modeled by the above game-theoretic learning methods. As mentioned, we can adopt Stackelberg game to do the analysis. In Stackelberg game, there are leader and follower. The leader first takes an action, and the follower takes a best response against the leader. Then the leader also chooses the best action according to the follower. The competitive interaction continues until converges to the Stackelberg equilibrium (SE) where neither the leader nor the follower can obtain higher utilities by changing the strategy unilaterally. In , we adopt the Stackelberg game to analyse an anti-jamming problem where a group of users and a malicious jammer exist. The game consists of two sub-game: the follower sub-game that the users adopt SLA algorithm aiming at minimizing the impact of co-channel interference (CCI) and jamming signal, and the leader sub-game that the jammer uses Q-learning algorithm trying to maximize the effect of interferences.
Note that, Q-learning
is a classic reinforcement learning method which aims to maximize the long-term cumulative reward of the problem modeled as Markov decision process (MDP)333The definition of MDP is that the agent sequentially makes decisions with Markovian transition model in a stochastic environment.. The learning process is that, the agent has a Q-value table which stores the evaluation of the actions under the current state. Through the interaction with environment, the agent updates the Q-value table, and those actions with higher Q-value are more likely to be selected. However, MDP is designed for single-agent scenario. If each agent treats the other agents as the environment and does the individual Q-learning, the network may never converge to an equilibrium . In the next section, based on the algorithm we proposed in  and a testbed we have developed, we choose the multi-agent reinforcement learning (MARL) anti-jamming as a case study of the game-theoretic learning framework.
V Case Study: A Testbed for Multi-Agent Reinforcement Learning Anti-Jamming
V-a Multi-Agent Reinforcement Learning Anti-Jamming Algorithm
In multi-agent system, we can resort to Markov game  to analyse the multi-agent MDP. A Markov game can be modeled as , where is the legitimate agent set, is the available channel set, the state is defined as the action profile and the channel that is currently being jammed, is the reward function and is the probability transition function. The reward is defined as positive if the agent choose the channel that hasn’t been occupied by other agents and the jammer, otherwise the reward is zero. In the system, the number of agents is two. As shown in Fig. 4 (a), Q-learning algorithm is adopted by the agents to learn the jamming pattern and avoid CCI in a collaborative way. Specifically, if the agent successfully transmits a packet (confirmed by ACK), it means the agent chooses a channel without CCI and jamming signal, and the user gets a positive reward to update the Q-value. Then, the Q-value is exchanged with each other to do the collaborative Q-learning. From Fig. 4 (b) we can see that the proposed algorithm achieves a better performance than the sensing-based method, and avoid the non-convergence of the independent Q-learning, which is a good case of distributed coordination and adversarial decision-making of the game-theoretic learning framework.
V-B A Testbed for Multi-Agent Reinforcement Learning Anti-Jamming
As Fig. 5 shows, we developed a multi-agent anti-jamming testbed based on the algorithm. The system consists of two users and one jammer. Here, the user is a transmitter-receiver pair. Assuming there are five available channels and the transmitter try to send a picture to receiver. Each user is equipped with three universal software radio peripherals (USRPs) which serve as transceiver and wide-band spectrum sensing (WBSS) equipment. The jammer USRP transmits jamming signal in a sweeping jamming pattern, i.e., periodically jamming the channels in sequence.
Fig. 5 (b) and (c) is the screenshots of the dynamic process running by the LabVIEW software. There are four windows in each receiver’s interface, which are denoted by , , and in Fig. 5 (b). As we can see, shows the process of selecting channels by user 1 and user 2, shows the normalized throughput, is the real-time spectrum display of WBSS where the two pulses are the users’ signals and the middle is the jamming signal, and is the picture under receiving. As the experiment results show, the jamming signal and CCI can be effectively avoided and the transmitted pictures can be successfully received.
Vi Future Research Directions and Conclusion
Vi-a Future Research Directions
It can be seen that the combination of game theory and learning in AI is highly promising to be applied into the intelligent wireless networks, but the current researches still have a long way to achieve the expected goal. Herein, we concluded some future research directions as follows.
The traditional Q-learning is difficult to be adopted under the general real-world settings due to the tremendous state space, which is called “the curse of dimension”. Due to the great capability of approximation, deep reinforcement learning (deep Q-learning) overcomes the shortcoming and is more suitable for the actual wireless networks. We have applied the deep Q-learning into the single-agent anti-jamming scenario , shown in Fig. 6. Based on the existing researches on multi-agent Q-learning, we can study the properties of multi-agent deep Q-learning in a game-theoretic perspective.
In the large-scale wireless network, the heterogeneous characteristic such as clustering and hierarchical can provide the system with effectiveness and efficiency. Therefore, it is of practical significance to investigate the hybrid and hierarchical MARL that consist of several reinforcement learning algorithms in the multi-agent system. We can theoretically study the performance and convergence of the hybrid and hierarchical MARL with the help of game theory.
In this paper, we proposed a game-theoretic learning framework for the intelligent wireless networks, which possesses good properties of both game theory and learning in AI. First of all, we explain why the combination of game theory and learning in AI is necessary for the intelligent networks. Then, the connections, challenges and requirements were discussed. To handle these problems, we introduced the proposed two-layer game-theoretic learning framework from internal and external perspectives. Combined with the works we have proposed, several promising game-theoretic learning methods were introduced. Moreover, a multi-agent anti-jamming testbed was developed, and the experiment results demonstrated the effectiveness of the game-theoretic learning method. Finally, some future research directions are discussed.
This work was supported by the National Natural Science Foundation of China under Grant No. 61771488, No. 61671473 and No. 61631020, in part by the Natural Science Foundation for Distinguished Young Scholars of Jiangsu Province under Grant No. BK20160034, and in part by the Open Research Foundation of Science and Technology on Communication Networks Laboratory.
-  Z. Han, D. Niyato, W. Saad, T. Baar, and A. Hjrungnes, Game theory in wireless and communication networks: theory, models, and applications: Cambridge University Press, 2012.
J.-B. Wang, J. Wang, Y. Wu, J.-Y. Wang, H. Zhu, M. Lin, and J. Wang, “A Machine Learning Framework for Resource Allocation Assisted by Cloud Computing,”IEEE Network, vol. 32, no. 2, pp. 144-151, 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ”Generative adversarial nets.” in International Conference on Neural Information Processing Systems, MIT Press, pp. 2672-2680.
-  N. Cesa-Bianchi, and G. Lugosi, Prediction, learning, and games. Cambridge university press, 2006.
-  F. Yao, L. Jia, Y. Sun, Y. Xu, S. Feng, and Y. Zhu, “A hierarchical learning approach to anti-jamming channel selection strategies,” Wireless Networks, pp. 1-13, 2017.
-  D. Monderer. and L. S. Shapley, “Potential Games,” Games and Economic Behavior, vol. 14, 1996, pp. 124-143.
-  R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.
-  J. Zheng, Y. Cai, Y. Liu, Y. Xu, B. Duan, and X. S. Shen, “Optimal power allocation and user scheduling in multicell networks: Base station cooperation using a game-theoretic approach,” IEEE Trans. Wireless Commun., vol. 13, no. 12, pp. 6928-6942, 2014.
-  P. Sastry, V. Phansalkar, and M. Thathachar, “Decentralized learning of Nash equilibria in multi-person stochastic games with incomplete information,” IEEE Trans. Syst., Man, Cybern. B, vol. 24, no. 5, pp. 769-777, 1994.
-  J. Zheng, Y. Cai, Y. Wu, and X. S. Shen, “Dynamic Computation Offloading for Mobile Cloud Computing: A Stochastic Game-Theoretic Approach,” IEEE Trans. Mobile Computing, 2018.
-  B. Pradelski, and H. P. Young, “Efficiency and equilibrium in trial and error learning,” University of Oxford, Department of Economics, Economics Series Working Papers, vol. 480, 2010.
-  Z. Du, Q. Wu, P. Yang, Y. Xu, J. Wang, and Y.-D. Yao, “Exploiting user demand diversity in heterogeneous wireless networks,” IEEE Trans. Wireless Commun., vol. 14, no. 8, pp. 4142-4155, 2015.
-  N. Vlassis, “A concise introduction to multiagent systems and distributed artificial intelligence,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 1, no. 1, pp. 1-71, 2007.
-  F. Yao, and L. Jia, “A Collaborative Multi-agent Reinforcement Learning Anti-jamming Algorithm in Wireless Networks,” arXiv preprint, arXiv:1809.04374, 2018.
-  X. Liu, Y. Xu, L. Jia, Q. Wu and A. Anpalagan, “Anti-Jamming Communications Using Spectrum Waterfall: A Deep Reinforcement Learning Approach,” IEEE Communications Letters, vol. 22, no. 5, pp. 998-1001, 2018.