Nomenclature
5G  The 5th Generation Mobile Network 
AI  Artificial Intelligent 
AMC  Automatic Modulation Classification 
ANN  Artificial Neural Network 
AP  Access Point 
AWGN  Additive White Gaussian Noise 
BBU  BaseBand processing Unit 
BS  Base Station 
CDF  Cumulative Distribution Function 
CNN  Convolutional Neural Network 
CogNet  Cognitive Network 
CoMP  Coordinated Multiple Points 
CR  Cognitive Radio 
CRAN  Cloud Radio Access Network 
CRN  Cognitive Radio Network 
CSI  Channel State Information 
CSMA/CA  CarrierSense Multiple Access with Collision Avoidance 
CSMA/CD  CarrierSense Multiple Access with Collision Detection 
CS Mode  ClientServer Mode 
D2D  Device to Device 
DBN  Deep Belief Network 
DNN  Deep Neural Network 
DQN  Deep QNetwork 
EA  Energy Awareness 
EE  Energy Efficiency 
EH  Energy Harvesting 
ELP  Exponentiallyweighted algorithm with Linear Programming 
EM  Expectation Maximization 
eMBB  enhanced Mobile Broad Band 
ERM  Empirical Risk Minimization 
EXP3  EXPonential weights for EXPloration and EXPloitation 
FANET  Flying Ad Hoc Network 
FDA  Fisher Discriminant Analysis 
FDI  False Data Injection 
FSMC  Finite State Markov Channel 
GMM  Gaussian Mixture Model 
HetNet  Heterogeneous Network 
HMM  Hidden Markov Model 
ICA  Independent Component Analysis 
IEEE  Institute of Electrical and Electronics Engineers 
IoT  Internet of Things 
ITS  Intelligent Transportation System 
KNN  KNearest Neighbors 
LED  Light Emitting Diode 
LOS  Line of Sight 
LS  Least Square 
LSTM  Long Short Term Memory 
LTE  Long Term Evolution 
M2M  Machine to Machine 
MANET  Mobile Ad Hoc Network 
MAP  Maximum a Posteriori 
MDP  Markov Decision Process 
MIMO  MultipleInput and MultipleOutput 
MLE  Maximum Likelihood Estimation 
mMTC  massive Machine Type of Communication 
NBIoT  NarrowBand Internet of Things 
NBM2M  NarrowBand Machine to Machine 
NFV  Network Function Virtualization 
NLOS  NonLine of Sight 
NGWN  NextGeneration Wireless Network 
NOMA  NonOrthogonal Multiple Access 
OFDM  Orthogonal Frequency Division Multiplexing 
OSPF  Open Shortest Path First 
P2P  Peer to Peer 
PCA  Principal Component Analysis 
POMDP  Partially Observable Markov Decision Process 
PU  Primary User 
QoE  Quality of Experience 
QoS  Quality of Service 
RAT  Radio Access Technology 
RBM  Restricted Boltzmann Machine 
RBF  Radial Basis Function 
RFID  Radio Frequency IDentification 
RNN  Recurrent Neural Network 
RRU  Remote Radio Unit 
SDA  Stacked Denoising Autoencoder 
SDN  Software Defined Network 
SDR  Software Defined Radio 
SE  Spectrum Efficiency 
SG  Stochastic Geometry 
SRM  Structural Risk Minimization 
STBC  Space Time Block Code 
SU  Secondary User 
SVM  Support Vector Machine 
TAS  Transmit Antenna Selection 
TCP  Transmission Control Protocol 
TD  Temporal Difference 
TOA  Time of Arrival 
UAV  Unmanned Aerial Vehicle 
UDN  Ultra Dense Network 
uRLLC  ultraReliable LowLatency Communication 
V2I  Vehicle to Infrastructure 
V2V  Vehicle to Vehicle 
V2X  Vehicle to Everything 
VANET  Vehicular Ad Hoc Network 
VLC  Visible Light Communication 
VR  Virtual Reality 
WANET  Wireless Ad Hoc Network 
WBAN  Wireless Body Area Network 
WLAN  Wireless Local Area Network 
WiMAX  Worldwide Interoperability for Microwave Access 
WiFi  Wireless Fidelity 
WMAN  Wireless Metropolitan Area Network 
WPAN  Wireless Personal Area Network 
WSN  Wireless Sensor Network 
WWAN  Wireless Wide Area Network 
I Introduction
Wireless networks have supported a variety of military services, intelligent transportation, healthcare, etc. To elaborate briefly, nextgeneration mobile networks are expected to support high date rate communication [1]. As a complement, wireless sensor networks (WSN) support sustained monitoring in unmanned or hostile environments relying on widely dispersed operating sensors [2]. Furthermore, the popular WiFi network provides convenient Internet access for various devices in indoor scenarios [3]. With the rapid proliferation of portable mobile devices and the demand for a high quality of service (QoS) and quality of experience (QoE), nextgeneration wireless networks (NGWN) will continue to support a broad range of compelling applications, where the users benefit from highrate, lowlatency, lowcost and reliable information services.
Ia Motivation
In contrast to the operational wireless networks, NGWNs have the following evolutionary tendency [4, 5]:

Network Scale: The NGWN is associated with a tremendous network size including all kinds of entities, each of which has different service capabilities as well as requirements. Furthermore, interactions among these entities result in a diverse variety of traffic, such as text, voice, audio, images, video, etc.

Network Structure: On one hand, the NGWN tends to have a selfconfiguring element, where each entity cooperatively completes tasks. This characteristic is termed as “being as hoc”. On the other hand, the NGWN is heterogeneous and hierarchical, having different network slices^{1}^{1}1In our paper, network slices are multiple logical networks running on the top of a shared physical network infrastructure and operated by a control center.. Furthermore, the mobility of entities results in a complex timevariant network structure, which requires dynamic timespace association.

Network Control: NGWNs facilitate convenient reconfiguration by softwarebased network management, hence improving network flexibility and efficiency.
Machine learning was first introduced as a popular technique of realizing artificial intelligence in the late 1950’s [6]. Machine learning algorithms can learn from training data without being explicitly programmed. It is beneficial for classification/regression, prediction, clustering and decision making [7, 8, 9], whilst relying on the following three basic elements [10]:

Model
: Mathematical or signal models are constructed from training data and expert knowledge, in order to statistically describe the characteristics of the given data set. Then again, relying on these trained models, machine learning can be used for classification, prediction and decision making. In case the appropriate models are not available, techniques on the feature extraction or knowledge discovery can be developed to achieve the same goal.

Strategy: The criteria used for training mathematical models are called strategies. How to select an appropriate strategy is closely associated with training data. Empirical risk minimization [11] and structural risk minimization [12] constitute a pair of fundamental strategies, where the latter can beneficially avoid the notorious “overfitting” phenomenon.

Algorithm
: Algorithms are constructed to find solutions based on predetermined model and strategy selected, which can be viewed as an optimization process. A powerful algorithm can find a globally optimal solution with high probability at a low computational complexity and storage.
In the last thirty years, machine learning has been successfully applied to the field of computer vision
[13], automatic control [14], bioinformatics [15], etc. Considering the aforementioned characteristics of the NGWN, datadriven machine learning can also become a powerful technique of network association for substantially improving the network performance. This is achieved by accurately learning the physical reality compared to traditional modeldriven optimization algorithms based on the assumptions detailed in [16]. More specifically,
The wireless traffic data torrance may be conveniently managed by the big data processing capability of machine learning [17]. For example, in 5G system, the traffic volume generated by ondemand information and entertainment is predicted to substantially increase over the next decade, and an average smart phone may generate 4.4 GB data per month by the year 2020 [18, 19, 20]. The massive amount of data constitutes a large training set, which can be statistically exploited for extracting the internal correlations and for conducting classification and prediction with the aid of machine leaning algorithms.

Modeling and parameter estimation play an important role in NGWNs. For instance, in massive multipleinput and multipleoutput (MIMO) systems, an accurate estimate of the channel state information (CSI) may critically improve the whole system’s capacity. Traditional mathematical models may not be able to accurately describe system in typical timevarying scenarios. Machine learning provides an alternative technique of adaptive modeling and parameter estimation relying on learning from history.

NGWNs require both individual node intelligence and swarm intelligence [21]. Moreover, as for resource allocation and management, we tend to strike a tradeoff among numerous factors, such as the capacity, power consumption, latency, interference, etc. rather than only considering a single aspect. Thanks to learning from trial and error, machine learning is conducive to supporting intelligent multiobjective decision making in the context of multiagent collaborative network management. NGWN can have further potential to enable more effective multiagent artificial intelligent systems.

NGWNs have the tendency to take into account the human behaviors, for example by taking into account the geographic deployment of access points (AP) in an ultra dense network (UDN), where usercentric designs have been conceived for reducing the clusteredge effects. By mimicking human intelligence, machine learning may be deemed to be the most appropriate tool for adapting the network’s structure and function to the human behaviors observed [22, 23].
In recent years, a range of surveys have been conceived on machine learning paradigms. Some of them focused their scope on a specific wireless scenario, such as WSNs [24, 25], cognitive radio networks (CRN) [26, 27, 28], Internet of Things (IoT) [29], wireless ad hoc networks (WANET) [30], selforganizing cellular networks [31], etc. Specifically, Alsheikh et al. [24] provided an extensive overview of machine learning methods applied to WSNs which improved the resource exploitation and prolonged the lifespan of the network. Kulkarni et al. [25] surveyed some common issues of WSNs solved by computational intelligence algorithms, such as data fusion, routing, task scheduling, localization, etc. Moreover, Bkassiny et al. [26] investigated decisionmaking and feature classification problems solved by both centralized and decentralized learning algorithms in CRN in a nonMarkovian environment. Gavrilovska et al. [27] studied the nature of the CRN’s capability of reasoning and learning. Park et al. [29] reviewed a range of learning aided frameworks designed for adapting to the heterogeneous resourceconstrained IoT environment. Forster [30] portrayed the advantages of using machine learning for the data routing problem of WANETs. Furthermore, a detailed literature review of the past fifteen years of machine learning techniques applied to selfconfiguration, selfoptimization and selfhealing, was provided by Klaine et al. [31].
Some of the literature was restricted to a specific application [32, 33, 34, 35], whilst others considered a single learning technique [36, 37, 38, 39]. To elaborate, AlRawi et al. [32] presented an overview of the features, methods and performance enhancement of learningassisted routing schemes in the context of distributed wireless networks. Additionally, Fadlullah et al. [33] provided an overview of the stateoftheart in learning aided network traffic control schemes as well as in deep learning aided intelligent routing strategies, while Nguyen et al. [34] focused their attention on the machine learning techniques conceived for Internet traffic classification. Machine learning and data mining assisted cyber intrusion detection were surveyed in [35], including the complexity comparison of each algorithm and a set of recommendations concerning the best methods applied to different cyber intrusion detection problems. As for exploring learning techniques, Usama et al. [36]
provided an overview of the recent advances of unsupervised learning in the context of networking, such as traffic classification, anomaly detection, network optimization, etc. Yau
et al. [37] investigated the employment of reinforcement learning invoked for achieving context awareness and intelligence in a variety of wireless network applications such as data routing, resource allocation and dynamic channel selection. The authors of [38] and [39] focused their attention on the benefit of deep learning in wireless multimedia network applications, including ambient sensing, cybersecurity, resource optimization, etc. The main contributions of the existing machine learning aided wireless networks survey and tutorial papers are contrasted in Fig. 1 to this survey.IB Contributions
Hence, our focus is on the comprehensive survey of machine learning aided NGWNs. Inspired by abovementioned challenges, in this article we review the development of machine learning aided wireless networks. We commence by investigating a series of popular learning algorithms and their compelling applications in NGWN and then provide some specific examples based on some recent research results, followed by a range of promising open issues in the design of future networks. Our original contributions are summarized as follows:

We critically review the thirtyyear history of machine learning. Depending on how we use training data, we classify machine learning algorithms into three categories, i.e. supervised learning
[40], unsupervised learning [41] and reinforcement learning [42]. In addition, we highlight the family of deep learning algorithms, given their success in the field of signal processing. 
The development of wireless networks is reviewed from their birth to NGWNs. Moreover, we summarize the evolution of wireless networking techniques, and characterize a variety of representative scenarios for the NGWN.

We appraise a range of typical supervised, unsupervised, reinforcement learning as well as deep learning algorithms. Moreover, their compelling applications in wireless networks are surveyed for assisting the readers in refining the motivation of machine learning in NGWN, all the way from the physical layer to the application layer.

Relying on recent research results, we highlight a pair of examples conceived for wireless networks, which can help the readers to gain the insight into hitherto unexplored scenarios and into their applications in NGWNs.
IC Organization
The remainder of this article is outlined as follows. In Section II, we provide a brief overview of the history of machine learning and of the development of wireless networks. In Section III, we introduce a range of typical supervised learning algorithms and highlight their compelling applications in wireless networks. In Section IV, we investigate the family of unsupervised leaning algorithms and their related applications. Some popular reinforcement learning algorithms are elaborated on in Section V. Moreover, we present two examples of how these reinforcement learning algorithms can improve the performance of wireless networks. In Section VI, we introduce some typical deep learning algorithms and their applications in NGWNs. Some future research ideas and our conclusions are provided in Section VII. The structure of this treatise is summarized at a glance in Fig. 2.
Ii A Brief Overview of Machine Learning and Wireless Networks
Iia The ThirtyYear Development of Machine Learning
The term “machine learning” was first proposed by Arthur Samuel in 1959 [6], which referred to computer systems having the capability of learning from their large amounts of previous tasks and data, as well as of selfoptimizing computer algorithms. Hardprogrammed algorithms are difficult to adapt to dynamically fluctuating demands and constantly renewed system states. By contrast, relying on learning from previous experiences, machine learning aided algorithms are beneficial for scientific decision making and task prediction, which is achieved by constructing a selfadaption model from sample inputs. To elaborate a little further, as for the concept of “learning”, Tom M. Mitchell [43] provided the widely quoted description: “A computer program is said to learn from experience with respect to some class of tasks and performance measure , if its performance at tasks in , as measured by , improves with experience .”
Machine learning began to flourish in the 1990s [8]
. Before this era, logic and knowledgebased schemes, such as inductive logic programming, expert systems, etc. dominated the artificial intelligence scene relying on highlevel humanreadable symbolic representations of tasks and logic. Thanks to the development of statistics theory and stochastic approximation, machine learning schemes regained researchers’ attention leading to a range of beneficial probabilistic models. Researchers embarked on creating datedriven programs for analyzing a large amount of data and tried to draw conclusions or to learn from the data. During this era, machine learning algorithms such as neural networks as well as kernel methods became mature. During the 2000s, researchers gradually renewed their interest in deep learning with the aid of the advances in hardwarebased computational capability, which made machine learning indispensable for supporting a wide range of services and applications.
Given the development of progressive learning techniques [44], at present, the research focus of machine learning has shifted from “learning being the purpose” to “learning being the method”. Specifically, machine learning algorithms no longer blindly pursue to imitate the learning capability of human beings, instead they focus more on the taskoriented intelligent datadriven analysis. Nowadays, thanks to the abundance of raw data and to the frequent interaction between exploration and exploitation, machine learning algorithms have prospered in the fields of computer vision, data mining, intelligent control, etc. NGWNs aim for providing ubiquitous information services for users in a variety of scenarios. However, the rapid growth in the number of users and the resulted explosive growth of teletraffic data pushes the limits of networkcapacity. As a remedy, machine learning aided network management and control can be viewed as a corner stone of NGWNs in view of their limited power, spectrum and cost.
IiB Classifying Machine Learning Techniques
Again, depending on how training data is used, machine learning algorithms can be grouped into three categories, i.e. supervised learning, unsupervised learning and reinforcement learning [45, 46]. In the following, we will provide a brief description of the three types of algorithms.

Supervised Learning: The algorithms are trained on a certain amount of labeled data [40]. Both the input data and its desired label are known to the computer, resulting in a datalabel pair. Their goal is to infer a function that maps the input data to the output label relying on the training of sample datalabel pairs. Specifically, considering a set of sample datalabel pairs in the form of , where is the th sample input data and represents its label. Let denote the input data set and represent the output label set. Usually, these sample pairs are independent and identically distributed (i.i.d.). The learning algorithms aim for seeking a function that yields the highest value of the score function , hence we have
. As a special case, if only part of the sample datalabel pairs are known to the computer and some of the desired output labels of input data are missing, the corresponding learning algorithms are termed as semisupervised learning
^{2}^{2}2In this paper, semisupervised learning algorithms are viewed as a specific category of supervised learning algorithms. However, in some of the literature, semisupervised learning is listed as a separate member of the machine learning family. These supervised learning algorithms can be widely used in the context of classification, regression and prediction. 
Unsupervised Learning: Relying on unlabeled input data, unsupervised learning algorithms try to explore the hidden features or structure of the data [41, 47]. Given the lack of sample datalabel pairs, there is no standard accuracy evaluation for the output of unsupervised learning algorithms, which is the main difference compared to its supervised learning aided counterpart. By analyzing input data , a pair of popular methods has been conceived for revealing the underlying unknown features of input data, namely density estimation [48] as well as feature extraction [49]. To elaborate, density estimation aided methods are characterized by explicitly building statistical models of how the underlying features might create the input. By contrast, feature extraction based techniques aim for directly extracting statistical regularities or even sometimes irregularities from the input data set.

Reinforcement Learning: In contrast to the aforementioned two learning techniques, reinforcement learning algorithms are conceived for decision making by learning from interaction with the environment, which are trained by the data on the basis of trial and error [42, 50]. They neither try to identify a category as supervised learning algorithms do, nor do they aim for finding hidden structures as unsupervised learning algorithms do. Specifically, at each time step, the system or environment is in some state , and the agent selects a legitimate action . The system responds at the next time step by moving into a new state with a certain probability influenced both by the specific action chosen as well as by the system’s inherent transitions. Meanwhile, the agent receives a corresponding reward from the system, as time evolves. Reinforcement learning algorithms aim for learning how to map situations into actions in order to attain the maximal cumulative weighted reward within the horizon in such a closedloop fashion.
As an important member of the machine learning family, deep learning has been booming since 2010, because it was found to be capable of handling the soaring growth of training data volume facilitated by the rapid development of computing hardware [51, 52]. Deep learning algorithms rely on a multiplelayer “network” consisting of interconnected nodes for feature extraction and transformation, which is inspired by the biological nervous system, namely the neural network. Each layer utilizes the output of the previous layer as its input. The term “deep” refers to having multiple layers in the network. Generally, relying on the way the training data is exploited, deep learning algorithms can also be classified into deep supervised learning, deep unsupervised learning as well as deep reinforcement learning [51]. Moreover, some deep learning network architectures, such as deep neural networks (DNN) [53], deep belief networks (DBN) [54], recurrent neural networks (RNN) [55] and convolutional neural networks (CNN) [56], have had success in a range of fields including computer vision, speech recognition, etc. They have also been invoked in compelling applications of wireless networks.
Fig. 3 shows the involvement of machine learning in NGWNs based on the aforementioned four categories. Below we list a variety of popular learning algorithms and highlight their applications in NGWNs.
IiC Development of Wireless Networks
Just as the terminology implies, wireless networks connect various network nodes via electromagnetic waves. Relying on their coverage, wireless networks can be roughly classified into four categories, such as wireless personal area networks (WPAN) [57], wireless local area networks (WLAN) [58], wireless metropolitan area networks (WMAN) [59] and wireless wide area networks (WWAN) [60]. Correspondingly, a family of networking standards and their variants that cover most of the physical layer specifications have been established by the IEEE 802 Working groups, including the IEEE 802.15 for WPAN, IEEE 802.11 for WLAN, IEEE 802.16 for WMAN and IEEE 802.20 for WWAN standards. Furthermore, when considering the network’s functions, some popular representatives of wireless networks include cellular networks [61], WSNs [62], WANETs [63], wireless body area networks (WBAN) [64], etc.
The first wireless network, namely ALOHANET, was developed at the University of Hawaii in 1969 and came into operation in 1971, which for the first time transmitted wireless data packets over a network [65]. The first commercial wireless network was the WaveLAN product family designed by the NCR Corporation in 1986. In 1997, the first IEEE 802.11 protocol was released for WLAN [58]. Afterwards, the emergence and progress of reliable and lowcost WiFi marked the maturity of wireless networking technologies at the end of the 20th century, which facilitated Internet access for a range of WiFi compatible devices including personal computers, smart phones, etc. NGWNs aim for providing highrate, lowlatency, fullcoverage and lowcost yet reliable information services. Compared to traditional wireless networks connecting humans and their devices, NGWNs are expected to interconnect everything under the umbrella of the ‘Internet of Everything’. Fig. 4 demonstrates the development of wireless networks in terms of their milestone techniques.
Wireless networks have evolved from the simple clientserver (CS) mode to the distributed dense multilayer CS mode, and finally to the ad hoc peertopeer (P2P) mode. The decentralization of network architectures grant more freedom both for the network nodes and their protocols, which requires more sophisticated techniques for supporting efficient and reliable implementations. Furthermore, the soaring growth of both the type and the amount of data provides a promising field of applications for machine learning algorithms, which are beneficial for selforganized and selfadaptive network architectures.
IiD Representative Techniques in NGWNs
As shown in Fig. 5, we first of all portray the representative application scenarios and techniques of NGWNs. In the following, we will briefly introduce a range of compelling techniques and their development trends in NGWNs, which is summarized in Fig. 6.
IiD1 From MIMO to Massive MIMO
The MIMO technology relying on multiple antennas in both the transmitter and receiver can be viewed as a breakthrough in terms of multiplying the capacity of a radio link compared to the singletransmit singlereceive antenna aided wireless system having a variety of cost, technology and regulatory constraints [66]. Both singleuser MIMO (SUMIMO) and Multiuser MIMO (MUMIMO) schemes have been proposed. To elaborate, multiple data streams of the same source are sent to a single user in SUMIMO, while a transmitter simultaneously serves multiple users on the same channel resource in MUMIMO [67, 68].
IiD2 From D2D, M2M to IoT
In the spirit of direct communication between nearby mobile devices without traversing base stations (BS) or core networks, devicetodevice (D2D) communication networks have been widely investigated in recent years, which can be deemed to be important milestones on the road towards selforganization and P2P collaboration. In D2D networks, the same resource slots can be reused both by the D2D links as well as the cellular links, which is capable of substantially improving the network capacity. Moreover, it is potentially beneficial in terms of enhancing the energy efficiency (EE), also reducing the transmission delay and improving the network’s fairness to users [73, 74], which is also closely related to machinetomachine (M2M) communications. The corresponding massive machine type of communication (mMTC) [75] mode of the 5G network^{3}^{3}3In 2015, International Telecommunication Union (ITU) officially defined three application scenarios of 5G network, i.e. enhanced mobile broad band (eMBB), massive machine type communication (mMTC) and ultra reliable low latency communication (uRLLC). is capable of supporting sensing, transmitting, fusion and processing sensory data. Furthermore, M2M is also capable of supporting the smart home [76], smart grid [77], etc.
Aiming for “connecting everything”, IoT was first defined for enabling objects to connect and exchange data in 1999 [78]. Furthermore, the IoT allows objects to be sensed and controlled remotely, creating opportunities for direct interaction between the physical world and computerbased virtual systems, which is beneficial in terms of improving operational efficiency and of reducing human intervention. Both WSNs and M2M communications can be viewed as a part of the IoT. Although the IoT faces a range of reliability, robustness and security challenges, there is no doubt that it will make our world ever smarter [79, 80].
IiD3 From UDN to HetNet
In order to meet the demand of supporting massive data traffic, the socalled UDN architecture has been defined where the density of BSs or APs potentially reaches or even exceeds the density of users [81, 82]. The UDN architecture is conducive to increasing the network capacity as well as simultaneously improving the user experience. However, the interference encountered in UDNs tends to be more severe and of higher volatility than that in traditional cellular networks because of the dense deployment of BSs and APs. Hence, the joint consideration of resource allocation, interference management and traffic routing are essential for UDNs [61, 83].
Considering a wide area network scenario, heterogeneous networks (HetNet) are characterized by the employment of multiple types of radio access technologies (RAT) [84]. Upon combining macrocells, microcells, picocells [85] and femtocells [86, 87], HetNets are capable of providing a seamless wireless coverage ranging from outdoor environments to office buildings and even to underground areas by selecting another RAT when a RAT fails, and HetNets can also provide loadbalancing in the face of nonuniform spatial distribution of users [88].
IiD4 From DBS to CRAN
Compared to the traditional BS, which integrates baseband processing units (BBU) and remote radio units (RRU)^{4}^{4}4In some works, RRU is also called remote radio head (RRH) in a single cabinet, distributed base station (DBS) aided systems separate the BBU as well as the RRU and connects them with optical fiber. The DBS system allows more flexibility in network planning and deployment, where RRUs can be placed a few hundred meters or a few kilometres away for enhancing network’s edgecoverage.
Cloudradio access networks (CRAN) can be viewed as an evolution of the aforementioned DBS system, which is a centralized processing and cloud computing aided radio access network architecture [89]. The principle of CRAN relies on gathering the BBUs from several BSs into a centralized BBU pool, whilst allowing hundreds of RRUs to connect to the centralized BBU pool [90]. Hence, resources can be allocated to each user based on joint dynamic scheduling. By exploiting coordination and virtualization, the spectral efficiency (SE), the system’s flexibility and the load balancing capability are substantially improved. Moreover, the centralized management of resources reduces the cost of the system’s operation and maintenance.
IiD5 From SDN to NFV
Softwaredefined networking (SDN) is employed as a programmable network architecture in order to achieve costeffective dynamic network configuration and monitoring [91, 92]. The SDN philosophy suggests to centralize network intelligence in a single network component by decoupling the control plane and the data plane, which disassociates network control and its forwarding functions. The two planes can communicate with the aid of the OpenFlow protocol^{5}^{5}5The OpenFlow protocol is a communication protocol that gives access to the forwarding plane of a switcher or router over the network., and the network resources can be managed logically and efficiently. A SDN connects decentralized users to cloud computing through a “network pipeline” [93] [94].
Relying on IT virtualization techniques, network function virtualization (NFV) transforms the entire set of network node functions into different building blocks, which separates the networking functions from specific hardware blocks [95]. Hence, NFV is eminently suitable for service diversification and promotes the standardization of networking equipment [96]. Explicitly, NFV can be viewed as a beneficial hardwareagnostic design in the application layer of SDN architectures.
IiD6 From EH to EA
Energy harvesting (EH) is an environmentally friendly process, which captures and stores ambient energy, such as solar power, thermal energy, wind energy, etc. for lowpower wireless devices [97], especially in WSNs and WBANs, for example.
In NGWNs, energy optimization is a significant concern motivated by mitigating climate change. However, energy consumption is related to both the network’s throughput and to its entire lifetime with a tradeoff between them. As a remedy, instead of only focusing on EH, energy awareness (EA) at every stage of the network’s design and management is the most promising approach to striking a tradeoff amongst the conflicting objectives of reducing energy consumption, improving the system’s throughput as well as prolonging its lifetime, especially in energyconstrained networks [98, 99].
IiD7 From CR to CogNet
Cognitive radio (CR) constitutes a technique that allows us to dynamically and efficiently exploit the wireless spectral resources [100, 101, 102, 103]. By relying on spectrum sensing, CR is capable of achieving dynamic spectrum access and spectrum sharing. Specifically, in the process of spectrum sensing, the secondary user (SU) detects an empty slicer of spectrum, for example, based on energy detection schemes. Then, in the process of spectrum access, power control is invoked by the SU for maximizing its capacity, whilst observing the interference power constraint in order to protect the primary user (PU). As a benefit, CR dynamically and flexibly exploits the scarce wireless spectral resources, hence substantially improving the spectrum efficiency [104].
In contrast to CR techniques, which only deal with the issues of physicallayer spectrum sensing and data linklayer access, cognitive networks (CogNet) are characterized by a cognitive crosslayer process according to their endtoend goals, where the overall network conditions are monitored, and then decisions are made based on the perceived conditions as well as on the feedback and experience gleaned from previous actions [105]. The network’s cognitive capability relies on a range of advanced techniques, such as knowledge representation and machine learning, which exploit a wealth of information generated within the network improving both the network management, the resource efficiency [106] and the energy efficiency [107].
IiD8 Interference Management
Interference constitutes the fundamental limiting factor of the overall wireless system performance, hence it is a key challenge faced by designers. Therefore susbtantial efforts have been dedicated to exploiting the communication channel’s state information (CSI) either at the transmitter (CSIT) or at the receiver (CSIR) for mitigating the effects of interference. Hence diverse time/frequency/space division multiple access based resource allocation schemes have been conceived for avoiding interference by creating orthogonal resource units [108, 109, 110]. Creative efforts have also been dedicated to the conception of nonorthogonal access systems, as exemplified by a large variety of cognitive radio [111] and nonorthogonal multiple access (NOMA) schemes [112] relying on sophisticated transceiver designs. Additionally, multiantenna based techniques, such as joint/partial pre/postcoding and antenna selection, have also been proposed for ameliorataing the effects of interference by exploiting the benefits of spatial diversity [113].
A closely related issue in NGWNs is interference management, which is a particularly critical task in ultradense networks in the face of their stringent throughput, delay and reliability specifications. Hence sophisticated resource allocation and interference management schemes are required. Therefore a range of machine learning algorithms have also been invoked for interference management relying on their environmental awareness and learning capability [114, 115, 116].
IiE MultiObjective Metrics of NGWNs
The challenging realworld optimization problems encountered in NGWNs usually have to meet multiple objectives in order to arrive at an attractive solution [117]. In contrast to conventional singleobjective optimization where we find the global optimum relying on a single metric, multiobjective optimization aims for finding the globally optimal solution relying on the notion of Pareto optimality [118]. The aim of multiobjective optimization in NGWNs is that of generating a diverse set of Paretooptimal solutions, where by definition it is only possible to improve any of the metrics considered at the cost of degrading at least one of the others. The collection of Paretooptimal points is referred to as the Pareto front.
In terms of metrics, the wireless community has invested decades of research efforts into making nearcapacity singleuser operation a reality [119], which is however only possible at the cost of an everincreasing delay, complexity and power consumption. However, in the context of nextgeneration wireless communication networks, we would like to be more ambitious than ’only’ optimize the network’s capacity  for delaysensitive services we would like to reduce the latency and/or reduce the total energy consumption, as well as to improve the system’s reliability and the user’s QoS. By contrast, in wireless sensor networks we may concentrate on optimizing both the connectivity and the network’s life time, just to name a few. In this context the family of machinelearning techniques may be viewed as an attractive set of optimization tools for finding Paretooptimumal solutions of multiobjective optimization problems in NGWNs, which tend to have a large searchspace. To expound a little further, it is plausible that every time we incorporate an additional parameter into the objective function, the searchspace is expanded and the surface of optimal solutions may exhibit numerous locally optimal solutions. Hence traditional gradientbased techniques routinely fail to find the global optimum. In this context Fig. 7 portrays some popular metrics commonly used in constructing multiobjective optimization problems in NGWNs.
Iii Supervised Learning in NGWN
Having covered the networking basics, in this section, we will introduce some rudimentary supervised learning algorithms, such as regression, Knearest neighbors (KNN), support vector machines (SVM) and Bayes classification including their applications in NGWN. Table I summarizes some of the typical applications of the abovementioned four supervised learning algorithms in NGWN.
Iiia Regression and Its Applications
IiiA1 Methods
Regression analysis is capable of estimating the relationships among variables. Relying on modeling the functional relationship between a dependent variable (objective) and one or more independent variables (predictors), regression constitutes a powerful statistical tool of predicting and forecasting a continuousvalued objective given a set of predictors.
In regression analysis, there are three variables, namely the

Independent variables (predictors):

Dependent variable (objective):

Other unknown parameters that affect the estimated value of the dependent variable:
The regression function models the functional vs relationship perturbed by , which can be formulated as:
. Usually, we characterize this relationship in terms of a specific regression function with the aid of its probability distribution. Moreover, the approximation is often modeled as
. When conducting regression analysis, first of all we have to determine the specific form of the regression function, which relies on both the common knowledge about the dependent vs independent variables as well as on its convenient evaluation. Based on the specific form of regression function, regression analysis methods can be classified as ordinary linear regression
[120][121], polynomial regression [122], etc.In linear regression, the dependent variable is a linear combination of the independent variables or unknown parameters. Let us assume having random training samples and independent variables, formulated as . Then the linear regression function can be formulated as:
(1) 
where is termed as the regression intercept, while is the error term and . Hence, Eq. (1) can be rewritten in the form of a matrix as , where is an observation vector of the dependent variable and , while and represents the observation matrix of independent variables, given by:
Linear regression analysis [120] aims for estimating the unknown parameter relying on the least squares (LS) criterion. The corresponding solution can be expressed as:
(2) 
By contrast, in logistic regression [121], the dependent variable is binary. In order to facilitate our analysis, in the following we consider the case of a binary dependent variable, for example. The goal of the binary logistic regression is to model the probability of the dependent variable having the value of or , given the training samples. To elaborate a little further, let the binary dependent variable depend on independent variables . The conditional distribution of under the condition of
obeys a Bernoulli distribution. Hence, the probability of
can be expressed in the form of a standard logistic function^{6}^{6}6The logistic function is a common “S” shape function, which is the cumulative distribution function (CDF) of the logistic distribution., also termed as a sigmoid function:
(3) 
where and represents the regression coefficient vector. Similarly, we have:
(4) 
Relying on the aforementioned definitions, we have . Hence, for a given dependent variable, the probability of its value being can be expressed by . Given a set of training samples , we are capable of estimating the regression coefficient vector with the aid of the maximum likelihood estimation (MLE) method. Explicitly, logistic regression can be deemed to form a special case of the generalized linear regression family using kernel model.
Furthermore, there exist numerous other useful regression models [122, 123, 124, 125]. When the dependent variable is a polynomial function of the independent variables, we refer to it as polynomial regression [122]
, where the bestfit line is a curve. Moreover, ridge regression
[123], least absolute shrinkage and selection operator (LASSO) regression
[124] and ElasticNet regression [125] are widely applied, when independent variables are of multicollinear nature and highly correlated. Fig. 8 demonstrates the basic flow of a regression model.IiiA2 Applications
The regression models can be used for estimating, detecting and predicting physical layer radio parameters related to wireless network scenarios. Specifically, Chang et al. [126] proposed a novel regressionaided interference model, which characterized the relationship between the SINR and the packet reception ratio, and evaluated its accuracy relying on the statistics. Based on this model, they constructed an analytic framework for striking a tradeoff between the overhead imposed and the accuracy of interference measurement attained. In [127], Umebayashi et al. used regression analysis for formulating a deterministicstochastic hybrid model for detecting the spectrum usage by PUs, which had a reduced number of parameters and yet maintained a high detection accuracy. In [128], Al Kalaa et al. used logistic regression for estimating the likelihood of WiFi and ZigBee wireless coexistence in the context of medical devices. Furthermore, Xiao et al. [129] constructed a logistic regressionaided physical layer authentication model for detecting spoofing attacks in wireless networks without relying on a known channel model, which exhibited a high detection accuracy, despite its low computational complexity.
The regression models can also be employed for solving both estimation and detection problems in the upper layers of the sevenlayer OSI model. For example, Chang et al. derived a regressionbased analytical model for the sake of estimating the contention success probability considering heterogeneous sensortraffic demands, which beneficially improved the channel’s exploitation in IoT [130]. Moreover, in [131], Chen et al. employed a regression model for reconstructing the radio map with the aid of signal strength models for the path planning and UAVlocation design in UAVassisted wireless networks. As a further advance, Lei et al. [132] employed a logistic regression classifier for devicefree localization relying on fingerprint signals, which yielded a low localization error.
IiiB KNN and Its Applications
IiiB1 Methods
KNN constitutes a nonparametric instancebased learning method, which can be used both for classification and regression. Proposed by Cover and Hart in 1968, the KNN algorithm is one of the simplest of all machine learning algorithms. By relying on the distance between the object and training samples in a feature space, the KNN algorithm determines which class of the object belongs to. Specifically, in a classification scenario, an object is categorized into a specific class by a majority vote of its nearest neighbors. If , the category of the object is the same as that of its nearest neighbor. In this case, it is termed as the one nearest neighbour classifier. By contrast, in a regression scenario, the output value of the object is calculated by the average of the value of its nearest neighbors. Fig. 9 shows the illustration of the unweighted KNN mechanism associated with .
Let us assume that there are training sample pairs of , where is the property value or class label of the sample , . Typically, we use the Euclidean distance or the Manhattan distance [133] for calculating the similarity between the object and the training samples. Let contain different features. Hence, the Euclidean distance between and can be expressed by:
(5) 
while their Manhattan distance is calculated as [133]:
(6) 
Relying on the associated similarity, the class label or property value of can be voted on or first weighted and then voted on by its nearest neighbors, which is formulated:
(7) 
The performance of the KNN algorithm critically depends on the value of , whilst the best choice of hinges upon the training samples. In general, a large is conducive to resisting the harmful influence of noise, but it fuzzifies the class boundary between different categories. Fortunately, an appropriate value of
can be determined by a variety of heuristic techniques based on the true characteristics of the training data set.
IiiB2 Applications
In KNN, an object can be classified into a specific category by a majority vote of the object’s neighbours, with the object being assigned to the class that is the most common one among its nearest neighbors. Hence, as a kind of simple and efficient classification algorithms, KNN is beneficial in terms of, for example, traffic prediction [134], anomaly detection [135, 136], missing data estimation [137], modulation classification [138], interference elimination [139], etc.
To elaborate, for the sake of capturing the dynamic characteristics of wireless resource demands, Feng et al. constructed a weighted KNN model by learning from a largescale historical data set generated by cellular operators’ networks, which was used for exploring both the temporal and spatial characteristics of radio resources [134]. In [135], Xie et al. proposed a novel KNN aided online anomaly detection scheme based on hypergrid intuition in the context of WSN applications for overcoming the ‘lazylearning’ problem [140] especially when the computational resource and the communication cost quantified in terms of bandwidth and energy were constrained. Moreover, in [136], Onireti et al. proposed a KNN based anomaly detection algorithms for improving the outage detection accuracy in dense heterogeneous networks. As for missing data estimation, a KNN assisted missing data estimation algorithm was conceived on the basis of the temporal and spatial correlation feature of sensor data, which jointly utilized the sensor data from multiple neighbor nodes [137]. Furthermore, Aslam et al. [138]
combined genetic programming and the KNN in order to improve the modulation classification accuracy, which can be viewed as a reliable modulation classification scheme for the SU in cognitive radio networks. In
[139], the KNN algorithm was used both for extracting the environmental interference imposed by 5G WiFi signals and for reducing the computational complexity and yet improving the performance of indoor localization.IiiC SVM and Its Applications
IiiC1 Methods
Being constructed purely by mathematical theory, SVM is another supervised learning model conceived for classification and regression relying on constructing a hyperplane or a set of hyperplanes in a highdimensional space. The best hyperplane is the one that results in the largest margin amongst the classes. However, the training data set may often be linearly nonseparable in a finite dimensional space. To address this issue, SVM is capable of mapping the original space into a higher dimensional space, where the training data set can be more easily discriminated.
Considering a linear binary SVM, for example, there are training samples in the form of , where indicates the class label of the point . SVM aims for searching for a hyperplane having the maximum possible separation from the training samples, which best discriminates the two classes of associated with and . Here, the maximum separation implies having the maximum possible distance between the nearest point and the hyperplane. The hyperplane is represented by:
(8) 
Hence, we can quantify the separation of the training sample as:
(9) 
Moreover, we assume having the correct classification if when , while when . Because we have , a higher separation implies a more reliable classification. Again, the SVM tries to find the optimal hyperplane that maximizes the minimum separation between the training samples and the hyperplane considered. Given a set of linearly separable training samples, after the operation of normalization, the SVM based classification can be formulated as the following optimization problem:
(10)  
where we have . After some further mathematical manipulations, the problem in (10) can be reduced to an optimization problem having a convex quadratic objective function and linear constraints, which can be expressed by:
(11)  
Problem (11) is a typical convex optimization problem. Taking advantage of Lagrange duality [141], we can obtain the optimal and .
Again, if the training samples are linearly nonseparable, SVM is capable of mapping data to a high dimensional feature space with a high probability of being linearly separable. This may result in a nonlinear classification or regression in the original space. Fortunately, kernel functions play a critical role in avoiding the “curse of dimensionality” in the abovementioned dimensionality ascending procedure
[142, 143]. To elaborate a little further, given the original input samples , we may be interested in learning some features . Let us assume , hence the corresponding kernel function is defined as:(12) 
Fortunately, even though the high dimensional feature mapping may be expensive to calculate, the kernel function calculated relying on their inner product can be easy obtained after some further mathematics manipulations.
There are a variety of alternative kernel functions, such as linear kernel function, polynomial kernel function, radial basis kernel function, neural network kernel function, etc. Furthermore, some regularization methods haven been conceived in order to make SVM be less sensitive to outlier points.
The specific choice of the kernel function plays a key role in machine learning [144], hence we have to beneficially design the kernel function. The construction of kernels can be generally developed by the inner product operations of feature mappings between the input samples over the Hilbert space, whose infinite number of dimensions allow the appropriate representation of big data to exploit their geometric properties. Such a Hilbert space associated with a kernel invoked for producing functions by calculating the inner product of the feature mappings is known as the reproducing kernel Hilbert space (RKHS) [145], and has been applied in diverse learning contents [146, 147]
. The RKHS therefore serves a critical foundation in statistical learning theory. Fig.
10 provides a graphical illustration of the kernelbased method.On the other hand, we may rely on statistical learning theory for appropriately constructing the signal space in order to identify sufficient statistics for reliable signal detection and estimation in statistical communication theory [148]. Inspired by Parzen [145], Kailath observed that RKHS may also be beneficially invoked both for detection and estimation [149]
by exploiting the onetoone relationship between RKHS and finitevariance linear functionals of a random process. Corresponding to the simplest setup of signal detection in additive white Gaussian noise (AWGN) using the KarhunenLoeve expansion
[150], the RKHS representation associated with the noise covariance function is capable of providing an equivalent theoretical framework of statistical communication theory. After a series of efforts inverted into different areas of signal detection and estimation, Kailath and Poor [151] conceived the RKHS approach for the detection of stochastic signals.IiiC2 Applications
As mentioned before, SVM hinges on a mapping that can transform the original training data into a higher dimension, where the events to be classified do become linearly separable. Then it searches for the optimal separating hyperplane for delineating one class from another in this higher dimension considered. As highlighted in Fig. 11, in the spirit of this, SVM aided learning models can be used for detecting and estimating network parameters, for learning and classifying environmental signals and the user’s behavior, as well as for guiding decision making concerning channel selection and anomaly detection, for example [152, 153, 154, 155, 156, 157, 158, 159, 160].
As for detecting and estimating the network parameters, Feng and Chang [152] constructed a hierarchical SVM (HSVM) structure for multiclass data estimation. The HSVM was constructed by a number of levels and each level was composed by a finite number of SVM classifiers. Feng and Chang used their HSVM model both for estimating the physical locations of nodes in an indoor wireless network and the Gaussian channel’s noise level in a MIMOaided wireless network. Thanks to its hierarchical structure, the HSVM was capable of providing an efficient distributed estimation procedure. Furthermore, Tran et al. proposed an SVM model for estimating the geographic location of sensor nodes in WSNs whilst only relying on their connectivity information, more precisely the hop counts [153]. It yielded fast convergence in a distributed manner. The final estimation error can be upper bounded by any small threshold upon relying on a sufficiently large training dataset. Moreover, Sun and Guo [154] conceived a least squareSVM (LSSVM) algorithm for estimating the user’s position by correlating the timeofarrival (TOA) of radio frequency signals at the BSs without any detailed knowledge about the base station’s location as well as about the propagation characteristics.
SVM can also be used for learning a user’s behavior and for classifying environmental signals considering the complex spatiotemporal context and the diverse selection of devices. In [155], Donohoo et al. studied the contextaware energyefficiency improvement options for smart devices. These solutions may become beneficial in terms of configuring their locationspecific interface for heterogeneous networks (HetNets) constituted by diverse cells. In [156], by combining the SVM and Fisher discriminant analysis (FDA) Joseph et al. learned the malicious sinking behavior in wireless ad hoc networks for finding the security vulnerabilities and for designing novel intrusion detection scheme. Moreover, features such as delay between data and acknowledgement, number of retransmits, etc. gleaned from the MAC layer were jointly considered with those from other layers, which constituted a correlated feature set. Furthermore, Pianegiani et al. [157] proposed an SVMbased binary classification solution for classifying acoustic signals emitted by vehicles relying on spectral analysis aided feature extraction, which was beneficial in terms of improving the classification accuracy, despite reducing the implementation complexity.
As for the SVM’s benefit in assisting decision making, in [158], a common control channel selection mechanism was conceived for SUs during a given frame relying on an SVMbased learning technique proposed for a cognitive radio network, which was capable of implicitly and cooperatively learning the surrounding environment cooperatively in an online way. Moreover, Yang et al. [159] investigated the spoofing attack detection problem based on the spatial correlation of received signal strength gleaned from network nodes, where a clusterbased SVM mechanism was developed for determining the number of attackers. Relying on carefully designed certain training data, the SVM algorithm employed further improved the accuracy of determining the number of attackers. Rajasegarar et al. [160] also investigated the malicious activity detection issues of WSNs invoking a variety of SVM based algorithms.
IiiD Bayes Classification and Its Applications
IiiD1 Methods
The Bayes classifier, a popular member of the probabilistic classifier family relying on Bayes’ theorem, operates by computing the
posterioriprobability distribution of the objective function values given a set of training samples. As a widelyused classification method, the naive Bayes classifier can be trained for example conditioned on a simple but strong independence assumption in features. Furthermore, the complexity of training a naive Bayes model is linearly proportional to the training set size.
To elaborate a little further, let the vector represent independent features for a total of classes . For each of the possible class labels , we have the conditional probability of . Relying on Bayes’ theorem, we decompose the conditional probability to yield the form of:
(13) 
where is the posteriori probability, whilst is the priori probability of . Given that is conditionally independent of for , we have:
(14) 
where only depends on independent features, which can be viewed as a constant.
The maximum a posteriori probability (MAP) is used as the decision making rule for the naive Bayes classifier. Given a feature vector , its label can be determined according to:
(15) 
Despite idealized simplifying assumptions, naive Bayes classifiers have enjoyed popularity in numerous complex realworld situations, such as outlier detection
[161], spam filtering [162], etc.IiiD2 Applications
Based on the Bayes’ theorem, Bayes classifier techniques are particularly applicable to the context where the dimensionality of the input is high. Despite their simplicity, they can often outperform other sophisticated classification methods. As for their applications in wireless networks, in the following, we will elaborate on some typical examples in different wireless scenarios, such as antenna selection, network association, anomaly detection, indoor location and QoE prediction.
Specifically, in [163], He et al. modeled the transmit antenna selection (TAS) problem of MIMO wiretap channels as a multiclass classification problem. Then, they used the naive Bayesbased classification scheme to select the optimal antenna for enhancing the physical layer security of the system considered. In contrast to conventional TAS schemes, simulation results showed that the proposed scheme resulted in a reduced feedback overhead at a given secrecy performance. In [164], Abouzar et al. proposed an actionbased network association technique for wireless body area networks (WBANs). Relying on the level of received signal strength indicator of the onbody link, the naive Bayes algorithm was employed to recognize the ongoing action, which was beneficial in terms of scheduling the time slot assignment in the context of fixed power allocation on various links by the sink node under a specific data rate constraint. Moreover, Klassen et al. [165] used the naive Bayes classifier for detecting anomaly in ad hoc wireless network involving the black hole attack, the denial of service (DoS) attack and the selective forwarding attack.
Bayes classifier can also be applied to the indoor location estimation. For example, in [166], a probabilistic model was conceived for characterizing the relationship between the received signal strength and location with the aid of the naive Bayes generative learning method, which was used for learning the parameters of an initial probabilistic model, given a limited number of labeled samples. The proposed indoor location estimation method was capable of both reducing the offline calibration efforts required, whilst maintaining a high location estimation accuracy. Furthermore, as for QoE prediction, in order to evaluate the impact of different networking and channel conditions on the QoE attained in the context of different network services, Charonyktakis et al. [167] proposed a modular algorithm for usercentric QoE prediction. They integrated multiple machine learning algorithms, including the Gaussian naive Bayes classifier and conceived a nested cross validation protocol for selecting the optimal classifier and its corresponding optimal hyperparameter value for the sake of accurate QoE prediction.
Paper  Application  Method  Description 

[126]  interference estimate  regression  strike a tradeoff between the overhead and accuracy of interference measurement 
[127]  spectrum sensing  regression  reduce the number of parameters and maintain a high detection accuracy 
[128]  wireless coexistence  regression  estimate the likelihood of the wireless coexistence of WiFi and ZigBee 
[129]  PHY authentication  regression  do not need the assumption on the accurate known channel model 
[130]  traffic estimation  regression  estimate the contention success probability considering sensors’ heterogeneous traffic demands 
[131]  map reconstruction  regression  reconstruct the wireless radio map for UAV path planning and location design 
[132]  wireless localization  regression  logistic regression classifier for counteracting the negative influence relying on fingerprint signals 
[134]  traffic prediction  KNN  explore both the temporal and spatial characteristics of radio resources 
[135]  anomaly detection  KNN  rely on the hypergrid intuition in the context of WSN applications 
[137]  missing data estimation  KNN  rely on the temporal and spatial correlation feature of sensor data 
[138]  modulation classification  KNN  combine the genetic programming and KNN for improving the modulation classification accuracy 
[139]  interference elimination  KNN  extract environmental interference from WiFi signal and reduce computational complexity 
[152]  data estimation  SVM  provide an efficient estimation procedure in a distributed manner 
[153]  localization estimation  SVM  yield fast convergence performance and efficiently use the communication resources 
[154]  user location  SVM  without knowledge about base station location and environmental propagation characteristics 
[155]  data prediction  SVM  provide locationspecific interface configuration for HetNets 
[156]  behavior learning  SVM  combine both the superior accuracy of SVM and fast convergence speed of FDA 
[157]  signal classification  SVM  classify acoustic signals emitted by vehicles rely on feature extraction 
[158]  channel selection  SVM  propose a control channel selection mechanism for a cognitive radio network 
[159]  attacker counting  SVM  develop a clusterbased SVM mechanism for determining the number of attackers 
[163]  antenna selection  Bayes  enhance the physical layer security relying on Bayesbased optimal antenna selection 
[164]  network association  Bayes  schedule time slot assignment and fixed power allocation under data rate constraint 
[165]  anomaly detection  Bayes  detect anomaly involving black hole attack, DoS attack and selective forwarding attack 
[166]  indoor location  Bayes  characterize the relationship between the received signal strength and location 
[167]  QoE prediction  Bayes  accurate QoE prediction by selecting optimal classifier and optimal hyperparameter values 
Iv Unsupervised Learning in NGWN
In this section, we will highlight some typical unsupervised learning algorithms, such as means clustering [168], expectationmaximization (EM) [169], principal component analysis (PCA) [170] and independent component analysis (ICA) [171] in terms of their methodology and their applications in NGWN. Table II summarizes some typical applications of the abovementioned unsupervised learning algorithms in NGWN.
Iva Means Clustering and Its Applications
IvA1 Methods
means clustering is a distance based clustering method that aims for partitioning unlabeled training samples into different cohesive clusters, where each sample belongs to one cluster. To elaborate a little further, means clustering measures the similarity between two samples in terms of their distance and it has two main steps, namely assigning each training sample to one of clusters in terms of the closest distance between the sample and the cluster centroids, and then updating each cluster centroid according to the mean of the samples assigned to it. The whole algorithm is hence implemented by repeatedly carrying out the abovementioned pair of steps until convergence is achieved.
To elaborate a little further, given a set of samples , where is a dimensional vector, let represent the abovementioned cluster set, and the mean of the samples in . means clustering intends to find an optimal clusterbased segmentation, which solves the following optimization problem:
(16) 
However, problem (16) is a nondeterministic polynomialtime hardness (NPhard) problem [172]. Fortunately, there are a range of efficient heuristic algorithms, which converge quickly to a local optimum.
One of the popular lowcomplexity iterative refinement algorithms suitable for means clustering is Lloyd’s algorithm [173], which often yields satisfactory performance after a low number of iterations. Specifically, given initial cluster centroid , Lloyd’s algorithm arrives at the final cluster segmentation result by alternating between the following two steps,

Step 1: In the iterative round , assign each sample to a cluster. For and , if we have:
(17) then we assign the sample to the cluster , even if it could potentially be assigned to more than one cluster.

Step 2: Update the new centroids of the new clusters formulated in the iterative round relying on:
(18) where denotes the number of samples in cluster in iterative round .
Convergence is deemed to be obtained when the assignment in Step 1 is stable. Explicitly, reaching convergence means that the clusters formulated in the current round are the same as those formed in the last round. Since this is a heuristic algorithm, there is no guarantee that it can converge to the global optimum. Hence, the result of clustering largely relies on specific choice of the initial clusters and on their centroids.
IvA2 Applications
means clustering aims for partitioning samples into clusters. Each sample belongs to the closest cluster. The clustering algorithm proceeds in an iterative manner, where the incluster differences are minimized by iteratively updating the cluster centroid, until convergence is achieved.
Clustering functioning under uncertainty or incomplete information is a common problem in wireless networks, especially in the scenarios associated with numerous small traffic cells, heterogeneous large and small cell structures relying on diverse carrier frequencies, diverse timevarying teletraffic, etc. First of all, the small cells have to be carefully clustered for avoiding excessive interference using coordinated multipoint transmission. Moreover, the devices and users should be beneficially clustered for the sake of achieving a high energy efficiency, maintaining an optimal access point association, obeying an efficient offloading policy, and of guaranteeing a high network security. In [174], a mixed integer programming problem was formulated for jointly optimizing both the gateway deployment and the virtualchannel allocation for optical/wireless hybrid networks, where Xia et al. designed an efficient means clustering based solution for iteratively solving this problem, which beneficially reduced the delay, as well as improved the network throughput. Moreover, in [175], Hajjar et al. proposed a means based relay selection algorithm for creating small cells under the umbrella of an oversailing LTE macro cell within a multicell scenario under the constraint of low power clusters. Relying on the proposed relay selection algorithm, the total capacity was increased by reusing the frequency in each low power cluster, which had the benefit of supporting high data rate services. Additionally, Cabria and Gondra [176] proposed a socalled potentialmeans scheme for partitioning data collection sensors into clusters and then for assigning each cluster to a storage center. The proposed means solution had the advantage of both balancing the storage center loads and minimizing the total network cost (optimizing the total number of sensors). Parwez et al. [177] invoked both
means clustering and hierarchical clustering algorithms for their useractivity analysis and useranomaly detection in a mobile wireless network, which verified genuine identity of users in the face of their dynamic spatiotemporal activities. Furthermore, ElKhatib
[178] designed a means classifier for selecting the optimal set of features of the MAC layer bearing in mind the specific relevance of each feature, which beneficially improved the accuracy of intrusion detection, despite reducing the learning complexity.Clustering can also be used in signal detection for the sake of both reducing the detection complexity and for improving the energy efficiency attained. In [179], the means clustering algorithm was invoked in a blind transceiver, where the training process was completely dispensed within the transmitter for reducing its energy dissipation, since no pilot power was required. Furthermore, Zhao et al. [180] conceived an efficient means clustering algorithm for optical signal detection in the context of burstmode data transmission.
IvB EM and Its Applications
IvB1 Methods
The EM algorithm is an iterative method conceived for searching for the maximum likelihood estimate of parameters in a statistical model. Typically, in addition to unknown parameters whose existence has been ascertained, the statistical model also has some latent variables. In this scenario it is an open challenge to derive a closedform solution, because we are unable to find the derivatives of the likelihood function with respect to all the unknown parameters and latent variables. The iterative EM algorithm consists of two steps, as shown in Fig. 12. During the expectation step (Estep), it calculates the expected value of the log likelihood function conditioned on the given parameters and latent variables, while in the maximization step (Mstep), it updates the parameters by maximizing the specific loglikelihood expectation function considered.
More explicitly, upon considering a statistical model with observable variables and latent variables , the unknown parameters are represented by . The loglikelihood function of the unknown parameters is given by:
(19) 
Hence, the EM algorithm can be described as follows [169]:

Estep: Calculate the expected value of the log likelihood function under the current estimate of , i.e.
(20) 
Mstep: Maximize Eq. (20) with respect to for generating an updated estimate of , which can be formulated as:
(21)
The EM algorithm plays a critical role in parameter estimation based on many of the popular statistical models, such as the Gaussian mixture model (GMM), hidden Markov model (HMM), etc. which are beneficial both for clustering and prediction.
IvB2 Applications
The EM model can be readily invoked for a variety of parameter learning and estimation problems routinely encountered in wireless networks. Specifically, Wen et al. [181] estimated both the channel parameters of the desired links in a target cell and those of the interfering links in the adjacent cells relying on constructing a GMM, which was estimated with the aid of the EM algorithm. Choi et al. [182] modeled the cognitive radio system as a HMM, where the secondary users (SUs) estimated the channel parameters such as the primary user’s (PU) sojourn time, signal strength, etc. based on the standard EM algorithm. Moreover, Assra et al. [183] also adopted the EM algorithm to jointly estimate the channel unknown frequency domain responses as well as the noise variance and detected the PU’s signal in a cooperative wideband cognitive system, which was shown to converge to the upper bound solution based on maximum likelihood estimation under the idealized assumption of having perfect channel parameter estimation. Additionally, Zhang et al. [184] proposed an EM aided joint symbol detection and channel estimation algorithm for MIMOOFDM systems in the presence of frequency selective fading, which provided a distributionestimate for both the hidden symbol and unknown channel parameters in an iterative manner. Li and Nehorai [185] built an asynchronous statespace model for connecting asynchronous observations with the most likely target state transition in the context of multisensor WSNs. Then, they adopted the EM algorithm for jointly estimating the sequential target state as well as the network’s synchronization state under the assumption of knowing the temporal order of sensor clocks. Furthermore, Zhang et al. [186] used a variational EM iterative algorithm to recover the transmitted signals and to identify the active users in a lowactivity code division multiple access based M2M communications without the knowledge of the user activity factor. The EM algorithm can also be invoked for target or source localization, which can be viewed as a joint sparse signal recovery and parameter estimation problem [187] [188].
IvC PCA & ICA and Their Applications
IvC1 Methods
PCA and ICA constitute sophisticated dimensionality reduction methods in machine learning, which are capable of reducing both the computational complexity and the storage requirements.
PCA utilizes an orthogonal transformation for converting a set of potentially correlated features of the training samples into a set of uncorrelated features, which are termed as the “principal components”. The number of principal components is expected to be lower than the number of the original features of the training samples, which hence provide a more compact representation of the original samples. More explicitly, less principle components can be used for representing the original samples in the transformed domain. In PCA, the first principal component tends to have the largest variance, which indicates that it encapsulates the most information of the original features provided that these features were correlated. Similarly, each succeeding component tends to have the next highest variance. These principal components can be generated by invoking the eigenvectors of the normalized covariance matrix.
Specifically, let us consider training samples of , where is composed of different features. Let us first preprocess the samples by normalizing their mean and variance. Given a unit vector , can be interpreted as the length of the projection of onto the direction . The PCA attempts to maximize the variance of the projections, which is formulated as:
(22) 
Given the covariance matrix , the solution of problem (22) is given by the eigenvector of the covariance matrix . If we denote the top eigenvectors of by and , a dimensionality reduction expression of can be formulated as:
(23) 
where are the first principle components of the training samples.
By contrast, the ICA attempts to find a new basis for representing original samples that are assumed to be a linear weighted superposition of some unknown latent variables. It aims for decomposing multivariate variables into a set of additive subcomponents, which are nonGaussian variables and are statistically independent from each other. As for the independent components, also termed as the latent variables, they exhibit the maximum possible “statistical independence”, which can be commonly characterized by either the minimization of their mutual information quantified in terms of the KullbackLeibler divergence metric and the maximum entropy criterion, or by the maximization of what is termed in parlance as the nonGaussianity relying on kurtosis and negentropy, for example.
Let us consider the linear noiseless ICA model in a simple example, where the multivariate training variables are denoted by . Its latent independent component vector is represented by . Each component of can be generated by a linearly weighted sum of independent components, i.e. we have , where is the weighting coefficient. The vectorial form of can be expressed as:
(24) 
where . Furthermore, let . Then the original multivariate training variables can be rewritten as:
(25) 
where the unknown matrix is referred to as the mixing matrix. ICA algorithms attempt to estimate both the mixing matrix and the independent component vector relying on setting up a cost function, which again, either maximizes the nonGaussianity or minimizes the mutual information. Thus, we can recover the independent component vector by computing , where is termed as the ‘unmixing’ matrix. Usually, we assume that and that the mixing matrix is a squareshaped matrix. Moreover, the apriori knowledge of the probability distribution of is beneficial in terms of formulating the cost function.
IvC2 Applications
As for the application of PCA and ICA in wireless networks, Shi et al. [189] utilized PCA to extract the most relevant feature vectors from finegrained subchannel measurements for improving the localization and tracking accuracy in an indoor location tracking system. Moreover, Morell et al. [190] designed an efficient data aggregation method for WSNs based on PCA amalgamated with a noneigenvector projection basis, while keeping the reconstruction error below a predefined threshold. Quer et al. [191] exploited PCA for inferring the spatial and temporal features of a range of signals monitored by a WSN. Based on this they recovered the large original data set from a small observation set.
Additionally, Qiu [192] combined ICA with PCA in a smart grid scenario for recovering smart meter data, which were jointly capable of enhancing the transmission efficiency both by avoiding the channel estimation in each frame and by eliminating wideband interference or jamming signals. A semiblind received signal detection method based on ICA was proposed by Lei et al. [193], which additionally estimated the channel information of a multicell multiuser massive MIMO system. Moreover, Sarperi et al. [194] proposed an ICA based blind receiver structure for MIMO OFDM systems, which approached the performance of its idealized counterpart relying on perfect CSI. ICA was also used for digital selfinterference cancellation in a full duplex system [195], which relied on a reference signal used for estimating the leakage into the receiver. More explicitly, in full duplex systems the highpower transmit signal leaks into the receiver through a nonlinear leakage path and drowns out the lowpower received signal. Hence its cancellation requires at least dB interference rejection. Furthermore, in [196], the Boolean ICA concept was proposed based on the integration of Boolean functions of binary signals for inferring the activities of the underlying latent signal sources. Specifically, it was shown that given SUs, the activities of up to PUs can be determined.
Paper  Application  Method  Description 

[174]  gateway deployment  means  reduce delay and improve network throughput for optical/wireless hybrid networks 
[175]  relay selection  means  create small cells in an LTE macro cell with low power cluster constraint 
[176]  sensor partitioning  means  balance the load of storage centers and minimize the total network cost 
[177]  anomaly detection  means  verify spatiotemporal varying users’ genuineness relying on ground truth information 
[178]  intrusion detection  means  improve intrusion detection accuracy and reduce the learning complexity 
[179]  blind transceiver  means  not require pilot duration and pilot power for saving energy consumption 
[180]  signal detection  means  burstmode data transmission with an unbalanced ratio of bits zero and bits one 
[181]  channel estimation  EM algorithm  construct a GMM to estimate channel parameters in both target cell and adjacent cells 
[182]  PU detection  EM algorithm  SUs estimate PU’s sojourn time and signal strength relying on a HMM model 
[183]  channel state detection  EM algorithm  jointly estimate channel frequency responses, noise variance and PU’s signal 
[184]  symbol detection  EM algorithm  joint symbol detection and channel estimation for MIMOOFDM systems 
[185]  network state detection  EM algorithm  joint estimate the sequential target state and network synchronization state 
[186]  active user detection  EM algorithm  detect active user for the lowactivity CDMA based M2M communications 
[187]  source localization  EM algorithm  formulate localization as a joint sparse signal recovery and parameter estimation problem 
[189]  indoor location  PCA  extract relevant feature vectors from finegrained subchannel measurements 
[190]  data aggregation  PCA  limit the reconstruction error based on a noneigenvector projection basis 
[191]  data recovery  PCA  exploit PCA to extract spatial and temporal features of real signals 
[192]  data recovery  ICA & PCA  enhance transmission efficiency by avoiding channel estimation and eliminating jamming signals 
[193]  channel estimation  ICA  differentiate and decode the received signal, and estimate the channel information 
[194]  blind receiver  ICA  yield an ideal performance close to that with perfect CSI 
[195]  interference cancellation  ICA  digital interference cancellation based on the reference signal from transmitter power amplifier 
[196]  signal detection  ICA  infer the activities of latent signal sources based on the Boolean functions 
V Reinforcement Learning in NGWN
Reinforcement learning deals with an agent interacting with the environment. Three specific aspects of reinforcement learning, multiarm bandit problem, Markov decision process (MDP) and temporaldifference (TD) learning can be very useful for NGMN. Then, we explore further on these algorithms of reinforcement learning to NGMN.
Va MultiArmed Bandit and Its Applications
VA1 Methods
The multiarmed bandit technique, also called armed bandit, models a decision making problem, where an agent is faced with a dilemma of different actions. After each choice, the agent receives a reward relying on a stationary probability distribution that is associated with its decision. The agent attempts to maximize its expected total reward over a series of decision making rounds relying on a balance striking a tradeoff between consulting existing knowledge and acquiring new knowledge when optimizing its decisions. The action of referring to existing knowledge to make decisions is termed as “exploitation”, while the trial of acquiring new knowledge is referred to as “exploration”. Striking a tradeoff between exploration and exploitation is also sought by other reinforcement learning algorithms, where exploitation is the plausible action for maximizing the expected reward within the current round, while exploration may produce a greater reward in the long run.
In a armed bandit model, possible actions, , yield different rewards associated with the unknowns of the problem at hand, which may have different distributions with mean values of , respectively. The agent iteratively chooses an action at the round and receives the corresponding reward of . Up to the round , the expected reward of an action can be expressed as . Upon striking a balance between the exploration and the exploitation, we may arrive at a simple bandit algorithm as follows, for example. In each decisionmaking round, we greedily opt for the action relying on the probability of , whilst riskily embarking on a random action selection based on the probability of , where is the probability of a brave attempt for exploring new knowledge.
In contrast to the abovementioned greedy bandit algorithm, there are also more complex bandit algorithms, such as the gradient aided bandit algorithm, associativesearch bandit, nonstationary bandit, etc [42]. Moreover, the multiarmed bandit problem can be extended into a multiplay and multiarmed bandit problem [197], where the reward of each agent depends on others’ actions, and each agent tries to find its optimal decision by predicting the future actions of the other agents relying on previous decision making strategies.
VA2 Applications
As mentioned before, multiarmed bandit based techniques are capable of dealing with uncertainties in the context of NGWNs because of limited prior knowledge and the associated resourcethirsty feedback. Moreover, it is beneficial to model the selfishness and the decision conflicts of/among the users during the decision making process. Hence, the multiarmed bandit based algorithms have become powerful tools for rational decision making in wireless networks both for distributed users and APs as well as for the central control center. Specifically, Maghsudi et al. [198] proposed a small cell activation scheme relying on the multiarmed bandit philosophy given only limited information about the available energy of the small cell BS as well as the number of users to be served. The overall heterogeneous network’s throughput was improved with the aid of an energyefficient small cell onoff switching regime controlled by the macro BS, while the interinterference level was reduced. Another compelling application of the multiarmed bandit regime in the heterogeneous network is constituted by the dynamic network selection in the context of uncertain heterogeneous network state information. Wu et al. [199] formulated the optimal network selection problem as a continuoustime multiarmed bandit problem considering diverse traffic types. Moreover, the network access cost function and the QoE reward were defined as the metrics of evaluating the proposed network selection schemes. In [200], given the timevarying and userdependent fading channels of wireless peertopeer (P2P) networks, a multiarmed bandit aided optimal distributed transmitter scheduling policy was conceived for multisource multimedia transmission, which was beneficial of maximizing the data transmission rate and reducing the related power consumption in the light in terms of the realistic energy constraints of wireless mobile devices. In addition to transmitter scheduling, Maghsudi and Stańczak applied the covariate multiarmed bandit regime [201] for solving the relay selection problem in the wireless network, where the geographical location of relay nodes was assumed to be known by the source node, but no knowledge was assumed about the corresponding fading gains. The proposed covariate multiarmed bandit model is capable of dealing with the exploitationexploration dilemma of the relay selection process. Lee et al. [202] proposed a greedy multiarmed bandit based framework for exploiting the gains provided by frequency diversity in WiFi channels. They struck a tradeoff between the achievable gain stemming from frequency diversity and the resource consumption imposed by channel estimation and coordination.
Given the open broadcast nature of the wireless channel environment and the access contention mechanism among multipriority users, multiarmed bandit based techniques have played a special role in cognitive networks [203, 204, 205, 206, 207, 208]. For example, Zhao et al. [203] formulated a multiarmed restless bandit model for opportunistic multichannel access, which approached the maximum attainable throughput by accurately predicting which is next idle channel likely to become. In [206], a channel selection scheme was investigated which was capable of adapting to the link quality and hence finding the optimal channel for avoiding interferences and deep fading. Moreover, Gwon et al. [204] and Zhou et al. [208] further considered the choice of access strategy in the presence of both legitimate desired users and jamming cognitive radio nodes, which was resilient to adaptive jamming attacks with different strengths spanning from near noattack to the fullattack across the entire spectrum. In contrast to only sensing and accessing a single channel, considering the correlated rewards of different arms, a sequential multiarmed bandit regime was conceived by Li et al. [205] for identifying multiple channels to be sensed in a carefully coordinated order. Furthermore, Avner and Mannor [207] studied multiuser coordination in cognitive networks, where each user’s successful channel selection relies on both the channel state as well as on the decisions of the other users.
VA3 An Example
Visible light communication (VLC) systems have the compelling benefit of a wide unlicensed communication bandwidth as well as innate security in downlink (DL) transmission scenarios, hence they may find their way into the construction of NGWNs. However, considering the limited coverage and dense deployment of lightemitting diodes (LED), traditional network association strategies are not readily applicable to VLC networks. Hence by exploiting the power of online learning algorithms, in [209], the authors focused their attention on sophisticated multiLED access point selection strategies conceived for hybrid indoor LiFiWiFi communication systems with the aid of a multiarmed bandit model. Explicitly, since lightfidelity (LiFi) VLC transmissions are less suitable for uplink (UL) transmissions, a classic WiFi UL was used in this study.
To elaborate, in the indoor VLC system, the communication between the devices and the backbone network relies on the VLC DL as well as on the RF WiFi UL, which hence can be viewed as a hybrid LiFiWiFi network. In the system model, it is assumed that there are lowenergy LED lamps in the indoor space considered. Moreover, regardless of their positions, the mobile devices are capable of accessing any of the indoor LED lamps and of downloading packets from the Internet via VLC. When a decision round is due, the access control strategy obeys the decision probability distribution of . And it has , where denotes the probability of accessing the
th LED lamp. Furthermore, the service time of each LED lamp obeys the negative exponential distribution with a departure rate
, while the interval between system access requests, in the same way, obeys the negative exponential distribution with an arrival rate . The VLC DL channel is characterized by a diffuse link, where the light beam is radiated within a certain angle. Thus, the indoor VLC channel can be modelled by combining the line of sight (LOS) path (Fig. 13 (a)) as well as a single onehop reflected path (Fig. 13 (b)).The expectation of the accumulated reward gap function is defined as the metric for characterizing the performance of our AP selection scheme, which represents the difference between the maximum theoretical reward and the actually acquired reward after sequential decision making experiments relying on the system’s decision probability distribution, which is formulated as.
(26) 
where denotes the user rate associated with the th decision round in terms of the access decision at the instant , with being the actual access decision.
Furthermore, in [209] a pair of multiarmed bandit learning techniques, i.e. the ‘exponential weights for exploration and exploitation’ (EXP3) as well as the ‘exponentiallyweighted algorithm with linear programming’ (ELP), were advocated for updating the APassignment decision probability distribution of each AP at each time instant for the sake of improving the link throughput based on the probability distribution of (26). More explicitly, in contrast to the trialanderror EXP3 algorithm, the ELP based AP selection algorithm was constructed for taking into account both the partially observed conditions of the APs as well as the network topology.
The theoretical upper bound of the expected value of the accumulated reward gap function of the EXP3 and ELPbased multiarmed bandit learning algorithms was also derived in [209]. In Fig. 14 and Fig. 15, the normalized throughput of the selected VLC links and of the whole system relying on the EXP3based, ELPbased as well as on random LED AP selection schemes was compared. By contrast, the random selection scheme granted an identical decision probability of accessing any of the LEDs, namely , for each lamp at each decisionmaking time instant. It was assumed that the negative exponential departure probability of each downloading service was . Moreover, the initial state of the number of downloading services supported by each lamp was randomly chosen between . Upon increasing the number of decision rounds , the EXP3 and ELPbased selection schemes had a higher accumulated normalized throughput than random selection. Furthermore, relying on more neighbor observation information as well as by exploiting the connection of the LED lamps, the ELPbased APselection scheme was shown to outperform that based on EXP3.
VB MDP & POMDP and Their Applications
VB1 Methods
The classic Markov decision process (MDP) [210] constitutes a framework of making decisions in the context of a discretetime stochastic environment of Markov state transitions, which provides the decision maker with the optimal actions to opt for at each state. It has been used in a wide range of disciplines, especially in automatic control [211]. The goal of the decision maker, generally speaking, is to maximize the cumulative reward received over a long run and to find the corresponding optimal policy which represents a mapping from each state to the specific probabilities of choosing each legitimate action.
In an MDP model, the system’s state transition follows the Markovian property, where the system’s response at time epoch
depends exclusively on the current state and on the agent’s action at time epoch . Mathematically, at time epoch , the system is in a certain state , where the agent selects a legitimate action that is available in the state . As a result, the system then acts at the next time epoch by moving into a new state relying on the system’s state transition probability of . At the same time, the decision maker receives the corresponding reward . The associated value function is then defined for quantifying how well the agent carries our its action over a long run commencing from the initial state , which can be formulated as:(27) 
where represents the discount factor and the mapping represents the probability of opting for action in the state . Hence, the optimal policy can be formulated by maximizing the value function considered, i.e. we have . The maximization of the value function can reformulated as an iterative equation with the aid of Bellman’s optimality theorem [212], which is given by:
(28)  
By contrast, as an extension of MDP, the partially observable Markov decision process (POMDP) only relies on partial knowledge about the hidden Markov system which is eminently suitable for scenarios, where the agent cannot directly observe the underlying system’s state transitions. Hence, the agent has to constitute belief states and the associated belief transition function by relying on a set of observations instead of the real system states. In a nutshell, the POMDP framework can be formulated as a quintuple of , i.e.

System’s State : The system’s state represents the system’s legitimate state;

Belief State : The belief state benchmarks the degree of the similarity between each of the system’s legitimate state and the state estimated by the agent;

Action : The action denotes the specific action that can be selected in the given state;

Belief Transition Function : The belief transition function represents the probability of the belief state traversing from to conditioned on selecting action ;

Reward Function : The reward function quantifies the immediate reward received by performing the selected action.
Similarly, the optimal policy can be obtained by solving the optimization problem of:
(29)  
VB2 Applications
As another important decisionmaking tools, which is different from the multiarmed bandit solutions, MDP/POMDP should firstly model the environment relying on either fully or partially observed knowledge. To elaborate a little further, Massey et al. [213] proposed an MDP based downlink service scheduling policy for wireless service providers. Considering the timesensitive nature of wireless teletraffic patterns, their proposed scheduling policy was capable of maximizing the expected reward for the wireless service provider in the context of a multiplicity of services. In [214], Tang et al. resorted to the MDP approach for enhancing a basic nodemisconduct detection method, where a novel rewardpenalty function was defined as a function of both correct and wrong decisions. The resultant adaptive nodemisconduct detector maximized this rewardpenalty function in diverse network states. Moreover, Kong et al. conceived a discretetime MDP (DTMDP) aided mechanism [215] for dynamically activating and deactivating certain resources of the BS in the context of timevarying network traffic. More explicitly, at each decision round, the DTMDP had the option of activating a new resource module, deactivating the currently active resource module and no operation. The proposed switching mechanism reduced the power consumption, i.e. improved the energy efficiency at the BS.
As a further development, relying on the POMDP paradigm, Tseng et al. [216] designed a cell selection scheme for improving the network’s capacity, where the full cell loading status was not observable. Hence, it predicted the unavailable cell loading information from set of nonserving base stations and then took actions for improving the various performance metrics, including the system’s capacity, the handover time as well as the mobility management as a whole. Moreover, the belief state was defined for representing the state uncertainty in terms of the statistical probability of a cell’s specific loading state. The simulation results of Tseng et al. [216] showed that their solution outperformed the conventional signalstrength aided and loadbalancing based methods. In order to save the energy of sensors, Fei et al. [217] proposed a POMDP aided Ksensor scheduling policy, which guaranteed the sensors’ highquality coverage and reduced the total energy consumption. Similarly, by striking a tradeoff between the detection performance and energy consumption, Zois et al. [218] designed a POMDP aided sensor node selection scheme for WBANs by maximizing the system’s lifetime as well as optimizing the physical state detection accuracy. The main goal of the sensor node selection was to devise a schedule under which the sensors alternated between the active state and the dormant state relying on the specific network activity. Upon relying on the decentralized POMDP (DECPOMDP), Pajarinen et al. [219] proposed a MAC solution, which promptly adapted both to the spatial and temporal opportunities facilitated by the wireless network dynamics, which yielded an increased throughput and reduced latency compared to the traditional carriersense multiple access relying on conventional collision avoidance (CSMA/CA) methods. Here, the POMDP tackled the uncertainty both in the environment’s evolution and in the associated inaccurate observations. Thanks to the cross layer optimization employed, more information can be gleaned from the lower layers for enhanced network condition estimation. Then, Xie et al. [220] used the POMDP model for solving the frame size selection problem of the ubiquitous transmission control protocol (TCP) with the objective of improving the total estimated throughput by striking a tradeoff between the contention probability and backoff time based on the current network condition. Furthermore, Michelusi and Mitra [221] conceived a crosslayer framework for jointly optimizing the spectrumsensing and access processes of cognitive wireless networks with the objective of maximizing the throughput of the SU under a strict constraint on the maximal performance degradation imposed on the PU. Furthermore, the high complexity of the POMDP formulation was mitigated by a lowdimensional belief representation, which was achieved by minimizing the KullbackLeibler divergence defined in [222].
VB3 An Example
As shown in Fig. 16, the ‘superWiFi’ network concept has been originally proposed for nationwide Internet access in the USA. However, the traditional mains power supply is not necessarily ubiquitous in this largescale wireless network. Furthermore, the nonuniform geographic distribution of both the BSs and of the teletraffic requires carefully considered userassociation. Relying on the rapidly developing energy harvesting techniques, in [223], a POMDPbased access point selection strategies was conceived for an energy harvesting aided superWiFi network.
It was assumed that both the battery states as well as the user access states were completely observable. However, in practice the solar radiation intensity changes over time in a year, as influenced by the weather conditions. Furthermore, the radiation sensors have a limited sampling rate, which makes it hard to simultaneously record the solar radiation intensity and to accurately estimate the system’s battery state. Fortunately, relying on historical solar radiation observation data provided by the University of Queensland, Australia [224], in a short period of time, say, within an hour, the realtime harvested solar power can be modeled as , where is constant for an hour, while
is a small perturbation. Moreover, multiple factors, such as the effective irradiation area, the clouds’ distribution, the sensors’ operating status, etc. may independently affect the harvested power. Relying on the centrallimit theorem, the perturbation
can be regarded as being Gaussian distributed. Hence, the distribution of
can be written as , where and can be learned from the harvested data set.Moreover, a queuebased userassociation state model as well as a dynamic battery state model was established. Hence, the system’s state having APs is constituted by both the userassociation states as well as by the battery states. Let denote the userassociation states, while represent the AP battery states, where and . Furthermore, the superWiFi system state can be written as a element vector , which includes both the APs’ userassociation states and the APs’ battery states. Assuming the independence of each AP’s two substates, the system’s state transition probability can be expressed as:
(30) 
where represents the users’ actions in terms of which available APs they request association with.
Since the requesting users only have partial knowledge of the entire superWiFi system’s state, relying on the above definitions and hypotheses, we construct the POMDP decisionmaking model in terms of a quintuple of as mentioned above. The POMDP formulation can be reduced to a belief MDP with the aid of the belief state vector. Therefore, the expected reward of the system relying on strategy after an infinite number of time slots can be written as:
(31) 
where is the initial system state, while is the belief state vector reflecting the grade of similarity between the current estimated state and the legitimate system state . Moreover, is the immediate reward of the system and represents the discount rate. Then, the optimal strategy can be constructed by invoking dynamic programming aided iterative algorithms for maximizing the expected reward function.
Bearing in mind the large values of , and , as well as the users’ rapidly fluctuating arrival rate and departure rate , obtaining the optimal POMDP solution may face the curse of dimension disaster. In order to reduce the computational complexity, a suboptimal algorithm was proposed in [223]. Explicitly, Algorithm 2 of [223] aimed for maximizing the expectation of the system’s energy function, which was defined as:
(32) 
where represents the residual energy of AP , while is its energy harvested under the assumption that the harvested power level remains quasistatic during the information transmission interval and denotes the energy consumption. Finally, is the capacity of the AP’s battery.
The efficiency of the AP selection algorithms proposed in [223] was compared in terms of the system’s access efficiency defined as , where is the total number of successful access attempts during the entire simulation time . In Fig. 17 and Fig. 18, multiple APs () are considered with the maximum number of admitted users being , while having a maximum number of battery states given by . Moreover, the departure rate is . We may conclude from Fig. 17 that a highly loaded system makes the carriersense multiple access with collision detection (CSMA/CD) method almost useless, when the users’ arrival rate reaches a certain value. As shown in Fig. 18, where , the system’s access efficiency recorded for all the AP selection algorithms only increases with the solar radiation intensity in a relatively small range. However, the performance of the CSMA/CD, CSMA/CA^{7}^{7}7Strictly speaking, the CSMA/CD and CSMA/CA in this paper are different from the Ethernet’s data link layer protocols. Here, both of them represent the access control mechanisms. We use the same acronym CSMA/CD and CSMA/CA for convenience., as well as of the random selection algorithm remains unchanged, regardless of the increase in solar radiation intensity. Moreover, the suboptimal Algorithm 2 of [223] is capable of outperforming the POMDP method at a strong solar radiation intensity, which may be deemed to be the result of the approximations and hypotheses inherent in the POMDP model.
VC Temporal Difference Learning and Its Applications
VC1 Methods
Temporaldifference (TD) learning is a modelfree reinforcement learning method, which is capable of directly gleaning knowledge from raw experience without a model of the environment or receiving delayed reward, which can be typically viewed as a combination of Monte Carlo methods and of dynamic programming. More specifically, it samples the environment like the Monte Carlo methods, and then updates the corresponding parameters relying on current estimates like dynamic programming does. By contrast, TD learning operates in an online fashion by relying on the result of a single time step, rather than waiting for the final outcome until the end of an episode of the Monte Carlo method. Moreover, it has an advantage over the dynamic programming methods since it does not require a model of the state transition probabilities as shown in Fig. 19. TD learning can be readily invoked for finding an optimal action policy for any finite MDP associated with an unknown system model. Fig. 19 shows the difference between the MDP, POMDP and TD learning.
A pair of popular representatives of the TD learning family are constituted by the Qlearning and by the “stateactionrewardstateaction” (SARSA) technique, which interacts with the environment and updates the stateaction value function, namely the Qfunction, based on the action it takes. In contrast to SARSA, Qlearning updates the Qfunction relying on the maximum reward provided by one of its available actions. Specifically, the update of the Qfunction in SARSA can be formulated as [50]:
(33) 
while in Qlearning, the update of the Qfunction can be cast as [225]:
(34) 
where represents the system’s state and is the action selected by the agent, whilst represents the available set of actions. Moreover, is the update weighting coefficient and denotes the discount factor. As for the convergence analysis, SARSA is capable of converging with probability to an optimal policy as well as to an optimal stateaction value function, provided that all the stateaction pairs are visited a sufficiently high number of times. However, because of the independence of making an action and that of updating the Qfunction, Qlearning has no delayed reward as TDlearning, which tends to facilitate an earlier convergence than SARSA [42].
VC2 Applications
As a benefit of being free from modeling the environment, TD learning is capable of providing competent decisions even in unknown environments. Table III summarizes a variety of compelling applications found in wireless networks for both SARSA and Qlearning along with their brief description.
Paper  Method  Scenario  Application & Description 

[226]  reducedstate SARSA  cellular network  dynamic channel allocation considering both mobile traffic and call handoffs. 
[227]  onpolicy SARSA  CR network  distributed multiagent sensing policy relying on local interactions among SU 
[228]  onpolicy SARSA  MANET  energyaware reactive routing protocol for maximizing network lifetime 
[229]  onpolicy SARSA  HetNet  resource management for maximizing resource utilization and guaranteeing QoS 
[230]  approximate SARSA  P2P network  energy harvesting aided power allocation policy for maximizing the throughput 
[231]  Qlearning  WBAN  power control scheme to mitigate interference and to improve throughput 
[232]  Qlearning  OFDM system  adaptive modulation and coding not relying on offline training from PHY 
[233]  Qlearning  cooperative network  efficient relay selection scheme meeting the symbol error rate requirement 
[234]  decentralized Qlearning  CR network  aggregated interference control without introducing signaling overhead 
[235]  convergent Qlearning  WSN  sensors’ sleep scheduling scheme for minimizing the tracking error 
Vi Deep Learning in NGWN
Via Deep Artificial Neural Networks and Their Applications
ViA1 Methods
Artificial neural networks [236]
constitute a set of algorithms conceived by imitating the interaction between neurons in human brain, which are designed to extract features for clustering and classification tasks.
In a common artificial neural network (ANN) model [237], the input of each artificial neuron is a realvalued signal, and the output of each artificial neuron is calculated by a nonlinear function of the sum of its inputs. Artificial neurons and their connections typically use a weighting factor for adjusting the “speed” of the learning process. Moreover, artificial neurons are organized in the form of layers. Different layers perform different kinds of transformations of their inputs. Basically, input signals travel from the first layer to the last layer, possibly via multiple hidden layers.
The deep neural network (DNN) is characterized by multiple hidden layers between the input and output layers as shown in Fig. 20
(a), which is capable of modeling complex relationships of the processed data with the aid of multiple nonlinear transformations. In a DNN, the provision of extra layers facilitates the composition of features from lower layers, which is beneficial in terms of more accurately modeling complex data than a ‘shallow’ network having a single hidden layer. Furthermore, DNN may be viewed as a type of feedforward network, where the processed data flows in the direction from the input layer to the output layer without looping back. Given recent impressive applications of DNN, the convergence behavior of DNN emerges an important subject in machine learning.
By contrast, in a recurrent neural network (RNN) a neuron in one layer is capable of connecting to the neurons in previous layers. Therefore, a RNN is capable of exploiting the dynamic temporal information hidden in a time sequence and it exploits its “memory” inherited from previous layers for processing the future inputs as shown in Fig. 20 (b). Popular algorithms used for training the RNN include the real time recurrent learning technique of [238]
, the causal recursive backpropagation algorithm of
[239], the backpropagation through time algorithm of [240], etc.The convolutional neural network (CNN) is a class of feedforward deep artificial neural networks relying on the socalled weightshared architecture and translation invariance characteristics, which hence only requires modest preprocessing. As seen in Fig. 20 (c), a basic CNN architecture is composed of an input layer, an output layer as well as multiple hidden layers, which are often referred to as convolutional layers, pooling layers and fully connected layers. More particularly, the convolutional layers invoke a convolution operation, also termed as the crosscorrelation operation, which generate a multidimensional feature map relying on a number of socalled filters. The CNN has been successfully used in both image and video recognition [241]
[242], recommender systems [243], etc. Fig. 20 contrasts the basic architecture of DNN, RNN and CNN, respectively.ViA2 Applications
In this subsection, we will consider the benefits of deep artificial neural network algorithms in a variety of wireless networking scenarios. As mentioned before, deep artificial neural networks are capable of capturing the nonlinear and often dynamically varying relationship between the inputs and outputs. Hence they have a powerful prediction, inference and data analysis capability by exploiting the vast amount of data generated both by the environment and by the users. As for learning from the environment, we are able to harness DNNs trained by the data gleaned over the air for the sake of channel estimation [244], interference identification [245], localization [246, 247, 248, 249, 250], etc. By contrast, with regard to learning from the users or devices, DNN algorithms can also be used for predicting the users’ behaviors, such as their content interests [251], mobility patterns [252], etc. in order to beneficially design the dynamic content caching of BSs and to efficiently allocate wireless resources, for example.
Traditional signal processing approaches supported by statistics and information theory in communication systems substantially rely on accurate and tractable mathematical models. Unfortunately, however, practical communication systems may have a range of imperfections and nonlinear factors, which are difficult to model mathematically. Given that DNN algorithms do not require a tractable model, they are capable of remedying the imperfections in the physical layer by learning both from the environment and from previous inputs relying on a specific hardware configuration. To elaborate, Ye et al. [244] proposed a DNN aided channel estimation method for learning the wireless channel characteristics, such as the nonlinear distortion, interference and frequency selectivity. The DNN aided channel estimation method was shown to be more robust than traditional methods, especially in the context of having fewer training pilots, in the absence of cyclic prefix, as well as in the face of nonlinear clipping noise. Apart from estimating the channel characteristics, DNNs can also be used for classifying modulated signals in the physical layer. Rajendran et al. [253] conceived a datedriven automatic modulation classification (AMC) scheme hinging on the long short term memory (LSTM) aided RNN, which captured the time domain (TD) amplitude and phase information of modulation schemes carried in the training data without expert knowledge. Their simulations showed that the novel AMC had an average classification accuracy of about in the context of timevarying SNR ranging from 0dB to 20dB. As for signal detection, Farsad and Goldsmith [254] developed a deep learning aided signal detector, where the transmitted signal can be efficiently estimated from its corrupted version observed at the receiver. The detector was trained relying on known transmitted signals, but without any knowledge of the underlying wireless channel model and estimated the likelihood of each symbol, which was beneficial for carrying out soft decision error correction afterwards. In the application of interference identification, Schmidt et al. [245] proposed a featuremap assisted CNN based wireless interference identification scheme. The CNN model learned the relevant features through selfoptimization during the GPU based training process, which was first designed in [255]. By carefully considering the realistic capability of wireless sensors, the model relied on the time and frequencylimited sensing snapshots having the duration of 12.8 as well as the bandwidth of 10MHz. The proposed CNN based wireless interference identifier was shown to have a higher identification accuracy than the stateoftheart schemes in the context of low SNRs, such as dB, for example.
Furthermore, we can use DNNs for modelling the entire physical layer of a communication system without any classic components such as source coding, channel coding, modulation, equalization, etc. In [256], O’Shea et al
. used a DNN to represent a simple communication system with one transmitter and one receiver that can be trained as a socalled autoencoder without knowing the accurate channel model. Moreover, a CNN algorithm was conceived for modulation classification based on both sampled radio frequency timeseries data and expert knowledge integrated by radio transformer networks (RTN). Additionally, O’Shea
et al. [257] extended the DNN aided autoencoder to a single user MIMO communication scenario, where the physical layer encoding and decoding processes were jointly optimized as a single endtoend selflearning task. Their simulation results showed that the autoencoder based system outperformed the classic space time block code (STBC) at 15dB SNR. Furthermore, Dörner et al. [258] also developed a DNN based prototype system solely composed of two unsynchronized offtheshelf softwaredefined radios (SDR). This prototype system was capable of mitigating the current restriction on short block lengths.DNNs also play a critical role in supporting a variety of compelling upper layer applications, such as traffic prediction [259], packet routing [260] and control [261], traffic offloading [262], resource allocation [263], attack detection [264], just to name a few. For instance, Wang et al. [259] presented a hybrid deep learning aided structure for spatiotemporal traffic modeling and prediction in cellular networks by mining information from the China Mobile dataset. It used a novel deep learning aided autoencoder for modeling the spatial features of wireless traffic, while using LSTM units for temporal modeling. Additionally, Kato et al. [260] proposed a supervised DNN aided traffic routing scheme, which outperformed the classic open shortest path first (OSPF) scheme in terms of requiring a lower overhead, whilst maintaining a higher throughput and lower delay. By contrast, a realtime deep CNN based traffic control mechanism learning from previous network anomalies was conceived by Tang et al. [261], which substantially reduced the average delay and packet loss rate. Hence, deep learning aided traffic control may indeed constitute a potential candidate for gradually replacing traditionally routing protocols in future wireless networks. Furthermore, Li et al. [262] integrated both the DNN structure and the edge computing technique into the multimedia IoT, which was able to improve the efficiency of multimedia processing. Sun et al. [263] treated the power control problem in interferencelimited wireless networks as a ‘black box’. They proposed an ‘almostrealtime’ power control algorithm relying on a DNN structure trained by simulated data. In comparison to traditional mathematical tools, the approximation error of the DNN aided algorithm is closely related to the depth of the DNN considered. As for network security issues, for example, He et al. [264] constructed a conditional deep belief network (CDBN) for the realtime detection of malicious false data injection (FDI) attacks in the smart grid, which was trained by historical measurement data. The simulations conducted using the IEEE 118bus test system and the IEEE 300bus test system showed that the CDBN aided FDI detection scheme was resilient to the environmental noise and had a higher detection accuracy than its SVM aided counterparts.
As a successful example of learning from the environment, DNNs are beneficial in terms of extracting electromagnetic fingerprint information from the wireless channel for indoor localization. In [246, 247, 248], Wang et al. proposed a DNN having three hiddenlayer for training the calibrated CSI phase data, where the fingerprint information was represented by the DNN’s weights. Their experimental results showed that the DNN aided localization scheme performed well in different propagation environments, including an empty living room, and a laboratory in the presence of mobile users. In [249], Wang et al. proposed a deep learning method for supporting devicefree wireless localization and activity recognition relying on learning from the wireless signals around the target, where a sparse autoencoder network was used for automatically learning the discriminative features of wireless signals. Furthermore, a softmaxregressionbased framework [265] was formulated for the location and activity recognition based on merged features. Moreover, in [250], Zhang et al. constructed a fourlayer DNN for extracting reliable high level features from massive WiFi data, which was pretrained by the stacked denoising autoencoder. Additionally, an HMM aided highaccuracy localization algorithm was proposed for smoothening the estimate variation. Their experimental results showed a substantial localization accuracy improvement in the context of a widely fluctuating wireless signal strength.
With regard to learning from users or devices, Ouyang et al. [252] conceived a CNNaided online learning architecture for understanding human mobility patterns relying on analyzing continuous mobile data streams. AlMolegi et al. [266] integrated both the spatial features gleaned from GPS data and the temporal features extracted from the associated time stamps for predicting human mobility based on a RNN. Moreover, Song et al. [267] proposed an intelligent deep LSTM RNN based system for predicting both human mobility and the specific transportation mode in a largescale transportation network, which was beneficial in terms of providing accurate traffic control for intelligent transportation systems (ITS). Additionally, a mobility prediction technique relying on a complex extreme learning machine (CELM) was developed by Ghouti et al. [268] in order to jointly optimize both the bandwidth and the power MANETs. In [269], both the multilayer perception and RNN models were employed by Agarwal et al. for characterizing the activity of primary users in CR networks, where three different traffic distributions, namely Poisson traffic, interrupted Poisson traffic and selfsimilar traffic were used for training the related models.
Table IV lists a range of typical applications of DNNs along with a brief description.
Paper  Application  Method  Description 

[244]  channel estimation  DNN  learn nonlinear distortion, interference and frequency selectivity of wireless channels 
[253]  modulation classification  RNN  capture amplitude and phase information without expert knowledge 
[254]  signal detection  DNN  transmit signal detection from noisy and corrupted signals without underlying CSI 
[245]  interference identification  CNN  learn features through selfoptimization during the GPU based training process 
[256]  PHY representation  DNN  represent simple system having one transmitter and receiver without accurate CSI 
[257]  PHY representation  DNN  represent single user MIMO system relying on DNN aided autoencoder 
[258]  softwaredefined radio  DNN  be capable of easing the current restriction on short block lengths 
[259]  traffic prediction  DNN  deep autoencoder and LSTM for modeling spatial and temporal features 
[260]  packet routing  DNN  traffic routing scheme with little signal overhead, large throughput and small delay 
[261]  traffic control  CNN  consider previous network abnormalities, lower average delay and packet loss rate 
[262]  traffic offloading  DNN  integrate both DNN structure and edge computing technique into multimedia IoT 
[263]  power control  DNN  an almostrealtime power control algorithm in interferencelimited wireless networks 
[264]  network security  DBN  a realtime detection of malicious false data injection attack in smart grid 
[246, 247, 248, 249, 250]  indoor localization  DNN  devicefree wireless localization and recognition by learning from ambient wireless signals 
[252]  mobility prediction  CNN  learn human mobility pattern relying on analyzing continuous mobile data stream 
[266]  mobility prediction  RNN  integrate spatial feature from GPS and temporal feature from associated time stamps 
[267]  transportation mode  RNN  predict both human mobility and transportation mode for largescale transport networks 
[269]  activity prediction  RNN  characterize primary users’ activity in CR with different traffic distribution 
ViB Deep Reinforcement Learning and Its Applications
ViB1 Methods
The deep reinforcement learning technique is constituted by the integration of the aforementioned DNNs and reinforcement learning. Explicitly, in deep reinforcement learning methods, DNNs are used for approximating certain components of reinforcement learning, including the state transition function, reward function, value function and the policy. These components can be viewed as a function of the weights in these DNNs, which can be updated with the aid of the classic stochastic gradient descent.
In particular, the deep QNetwork (DQN) constitutes the first deep reinforcement learning solution, which was proposed by Mnih et al. in 2015 [53], which avoids the instability of the reinforcement learning algorithm, which may even become divergent when its actionvalue function is approximated relying on a nonlinear function. To elaborate a little further, DQN stabilizes the training process of the actionvalue function approximation by relying on experience replay. Furthermore, DQN only requires modest domain knowledge. The deep Qlearning algorithm in DQN is a variant of the classical Qlearning algorithms, which is integrated with the deep CNN model, where the convolutional filters seen in Fig. 20 (c) are used for representing the effects of receptive fields. One of the outputs of the deep CNN involved yields the specific value of the Qfunction for a possible action. Beyond the DQN, substantial efforts have also been invested in improving the performance and stability, as exemplified by the double DQN [270] and the dueling DQN [271]. Thanks to the powerful feature representation capability of DNNs and of the reinforcement learning algorithms, DQN performs well in a range of compelling applications as exemplified by the AlphaGo, which is the first superhuman program to defeat a professional human chessplayer.
ViB2 Applications
Deep reinforcement learning is eminently suitable for supporting the interaction in autonomous systems in terms of a higher level understanding of the visual world, which can be readily applied to a diverse analytically intractable problems in NGWNs.
Given the intrinsic advantages of the reinforcement learning in environment in interactive decision making, it may play a significant role in the field of control decision [272, 273]. Specifically, Zhang et al. [272] proposed a modelfree UAV trajectory control scheme relying on deep reinforcement learning for data collection in smart cities, where a powerful deep CNN was used for extracting the necessary features, while a DQN model was used for decision making. Given the sensing region and the related tasks, this algorithm supported efficient route planning for both the UAVs and mobile charging stations involved. In [273], a deep reinforcement learning aided communicationbased train control system was conceived by Zhu et al. which jointly optimized the communication handoff strategy and the control functions, while reducing the energy consumption. Real channel measurements and realtime train position information were used for training the DQN model, which resulted in optimal communication and control decisions.
Furthermore, the resource allocation problems of wireless networks, such as energy scheduling, traffic scheduling, caching decisions, user association, etc. can be efficiently solved by deep reinforcement learning at a low computational complexity [81, 274, 275, 276, 277, 278, 279]. For example, Zhang et al. [81] proposed a deep Qlearning model for system’s dynamic energy scheduling, which relied on the amalgamated stacked autoencoder and Qlearning model. More specifically, the stacked autoencoder was used for learning the stateaction value function of each strategy in any of the available system states. Moreover, Xu et al. [274] proposed a deep reinforcement learning framework for powerefficient resource allocation in CRANs, which optimized the expected and cumulative long term power consumption, including the transmit power consumption, the sleep/active transition power consumption as well as the RRU’s power consumption. A twostep deep reinforcement learning aided decision making scheme was conceived, where the learning agent first decides on activating/deactivating the sleeping mode of each RRU, and then determines the optimal beamformer’s power allocation. As for traffic scheduling, Zhu et al. [275] designed a stacked autoencoder assisted deep learning model for packet transmission planning in the face of multiple contending channels in cognitive IoT networks, which aimed for maximizing the system’s throughput. In this architecture, MDP was used for modelling the system states. Given the large stateaction space of the system, the stacked autoencoder was used for constructing the mapping between the state and the action for accelerating the process of optimization. Furthermore, a deep Qlearning algorithm was conceived for designing both the cache allocation and the transmission rate in contentcentric IoT networks for the sake of maximizing the longterm QoE [276], where He et al. considered both the networking cost as well as the users’ mean opinion score. In [277] and [278], He et al. proposed a DQN based user scheduling scheme for a cacheenabled opportunistic interference alignment (IA) assisted wireless network in the context of realistic timevarying channels formulated as a finitestate Markov model. More specifically, the DQN was constructed by relying on a sophisticated actionvalue function for the sake of reducing the computational complexity. Their simulation results demonstrated that the DQN aided IA assisted user scheduling was beneficial in terms of substantially improving the network’s throughput vs energy efficiency tradeoff. To elaborate a little further, He et al. [279] utilized deep reinforcement learning for constructing their resource allocation policy relying on a joint optimization problem, which considered the programmable networking, informationcentric caching as well as mobile edge computing in the context of connected vehicular scenarios. Moreover, the greedy policy was utilized for striking an attractive tradeoff between exploration and exploitation.
In order to curb the potentially excessive computational complexity resulting from having a large state space and to deal with its partial observability in cognitive radio networks, Naparstek and Cohen developed a distributed dynamic spectrum access scheme relying on deep multiuser reinforcement leaning, where each user maps his/her current state to spectrum access actions with the aid of a DQN for the sake of maximizing the network’s utility which was achieved without any message exchanges [280]. Additionally, Wang et al. [281] proposed an adaptive DQN algorithm for dynamic multichannel access, which was capable of achieving a nearoptimal performance outperforming the Myopic policy [282]^{8}^{8}8The Myopic policy is one that simply optimizes the average immediate reward. It is called ‘myopic’ in the sense that it merely considers the single criterion, while it has the advantage of being easy to implement. and the Whittle’s Indexbased heuristic algorithms [283]^{9}^{9}9The Whittle’s Indexbased algorithm is one of index heuristic algorithms which is designed to solve a problem in a more efficient way than traditional methods often used for solving NPhard problems. More explicitly, the Whittle’s Index policy is a lowcomplexity heuristic policy. in complex scenarios.
Table V lists some typical applications of deep reinforcement learning in NGWN.
Paper  Scenario  Application  Description 

[272]  UAV network  trajectory control  a modelfree UAV trajectory control scheme in smart cities relying on DQN 
[273]  ITS  train control  jointly optimize the communication handoff strategy and control performances 
[81]  energyaware network  energy scheduling  associate the stacked autoencoder and the deep Qlearning model 
[274]  CRAN  power allocation  decide RRU’s sleeping mode and the optimal beamformer’s power allocation 
[275]  cognitive IoT  traffic scheduling  construct the mapping between states and actions relying on stacked autoencoder 
[276]  contentcentric IoT  cache allocation  jointly design cache allocation and transmission rate for maximizing longterm QoE 
[278]  IA network  user scheduling  obtain the actionvalue function relying on DQN for lowering complexity 
[279]  vehicular network  resource allocation  consider programmable SDN, informationcentric caching and mobile edge computing 
[280]  CR network  spectrum access  distributed spectrum access for maximizing network utility without message exchanges 
[281]  CR network  multichannel access  adaptive DQN aided multichannel access yielding a nearoptimal performance 
Vii Future Research and Conclusions
In the following, we will list a range of future research ideas on promising applications of machine learning in NGWNs.

UAVaided networking: Given the agility of UAV nodes as well as the bursty and often unpredictable nature of terrestrial wireless traffic, machine learning models can be used both for predicting the traffic demand and for adaptively adjusting the UAVs’ location.

mMTC and uRLLC network: While wireless networks have primarily served communications among individuals, in the era of the IoT, wireless networking also support myriads of machines and intelligent devices. In this era a pair of 5G operational modes  namely mMTC and uRLLC  are expected to play key roles [284]. Machine learning is capable of enhancing conventional networks designed for mMTC, for example by invoking reinforcement learning to appropriately select the access points of MTC [285]. The uRLLC mode of operation constitutes a rather young technological territory, which can be jointly designed with mMTC [286]. To reduce the network’s latency from hundreds of milliseconds as experienced in stateoftheart mobile communications to the desirable range of just a few milliseconds, machine learning is capable of supporting socalled anticipatory mobility management, which integrates the naive Bayesian classification of the previously used APs and geographical regression for the predictive analysis of data. Another example of disruptive technical trend is to investigate how wireless networking impacts on the smart agents operated by machine learning.

Narrow band IoT (NBIoT): NBIoT allows a large number of lowpower devices to connect to the cellular network, where the devices require longterm Internet access and dense wireless coverage. Machine learning algorithms are capable of supporting intelligent resource allocation, optimal AP deployment and efficient access.

Sociallyaware wireless networking: The operation of sociallyaware wireless networks relies on a variety of social attributes, where machine learning schemes are beneficial in terms of facilitating feature extraction, social groupofinterest formation, classification and prediction of these social attributes, such as human mobility, social relations, behavior preference, etc.

Wireless Virtual reality (VR) networks: VR networks facilitate for the users to experience and interact with immersive environments, which requires flawless audio and video data processing capability. Machine learning algorithms have the potential of circumventing the conception of complex joint source and channelcoding schemes by further developing the autoencoder principles.

Network integration, representation and design: Machine learning may provide an alternative for network representation, where we can integrate each classic communicationtheoretic blocks including source and channelencoding, modulation, demodulation, decoding, etc. into a “black box”. By Simply learning from and processing previous input and output signals, the receivers become capable of adaptively understanding the operational mechanism of the “black box” considered.

Wireless network tomography: Stateoftheart wireless networks support a vast number of nodes, such as those of the IoT, where the provision of global information is practically impossible for each node. Hence a new class of problems arises in the context of distributed wireless networks, which is related to the acquisition of networkrelated information. Classic network tomography [287] defines the problem as: , where is a dimensional vector of the network’s dynamically fluctuating parameters, such as the link delay or traffic activity, is the dimensional vector of measurements and
Comments
There are no comments yet.