|5G||The 5th Generation Mobile Network|
|AMC||Automatic Modulation Classification|
Artificial Neural Network
|AWGN||Additive White Gaussian Noise|
|BBU||BaseBand processing Unit|
|CDF||Cumulative Distribution Function|
|CNN||Convolutional Neural Network|
|CoMP||Coordinated Multiple Points|
|C-RAN||Cloud Radio Access Network|
|CRN||Cognitive Radio Network|
|CSI||Channel State Information|
|CSMA/CA||Carrier-Sense Multiple Access with Collision Avoidance|
|CSMA/CD||Carrier-Sense Multiple Access with Collision Detection|
|C-S Mode||Client-Server Mode|
|D2D||Device to Device|
|DBN||Deep Belief Network|
|DNN||Deep Neural Network|
Exponentially-weighted algorithm with Linear Programming
|eMBB||enhanced Mobile Broad Band|
|ERM||Empirical Risk Minimization|
|EXP3||EXPonential weights for EXPloration and EXPloitation|
|FANET||Flying Ad Hoc Network|
|FDA||Fisher Discriminant Analysis|
|FDI||False Data Injection|
|FSMC||Finite State Markov Channel|
|GMM||Gaussian Mixture Model|
|HMM||Hidden Markov Model|
|ICA||Independent Component Analysis|
|IEEE||Institute of Electrical and Electronics Engineers|
|IoT||Internet of Things|
|ITS||Intelligent Transportation System|
|LED||Light Emitting Diode|
|LOS||Line of Sight|
|LSTM||Long Short Term Memory|
|LTE||Long Term Evolution|
|M2M||Machine to Machine|
|MANET||Mobile Ad Hoc Network|
|MAP||Maximum a Posteriori|
|MDP||Markov Decision Process|
|MIMO||Multiple-Input and Multiple-Output|
|MLE||Maximum Likelihood Estimation|
|mMTC||massive Machine Type of Communication|
|NB-IoT||NarrowBand Internet of Things|
|NB-M2M||NarrowBand Machine to Machine|
|NFV||Network Function Virtualization|
|NLOS||Non-Line of Sight|
|NGWN||Next-Generation Wireless Network|
|NOMA||Non-Orthogonal Multiple Access|
|OFDM||Orthogonal Frequency Division Multiplexing|
|OSPF||Open Shortest Path First|
|P2P||Peer to Peer|
|PCA||Principal Component Analysis|
|POMDP||Partially Observable Markov Decision Process|
|QoE||Quality of Experience|
|QoS||Quality of Service|
|RAT||Radio Access Technology|
|RBM||Restricted Boltzmann Machine|
|RBF||Radial Basis Function|
|RFID||Radio Frequency IDentification|
|RNN||Recurrent Neural Network|
|RRU||Remote Radio Unit|
|SDA||Stacked Denoising Auto-encoder|
|SDN||Software Defined Network|
|SDR||Software Defined Radio|
|SRM||Structural Risk Minimization|
|STBC||Space Time Block Code|
|SVM||Support Vector Machine|
|TAS||Transmit Antenna Selection|
|TCP||Transmission Control Protocol|
|TOA||Time of Arrival|
|UAV||Unmanned Aerial Vehicle|
|UDN||Ultra Dense Network|
|uRLLC||ultra-Reliable Low-Latency Communication|
|V2I||Vehicle to Infrastructure|
|V2V||Vehicle to Vehicle|
|V2X||Vehicle to Everything|
|VANET||Vehicular Ad Hoc Network|
|VLC||Visible Light Communication|
|WANET||Wireless Ad Hoc Network|
|WBAN||Wireless Body Area Network|
|WLAN||Wireless Local Area Network|
|WiMAX||Worldwide Interoperability for Microwave Access|
|WMAN||Wireless Metropolitan Area Network|
|WPAN||Wireless Personal Area Network|
|WSN||Wireless Sensor Network|
|WWAN||Wireless Wide Area Network|
Wireless networks have supported a variety of military services, intelligent transportation, healthcare, etc. To elaborate briefly, next-generation mobile networks are expected to support high date rate communication . As a complement, wireless sensor networks (WSN) support sustained monitoring in unmanned or hostile environments relying on widely dispersed operating sensors . Furthermore, the popular Wi-Fi network provides convenient Internet access for various devices in indoor scenarios . With the rapid proliferation of portable mobile devices and the demand for a high quality of service (QoS) and quality of experience (QoE), next-generation wireless networks (NGWN) will continue to support a broad range of compelling applications, where the users benefit from high-rate, low-latency, low-cost and reliable information services.
Network Scale: The NGWN is associated with a tremendous network size including all kinds of entities, each of which has different service capabilities as well as requirements. Furthermore, interactions among these entities result in a diverse variety of traffic, such as text, voice, audio, images, video, etc.
Network Structure: On one hand, the NGWN tends to have a self-configuring element, where each entity cooperatively completes tasks. This characteristic is termed as “being as hoc”. On the other hand, the NGWN is heterogeneous and hierarchical, having different network slices111In our paper, network slices are multiple logical networks running on the top of a shared physical network infrastructure and operated by a control center.. Furthermore, the mobility of entities results in a complex time-variant network structure, which requires dynamic time-space association.
Network Control: NGWNs facilitate convenient reconfiguration by software-based network management, hence improving network flexibility and efficiency.
Machine learning was first introduced as a popular technique of realizing artificial intelligence in the late 1950’s . Machine learning algorithms can learn from training data without being explicitly programmed. It is beneficial for classification/regression, prediction, clustering and decision making [7, 8, 9], whilst relying on the following three basic elements :
: Mathematical or signal models are constructed from training data and expert knowledge, in order to statistically describe the characteristics of the given data set. Then again, relying on these trained models, machine learning can be used for classification, prediction and decision making. In case the appropriate models are not available, techniques on the feature extraction or knowledge discovery can be developed to achieve the same goal.
Strategy: The criteria used for training mathematical models are called strategies. How to select an appropriate strategy is closely associated with training data. Empirical risk minimization  and structural risk minimization  constitute a pair of fundamental strategies, where the latter can beneficially avoid the notorious “over-fitting” phenomenon.
: Algorithms are constructed to find solutions based on predetermined model and strategy selected, which can be viewed as an optimization process. A powerful algorithm can find a globally optimal solution with high probability at a low computational complexity and storage.
In the last thirty years, machine learning has been successfully applied to the field of computer vision, automatic control , bioinformatics , etc. Considering the aforementioned characteristics of the NGWN, data-driven machine learning can also become a powerful technique of network association for substantially improving the network performance. This is achieved by accurately learning the physical reality compared to traditional model-driven optimization algorithms based on the assumptions detailed in . More specifically,
The wireless traffic data torrance may be conveniently managed by the big data processing capability of machine learning . For example, in 5G system, the traffic volume generated by on-demand information and entertainment is predicted to substantially increase over the next decade, and an average smart phone may generate 4.4 GB data per month by the year 2020 [18, 19, 20]. The massive amount of data constitutes a large training set, which can be statistically exploited for extracting the internal correlations and for conducting classification and prediction with the aid of machine leaning algorithms.
Modeling and parameter estimation play an important role in NGWNs. For instance, in massive multiple-input and multiple-output (MIMO) systems, an accurate estimate of the channel state information (CSI) may critically improve the whole system’s capacity. Traditional mathematical models may not be able to accurately describe system in typical time-varying scenarios. Machine learning provides an alternative technique of adaptive modeling and parameter estimation relying on learning from history.
NGWNs require both individual node intelligence and swarm intelligence . Moreover, as for resource allocation and management, we tend to strike a trade-off among numerous factors, such as the capacity, power consumption, latency, interference, etc. rather than only considering a single aspect. Thanks to learning from trial and error, machine learning is conducive to supporting intelligent multi-objective decision making in the context of multi-agent collaborative network management. NGWN can have further potential to enable more effective multi-agent artificial intelligent systems.
NGWNs have the tendency to take into account the human behaviors, for example by taking into account the geographic deployment of access points (AP) in an ultra dense network (UDN), where user-centric designs have been conceived for reducing the cluster-edge effects. By mimicking human intelligence, machine learning may be deemed to be the most appropriate tool for adapting the network’s structure and function to the human behaviors observed [22, 23].
In recent years, a range of surveys have been conceived on machine learning paradigms. Some of them focused their scope on a specific wireless scenario, such as WSNs [24, 25], cognitive radio networks (CRN) [26, 27, 28], Internet of Things (IoT) , wireless ad hoc networks (WANET) , self-organizing cellular networks , etc. Specifically, Alsheikh et al.  provided an extensive overview of machine learning methods applied to WSNs which improved the resource exploitation and prolonged the lifespan of the network. Kulkarni et al.  surveyed some common issues of WSNs solved by computational intelligence algorithms, such as data fusion, routing, task scheduling, localization, etc. Moreover, Bkassiny et al.  investigated decision-making and feature classification problems solved by both centralized and decentralized learning algorithms in CRN in a non-Markovian environment. Gavrilovska et al.  studied the nature of the CRN’s capability of reasoning and learning. Park et al.  reviewed a range of learning aided frameworks designed for adapting to the heterogeneous resource-constrained IoT environment. Forster  portrayed the advantages of using machine learning for the data routing problem of WANETs. Furthermore, a detailed literature review of the past fifteen years of machine learning techniques applied to self-configuration, self-optimization and self-healing, was provided by Klaine et al. .
Some of the literature was restricted to a specific application [32, 33, 34, 35], whilst others considered a single learning technique [36, 37, 38, 39]. To elaborate, Al-Rawi et al.  presented an overview of the features, methods and performance enhancement of learning-assisted routing schemes in the context of distributed wireless networks. Additionally, Fadlullah et al.  provided an overview of the state-of-the-art in learning aided network traffic control schemes as well as in deep learning aided intelligent routing strategies, while Nguyen et al.  focused their attention on the machine learning techniques conceived for Internet traffic classification. Machine learning and data mining assisted cyber intrusion detection were surveyed in , including the complexity comparison of each algorithm and a set of recommendations concerning the best methods applied to different cyber intrusion detection problems. As for exploring learning techniques, Usama et al. 
provided an overview of the recent advances of unsupervised learning in the context of networking, such as traffic classification, anomaly detection, network optimization, etc. Yauet al.  investigated the employment of reinforcement learning invoked for achieving context awareness and intelligence in a variety of wireless network applications such as data routing, resource allocation and dynamic channel selection. The authors of  and  focused their attention on the benefit of deep learning in wireless multimedia network applications, including ambient sensing, cyber-security, resource optimization, etc. The main contributions of the existing machine learning aided wireless networks survey and tutorial papers are contrasted in Fig. 1 to this survey.
Hence, our focus is on the comprehensive survey of machine learning aided NGWNs. Inspired by above-mentioned challenges, in this article we review the development of machine learning aided wireless networks. We commence by investigating a series of popular learning algorithms and their compelling applications in NGWN and then provide some specific examples based on some recent research results, followed by a range of promising open issues in the design of future networks. Our original contributions are summarized as follows:
We critically review the thirty-year history of machine learning. Depending on how we use training data, we classify machine learning algorithms into three categories, i.e. supervised learning, unsupervised learning  and reinforcement learning . In addition, we highlight the family of deep learning algorithms, given their success in the field of signal processing.
The development of wireless networks is reviewed from their birth to NGWNs. Moreover, we summarize the evolution of wireless networking techniques, and characterize a variety of representative scenarios for the NGWN.
We appraise a range of typical supervised, unsupervised, reinforcement learning as well as deep learning algorithms. Moreover, their compelling applications in wireless networks are surveyed for assisting the readers in refining the motivation of machine learning in NGWN, all the way from the physical layer to the application layer.
Relying on recent research results, we highlight a pair of examples conceived for wireless networks, which can help the readers to gain the insight into hitherto unexplored scenarios and into their applications in NGWNs.
The remainder of this article is outlined as follows. In Section II, we provide a brief overview of the history of machine learning and of the development of wireless networks. In Section III, we introduce a range of typical supervised learning algorithms and highlight their compelling applications in wireless networks. In Section IV, we investigate the family of unsupervised leaning algorithms and their related applications. Some popular reinforcement learning algorithms are elaborated on in Section V. Moreover, we present two examples of how these reinforcement learning algorithms can improve the performance of wireless networks. In Section VI, we introduce some typical deep learning algorithms and their applications in NGWNs. Some future research ideas and our conclusions are provided in Section VII. The structure of this treatise is summarized at a glance in Fig. 2.
Ii A Brief Overview of Machine Learning and Wireless Networks
Ii-a The Thirty-Year Development of Machine Learning
The term “machine learning” was first proposed by Arthur Samuel in 1959 , which referred to computer systems having the capability of learning from their large amounts of previous tasks and data, as well as of self-optimizing computer algorithms. Hard-programmed algorithms are difficult to adapt to dynamically fluctuating demands and constantly renewed system states. By contrast, relying on learning from previous experiences, machine learning aided algorithms are beneficial for scientific decision making and task prediction, which is achieved by constructing a self-adaption model from sample inputs. To elaborate a little further, as for the concept of “learning”, Tom M. Mitchell  provided the widely quoted description: “A computer program is said to learn from experience with respect to some class of tasks and performance measure , if its performance at tasks in , as measured by , improves with experience .”
Machine learning began to flourish in the 1990s 
. Before this era, logic- and knowledge-based schemes, such as inductive logic programming, expert systems, etc. dominated the artificial intelligence scene relying on high-level human-readable symbolic representations of tasks and logic. Thanks to the development of statistics theory and stochastic approximation, machine learning schemes regained researchers’ attention leading to a range of beneficial probabilistic models. Researchers embarked on creating date-driven programs for analyzing a large amount of data and tried to draw conclusions or to learn from the data. During this era, machine learning algorithms such as neural networks as well as kernel methods became mature. During the 2000s, researchers gradually renewed their interest in deep learning with the aid of the advances in hardware-based computational capability, which made machine learning indispensable for supporting a wide range of services and applications.
Given the development of progressive learning techniques , at present, the research focus of machine learning has shifted from “learning being the purpose” to “learning being the method”. Specifically, machine learning algorithms no longer blindly pursue to imitate the learning capability of human beings, instead they focus more on the task-oriented intelligent data-driven analysis. Nowadays, thanks to the abundance of raw data and to the frequent interaction between exploration and exploitation, machine learning algorithms have prospered in the fields of computer vision, data mining, intelligent control, etc. NGWNs aim for providing ubiquitous information services for users in a variety of scenarios. However, the rapid growth in the number of users and the resulted explosive growth of tele-traffic data pushes the limits of network-capacity. As a remedy, machine learning aided network management and control can be viewed as a corner stone of NGWNs in view of their limited power, spectrum and cost.
Ii-B Classifying Machine Learning Techniques
Again, depending on how training data is used, machine learning algorithms can be grouped into three categories, i.e. supervised learning, unsupervised learning and reinforcement learning [45, 46]. In the following, we will provide a brief description of the three types of algorithms.
Supervised Learning: The algorithms are trained on a certain amount of labeled data . Both the input data and its desired label are known to the computer, resulting in a data-label pair. Their goal is to infer a function that maps the input data to the output label relying on the training of sample data-label pairs. Specifically, considering a set of sample data-label pairs in the form of , where is the -th sample input data and represents its label. Let denote the input data set and represent the output label set. Usually, these sample pairs are independent and identically distributed (i.i.d.). The learning algorithms aim for seeking a function that yields the highest value of the score function , hence we have
. As a special case, if only part of the sample data-label pairs are known to the computer and some of the desired output labels of input data are missing, the corresponding learning algorithms are termed as semi-supervised learning222In this paper, semi-supervised learning algorithms are viewed as a specific category of supervised learning algorithms. However, in some of the literature, semi-supervised learning is listed as a separate member of the machine learning family. These supervised learning algorithms can be widely used in the context of classification, regression and prediction.
Unsupervised Learning: Relying on unlabeled input data, unsupervised learning algorithms try to explore the hidden features or structure of the data [41, 47]. Given the lack of sample data-label pairs, there is no standard accuracy evaluation for the output of unsupervised learning algorithms, which is the main difference compared to its supervised learning aided counterpart. By analyzing input data , a pair of popular methods has been conceived for revealing the underlying unknown features of input data, namely density estimation  as well as feature extraction . To elaborate, density estimation aided methods are characterized by explicitly building statistical models of how the underlying features might create the input. By contrast, feature extraction based techniques aim for directly extracting statistical regularities or even sometimes irregularities from the input data set.
Reinforcement Learning: In contrast to the aforementioned two learning techniques, reinforcement learning algorithms are conceived for decision making by learning from interaction with the environment, which are trained by the data on the basis of trial and error [42, 50]. They neither try to identify a category as supervised learning algorithms do, nor do they aim for finding hidden structures as unsupervised learning algorithms do. Specifically, at each time step, the system or environment is in some state , and the agent selects a legitimate action . The system responds at the next time step by moving into a new state with a certain probability influenced both by the specific action chosen as well as by the system’s inherent transitions. Meanwhile, the agent receives a corresponding reward from the system, as time evolves. Reinforcement learning algorithms aim for learning how to map situations into actions in order to attain the maximal cumulative weighted reward within the horizon in such a closed-loop fashion.
As an important member of the machine learning family, deep learning has been booming since 2010, because it was found to be capable of handling the soaring growth of training data volume facilitated by the rapid development of computing hardware [51, 52]. Deep learning algorithms rely on a multiple-layer “network” consisting of inter-connected nodes for feature extraction and transformation, which is inspired by the biological nervous system, namely the neural network. Each layer utilizes the output of the previous layer as its input. The term “deep” refers to having multiple layers in the network. Generally, relying on the way the training data is exploited, deep learning algorithms can also be classified into deep supervised learning, deep unsupervised learning as well as deep reinforcement learning . Moreover, some deep learning network architectures, such as deep neural networks (DNN) , deep belief networks (DBN) , recurrent neural networks (RNN)  and convolutional neural networks (CNN) , have had success in a range of fields including computer vision, speech recognition, etc. They have also been invoked in compelling applications of wireless networks.
Fig. 3 shows the involvement of machine learning in NGWNs based on the aforementioned four categories. Below we list a variety of popular learning algorithms and highlight their applications in NGWNs.
Ii-C Development of Wireless Networks
Just as the terminology implies, wireless networks connect various network nodes via electromagnetic waves. Relying on their coverage, wireless networks can be roughly classified into four categories, such as wireless personal area networks (WPAN) , wireless local area networks (WLAN) , wireless metropolitan area networks (WMAN)  and wireless wide area networks (WWAN) . Correspondingly, a family of networking standards and their variants that cover most of the physical layer specifications have been established by the IEEE 802 Working groups, including the IEEE 802.15 for WPAN, IEEE 802.11 for WLAN, IEEE 802.16 for WMAN and IEEE 802.20 for WWAN standards. Furthermore, when considering the network’s functions, some popular representatives of wireless networks include cellular networks , WSNs , WANETs , wireless body area networks (WBAN) , etc.
The first wireless network, namely ALOHANET, was developed at the University of Hawaii in 1969 and came into operation in 1971, which for the first time transmitted wireless data packets over a network . The first commercial wireless network was the WaveLAN product family designed by the NCR Corporation in 1986. In 1997, the first IEEE 802.11 protocol was released for WLAN . Afterwards, the emergence and progress of reliable and low-cost Wi-Fi marked the maturity of wireless networking technologies at the end of the 20th century, which facilitated Internet access for a range of Wi-Fi compatible devices including personal computers, smart phones, etc. NGWNs aim for providing high-rate, low-latency, full-coverage and low-cost yet reliable information services. Compared to traditional wireless networks connecting humans and their devices, NGWNs are expected to interconnect everything under the umbrella of the ‘Internet of Everything’. Fig. 4 demonstrates the development of wireless networks in terms of their milestone techniques.
Wireless networks have evolved from the simple client-server (C-S) mode to the distributed dense multi-layer C-S mode, and finally to the ad hoc peer-to-peer (P2P) mode. The decentralization of network architectures grant more freedom both for the network nodes and their protocols, which requires more sophisticated techniques for supporting efficient and reliable implementations. Furthermore, the soaring growth of both the type and the amount of data provides a promising field of applications for machine learning algorithms, which are beneficial for self-organized and self-adaptive network architectures.
Ii-D Representative Techniques in NGWNs
As shown in Fig. 5, we first of all portray the representative application scenarios and techniques of NGWNs. In the following, we will briefly introduce a range of compelling techniques and their development trends in NGWNs, which is summarized in Fig. 6.
Ii-D1 From MIMO to Massive MIMO
The MIMO technology relying on multiple antennas in both the transmitter and receiver can be viewed as a breakthrough in terms of multiplying the capacity of a radio link compared to the single-transmit single-receive antenna aided wireless system having a variety of cost, technology and regulatory constraints . Both single-user MIMO (SU-MIMO) and Multi-user MIMO (MU-MIMO) schemes have been proposed. To elaborate, multiple data streams of the same source are sent to a single user in SU-MIMO, while a transmitter simultaneously serves multiple users on the same channel resource in MU-MIMO [67, 68].
Ii-D2 From D2D, M2M to IoT
In the spirit of direct communication between nearby mobile devices without traversing base stations (BS) or core networks, device-to-device (D2D) communication networks have been widely investigated in recent years, which can be deemed to be important milestones on the road towards self-organization and P2P collaboration. In D2D networks, the same resource slots can be reused both by the D2D links as well as the cellular links, which is capable of substantially improving the network capacity. Moreover, it is potentially beneficial in terms of enhancing the energy efficiency (EE), also reducing the transmission delay and improving the network’s fairness to users [73, 74], which is also closely related to machine-to-machine (M2M) communications. The corresponding massive machine type of communication (mMTC)  mode of the 5G network333In 2015, International Telecommunication Union (ITU) officially defined three application scenarios of 5G network, i.e. enhanced mobile broad band (eMBB), massive machine type communication (mMTC) and ultra reliable low latency communication (uRLLC). is capable of supporting sensing, transmitting, fusion and processing sensory data. Furthermore, M2M is also capable of supporting the smart home , smart grid , etc.
Aiming for “connecting everything”, IoT was first defined for enabling objects to connect and exchange data in 1999 . Furthermore, the IoT allows objects to be sensed and controlled remotely, creating opportunities for direct interaction between the physical world and computer-based virtual systems, which is beneficial in terms of improving operational efficiency and of reducing human intervention. Both WSNs and M2M communications can be viewed as a part of the IoT. Although the IoT faces a range of reliability, robustness and security challenges, there is no doubt that it will make our world ever smarter [79, 80].
Ii-D3 From UDN to HetNet
In order to meet the demand of supporting massive data traffic, the so-called UDN architecture has been defined where the density of BSs or APs potentially reaches or even exceeds the density of users [81, 82]. The UDN architecture is conducive to increasing the network capacity as well as simultaneously improving the user experience. However, the interference encountered in UDNs tends to be more severe and of higher volatility than that in traditional cellular networks because of the dense deployment of BSs and APs. Hence, the joint consideration of resource allocation, interference management and traffic routing are essential for UDNs [61, 83].
Considering a wide area network scenario, heterogeneous networks (HetNet) are characterized by the employment of multiple types of radio access technologies (RAT) . Upon combining macrocells, microcells, picocells  and femtocells [86, 87], HetNets are capable of providing a seamless wireless coverage ranging from outdoor environments to office buildings and even to underground areas by selecting another RAT when a RAT fails, and HetNets can also provide load-balancing in the face of non-uniform spatial distribution of users .
Ii-D4 From DBS to C-RAN
Compared to the traditional BS, which integrates baseband processing units (BBU) and remote radio units (RRU)444In some works, RRU is also called remote radio head (RRH) in a single cabinet, distributed base station (DBS) aided systems separate the BBU as well as the RRU and connects them with optical fiber. The DBS system allows more flexibility in network planning and deployment, where RRUs can be placed a few hundred meters or a few kilometres away for enhancing network’s edge-coverage.
Cloud-radio access networks (C-RAN) can be viewed as an evolution of the aforementioned DBS system, which is a centralized processing and cloud computing aided radio access network architecture . The principle of C-RAN relies on gathering the BBUs from several BSs into a centralized BBU pool, whilst allowing hundreds of RRUs to connect to the centralized BBU pool . Hence, resources can be allocated to each user based on joint dynamic scheduling. By exploiting coordination and virtualization, the spectral efficiency (SE), the system’s flexibility and the load balancing capability are substantially improved. Moreover, the centralized management of resources reduces the cost of the system’s operation and maintenance.
Ii-D5 From SDN to NFV
Software-defined networking (SDN) is employed as a programmable network architecture in order to achieve cost-effective dynamic network configuration and monitoring [91, 92]. The SDN philosophy suggests to centralize network intelligence in a single network component by decoupling the control plane and the data plane, which disassociates network control and its forwarding functions. The two planes can communicate with the aid of the OpenFlow protocol555The OpenFlow protocol is a communication protocol that gives access to the forwarding plane of a switcher or router over the network., and the network resources can be managed logically and efficiently. A SDN connects decentralized users to cloud computing through a “network pipeline”  .
Relying on IT virtualization techniques, network function virtualization (NFV) transforms the entire set of network node functions into different building blocks, which separates the networking functions from specific hardware blocks . Hence, NFV is eminently suitable for service diversification and promotes the standardization of networking equipment . Explicitly, NFV can be viewed as a beneficial hardware-agnostic design in the application layer of SDN architectures.
Ii-D6 From EH to EA
Energy harvesting (EH) is an environmentally friendly process, which captures and stores ambient energy, such as solar power, thermal energy, wind energy, etc. for low-power wireless devices , especially in WSNs and WBANs, for example.
In NGWNs, energy optimization is a significant concern motivated by mitigating climate change. However, energy consumption is related to both the network’s throughput and to its entire lifetime with a trade-off between them. As a remedy, instead of only focusing on EH, energy awareness (EA) at every stage of the network’s design and management is the most promising approach to striking a trade-off amongst the conflicting objectives of reducing energy consumption, improving the system’s throughput as well as prolonging its lifetime, especially in energy-constrained networks [98, 99].
Ii-D7 From CR to CogNet
Cognitive radio (CR) constitutes a technique that allows us to dynamically and efficiently exploit the wireless spectral resources [100, 101, 102, 103]. By relying on spectrum sensing, CR is capable of achieving dynamic spectrum access and spectrum sharing. Specifically, in the process of spectrum sensing, the secondary user (SU) detects an empty slicer of spectrum, for example, based on energy detection schemes. Then, in the process of spectrum access, power control is invoked by the SU for maximizing its capacity, whilst observing the interference power constraint in order to protect the primary user (PU). As a benefit, CR dynamically and flexibly exploits the scarce wireless spectral resources, hence substantially improving the spectrum efficiency .
In contrast to CR techniques, which only deal with the issues of physical-layer spectrum sensing and data link-layer access, cognitive networks (CogNet) are characterized by a cognitive cross-layer process according to their end-to-end goals, where the overall network conditions are monitored, and then decisions are made based on the perceived conditions as well as on the feedback and experience gleaned from previous actions . The network’s cognitive capability relies on a range of advanced techniques, such as knowledge representation and machine learning, which exploit a wealth of information generated within the network improving both the network management, the resource efficiency  and the energy efficiency .
Ii-D8 Interference Management
Interference constitutes the fundamental limiting factor of the overall wireless system performance, hence it is a key challenge faced by designers. Therefore susbtantial efforts have been dedicated to exploiting the communication channel’s state information (CSI) either at the transmitter (CSIT) or at the receiver (CSIR) for mitigating the effects of interference. Hence diverse time/frequency/space division multiple access based resource allocation schemes have been conceived for avoiding interference by creating orthogonal resource units [108, 109, 110]. Creative efforts have also been dedicated to the conception of non-orthogonal access systems, as exemplified by a large variety of cognitive radio  and non-orthogonal multiple access (NOMA) schemes  relying on sophisticated transceiver designs. Additionally, multi-antenna based techniques, such as joint/partial pre/postcoding and antenna selection, have also been proposed for ameliorataing the effects of interference by exploiting the benefits of spatial diversity .
A closely related issue in NGWNs is interference management, which is a particularly critical task in ultra-dense networks in the face of their stringent throughput, delay and reliability specifications. Hence sophisticated resource allocation and interference management schemes are required. Therefore a range of machine learning algorithms have also been invoked for interference management relying on their environmental awareness and learning capability [114, 115, 116].
Ii-E Multi-Objective Metrics of NGWNs
The challenging real-world optimization problems encountered in NGWNs usually have to meet multiple objectives in order to arrive at an attractive solution . In contrast to conventional single-objective optimization where we find the global optimum relying on a single metric, multi-objective optimization aims for finding the globally optimal solution relying on the notion of Pareto optimality . The aim of multi-objective optimization in NGWNs is that of generating a diverse set of Pareto-optimal solutions, where by definition it is only possible to improve any of the metrics considered at the cost of degrading at least one of the others. The collection of Pareto-optimal points is referred to as the Pareto front.
In terms of metrics, the wireless community has invested decades of research efforts into making near-capacity single-user operation a reality , which is however only possible at the cost of an ever-increasing delay, complexity and power consumption. However, in the context of next-generation wireless communication networks, we would like to be more ambitious than ’only’ optimize the network’s capacity - for delay-sensitive services we would like to reduce the latency and/or reduce the total energy consumption, as well as to improve the system’s reliability and the user’s QoS. By contrast, in wireless sensor networks we may concentrate on optimizing both the connectivity and the network’s life time, just to name a few. In this context the family of machine-learning techniques may be viewed as an attractive set of optimization tools for finding Pareto-optimumal solutions of multi-objective optimization problems in NGWNs, which tend to have a large search-space. To expound a little further, it is plausible that every time we incorporate an additional parameter into the objective function, the search-space is expanded and the surface of optimal solutions may exhibit numerous locally optimal solutions. Hence traditional gradient-based techniques routinely fail to find the global optimum. In this context Fig. 7 portrays some popular metrics commonly used in constructing multi-objective optimization problems in NGWNs.
Iii Supervised Learning in NGWN
Having covered the networking basics, in this section, we will introduce some rudimentary supervised learning algorithms, such as regression, K-nearest neighbors (KNN), support vector machines (SVM) and Bayes classification including their applications in NGWN. Table I summarizes some of the typical applications of the above-mentioned four supervised learning algorithms in NGWN.
Iii-a Regression and Its Applications
Regression analysis is capable of estimating the relationships among variables. Relying on modeling the functional relationship between a dependent variable (objective) and one or more independent variables (predictors), regression constitutes a powerful statistical tool of predicting and forecasting a continuous-valued objective given a set of predictors.
In regression analysis, there are three variables, namely the
Independent variables (predictors):
Dependent variable (objective):
Other unknown parameters that affect the estimated value of the dependent variable:
The regression function models the functional vs relationship perturbed by , which can be formulated as:
. Usually, we characterize this relationship in terms of a specific regression function with the aid of its probability distribution. Moreover, the approximation is often modeled as. When conducting regression analysis, first of all we have to determine the specific form of the regression function
, which relies on both the common knowledge about the dependent vs independent variables as well as on its convenient evaluation. Based on the specific form of regression function, regression analysis methods can be classified as ordinary linear regression121], polynomial regression , etc.
In linear regression, the dependent variable is a linear combination of the independent variables or unknown parameters. Let us assume having random training samples and independent variables, formulated as . Then the linear regression function can be formulated as:
where is termed as the regression intercept, while is the error term and . Hence, Eq. (1) can be rewritten in the form of a matrix as , where is an observation vector of the dependent variable and , while and represents the observation matrix of independent variables, given by:
Linear regression analysis  aims for estimating the unknown parameter relying on the least squares (LS) criterion. The corresponding solution can be expressed as:
By contrast, in logistic regression , the dependent variable is binary. In order to facilitate our analysis, in the following we consider the case of a binary dependent variable, for example. The goal of the binary logistic regression is to model the probability of the dependent variable having the value of or , given the training samples. To elaborate a little further, let the binary dependent variable depend on independent variables . The conditional distribution of under the condition of
obeys a Bernoulli distribution. Hence, the probability ofcan be expressed in the form of a standard logistic function666The logistic function is a common “S” shape function, which is the cumulative distribution function (CDF) of the logistic distribution.
, also termed as a sigmoid function:
where and represents the regression coefficient vector. Similarly, we have:
Relying on the aforementioned definitions, we have . Hence, for a given dependent variable, the probability of its value being can be expressed by . Given a set of training samples , we are capable of estimating the regression coefficient vector with the aid of the maximum likelihood estimation (MLE) method. Explicitly, logistic regression can be deemed to form a special case of the generalized linear regression family using kernel model.
Furthermore, there exist numerous other useful regression models [122, 123, 124, 125]. When the dependent variable is a polynomial function of the independent variables, we refer to it as polynomial regression 
, where the best-fit line is a curve. Moreover, ridge regression
, least absolute shrinkage and selection operator (LASSO) regression and ElasticNet regression  are widely applied, when independent variables are of multi-collinear nature and highly correlated. Fig. 8 demonstrates the basic flow of a regression model.
The regression models can be used for estimating, detecting and predicting physical layer radio parameters related to wireless network scenarios. Specifically, Chang et al.  proposed a novel regression-aided interference model, which characterized the relationship between the SINR and the packet reception ratio, and evaluated its accuracy relying on the statistics. Based on this model, they constructed an analytic framework for striking a trade-off between the overhead imposed and the accuracy of interference measurement attained. In , Umebayashi et al. used regression analysis for formulating a deterministic-stochastic hybrid model for detecting the spectrum usage by PUs, which had a reduced number of parameters and yet maintained a high detection accuracy. In , Al Kalaa et al. used logistic regression for estimating the likelihood of Wi-Fi and ZigBee wireless coexistence in the context of medical devices. Furthermore, Xiao et al.  constructed a logistic regression-aided physical layer authentication model for detecting spoofing attacks in wireless networks without relying on a known channel model, which exhibited a high detection accuracy, despite its low computational complexity.
The regression models can also be employed for solving both estimation and detection problems in the upper layers of the seven-layer OSI model. For example, Chang et al. derived a regression-based analytical model for the sake of estimating the contention success probability considering heterogeneous sensor-traffic demands, which beneficially improved the channel’s exploitation in IoT . Moreover, in , Chen et al. employed a regression model for reconstructing the radio map with the aid of signal strength models for the path planning and UAV-location design in UAV-assisted wireless networks. As a further advance, Lei et al.  employed a logistic regression classifier for device-free localization relying on fingerprint signals, which yielded a low localization error.
Iii-B KNN and Its Applications
KNN constitutes a non-parametric instance-based learning method, which can be used both for classification and regression. Proposed by Cover and Hart in 1968, the KNN algorithm is one of the simplest of all machine learning algorithms. By relying on the distance between the object and training samples in a feature space, the KNN algorithm determines which class of the object belongs to. Specifically, in a classification scenario, an object is categorized into a specific class by a majority vote of its nearest neighbors. If , the category of the object is the same as that of its nearest neighbor. In this case, it is termed as the one nearest neighbour classifier. By contrast, in a regression scenario, the output value of the object is calculated by the average of the value of its nearest neighbors. Fig. 9 shows the illustration of the unweighted KNN mechanism associated with .
Let us assume that there are training sample pairs of , where is the property value or class label of the sample , . Typically, we use the Euclidean distance or the Manhattan distance  for calculating the similarity between the object and the training samples. Let contain different features. Hence, the Euclidean distance between and can be expressed by:
while their Manhattan distance is calculated as :
Relying on the associated similarity, the class label or property value of can be voted on or first weighted and then voted on by its nearest neighbors, which is formulated:
The performance of the KNN algorithm critically depends on the value of , whilst the best choice of hinges upon the training samples. In general, a large is conducive to resisting the harmful influence of noise, but it fuzzifies the class boundary between different categories. Fortunately, an appropriate value of
can be determined by a variety of heuristic techniques based on the true characteristics of the training data set.
In KNN, an object can be classified into a specific category by a majority vote of the object’s neighbours, with the object being assigned to the class that is the most common one among its nearest neighbors. Hence, as a kind of simple and efficient classification algorithms, KNN is beneficial in terms of, for example, traffic prediction , anomaly detection [135, 136], missing data estimation , modulation classification , interference elimination , etc.
To elaborate, for the sake of capturing the dynamic characteristics of wireless resource demands, Feng et al. constructed a weighted KNN model by learning from a large-scale historical data set generated by cellular operators’ networks, which was used for exploring both the temporal and spatial characteristics of radio resources . In , Xie et al. proposed a novel KNN aided online anomaly detection scheme based on hypergrid intuition in the context of WSN applications for overcoming the ‘lazy-learning’ problem  especially when the computational resource and the communication cost quantified in terms of bandwidth and energy were constrained. Moreover, in , Onireti et al. proposed a KNN based anomaly detection algorithms for improving the outage detection accuracy in dense heterogeneous networks. As for missing data estimation, a KNN assisted missing data estimation algorithm was conceived on the basis of the temporal and spatial correlation feature of sensor data, which jointly utilized the sensor data from multiple neighbor nodes . Furthermore, Aslam et al. 
combined genetic programming and the KNN in order to improve the modulation classification accuracy, which can be viewed as a reliable modulation classification scheme for the SU in cognitive radio networks. In, the KNN algorithm was used both for extracting the environmental interference imposed by 5G Wi-Fi signals and for reducing the computational complexity and yet improving the performance of indoor localization.
Iii-C SVM and Its Applications
Being constructed purely by mathematical theory, SVM is another supervised learning model conceived for classification and regression relying on constructing a hyperplane or a set of hyperplanes in a high-dimensional space. The best hyperplane is the one that results in the largest margin amongst the classes. However, the training data set may often be linearly non-separable in a finite dimensional space. To address this issue, SVM is capable of mapping the original space into a higher dimensional space, where the training data set can be more easily discriminated.
Considering a linear binary SVM, for example, there are training samples in the form of , where indicates the class label of the point . SVM aims for searching for a hyperplane having the maximum possible separation from the training samples, which best discriminates the two classes of associated with and . Here, the maximum separation implies having the maximum possible distance between the nearest point and the hyperplane. The hyperplane is represented by:
Hence, we can quantify the separation of the training sample as:
Moreover, we assume having the correct classification if when , while when . Because we have , a higher separation implies a more reliable classification. Again, the SVM tries to find the optimal hyperplane that maximizes the minimum separation between the training samples and the hyperplane considered. Given a set of linearly separable training samples, after the operation of normalization, the SVM based classification can be formulated as the following optimization problem:
where we have . After some further mathematical manipulations, the problem in (10) can be reduced to an optimization problem having a convex quadratic objective function and linear constraints, which can be expressed by:
Again, if the training samples are linearly non-separable, SVM is capable of mapping data to a high dimensional feature space with a high probability of being linearly separable. This may result in a non-linear classification or regression in the original space. Fortunately, kernel functions play a critical role in avoiding the “curse of dimensionality” in the above-mentioned dimensionality ascending procedure[142, 143]. To elaborate a little further, given the original input samples , we may be interested in learning some features . Let us assume , hence the corresponding kernel function is defined as:
Fortunately, even though the high dimensional feature mapping may be expensive to calculate, the kernel function calculated relying on their inner product can be easy obtained after some further mathematics manipulations.
There are a variety of alternative kernel functions, such as linear kernel function, polynomial kernel function, radial basis kernel function, neural network kernel function, etc. Furthermore, some regularization methods haven been conceived in order to make SVM be less sensitive to outlier points.
The specific choice of the kernel function plays a key role in machine learning , hence we have to beneficially design the kernel function. The construction of kernels can be generally developed by the inner product operations of feature mappings between the input samples over the Hilbert space, whose infinite number of dimensions allow the appropriate representation of big data to exploit their geometric properties. Such a Hilbert space associated with a kernel invoked for producing functions by calculating the inner product of the feature mappings is known as the reproducing kernel Hilbert space (RKHS) , and has been applied in diverse learning contents [146, 147]
. The RKHS therefore serves a critical foundation in statistical learning theory. Fig.10 provides a graphical illustration of the kernel-based method.
On the other hand, we may rely on statistical learning theory for appropriately constructing the signal space in order to identify sufficient statistics for reliable signal detection and estimation in statistical communication theory . Inspired by Parzen , Kailath observed that RKHS may also be beneficially invoked both for detection and estimation 
by exploiting the one-to-one relationship between RKHS and finite-variance linear functionals of a random process. Corresponding to the simplest setup of signal detection in additive white Gaussian noise (AWGN) using the Karhunen-Loeve expansion, the RKHS representation associated with the noise covariance function is capable of providing an equivalent theoretical framework of statistical communication theory. After a series of efforts inverted into different areas of signal detection and estimation, Kailath and Poor  conceived the RKHS approach for the detection of stochastic signals.
As mentioned before, SVM hinges on a mapping that can transform the original training data into a higher dimension, where the events to be classified do become linearly separable. Then it searches for the optimal separating hyperplane for delineating one class from another in this higher dimension considered. As highlighted in Fig. 11, in the spirit of this, SVM aided learning models can be used for detecting and estimating network parameters, for learning and classifying environmental signals and the user’s behavior, as well as for guiding decision making concerning channel selection and anomaly detection, for example [152, 153, 154, 155, 156, 157, 158, 159, 160].
As for detecting and estimating the network parameters, Feng and Chang  constructed a hierarchical SVM (H-SVM) structure for multi-class data estimation. The H-SVM was constructed by a number of levels and each level was composed by a finite number of SVM classifiers. Feng and Chang used their H-SVM model both for estimating the physical locations of nodes in an indoor wireless network and the Gaussian channel’s noise level in a MIMO-aided wireless network. Thanks to its hierarchical structure, the H-SVM was capable of providing an efficient distributed estimation procedure. Furthermore, Tran et al. proposed an SVM model for estimating the geographic location of sensor nodes in WSNs whilst only relying on their connectivity information, more precisely the hop counts . It yielded fast convergence in a distributed manner. The final estimation error can be upper bounded by any small threshold upon relying on a sufficiently large training dataset. Moreover, Sun and Guo  conceived a least square-SVM (LS-SVM) algorithm for estimating the user’s position by correlating the time-of-arrival (TOA) of radio frequency signals at the BSs without any detailed knowledge about the base station’s location as well as about the propagation characteristics.
SVM can also be used for learning a user’s behavior and for classifying environmental signals considering the complex spatio-temporal context and the diverse selection of devices. In , Donohoo et al. studied the context-aware energy-efficiency improvement options for smart devices. These solutions may become beneficial in terms of configuring their location-specific interface for heterogeneous networks (HetNets) constituted by diverse cells. In , by combining the SVM and Fisher discriminant analysis (FDA) Joseph et al. learned the malicious sinking behavior in wireless ad hoc networks for finding the security vulnerabilities and for designing novel intrusion detection scheme. Moreover, features such as delay between data and acknowledgement, number of re-transmits, etc. gleaned from the MAC layer were jointly considered with those from other layers, which constituted a correlated feature set. Furthermore, Pianegiani et al.  proposed an SVM-based binary classification solution for classifying acoustic signals emitted by vehicles relying on spectral analysis aided feature extraction, which was beneficial in terms of improving the classification accuracy, despite reducing the implementation complexity.
As for the SVM’s benefit in assisting decision making, in , a common control channel selection mechanism was conceived for SUs during a given frame relying on an SVM-based learning technique proposed for a cognitive radio network, which was capable of implicitly and cooperatively learning the surrounding environment cooperatively in an online way. Moreover, Yang et al.  investigated the spoofing attack detection problem based on the spatial correlation of received signal strength gleaned from network nodes, where a cluster-based SVM mechanism was developed for determining the number of attackers. Relying on carefully designed certain training data, the SVM algorithm employed further improved the accuracy of determining the number of attackers. Rajasegarar et al.  also investigated the malicious activity detection issues of WSNs invoking a variety of SVM based algorithms.
Iii-D Bayes Classification and Its Applications
The Bayes classifier, a popular member of the probabilistic classifier family relying on Bayes’ theorem, operates by computing theposteriori
probability distribution of the objective function values given a set of training samples. As a widely-used classification method, the naive Bayes classifier can be trained for example conditioned on a simple but strong independence assumption in features. Furthermore, the complexity of training a naive Bayes model is linearly proportional to the training set size.
To elaborate a little further, let the vector represent independent features for a total of classes . For each of the possible class labels , we have the conditional probability of . Relying on Bayes’ theorem, we decompose the conditional probability to yield the form of:
where is the posteriori probability, whilst is the priori probability of . Given that is conditionally independent of for , we have:
where only depends on independent features, which can be viewed as a constant.
The maximum a posteriori probability (MAP) is used as the decision making rule for the naive Bayes classifier. Given a feature vector , its label can be determined according to:
Despite idealized simplifying assumptions, naive Bayes classifiers have enjoyed popularity in numerous complex real-world situations, such as outlier detection, spam filtering , etc.
Based on the Bayes’ theorem, Bayes classifier techniques are particularly applicable to the context where the dimensionality of the input is high. Despite their simplicity, they can often outperform other sophisticated classification methods. As for their applications in wireless networks, in the following, we will elaborate on some typical examples in different wireless scenarios, such as antenna selection, network association, anomaly detection, indoor location and QoE prediction.
Specifically, in , He et al. modeled the transmit antenna selection (TAS) problem of MIMO wiretap channels as a multi-class classification problem. Then, they used the naive Bayes-based classification scheme to select the optimal antenna for enhancing the physical layer security of the system considered. In contrast to conventional TAS schemes, simulation results showed that the proposed scheme resulted in a reduced feedback overhead at a given secrecy performance. In , Abouzar et al. proposed an action-based network association technique for wireless body area networks (WBANs). Relying on the level of received signal strength indicator of the on-body link, the naive Bayes algorithm was employed to recognize the ongoing action, which was beneficial in terms of scheduling the time slot assignment in the context of fixed power allocation on various links by the sink node under a specific data rate constraint. Moreover, Klassen et al.  used the naive Bayes classifier for detecting anomaly in ad hoc wireless network involving the black hole attack, the denial of service (DoS) attack and the selective forwarding attack.
Bayes classifier can also be applied to the indoor location estimation. For example, in , a probabilistic model was conceived for characterizing the relationship between the received signal strength and location with the aid of the naive Bayes generative learning method, which was used for learning the parameters of an initial probabilistic model, given a limited number of labeled samples. The proposed indoor location estimation method was capable of both reducing the off-line calibration efforts required, whilst maintaining a high location estimation accuracy. Furthermore, as for QoE prediction, in order to evaluate the impact of different networking and channel conditions on the QoE attained in the context of different network services, Charonyktakis et al.  proposed a modular algorithm for user-centric QoE prediction. They integrated multiple machine learning algorithms, including the Gaussian naive Bayes classifier and conceived a nested cross validation protocol for selecting the optimal classifier and its corresponding optimal hyper-parameter value for the sake of accurate QoE prediction.
|||interference estimate||regression||strike a trade-off between the overhead and accuracy of interference measurement|
|||spectrum sensing||regression||reduce the number of parameters and maintain a high detection accuracy|
|||wireless coexistence||regression||estimate the likelihood of the wireless coexistence of Wi-Fi and ZigBee|
|||PHY authentication||regression||do not need the assumption on the accurate known channel model|
|||traffic estimation||regression||estimate the contention success probability considering sensors’ heterogeneous traffic demands|
|||map reconstruction||regression||reconstruct the wireless radio map for UAV path planning and location design|
|||wireless localization||regression||logistic regression classifier for counteracting the negative influence relying on fingerprint signals|
|||traffic prediction||KNN||explore both the temporal and spatial characteristics of radio resources|
|||anomaly detection||KNN||rely on the hypergrid intuition in the context of WSN applications|
|||missing data estimation||KNN||rely on the temporal and spatial correlation feature of sensor data|
|||modulation classification||KNN||combine the genetic programming and KNN for improving the modulation classification accuracy|
|||interference elimination||KNN||extract environmental interference from Wi-Fi signal and reduce computational complexity|
|||data estimation||SVM||provide an efficient estimation procedure in a distributed manner|
|||localization estimation||SVM||yield fast convergence performance and efficiently use the communication resources|
|||user location||SVM||without knowledge about base station location and environmental propagation characteristics|
|||data prediction||SVM||provide location-specific interface configuration for HetNets|
|||behavior learning||SVM||combine both the superior accuracy of SVM and fast convergence speed of FDA|
|||signal classification||SVM||classify acoustic signals emitted by vehicles rely on feature extraction|
|||channel selection||SVM||propose a control channel selection mechanism for a cognitive radio network|
|||attacker counting||SVM||develop a cluster-based SVM mechanism for determining the number of attackers|
|||antenna selection||Bayes||enhance the physical layer security relying on Bayes-based optimal antenna selection|
|||network association||Bayes||schedule time slot assignment and fixed power allocation under data rate constraint|
|||anomaly detection||Bayes||detect anomaly involving black hole attack, DoS attack and selective forwarding attack|
|||indoor location||Bayes||characterize the relationship between the received signal strength and location|
|||QoE prediction||Bayes||accurate QoE prediction by selecting optimal classifier and optimal hyper-parameter values|
Iv Unsupervised Learning in NGWN
In this section, we will highlight some typical unsupervised learning algorithms, such as -means clustering , expectation-maximization (EM) , principal component analysis (PCA)  and independent component analysis (ICA)  in terms of their methodology and their applications in NGWN. Table II summarizes some typical applications of the above-mentioned unsupervised learning algorithms in NGWN.
Iv-a -Means Clustering and Its Applications
-means clustering is a distance based clustering method that aims for partitioning unlabeled training samples into different cohesive clusters, where each sample belongs to one cluster. To elaborate a little further, -means clustering measures the similarity between two samples in terms of their distance and it has two main steps, namely assigning each training sample to one of clusters in terms of the closest distance between the sample and the cluster centroids, and then updating each cluster centroid according to the mean of the samples assigned to it. The whole algorithm is hence implemented by repeatedly carrying out the above-mentioned pair of steps until convergence is achieved.
To elaborate a little further, given a set of samples , where is a -dimensional vector, let represent the above-mentioned cluster set, and the mean of the samples in . -means clustering intends to find an optimal cluster-based segmentation, which solves the following optimization problem:
However, problem (16) is a non-deterministic polynomial-time hardness (NP-hard) problem . Fortunately, there are a range of efficient heuristic algorithms, which converge quickly to a local optimum.
One of the popular low-complexity iterative refinement algorithms suitable for -means clustering is Lloyd’s algorithm , which often yields satisfactory performance after a low number of iterations. Specifically, given initial cluster centroid , Lloyd’s algorithm arrives at the final cluster segmentation result by alternating between the following two steps,
Step 1: In the iterative round , assign each sample to a cluster. For and , if we have:
then we assign the sample to the cluster , even if it could potentially be assigned to more than one cluster.
Step 2: Update the new centroids of the new clusters formulated in the iterative round relying on:
where denotes the number of samples in cluster in iterative round .
Convergence is deemed to be obtained when the assignment in Step 1 is stable. Explicitly, reaching convergence means that the clusters formulated in the current round are the same as those formed in the last round. Since this is a heuristic algorithm, there is no guarantee that it can converge to the global optimum. Hence, the result of clustering largely relies on specific choice of the initial clusters and on their centroids.
-means clustering aims for partitioning samples into clusters. Each sample belongs to the closest cluster. The clustering algorithm proceeds in an iterative manner, where the in-cluster differences are minimized by iteratively updating the cluster centroid, until convergence is achieved.
Clustering functioning under uncertainty or incomplete information is a common problem in wireless networks, especially in the scenarios associated with numerous small traffic cells, heterogeneous large and small cell structures relying on diverse carrier frequencies, diverse time-varying tele-traffic, etc. First of all, the small cells have to be carefully clustered for avoiding excessive interference using coordinated multi-point transmission. Moreover, the devices and users should be beneficially clustered for the sake of achieving a high energy efficiency, maintaining an optimal access point association, obeying an efficient offloading policy, and of guaranteeing a high network security. In , a mixed integer programming problem was formulated for jointly optimizing both the gateway deployment and the virtual-channel allocation for optical/wireless hybrid networks, where Xia et al. designed an efficient -means clustering based solution for iteratively solving this problem, which beneficially reduced the delay, as well as improved the network throughput. Moreover, in , Hajjar et al. proposed a -means based relay selection algorithm for creating small cells under the umbrella of an oversailing LTE macro cell within a multi-cell scenario under the constraint of low power clusters. Relying on the proposed relay selection algorithm, the total capacity was increased by reusing the frequency in each low power cluster, which had the benefit of supporting high data rate services. Additionally, Cabria and Gondra  proposed a so-called potential--means scheme for partitioning data collection sensors into clusters and then for assigning each cluster to a storage center. The proposed -means solution had the advantage of both balancing the storage center loads and minimizing the total network cost (optimizing the total number of sensors). Parwez et al.  invoked both
-means clustering and hierarchical clustering algorithms for their user-activity analysis and user-anomaly detection in a mobile wireless network, which verified genuine identity of users in the face of their dynamic spatio-temporal activities. Furthermore, El-Khatib designed a -means classifier for selecting the optimal set of features of the MAC layer bearing in mind the specific relevance of each feature, which beneficially improved the accuracy of intrusion detection, despite reducing the learning complexity.
Clustering can also be used in signal detection for the sake of both reducing the detection complexity and for improving the energy efficiency attained. In , the -means clustering algorithm was invoked in a blind transceiver, where the training process was completely dispensed within the transmitter for reducing its energy dissipation, since no pilot power was required. Furthermore, Zhao et al.  conceived an efficient -means clustering algorithm for optical signal detection in the context of burst-mode data transmission.
Iv-B EM and Its Applications
The EM algorithm is an iterative method conceived for searching for the maximum likelihood estimate of parameters in a statistical model. Typically, in addition to unknown parameters whose existence has been ascertained, the statistical model also has some latent variables. In this scenario it is an open challenge to derive a closed-form solution, because we are unable to find the derivatives of the likelihood function with respect to all the unknown parameters and latent variables. The iterative EM algorithm consists of two steps, as shown in Fig. 12. During the expectation step (E-step), it calculates the expected value of the log likelihood function conditioned on the given parameters and latent variables, while in the maximization step (M-step), it updates the parameters by maximizing the specific log-likelihood expectation function considered.
More explicitly, upon considering a statistical model with observable variables and latent variables , the unknown parameters are represented by . The log-likelihood function of the unknown parameters is given by:
Hence, the EM algorithm can be described as follows :
E-step: Calculate the expected value of the log likelihood function under the current estimate of , i.e.
M-step: Maximize Eq. (20) with respect to for generating an updated estimate of , which can be formulated as:
The EM algorithm plays a critical role in parameter estimation based on many of the popular statistical models, such as the Gaussian mixture model (GMM), hidden Markov model (HMM), etc. which are beneficial both for clustering and prediction.
The EM model can be readily invoked for a variety of parameter learning and estimation problems routinely encountered in wireless networks. Specifically, Wen et al.  estimated both the channel parameters of the desired links in a target cell and those of the interfering links in the adjacent cells relying on constructing a GMM, which was estimated with the aid of the EM algorithm. Choi et al.  modeled the cognitive radio system as a HMM, where the secondary users (SUs) estimated the channel parameters such as the primary user’s (PU) sojourn time, signal strength, etc. based on the standard EM algorithm. Moreover, Assra et al.  also adopted the EM algorithm to jointly estimate the channel unknown frequency domain responses as well as the noise variance and detected the PU’s signal in a cooperative wide-band cognitive system, which was shown to converge to the upper bound solution based on maximum likelihood estimation under the idealized assumption of having perfect channel parameter estimation. Additionally, Zhang et al.  proposed an EM aided joint symbol detection and channel estimation algorithm for MIMO-OFDM systems in the presence of frequency selective fading, which provided a distribution-estimate for both the hidden symbol and unknown channel parameters in an iterative manner. Li and Nehorai  built an asynchronous state-space model for connecting asynchronous observations with the most likely target state transition in the context of multi-sensor WSNs. Then, they adopted the EM algorithm for jointly estimating the sequential target state as well as the network’s synchronization state under the assumption of knowing the temporal order of sensor clocks. Furthermore, Zhang et al.  used a variational EM iterative algorithm to recover the transmitted signals and to identify the active users in a low-activity code division multiple access based M2M communications without the knowledge of the user activity factor. The EM algorithm can also be invoked for target or source localization, which can be viewed as a joint sparse signal recovery and parameter estimation problem  .
Iv-C PCA & ICA and Their Applications
PCA and ICA constitute sophisticated dimensionality reduction methods in machine learning, which are capable of reducing both the computational complexity and the storage requirements.
PCA utilizes an orthogonal transformation for converting a set of potentially correlated features of the training samples into a set of uncorrelated features, which are termed as the “principal components”. The number of principal components is expected to be lower than the number of the original features of the training samples, which hence provide a more compact representation of the original samples. More explicitly, less principle components can be used for representing the original samples in the transformed domain. In PCA, the first principal component tends to have the largest variance, which indicates that it encapsulates the most information of the original features provided that these features were correlated. Similarly, each succeeding component tends to have the next highest variance. These principal components can be generated by invoking the eigenvectors of the normalized covariance matrix.
Specifically, let us consider training samples of , where is composed of different features. Let us first pre-process the samples by normalizing their mean and variance. Given a unit vector , can be interpreted as the length of the projection of onto the direction . The PCA attempts to maximize the variance of the projections, which is formulated as:
Given the covariance matrix , the solution of problem (22) is given by the eigenvector of the covariance matrix . If we denote the top eigenvectors of by and , a dimensionality reduction expression of can be formulated as:
where are the first principle components of the training samples.
By contrast, the ICA attempts to find a new basis for representing original samples that are assumed to be a linear weighted superposition of some unknown latent variables. It aims for decomposing multivariate variables into a set of additive subcomponents, which are non-Gaussian variables and are statistically independent from each other. As for the independent components, also termed as the latent variables, they exhibit the maximum possible “statistical independence”, which can be commonly characterized by either the minimization of their mutual information quantified in terms of the Kullback-Leibler divergence metric and the maximum entropy criterion, or by the maximization of what is termed in parlance as the non-Gaussianity relying on kurtosis and negentropy, for example.
Let us consider the linear noiseless ICA model in a simple example, where the multivariate training variables are denoted by . Its latent independent component vector is represented by . Each component of can be generated by a linearly weighted sum of independent components, i.e. we have , where is the weighting coefficient. The vectorial form of can be expressed as:
where . Furthermore, let . Then the original multivariate training variables can be rewritten as:
where the unknown matrix is referred to as the mixing matrix. ICA algorithms attempt to estimate both the mixing matrix and the independent component vector relying on setting up a cost function, which again, either maximizes the non-Gaussianity or minimizes the mutual information. Thus, we can recover the independent component vector by computing , where is termed as the ‘unmixing’ matrix. Usually, we assume that and that the mixing matrix is a square-shaped matrix. Moreover, the apriori knowledge of the probability distribution of is beneficial in terms of formulating the cost function.
As for the application of PCA and ICA in wireless networks, Shi et al.  utilized PCA to extract the most relevant feature vectors from fine-grained subchannel measurements for improving the localization and tracking accuracy in an indoor location tracking system. Moreover, Morell et al.  designed an efficient data aggregation method for WSNs based on PCA amalgamated with a non-eigenvector projection basis, while keeping the reconstruction error below a pre-defined threshold. Quer et al.  exploited PCA for inferring the spatial and temporal features of a range of signals monitored by a WSN. Based on this they recovered the large original data set from a small observation set.
Additionally, Qiu  combined ICA with PCA in a smart grid scenario for recovering smart meter data, which were jointly capable of enhancing the transmission efficiency both by avoiding the channel estimation in each frame and by eliminating wide-band interference or jamming signals. A semi-blind received signal detection method based on ICA was proposed by Lei et al. , which additionally estimated the channel information of a multicell multiuser massive MIMO system. Moreover, Sarperi et al.  proposed an ICA based blind receiver structure for MIMO OFDM systems, which approached the performance of its idealized counterpart relying on perfect CSI. ICA was also used for digital self-interference cancellation in a full duplex system , which relied on a reference signal used for estimating the leakage into the receiver. More explicitly, in full duplex systems the high-power transmit signal leaks into the receiver through a nonlinear leakage path and drowns out the low-power received signal. Hence its cancellation requires at least dB interference rejection. Furthermore, in , the Boolean ICA concept was proposed based on the integration of Boolean functions of binary signals for inferring the activities of the underlying latent signal sources. Specifically, it was shown that given SUs, the activities of up to PUs can be determined.
|||gateway deployment||-means||reduce delay and improve network throughput for optical/wireless hybrid networks|
|||relay selection||-means||create small cells in an LTE macro cell with low power cluster constraint|
|||sensor partitioning||-means||balance the load of storage centers and minimize the total network cost|
|||anomaly detection||-means||verify spatio-temporal varying users’ genuineness relying on ground truth information|
|||intrusion detection||-means||improve intrusion detection accuracy and reduce the learning complexity|
|||blind transceiver||-means||not require pilot duration and pilot power for saving energy consumption|
|||signal detection||-means||burst-mode data transmission with an unbalanced ratio of bits zero and bits one|
|||channel estimation||EM algorithm||construct a GMM to estimate channel parameters in both target cell and adjacent cells|
|||PU detection||EM algorithm||SUs estimate PU’s sojourn time and signal strength relying on a HMM model|
|||channel state detection||EM algorithm||jointly estimate channel frequency responses, noise variance and PU’s signal|
|||symbol detection||EM algorithm||joint symbol detection and channel estimation for MIMO-OFDM systems|
|||network state detection||EM algorithm||joint estimate the sequential target state and network synchronization state|
|||active user detection||EM algorithm||detect active user for the low-activity CDMA based M2M communications|
|||source localization||EM algorithm||formulate localization as a joint sparse signal recovery and parameter estimation problem|
|||indoor location||PCA||extract relevant feature vectors from fine-grained subchannel measurements|
|||data aggregation||PCA||limit the reconstruction error based on a non-eigenvector projection basis|
|||data recovery||PCA||exploit PCA to extract spatial and temporal features of real signals|
|||data recovery||ICA & PCA||enhance transmission efficiency by avoiding channel estimation and eliminating jamming signals|
|||channel estimation||ICA||differentiate and decode the received signal, and estimate the channel information|
|||blind receiver||ICA||yield an ideal performance close to that with perfect CSI|
|||interference cancellation||ICA||digital interference cancellation based on the reference signal from transmitter power amplifier|
|||signal detection||ICA||infer the activities of latent signal sources based on the Boolean functions|
V Reinforcement Learning in NGWN
Reinforcement learning deals with an agent interacting with the environment. Three specific aspects of reinforcement learning, multi-arm bandit problem, Markov decision process (MDP) and temporal-difference (TD) learning can be very useful for NGMN. Then, we explore further on these algorithms of reinforcement learning to NGMN.
V-a Multi-Armed Bandit and Its Applications
The multi-armed bandit technique, also called -armed bandit, models a decision making problem, where an agent is faced with a dilemma of different actions. After each choice, the agent receives a reward relying on a stationary probability distribution that is associated with its decision. The agent attempts to maximize its expected total reward over a series of decision making rounds relying on a balance striking a trade-off between consulting existing knowledge and acquiring new knowledge when optimizing its decisions. The action of referring to existing knowledge to make decisions is termed as “exploitation”, while the trial of acquiring new knowledge is referred to as “exploration”. Striking a trade-off between exploration and exploitation is also sought by other reinforcement learning algorithms, where exploitation is the plausible action for maximizing the expected reward within the current round, while exploration may produce a greater reward in the long run.
In a -armed bandit model, possible actions, , yield different rewards associated with the unknowns of the problem at hand, which may have different distributions with mean values of , respectively. The agent iteratively chooses an action at the round and receives the corresponding reward of . Up to the round , the expected reward of an action can be expressed as . Upon striking a balance between the exploration and the exploitation, we may arrive at a simple bandit algorithm as follows, for example. In each decision-making round, we greedily opt for the action relying on the probability of , whilst riskily embarking on a random action selection based on the probability of , where is the probability of a brave attempt for exploring new knowledge.
In contrast to the above-mentioned -greedy bandit algorithm, there are also more complex bandit algorithms, such as the gradient aided bandit algorithm, associative-search bandit, non-stationary bandit, etc . Moreover, the multi-armed bandit problem can be extended into a multi-play and multi-armed bandit problem , where the reward of each agent depends on others’ actions, and each agent tries to find its optimal decision by predicting the future actions of the other agents relying on previous decision making strategies.
As mentioned before, multi-armed bandit based techniques are capable of dealing with uncertainties in the context of NGWNs because of limited prior knowledge and the associated resource-thirsty feedback. Moreover, it is beneficial to model the selfishness and the decision conflicts of/among the users during the decision making process. Hence, the multi-armed bandit based algorithms have become powerful tools for rational decision making in wireless networks both for distributed users and APs as well as for the central control center. Specifically, Maghsudi et al.  proposed a small cell activation scheme relying on the multi-armed bandit philosophy given only limited information about the available energy of the small cell BS as well as the number of users to be served. The overall heterogeneous network’s throughput was improved with the aid of an energy-efficient small cell on-off switching regime controlled by the macro BS, while the inter-interference level was reduced. Another compelling application of the multi-armed bandit regime in the heterogeneous network is constituted by the dynamic network selection in the context of uncertain heterogeneous network state information. Wu et al.  formulated the optimal network selection problem as a continuous-time multi-armed bandit problem considering diverse traffic types. Moreover, the network access cost function and the QoE reward were defined as the metrics of evaluating the proposed network selection schemes. In , given the time-varying and user-dependent fading channels of wireless peer-to-peer (P2P) networks, a multi-armed bandit aided optimal distributed transmitter scheduling policy was conceived for multi-source multimedia transmission, which was beneficial of maximizing the data transmission rate and reducing the related power consumption in the light in terms of the realistic energy constraints of wireless mobile devices. In addition to transmitter scheduling, Maghsudi and Stańczak applied the covariate multi-armed bandit regime  for solving the relay selection problem in the wireless network, where the geographical location of relay nodes was assumed to be known by the source node, but no knowledge was assumed about the corresponding fading gains. The proposed covariate multi-armed bandit model is capable of dealing with the exploitation-exploration dilemma of the relay selection process. Lee et al.  proposed a -greedy multi-armed bandit based framework for exploiting the gains provided by frequency diversity in Wi-Fi channels. They struck a trade-off between the achievable gain stemming from frequency diversity and the resource consumption imposed by channel estimation and coordination.
Given the open broadcast nature of the wireless channel environment and the access contention mechanism among multi-priority users, multi-armed bandit based techniques have played a special role in cognitive networks [203, 204, 205, 206, 207, 208]. For example, Zhao et al.  formulated a multi-armed restless bandit model for opportunistic multi-channel access, which approached the maximum attainable throughput by accurately predicting which is next idle channel likely to become. In , a channel selection scheme was investigated which was capable of adapting to the link quality and hence finding the optimal channel for avoiding interferences and deep fading. Moreover, Gwon et al.  and Zhou et al.  further considered the choice of access strategy in the presence of both legitimate desired users and jamming cognitive radio nodes, which was resilient to adaptive jamming attacks with different strengths spanning from near no-attack to the full-attack across the entire spectrum. In contrast to only sensing and accessing a single channel, considering the correlated rewards of different arms, a sequential multi-armed bandit regime was conceived by Li et al.  for identifying multiple channels to be sensed in a carefully coordinated order. Furthermore, Avner and Mannor  studied multi-user coordination in cognitive networks, where each user’s successful channel selection relies on both the channel state as well as on the decisions of the other users.
V-A3 An Example
Visible light communication (VLC) systems have the compelling benefit of a wide unlicensed communication bandwidth as well as innate security in downlink (DL) transmission scenarios, hence they may find their way into the construction of NGWNs. However, considering the limited coverage and dense deployment of light-emitting diodes (LED), traditional network association strategies are not readily applicable to VLC networks. Hence by exploiting the power of online learning algorithms, in , the authors focused their attention on sophisticated multi-LED access point selection strategies conceived for hybrid indoor LiFi-WiFi communication systems with the aid of a multi-armed bandit model. Explicitly, since light-fidelity (LiFi) VLC transmissions are less suitable for uplink (UL) transmissions, a classic WiFi UL was used in this study.
To elaborate, in the indoor VLC system, the communication between the devices and the backbone network relies on the VLC DL as well as on the RF WiFi UL, which hence can be viewed as a hybrid LiFi-WiFi network. In the system model, it is assumed that there are low-energy LED lamps in the indoor space considered. Moreover, regardless of their positions, the mobile devices are capable of accessing any of the indoor LED lamps and of downloading packets from the Internet via VLC. When a decision round is due, the access control strategy obeys the decision probability distribution of . And it has , where denotes the probability of accessing the
th LED lamp. Furthermore, the service time of each LED lamp obeys the negative exponential distribution with a departure rate, while the interval between system access requests, in the same way, obeys the negative exponential distribution with an arrival rate . The VLC DL channel is characterized by a diffuse link, where the light beam is radiated within a certain angle. Thus, the indoor VLC channel can be modelled by combining the line of sight (LOS) path (Fig. 13 (a)) as well as a single one-hop reflected path (Fig. 13 (b)).
The expectation of the accumulated reward gap function is defined as the metric for characterizing the performance of our AP selection scheme, which represents the difference between the maximum theoretical reward and the actually acquired reward after sequential decision making experiments relying on the system’s decision probability distribution, which is formulated as.
where denotes the user rate associated with the th decision round in terms of the access decision at the instant , with being the actual access decision.
Furthermore, in  a pair of multi-armed bandit learning techniques, i.e. the ‘exponential weights for exploration and exploitation’ (EXP3) as well as the ‘exponentially-weighted algorithm with linear programming’ (ELP), were advocated for updating the AP-assignment decision probability distribution of each AP at each time instant for the sake of improving the link throughput based on the probability distribution of (26). More explicitly, in contrast to the trial-and-error EXP3 algorithm, the ELP based AP selection algorithm was constructed for taking into account both the partially observed conditions of the APs as well as the network topology.
The theoretical upper bound of the expected value of the accumulated reward gap function of the EXP3- and ELP-based multi-armed bandit learning algorithms was also derived in . In Fig. 14 and Fig. 15, the normalized throughput of the selected VLC links and of the whole system relying on the EXP3-based, ELP-based as well as on random LED AP selection schemes was compared. By contrast, the random selection scheme granted an identical decision probability of accessing any of the LEDs, namely , for each lamp at each decision-making time instant. It was assumed that the negative exponential departure probability of each downloading service was . Moreover, the initial state of the number of downloading services supported by each lamp was randomly chosen between . Upon increasing the number of decision rounds , the EXP3- and ELP-based selection schemes had a higher accumulated normalized throughput than random selection. Furthermore, relying on more neighbor observation information as well as by exploiting the connection of the LED lamps, the ELP-based AP-selection scheme was shown to outperform that based on EXP3.
V-B MDP & POMDP and Their Applications
The classic Markov decision process (MDP)  constitutes a framework of making decisions in the context of a discrete-time stochastic environment of Markov state transitions, which provides the decision maker with the optimal actions to opt for at each state. It has been used in a wide range of disciplines, especially in automatic control . The goal of the decision maker, generally speaking, is to maximize the cumulative reward received over a long run and to find the corresponding optimal policy which represents a mapping from each state to the specific probabilities of choosing each legitimate action.
In an MDP model, the system’s state transition follows the Markovian property, where the system’s response at time epochdepends exclusively on the current state and on the agent’s action at time epoch . Mathematically, at time epoch , the system is in a certain state , where the agent selects a legitimate action that is available in the state . As a result, the system then acts at the next time epoch by moving into a new state relying on the system’s state transition probability of . At the same time, the decision maker receives the corresponding reward . The associated value function is then defined for quantifying how well the agent carries our its action over a long run commencing from the initial state , which can be formulated as:
where represents the discount factor and the mapping represents the probability of opting for action in the state . Hence, the optimal policy can be formulated by maximizing the value function considered, i.e. we have . The maximization of the value function can reformulated as an iterative equation with the aid of Bellman’s optimality theorem , which is given by:
By contrast, as an extension of MDP, the partially observable Markov decision process (POMDP) only relies on partial knowledge about the hidden Markov system which is eminently suitable for scenarios, where the agent cannot directly observe the underlying system’s state transitions. Hence, the agent has to constitute belief states and the associated belief transition function by relying on a set of observations instead of the real system states. In a nutshell, the POMDP framework can be formulated as a quintuple of , i.e.
System’s State : The system’s state represents the system’s legitimate state;
Belief State : The belief state benchmarks the degree of the similarity between each of the system’s legitimate state and the state estimated by the agent;
Action : The action denotes the specific action that can be selected in the given state;
Belief Transition Function : The belief transition function represents the probability of the belief state traversing from to conditioned on selecting action ;
Reward Function : The reward function quantifies the immediate reward received by performing the selected action.
Similarly, the optimal policy can be obtained by solving the optimization problem of:
As another important decision-making tools, which is different from the multi-armed bandit solutions, MDP/POMDP should firstly model the environment relying on either fully or partially observed knowledge. To elaborate a little further, Massey et al.  proposed an MDP based downlink service scheduling policy for wireless service providers. Considering the time-sensitive nature of wireless tele-traffic patterns, their proposed scheduling policy was capable of maximizing the expected reward for the wireless service provider in the context of a multiplicity of services. In , Tang et al. resorted to the MDP approach for enhancing a basic node-misconduct detection method, where a novel reward-penalty function was defined as a function of both correct and wrong decisions. The resultant adaptive node-misconduct detector maximized this reward-penalty function in diverse network states. Moreover, Kong et al. conceived a discrete-time MDP (DTMDP) aided mechanism  for dynamically activating and deactivating certain resources of the BS in the context of time-varying network traffic. More explicitly, at each decision round, the DTMDP had the option of activating a new resource module, deactivating the currently active resource module and no operation. The proposed switching mechanism reduced the power consumption, i.e. improved the energy efficiency at the BS.
As a further development, relying on the POMDP paradigm, Tseng et al.  designed a cell selection scheme for improving the network’s capacity, where the full cell loading status was not observable. Hence, it predicted the unavailable cell loading information from set of non-serving base stations and then took actions for improving the various performance metrics, including the system’s capacity, the handover time as well as the mobility management as a whole. Moreover, the belief state was defined for representing the state uncertainty in terms of the statistical probability of a cell’s specific loading state. The simulation results of Tseng et al.  showed that their solution outperformed the conventional signal-strength aided and load-balancing based methods. In order to save the energy of sensors, Fei et al.  proposed a POMDP aided K-sensor scheduling policy, which guaranteed the sensors’ high-quality coverage and reduced the total energy consumption. Similarly, by striking a trade-off between the detection performance and energy consumption, Zois et al.  designed a POMDP aided sensor node selection scheme for WBANs by maximizing the system’s lifetime as well as optimizing the physical state detection accuracy. The main goal of the sensor node selection was to devise a schedule under which the sensors alternated between the active state and the dormant state relying on the specific network activity. Upon relying on the decentralized POMDP (DEC-POMDP), Pajarinen et al.  proposed a MAC solution, which promptly adapted both to the spatial and temporal opportunities facilitated by the wireless network dynamics, which yielded an increased throughput and reduced latency compared to the traditional carrier-sense multiple access relying on conventional collision avoidance (CSMA/CA) methods. Here, the POMDP tackled the uncertainty both in the environment’s evolution and in the associated inaccurate observations. Thanks to the cross layer optimization employed, more information can be gleaned from the lower layers for enhanced network condition estimation. Then, Xie et al.  used the POMDP model for solving the frame size selection problem of the ubiquitous transmission control protocol (TCP) with the objective of improving the total estimated throughput by striking a tradeoff between the contention probability and back-off time based on the current network condition. Furthermore, Michelusi and Mitra  conceived a cross-layer framework for jointly optimizing the spectrum-sensing and access processes of cognitive wireless networks with the objective of maximizing the throughput of the SU under a strict constraint on the maximal performance degradation imposed on the PU. Furthermore, the high complexity of the POMDP formulation was mitigated by a low-dimensional belief representation, which was achieved by minimizing the Kullback-Leibler divergence defined in .
V-B3 An Example
As shown in Fig. 16, the ‘super-WiFi’ network concept has been originally proposed for nationwide Internet access in the USA. However, the traditional mains power supply is not necessarily ubiquitous in this large-scale wireless network. Furthermore, the non-uniform geographic distribution of both the BSs and of the tele-traffic requires carefully considered user-association. Relying on the rapidly developing energy harvesting techniques, in , a POMDP-based access point selection strategies was conceived for an energy harvesting aided super-WiFi network.
It was assumed that both the battery states as well as the user access states were completely observable. However, in practice the solar radiation intensity changes over time in a year, as influenced by the weather conditions. Furthermore, the radiation sensors have a limited sampling rate, which makes it hard to simultaneously record the solar radiation intensity and to accurately estimate the system’s battery state. Fortunately, relying on historical solar radiation observation data provided by the University of Queensland, Australia , in a short period of time, say, within an hour, the real-time harvested solar power can be modeled as , where is constant for an hour, while
is a small perturbation. Moreover, multiple factors, such as the effective irradiation area, the clouds’ distribution, the sensors’ operating status, etc. may independently affect the harvested power. Relying on the central-limit theorem, the perturbation
can be regarded as being Gaussian distributed. Hence, the distribution ofcan be written as , where and can be learned from the harvested data set.
Moreover, a queue-based user-association state model as well as a dynamic battery state model was established. Hence, the system’s state having APs is constituted by both the user-association states as well as by the battery states. Let denote the user-association states, while represent the AP battery states, where and . Furthermore, the super-WiFi system state can be written as a -element vector , which includes both the APs’ user-association states and the APs’ battery states. Assuming the independence of each AP’s two sub-states, the system’s state transition probability can be expressed as:
where represents the users’ actions in terms of which available APs they request association with.
Since the requesting users only have partial knowledge of the entire super-WiFi system’s state, relying on the above definitions and hypotheses, we construct the POMDP decision-making model in terms of a quintuple of as mentioned above. The POMDP formulation can be reduced to a belief MDP with the aid of the belief state vector. Therefore, the expected reward of the system relying on strategy after an infinite number of time slots can be written as:
where is the initial system state, while is the belief state vector reflecting the grade of similarity between the current estimated state and the legitimate system state . Moreover, is the immediate reward of the system and represents the discount rate. Then, the optimal strategy can be constructed by invoking dynamic programming aided iterative algorithms for maximizing the expected reward function.
Bearing in mind the large values of , and , as well as the users’ rapidly fluctuating arrival rate and departure rate , obtaining the optimal POMDP solution may face the curse of dimension disaster. In order to reduce the computational complexity, a suboptimal algorithm was proposed in . Explicitly, Algorithm 2 of  aimed for maximizing the expectation of the system’s energy function, which was defined as:
where represents the residual energy of AP , while is its energy harvested under the assumption that the harvested power level remains quasi-static during the information transmission interval and denotes the energy consumption. Finally, is the capacity of the AP’s battery.
The efficiency of the AP selection algorithms proposed in  was compared in terms of the system’s access efficiency defined as , where is the total number of successful access attempts during the entire simulation time . In Fig. 17 and Fig. 18, multiple APs () are considered with the maximum number of admitted users being , while having a maximum number of battery states given by . Moreover, the departure rate is . We may conclude from Fig. 17 that a highly loaded system makes the carrier-sense multiple access with collision detection (CSMA/CD) method almost useless, when the users’ arrival rate reaches a certain value. As shown in Fig. 18, where , the system’s access efficiency recorded for all the AP selection algorithms only increases with the solar radiation intensity in a relatively small range. However, the performance of the CSMA/CD, CSMA/CA777Strictly speaking, the CSMA/CD and CSMA/CA in this paper are different from the Ethernet’s data link layer protocols. Here, both of them represent the access control mechanisms. We use the same acronym CSMA/CD and CSMA/CA for convenience., as well as of the random selection algorithm remains unchanged, regardless of the increase in solar radiation intensity. Moreover, the suboptimal Algorithm 2 of  is capable of outperforming the POMDP method at a strong solar radiation intensity, which may be deemed to be the result of the approximations and hypotheses inherent in the POMDP model.
V-C Temporal Difference Learning and Its Applications
Temporal-difference (TD) learning is a model-free reinforcement learning method, which is capable of directly gleaning knowledge from raw experience without a model of the environment or receiving delayed reward, which can be typically viewed as a combination of Monte Carlo methods and of dynamic programming. More specifically, it samples the environment like the Monte Carlo methods, and then updates the corresponding parameters relying on current estimates like dynamic programming does. By contrast, TD learning operates in an on-line fashion by relying on the result of a single time step, rather than waiting for the final outcome until the end of an episode of the Monte Carlo method. Moreover, it has an advantage over the dynamic programming methods since it does not require a model of the state transition probabilities as shown in Fig. 19. TD learning can be readily invoked for finding an optimal action policy for any finite MDP associated with an unknown system model. Fig. 19 shows the difference between the MDP, POMDP and TD learning.
A pair of popular representatives of the TD learning family are constituted by the Q-learning and by the “state-action-reward-state-action” (SARSA) technique, which interacts with the environment and updates the state-action value function, namely the Q-function, based on the action it takes. In contrast to SARSA, Q-learning updates the Q-function relying on the maximum reward provided by one of its available actions. Specifically, the update of the Q-function in SARSA can be formulated as :
while in Q-learning, the update of the Q-function can be cast as :
where represents the system’s state and is the action selected by the agent, whilst represents the available set of actions. Moreover, is the update weighting coefficient and denotes the discount factor. As for the convergence analysis, SARSA is capable of converging with probability to an optimal policy as well as to an optimal state-action value function, provided that all the state-action pairs are visited a sufficiently high number of times. However, because of the independence of making an action and that of updating the Q-function, Q-learning has no delayed reward as TD-learning, which tends to facilitate an earlier convergence than SARSA .
As a benefit of being free from modeling the environment, TD learning is capable of providing competent decisions even in unknown environments. Table III summarizes a variety of compelling applications found in wireless networks for both SARSA and Q-learning along with their brief description.
|Paper||Method||Scenario||Application & Description|
|||reduced-state SARSA||cellular network||dynamic channel allocation considering both mobile traffic and call handoffs.|
|||on-policy SARSA||CR network||distributed multiagent sensing policy relying on local interactions among SU|
|||on-policy SARSA||MANET||energy-aware reactive routing protocol for maximizing network lifetime|
|||on-policy SARSA||HetNet||resource management for maximizing resource utilization and guaranteeing QoS|
|||approximate SARSA||P2P network||energy harvesting aided power allocation policy for maximizing the throughput|
|||Q-learning||WBAN||power control scheme to mitigate interference and to improve throughput|
|||Q-learning||OFDM system||adaptive modulation and coding not relying on off-line training from PHY|
|||Q-learning||cooperative network||efficient relay selection scheme meeting the symbol error rate requirement|
|||decentralized Q-learning||CR network||aggregated interference control without introducing signaling overhead|
|||convergent Q-learning||WSN||sensors’ sleep scheduling scheme for minimizing the tracking error|
Vi Deep Learning in NGWN
Vi-a Deep Artificial Neural Networks and Their Applications
Artificial neural networks 
constitute a set of algorithms conceived by imitating the interaction between neurons in human brain, which are designed to extract features for clustering and classification tasks.
In a common artificial neural network (ANN) model , the input of each artificial neuron is a real-valued signal, and the output of each artificial neuron is calculated by a non-linear function of the sum of its inputs. Artificial neurons and their connections typically use a weighting factor for adjusting the “speed” of the learning process. Moreover, artificial neurons are organized in the form of layers. Different layers perform different kinds of transformations of their inputs. Basically, input signals travel from the first layer to the last layer, possibly via multiple hidden layers.
The deep neural network (DNN) is characterized by multiple hidden layers between the input and output layers as shown in Fig. 20
(a), which is capable of modeling complex relationships of the processed data with the aid of multiple non-linear transformations. In a DNN, the provision of extra layers facilitates the composition of features from lower layers, which is beneficial in terms of more accurately modeling complex data than a ‘shallow’ network having a single hidden layer. Furthermore, DNN may be viewed as a type of feed-forward network, where the processed data flows in the direction from the input layer to the output layer without looping back. Given recent impressive applications of DNN, the convergence behavior of DNN emerges an important subject in machine learning.
By contrast, in a recurrent neural network (RNN) a neuron in one layer is capable of connecting to the neurons in previous layers. Therefore, a RNN is capable of exploiting the dynamic temporal information hidden in a time sequence and it exploits its “memory” inherited from previous layers for processing the future inputs as shown in Fig. 20 (b). Popular algorithms used for training the RNN include the real time recurrent learning technique of 
, the causal recursive backpropagation algorithm of, the backpropagation through time algorithm of , etc.
The convolutional neural network (CNN) is a class of feed-forward deep artificial neural networks relying on the so-called weight-shared architecture and translation invariance characteristics, which hence only requires modest preprocessing. As seen in Fig. 20 (c), a basic CNN architecture is composed of an input layer, an output layer as well as multiple hidden layers, which are often referred to as convolutional layers, pooling layers and fully connected layers. More particularly, the convolutional layers invoke a convolution operation, also termed as the cross-correlation operation, which generate a multi-dimensional feature map relying on a number of so-called filters. The CNN has been successfully used in both image and video recognition 242], recommender systems , etc. Fig. 20 contrasts the basic architecture of DNN, RNN and CNN, respectively.
In this subsection, we will consider the benefits of deep artificial neural network algorithms in a variety of wireless networking scenarios. As mentioned before, deep artificial neural networks are capable of capturing the non-linear and often dynamically varying relationship between the inputs and outputs. Hence they have a powerful prediction, inference and data analysis capability by exploiting the vast amount of data generated both by the environment and by the users. As for learning from the environment, we are able to harness DNNs trained by the data gleaned over the air for the sake of channel estimation , interference identification , localization [246, 247, 248, 249, 250], etc. By contrast, with regard to learning from the users or devices, DNN algorithms can also be used for predicting the users’ behaviors, such as their content interests , mobility patterns , etc. in order to beneficially design the dynamic content caching of BSs and to efficiently allocate wireless resources, for example.
Traditional signal processing approaches supported by statistics and information theory in communication systems substantially rely on accurate and tractable mathematical models. Unfortunately, however, practical communication systems may have a range of imperfections and non-linear factors, which are difficult to model mathematically. Given that DNN algorithms do not require a tractable model, they are capable of remedying the imperfections in the physical layer by learning both from the environment and from previous inputs relying on a specific hardware configuration. To elaborate, Ye et al.  proposed a DNN aided channel estimation method for learning the wireless channel characteristics, such as the nonlinear distortion, interference and frequency selectivity. The DNN aided channel estimation method was shown to be more robust than traditional methods, especially in the context of having fewer training pilots, in the absence of cyclic prefix, as well as in the face of nonlinear clipping noise. Apart from estimating the channel characteristics, DNNs can also be used for classifying modulated signals in the physical layer. Rajendran et al.  conceived a date-driven automatic modulation classification (AMC) scheme hinging on the long short term memory (LSTM) aided RNN, which captured the time domain (TD) amplitude and phase information of modulation schemes carried in the training data without expert knowledge. Their simulations showed that the novel AMC had an average classification accuracy of about in the context of time-varying SNR ranging from 0dB to 20dB. As for signal detection, Farsad and Goldsmith  developed a deep learning aided signal detector, where the transmitted signal can be efficiently estimated from its corrupted version observed at the receiver. The detector was trained relying on known transmitted signals, but without any knowledge of the underlying wireless channel model and estimated the likelihood of each symbol, which was beneficial for carrying out soft decision error correction afterwards. In the application of interference identification, Schmidt et al.  proposed a -feature-map assisted CNN based wireless interference identification scheme. The CNN model learned the relevant features through self-optimization during the GPU based training process, which was first designed in . By carefully considering the realistic capability of wireless sensors, the model relied on the time- and frequency-limited sensing snapshots having the duration of 12.8 as well as the bandwidth of 10MHz. The proposed CNN based wireless interference identifier was shown to have a higher identification accuracy than the state-of-the-art schemes in the context of low SNRs, such as dB, for example.
Furthermore, we can use DNNs for modelling the entire physical layer of a communication system without any classic components such as source coding, channel coding, modulation, equalization, etc. In , O’Shea et al
. used a DNN to represent a simple communication system with one transmitter and one receiver that can be trained as a so-called auto-encoder without knowing the accurate channel model. Moreover, a CNN algorithm was conceived for modulation classification based on both sampled radio frequency time-series data and expert knowledge integrated by radio transformer networks (RTN). Additionally, O’Sheaet al.  extended the DNN aided auto-encoder to a single user MIMO communication scenario, where the physical layer encoding and decoding processes were jointly optimized as a single end-to-end self-learning task. Their simulation results showed that the auto-encoder based system outperformed the classic space time block code (STBC) at 15dB SNR. Furthermore, Dörner et al.  also developed a DNN based prototype system solely composed of two unsynchronized off-the-shelf software-defined radios (SDR). This prototype system was capable of mitigating the current restriction on short block lengths.
DNNs also play a critical role in supporting a variety of compelling upper layer applications, such as traffic prediction , packet routing  and control , traffic offloading , resource allocation , attack detection , just to name a few. For instance, Wang et al.  presented a hybrid deep learning aided structure for spatio-temporal traffic modeling and prediction in cellular networks by mining information from the China Mobile dataset. It used a novel deep learning aided auto-encoder for modeling the spatial features of wireless traffic, while using LSTM units for temporal modeling. Additionally, Kato et al.  proposed a supervised DNN aided traffic routing scheme, which outperformed the classic open shortest path first (OSPF) scheme in terms of requiring a lower overhead, whilst maintaining a higher throughput and lower delay. By contrast, a real-time deep CNN based traffic control mechanism learning from previous network anomalies was conceived by Tang et al. , which substantially reduced the average delay and packet loss rate. Hence, deep learning aided traffic control may indeed constitute a potential candidate for gradually replacing traditionally routing protocols in future wireless networks. Furthermore, Li et al.  integrated both the DNN structure and the edge computing technique into the multimedia IoT, which was able to improve the efficiency of multimedia processing. Sun et al.  treated the power control problem in interference-limited wireless networks as a ‘black box’. They proposed an ‘almost-real-time’ power control algorithm relying on a DNN structure trained by simulated data. In comparison to traditional mathematical tools, the approximation error of the DNN aided algorithm is closely related to the depth of the DNN considered. As for network security issues, for example, He et al.  constructed a conditional deep belief network (CDBN) for the real-time detection of malicious false data injection (FDI) attacks in the smart grid, which was trained by historical measurement data. The simulations conducted using the IEEE 118-bus test system and the IEEE 300-bus test system showed that the CDBN aided FDI detection scheme was resilient to the environmental noise and had a higher detection accuracy than its SVM aided counterparts.
As a successful example of learning from the environment, DNNs are beneficial in terms of extracting electromagnetic fingerprint information from the wireless channel for indoor localization. In [246, 247, 248], Wang et al. proposed a DNN having three hidden-layer for training the calibrated CSI phase data, where the fingerprint information was represented by the DNN’s weights. Their experimental results showed that the DNN aided localization scheme performed well in different propagation environments, including an empty living room, and a laboratory in the presence of mobile users. In , Wang et al. proposed a deep learning method for supporting device-free wireless localization and activity recognition relying on learning from the wireless signals around the target, where a sparse auto-encoder network was used for automatically learning the discriminative features of wireless signals. Furthermore, a softmax-regression-based framework  was formulated for the location and activity recognition based on merged features. Moreover, in , Zhang et al. constructed a four-layer DNN for extracting reliable high level features from massive Wi-Fi data, which was pre-trained by the stacked denoising auto-encoder. Additionally, an HMM aided high-accuracy localization algorithm was proposed for smoothening the estimate variation. Their experimental results showed a substantial localization accuracy improvement in the context of a widely fluctuating wireless signal strength.
With regard to learning from users or devices, Ouyang et al.  conceived a CNN-aided online learning architecture for understanding human mobility patterns relying on analyzing continuous mobile data streams. Al-Molegi et al.  integrated both the spatial features gleaned from GPS data and the temporal features extracted from the associated time stamps for predicting human mobility based on a RNN. Moreover, Song et al.  proposed an intelligent deep LSTM RNN based system for predicting both human mobility and the specific transportation mode in a large-scale transportation network, which was beneficial in terms of providing accurate traffic control for intelligent transportation systems (ITS). Additionally, a mobility prediction technique relying on a complex extreme learning machine (CELM) was developed by Ghouti et al.  in order to jointly optimize both the bandwidth and the power MANETs. In , both the multi-layer perception and RNN models were employed by Agarwal et al. for characterizing the activity of primary users in CR networks, where three different traffic distributions, namely Poisson traffic, interrupted Poisson traffic and self-similar traffic were used for training the related models.
Table IV lists a range of typical applications of DNNs along with a brief description.
|||channel estimation||DNN||learn nonlinear distortion, interference and frequency selectivity of wireless channels|
|||modulation classification||RNN||capture amplitude and phase information without expert knowledge|
|||signal detection||DNN||transmit signal detection from noisy and corrupted signals without underlying CSI|
|||interference identification||CNN||learn features through self-optimization during the GPU based training process|
|||PHY representation||DNN||represent simple system having one transmitter and receiver without accurate CSI|
|||PHY representation||DNN||represent single user MIMO system relying on DNN aided auto-encoder|
|||software-defined radio||DNN||be capable of easing the current restriction on short block lengths|
|||traffic prediction||DNN||deep auto-encoder and LSTM for modeling spatial and temporal features|
|||packet routing||DNN||traffic routing scheme with little signal overhead, large throughput and small delay|
|||traffic control||CNN||consider previous network abnormalities, lower average delay and packet loss rate|
|||traffic offloading||DNN||integrate both DNN structure and edge computing technique into multimedia IoT|
|||power control||DNN||an almost-real-time power control algorithm in interference-limited wireless networks|
|||network security||DBN||a real-time detection of malicious false data injection attack in smart grid|
|[246, 247, 248, 249, 250]||indoor localization||DNN||device-free wireless localization and recognition by learning from ambient wireless signals|
|||mobility prediction||CNN||learn human mobility pattern relying on analyzing continuous mobile data stream|
|||mobility prediction||RNN||integrate spatial feature from GPS and temporal feature from associated time stamps|
|||transportation mode||RNN||predict both human mobility and transportation mode for large-scale transport networks|
|||activity prediction||RNN||characterize primary users’ activity in CR with different traffic distribution|
Vi-B Deep Reinforcement Learning and Its Applications
The deep reinforcement learning technique is constituted by the integration of the aforementioned DNNs and reinforcement learning. Explicitly, in deep reinforcement learning methods, DNNs are used for approximating certain components of reinforcement learning, including the state transition function, reward function, value function and the policy. These components can be viewed as a function of the weights in these DNNs, which can be updated with the aid of the classic stochastic gradient descent.
In particular, the deep Q-Network (DQN) constitutes the first deep reinforcement learning solution, which was proposed by Mnih et al. in 2015 , which avoids the instability of the reinforcement learning algorithm, which may even become divergent when its action-value function is approximated relying on a non-linear function. To elaborate a little further, DQN stabilizes the training process of the action-value function approximation by relying on experience replay. Furthermore, DQN only requires modest domain knowledge. The deep Q-learning algorithm in DQN is a variant of the classical Q-learning algorithms, which is integrated with the deep CNN model, where the convolutional filters seen in Fig. 20 (c) are used for representing the effects of receptive fields. One of the outputs of the deep CNN involved yields the specific value of the Q-function for a possible action. Beyond the DQN, substantial efforts have also been invested in improving the performance and stability, as exemplified by the double DQN  and the dueling DQN . Thanks to the powerful feature representation capability of DNNs and of the reinforcement learning algorithms, DQN performs well in a range of compelling applications as exemplified by the AlphaGo, which is the first super-human program to defeat a professional human chess-player.
Deep reinforcement learning is eminently suitable for supporting the interaction in autonomous systems in terms of a higher level understanding of the visual world, which can be readily applied to a diverse analytically intractable problems in NGWNs.
Given the intrinsic advantages of the reinforcement learning in environment in interactive decision making, it may play a significant role in the field of control decision [272, 273]. Specifically, Zhang et al.  proposed a model-free UAV trajectory control scheme relying on deep reinforcement learning for data collection in smart cities, where a powerful deep CNN was used for extracting the necessary features, while a DQN model was used for decision making. Given the sensing region and the related tasks, this algorithm supported efficient route planning for both the UAVs and mobile charging stations involved. In , a deep reinforcement learning aided communication-based train control system was conceived by Zhu et al. which jointly optimized the communication handoff strategy and the control functions, while reducing the energy consumption. Real channel measurements and real-time train position information were used for training the DQN model, which resulted in optimal communication and control decisions.
Furthermore, the resource allocation problems of wireless networks, such as energy scheduling, traffic scheduling, caching decisions, user association, etc. can be efficiently solved by deep reinforcement learning at a low computational complexity [81, 274, 275, 276, 277, 278, 279]. For example, Zhang et al.  proposed a deep Q-learning model for system’s dynamic energy scheduling, which relied on the amalgamated stacked auto-encoder and Q-learning model. More specifically, the stacked auto-encoder was used for learning the state-action value function of each strategy in any of the available system states. Moreover, Xu et al.  proposed a deep reinforcement learning framework for power-efficient resource allocation in CRANs, which optimized the expected and cumulative long term power consumption, including the transmit power consumption, the sleep/active transition power consumption as well as the RRU’s power consumption. A two-step deep reinforcement learning aided decision making scheme was conceived, where the learning agent first decides on activating/deactivating the sleeping mode of each RRU, and then determines the optimal beamformer’s power allocation. As for traffic scheduling, Zhu et al.  designed a stacked auto-encoder assisted deep learning model for packet transmission planning in the face of multiple contending channels in cognitive IoT networks, which aimed for maximizing the system’s throughput. In this architecture, MDP was used for modelling the system states. Given the large state-action space of the system, the stacked auto-encoder was used for constructing the mapping between the state and the action for accelerating the process of optimization. Furthermore, a deep Q-learning algorithm was conceived for designing both the cache allocation and the transmission rate in content-centric IoT networks for the sake of maximizing the long-term QoE , where He et al. considered both the networking cost as well as the users’ mean opinion score. In  and , He et al. proposed a DQN based user scheduling scheme for a cache-enabled opportunistic interference alignment (IA) assisted wireless network in the context of realistic time-varying channels formulated as a finite-state Markov model. More specifically, the DQN was constructed by relying on a sophisticated action-value function for the sake of reducing the computational complexity. Their simulation results demonstrated that the DQN aided IA assisted user scheduling was beneficial in terms of substantially improving the network’s throughput vs energy efficiency trade-off. To elaborate a little further, He et al.  utilized deep reinforcement learning for constructing their resource allocation policy relying on a joint optimization problem, which considered the programmable networking, information-centric caching as well as mobile edge computing in the context of connected vehicular scenarios. Moreover, the -greedy policy was utilized for striking an attractive trade-off between exploration and exploitation.
In order to curb the potentially excessive computational complexity resulting from having a large state space and to deal with its partial observability in cognitive radio networks, Naparstek and Cohen developed a distributed dynamic spectrum access scheme relying on deep multi-user reinforcement leaning, where each user maps his/her current state to spectrum access actions with the aid of a DQN for the sake of maximizing the network’s utility which was achieved without any message exchanges . Additionally, Wang et al.  proposed an adaptive DQN algorithm for dynamic multichannel access, which was capable of achieving a near-optimal performance outperforming the Myopic policy 888The Myopic policy is one that simply optimizes the average immediate reward. It is called ‘myopic’ in the sense that it merely considers the single criterion, while it has the advantage of being easy to implement. and the Whittle’s Index-based heuristic algorithms 999The Whittle’s Index-based algorithm is one of index heuristic algorithms which is designed to solve a problem in a more efficient way than traditional methods often used for solving NP-hard problems. More explicitly, the Whittle’s Index policy is a low-complexity heuristic policy. in complex scenarios.
Table V lists some typical applications of deep reinforcement learning in NGWN.
|||UAV network||trajectory control||a model-free UAV trajectory control scheme in smart cities relying on DQN|
|||ITS||train control||jointly optimize the communication handoff strategy and control performances|
|||energy-aware network||energy scheduling||associate the stacked auto-encoder and the deep Q-learning model|
|||CRAN||power allocation||decide RRU’s sleeping mode and the optimal beamformer’s power allocation|
|||cognitive IoT||traffic scheduling||construct the mapping between states and actions relying on stacked auto-encoder|
|||content-centric IoT||cache allocation||jointly design cache allocation and transmission rate for maximizing long-term QoE|
|||IA network||user scheduling||obtain the action-value function relying on DQN for lowering complexity|
|||vehicular network||resource allocation||consider programmable SDN, information-centric caching and mobile edge computing|
|||CR network||spectrum access||distributed spectrum access for maximizing network utility without message exchanges|
|||CR network||multichannel access||adaptive DQN aided multichannel access yielding a near-optimal performance|
Vii Future Research and Conclusions
In the following, we will list a range of future research ideas on promising applications of machine learning in NGWNs.
UAV-aided networking: Given the agility of UAV nodes as well as the bursty and often unpredictable nature of terrestrial wireless traffic, machine learning models can be used both for predicting the traffic demand and for adaptively adjusting the UAVs’ location.
mMTC and uRLLC network: While wireless networks have primarily served communications among individuals, in the era of the IoT, wireless networking also support myriads of machines and intelligent devices. In this era a pair of 5G operational modes - namely mMTC and uRLLC - are expected to play key roles . Machine learning is capable of enhancing conventional networks designed for mMTC, for example by invoking reinforcement learning to appropriately select the access points of MTC . The uRLLC mode of operation constitutes a rather young technological territory, which can be jointly designed with mMTC . To reduce the network’s latency from hundreds of milliseconds as experienced in state-of-the-art mobile communications to the desirable range of just a few milliseconds, machine learning is capable of supporting so-called anticipatory mobility management, which integrates the naive Bayesian classification of the previously used APs and geographical regression for the predictive analysis of data. Another example of disruptive technical trend is to investigate how wireless networking impacts on the smart agents operated by machine learning.
Narrow band IoT (NB-IoT): NB-IoT allows a large number of low-power devices to connect to the cellular network, where the devices require long-term Internet access and dense wireless coverage. Machine learning algorithms are capable of supporting intelligent resource allocation, optimal AP deployment and efficient access.
Socially-aware wireless networking: The operation of socially-aware wireless networks relies on a variety of social attributes, where machine learning schemes are beneficial in terms of facilitating feature extraction, social group-of-interest formation, classification and prediction of these social attributes, such as human mobility, social relations, behavior preference, etc.
Wireless Virtual reality (VR) networks: VR networks facilitate for the users to experience and interact with immersive environments, which requires flawless audio and video data processing capability. Machine learning algorithms have the potential of circumventing the conception of complex joint source- and channel-coding schemes by further developing the auto-encoder principles.
Network integration, representation and design: Machine learning may provide an alternative for network representation, where we can integrate each classic communication-theoretic blocks including source- and channel-encoding, modulation, demodulation, decoding, etc. into a “black box”. By Simply learning from and processing previous input and output signals, the receivers become capable of adaptively understanding the operational mechanism of the “black box” considered.
Wireless network tomography: State-of-the-art wireless networks support a vast number of nodes, such as those of the IoT, where the provision of global information is practically impossible for each node. Hence a new class of problems arises in the context of distributed wireless networks, which is related to the acquisition of network-related information. Classic network tomography  defines the problem as: , where is a -dimensional vector of the network’s dynamically fluctuating parameters, such as the link delay or traffic activity, is the -dimensional vector of measurements and