I Introduction
Wireless traffic prediction is a key enabler for proactive resource optimisation in 5G and beyond [1, 2, 3, 4, 5]. Proactive optimisation can create usercentric qualityofservice (QoS) and experience (QoE) improvements across 5G network slices [6, 7]. Direct prediction from historical data [8, 9, 10] and inference from proxy social media data [11] are important inputs to proactive optimisation modules [6] being considered for 5G and beyond applications, such as interference management, load balancing, and multiRAT offloading; with implementation on the edge or in CRANs. We begin with a review of timeseries forecasting algorithms used in wireless traffic prediction and identify a lack of research in both high and low extreme value predictions, which is of critical importance to avoiding network congestion and inefficiencies.
Ia StateoftheArt
Timeseries prediction methods can be classified into several types, with training data in high demand.
IA1 Statistical Models
Statistical timeseries modelling using a variety of signal processing and machine learning approaches have been widely applied to predict the wireless traffic. Moving average models with smoothing weights and seasonality works well for univariate forecasting. For example, in [12], seasonal autoregressive integrated moving average (ARIMA) models were fitted to wireless cellular traffic with two periodicities for prediction. However, this model is insensitive to anomalous values, such as eventdriven spike demand. Indeed, predicting and avoiding spike demand is critical to avoiding network outages and improving the consumer experience. Other methods rely on statistical generative functions assuming a quasistatic behaviour, such as the stable model [13] or the exponential model [14], but these do not offer the adpativity of machine learning techniques below.
IA2 Machine Learning Models
In terms of machine learning approaches, artificial neural networks (ANNs) has been used to predict the selfsimilar traffic with burstiness in [15]
. Although ANNs and deep learning approaches (CNN, LSTM, wavelet, Elman) neural networks
[9, 16, 17, 18, 19] performed well in cumulative learning and prediction accuracy, it cannot give a quantitative uncertainty due to its intermediate blackbox process. Alternatively, Gaussian Processes (GPs) have been used [8] and showed a strong adaptivity to the wireless traffic data. Nevertheless, the usage of traditional kernels are unable to capture longrange periodvarying dependent characteristics which limits the efficiency of existing training data.IA3 Gaussian Process Review
Gaussian process (GP) is widely used because of its adaptability to manifold data [20]. As a nonparametric machine learning method, the prior GP model is firstly established with compound kernel functions based on the background of the data. One optimizes the hyperparameters using the training data to extract its posterior distribution for the predicted outcome. The prediction results given by GPs quantify the statistical significance, which is an important advantage over other blackbox machine learning. As such, whilst GPs may not achieve the performance level of ANNs, they are able to quantify risk and that risk can be interpreted back to the features of the data [21].
IA4 Feature Extraction and Wireless Context
The features of traffic patterns may be correlated if the patterns are driven by the same specified events, i.e. the rush hours, concerts, etc. In these cases, the key point is to find the implied events information from the current flow trend by identifying where its features are close to those in historical data, hence, to predict how will the traffic demand change according to it in the past. [22, 23, 24]
addressed the problem of feature selection, in order to determine the most discriminative and relevant features of the classified data.
In the context of wireless traffic forecasting, current literature employ classic kernel functions [8], which cannot memorize the nonperiodic data pattern for extended periods. This means the GP model do not make full use of the training data. Furthermore, wireless traffic forecasting is often interested in predicting extreme events as opposed to the overall pattern of the traffic variation. Extreme demand values are useful in driving proactive network actions (e.g. extreme high demand requires spectrum aggregation and cognitive access [25, 26], whereas extreme low demand can lead to proactive sleep mode and coverage compensation [27, 28]). Therefore, what is needed is an adaptive kernel in GP models to tradeoff prediction accuracy between overall traffic variations and extreme values.
IB Novelty and Contribution
In this paper, we propose to embed the relevant data features in a flexible kernel functions, which enable the GP model to achieve this tradeoff. We make three major contributions:
1) A novel feature embedding (FE) kernel GP model is proposed for forecasting wireless traffic. Specifically, fewer hyperparameters are required in this model, which reduce the computation burden compared with that uses classic hybrid kernels. Meanwhile, the learning rate is improved significantly for irregular training data;
2) The predicted results are quantified into probability density function (PDF), which are more useful to plug into optimisation modules than the mean prediction value. Precisely, the predicted traffic is described to follow a weighted superposition distribution of mixed Gaussian distributions instead of the sum of those in traditional GP;
3) Demonstrating our forecast model on real wireless traffic data, the cumulative error curve of our model is compared against stateoftheart algorithms used in literature (seasonal ARIMA [12] and traditional GP model [8]). Our model shows the best adaptivity and prediction accuracy tradeoff between overall accuracy and extreme value accuracy.
The remainder of this paper is organised as follows. In Section II, we build a system model step by step from preprocessing to prediction. In Section III, we apply the model to the real wireless traffic data and evaluate the performance of it. Section IV concludes this paper and proposes the ideas for future work.
Ii System Model
In this paper, we use a sliding window of historical traffic data to predict future traffic demand. In this paper, we focus on wireless downlink (DL) traffic data demanded by endusers at 15m intervals over a two week period  see Fig. 1.
Iia Data Decomposition
The raw data is considered to be composed of a daily periodic and an aperiodic pattern from our observation and existing literature. By using a bandpass filter, the raw data can be decomposed into the aforementioned two components, as shown in Fig.1. In order to set the model free from the domination of largescale periodic patterns, we fix the daily periodic pattern which is derived from the historical data as the established baseline ^{1}^{1}1we acknowledge that there are baseline variations between each day of the week, but we focus on the aperiodic prediction, which is the main challenge. and only make prediction on the rest aperiodic pattern.
We assume that the aperiodic traffic consists of a noise flow and a eventdriven flow which has an implicit intrinsic correlation. The latter is predictable if we can identify the features relevance in this kind of flow from the noise.
IiB Priori Gaussian Process Model
The DL traffic value at time point is assumed to be a latent GP plus noise as
(1) 
where
is the random variable (RV) which follows a distribution given by GP, and
is the additive Gaussian noise with zero mean and variance
. From the continuous time domain, finite number of time points taken as , the RVs, , can be assumed to follow the multivariate Gaussian as [21](2) 
where is the mean function and is the covariance matrix given by
(3) 
where is the covariance between and represented by the kernel function.
IiC Feature Embedding Kernel Function
In GP, the covariance between every two RVs is quantified by the kernel function which interprets the potential correlation of RVs in a high dimensional space. Here we use the Gaussian radial basis function (RBF) kernel with a feature embedding (FE) norm:
(5) 
where is defined as the dimensions weighted feature matrix of the RV at time point :
(6) 
where the feature of RV in the matrix is a function which can either be homogeneous or nonhomogeneous of former values ():
(7) 
Due to the symmetry, it can be easily proved that our new kernel function still meets the conditions of Mercer’s theorem.
In BS (coordinated) control systems (e.g. radio resource management or beamforming), understanding sharp changes in traffic demand (especially when above the cell capacity or significantly below economic profitability thresholds) is more important than average demand trends. As such, the proposed feature weighting process in this paper focus building a tradeoff between general prediction accuracy and the aforementioned extreme demand values.
To achieve this, we set a threshold of traffic varying value at each sample time point based on historical data as shown in Fig.2. If at is outside the confidence interval in the distribution of , the associated feature
will be tagged as an outlier and assigned to category
; otherwise it is assigned to category . The Relief idea in [29] is utilized, whereby the feature weights are optimized by maximizing the sum of margin from each to the nearest point with a different category . This process is expressed as:(8) 
where
, which projects the high dimensional feature vectors’ norm onto one dimension, and
is the weight of the feature.In the Gaussian RBF kernel , the feature space can be mapped to an infinite dimensional kernel space . The hyperparameter controls the higher dimensional attenuation rate and has amplitude . Hyperparameters are tuned by maximizing the corresponding log marginal likelihood function which is equivalent to minimizing the cost function [8]:
(9) 
where and is the matrix of known values . The quasiNewton and gradient descent methods can be used in this optimization problem.
IiD Posteriori Prediction
After the hyperparameters are determined, the covariance of every two RVs in the training set can be quantified by , where are the optimized parameters. Let us assume that at a future time point , the RV follows the same model as the training set. Therefore, yields the covariance of with historical RVs. The multivariate distribution for any and is
(10) 
with mean and covariance matrix .
So given, the posterior distribution of can be derived as
(11) 
with
(12) 
For each previous time point in this model, a posterior distribution component of can be generated. In naive GP, the predicted distribution is also a Gaussian distribution which sums the influence of each previous point on its mean and variance [21]. In our proposed FEGP forecasting model, the predicted distribution uses a Gaussian mixed model (GMM). Consider the GMM resultant PDF of is the superposition of every individual distribution components from each and with a normalization coefficient as
(13) 
An example is shown in Fig.3. Blue lines are distribution components, derived by the covariance matrix of three previous points with the future point. Naive GP gives the average prediction result of this future point, i.e. also a Gaussian distribution, under integrated impacts from all components. While in FEGP, the GMM prediction result of this future point is assumed to have an equal probability to follow one of these three distribution components. The purple line gives the overall PDF of FEGP.
Iii Experimental Results and Discussion
Iiia Data Source
The data we use for training comes from base stations (BSs) in a 4G metropolitan area. The anonymous data is given by our industrial collaborator. It consists of aggregated downlink (DL) and uplink (UL) traffic demand volume per 15 minute interval over several weeks. We have selected a few example BSs at random to demonstrate our forecasting algorithm’s performance.
IiiB Feature Matrix in FEGP
When applying FEGP to wireless DL traffic forecasting, the first to be considered is what does the feature matrix consist of. In our experiment, the features are set to be:
(14) 
where
is the standard deviation of elements of
.IiiC Performance Metrics
We use the absolute cumulative error (ACE) as the performance metric:
(15) 
where is the predicted DL traffic and is the real data. For a fixed value forecast (onestepahead forecasting of the DL traffic), we assign
to be the value that has the maximum posterior probability.
IiiD Results Analysis
Fig.4 shows a comparison of forecasting algorithms over a week (672 points): (1) proposed FEGP, (2) classical NaiveGP, (3) Seasonal ARIMA, against real 4G DL data. The cumulative error is shown for 2 different representative parts of the data: (left) average demand shows similar performance between FEGP and NaiveGP; and (right) extreme spike demand shows superior performance by FEGP against both NaiveGP and SARIMA. From the GP models perspective, in the average part (left), both FEGP and NaiveGP consider most of the traffic demands as noise flow, i.e. in the initial model, thus they perform similarly; In the extreme spike part (right), FEGP can correctly recognize a potential eventdriven flow, which has happened before, using features from the last few points, yet NaiveGP cannot, hence FEGP gives a better prediction. In Fig. 6, we demonstrate that the proposed FEGP perform the best overall due to its adaptive to the extreme values, even though it might be a little worse on average demand in few specific time stamps.
IiiE Uncertainty Quantification
Posterior distribution of both models at a few representative points are given in Fig.5. Different from single peak Gaussian distribution predicted by NaiveGP, the GMM in FEGP gives more general distributions for prediction. In the absence of a known periodicity, NaiveGP sums the effect of the last few time points, while FEGP consider the effect from all time points according to their similarity in features with the predicted point. Consequently, there may be several peaks scattered over the forecast, which will inform proactive optimisation modules.
In datadriven wireless resource proactive optimization system [1, 2, 3], we ought to focus on not only the benefits brought by the system decision, but also the potential risks that drive regret functions, i.e., the occurrence of extreme demands. In our proposed FEGP prediction model, the risks can be quantified from posterior distribution. For example, in Fig.5:
(1) Lowtraffic triggers proactive sleep mode and coverage compensation: Our FEGP prediction points 318 and 672 demonstrates clear nonnegligible probability of low traffic whilst the mean prediction is similar to that of the naiveGP. That is to say, we may need to proactive sleep selected cells to achieve more energy efficient operations [30], whilst using other neighbouring cells across RATs to compensate [27, 28]. The risk of doing so is quantified by the posterior distribution (e.g., there is a small risk that there is actually high demand and compensated coverage is not enough).
(1) Spiketraffic triggers proactive spectrum aggregation and offloading: prediction point 368, there is a nonnegligible high probability density area appearing at extreme value, which is far away from the predicted mean value. This can be used to inform proactive spectrum aggregation and offloading of nonvital traffic to delaytolerant RATs [25, 26]. The risk of doing so is quantified by the posterior distribution (e.g., there is a small risk that there is actually no demand for high capacity).
IiiF Training Process
As the training set increases over time, the FEGP model becomes more sensitive to spikes due to its adaptivity to features. Nevertheless, the cost of computing goes up with the size of training set as well, thus we have to set a size threshold to the training set. In NaiveGP, we can discard data in reverse chronological order without affecting the performance of the model. However, in FEGP, we must make a tradeoff between the sensitivity of spikes and overall prediction accuracy, i.e., keeping more extreme value time points means the model is more sensitive to spikes prediction but may reduced overall performance. This need to be done case by case with each preexiting resource proactive optimization system.
Iv Conclusion and Future Work
Forecasting extreme demand spikes and troughs is essential to avoiding outages and improving energy efficiency. Proactive capacity provisioning can be achieved through extra bandwidth in predicted high demand areas (i.e., via spectrum aggregation and cognitive access techniques), and energy efficiency improvements can be achieved through sleep mode operations.
Whilst significant research into traffic forecasting using ARIMA, GPs, and ANNs have been conducted, current methods predominantly focus on overall performance and/or do not offer probabilistic uncertainty quantification. Here, we designed a feature embedding (FE) kernel for a Gaussian Process (GP) model to forecast traffic demand. The FE kernel enabled us to tradeoff overall forecast accuracy against peaktrough accuracy. We compared its performance against both conventional GPs, ARIMA models, as well as demonstrate the uncertainty quantification output. The advantage over neural network (e.g. CNN, LSTM) models is that the probabilistic forecast uncertainty can directly feed into decision processes in selforganizingnetwork (SON) modules in the form of both predicted average KPI benefit and regret functions using methods such as probabilistic numerics.
Our future work will focus on expanding to spatialtemporal dimension [18] via Gaussian random fields integration, consider multivariate forecasting across different service slices, as well as employing Bayesian training in Deep Gaussian Process (DGP) models [31, 32] to avoid catastrophic forgetting and to combat the dynamiticity of the traffic process.
V Acknowledgement
The authors would like to thank Zhuangkun Wei for constructive discussions on feature embedding and Dr. Bowei Yang for the data support.
References
 [1] Z. Du, Y. Sun, W. Guo, Y. Xu, Q. Wu, and J. Zhang, “Datadriven deployment and cooperative selforganization in ultradense small cell networks,” IEEE Access, vol. 6, pp. 22 839–22 848, 2018.
 [2] N. Saxena, A. Roy, and H. Kim, “Trafficaware cloud ran: A key for green 5G networks,” IEEE Journal on Selected Areas in Communications, vol. 34, no. 4, pp. 1010–1021, April 2016.
 [3] R. Li, Z. Zhao, X. Zhou, J. Palicot, and H. Zhang, “The prediction analysis of cellular radio access network traffic: From entropy theory to networking practice,” IEEE Communications Magazine, vol. 52, no. 6, pp. 234–240, June 2014.

[4]
S. O. Somuyiwa, A. Gyorgy, and D. Gundüz, “A reinforcementlearning approach to proactive caching in wireless networks,”
IEEE Journal on Selected Areas in Communications, vol. 36, no. 6, pp. 1331–1344, June 2018.  [5] F. Shen, K. Hamidouche, E. Bastug, and M. Debbah, “A stackelberg game for incentive proactive caching mechanisms in wireless networks,” in 2016 IEEE Global Communications Conference (GLOBECOM), Dec 2016, pp. 1–6.
 [6] V. Sciancalepore, K. Samdanis, X. CostaPerez, D. Bega, M. Gramaglia, and A. Banchs, “Mobile traffic forecasting for maximizing 5G network slicing resource utilization,” in IEEE Conference on Computer Communications (INFOCOM), May 2017, pp. 1–9.
 [7] L. Le, D. Sinh, L. Tung, and B. P. Lin, “A practical model for traffic forecasting based on big data, machinelearning, and network KPIs,” in 2018 15th IEEE Annual Consumer Communications Networking Conference (CCNC), Jan 2018, pp. 1–4.
 [8] Y. Xu, W. Xu, F. Yin, J. Lin, and S. Cui, “Highaccuracy wireless traffic prediction: A GPbased machine learning approach,” in IEEE Global Communications Conference (GLOBECOM), Dec 2017, pp. 1–6.
 [9] K. Zhang, G. Chuai, W. Gao, X. Liu, S. Maimaiti, and Z. Si, “A new method for traffic forecasting in urban wireless communication network,” EURASIP Journal on Wireless Communications and Networking, vol. 2019, no. 1, p. 66, Mar 2019.
 [10] X. Cao, Y. Zhong, Y. Zhou, J. Wang, C. Zhu, and W. Zhang, “Interactive temporal recurrent convolution network for traffic prediction in data centers,” IEEE Access, vol. 6, pp. 5276–5289, 2018.
 [11] B. Yang, W. Guo, B. Chen, G. Yang, and J. Zhang, “Estimating mobile traffic demand using Twitter,” IEEE Wireless Communications Letters, vol. 5, no. 4, pp. 380–383, Aug 2016.
 [12] F. Xu, Y. Lin, J. Huang, D. Wu, H. Shi, J. Song, and Y. Li, “Big data driven mobile traffic understanding and forecasting: A time series approach,” IEEE Transactions on Services Computing, vol. 9, no. 5, pp. 796–805, Sep. 2016.
 [13] R. Li, Z. Zhao, J. Zheng, C. Mei, Y. Cai, and H. Zhang, “The learning and prediction of applicationlevel traffic data in cellular networks,” IEEE Transactions on Wireless Communications, vol. 16, no. 6, pp. 3899–3912, June 2017.
 [14] and and and, “User data traffic analysis for 3g cellular networks,” in 2013 8th International Conference on Communications and Networking in China (CHINACOM), Aug 2013, pp. 468–472.
 [15] L. Xiang, X. Ge, C. Liu, L. Shu, and C. Wang, “A new hybrid network traffic prediction method,” in IEEE Global Telecommunications Conference (GLOBECOM), Dec 2010, pp. 1–5.
 [16] J. Feng, X. Chen, R. Gao, M. Zeng, and Y. Li, “Deeptp: An endtoend neural network for mobile cellular traffic prediction,” IEEE Network, vol. 32, no. 6, pp. 108–115, November 2018.
 [17] X. Wang, Z. Zhou, F. Xiao, K. Xing, Z. Yang, Y. Liu, and C. Peng, “Spatiotemporal analysis and prediction of cellular traffic in metropolis,” IEEE Transactions on Mobile Computing, pp. 1–1, 2018.
 [18] D. Miao, W. Sun, X. Qin, and W. Wang, “Msfs: Multiple spatiotemporal scales traffic forecasting in mobile cellular network,” in IEEE Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress, Aug 2016, pp. 787–794.
 [19] F. Ni, Y. Zang, and Z. Feng, “A study on cellular wireless traffic modeling and prediction using elman neural networks,” in 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), vol. 01, Dec 2015, pp. 490–494.
 [20] Y. Shu, M. Yu, O. Yang, J. Liu, and H. Feng, “Wireless traffic modeling and prediction using seasonal ARIMA models,” IEICE transactions on communications, vol. 88, no. 10, pp. 3992–3999, 2005.
 [21] A. G. Wilson, “Covariance kernels for fast automatic pattern discovery and extrapolation with gaussian processes,” Ph.D. dissertation, University of Cambridge, 2014.
 [22] B. Cao, D. Shen, J.T. Sun, Q. Yang, and Z. Chen, “Feature selection in a kernel space,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 121–128.
 [23] M. Ramona, G. Richard, and B. David, “Multiclass feature selection with kernel grammatrixbased criteria,” IEEE transactions on neural networks and learning systems, vol. 23, no. 10, pp. 1611–1623, 2012.

[24]
K.P. Wu and S.D. Wang, “Choosing the kernel parameters for support vector machines by the intercluster distance in the feature space,”
Pattern Recognition, vol. 42, no. 5, pp. 710–717, 2009.  [25] W. Zhang, C. Wang, X. Ge, and Y. Chen, “Enhanced 5G cognitive radio networks based on spectrum sharing and spectrum aggregation,” IEEE Transactions on Communications, vol. 66, no. 12, pp. 6304–6316, Dec 2018.
 [26] G. Yuan, R. C. Grammenos, Y. Yang, and W. Wang, “Performance analysis of selective opportunistic spectrum access with traffic prediction,” IEEE Transactions on Vehicular Technology, vol. 59, no. 4, pp. 1949–1959, May 2010.
 [27] W. Guo and T. O’Farrell, “Dynamic cell expansion with selforganizing cooperation,” IEEE Journal on Selected Areas in Communications, vol. 31, no. 5, pp. 851–860, May 2013.
 [28] S. Wang and W. Guo, “Energy and cost implications of a traffic aware and qualityofservice constrained sleep mode mechanism,” IET Communications, vol. 7, no. 18, pp. 2092–2101, December 2013.
 [29] Y. Sun, “Iterative RELIEF for feature weighting: algorithms, theories, and applications,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 1035–1051, 2007.
 [30] X. Xu, C. Yuan, W. Chen, X. Tao, and Y. Sun, “Adaptive cell zooming and sleeping for green heterogeneous ultradense networks,” IEEE Transactions on Vehicular Technology, vol. 67, no. 2, pp. 1612–1621, Feb 2018.
 [31] T. Buil, J. HernandezLoba, D. HernandezLoba, Y. Li, and R. Turner, “Deep gaussian processes for regression using approximate expectation propagation,” Proceedings of the 33rd International Conference on Machine Learning, 2016.

[32]
S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang, “Overcoming catastrophic forgetting by incremental moment matching,”
Advances in Neural Information Processing Systems (NIPS), 2017.
Comments
There are no comments yet.