I Introduction
The internet is continuing to change how we connect with others, organize the flow of things, and communicate information across the world. The demand for network traffic has risen significantly around the globe as network technology has advanced and digital activities such as video streaming, remote conferencing, online gaming, and ecommerce has increased. However, predicting network traffic based on the historical trends is indispensable for better Quality of Service (QoS), Quality of Experience (QoE), dynamic bandwidth reservation and allocation, congestion control, admission control [4], and privacypreserving routing [3]. Several research works have been done for efficient and accurate traffic prediction based on conventional statistical models[13]
and modern machine or deep learning techniques.
Real internet traffic is a nonlinear time series [20], and it is challenging to develop an accurate prediction model due to timevariability, longterm correlation, selfsimilarity, suddenness, and chaos [6] in the internet traffic. Despite those nonlinear characteristics, machine learning and deep learningbased methods have impactful performance. Most of the works assumed data is independent and identically distributed (i.i.d), i.e., the train and test data for model development and evaluation came from similar distribution. But in practice, the data distribution will not always be the same due to heterogeneous and anomalous internet traffic. In addition, there is a rising indication that deep learning and machine learning models exploit undesired results due to selection biases, confounding factors, and other biases in the data [17]. These biases in the datasets failed in generalizing the prediction rules for the outofdistribution data [16]
. Also, they assist predictive models in minimizing empirical risk by relying on correlation rather than causality. As a result, it is concerning for realworld AI (Artificial Intelligence) solutions in sensitive domains such as selfdriving cars
[1], health care [12], etc.Efficient traffic prediction is crucial for effective business decisions such as infrastructure expansion or abridgement, new service adaption, etc. Machine learning models [15] would be an excellent solution for handling the dynamic nature of the internet traffic and providing better predictions. However, the traffic prediction tool is more likely to deal with a data distribution that is slightly different and utterly unknown than the distribution used for model training and testing. Therefore, it is essential to build a robust machine learning model that will be able to handle a shift in the data distribution. This work proposes a hybrid machine learning model based on discrete wavelet transformation to improve the outofdistribution performance. The main contributions of our work are as follows:

Proposing a hybrid machine learning model architecture to analyze and predict realworld internet traffic.

Investigating different regressor models and their ensemble for singlestep traffic prediction.

Evaluating of the model performance based on different feature lengths for identifying the optimum window length.

Comparing the performance among standalone and hybrid models using four different traffic datasets to show the effectiveness of our proposed model in case of outofdistribution generalization.
This paper is organized as follows. Section II describes the literature review of current traffic prediction using machine learning models. Section III presents the methodology, including dataset description, deep learning models explanation, anomaly identification process, and experiment details. Section IV
summarizes the different deep learning methods’ performance and draws a comparative picture among prediction models with and without outliers in the dataset. Finally, section
V concludes our paper and sheds light on future research directions.Ii Literature Review
A comparative performance analysis between a conventional statistical model, ARIMA, and a deep learning model, LSTM, has been conducted in [2] for internet traffic prediction. In addition, they used the signal decomposition technique, discrete wavelet transforms (DWT), to separate the linear and nonlinear components from the original data before feeding them into the prediction model. The performance of different deep sequence model is investigated in [14]
. They treated anomaly in the time series data before feeding into prediction model and showed the effectiveness of outlier detection in internet traffic prediction. A hybrid model consisting of a statistical and deep learning model is proposed in
[9]for better performance than the standalone model. In addition, they applied Discrete Wavelet Transform (DWT) on the time series data for separating the linear and nonlinear components, modeled respectively using Auto Regressive Moving Average (ARIMA) and Recurrent Neural Networks (RNN). Since the conventional statistical models such as ARMA, and ARIMA, are incapable of handling the nonlinear components in the time series, the author tried to use the signal decomposition technique here to deal with the complex, nonstationary internet traffic by combining the power of deep learning. According to their experimental results, the combination of ARIMA and RNN performed better than the individual model.
An artificial neural network model combined with the multifractal DWT is proposed in
[8]. The network traffic is discomposed into low frequency and highfrequency components using Haar Wavelet, which has been considered a target for the ANN model with the input of the original traffic data. In the end, model predictions are combined to reconstruct the actual value. Their model outperformed compared with two existing methodologies. In [18], they performed a comparative analysis among different methods of DWT and splineextrapolation in predicting the characteristics of the IoT multimedia internet traffic. The splineextrapolation with the Bsplines was the best, giving them the minimum forecast error of 5% compared with Haarwavelet and quadratic splines having prediction errors respectively 710% and 10%. In [19], the author developed a traffic prediction framework by combining the power of several technical methods such as Mallat wavelet transformation, Hurst exponent analysis, model parameter optimization, and fusion of multiple prediction models. Firstly, a threelevel decomposition has been carried out on the original traffic data to extract a set of approximate and detailed components. Then, the individual component predictability was analyzed using Hurst exponent analysis, where a higher Hurst exponent H indicates more predictableness. According to their study, the approximate component has a higher H than the detailed component. As a result, a more efficient machine learning model, the Least squares Support vector machine (LSSVM), is used to predict detail components while the approximate component is analyzed using ARIMA. The proposed method showed better prediction accuracy compared with the other models.
The use of signal decomposition in realworld internet traffic prediction is prevalent in the current literature. The existing works indicate the outperformance of the hybrid model capable of handling the linear and nonlinear components separately using different types of models. However, most research assumes that the model’s data come from a similar distribution, although the scenario is quite different in practice. The prediction model is more likely to see a distribution shift or completely unknown distribution when it will deploy in the production. According to recent studies, the machine learning methods do not guarantee the generalization of the outofdistribution data. And this issue is not considered extensively in the existing literature. Therefore, it is significantly vital to build a robust prediction model so that it can generalize the unknown distribution in the future. In this research, we propose a hybrid machine learning model that can better generalize the distribution shift in the data rather than the individual model.
Iii Methodology
Iiia Dataset and Preprocessing Steps
Real internet traffic telemetry on several highspeed interfaces has been used for this experiment. The data are collected every five minutes for a recent thirty days time period. A total of five different datasets are used in our experiment. The source domain dataset, Dataset A, consists of 8563 data samples and it is used to build the predictive model for the source task. The other four datasets, Dataset B, C, and D, and E are comparatively smaller in size, having 363, 369, and 358 data instances, respectively. Time series data need to be expressed in the proper format for supervised learning. Generally, the timeseries data consists of several tuples (time, value), which is inappropriate for feeding them into the machine learning model. So, we restructured our original time series data using the sliding window technique. The highlevel diagram of our methodology is shown in Fig.
1.IiiB Discrete Wavelet Transform
The wavelet transform is a mathematical way of finding the hidden patterns in the original dataset by decomposing the signal into a timefrequency domain. The process involves a wavelet, i.e., a wavelike oscillation for extracting multiple lower resolution levels by controlling the scale and location of the wavelet
[7]. There are two main types of wavelet transform, Discrete Wavelet Transform () and Continuous Wavelet Transform (). In this work, we used for the decomposition step because will be a very timeconsuming transform with extra and useless data. At each stage of , the signal is decomposed into two components: approximate component, and detailed component, representing the general trend and detailed events in the data, respectively. The from level is used to calculate the in the next level, . A lowpass filter () and highpass filter () convolute the signal to generate the new and . The equation for components is as follows [10]:(1) 
(2) 
Here is the total number of levels. We can reconstruct the original data by combining all level’s detailed components and approximate components from the last level.
(3) 
No Wavelet  dmey  haar  bior3.7  

6  9  12  15  6  9  12  15  6  9  12  15  6  9  12  15  
XGB  96.1  96.0  95.9  95.9  97.4  97.4  97.4  97.3  97.2  97.3  97.3  97.2  97.2  97.2  97.2  97.3 
LGB  96.2  96.2  96.1  96.2  97.4  97.4  97.4  97.4  97.3  97.3  97.3  97.3  97.1  97.2  97.2  97.2 
SGD  96.0  96.1  96.2  96.1  96.7  96.9  97.0  97.0  96.7  96.6  96.9  96.9  96.7  96.8  96.8  96.9 
GBR  96.1  96.1  96.2  96.2  97.1  97.1  97.1  97.2  97.1  97.1  97.1  97.1  97.1  97.2  97.2  97.2 
CAB  96.3  96.1  96.2  96.2  97.3  97.2  97.2  97.2  97.2  97.1  97.1  97.1  97.2  97.2  97.2  97.1 
ENS  96.1  96.3  96.3  96.4  97.4  97.4  97.3  97.3  97.2  97.2  97.1  97.2  97.1  97.1  97.1  97.0 
No Wavelet  dmey  haar  bior3.7  

6  9  12  15  6  9  12  15  6  9  12  15  6  9  12  15  
XGB  78.9  78.7  80.1  78.0  87.9  87.6  87.4  87.4  87.3  87.5  87.5  87.1  83.3  81.9  82.1  82.8 
LGB  83.4  83.3  82.9  82.8  87.6  87.3  87.1  86.9  87.3  87.4  87.2  87.0  84.8  84.5  84.1  83.9 
SGD  80.1  79.5  78.6  77.9  84.1  83.8  82.8  82.6  84.0  83.3  82.5  81.7  83.6  83.3  82.3  81.4 
GBR  82.6  81.6  82.0  81.0  86.7  86.9  86.8  86.7  86.5  86.2  85.6  85.5  82.5  82.5  82.0  81.5 
CAB  80.6  79.8  81.3  80.4  85.7  84.9  84.8  84.0  85.6  85.2  85.3  84.8  83.8  83.2  83.0  81.3 
ENS  79.3  80.3  80.5  79.9  86.9  85.5  85.4  84.9  86.0  85.5  86.0  85.2  84.9  84.4  83.8  83.4 
No Wavelet  dmey  haar  bior3.7  

6  9  12  15  6  9  12  15  6  9  12  15  6  9  12  15  
XGB  72.6  71.7  72.9  71.0  84.0  83.6  83.2  83.1  82.2  82.5  82.3  82.3  80.7  79.7  79.6  80.2 
LGB  75.6  75.9  75.4  75.1  83.5  83.3  83.2  83.1  82.2  82.2  82.0  81.9  82.0  81.7  81.4  81.1 
SGD  73.9  73.6  73.3  72.6  79.7  79.5  78.9  78.9  78.6  78.4  77.9  77.3  79.6  79.2  78.9  78.1 
GBR  74.1  72.8  73.9  73.0  83.1  83.1  82.6  82.8  80.9  81.1  80.8  80.6  79.1  78.7  78.7  78.9 
CAB  74.0  73.2  73.6  73.8  82.5  81.4  81.8  81.1  80.8  80.6  80.4  80.0  80.9  80.9  81.1  80.0 
ENS  73.5  74.4  74.3  73.7  82.6  81.2  80.7  80.9  80.8  80.6  80.5  80.0  81.4  81.2  80.3  80.4 
No Wavelet  dmey  haar  bior3.7  

6  9  12  15  6  9  12  15  6  9  12  15  6  9  12  15  
XGB  85.3  84.9  85.1  85.0  89.4  89.4  89.1  89.4  89.1  89.1  89.1  89.1  87.8  87.2  87.2  87.4 
LGB  86.6  86.5  86.4  86.6  89.6  89.5  89.4  89.6  89.3  89.2  89.1  89.3  89.8  89.5  89.3  89.4 
SGD  85.0  84.5  84.5  84.7  87.5  87.2  87.0  87.1  87.4  86.9  86.9  87.0  88.2  87.7  87.6  87.7 
GBR  86.5  86.0  86.3  86.2  88.6  88.7  88.7  88.7  88.7  88.6  88.5  88.8  87.0  87.0  86.9  87.1 
CAB  86.3  86.2  86.0  86.2  89.6  89.5  89.6  89.7  89.3  89.4  89.2  89.3  89.5  89.1  88.7  88.6 
ENS  86.1  85.9  86.0  86.1  89.2  88.6  88.6  88.3  88.6  88.4  88.2  88.2  89.2  89.1  88.5  89.1 
IiiC Outofdistribution Generalization
Machine learning model dealt with a feature set and corresponding target that could be a discrete value (classification task) or continuous value (regression task). The purpose of the machine learning algorithm is to optimize the learnable parameters theta for a function
. The function parameter theta is optimized based on the loss function
which returns the gap between actual and predicted value, and the purpose of the machine learning model is to find the function parameters theta with minimum loss. Many machine learning models have been developed based on the assumption that the train data and test data are come from same distribution i.e. = . But, the train and test distribution can be different due to several reasons such as temporal/spatial evolution of data or the sample selection bias in data collection process. Therefore, outofdistribution generalization discusses machine learning methodology where a distribution shift is exists between and . In OOD problem settings, we need to define how the test distribution is different from the train distribution. There are different distribution shifts in the literature but the most common one is the covariate shift where the target generation process is the same with the marginal distribution of shifts from the training set to the test set. In this work, we deal with the covariate shifts in the data.IiiD Evaluation Metrics
We used Weighted Average Percentage Error (WAPE) to estimate the performance of our traffic forecasting models. The performance metric identifies the deviation of the predicted result from the original data. For example, WAPE error represents the average percentage of fluctuation between the actual value and predicted value. Therefore, we can define our performance metric mathematically as follow:
(4) 
(5) 
Here, and are predicted and original value respectively and is the total number of test instance.
IiiE Software and Hardware Preliminaries
Iv Results and Discussion
Iva Experimental setup
As we mentioned earlier, our main research objectives are to improve the machine learning model’s performance in case of a distribution shift. We used four different internet traffic datasets extracted from realworld telemetry to conduct this experiment. But all the datasets are not equivalent according to the data volume, and their distribution is different, as represented in Fig.2. Therefore, we used a comparatively larger dataset to build our traffic prediction model. The other datasets from to are the test datasets with distribution shifts. So, our experiment has three main parts: i) design and validate traffic prediction model based on dataset ii) validate the model performance with other three datasets , , , which are entirely unknown to the model, and iii) finally modify models by adapting discrete wavelet transformation to improve inference on datasets , , .
IvB Result analysis
We applied several machine learning models such as eXtreme Gradient Boosting (XGB), Light Gradient Boosting Machine (LGB), CatBoost Regressor (CAB), Stochastic Gradient Descent (SGD), Gradient Boosting Regressor (GBR), and an ensemble model (ENS) based on the stacking technique. The performance of all machine learning models trained and validated using dataset is summarized in Table I
. The group column ‘No Wavelet’ represents the standalone model performance while the other three columns ‘dmey’, ‘haar’, and ‘bior3.7’ represent the hybrid model with the corresponding mother wavelet. We used five different feature window lengths of 6, 9, 12, and 15 to train and identify an optimum number of inputs for the model. In the case of standalone model evaluation, the XGBoost model achieved the best prediction accuracy of 96.1% using the six input features. The highest accuracy for LGB, SGD, GBR, and CAB are respectively 96.2%, 96.2%, 96.2%, and 96.3% using input 6, 12, 12, and 6. Finally, the ensemble model gave us the best prediction accuracy of 96.4% using 15 input features. Then, the standalone models are modified by enabling them to accept decomposed components of the series by discrete wavelet decomposition varied by three different mother wavelets. Overall, the model performance is improved in the case of the hybrid model for each setting. However, the best hybrid model performance we achieved is 97.4% using the ensemble model with dmey wavelet, which is a 1% improvement over the standalone model. After evaluating models with a test set from the same training set distribution, we used three other datasets from different distributions to validate model performance in case of outof distribution. The results are summarized in Table
II, Table III, and Table IV respectively for dataset , dataset , and dataset . As the distribution of these three datasets differs from the dataset used in training, the model performance decreases significantly. Fig. 3 depicts the performance trend of different prediction model for four different test distribution. Overall, the model performance was at its peak when it was evaluated using the same distribution data, i.e., training and test on dataset . The performance dropped from 95%96% to 80%85% range when tested using dataset having distribution shift. And the similar pattern is clear from the figure for the other two datasets. Finally, we evaluate our hybrid model’s performance in ourofdistribution generalization. Fig 4, 5, and 6 represent comparative performance among standalone model and three different hybrid models. These figures help compare the performance of different mother wavelets and input sizes. Our experiment showed a significant performance enhancement after applying wavelet transformation. For example, the XGB model with dmey wavelet gave us more than 8% accurate result for dataset compared with the standalone model. Similarly, more than 10% accuracy has been improved in the case of dataset by the LGB model with haar wavelet. Overall, the hybrid model’s performance is better than the standalone model for each input set. In addition, the hybrid model with dmey mother wavelet performance is better on average than the other two wavelets, while the bior3.7 gave us the poorer performance. Hence, choosing the proper mother wavelet is essential for better prediction results. In addition, the effect of selecting the appropriate window size is prevalent in the diagrams. Finally, we depicted the hybrid model’s performance improvement compared with the standalone model for our three datasets, different from the training dataset in Fig.7. Each model performance indicates a better generalization power when combined with the decomposition technique.V Conclusion
In this work, we evaluate the machine learning model’s performance in case of a distribution shift in the internet traffic dataset. The majority of the machine learning models for traffic prediction are assessed based on the same distribution data. But in practice, the prediction model will most likely encounter slightly or entirely different data than the dataset used for the model development. As realworld traffic is susceptible to various internal and external factors, it is highly volatile. For our experiment, we considered a total of 4 datasets. The larger one is used for the traditional machine learning model development and validation, where the train and test split was respectively 70% and 30%. A total of six different machine learning models were used for the traffic prediction task, while the ensemble model gave us the best prediction accuracy of 96.4%. After integrating the decomposition technique with the model, the best performance increased by 1% to 97.4% by ensemble model and dmey wavelet. Then we used similar models for inferencing our other three smaller datasets from a different distribution than the training dataset. These three datasets are entirely unknown to our prediction models. Due to the distribution shift, the model performance decreased significantly, indicating a fundamental problem of machine learning solutions after deployment in the real world. But the hybrid model performance was considerably better than the standalone model in case of outofdistribution generalization. In the future, we would like to extend this work by adapting empirical mode decomposition to avoid the problem of choosing a suitable wavelet for a better prediction. Also, we plan to experiment with the multistep forecast as well.
References
 [1] (2021) Selfdriving cars: a survey. Expert Systems with Applications 165, pp. 113816. Cited by: §I.
 [2] (2021) An entropybased hybrid mechanism for largescale wireless network traffic prediction. In 2021 International Symposium on Networks, Computers and Communications (ISNCC), Vol. , pp. 1–6. External Links: Document Cited by: §II.
 [3] (2019) A dynamic multipath scheme for protecting sourcelocation privacy using multiple sinks in wsns intended for iiot. IEEE Transactions on Industrial Informatics 16 (8), pp. 5527–5538. Cited by: §I.
 [4] (2016) Adaptive lstar model for longrange variable bit rate video traffic prediction. IEEE Transactions on Multimedia 19 (5), pp. 999–1014. Cited by: §I.

[5]
(2019)
PyWavelets: a python package for wavelet analysis.
Journal of Open Source Software
4 (36), pp. 1237. Cited by: §IIIE.  [6] (2016) A new approach for chaotic time series prediction using recurrent neural network. Mathematical Problems in Engineering 2016. Cited by: §I.
 [7] (2016) Identifying longterm variations in vegetation and climatic variables and their scaledependent relationships: a case study in southwest germany. Global and Planetary Change 147, pp. 54–66. Cited by: §IIIB.
 [8] (2019) Network traffic model with multifractal discrete wavelet transform in power telecommunication access networks. In International Conference on Simulation Tools and Techniques, pp. 53–62. Cited by: §II.
 [9] (2018) Predicting computer network traffic: a time series forecasting approach using dwt, arima and rnn. In 2018 Eleventh International Conference on Contemporary Computing (IC3), pp. 1–5. Cited by: §II.
 [10] (2019) A framework for shortterm traffic flow forecasting using the combination of wavelet transformation and artificial neural networks. Journal of Intelligent Transportation Systems 23 (1), pp. 60–71. Cited by: §IIIB.
 [11] (2011) Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §IIIE.
 [12] (2018) Scalable and accurate deep learning with electronic health records. NPJ Digital Medicine 1 (1), pp. 1–10. Cited by: §I.
 [13] (2022) An empirical study on internet traffic prediction using statistical rolling model. IEEE. Note: (in press) Cited by: §I.
 [14] (2022) Deep sequence modeling for anomalous isp traffic prediction. IEEE. Note: (in press) Cited by: §II.
 [15] (2022) Towards an ensemble regressor model for anomalous isp traffic prediction. IEEE. Note: (in press) Cited by: §I.
 [16] (2021) Toward causal representation learning. Proceedings of the IEEE 109 (5), pp. 612–634. Cited by: §I.
 [17] (2021) Towards outofdistribution generalization: a survey. arXiv preprint arXiv:2108.13624. Cited by: §I.
 [18] (2020) Multimedia traffic prediction based on waveletand splineextrapolation. In 2020 IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), pp. 1–5. Cited by: §II.
 [19] (2020) Network traffic prediction method based on wavelet transform and multiple models fusion. International Journal of Communication Systems 33 (11), pp. e4415. Cited by: §II.
 [20] (2017) Network traffic prediction based on rbf neural network optimized by improved gravitation search algorithm. Neural Computing and Applications 28 (8), pp. 2303–2312. Cited by: §I.