Superiority of Simplicity: A Lightweight Model for Network Device Workload Prediction

07/07/2020 ∙ by Alexander Acker, et al. ∙ Berlin Institute of Technology (Technische Universität Berlin) 0

The rapid growth and distribution of IT systems increases their complexity and aggravates operation and maintenance. To sustain control over large sets of hosts and the connecting networks, monitoring solutions are employed and constantly enhanced. They collect diverse key performance indicators (KPIs) (e.g. CPU utilization, allocated memory, etc.) and provide detailed information about the system state. Storing such metrics over a period of time naturally raises the motivation of predicting future KPI progress based on past observations. Although, a variety of time series forecasting methods exist, forecasting the progress of IT system KPIs is very hard. First, KPI types like CPU utilization or allocated memory are very different and hard to be expressed by the same model. Second, system components are interconnected and constantly changing due to soft- or firmware updates and hardware modernization. Thus a frequent model retraining or fine-tuning must be expected. Therefore, we propose a lightweight solution for KPI series prediction based on historic observations. It consists of a weighted heterogeneous ensemble method composed of two models - a neural network and a mean predictor. As ensemble method a weighted summation is used, whereby a heuristic is employed to set the weights. The modelling approach is evaluated on the available FedCSIS 2020 challenge dataset and achieves an overall R^2 score of 0.10 on the preliminary 10 data and 0.15 on the complete test data. We publish our code on the following github repository:



There are no comments yet.


page 1

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

IT systems are rapidly evolving to meet the growing demand for new applications and services in a variety of fields like industry, medicine or autonomous transportation. This entails an increasing number of interconnected devices, large networks and growing data centres to provide the required infrastructure. Although accelerating innovations and business opportunities, this trend increases complexity and thus, aggravates the operation and maintenance of these systems. Operators are in need of assistance to be able to maintain control over this complexity. Therefore, monitoring solutions are implemented. They constantly collect system KPIs like latency, throughput, or system resource utilization and provide detailed information about the monitored IT system. One particularly important aspect of system monitoring is the prediction of future system load based on historic observations. Several efforts where made to enable this ranging from linear regression 


, Bayesian statistics 

[2] and neural networks [12].

A precise prediction of future system load enables predictive decision making and thus, ahead of time optimization. An anomaly detection methods can be employed to compare the difference between the predicted and the actual state and raise alarms in case of unforeseen deviations [11]. Scaling up on imminent load peaks or scaling down during moderate or low utilization periods helps to optimize for cost and optimal user experience [8]. Among many others, scheduling decision [6], network routing and dimensioning [5], data centre cooling control [1] or predictive maintenance [14] all benefit from precise system load predictions.

A particular data source for load prediction are KPIs like CPU utilization, allocated memory or network throughput of individual system components. Sampled at fixed time intervals or aggregated over a period, they represent the evolution of system states as time series. Following this, the task of system load prediction can be formulated as a time series forecasting problem. Forecasting different types of KPI time series based on historic observations is very hard due to the properties of IT systems. First, different KPI types are highly non-uniform. While CPU utilization is usually very volatile, memory allocation is rarely overlaid by noise. Further, disk read and write operations expose bursty patterns due to buffering resulting in flat sequences with sporadic peaks. The concrete pattern of these series depend of unknown external and a variety of internal factors. An example of an external factor is the difference of user behaviours based on days of weeks, night- and daytime ours or occasional events like Christmas days. Load on weekends can significantly differ from the load on weekdays. The same applies to the difference between night and day time hours. Therefore, load in general usually follows seasonal patterns and long term trends. Incorporating this information either explicit or implicit into the prediction model enhances the prediction performance. Also, the IT system itself is problematic from modeling perspective due to their dynamic nature and high uncertainty. Frequent soft- and firmware updates or hardware modernization change system properties and usually require model retraining or fine-tuning. It does not only affect the changed component itself but might also propagate to connected devices. Co-locations, changing scheduling policies and maintenance operations are additional sources of uncertainty. This imposes the requirement of frequent and fast model adaption.

Related work on time series forecasting is diverse and ranges from traditional linear or non-linear regression [13], stochastic methods [4]

, deep learning models 

[7] and ensemble methods [15, 10]. Traditional regressive or statistical models are often not able to capture the underlying complex processes which result in imprecise predictions. Models based on neural networks or ensemble methods usually provide more accurate predictions but suffer from high complexity and an accompanying high computational overhead that is required to train them. This is a major limitation when it comes to systems with large amounts of components as data centres or large networks. Additionally, the previously described dynamic nature aggravates this limitation.

Considering this, we present our solution for this years FedCSIS 2020 challenge. It proposes a model for network device workload prediction whereby future KPI values have to be predicted based on historic data. Due to the high number of components and KPI time series, we focused on a lightweight modelling approach in order to keep the solution computationally feasible. The model combines the overall average of each time series with a prediction from a linear neural network. Furthermore, we employed heuristics to tackle numerical imprecision and enhance overall prediction performance. Our solution achieved an overall score of on the preliminary 10% test data and on the complete test data.

The rest of the paper is structured as follows. First, section II described the problem of network device workload prediction and provides a preliminary analysis of the available training data set. Second, section III introduces our solution for workload prediction. It includes a formal problem definition based on time series forecasting and explains each element of our proposed method. Third, an evaluation of our method with respect to runtime and prediction performance is performed. The results are presented in section IV. Finally section V concludes our paper.

Ii Network Device Workload Prediction

This year FedCSIS 2020 challenge was to predict the future workload of network devices based on past workload observations. More specifically, the workload of a set of devices, referred to as hosts, were characterized by KPI series such as CPU utilization, incoming and outgoing network traffic or allocated main memory. The data were collected hourly over a period of 3 months with sporadically missing samples. Overall, 45 different KPIs were recorded from 3,716 hosts, whereby the workload of individual hosts was described by different KPI subsets. Each hourly KPI series sample consists of seven measurement aggregations over the respective hour. These are the number of collected measurements, the mean and standard deviation, the first, last, highest and lowest measurement. All of the seven aggregations can be used as input but only the future mean value must be predicted, resulting in a possibly multivariate input but univariate output.

Fig. 1: Example of four KPIs for six hosts. A great in-between and within KPI value diversity for the different hosts can be observed. This indicates the major challenge when faced with forecasting the expected future values of the KPIs.

The plots in Fig. 1 show four different KPI mean values from six different hosts. Thereby, the series was split into weekly windows from Monday until Sunday and arranged by the hour of the week resulting in ten aggregated weekly series for each plot. The dark blue line shows the mean value while the light blue is visualized the confidence interval. It can be observed that KPI series are highly non-uniform, which indicates the major challenge when faced with forecasting the expected future values of the KPIs. There are KPI series with high noise ("cpu_5s" of host 8139) while others remain fairly constant ("memory_used" of host 7159). Some KPIs follow long term trends ("memory_used" of host 1279) and several series are periodical on a daily and weekly basis ("mem_by_proc" and "in_traffic" of hosts 4064 and 4289). Furthermore, these properties vary for the same KPI types depending on the host from which they were collected. While "cpu_5s" is fairly constant but noisy for host 7159, a clear seasonality can be observed for host 1279.

Iii Lightweight Workload Prediction Model

In this section we present our method for lightweight workload prediction. Its concept and architecture were chosen based on the previously described observations and analyses in section II. In subsection III-A we describe the workload prediction problem in form of time series forecasting and define all required preliminaries. After that, we present our method for workload prediction in subsection III-B

including the data preprocessing, feature selection and forecasting.

Iii-a Preliminaries

We define the task of workload prediction as a time series forecasting problem. A time series is an temporally ordered sequence of values , where is the dimensionality of each point. For , we denote indices and with and as time series boundaries in order to slice a given series and acquire a subseries . The variable defines the time stamp of the last sample of the past observations. Additionally, we use the notion to refer to a certain dimension , with . Furthermore, meta information for each time series value are denoted as .

The problem of workload prediction is modelled as the forecasting of a future univariate value , with , conditioned on a sequence of past values , and known meta information about the future time stamp . Therefore, the learning objective is to select a function , where is the dimensionality of the input, that results in a small generalization loss:



is a bounded loss function and

is the set of offsets defining all future time stamps to predict.

Iii-B Lightweight Workload Prediction Model

The overall architecture of our method is depicted in Fig. 2. A future time series value should be predicted based on the history and its known meta information . For the task of workload prediction, each time series represents an KPI. The respective dimensions of samples are aggregated values of that KPI between time and . Due to their importance, we selectively define the mean and last measurement as and , where . The mean value of the sample is the prediction target. Since many workload series are seasonal, we additionally add the encoded day of week and hour of day as meta information . Subsequently, each model element is described in detail.

Fig. 2: Overall solution architecture.

Preprocessing. Initially, a rescaling of each value in the KPI series to a uniform range is performed. Furthermore, values in

are expected to be sampled hourly. If samples are missing, a linear interpolation is employed.

Feature Selection. Due to the additional overhead that is introduced by automated feature selection methods, we choose to select a fixed subset of features manually. Furthermore, we focus on a minimal set of features to keep the model capacity low. The features are selected depending on the model that they are forwarded to. Therefore, we define a filter for the mean predictor and a filter for the neural network model (NN). The filter includes only the mean values of . Filter applies two feature selection operations. First, out of the aggregated values in the last available series sample, we pick the mean and last value, i.e. . Second, motivated by the seasonality of system load, we additionally use the mean value of the same hour of the week as the prediction target of previous weeks.

The Models. The mean predictor calculates the overall average over the filtered sample series

. The NN model is a linear feed-forward neural network. It receives the preprocessed and filtered data

, the meta-information values

and the output of the mean model. These are combined to a flat input vector

x. The learning objective of the NN model is to minimize the squared error loss between the prediction and the mean value of :


Our employed network structure is depicted in Fig. 3

. We use a fanning out first hidden layer. Its size is fourfold of the input layer size. The subsequent layers are tampered, which works as regularization. Furthermore, we use a dropout between the first and second hidden layer as an additional regularization. A rectifier linear unit (ReLU) activation is applied to the output value of the network. The output of the mean model and NN model are respectively denoted as

and .

Fig. 3: Structure of our fast forward network.

Ensemble Layer. To combine the predictions of the mean model and the NN model, a weighted average over the model outputs is calculated:


The usage of two models is motivated by the non-uniformity of KPI series. While the neural network is capable to predict seasonal series fairly well, it fails to accurately predict constant but noisy series. A simple average over all mean metrics of a KPI resulted in good predictions for constant but noisy series but resulted in bad predictions for seasonal series. By combining both, we expect to achieve a generally better result.

Iv Evaluation

Given three months of historic data, the task is to predict respective mean KPI values of the subsequent week. Overall the future progress of 10,000 KPI series mus be predicted. Samples are given hourly and thus, the predictions are expected hourly as well. This results in a sequence of 168 samples that have to be predicted for each series. In this section, we evaluate the proposed method in terms of runtime and prediction performance. Subsection IV-A described the parametrization that was used to predict the submission results together with the training process. In subsection IV-B we provide a runtime analysis underlining the requirement of a lightweight modeling solution. Finally, subsection IV-C provides an overview of the achieved challenge scores.

Iv-a Training and Parameterization

KPI series are diverse depending on the type and the host from which they were collected. Therefore, we choose to train individual models for each KPI series. The mean model calculates an overall average over all mean values from the available three months of data. This, filter selects all available mean values of the respective KPI series.

Training of the NN model requires the definition of a training set. Therefore, a set of inputs and prediction targets are defined. The target is always a specific mean value at prediction target time stamp . As target value meta information its hour of day and day of week is used, defined as , where is the hour of day and is the day of week. To acquire the input data, a filter is utilized on preceding weeks. This KPI training series slice is defined as with and , where are the hours of one week, and . Thereof, the mean and last value from the last sample are selected . Further, respecting the seasonality of several KPI series, the mean value of the same hour of the week as the prediction target is added to the input. These can be accessed via

To create the training data we set . For the rescaling, we define and

. The neural network is trained via backpropagation. Thereby, the mean square error is used as the optimization criterion and Adam as the optimizer. We set the learning rate to

and use dropout probability of

. One individual model is trained for every KPI series. We set a fixed number of six epochs and do not use a validation set to make use of all available data for training. Based on the above definition of training data creation, all possible input/target tuples were used and defined as one epoch of the training process.

Iv-B Runtime Analysis

Due to their non-uniformity, we propose to train models respectively for each KPI series. Furthermore, frequent retraining can be expected due to the dynamic nature of IT systems. Considering this, we conduct a preliminary runtime analysis and compare our neural network to a recurrent version of it. For the recurrent network, we use long short term memory (LSTM) instead of linear cells. The overall architecture remains the same as depicted in Fig. 


. We measure the training time per epoch on a bare-metal machine with an Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz, 3x32 GB RAM and two Nvidia GeForce RTX 2080 Titan GPUs whereof one was utilized during the runtime measurement experiments. Ubuntu 18.04.3 LTS with kernel version 5.3.0-51-generic is installed as OS and Python version 3.6.7 and PyTorch version 1.4.0 are used to implement the networks. The result of the comparison is shown in Fig. 


Fig. 4: Runtime analysis results.

It can be observed that the LSTM version requires significantly more time for training than the network with linear cells. In comparison, the runtime increases by a factor of ten. The mean training runtime per epoch of the linear version is seconds per epoch with a standard deviation of and confidence interval of . For the network version with LSTM nodes a training time per epoch of is measured with a standard deviation of and confidence interval of . Having six epochs per series and a total number of series to predict results in a total required training time of hours for the linear version and days when using LSTM cells.

Although recurrent neural network architectures especially with LSTM cells are reported to perform well on sequential data prediction tasks 

[9], our runtime analysis shows that the required training time is very high. The task of training a model for each series is completely parallelizable. However, our access is limited to the above describe machine and the training time of almost days is infeasible for us. Therefore, the utilization of linear cells is chosen.

Iv-C Prediction Results

The performance of the proposed workload prediction method is evaluated against the withheld test set by submitting the solution via the official FedCSIS 2020 challenge submission system. The submissions are scored by the score defined as


where and as the overall average over all mean samples. Based on our observation several KPI series are mainly constant with sporadic deviations, resulting in a very small normalization value (denominator of in Eq. 4). This results in high division values and thus, low scores even for small deviations of the predicted values. These values had a high impact on the overall score. Furthermore, several KPI series can be described as the noise around a baseline. Such series are better predicted by their baseline instead of guessing random noise. This motivates us to implement a heuristic to choose an adaptive weighting of the model outputs. Either an equal withing of 0.5 for each model is set, or we set the NN model weight to 0.0 and the average predictor output weight to 1.0 The decision is made as follows. First, the neural network is trained. Second, the last available week is used as a prediction target and the respectively filtered data before that week as input. Since this last week was explicitly trained on, we assume precise prediction results, i.e. score close to 1. If the neural network output resulted in a lower score than the output of the average predictor, we set the weight for the average predictor to 1.0 and the neural network weight to 0.0. Otherwise, the both weights were set to 0.5.

Finally, the prediction of the submission is done based on the filtered last available weeks in the training data set. The score results are listen in TABLE I.

baseline 1st 2nd Ours
Preliminary test set (10%) 0.2267 0.1888 0.1841 0.1053
Complete test set (100%) 0.2295 0.163 0.1515 0.1501
TABLE I: scores of best three submissions together with the baseline.

None of the submitted results is able to achieve the specified baseline. Two submissions achieved a better score than our solution with and on the preliminary of test data and and on the complete test dataset. With our proposed lightweight model, we achieve an score of on the preliminary test data and on the complete test dataset. We did not carry out any attempts to optimize for the preliminary test data since it was not clear whether it is a general representation of the complete test dataset. Therefore, it is interesting for us to see that our solution is the only one - also including submission below ours - achieving a better score on the complete dataset than on the preliminary .

V Conclusion

We tackle the given challenge of network device workload prediction based on KPI data with a lightweight model that ensembles the predictions of a multi-layer linear neural network and an overall averaging predictor. The ensemble is done by a weighted summation. A heuristic is used to selectively set the weights for both model predictions. The lightweight nature of the method allows training individual models for each KPI respecting the diverse natures different KPI types and host. From a practical perspective, frequent retraining needs to be feasible which is supported by the lightweight nature of the solution as well.

To evaluate our solution we conducted two types of experiments. First, we evaluate our solution with the FedCSIS 2020 challenge dataset. It consists of 45 different KPIs recorded from 3,716 hosts. The experiment results show that the lightweight approach predicts future KPI values with an overall score of on the preliminary test data and on the complete test data. Second, we provide a runtime analysis between LSTM and linear network cells and show that the usage of LSTM cells increases the training time by a factor of which renders it infeasible to be used for the given problem.

For future work, we see further experimentation with different network types like convolutional neural networks or attention mechanisms as promising. Furthermore, a different numerical encoding of the currently used meta information and the learning of the summation weights when aggregating overall average and neural network outputs are sources for potential optimization.


  • [1] F. Ahmad and T. Vijaykumar (2010) Joint optimization of idle and cooling power in data centers while maintaining response time. ACM Sigplan Notices 45 (3), pp. 243–256. Cited by: §I.
  • [2] S. Di, D. Kondo, and W. Cirne (2012) Host load prediction in a google compute cloud with a bayesian model. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. Cited by: §I.
  • [3] P. A. Dinda and D. R. O’Hallaron (2000) Host load prediction using linear models. Cluster Computing 3 (4), pp. 265–280. Cited by: §I.
  • [4] M. R. Hassan and B. Nath (2005)

    Stock market forecasting using hidden markov model: a new approach

    In 5th International Conference on Intelligent Systems Design and Applications (ISDA’05), pp. 192–196. Cited by: §I.
  • [5] A. Howard, A. Zhmoginov, L. Chen, M. Sandler, and M. Zhu (2018) Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. Cited by: §I.
  • [6] J. W. Jiang, T. Lan, S. Ha, M. Chen, and M. Chiang (2012) Joint vm placement and routing for data center traffic engineering. In 2012 Proceedings IEEE INFOCOM, pp. 2876–2880. Cited by: §I.
  • [7] B. Lim, S. O. Arik, N. Loeff, and T. Pfister (2019) Temporal fusion transformers for interpretable multi-horizon time series forecasting. arXiv preprint arXiv:1912.09363. Cited by: §I.
  • [8] M. Mao, J. Li, and M. Humphrey (2010) Cloud auto-scaling with deadline and budget constraints. In 2010 11th IEEE/ACM International Conference on Grid Computing, pp. 41–48. Cited by: §I.
  • [9] S. Nedelkoski, J. S. Cardoso, and O. Kao (2019) Anomaly detection and classification using distributed tracing and deep learning.. In CCGRID, pp. 241–250. Cited by: §IV-B.
  • [10] X. Qiu, Y. Ren, P. N. Suganthan, and G. A. Amaratunga (2017) Empirical mode decomposition based ensemble deep learning for load demand time series forecasting. Applied Soft Computing 54, pp. 246–255. Cited by: §I.
  • [11] F. Schmidt, F. Suri-Payer, A. Gulenko, M. Wallschläger, A. Acker, and O. Kao (2018) Unsupervised anomaly event detection for vnf service monitoring using multivariate online arima. In 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp. 278–283. Cited by: §I.
  • [12] B. Song, Y. Yu, Y. Zhou, Z. Wang, and S. Du (2018) Host load prediction with long short-term memory in cloud computing. The Journal of Supercomputing 74 (12), pp. 6554–6568. Cited by: §I.
  • [13] J. H. Stock and M. W. Watson (1998) A comparison of linear and nonlinear univariate models for forecasting macroeconomic time series. Technical report National Bureau of Economic Research. Cited by: §I.
  • [14] M. Yaseen, D. Swathi, and T. A. Kumar (2017) IoT based condition monitoring of generators and predictive maintenance. In 2017 2nd International Conference on Communication and Electronics Systems (ICCES), pp. 725–729. Cited by: §I.
  • [15] G. P. Zhang (2003) Time series forecasting using a hybrid arima and neural network model. Neurocomputing 50, pp. 159–175. Cited by: §I.