Forecasting Wireless Demand with Extreme Values using Feature Embedding in Gaussian Processes

05/15/2019
by   Chengyao Sun, et al.
University of Warwick
0

Wireless traffic prediction is a fundamental enabler to proactive network optimisation in 5G and beyond. Forecasting extreme demand spikes and troughs is essential to avoiding outages and improving energy efficiency. However, current forecasting methods predominantly focus on overall forecast performance and/or do not offer probabilistic uncertainty quantification. Here, we design a feature embedding (FE) kernel for a Gaussian Process (GP) model to forecast traffic demand. The FE kernel enables us to trade-off overall forecast accuracy against peak-trough accuracy. Using real 4G base station data, we compare its performance against both conventional GPs, ARIMA models, as well as demonstrate the uncertainty quantification output. The advantage over neural network (e.g. CNN, LSTM) models is that the probabilistic forecast uncertainty can directly feed into decision processes in self-organizing-network (SON) modules.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/02/2018

Modelling tourism demand to Spain with machine learning techniques. The impact of forecast horizon on model selection

This study assesses the influence of the forecast horizon on the forecas...
11/02/2019

Predicting Weather Uncertainty with Deep Convnets

Modern weather forecast models perform uncertainty quantification using ...
06/07/2021

When in Doubt: Neural Non-Parametric Uncertainty Quantification for Epidemic Forecasting

Accurate and trustworthy epidemic forecasting is an important problem th...
11/29/2021

Evaluation of Machine Learning Techniques for Forecast Uncertainty Quantification

Producing an accurate weather forecast and a reliable quantification of ...
12/19/2020

Functional time series forecasting of extreme values

We consider forecasting functional time series of extreme values within ...
09/17/2020

Automatic Forecasting using Gaussian Processes

Automatic forecasting is the task of receiving a time series and returni...
07/14/2021

Extreme Precipitation Seasonal Forecast Using a Transformer Neural Network

An impact of climate change is the increase in frequency and intensity o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Wireless traffic prediction is a key enabler for proactive resource optimisation in 5G and beyond [1, 2, 3, 4, 5]. Proactive optimisation can create user-centric quality-of-service (QoS) and -experience (QoE) improvements across 5G network slices [6, 7]. Direct prediction from historical data [8, 9, 10] and inference from proxy social media data [11] are important inputs to proactive optimisation modules [6] being considered for 5G and beyond applications, such as interference management, load balancing, and multi-RAT offloading; with implementation on the edge or in CRANs. We begin with a review of time-series forecasting algorithms used in wireless traffic prediction and identify a lack of research in both high and low extreme value predictions, which is of critical importance to avoiding network congestion and inefficiencies.

I-a State-of-the-Art

Time-series prediction methods can be classified into several types, with training data in high demand.

I-A1 Statistical Models

Statistical time-series modelling using a variety of signal processing and machine learning approaches have been widely applied to predict the wireless traffic. Moving average models with smoothing weights and seasonality works well for univariate forecasting. For example, in [12], seasonal auto-regressive integrated moving average (ARIMA) models were fitted to wireless cellular traffic with two periodicities for prediction. However, this model is insensitive to anomalous values, such as event-driven spike demand. Indeed, predicting and avoiding spike demand is critical to avoiding network outages and improving the consumer experience. Other methods rely on statistical generative functions assuming a quasi-static behaviour, such as the -stable model [13] or the exponential model [14], but these do not offer the adpativity of machine learning techniques below.

I-A2 Machine Learning Models

In terms of machine learning approaches, artificial neural networks (ANNs) has been used to predict the self-similar traffic with burstiness in [15]

. Although ANNs and deep learning approaches (CNN, LSTM, wavelet, Elman) neural networks

[9, 16, 17, 18, 19] performed well in cumulative learning and prediction accuracy, it cannot give a quantitative uncertainty due to its intermediate black-box process. Alternatively, Gaussian Processes (GPs) have been used [8] and showed a strong adaptivity to the wireless traffic data. Nevertheless, the usage of traditional kernels are unable to capture long-range period-varying dependent characteristics which limits the efficiency of existing training data.

I-A3 Gaussian Process Review

Gaussian process (GP) is widely used because of its adaptability to manifold data [20]. As a non-parametric machine learning method, the prior GP model is firstly established with compound kernel functions based on the background of the data. One optimizes the hyper-parameters using the training data to extract its posterior distribution for the predicted outcome. The prediction results given by GPs quantify the statistical significance, which is an important advantage over other black-box machine learning. As such, whilst GPs may not achieve the performance level of ANNs, they are able to quantify risk and that risk can be interpreted back to the features of the data [21].

I-A4 Feature Extraction and Wireless Context

The features of traffic patterns may be correlated if the patterns are driven by the same specified events, i.e. the rush hours, concerts, etc. In these cases, the key point is to find the implied events information from the current flow trend by identifying where its features are close to those in historical data, hence, to predict how will the traffic demand change according to it in the past. [22, 23, 24]

addressed the problem of feature selection, in order to determine the most discriminative and relevant features of the classified data.

In the context of wireless traffic forecasting, current literature employ classic kernel functions [8], which cannot memorize the non-periodic data pattern for extended periods. This means the GP model do not make full use of the training data. Furthermore, wireless traffic forecasting is often interested in predicting extreme events as opposed to the overall pattern of the traffic variation. Extreme demand values are useful in driving proactive network actions (e.g. extreme high demand requires spectrum aggregation and cognitive access [25, 26], whereas extreme low demand can lead to proactive sleep mode and coverage compensation [27, 28]). Therefore, what is needed is an adaptive kernel in GP models to trade-off prediction accuracy between overall traffic variations and extreme values.

I-B Novelty and Contribution

In this paper, we propose to embed the relevant data features in a flexible kernel functions, which enable the GP model to achieve this trade-off. We make three major contributions:

1) A novel feature embedding (FE) kernel GP model is proposed for forecasting wireless traffic. Specifically, fewer hyper-parameters are required in this model, which reduce the computation burden compared with that uses classic hybrid kernels. Meanwhile, the learning rate is improved significantly for irregular training data;

2) The predicted results are quantified into probability density function (PDF), which are more useful to plug into optimisation modules than the mean prediction value. Precisely, the predicted traffic is described to follow a weighted superposition distribution of mixed Gaussian distributions instead of the sum of those in traditional GP;

3) Demonstrating our forecast model on real wireless traffic data, the cumulative error curve of our model is compared against state-of-the-art algorithms used in literature (seasonal ARIMA [12] and traditional GP model [8]). Our model shows the best adaptivity and prediction accuracy trade-off between overall accuracy and extreme value accuracy.

The remainder of this paper is organised as follows. In Section II, we build a system model step by step from pre-processing to prediction. In Section III, we apply the model to the real wireless traffic data and evaluate the performance of it. Section IV concludes this paper and proposes the ideas for future work.

Ii System Model

In this paper, we use a sliding window of historical traffic data to predict future traffic demand. In this paper, we focus on wireless downlink (DL) traffic data demanded by end-users at 15m intervals over a two week period - see Fig. 1.

Fig. 1: The traffic demand data is decomposed into daily periodic and aperiodic components.

Ii-a Data Decomposition

The raw data is considered to be composed of a daily periodic and an aperiodic pattern from our observation and existing literature. By using a band-pass filter, the raw data can be decomposed into the aforementioned two components, as shown in Fig.1. In order to set the model free from the domination of large-scale periodic patterns, we fix the daily periodic pattern which is derived from the historical data as the established baseline 111we acknowledge that there are baseline variations between each day of the week, but we focus on the aperiodic prediction, which is the main challenge. and only make prediction on the rest aperiodic pattern.

We assume that the aperiodic traffic consists of a noise flow and a event-driven flow which has an implicit intrinsic correlation. The latter is predictable if we can identify the features relevance in this kind of flow from the noise.

Ii-B Priori Gaussian Process Model

The DL traffic value at time point is assumed to be a latent GP plus noise as

(1)

where

is the random variable (RV) which follows a distribution given by GP, and

is the additive Gaussian noise with zero mean and variance

. From the continuous time domain, finite number of time points taken as , the RVs, , can be assumed to follow the multivariate Gaussian as [21]

(2)

where is the mean function and is the covariance matrix given by

(3)

where is the covariance between and represented by the kernel function.

According to (1) and (2), the priori GP probability model of DL traffic can be expressed as

(4)

Ii-C Feature Embedding Kernel Function

In GP, the covariance between every two RVs is quantified by the kernel function which interprets the potential correlation of RVs in a high dimensional space. Here we use the Gaussian radial basis function (RBF) kernel with a feature embedding (FE) norm:

(5)

where is defined as the dimensions weighted feature matrix of the RV at time point :

(6)

where the feature of RV in the matrix is a function which can either be homogeneous or non-homogeneous of former values ():

(7)

Due to the symmetry, it can be easily proved that our new kernel function still meets the conditions of Mercer’s theorem.

In BS (coordinated) control systems (e.g. radio resource management or beamforming), understanding sharp changes in traffic demand (especially when above the cell capacity or significantly below economic profitability thresholds) is more important than average demand trends. As such, the proposed feature weighting process in this paper focus building a trade-off between general prediction accuracy and the aforementioned extreme demand values.

To achieve this, we set a threshold of traffic varying value at each sample time point based on historical data as shown in Fig.2. If at is outside the confidence interval in the distribution of , the associated feature

will be tagged as an outlier and assigned to category

; otherwise it is assigned to category . The Relief idea in [29] is utilized, whereby the feature weights are optimized by maximizing the sum of margin from each to the nearest point with a different category . This process is expressed as:

(8)

where

, which projects the high dimensional feature vectors’ norm onto one dimension, and

is the weight of the feature.

Fig. 2:

Historical time points are collected into two categories according to their estimated Gaussian distribution.

In the Gaussian RBF kernel , the feature space can be mapped to an infinite dimensional kernel space . The hyper-parameter controls the higher dimensional attenuation rate and has amplitude . Hyper-parameters are tuned by maximizing the corresponding log marginal likelihood function which is equivalent to minimizing the cost function [8]:

(9)

where and is the matrix of known values . The quasi-Newton and gradient descent methods can be used in this optimization problem.

Ii-D Posteriori Prediction

After the hyper-parameters are determined, the covariance of every two RVs in the training set can be quantified by , where are the optimized parameters. Let us assume that at a future time point , the RV follows the same model as the training set. Therefore, yields the covariance of with historical RVs. The multivariate distribution for any and is

(10)

with mean and covariance matrix .

So given, the posterior distribution of can be derived as

(11)

with

(12)

For each previous time point in this model, a posterior distribution component of can be generated. In naive GP, the predicted distribution is also a Gaussian distribution which sums the influence of each previous point on its mean and variance [21]. In our proposed FE-GP forecasting model, the predicted distribution uses a Gaussian mixed model (GMM). Consider the GMM resultant PDF of is the superposition of every individual distribution components from each and with a normalization coefficient as

(13)
Fig. 3: The purple and orange shadows have the same area representing the same probability.

An example is shown in Fig.3. Blue lines are distribution components, derived by the covariance matrix of three previous points with the future point. Naive GP gives the average prediction result of this future point, i.e. also a Gaussian distribution, under integrated impacts from all components. While in FE-GP, the GMM prediction result of this future point is assumed to have an equal probability to follow one of these three distribution components. The purple line gives the overall PDF of FE-GP.

Fig. 4: The purple and orange shadows have the same area representing the same probability.
Fig. 5: Comparison of forecasts against 4G DL data. The cumulative error for 2 representative parts: (left) average demand; and (right) spike demand.

Iii Experimental Results and Discussion

Iii-a Data Source

The data we use for training comes from base stations (BSs) in a 4G metropolitan area. The anonymous data is given by our industrial collaborator. It consists of aggregated downlink (DL) and uplink (UL) traffic demand volume per 15 minute interval over several weeks. We have selected a few example BSs at random to demonstrate our forecasting algorithm’s performance.

Iii-B Feature Matrix in FE-GP

When applying FE-GP to wireless DL traffic forecasting, the first to be considered is what does the feature matrix consist of. In our experiment, the features are set to be:

(14)

where

is the standard deviation of elements of

.

Iii-C Performance Metrics

We use the absolute cumulative error (ACE) as the performance metric:

(15)

where is the predicted DL traffic and is the real data. For a fixed value forecast (one-step-ahead forecasting of the DL traffic), we assign

to be the value that has the maximum posterior probability.

Iii-D Results Analysis

Fig.4 shows a comparison of forecasting algorithms over a week (672 points): (1) proposed FE-GP, (2) classical Naive-GP, (3) Seasonal ARIMA, against real 4G DL data. The cumulative error is shown for 2 different representative parts of the data: (left) average demand shows similar performance between FE-GP and Naive-GP; and (right) extreme spike demand shows superior performance by FE-GP against both Naive-GP and S-ARIMA. From the GP models perspective, in the average part (left), both FE-GP and Naive-GP consider most of the traffic demands as noise flow, i.e. in the initial model, thus they perform similarly; In the extreme spike part (right), FE-GP can correctly recognize a potential event-driven flow, which has happened before, using features from the last few points, yet Naive-GP cannot, hence FE-GP gives a better prediction. In Fig. 6, we demonstrate that the proposed FE-GP perform the best overall due to its adaptive to the extreme values, even though it might be a little worse on average demand in few specific time stamps.

Fig. 6: Cumulative error comparison between forecasting algorithms.

Iii-E Uncertainty Quantification

Posterior distribution of both models at a few representative points are given in Fig.5. Different from single peak Gaussian distribution predicted by Naive-GP, the GMM in FE-GP gives more general distributions for prediction. In the absence of a known periodicity, Naive-GP sums the effect of the last few time points, while FE-GP consider the effect from all time points according to their similarity in features with the predicted point. Consequently, there may be several peaks scattered over the forecast, which will inform proactive optimisation modules.

In data-driven wireless resource proactive optimization system [1, 2, 3], we ought to focus on not only the benefits brought by the system decision, but also the potential risks that drive regret functions, i.e., the occurrence of extreme demands. In our proposed FE-GP prediction model, the risks can be quantified from posterior distribution. For example, in Fig.5:

(1) Low-traffic triggers proactive sleep mode and coverage compensation: Our FE-GP prediction points 318 and 672 demonstrates clear non-negligible probability of low traffic whilst the mean prediction is similar to that of the naive-GP. That is to say, we may need to proactive sleep selected cells to achieve more energy efficient operations [30], whilst using other neighbouring cells across RATs to compensate [27, 28]. The risk of doing so is quantified by the posterior distribution (e.g., there is a small risk that there is actually high demand and compensated coverage is not enough).

(1) Spike-traffic triggers proactive spectrum aggregation and offloading: prediction point 368, there is a non-negligible high probability density area appearing at extreme value, which is far away from the predicted mean value. This can be used to inform proactive spectrum aggregation and off-loading of non-vital traffic to delay-tolerant RATs [25, 26]. The risk of doing so is quantified by the posterior distribution (e.g., there is a small risk that there is actually no demand for high capacity).

Iii-F Training Process

As the training set increases over time, the FE-GP model becomes more sensitive to spikes due to its adaptivity to features. Nevertheless, the cost of computing goes up with the size of training set as well, thus we have to set a size threshold to the training set. In Naive-GP, we can discard data in reverse chronological order without affecting the performance of the model. However, in FE-GP, we must make a trade-off between the sensitivity of spikes and overall prediction accuracy, i.e., keeping more extreme value time points means the model is more sensitive to spikes prediction but may reduced overall performance. This need to be done case by case with each pre-exiting resource proactive optimization system.

Iv Conclusion and Future Work

Forecasting extreme demand spikes and troughs is essential to avoiding outages and improving energy efficiency. Proactive capacity provisioning can be achieved through extra bandwidth in predicted high demand areas (i.e., via spectrum aggregation and cognitive access techniques), and energy efficiency improvements can be achieved through sleep mode operations.

Whilst significant research into traffic forecasting using ARIMA, GPs, and ANNs have been conducted, current methods predominantly focus on overall performance and/or do not offer probabilistic uncertainty quantification. Here, we designed a feature embedding (FE) kernel for a Gaussian Process (GP) model to forecast traffic demand. The FE kernel enabled us to trade-off overall forecast accuracy against peak-trough accuracy. We compared its performance against both conventional GPs, ARIMA models, as well as demonstrate the uncertainty quantification output. The advantage over neural network (e.g. CNN, LSTM) models is that the probabilistic forecast uncertainty can directly feed into decision processes in self-organizing-network (SON) modules in the form of both predicted average KPI benefit and regret functions using methods such as probabilistic numerics.

Our future work will focus on expanding to spatial-temporal dimension [18] via Gaussian random fields integration, consider multi-variate forecasting across different service slices, as well as employing Bayesian training in Deep Gaussian Process (DGP) models [31, 32] to avoid catastrophic forgetting and to combat the dynamiticity of the traffic process.

V Acknowledgement

The authors would like to thank Zhuangkun Wei for constructive discussions on feature embedding and Dr. Bowei Yang for the data support.

References

  • [1] Z. Du, Y. Sun, W. Guo, Y. Xu, Q. Wu, and J. Zhang, “Data-driven deployment and cooperative self-organization in ultra-dense small cell networks,” IEEE Access, vol. 6, pp. 22 839–22 848, 2018.
  • [2] N. Saxena, A. Roy, and H. Kim, “Traffic-aware cloud ran: A key for green 5G networks,” IEEE Journal on Selected Areas in Communications, vol. 34, no. 4, pp. 1010–1021, April 2016.
  • [3] R. Li, Z. Zhao, X. Zhou, J. Palicot, and H. Zhang, “The prediction analysis of cellular radio access network traffic: From entropy theory to networking practice,” IEEE Communications Magazine, vol. 52, no. 6, pp. 234–240, June 2014.
  • [4]

    S. O. Somuyiwa, A. Gyorgy, and D. Gundüz, “A reinforcement-learning approach to proactive caching in wireless networks,”

    IEEE Journal on Selected Areas in Communications, vol. 36, no. 6, pp. 1331–1344, June 2018.
  • [5] F. Shen, K. Hamidouche, E. Bastug, and M. Debbah, “A stackelberg game for incentive proactive caching mechanisms in wireless networks,” in 2016 IEEE Global Communications Conference (GLOBECOM), Dec 2016, pp. 1–6.
  • [6] V. Sciancalepore, K. Samdanis, X. Costa-Perez, D. Bega, M. Gramaglia, and A. Banchs, “Mobile traffic forecasting for maximizing 5G network slicing resource utilization,” in IEEE Conference on Computer Communications (INFOCOM), May 2017, pp. 1–9.
  • [7] L. Le, D. Sinh, L. Tung, and B. P. Lin, “A practical model for traffic forecasting based on big data, machine-learning, and network KPIs,” in 2018 15th IEEE Annual Consumer Communications Networking Conference (CCNC), Jan 2018, pp. 1–4.
  • [8] Y. Xu, W. Xu, F. Yin, J. Lin, and S. Cui, “High-accuracy wireless traffic prediction: A GP-based machine learning approach,” in IEEE Global Communications Conference (GLOBECOM), Dec 2017, pp. 1–6.
  • [9] K. Zhang, G. Chuai, W. Gao, X. Liu, S. Maimaiti, and Z. Si, “A new method for traffic forecasting in urban wireless communication network,” EURASIP Journal on Wireless Communications and Networking, vol. 2019, no. 1, p. 66, Mar 2019.
  • [10] X. Cao, Y. Zhong, Y. Zhou, J. Wang, C. Zhu, and W. Zhang, “Interactive temporal recurrent convolution network for traffic prediction in data centers,” IEEE Access, vol. 6, pp. 5276–5289, 2018.
  • [11] B. Yang, W. Guo, B. Chen, G. Yang, and J. Zhang, “Estimating mobile traffic demand using Twitter,” IEEE Wireless Communications Letters, vol. 5, no. 4, pp. 380–383, Aug 2016.
  • [12] F. Xu, Y. Lin, J. Huang, D. Wu, H. Shi, J. Song, and Y. Li, “Big data driven mobile traffic understanding and forecasting: A time series approach,” IEEE Transactions on Services Computing, vol. 9, no. 5, pp. 796–805, Sep. 2016.
  • [13] R. Li, Z. Zhao, J. Zheng, C. Mei, Y. Cai, and H. Zhang, “The learning and prediction of application-level traffic data in cellular networks,” IEEE Transactions on Wireless Communications, vol. 16, no. 6, pp. 3899–3912, June 2017.
  • [14] and and and, “User data traffic analysis for 3g cellular networks,” in 2013 8th International Conference on Communications and Networking in China (CHINACOM), Aug 2013, pp. 468–472.
  • [15] L. Xiang, X. Ge, C. Liu, L. Shu, and C. Wang, “A new hybrid network traffic prediction method,” in IEEE Global Telecommunications Conference (GLOBECOM), Dec 2010, pp. 1–5.
  • [16] J. Feng, X. Chen, R. Gao, M. Zeng, and Y. Li, “Deeptp: An end-to-end neural network for mobile cellular traffic prediction,” IEEE Network, vol. 32, no. 6, pp. 108–115, November 2018.
  • [17] X. Wang, Z. Zhou, F. Xiao, K. Xing, Z. Yang, Y. Liu, and C. Peng, “Spatio-temporal analysis and prediction of cellular traffic in metropolis,” IEEE Transactions on Mobile Computing, pp. 1–1, 2018.
  • [18] D. Miao, W. Sun, X. Qin, and W. Wang, “Msfs: Multiple spatio-temporal scales traffic forecasting in mobile cellular network,” in IEEE Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress, Aug 2016, pp. 787–794.
  • [19] F. Ni, Y. Zang, and Z. Feng, “A study on cellular wireless traffic modeling and prediction using elman neural networks,” in 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), vol. 01, Dec 2015, pp. 490–494.
  • [20] Y. Shu, M. Yu, O. Yang, J. Liu, and H. Feng, “Wireless traffic modeling and prediction using seasonal ARIMA models,” IEICE transactions on communications, vol. 88, no. 10, pp. 3992–3999, 2005.
  • [21] A. G. Wilson, “Covariance kernels for fast automatic pattern discovery and extrapolation with gaussian processes,” Ph.D. dissertation, University of Cambridge, 2014.
  • [22] B. Cao, D. Shen, J.-T. Sun, Q. Yang, and Z. Chen, “Feature selection in a kernel space,” in Proceedings of the 24th international conference on Machine learning.   ACM, 2007, pp. 121–128.
  • [23] M. Ramona, G. Richard, and B. David, “Multiclass feature selection with kernel gram-matrix-based criteria,” IEEE transactions on neural networks and learning systems, vol. 23, no. 10, pp. 1611–1623, 2012.
  • [24]

    K.-P. Wu and S.-D. Wang, “Choosing the kernel parameters for support vector machines by the inter-cluster distance in the feature space,”

    Pattern Recognition, vol. 42, no. 5, pp. 710–717, 2009.
  • [25] W. Zhang, C. Wang, X. Ge, and Y. Chen, “Enhanced 5G cognitive radio networks based on spectrum sharing and spectrum aggregation,” IEEE Transactions on Communications, vol. 66, no. 12, pp. 6304–6316, Dec 2018.
  • [26] G. Yuan, R. C. Grammenos, Y. Yang, and W. Wang, “Performance analysis of selective opportunistic spectrum access with traffic prediction,” IEEE Transactions on Vehicular Technology, vol. 59, no. 4, pp. 1949–1959, May 2010.
  • [27] W. Guo and T. O’Farrell, “Dynamic cell expansion with self-organizing cooperation,” IEEE Journal on Selected Areas in Communications, vol. 31, no. 5, pp. 851–860, May 2013.
  • [28] S. Wang and W. Guo, “Energy and cost implications of a traffic aware and quality-of-service constrained sleep mode mechanism,” IET Communications, vol. 7, no. 18, pp. 2092–2101, December 2013.
  • [29] Y. Sun, “Iterative RELIEF for feature weighting: algorithms, theories, and applications,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 1035–1051, 2007.
  • [30] X. Xu, C. Yuan, W. Chen, X. Tao, and Y. Sun, “Adaptive cell zooming and sleeping for green heterogeneous ultradense networks,” IEEE Transactions on Vehicular Technology, vol. 67, no. 2, pp. 1612–1621, Feb 2018.
  • [31] T. Buil, J. Hernandez-Loba, D. Hernandez-Loba, Y. Li, and R. Turner, “Deep gaussian processes for regression using approximate expectation propagation,” Proceedings of the 33rd International Conference on Machine Learning, 2016.
  • [32]

    S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang, “Overcoming catastrophic forgetting by incremental moment matching,”

    Advances in Neural Information Processing Systems (NIPS), 2017.