Training successful deep neural networks usually depend on a massive amount of data. With one-time design and training, deep neural network models can be easily deployed to solve specific problems. In recent years, lots of research has proven that such a traditional data-driven training method can quickly optimize hyperparameters of deep neural networks with the help of massive datasets and supercomputing resources. Models can obtain or even exceed human-level cognitive skills in lots of application scenarios. However, even such a classic training method has three main disadvantages, too:
A training dataset with sufficient meaningful samples is the prerequisite for training a successful model. Data collection and preprocessing are extremely time-and-money-consuming. Meanwhile, the exhausting process can prolong the preparation phase in a practical project. Sometimes, we have to postpone the beginning of training models due to a lack of data.
In this training setting, it is always assumed that the underlying data generation process is static. Based on this assumption, we can evaluate models’ generalization by comparing the errors among training, validation, and test datasets in the training phase. However, this static model is not always applicable to the constantly evolving and developing world. In this article, the context can be defined as a non-stationary scenario, where the generative probabilistic distribution of input data or the target data changes over time. The change can be grouped into three families, i.e., short periodical, long periodical, or non-periodical . Periodical changes could be due to insufficient samples in the training dataset, which restricts the model from obtaining the information of the entire sample space. Non-periodical changes could be caused by changes in the objective environment, broken physical devices, or unavoidable measurement errors. Deep neural network models should continually learn the data with these periodical changes to improve their cognition, as the learning process of humans. Because the non-periodical changes are hardly predictable and repetitive, we need to detect them and make the correct decisions for processing them.
The structure of the deep neural network model is generally fixed after deployment. This setting is unrealistic and inflexible in real-world applications because new targets can appear as the application environment keeps evolving. The model should extend its structure by increasing the number of outputs in this case. A new target can be a new label in classification tasks or a new predicted object in regression tasks. To take power forecasts in the context of smart power girds as an example, we can train models to provide forecasts regarding energy supply and demand for managing a regional power grid. With the extension of the power grid, new power generators and consumers will absolutely be added to the list of forecasting targets. Training an individual model for the new target might also be a solution. However, in this case, we have to reconsider the first problem, i.e., we can not start training until sufficient samples are collected.
One of the potential solutions for addressing these issues is Continual Learning
(CL), also known as Continuous Learning, Incremental Learning, Sequential Learning or Lifelong Learning, which is carried out to solve multiple, related tasks and leads to a long-term version of machine learning models. While the term is not well consolidated, the idea is to enable models to continually and adaptively learn the world and overcome catastrophic forgetting. Knowledge of models can be developed incrementally and become more complicated. Catastrophic forgetting refers to the phenomenon that models could forget the knowledge for solving old tasks while learning new tasks. This forgetting problem raises a more general problem in traditional neural networks, i.e., the so-called stability-plasticity dilemma, which means that models should find a trade-off between accumulation and integration of new knowledge and retention of old knowledge. Numerous valuable research work focused on CL algorithms, application scenarios, evaluation metrics for classification tasks, etc. However, the necessity of CL for regression tasks seems to be ignored.
This article can be viewed as an abstract of my thesis for a Ph.D. degree. The contributions of the planned thesis mainly contain:
To present the necessity and importance of CL for regression tasks;
To give an overview of the relevant research literature, including but not limited CL algorithms, detection of novelty and non-stationarity in the data stream, explainable artificial intelligence (AI);
To explore the applicability of well-known CL algorithms for regression problems;
To analyze the shortcomings of common experimental setups as well as restrictions of general evaluation metrics;
To summarize relevant research challenges being faced for our proposed CL framework  and develop it further;
To develop visualization utilities and propose comprehensive evaluation metrics to make CL explainable;
To evaluate the framework in power forecasting experiments with real-world datasets.
The remainder of this article will start with an overview of the requirements and relevant research questions of CL for regression problems. In Section 2, I will propose a visualizable CL framework for regression problems and introduce the application in the two proposed CL scenarios with instances. Then I will present three experimental datasets, which can be used to design power forecasts experiments to assess the proposed solutions. This article will end up with a brief conclusion.
2 Continual Learning for Regression
2.1 Continual Learning Scenarios
CL has been widely applied to classification problems for learning new tasks sequentially and retaining the obtained knowledge. In , three CL scenarios for classification problems are proposed to focus on object recognition:
New Instances: new samples of the previously known classes are available in the subsequent batches with new poses or conditions. In other words, these new samples own novel information but still belong to the same labels. Models need to keep extending and accumulating knowledge regarding the learned labels.
New Classes: new samples belong to unknown classes. In this case, the model should be able to identify the objects of new classes as retaining the accuracy of old ones.
New Instances and Classes: new samples belong to both known and new classes.
Therefore, a new task in the context of classification problems can be defined as learning new instances belonging to known labels or learning to recognize new labels. However, one obvious difference between classification and regression problems is the models’ targets, which are discrete labels in classification and consecutive values in regression.
Two CL scenarios for regression are proposed in my previous work :
Data-domain incremental (DDI) scenario
refers to the situation where the underlying data generation process is changing over time due to the non-stationarity of the data stream. Either the change of the probability distribution regarding the input dataor the target can trigger updating the model trained on data from the out-of-date generation process. The model learns to extract latent representations of the input data from a changed generation process in the updating phase. Besides, the model needs to adjust its weights to find a new proper mapping between the new latent representations to the targets. The non-stationarity could result from insufficient samples in the pre-training process or external objective factors.
Target-domain incremental (TDI) scenario
refers to the situation where the structure of the network model is extended as the number of prediction targets increases. Assume that using a multi-output deep neural network to forecast several independent targets based on the same input data, the neural network owns a shared hidden sub-network for learning non-linear latent representations and multiple-output sub-networks for prediction. The model will add a new sub-network for the new target when it appears. The TDI scenario is a joint research topic among multi-task learning, transfer learning and continual learning. On the one hand, the obtained knowledge of the shared network can be transferred to train the additional sub-network quickly, even without sufficient samples. On the other hand, CL algorithms can avoid decreasing the prediction accuracy of the previously handled tasks while learning the new task by utilizing the free weights of the shared network, which are unimportant for other targets.
For example, renewable power generation can be predicted based on regional weather conditions. Figure 1 illustrates the two proposed CL scenarios to power forecasts of regional renewable energy generators. As described above, the model has a shared network to learn the latent space and several prediction sub-networks for these targets.
The weather conditions are time-variant features, which fluctuate periodically over time. Therefore, a gradual change can exist in the weather data due to climate change, dynamic behavior of noise from the weather prediction model, or other foreseeable factors. Such smooth change is usually referred to as concept drift . The case in Fig. 1, marked as 1.a, presents this challenge, which will lead to a negative impact on prediction accuracy, especially when sufficient samples are unavailable for pre-training.
The cases of 1.b are corresponding to power generation capability, which is time-dependent and can be affected by, such as, upgrading or aging of the device in the long term or the changes in the environment. Besides, residential power demand forecasting is another example that needs to be considered in this scenario. Generally, we predict the overall power demand of a residential area in a low-voltage power grid rather than the power demand of every single consumer in this region. The mapping is sensitive to the change of these consumers’ power demand or consumption habits. Sometimes we have to update the prediction model due to these unpredictable factors. In the data-domain CL scenario, models should continually collect data and accumulate knowledge by learning newly collected data. Regarding the TDI scenario in Fig. 1, the prediction sub-network for an additional photovoltaic generator is added to the prediction model.
In the proposed setting, a new task can thus be defined as (1) learning non-stationarity of the data stream, including the input data generation and the output data generation with given inputs, or (2) integrating new sub-networks to the existing model for predicting new targets without any negative effects on the prediction accuracy of other known targets. The red spots and lines in Fig. 1 correspond to the new tasks in both scenarios, respectively.
2.2 Research Questions
In the common CL experimental setting for classification tasks, the dataset contains disjoint subsets, each of which is assumed to be separately sampled from an independent and identical distributions. One subset represents a task, and the dataset
is not independently and identically distributed, which is different from traditional supervised learning. Neural network models need to learn these unseen, independent tasks sequentially as the identification information of these tasks is given. Some CL algorithms allow models to revisit the previously learned tasks without restriction while learning a new one. It is called replay CL strategy, which will be introduced in the remainder of this article. The replayed data can be a subset of samples storing the previous tasks in raw format or generated by a generative model. This setting can not represent the case in the real world, though it is feasible to evaluate and compare diverse CL algorithms.
First, the appearance of a new task is usually unpredictable in real-world applications, which means that the prior knowledge of the new task is unavailable. A detection mechanism should be set to identify the appearance and the type of the new tasks. Second, data streams in real applications are infinite and contain both known and unknown tasks, which might not appear separately and orderly. The model could identify the new task and update itself immediately or store the samples until it is sufficient for an update, which depends on the adopted updating strategy. Third, unrestricted retraining on old tasks enables the model to remember the obtained knowledge and might be prone to overfitting. Besides, storing all old tasks might be a burden to storage overhead and violate privacy laws.
Farquhar et al. introduce five core desiderata for evaluating CL algorithms and designing classification experiments . In previous work , we give five suggestions for designing CL regression experiments:
New tasks resemble the previous tasks;
The neural network model utilizes the single output for predicting the corresponding target and learning the changes in DDI scenario;
New tasks appear unpredictably in DDI scenario, and the prior knowledge regarding the appearance should be identified rather than informed;
Considering privacy law and revisiting the previous dataset with restriction;
More than two tasks, either in DDI scenario or TDI scenario.
These suggestions will guide the development of updating methods and the design of experiments. Furthermore, the following research questions should be answered in the proposed thesis.
2.2.1 Question 1: When to trigger an update?
The trigger condition is the prerequisite for CL and determines the starting point of an update. According to the definition of a new task in Section 2.1
, this question is more valuable to research in the DDI scenario, where new tasks appear unpredictably, because new tasks of the TDI scenario depend on the objective requirements of projects and are added manually. For answering this question, my research will focus on novelty detection (concept shift and drift) using deterministic and probabilistic methods. For example, in
, the trigger condition depends on the number of newly collected novel samples. Besides, updating could also be triggered due to the estimation of new samples’ entropy. The design of update trigger conditions is the first significant step affecting the updating results and the model’s future performance.
2.2.2 Question 2: How to update models?
Updating methods are the core for learning tasks sequentially and continually. CL algorithms can be briefly categorized into three groups  depending on how data is stored and reused:
: The goal of these algorithms is to reduce storage demand and prioritize privacy. Instead of revisiting old data, a penalty regularization term is added to the loss function for consolidating the weights which are important for previous tasks while learning a new task. Delange et al. further divided these approaches into data-focused approaches[19, 31, 25] and prior-focused approaches [16, 30, 1].
Replay approaches: These approaches prevent forgetting through replaying the previous samples stored in raw format or generated by a generative model, which can be named as rehearsal approaches [26, 2] and pseudo rehearsal approaches [28, 18], respectively. The subset of these previous samples can be introduced as the model’s inputs combined with new samples for continually learning the new task while constraining optimization in the loss function.
Parameter isolation approaches: The concept of these approaches is to arrange a subset of the model’s parameters to a new task specifically. For example, one can adjust the model’s structure by expanding a new branch to learn a new task if no constraint is required for the model’s size . Alternatively, the shared part of the network for all tasks can stay static, and previous task parts are marked out while learning new tasks [21, 6].
Note that not all well-known CL can directly be applied to regression tasks. For example, Li et al. use Knowledge Distillation loss  in their Learning Without Forgetting  (LWF) algorithm to consolidate the obtained knowledge for previous tasks. The loss function is a variant of cross-entropy loss, which is inapplicable for regression tasks. Thus, I plan to review these well-known algorithms and then analyze their advantages and applicability. Moreover, further work proposes novel CL algorithms based on the current CL and the proposed experimental setup. Another interesting topic is ensemble CL, which investigates the collaboration of various CL algorithms to improve models’ performance.
2.2.3 Question 3: How to evaluate the updated models?
Common evaluation metrics for regression tasks, such as Mean Square Error (MSE), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE), can assess the fitting and prediction ability of the model. However, more specific metrics are required for evaluating the updated models comprehensively in the CL setting. In [12, 13], the models are evaluated in terms of fitting error, prediction error, and forgetting ratio. We also consider training time as a significant evaluation factor , especially in real-time applications. Besides, algorithm ranking metrics are proposed according to different desiderata in , including accuracy, forward/backward transfer, model size efficiency, sample storage size efficiency, computational efficiency. Díaz-Rodríguez et al. fuse these desiderata to a single CL score for ranking purposes. A series of wide-ranging evaluation metrics can make CL explainable, which is the basis of visualizing the updating process and dynamically adjusting hyperparameters.
2.2.4 Question 4: How to explain the updating process?
The training process of either typical supervised learning or CL is generally a black box, where results of the solution are untransparent and incomprehensible for humans. Explainable artificial intelligence, also called XAI in literature, refers to the technique used to help humans understand and trust machine learning models’ results. It has been applied to many sectors, such as medical diagnoses  and text analysis .
Due to stochasticity in re-learning, some updates could fail and lead to a worse predictive ability. XAI can visualize the updating process and interpret the reasons for the failures. Experts can monitor the updating process and analyze the updated model based on the given evaluation criterion. Furthermore, they can more easily decide to accept successful updates or reject failed failures, and take the following actions, for example, rolling back the failed model to the previous version, assembling multiple updated models for ensemble learning, or adjusting hyperparameters dynamically for a further update.
3 Visualizable Continual Learning Framework for Regression Tasks
The V-CLeaR framework, which is short for Visualizable Continual Learning for Regression tasks, is shown in Fig. 2
. It consists of three main parts: (1) preprocessing block, (2) CLeaR framework, and (3) explainability utility. The preprocessing block is responsible for processing the incoming data, including cleaning the exceptions, filling miss values, and scaling. Besides, due to concept drift in the data stream, the parameters of the used scaler might have to be updated, for example, the maximum and minimum of the min-max scaler or the mean and variance of the standard scaler.
The CLeaR framework is a continual learning framework based on buffered data  that is grouped as novelty and familiarity by the deterministic or probabilistic novelty detector and stored in the corresponding buffer. The novelty from the infinite data stream indicates what the trained model cannot predict accurately and should continually learn. The familiarity is defined as data that the model is familiar with. It could be obtained from the infinite data stream or the stored historical samples in a raw format, or generated by a generative model. Storage and usage of these buffered data are dependent on the adopted CL strategies.
In this instance, the model consists of an autoencoder as the shared network for extracting the latent representations of the input and fully connected networks as the predictors. Meanwhile, the two sub-networks are used for detecting the changes in the input distribution and the goal distribution , respectively. Either novelty or familiarity is determined by comparing the MSE between prediction (or reconstruction ) and the truth (or ) to the preset dynamically changeable threshold. The novelty buffer has a limited size, and the familiarity buffer is infinite. Updating the sub-network is triggered when the corresponding novelty buffer is filled. After updating the sub-network, the corresponding threshold will be adjusted depending on the updating results, and its buffers will be emptied. The core of this framework is the flexibility and customizability of these modules, including the novelty detector, storage of the data, the available CL strategies, and the type of neural network models. Users can select the optimal components of the framework for their own applications.
The explainability utility is designed as a visualization tool focusing on visualizing the updating process and explaining the updated model using the proposed evaluation metrics. The updating process is also supervised by experts, who can input instructions to assist the model in making decisions for the next move. Here a decision is defined as what will affect the CLeaR instance to take the following actions. For example, the CL-algorithm-related hyperparameters are adjusted for re-updating when the current updating results are not ideal, or the framework-related hyperparameters will be changed to make a trade-off between forgetting and prediction in the future update. Besides, considering these factors, such as storage and computational overhead, experts can decide to store or drop the updated models.
The development of the V-CLeaR framework can answer the four research questions listed in Section 2.2.
4 Datasets & Experiments
In the proposed thesis, I plan three experiments in the context of power forecasts based on three real-world public datasets to assess the framework’s performance. In the remainder of this section, I will briefly introduce the selected datasets and the experimental setup.
4.1 Wind Power Generation Forecasts
The dataset contains hourly averaged wind power generation time series for two consecutive years and the corresponding day-ahead meteorological forecasts provided by the European Centre for Medium-Range Weather Forecasts (ECMWF) weather model. The meteorological features contain (1) wind speed in 100m height, (2) wind speed in 10m height, (3) wind direction (zonal) in 100m height, (4) wind direction (meridional) in 100m height, (5) air pressure, (6) air temperature, and (7) humidity. All features are scaled between 0 and 1. Additionally, the power generation time series is normalized with the wind farm’s respective nominal capacity to enable a scale-free comparison and to mask the original characteristics of the wind farm. The dataset is pre-filtered to discard any period of time longer than 24 hours in which no energy has been produced, as this is an indicator of a wind farm malfunction.
4.2 Solar Power Generation Forecasts
Their installed nominal power ranges between 100kW and 8500kW. The PV facilities range from PV panels installed on rooftops to fully-fledged solar farms. Historical numeric weather prediction (NWP) data and the produced power in a three-hour resolution for 990 days are available for each facility. The weather prediction series in the dataset are scaled between 0 and 1 using the min-max normalization. Besides, there are three temporal features, the hour of the day, the month of the year, and the season of the year, which are normalized in the range of 0 and 1 using cosine and sine coding. The target variable, i.e., the measured power generation, is normalized using the nominal output capacity of the corresponding PV facility. Therefore, it allows the comparison of the forecasting performance without taking the size of the PV facilities into account.
The experimental setup is the same as the setup of the wind power generation forecasts experiment. Namely, the NWP features are used as the input of V-CLeaR instances for forecasting each generator in the DDI scenario.
4.3 Power Supply and Demand Forecasts in a Regional Power Grid
The regional power grid dataset  is collected from a real-world German regional flexibility market, including two-year NWP data, the low-/medium-voltage power generation (e.g., wind and solar power generation) and consumption (e.g., residential and industrial consumption) measurements in the same period, and the geographic and electrical information of the power grid as shown in Fig. 7.
The NWP data contains 13 24-hour-ahead numerical weather features with a 15-minute resolution. The power data contains historical samples of 11 renewable power generators, 55 local energy consumers, and 36 low-voltage residential consumers. Such as the NWP data, the power data ranges from March 1, 2019, to March 31, 2021, with a 15-minute resolution. NWP and power data are scaled between 0 and 1 using min-max normalization.
The power grid information records all information regarding the regional energy market, such as the energy generators and consumers’ parameters, the topological structure of the power grid, and the connection points to higher or lower power grid levels, etc. It can help create a virtual power grid using the open-source python library, pandapower, to analyze the power grid’s state and optimize power supply and demand.
Because all generators are located in the same region, the NWP features are viewed as the identical input for predicting all power targets in the power grid. Therefore, we can build a multiple-output neural network with a shared sub-network that extracts public representations, as shown in Fig. 1, to assess the V-CLeaR framework in the DDI and TDI scenarios. Additionally, the virtual power grid can help analyze the effect of continually updated prediction models on power grid management and optimization.
Conclusively, as a starting point of the dissertation, this proposal aims to present the existing research questions related to continual deep learning for regression tasks. Based on these questions and requirements from real-world applications, this article proposes an explainable neural-network-based CL framework for solving data-domain or target-domain incremental regression tasks. Currently, the work is in the progress of developing the modules of this framework and evaluating the functionality of each module in the application scenario of power forecasts. The V-CLeaR framework is expected to be modularized, and users can utilize the single module or customize the framework for their requirements. Our previous works have proven the applicability and necessity of the proposed framework.
This work was supervised by Prof. Dr. Bernhard Sick and supported within the Digital-Twin-Solar (03EI6024E) project, funded by BMWi: Deutsches Bundesministerium für Wirtschaft und Energie/German Federal Ministry for Economic Affairs and Energy.
Riemannian walk for incremental learning: understanding forgetting and intransigence.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–547. Cited by: 1st item.
-  (2020) Continual prototype evolution: learning online from non-stationary data streams. arXiv preprint arXiv:2009.00919. Cited by: 2nd item.
-  (2021) A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. External Links: Cited by: §2.2.2.
-  (2018) Don’t forget, there is more than forgetting: new metrics for continual learning. arXiv preprint arXiv:1810.13166. Cited by: §2.2.3.
-  (2018) Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733. Cited by: §2.2.
-  (2017) Pathnet: evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734. Cited by: 3rd item.
-  (2016) GermanSolarFarm Data Set. External Links: Cited by: §4.2.
-  (2016) Deep learning for solar power forecasting—an approach using autoencoder and lstm neural networks. In 2016 IEEE international conference on systems, man, and cybernetics (SMC), pp. 002858–002865. Cited by: §4.1, §4.2.
-  (2016) EuropeWindFarm Data Set. External Links: Cited by: §4.1.
-  (2021) Novelty detection in continuously changing environments. Future Generation Computer Systems 114, pp. 138–154. Cited by: §2.1.
-  (2020) Continuous learning of deep neural networks to improve forecasts for regional energy markets. IFAC-PapersOnLine 53 (2), pp. 12175–12182. Cited by: §2.1, §2.2.3.
-  (2021) Toward application of continuous power forecasts in a regional flexibility market. Note: In press Cited by: item 2, §2.2.3, §4.3.
-  (2021) CLeaR: an adaptive continual learning framework for regression tasks. arXiv preprint arXiv:2101.00926. Cited by: item 5, §2.2.1, §2.2.3, §2.2, §3, §4.1.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.2.2.
-  (2017) What do we need to build explainable ai systems for the medical domain?. arXiv preprint arXiv:1712.09923. Cited by: §2.2.4.
-  (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: 1st item.
-  (2020) Continual learning with bayesian neural networks for non-stationary data. In International Conference on Learning Representations, External Links: Cited by: §2.2.2.
-  (2018) Continual classification learning using generative models. arXiv preprint arXiv:1810.10612. Cited by: 2nd item.
-  (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: 1st item, §2.2.2.
-  (2017) Core50: a new dataset and benchmark for continuous object recognition. In Conference on Robot Learning, pp. 17–26. Cited by: §2.1.
Packnet: adding multiple tasks to a single network by iterative pruning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7765–7773. Cited by: 3rd item.
Virtual vector machine for bayesian online classification. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 411–418. Cited by: §2.2.2.
-  (2018) Variational continual learning. In International Conference on Learning Representations, Cited by: §2.2.2.
-  (2019) Eve: explainable vector based embedding technique using wikipedia. Journal of Intelligent Information Systems 53 (1), pp. 137–165. Cited by: §2.2.4.
-  (2017) Encoder based lifelong learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1320–1328. Cited by: 1st item.
Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: 2nd item.
-  (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: 3rd item.
-  (2017) Continual learning with deep generative replay. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 2994–3003. Cited by: 2nd item.
-  (2018) Pandapower—an open-source python tool for convenient modeling, analysis, and optimization of electric power systems. IEEE Transactions on Power Systems 33 (6), pp. 6510–6521. Cited by: §4.3.
-  (2017) Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987–3995. Cited by: 1st item.
-  (2020) Class-incremental learning via deep model consolidation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1131–1140. Cited by: 1st item.