Deep learning techniques have achieved enormous success in a wide range of artifi-cial intelligence (AI) and machine learning applications in recent years1bengio2013representation ; 2chen2015net2net . However, most of these existing deep learning approaches suppose that the models often work in a batch learning setting or offline learning fashion, where the entire training dataset must be available to train a model by some learning techniques. Such learning ap-proaches are poorly scalable for many real-word tasks, where the data instances arrive in a sequential manner. Thus, making deep learning available for the streaming data is a desideratum in the field of machine learning.
Unlike traditional batch learning, online learning represents a significant family of learning algorithms that are designed to optimize and learn models incrementally over streaming data sequentially3cesa2006prediction . Online learning shows the tremendous ad-vantages that the models can be updated efficiently in an online manner compared with traditional offline learning fashion when the new data instance comes. Similar to batch learning algorithms, online learning can also be applied for various real-word tasks, such as supervised classification task4sahoo2018online
, unsupervised learning task5hoi2018online , and so on.
However, in general, online learning algorithms cannot be directly employed to deep neural network. They have to cope with the intractable convergence problems, such as vanishing gradient. Besides, the traditional shallow or fixed neural network structure is poorly scalable for the most real-world applications where the data in-stances arrive in a sequential order and the probability distribution of data is non-stationary. Therefore, a promising online deep learning framework should be devel-oped that can effectively and rapidly learn knowledge in non-stationary.
It should also be noted that the probability distribution obeyed by streaming data could occur the concept drift, in other words, the data probability distribution chang-es. In this circumstance, the leaning algorithms must take some actions to prevent the large drift and encourage positive transfer, in other words, the learner should make a trade-off between both the new and old knowledge and alleviate the catastrophic forgetting. The classical algorithms for catastrophic forgetting are Elastic Weight Consolidation (EWC)6kirkpatrick2017overcoming and their variants7chaudhry2018riemannian , but this kind of algorithms attempt to address catastrophic forgetting by augmenting objective function and then control the whole network, that is, let the learning model’s weights balance between these two factors, rather than directly take actions to cope with catastrophic forgetting. Based on the above fact, therefore this reminds us of the importance to enhance the differ-ent-depth latent representations and the ability to rapidly adapt to dynamic changing situations.
To achieve this, in this work, we devise a novel Bilevel Online Deep Learning (BODL) framework, which consists of three major components: online ensemble classifier, concept drift detection and bilevel online deep learning. Our BODL frame-work can effectively utilize the different abstract level latent feature representations to build classifiers via the online ensemble framework, where the important weights of the base classifiers would be updated by online exponential gradient descent strat-egy. consider the convergence problem of online ensemble framework, we apply the similar constraint to generate the favorable latent representation. Besides, a concept drift detected mechanism is devised according to the error rate of base classifiers. When the concept drift is detected, our BODL model can adaptively update the model parameters via bilevel optimization and then prevent the large drift and en-courage positive transfer.
In a summary, our main contributions in this paper are listed below:
1) We design an effective bilevel learning strategy. Specifically, if the concept drift is detected, the model would adaptively adjust the parameters for all base classi-fiers and of the different-depth feature representation mentioned in section 2 using bilevel optimization, where this process is achieved based on a tiny episodic memory. After that, the model can circumvent the large drift and encourage positive transfer in non-stationary environment.
2) In this work, consider the convergence problem of online ensemble framework, we impose the similar constraint between the shallower and the deeper layer’s fea-ture, which would be beneficial to generate the favorable feature representations.
3) The comparative experiments are devised to verify the effectiveness of the pro-posed BODL algorithm, and we analysis the experimental results of a variety of algo-rithms from different perspectives in terms of accuracy, precision, recall-score and F-1 score, and then we can see that our BODL algorithm can exploit the different-depth feature representations and adapt to rapidly changing environment.
The remainder of this paper is organized as follows. In Section 2, we introduce our BODL algorithm in details, which consists of three parts: online ensemble classifier, concept drift detection mechanism, bilevel learning for concept drift. In Section 3 we empirically compare BODL algorithm with several state-of-the-art online learning algorithms. In Section 4 we elaborate related works. In Section 5 we summarize the whole work and the interested directions in the future.
2 Bilevel Online Deep Learning(BODL)
In this work, we present bilevel online deep learning, a conceptually novel framework for online learning based on bilevel optimization8jenni2018deep and online ensemble framework. Our BODL architecture can be divided into three main parts: online ensemble classi-fier, concept drift detection mechanism, bilevel learning for concept drift. The online deep ensemble classifier can make a trade-off among the different-level base classi-fiers and improve the performance of classification; Concept drift detection mecha-nism is used to monitor the change in non-stationary environment; When the concept drift is detected, bilevel learning is designed to adaptively adjust the parameters and , then the model can adapt to the change in non-stationary environment.
2.1 Online Ensemble Classifier
We illustrate the online deep ensemble classifier in Fig.1, where represents the importance of the N base classifiers. The online deep ensemble classifi-er can make a trade-off among the different-level base classifiers via Exponential Gradient Descent(EGD) algorithm in an online manner4sahoo2018online .
More specifically, we character a Deep Neural Network (DNN) with hidden layers, and the final ensemble classifier can be achieved by dynamically updating the weight parameters of the base classifiers for each hidden layer based on their classifi-cation loss. The specific ensemble prediction function can be written as Eq.(1).
Compared to the traditional network, in which the feature representation con-structed by outputs of the final hidden layer is used as input of the classifier, here we can make a favorable classifier by an online ensemble framework, which can benefit from the different depth feature representation and improve the prediction perfor-mance of the whole model. It is noted that the parameters , and in Eq.(1) can be learned in an online flavor.
Update the parameters We update the weights for base classifiers using exponential gradient descent11. Firstly, the weights
are initialized using a uniform distribution:i.e., each base classifier has equal probability to be picked. At each iteration, the prediction loss of the n-th base classifier can be written as , where and represent the base classifier prediction and the target variable respectively. Then, the weight of each base classifier can be learned according to the loss suffered and the update rule is given by follow:
where and is set to 0.01 in our work. After that, the trained base clas-sifier’s important weight is discounted by an exponential weight .
Update the parameters The parameters for all base classifiers are updated using Stochastic Online Gradient Descent (SOGD), and this process is analogical to the traditional feedforward networks.
Update the parameters The update rule about the parameters
of the dif-ferent-depth feature representation is different from the traditional backpropagation framework. The objective function includes two parts: the adaptive loss function and similar constraint, which are defined as follow:
where, the first part in loss function represents the adaptive prediction loss. Note that, the parameters of shallower layer tend to converge faster than the ones of deeper layer, which can lead to deeper base classifiers learn slowly2chen2015net2net . Thus, we incorporate the similar constraint between the shallower and deeper layer’s features, which can be beneficial to generate the favorable feature representations and improve the conver-gence rate and the prediction performance of the deeper layer. In this work, is a tradeoff parameter and is set to 0.1. Note that, the similarity can be modelled in mul-tiple manners and we choose the squared distance metric in this paper.
2.2 Bilevel Online Deep Learning
As the streaming data comes gradually and the data probability distribution could change. We monitor the change of the data probability distribution utilizing the error rate of classifier. This concept drift detection mechanism is similar to the drift detec-tion method in 10kifer2004detecting but the warning phase is not arranged in this paper in order to avoid the use of slide window methods. In this section, we describe our adaptive online deep learning based on bilevel optimization in detail. Figure 2 shows a flowchart of the bilevel online deep learning framework.
2.2.1 Bilevel Learning
For each arriving instance in online learning scenario, we detect the concept drift uti-lizing the error rate of classifier. If the concept drift is observed, the learning algorithm obviously needs to takes some actions to prevent large drift and achieve online in-cremental learning. Specifically, when the concept drift occurs, BODL initializes a memory weight to replay the knowledge in the memory. Then we apply the trained memory weight to update such that it can prevent large drift and weight the new and old knowledge in a non-stationary environment as shown in the Fig.2.
Bearing this in mind, the objective function can be defined as the following bilevel optimization problem:
where denotes the current training data that exists concept drift. We parame-terize each as an inner optimization problem , which the learner optimiz-es the corresponding . During the bilevel learning, firstly the agent learns the memory weight about the inner problem. After that, the agent learns the outer problem with respect to . In this process, we apply the cross-entropy loss as objec-tive function for the inner and outer problems respectively.
2.2.2 First Order Approximation
Generally, the data comes gradually in non-stationary environment and the concept drift mechanism will monitor the change in online manner. When the concept drift occurs, the learner can adaptively adjust the model parameters and weight the new and old knowledge in a non-stationary environment via bilevel learning.
Specifically, assume that for an incoming training data reported as concept drift, the inner problem is settled by:
After receiving via Eq. (5), the outer learning for the parameters
can be solved by the chain rule.
Note that solving the Eq. (6) is a cumbersome problem in real word scenario be-cause of the Hessian vector product in the second term11pham2020bilevel . In order to improve the efficiency of the computation, we apply first-order approximation to simplify the Eq.(6) in this work12nichol2018first ; 13zhang2019lookahead
. Thus, the outer learning is given by interpolating only in the parameter space:
We apply Eq.(7) to obtain a one-step look-ahead parameter from . After that, we can adjust by linearly interpolate between the current parameters and . It is noted that we only maintain the parameters of the main model ,i.e., once the parameters is obtained and then we discard it after every outer update. In this process, the inner optimization should be carried out via tiny experience memory14chaudhry2019continual .
2.2.3 Bilevel Online Deep Learning Algorithm
In this section, we show that our BODL algorithm can effectively learn in non-stationary environment by an online manner.
Our proposed BODL algorithm is shown in Algorithm 1.
In BODL algorithm, firstly we present an online ensemble framework that at-tempts to dynamically weight the different depth classifiers and the base classifier’s weights for each hidden layer are update based on the exponential gradient descent algorithm in an online manner. In particular, we impose the similar constraint be-tween the shallower and the deeper layer’s features, which would be beneficial to generate the favorable feature representations and improve the performance of the convergence.
In addition, consider that the data probability distribution would change in real-world scenarios. Thus, a concept drift detection mechanism is used to monitor the data changes according to the error rate of classifier. Once the drift is detected, the learner would update the model parameters via bilevel optimization. Thus, the learner would effectively prevent the large drift and alleviate the catastrophic forgetting.
In this section, we evaluate the baselines and our proposed BODL algorithm on vari-ous stationary and non-stationary datasets. We report and analysis the experimental results in detail.
3.0.1 Experiment Setup
We use the neural network architecture with 15 hidden layers of 30 units with ReLU nonlinearities. In all experiments, the entire network parameters are updated by Ad-am optimizer with a learning rate of 0.01. When the drift is detected, the model would adaptively learn the parameters via the tiny memory budge and this process is achieved using the bilevel optimization strategy. It is well worth note that we apply a test-then-train strategy for evaluating the learning algorithms to cast this as a classifi-cation task.
We compare against with several state-of-the-art baselines: Perceptron, the Re-laxed Online Maximum Margin(ROMMA)15li2002relaxed , OGD16zinkevich2003online , the recently proposed Soft Confidence Weighted algorithms(SCW) 17hoi2012exact , the Adaptive Regularization of Weight Vectors(AROW) 18crammer2013adaptive , the Confidence-Weighted (CW) learning algorithm. Here, the BODL-Base algorithm is regarded as an online learning approach without the bilevel optimization strategy.
The learning performance of BODL algorithm is numerically validated on stationary and non-stationary data, but evolving data stream usually characterize non-stationary properties in real-word task. Thus, in our experiments, we select three non-stationary datasets and two stationary datasets for experimental comparison. Here, the datasets are obtained from UCI repositories and the properties are shown in de-tails in Table1.
3.0.3 Experimental Results
In this section, the experimental comparative results of all baselines and the proposed BODL algorithm with four different metric criteria: average accuracy, average preci-sion, F1-Score and recall-score are reported in Table 2. In additional, in order to study the contribution of each component, a complete ablation studies are conducted in our work where BODL-2: the model is trained using the bilevel learning and the simi-lar constrain, BODL-1: the model is trained using the similar constrain alone, BODL-Base: the model is trained without the bilevel learning and the similar constrain.
The experiment results show that our BODL-2 algorithm enjoys competitive per-formance on different datasets implementing different evaluation criteria. BODL-2 is slightly better than BODL-1 with the help of bilevel learning since it can alleviate the catastrophic forget when the concept drift occurs. BODL-Base have lower accuracy than BODL-1, which means the similar constrain would be beneficial to generate the favorable feature representations.
Compared to the state-of-the-art methods, we can draw several conclusions. In terms of average accuracy, first but not surprise, traditional online learning tech-niques, such as Perceptron and CW, achieve relatively poor performance on almost all datasets. Next, we also note that the algorithms, such as OGD, could obtain relatively competitive numerical results on MNIST datasets. However, lacked the ability to further explore the power of depth or adaptively adjust the model parameters when concept drift occurs, so they receive poor performance on weather and PIMA da-taset. SCW and AROW achieve favorable accuracy in concept drift datasets such as weather and KDDCUP, but they product poor results in PIMA dataset which features highly imbalance and non-stationary. In contrary, our BODL-2 algorithm can exploit the different-level favorable feature representation base on the deep learning frame-work, besides, when the concept drift is observed, the learner can adaptively adjust the model parameters via bilevel optimization strategy based on memory replay and then encourage positive transfer and prevent the large drift.
In additional, BODL-2 algorithm outperform all other approaches on Magic, MNIST and KDDCUP dataset under accuracy evaluation criteria. It is noted that our method can produce good performance from highly imbalance data streams with concept drift, such as PIMA. Only 1.22% less than the highest one in terms of accu-racy on weather dataset but achieve highest results under the average precision, F1-Score and recall-score evaluation criteria and so on. To conclude, the experimental results demonstrate that our BODL-2 algorithm is a promising online learning ap-proach comparing to the state-of-the-art online methods.
4 Related Works
Recent years we have witnessed enormous success in the deep neural network. Com-pared to traditional off-line learning, online learning is more suitable in many real-word tasks. Online learning algorithms represents a class of scalable algorithms which are devised to optimize the models incrementally where the data instance comes gradually. Perceptron based on maximum-margin classification is the earliest online learning algorithm, which is primarily developed to learn linear models. However, the class of perceptron algorithm is fragile to the samples that are linearly inseparable. Thus, perceptron algorithm with the kernel functions are developed 20kivinen2004online , which give a solution to online learning techniques with nonlinear models. While such approaches are able to solve the non-linear classification, determining the type and number of kernel function is an open challenge. Moreover, these approaches are not explicitly built to extract the different-depth feature representations for the data instances. Base on this fact, Sahoo et al et al present an online algorithm with different depth network for evolving data streams4sahoo2018online . However, they neglect the intractable problem of catastrophic forgetting, or cannot cope with the non-stationary environment very well. Recently, there are some specific algorithms handle for concept drift in non-stationary environment. These methods concentrate on incrementally update the model as long as the data instance arrives in a stream, such as dynamic combination model; the online Gradient Descent Algorithm(OGD)16zinkevich2003online ; the relaxed online maxi-mum margin algorithm and its aggressive version aROMMA, ROMMA, and aROM-MA15li2002relaxed ; the Adaptive Regularization of Weight Vectors(aROW) 18crammer2013adaptive ; the Confi-dence-Weighted (CW) learning algorithm19crammer2008exact ; The recently proposed Soft Confidence Weighted algorithms(SCW) 17hoi2012exact . However, these methods characterize the constant updating of their models, which would make the model evolve in an extremely regu-lar manner regardless of the concept drift.
5 Conclusion and future work
Concept drift is an inevitable problem with learning from evolving data streams, which must be handled for data instances to be practically useful. In this work, we proposed a novel Bilevel Online Deep Learning (BODL) framework to learn in non-stationary environment in an online manner. BODL creates an ensemble classifier using the different depth feature representations, where the important weights of each classifier would be updated by online exponential gradient descent strategy. In order to make the deeper layers converge faster and generate the favorable feature repre-sentation, we impose the similar constraint between the shallower and the deeper layer’s features. Besides, a concept drift detected mechanism is devised according to the error rate of classifier. When the concept drift is detected, our BODL algorithm can adaptively update the model parameters via bilevel optimization based on tiny episodic memory and then prevent the large drift and encourage positive transfer.
At last, we validated the proposed BODL algorithm through extensive experiments on various stationary and non-stationary datasets and the competitive numerical results show our BODL algorithm is a promising online learning approach.
In the future work, we would consider the online learning problem for class incre-mental learning. Besides, in order to obtain the more favorable feature representa-tion, we also consider incorporating the recently proposed self-supervised learning and data augment methods.
This work was supported by the Science Foundation of China University of Petroleum, Beijing(No. 2462020YXZZ023)
- (1) Y. Bengio, A. Courville, P. Vincent, Representation learning: A review and new perspectives, IEEE transactions on pattern analysis and machine intelligence 35 (8) (2013) 1798–1828.
- (2) T. Chen, I. Goodfellow, J. Shlens, Net2net: Accelerating learning via knowledge transfer, arXiv preprint arXiv:1511.05641 (2015).
- (3) N. Cesa-Bianchi, G. Lugosi, Prediction, learning, and games, Cambridge university press, 2006.
D. SAHOO, H. Q. PHAM, J. LU, S. C. HOI, Online deep learning: Learning deep neural networks on the fly.(2018), in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI 2018, July 13-19, Stockholm, pp. 2660–2666.
- (5) S. C. Hoi, D. Sahoo, J. Lu, P. Zhao, Online learning: A comprehensive survey, arXiv e-prints (2018) arXiv–1802.
- (6) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences 114 (13) (2017) 3521–3526.
A. Chaudhry, P. K. Dokania, T. Ajanthan, P. H. Torr, Riemannian walk for incremental learning: Understanding forgetting and intransigence, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 532–547.
- (8) S. Jenni, P. Favaro, Deep bilevel learning, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 618–633.
- (9) D. Kifer, S. Ben-David, J. Gehrke, Detecting change in data streams, in: VLDB, Vol. 4, Toronto, Canada, 2004, pp. 180–191.
- (10) Q. Pham, D. Sahoo, C. Liu, S. C. Hoi, Bilevel continual learning, arXiv preprint arXiv:2007.15553 (2020).
- (11) A. Nichol, J. Achiam, J. Schulman, On first-order meta-learning algorithms, arXiv preprint arXiv:1803.02999 (2018).
- (12) M. R. Zhang, J. Lucas, G. Hinton, J. Ba, Lookahead optimizer: k steps forward, 1 step back, arXiv preprint arXiv:1907.08610 (2019).
- (13) A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, M. Ranzato, Continual learning with tiny episodic memories (2019).
- (14) Y. Li, P. M. Long, The relaxed online maximum margin algorithm, Machine Learning 46 (1) (2002) 361–387.
- (15) M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, in: Proceedings of the 20th international conference on machine learning (icml-03), 2003, pp. 928–936.
- (16) S. C. Hoi, J. Wang, P. Zhao, Exact soft confidence-weighted learning, in: ICML, 2012.
- (17) K. Crammer, A. Kulesza, M. Dredze, Adaptive regularization of weight vectors, Machine learning 91 (2) (2013) 155–187.
- (18) J. Kivinen, A. J. Smola, R. C. Williamson, Online learning with kernels, IEEE transactions on signal processing 52 (8) (2004) 2165–2176.
K. Crammer, M. Dredze, F. Pereira, Exact convex confidence-weighted learning, in: Advances in Neural Information Processing Systems, Citeseer, 2008, pp. 345–352.