1 Introduction
Deep learning techniques have achieved enormous success in a wide range of artificial intelligence (AI) and machine learning applications in recent years
1bengio2013representation ; 2chen2015net2net . However, most of these existing deep learning approaches suppose that the models often work in a batch learning setting or offline learning fashion, where the entire training dataset must be available to train a model by some learning techniques. Such learning approaches are poorly scalable for many realword tasks, where the data instances arrive in a sequential manner. Thus, making deep learning available for the streaming data is a desideratum in the field of machine learning.Unlike traditional batch learning, online learning represents a significant family of learning algorithms that are designed to optimize and learn models incrementally over streaming data sequentially3cesa2006prediction . Online learning shows the tremendous advantages that the models can be updated efficiently in an online manner compared with traditional offline learning fashion when the new data instance comes. Similar to batch learning algorithms, online learning can also be applied for various realword tasks, such as supervised classification task4sahoo2018online
, unsupervised learning task
5hoi2018online , and so on.However, in general, online learning algorithms cannot be directly employed to deep neural network. They have to cope with the intractable convergence problems, such as vanishing gradient. Besides, the traditional shallow or fixed neural network structure is poorly scalable for the most realworld applications where the data instances arrive in a sequential order and the probability distribution of data is nonstationary. Therefore, a promising online deep learning framework should be developed that can effectively and rapidly learn knowledge in nonstationary.
It should also be noted that the probability distribution obeyed by streaming data could occur the concept drift, in other words, the data probability distribution changes. In this circumstance, the leaning algorithms must take some actions to prevent the large drift and encourage positive transfer, in other words, the learner should make a tradeoff between both the new and old knowledge and alleviate the catastrophic forgetting. The classical algorithms for catastrophic forgetting are Elastic Weight Consolidation (EWC)6kirkpatrick2017overcoming and their variants7chaudhry2018riemannian , but this kind of algorithms attempt to address catastrophic forgetting by augmenting objective function and then control the whole network, that is, let the learning model’s weights balance between these two factors, rather than directly take actions to cope with catastrophic forgetting. Based on the above fact, therefore this reminds us of the importance to enhance the differentdepth latent representations and the ability to rapidly adapt to dynamic changing situations.
To achieve this, in this work, we devise a novel Bilevel Online Deep Learning (BODL) framework, which consists of three major components: online ensemble classifier, concept drift detection and bilevel online deep learning. Our BODL framework can effectively utilize the different abstract level latent feature representations to build classifiers via the online ensemble framework, where the important weights of the base classifiers would be updated by online exponential gradient descent strategy. consider the convergence problem of online ensemble framework, we apply the similar constraint to generate the favorable latent representation. Besides, a concept drift detected mechanism is devised according to the error rate of base classifiers. When the concept drift is detected, our BODL model can adaptively update the model parameters via bilevel optimization and then prevent the large drift and encourage positive transfer.
In a summary, our main contributions in this paper are listed below:
1) We design an effective bilevel learning strategy. Specifically, if the concept drift is detected, the model would adaptively adjust the parameters for all base classifiers and of the differentdepth feature representation mentioned in section 2 using bilevel optimization, where this process is achieved based on a tiny episodic memory. After that, the model can circumvent the large drift and encourage positive transfer in nonstationary environment.
2) In this work, consider the convergence problem of online ensemble framework, we impose the similar constraint between the shallower and the deeper layer’s feature, which would be beneficial to generate the favorable feature representations.
3) The comparative experiments are devised to verify the effectiveness of the proposed BODL algorithm, and we analysis the experimental results of a variety of algorithms from different perspectives in terms of accuracy, precision, recallscore and F1 score, and then we can see that our BODL algorithm can exploit the differentdepth feature representations and adapt to rapidly changing environment.
The remainder of this paper is organized as follows. In Section 2, we introduce our BODL algorithm in details, which consists of three parts: online ensemble classifier, concept drift detection mechanism, bilevel learning for concept drift. In Section 3 we empirically compare BODL algorithm with several stateoftheart online learning algorithms. In Section 4 we elaborate related works. In Section 5 we summarize the whole work and the interested directions in the future.
2 Bilevel Online Deep Learning(BODL)
In this work, we present bilevel online deep learning, a conceptually novel framework for online learning based on bilevel optimization8jenni2018deep and online ensemble framework. Our BODL architecture can be divided into three main parts: online ensemble classifier, concept drift detection mechanism, bilevel learning for concept drift. The online deep ensemble classifier can make a tradeoff among the differentlevel base classifiers and improve the performance of classification; Concept drift detection mechanism is used to monitor the change in nonstationary environment; When the concept drift is detected, bilevel learning is designed to adaptively adjust the parameters and , then the model can adapt to the change in nonstationary environment.
2.1 Online Ensemble Classifier
We illustrate the online deep ensemble classifier in Fig.1, where represents the importance of the N base classifiers. The online deep ensemble classifier can make a tradeoff among the differentlevel base classifiers via Exponential Gradient Descent(EGD) algorithm in an online manner4sahoo2018online .
More specifically, we character a Deep Neural Network (DNN) with hidden layers, and the final ensemble classifier can be achieved by dynamically updating the weight parameters of the base classifiers for each hidden layer based on their classification loss. The specific ensemble prediction function can be written as Eq.(1).
(1) 
Compared to the traditional network, in which the feature representation constructed by outputs of the final hidden layer is used as input of the classifier, here we can make a favorable classifier by an online ensemble framework, which can benefit from the different depth feature representation and improve the prediction performance of the whole model. It is noted that the parameters , and in Eq.(1) can be learned in an online flavor.
Update the parameters We update the weights for base classifiers using exponential gradient descent11. Firstly, the weights
are initialized using a uniform distribution:
i.e., each base classifier has equal probability to be picked. At each iteration, the prediction loss of the nth base classifier can be written as , where and represent the base classifier prediction and the target variable respectively. Then, the weight of each base classifier can be learned according to the loss suffered and the update rule is given by follow:(2) 
where and is set to 0.01 in our work. After that, the trained base classifier’s important weight is discounted by an exponential weight .
Update the parameters The parameters for all base classifiers are updated using Stochastic Online Gradient Descent (SOGD), and this process is analogical to the traditional feedforward networks.
Update the parameters The update rule about the parameters
of the differentdepth feature representation is different from the traditional backpropagation framework. The objective function includes two parts: the adaptive loss function and similar constraint, which are defined as follow:
(3) 
where, the first part in loss function represents the adaptive prediction loss. Note that, the parameters of shallower layer tend to converge faster than the ones of deeper layer, which can lead to deeper base classifiers learn slowly2chen2015net2net . Thus, we incorporate the similar constraint between the shallower and deeper layer’s features, which can be beneficial to generate the favorable feature representations and improve the convergence rate and the prediction performance of the deeper layer. In this work, is a tradeoff parameter and is set to 0.1. Note that, the similarity can be modelled in multiple manners and we choose the squared distance metric in this paper.
2.2 Bilevel Online Deep Learning
As the streaming data comes gradually and the data probability distribution could change. We monitor the change of the data probability distribution utilizing the error rate of classifier. This concept drift detection mechanism is similar to the drift detection method in 10kifer2004detecting but the warning phase is not arranged in this paper in order to avoid the use of slide window methods. In this section, we describe our adaptive online deep learning based on bilevel optimization in detail. Figure 2 shows a flowchart of the bilevel online deep learning framework.
2.2.1 Bilevel Learning
For each arriving instance in online learning scenario, we detect the concept drift utilizing the error rate of classifier. If the concept drift is observed, the learning algorithm obviously needs to takes some actions to prevent large drift and achieve online incremental learning. Specifically, when the concept drift occurs, BODL initializes a memory weight to replay the knowledge in the memory. Then we apply the trained memory weight to update such that it can prevent large drift and weight the new and old knowledge in a nonstationary environment as shown in the Fig.2.
Bearing this in mind, the objective function can be defined as the following bilevel optimization problem:
(4) 
where denotes the current training data that exists concept drift. We parameterize each as an inner optimization problem , which the learner optimizes the corresponding . During the bilevel learning, firstly the agent learns the memory weight about the inner problem. After that, the agent learns the outer problem with respect to . In this process, we apply the crossentropy loss as objective function for the inner and outer problems respectively.
2.2.2 First Order Approximation
Generally, the data comes gradually in nonstationary environment and the concept drift mechanism will monitor the change in online manner. When the concept drift occurs, the learner can adaptively adjust the model parameters and weight the new and old knowledge in a nonstationary environment via bilevel learning.
Specifically, assume that for an incoming training data reported as concept drift, the inner problem is settled by:
(5) 
After receiving via Eq. (5), the outer learning for the parameters
can be solved by the chain rule.
(6) 
Note that solving the Eq. (6) is a cumbersome problem in real word scenario because of the Hessian vector product in the second term
11pham2020bilevel . In order to improve the efficiency of the computation, we apply firstorder approximation to simplify the Eq.(6) in this work12nichol2018first ; 13zhang2019lookahead. Thus, the outer learning is given by interpolating only in the parameter space:
(7) 
We apply Eq.(7) to obtain a onestep lookahead parameter from . After that, we can adjust by linearly interpolate between the current parameters and . It is noted that we only maintain the parameters of the main model ,i.e., once the parameters is obtained and then we discard it after every outer update. In this process, the inner optimization should be carried out via tiny experience memory14chaudhry2019continual .
2.2.3 Bilevel Online Deep Learning Algorithm
In this section, we show that our BODL algorithm can effectively learn in nonstationary environment by an online manner.
Our proposed BODL algorithm is shown in Algorithm 1.
In BODL algorithm, firstly we present an online ensemble framework that attempts to dynamically weight the different depth classifiers and the base classifier’s weights for each hidden layer are update based on the exponential gradient descent algorithm in an online manner. In particular, we impose the similar constraint between the shallower and the deeper layer’s features, which would be beneficial to generate the favorable feature representations and improve the performance of the convergence.
In addition, consider that the data probability distribution would change in realworld scenarios. Thus, a concept drift detection mechanism is used to monitor the data changes according to the error rate of classifier. Once the drift is detected, the learner would update the model parameters via bilevel optimization. Thus, the learner would effectively prevent the large drift and alleviate the catastrophic forgetting.
3 Experiments
In this section, we evaluate the baselines and our proposed BODL algorithm on various stationary and nonstationary datasets. We report and analysis the experimental results in detail.
3.0.1 Experiment Setup
We use the neural network architecture with 15 hidden layers of 30 units with ReLU nonlinearities. In all experiments, the entire network parameters are updated by Adam optimizer with a learning rate of 0.01. When the drift is detected, the model would adaptively learn the parameters via the tiny memory budge and this process is achieved using the bilevel optimization strategy. It is well worth note that we apply a testthentrain strategy for evaluating the learning algorithms to cast this as a classification task.
We compare against with several stateoftheart baselines: Perceptron, the Relaxed Online Maximum Margin(ROMMA)
15li2002relaxed , OGD16zinkevich2003online , the recently proposed Soft Confidence Weighted algorithms(SCW) 17hoi2012exact , the Adaptive Regularization of Weight Vectors(AROW) 18crammer2013adaptive , the ConfidenceWeighted (CW) learning algorithm[19]. Here, the BODLBase algorithm is regarded as an online learning approach without the bilevel optimization strategy.3.0.2 Datasets
The learning performance of BODL algorithm is numerically validated on stationary and nonstationary data, but evolving data stream usually characterize nonstationary properties in realword task. Thus, in our experiments, we select three nonstationary datasets and two stationary datasets for experimental comparison. Here, the datasets are obtained from UCI repositories and the properties are shown in details in Table1.
Dataset  Size  Features  Type 

MNINST  70000  786  Stationary 
Magic  19020  10  Stationary 
PIMA  768  8  Nonstationary 
Weather  18140  8  Nonstationary 
KDDCUP  1036241  127  Nonstationary 
3.0.3 Experimental Results
In this section, the experimental comparative results of all baselines and the proposed BODL algorithm with four different metric criteria: average accuracy, average precision, F1Score and recallscore are reported in Table 2. In additional, in order to study the contribution of each component, a complete ablation studies are conducted in our work where BODL2: the model is trained using the bilevel learning and the similar constrain, BODL1: the model is trained using the similar constrain alone, BODLBase: the model is trained without the bilevel learning and the similar constrain.
Method  Average Accuracy  

MNIST  Magic  PIMA  Weather  KDDCUP  
BODL2  92.00%  78.73%  74.36%  74.90%  99.68% 
BODL1  91.99%  78.49%  73.84%  73.28%  99.44% 
BODLBase  90.80%  78.31%  71.69%  72.34%  99.35% 
Perceptron  84.77%  70.60%  64.45%  65.85%  99.31% 
ROMMA  83.22%  66.67%  64.45%  65.63%  99.34% 
OGD  90.10%  78.72%  72.78%  72.70%  99.61% 
SCW  88.98%  78.64%  70.31%  76.12%  99.75% 
AROW  89.04%  78.71%  72.14%  75.15%  99.58% 
CW  86.88%  67.90%  63.41%  36.81%  99.62% 
PA  85.68%  70.13%  66.41%  65.74%  99.41% 
Method  Average precision  
MNIST  Magic  PIMA  Weather  KDDCUP  
BODL2  91.91%  74.65%  54.83%  77.35%  98.55% 
BODL1  91.89%  73.96%  54.38%  75.76%  97.46% 
BODLBase  90.70%  73.80%  51.88%  76.06%  96.99% 
Perceptron  84.61%  67.77%  64.03%  60.78%  98.96% 
ROMMA  82.99%  64.16%  63.74%  60.69%  99.08% 
OGD  89.99%  77.66%  71.77%  67.91%  99.35% 
SCW  88.83%  76.75%  69.18%  74.90%  99.55% 
AROW  88.92%  77.62%  70.84%  71.13%  99.31% 
CW  86.70%  64.75%  62.55%  53.55%  99.56% 
PA  85.45%  67.19%  65.06%  59.95%  99.01% 
Method  F1score  
MNIST  Magic  PIMA  Weather  KDDCUP  
BODL2  91.97%  65.81%  60.83%  82.62%  99.21% 
BODL1  91.97%  65.73%  63.30%  81.80%  98.61% 
BODLBase  90.78%  65.70%  59.66%  81.69%  98.38% 
Perceptron  84.78%  70.61%  65.27%  66.07%  99.31% 
ROMMA  83.21%  67.05%  65.27%  65.95%  99.33% 
OGD  90.08%  78.02%  73.39%  71.04%  99.61% 
SCW  88.97%  78.41%  70.96%  77.53%  99.75% 
AROW  88.98%  78.16%  72.73%  74.28%  99.66% 
CW  86.87%  67.86%  64.24%  29.00%  99.67% 
PA  85.66%  70.08%  67.12%  65.58%  99.41% 
The experiment results show that our BODL2 algorithm enjoys competitive performance on different datasets implementing different evaluation criteria. BODL2 is slightly better than BODL1 with the help of bilevel learning since it can alleviate the catastrophic forget when the concept drift occurs. BODLBase have lower accuracy than BODL1, which means the similar constrain would be beneficial to generate the favorable feature representations.
Compared to the stateoftheart methods, we can draw several conclusions. In terms of average accuracy, first but not surprise, traditional online learning techniques, such as Perceptron and CW, achieve relatively poor performance on almost all datasets. Next, we also note that the algorithms, such as OGD, could obtain relatively competitive numerical results on MNIST datasets. However, lacked the ability to further explore the power of depth or adaptively adjust the model parameters when concept drift occurs, so they receive poor performance on weather and PIMA dataset. SCW and AROW achieve favorable accuracy in concept drift datasets such as weather and KDDCUP, but they product poor results in PIMA dataset which features highly imbalance and nonstationary. In contrary, our BODL2 algorithm can exploit the differentlevel favorable feature representation base on the deep learning framework, besides, when the concept drift is observed, the learner can adaptively adjust the model parameters via bilevel optimization strategy based on memory replay and then encourage positive transfer and prevent the large drift.
In additional, BODL2 algorithm outperform all other approaches on Magic, MNIST and KDDCUP dataset under accuracy evaluation criteria. It is noted that our method can produce good performance from highly imbalance data streams with concept drift, such as PIMA. Only 1.22% less than the highest one in terms of accuracy on weather dataset but achieve highest results under the average precision, F1Score and recallscore evaluation criteria and so on. To conclude, the experimental results demonstrate that our BODL2 algorithm is a promising online learning approach comparing to the stateoftheart online methods.
4 Related Works
Recent years we have witnessed enormous success in the deep neural network. Compared to traditional offline learning, online learning is more suitable in many realword tasks. Online learning algorithms represents a class of scalable algorithms which are devised to optimize the models incrementally where the data instance comes gradually. Perceptron based on maximummargin classification is the earliest online learning algorithm, which is primarily developed to learn linear models. However, the class of perceptron algorithm is fragile to the samples that are linearly inseparable. Thus, perceptron algorithm with the kernel functions are developed 20kivinen2004online , which give a solution to online learning techniques with nonlinear models. While such approaches are able to solve the nonlinear classification, determining the type and number of kernel function is an open challenge. Moreover, these approaches are not explicitly built to extract the differentdepth feature representations for the data instances. Base on this fact, Sahoo et al et al present an online algorithm with different depth network for evolving data streams4sahoo2018online . However, they neglect the intractable problem of catastrophic forgetting, or cannot cope with the nonstationary environment very well. Recently, there are some specific algorithms handle for concept drift in nonstationary environment. These methods concentrate on incrementally update the model as long as the data instance arrives in a stream, such as dynamic combination model; the online Gradient Descent Algorithm(OGD)16zinkevich2003online ; the relaxed online maximum margin algorithm and its aggressive version aROMMA, ROMMA, and aROMMA15li2002relaxed ; the Adaptive Regularization of Weight Vectors(aROW) 18crammer2013adaptive ; the ConfidenceWeighted (CW) learning algorithm19crammer2008exact ; The recently proposed Soft Confidence Weighted algorithms(SCW) 17hoi2012exact . However, these methods characterize the constant updating of their models, which would make the model evolve in an extremely regular manner regardless of the concept drift.
5 Conclusion and future work
Concept drift is an inevitable problem with learning from evolving data streams, which must be handled for data instances to be practically useful. In this work, we proposed a novel Bilevel Online Deep Learning (BODL) framework to learn in nonstationary environment in an online manner. BODL creates an ensemble classifier using the different depth feature representations, where the important weights of each classifier would be updated by online exponential gradient descent strategy. In order to make the deeper layers converge faster and generate the favorable feature representation, we impose the similar constraint between the shallower and the deeper layer’s features. Besides, a concept drift detected mechanism is devised according to the error rate of classifier. When the concept drift is detected, our BODL algorithm can adaptively update the model parameters via bilevel optimization based on tiny episodic memory and then prevent the large drift and encourage positive transfer.
At last, we validated the proposed BODL algorithm through extensive experiments on various stationary and nonstationary datasets and the competitive numerical results show our BODL algorithm is a promising online learning approach.
In the future work, we would consider the online learning problem for class incremental learning. Besides, in order to obtain the more favorable feature representation, we also consider incorporating the recently proposed selfsupervised learning and data augment methods.
6 Acknowledgements
This work was supported by the Science Foundation of China University of Petroleum, Beijing(No. 2462020YXZZ023)
References
 (1) Y. Bengio, A. Courville, P. Vincent, Representation learning: A review and new perspectives, IEEE transactions on pattern analysis and machine intelligence 35 (8) (2013) 1798–1828.
 (2) T. Chen, I. Goodfellow, J. Shlens, Net2net: Accelerating learning via knowledge transfer, arXiv preprint arXiv:1511.05641 (2015).
 (3) N. CesaBianchi, G. Lugosi, Prediction, learning, and games, Cambridge university press, 2006.

(4)
D. SAHOO, H. Q. PHAM, J. LU, S. C. HOI, Online deep learning: Learning deep neural networks on the fly.(2018), in: Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence IJCAI 2018, July 1319, Stockholm, pp. 2660–2666.
 (5) S. C. Hoi, D. Sahoo, J. Lu, P. Zhao, Online learning: A comprehensive survey, arXiv eprints (2018) arXiv–1802.
 (6) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. GrabskaBarwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences 114 (13) (2017) 3521–3526.

(7)
A. Chaudhry, P. K. Dokania, T. Ajanthan, P. H. Torr, Riemannian walk for incremental learning: Understanding forgetting and intransigence, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 532–547.
 (8) S. Jenni, P. Favaro, Deep bilevel learning, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 618–633.
 (9) D. Kifer, S. BenDavid, J. Gehrke, Detecting change in data streams, in: VLDB, Vol. 4, Toronto, Canada, 2004, pp. 180–191.
 (10) Q. Pham, D. Sahoo, C. Liu, S. C. Hoi, Bilevel continual learning, arXiv preprint arXiv:2007.15553 (2020).
 (11) A. Nichol, J. Achiam, J. Schulman, On firstorder metalearning algorithms, arXiv preprint arXiv:1803.02999 (2018).
 (12) M. R. Zhang, J. Lucas, G. Hinton, J. Ba, Lookahead optimizer: k steps forward, 1 step back, arXiv preprint arXiv:1907.08610 (2019).
 (13) A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, M. Ranzato, Continual learning with tiny episodic memories (2019).
 (14) Y. Li, P. M. Long, The relaxed online maximum margin algorithm, Machine Learning 46 (1) (2002) 361–387.
 (15) M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, in: Proceedings of the 20th international conference on machine learning (icml03), 2003, pp. 928–936.
 (16) S. C. Hoi, J. Wang, P. Zhao, Exact soft confidenceweighted learning, in: ICML, 2012.
 (17) K. Crammer, A. Kulesza, M. Dredze, Adaptive regularization of weight vectors, Machine learning 91 (2) (2013) 155–187.
 (18) J. Kivinen, A. J. Smola, R. C. Williamson, Online learning with kernels, IEEE transactions on signal processing 52 (8) (2004) 2165–2176.

(19)
K. Crammer, M. Dredze, F. Pereira, Exact convex confidenceweighted learning, in: Advances in Neural Information Processing Systems, Citeseer, 2008, pp. 345–352.