In the area of monitoring and security of critical infrastructures which include large-scale, complex systems such as power and energy systems, water, transportation and telecommunication networks, the challenge of the state being normal or healthy for a sustained period of time until an abnormal event occurs is typically encountered 
. Such abnormal events or faults can lead to serious degradation in performance or, even worse, to cascading overall system failure and breakdown. The consequences are tremendous and may have a huge impact on everyday life and well-being. Examples include real-time prediction of hazardous events in environment monitoring systems and intrusion detection in computer networks. In critical infrastructure systems the system is at a healthy state the majority of the time and failures are low probability events, therefore, class imbalance is a major challenge encountered in this area.
Class imbalance occurs when at least one data class is under-represented compared to others, thus constituting a minority class. It is a difficult problem as the skewed distribution makes a traditional learning algorithm ineffective, specifically, its prediction power is typically low for the minority class examples and its generalisation ability is poor . The problem becomes significantly harder when class imbalance co-exists with concept drift. There exists only a handful of work on online class imbalance learning. Focussing on binary classification problems, we introduce a novel algorithm, queue-based resampling, where its central idea is to selectively include in the training set a subset of the negative and positive examples by maintaining a separate queue for each class. Our study examines two popular benchmark datasets under various class imbalance rates with and without the presence of drift. Queue-based resampling outperforms state-of-the-art methods in terms of learning speed and quality.
2 Background and Related Work
2.1 Online Learning
In online learning , a data generating process provides at each time step a sequence of examples
from an unknown probability distribution, where is an
-dimensional input vector belonging to input spaceand is the class label where and
is the number of classes. An online classifier is built that receives a new exampleat time step and makes a prediction . Specifically, assume a concept such that . The classifier after some time receives the true label
, its performance is evaluated using a loss functionand is then trained i.e. its parameters are updated accordingly based on the loss incurred. The example is discarded to enable learning in high-speed data streaming applications. This process is repeated at each time step. Depending on the application, new examples do not necessarily arrive at regular and pre-defined intervals.
We distinguish chunk-based learning  from online learning where at each time step we receive a chunk of examples . Both approaches build a model incrementally, however, the design of chunk-based algorithms differs significantly and, therefore, the majority is typically not suitable for online learning tasks . This work focuses on online learning.
2.2 Class Imbalance and Concept Drift
Class imbalance  constitutes a major challenge in learning and occurs when at least one data class is under-represented compared to others, thus constituting a minority class. Considering, for example, a binary classification problem, class (positive) and (negative) constitutes the minority and majority class respectively if . Class imbalance has been extensively studied in offline learning and techniques addressing the problem are typically split into two categories, these are, data-level and algorithm-level techniques.
Data-level techniques consist of resampling techniques that alter the training set to deal with the skewed data distribution, specifically, oversampling techniques “grow” the minority class while undersampling techniques “shrink” the majority class. The simplest and most popular resampling techniques are random oversampling (or undersampling) where data examples are randomly added (or removed) respectively [17, 16]. More sophisticated resampling techniques exist, for example, the use of Tomek links discards borderline examples while the SMOTE algorithm generates new minority class examples based on the similarities to the original ones. Interestingly, sophisticated techniques do not always outperform the simpler ones . Furthermore, since their mechanism relies on identifying relations between training data, it is difficult to be applied in online learning tasks, although some initial effort has been recently made .
Algorithm-level techniques modify the classification algorithm directly to deal with the imbalance problem. Cost-sensitive learning is widely adopted and assigns a different cost to each data class . Alternatives are threshold-moving  methods where the classifier’s threshold is modified such that it becomes harder to misclassify minority class examples. Contrary to resampling methods that are algorithm-agnostic, algorithm-level methods are not as widely used .
A challenge in online learning is that of concept drift  where the data generating process is evolving over time. Formally, a drift corresponds to a change in the joint probability . Despite that drift can manifest itself in other forms, this work focuses on
drift (i.e. a change in the prior probability) because such a change can lead to class imbalance. Note that thetrue decision boundary remains unaffected when drift occurs, however, the classifier’s learnt boundary may drift away from the true one.
2.3 Online Class Imbalance Learning
The majority of existing work addresses class imbalance in offline learning, while some others require chunk-based data processing [16, 8]. Little work deals with class imbalance in online learning and this section discusses the state-of-the-art.
The authors in  propose the cost-sensitive online gradient descent () method that uses the following loss function:
where is the indicator function that returns 1 if is satisfied and 0 otherwise, and
are the costs for positive and negative classes respectively. The authors use the perceptron classifier and stochastic gradient descent, and apply the cost-sensitive modification to the hinge loss function achieving excellent results. The downside of this method is that the costs need to be pre-defined, however, the extent of the class imbalance problem may not be known in advance. In addition, it cannot cope with concept drift as the pre-defined costs remain static. In, the authors introduce which is a cost-sensitive perceptron-based classifier with an adaptive cost strategy.
A time decayed class size metric is defined in  where for each class , its size is updated at each time step according to the following equation:
where is a pre-defined time decay factor that gives less emphasis on older data. This metric is used to determine the imbalance rate at any given time. For instance, for a binary classification problem where the positive class constitutes the minority, the imbalance rate at any given time is given by .
Oversampling-based online bagging () is an ensemble method that adjusts the learning bias from the majority to the minority class adaptively through resampling by utilising the time decayed class size metric . An undersampling version called had also been proposed but was demonstrated to be unstable. with 50 neural networks has been shown to have superior performance. To determine the effectiveness of resampling solely, the authors examine the special case where there exists only a single classifier denoted by . Compared against the aforementioned and others, has been shown to outperform the rest in the majority of the cases, thus concluding that resampling is the main reason behind the effectiveness of the ensemble .
Another approach to address drift is the use of sliding windows . It can be viewed as adding a memory component to the online learner; given a window of size , it keeps in the memory the most recent examples. Despite being able to address concept drift, it is difficult to determine a priori the window size as a larger window is better suited for a slow drift, while a smaller window is suitable for a rapid drift. More sophisticated algorithms have been proposed, such as, a window of adaptable size or the use of multiple windows of different size . The drawback of this approach is that it cannot handle class imbalance.
3 Queue-based Resampling
Online class imbalance learning is an emerging research topic and this work proposes queue-based resampling, a novel algorithm that addresses this problem. Focussing on binary classification, the central idea of the proposed resampling algorithm is to selectively include in the training set a subset of the positive and negative examples that appeared so far. Work closer to us is  where the authors apply an analogous idea but in the context of chunk-based learning.
The selection of the examples is achieved by maintaining at any given time two separate queues of equal length , and that contain the negative and positive examples respectively. Let , for any two or () such that , arrived more recently in time. Queue-based resampling stores the most recent example plus old ones. We will refer to the proposed algorithm as . Of particular interest is the special case where the length of each queue is , as it has the major advantage of requiring just a single data point from the past.
An example demonstrating how works when is shown in Figure 1. The upper part shows the examples that arrive at each time step e.g. and arrive at and respectively. Positive examples are shown in green. The bottom part shows the contents of each queue at each time step. Focussing on , we can see that the queue contains the two most recent negative examples i.e. and , and the queue contains the most recent positive example i.e. which is carried over since .
The union of the two queues is then taken to form the new training set for the classifier. The cost function is given in Equation 3:
where and . At each time step the classifier is updated once according to the cost incurred i.e. a single update of the classifier’s weights is performed. The pseudocode of our algorithm is shown in Algorithm 1.
The effectiveness of queue-based resampling is attributed to a few important characteristics. Maintaining separate queues for each class helps to address the class imbalance problem. Including positive examples from the past in the most recent training set can be viewed as a form of oversampling. The fact that examples are propagated and carried over a series of time steps allows the classifier to ‘remember’ old concepts. Additionally, to address the challenge of concept drift, the classifier needs to also be able to ‘forget’ old concepts. This is achieved by bounding the length of queues to , therefore, the queues are essentially behaving like sliding windows as well. Therefore, the proposed queue-based resampling method can cope with both class imbalance and concept drift.
4 Experimental Setup
Our experimental study is based on two popular synthetic datasets from the literature  where in both cases a classifier attempts to learn a non-linear decision boundary. These are, the Sine and Circle datasets and are described below.
Sine. It consists of two attributes and uniformly distributed in and respectively. The classification function is . Instances below the curve are classified as positive and above the curve as negative. Feature rescaling has been performed so that and are in .
Circle. It has two attributes and that are uniformly distributed in . The circle function is given by where is its centre and its radius. The circle with and is created. Instances inside the circle are classified as positive and outside as negative.
Our baseline classifier is a neural network consisting of one hidden layer with eight neurons. Its configuration is as follows:
weight initialisation, backpropagation and the optimisation algorithms, learning rate of , 
as the activation function of the hidden neurons, sigmoid activation for the output neuron, and the binary cross-entropy loss function.
|Method||Class imbalance||Concept drift||Access to old data|
|Sliding window||no||yes||yes ()|
For our study we implemented a series of state-of-the-art methods as described in Section 2.3. We implemented a cost sensitive version of the baseline which we will refer to as ; the cost of the positive class is set to as in . Furthermore, the sliding window method has been implemented with a window size of . Moreover, the has been implemented with the time decay factor set to for calculating the class size at any given time.
For the proposed resampling method we will use the special case and another case where . Section 5.1 performs an analysis of by examining how the queue length affects the behaviour and performance of queue-based resampling. For a fair comparison with the sliding window method, we will set the window size to i.e. both methods will have access to the same amount of old data examples. A summary of the compared methods is shown in Table 1 indicating which methods are suitable for addressing class imbalance and concept drift. It also indicates whether methods require access to old data and, if yes, it includes the maximum number in the brackets.
A popular and suitable metric for evaluating algorithms under class imbalance is the geometric mean as it is not sensitive to the class distribution. It is defined as the geometric mean of recall and specificity. Recall is defined as the true positive rate () and specificity is defined as the true negative rate (), where and is the number of true positives and positives respectively, and similarly, and for the true negatives and negatives. The geometric mean is then calculated using -. To calculate the recall and specificity online, we use the prequential evaluation using fading factors as proposed in  and set the fading factor to . In all graphs we plot the prequential -
in every time step averaged over 30 runs, including the error bars showing the standard error around the mean.
5 Experimental Results
5.1 Analysis of Queue-based Resampling
In this section we investigate the behaviour of resampling under various queue lengths () and examine how these affect its performance. Furthermore, we consider a balanced scenario (i.e. ) and a scenario with a severe class imbalance of (i.e. ).
Figures (a)a and (b)b depict the behaviour of the proposed method on the balanced and severely imbalanced scenario respectively for the Sine dataset. It can be observed from Figure (a)a that the larger the queue length the better the performance, specifically, the best performance is achieved when . It can be observed from Figure (b)b that the smaller the queue length the faster the learning speed. dominates in the first 500 time steps, however, its end performance is inferior to the rest. The method with dominates for over 3000 steps. Given additional learning time the method with achieves the best performance. The method with is unable to outperform the one with after 5000 steps, in fact, it performs similarly to .
It is important to emphasise that contrary to offline learning where the end performance is of particular concern, in online learning both the end performance and learning time are of high importance. For this reason, we have decided to focus on as it constitutes a reasonable trade-off between learning speed and performance. As already mentioned, we will also focus on as it has the advantage of requiring only one data example from the past.
5.2 Comparative Study
Figure (a)a depicts a comparative study of all the methods in the scenario involving class imbalance for the Circle dataset. The baseline method, as expected, does not perform well and only starts learning after about 3000 time steps. The proposed has the best performance at the expense of a late start. also outperforms the rest although towards the end other methods like close the gap. Similar results are obtained for the Sine dataset but are not presented here due to space constraints.
Figure (b)b shows how each method compares to each other in the class imbalance scenario. Both the proposed methods outperform the state-of-the-art . Despite the fact that performs considerably better than , it requires about 1500 time steps to surpass it. Additionally, we stress out that only requires access to a single old example.
We now examine the behaviour of all methods in the presence of both class imbalance and drift. Figures (a)a and (b)b show the performance of all methods for the Sine and Circle datasets respectively. Initially, class imbalance is but at time step an abrupt drift occurs and this becomes . At the time of drift we reset the prequential - to zero, thus ensuring the performance observed remains unaffected by the performance prior the drift . Similar results are observed for both datasets. outperforms the rest at the expense of a late start. starts learning fast, initially it outperforms other methods but their end performance is close. is affected more by the drift in the Sine dataset but recovers soon. The baseline method outperforms its cost sensitive version after the drift because the pre-defined costs of method are no longer suitable in the new situation.
Online class imbalance learning constitutes a new problem and an emerging research topic. We propose a novel algorithm, queue-based resamping, to address this problem. Focussing on binary classification problems, the central idea behind queue-based resampling is to selectively include in the training set a subset of the negative and positive examples by maintaining at any given time a separate queue for each class. It has been shown to outperform state-of-the-art methods, particularly, in scenarios with severe class imbalance. It has also been demonstrated to work well when abrupt concept drift occurs. Future work will examine the behaviour of queue-based resampling in various other types of concept drift (e.g. gradual). A challenge faced in the area of monitoring of critical infrastructures is that the true label of examples can be noisy or even not available. We plan to address this challenge in the future.
This work has been supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 739551 (KIOS CoE) and from the Republic of Cyprus through the Directorate General for European Programmes, Coordination and Development.
-  Ditzler, G., Roveri, M., Alippi, C., Polikar, R.: Learning in nonstationary environments: A survey. IEEE Computational Intelligence Magazine 10(4), 12–25 (2015)
Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Brazilian symposium on artificial intelligence. pp. 286–295. Springer (2004)
Gama, J., Sebastião, R., Rodrigues, P.P.: On evaluating stream learning algorithms. Machine learning90(3), 317–346 (2013)
-  Gao, J., Ding, B., Fan, W., Han, J., Philip, S.Y.: Classifying data streams with skewed class distributions and concept drifts. IEEE Internet Computing 12(6) (2008)
-  Ghazikhani, A., Monsefi, R., Yazdi, H.S.: Recursive least square perceptron model for non-stationary and imbalanced data stream classification. Evolving Systems 4(2), 119–131 (2013)
-  He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering (9), 1263–1284 (2008)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision. pp. 1026–1034 (2015)
-  Hoens, T.R., Polikar, R., Chawla, N.V.: Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence 1(1), 89–101 (2012)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Kyriakides, E., Polycarpou, M.: Intelligent monitoring, control, and security of critical infrastructure systems, vol. 565. Springer (2014)
-  Lazarescu, M.M., Venkatesh, S., Bui, H.H.: Using multiple windows to track concept drift. Intelligent data analysis 8(1), 29–59 (2004)
-  Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proc. icml. vol. 30, p. 3 (2013)
-  Mao, W., Wang, J., Wang, L.: Online sequential classification of imbalanced data by combining extreme learning machine and improved smote algorithm. In: Neural Networks (IJCNN), 2015 International Joint Conference on. pp. 1–8. IEEE (2015)
-  Wang, J., Zhao, P., Hoi, S.C.: Cost-sensitive online classification. IEEE Transactions on Knowledge and Data Engineering 26(10), 2425–2438 (2014)
-  Wang, S., Minku, L.L., Yao, X.: Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering 27(5), 1356–1368 (2015)
-  Wang, S., Minku, L.L., Yao, X.: A systematic study of online class imbalance learning with concept drift. IEEE Transactions on Neural Networks and Learning Systems (2018)
-  Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering 18(1), 63–77 (2006)