1 Introduction
Online changepoint detection is concerned with the problem of sequential detection of distributional changes in data streams, as soon as such changes occur. It can have numerous applications ranging from statistical process control, e.g. financial times series and medical conditioning monitoring (Hawkins et al., 2003; Bansal and Zhou, 2002; Aminikhanghahi and Cook, 2017; Truong et al., 2018)
, to problems in machine learning which can involve training very complex and highly parametrized models from a sequence of learning tasks
(Ring, 1994; Robins, 1995; Schmidhuber, 2013; Kirkpatrick et al., 2017). In this latter application, referred to as continual learning, it is often desirable to train online from a stream of observations a complex neural network and simultaneously detect changepoints that quantify when a task change occurs.However, current algorithms for simultaneous online learning and changepoint detection are not well suited for models such as neural networks that can have millions of adjustable parameters. For instance, while state of the art Bayesian online changepoint detection algorithms have been developed (Fearnhead, 2006; Fearnhead and Liu, 2007; Adams and MacKay, 2007; Caron et al., 2012; Yildirim et al., 2013)
, such techniques can be computationally too expensive to use along with neural networks. This is because they are based on Bayesian inference procedures that require selecting suitable priors for all model parameters and they rely on applying accurate online Bayesian inference which is generally intractable, unless the model has a simple conjugate form. For instance, the popular techniques in
Fearnhead and Liu (2007); Adams and MacKay (2007) are tested in simple Bayesian conjugate models where exact integration over the parameters is feasible. Clearly, such Bayesian computations are intractable or very expensive for highly nonlinear models such as neural networks, which can contain millions of parameters.In this article, we wish to develop a framework for joint sequential changepoint detection and online model fitting, that could be easily applied to arbitrary systems and it will be particularly suited for highly parametrized models such as neural networks. The key idea we introduce is to sequentially perform statistical hypothesis testing by evaluating predictive scores under cached model checkpoints. Such checkpoints are periodicallyupdated copies of the model parameters and they are used to detect distributional changes by performing predictions on future/unseen data (relative to the checkpoint), i.e. on data observed after a checkpoint and up to present time. An illustration of the approach is given by Fig.
1, while detailed description of the method is given in Section 3 and Algorithm 1. In statistical testing for change detection we use generalized likelihood ratio tests (Csorgo and Horváth, 1997; Jandhyala et al., 2002) where we bound the Type I error (false positive detection error) during the sequential testing process. The overall algorithm is easy to use and it requires by the user to specify two main hypeparameters: the desired bound on the Type I error and the testing window size between a checkpoint and the present time.(a)  (b) 
We demonstrate the efficiency of our method on time series as well as on continual learning of neural networks from a sequence of supervised tasks with unknown task changepoints. For these challenging continual learning problems we also consider a strong baseline for comparison, by using a variant of the Bayesian online changepoint detection algorithm by Adams and MacKay (2007) that is easily applicable to complex models. This is done by applying the Bayesian algorithm on data that correspond to predictive scores provided by the neural network during online training. Our proposed method consistently outperforms this and others baselines (see Section 6), which shows that model checkpoints provide an easy to use and simultaneously effective technique for changepoint detection in online learning of complex models.
The paper is organized as follows. Section 2 introduces the problem of changepoint detection and online model learning. Section 3 develops our framework for changepoint detection using checkpoints and Section 4 considers applications to continual learning with neural networks. Section 5 discusses related work, while Section 6 provides numerical experimental results and comparisons with other methods. The Appendix contains further details about the method and additional experimental results.
2 Problem Setup
2.1 Streaming Data with Unknown Changepoints
We consider the online learning problem with unknown changepoints in a stream of observations . Each
includes an input vector and possibly additional outputs such as a class label or a realvalued response. For instance, in a supervised learning setting each observation takes the form
where is the input vector andthe desired response such as a class label, while in unsupervised learning
is an input vector alone, i.e.. In addition, for many applications, e.g. in deep learning
(LeCun et al., 2015), can be a small set or minibatch of individual i.i.d. observations, i.e. , that are received simultaneously at time .In the generation process of we assume that there exist certain times, referred to as changepoints and denoted by , that result in abrupt changes in data distribution so that and in general
(1) 
where and with the convention . Each is the segment or taskspecific distribution that generates the th data segment. These assumptions are often referred to as partial exchangeability or the product partition model (Barry and Hartigan, 1992). To learn from such data we wish to devise schemes that can adapt online a parametrized model without knowing the changepoints and the distributions . Accurate sequential detection of changepoints can be useful since, knowing them, the learning system can dynamically decide to switch to a different parametric model or add new parameters to a shared model and etc. In Section 3, we introduce a general online learning and changepoint detection algorithm suitable for arbitrary models ranging from simple singleparameter models to complex deep networks having millions of parameters.
2.2 Online Model Learning with Changepoints
We consider a probabilistic model with parameters that we wish to train online and simultaneously use it to detect the next changepoint . Online training of means that for each observation we perform, for instance, a gradient update
(2) 
where is the step size or learning rate. Given the nonstationarity of the learning problem this sequence should not satisfy the RobbinsMonro conditions (Robbins and Monro, 1951) and, for instance,
could be constant through time. The loss function in Eq. (
2) is typically the negative loglikelihood function, i.e.and denotes the parameter values after having seen observations including the most recent . We will refer to the evaluations of the loss , or any other score function on any future data (relative to ), with , as prediction scores. Notice that given
s have been drawn i.i.d. from the same data segment, the prediction scores are also i.i.d. random variables.
Suppose at time the data segment or task is and we observe , where is the most recently detected changepoint, i.e. when the th task started as shown in Eq. (1). If we decide that data comes from a new task , we could set , instantiate a new model with a fresh set of parameters and repeat the process. All these models could have completely separate parameters, i.e. or allow parameter sharing, i.e. , as further discussed in Section 4 where we describe applications to continual learning.
3 Changepoint Detection with Checkpoints
Throughout this Section we will be interested to detect the next changepoint . Thus, to simplify notation we will drop index and write this changepoint as and the current parameters as when it does not cause confusion. We will also assume that the previously detected changepoint is at time zero.
The iterative procedure for changepoint detection with checkpoints is illustrated in Fig. 1. This algorithm assumes that together with the current parameter values we cache in memory one or multiple copies of early values of the parameters referred to as model checkpoints or simply checkpoints. Checkpoints are cached and deleted periodically and statistical testing for changepoint detection is also performed periodically. When at iteration , where model parameters are , we need to perform a changepoint detection test the algorithm activates the checkpoint , that has been cached iterations before. This checkpoint forms predictions on all subsequent data (not seen by the checkpoint) in order to detect a change in the data distribution that possibly occurs in the data segment in . Pseudocode of the algorithm is provided in Algorithm 1.
In the remaining of this Section we will be discussing in detail how Algorithm 1 works. Some useful summarizing remarks to keep in mind are the following. The algorithm caches checkpoints every iterations with the first checkpoint cached at . is the window size,
is the stride and
is the minimum sample size when computing the testing statistics. The first testing occurs at time
, i.e. when the data buffer becomes full and the first cached checkpoint used is the initial parameter values . This constrains also the minimum size of the data segment (i.e. the distance between two consecutive changepoints) to be . After the first test, testing occurs every iterations and given that each checkpoint is deleted after a test the number of checkpoints in memory is roughly .3.1 Offline Changepoint Detection in a Window
Suppose a sliding window of observations (recall that can generally be a set or minibatch of i.i.d. individual observations) of size , i.e. all data observed strictly after the model checkpoint . Given a scalar prediction score function:
or for short, we consider the offline changepoint detection problem in the interval
with the independent and normal distribution assumption:
Assumption 1.
We consider the following hypothesis testing problem with unknown change time, mean and variance:

.

s.t.
,
,
where is the minimum sample size in each segment of
for estimating the mean and variance. Using a model checkpoint and applying the testing on predictions is important to satisfy the independent assumption on scores.
Following the generalized likelihood ratio (GLR) test, we denote by the likelihood ratio of the two hypotheses at a changepoint of with the unknown variables taking the maximum likelihood estimates,
(3) 
and compute the statistics as follows:
(4) 
where is the sample variance, denotes the set of all values , the values and their union.
Asymptotic distribution of the statistics as has been well studied in the literature for the normal distribution of (Csorgo and Horváth, 1997; Jandhyala et al., 2002). For a finite window size , we can also compute the critical region, at a given confidence level
numerically; see the Appendix. When the null hypothesis is rejected, we claim there is a changepoint in the current window
, and the changepoint is selected with .It is important to note that the alternative hypothesis is not a complement of the null, and we consider the candidate changepoint in a subset for reliable estimate of sample mean and variance. This means that when a true changepoint exists in the right border , it might cause a rejection and show up in the nearest location on the subset, i.e. , which subsequently could increase the error of the changepoint location estimation. To reduce this effect we can compute in the extended subset , and do not reject if ( is still taken in ). Notice that there are samples in the right side when , satisfying our requirement for the minimum sample size.
We repeat the offline detection using a sliding window of size with a stride . This ensures that every time location will be included in the candidate subset for exactly one test. An illustration of this is shown in Fig. 1 where the green border is precisely the location in the extended subset, which is ignored if the maximum occurs there, but it could be accepted in the next iteration where the green location becomes the first location in the new subset. Similarly, the possibility that a changepoint occurs in the left border can be detected in a previous window.
3.2 Online Changepoint Detection across Windows
As the interval between two changepoints spans over multiple test windows, we would like to control the overall error of making a false rejection of the null hypothesis, that is making a false claiming that data distribution changes, in every data segment. Since the model checkpoints change at every test window, it is difficult to apply a standard sequential likelihood ratio test across windows. Instead, we select the confidence level with an annealing schedule so that the overall error is bounded:
(5) 
where is the confidence level for the th (0based) test window after a new task is detected and is the decaying rate.
Proposition 1.
Given Assumption 1
holds, the probability of making a Type I error (false changepoint detection) by Algorithm
1 between two real changepoints is upper bounded by .Proof.
Let be the segment of data stream with the same distribution where is the time of last minibatch of data, . Offline change point detection is conducted in windows , , , , before a changepoint occurs, where is the maximum integer with . Under Assumption 1, the probability of rejecting the null hypothesis at the th testing window with input error argument is
(6) 
where the first inequality is due to the possibility to ignore the rejection when , and the second equality follows the definition of in Algorithm 2 as the quantile.
The probability that the null hypothesis is rejected in at least one testing window is then upper bounded with the union bound by
(7) 
∎
The time complexity of running Algorithm 1 on a datastream of length is where is the minibatch size and is the stride of the sliding window, and the space complexity for storing the checkpoint and data buffer is where and denotes the size of a model checkpoint and a data point.
3.3 Setting the Hyperparameters and Prediction Scores
There are a few hyperparameters in our proposed algorithm, including the window size
, minimum sample size in a window , Type I error , and the error decaying factor .A large window size provides more data for every offline detection and usually leads to a higher accuracy. However, the space complexity increases linearly as the window size in order to keep a data buffer of size , and it has to be upper bounded by our prior assumption on the minimum distance between two consecutive changepoints. Also, when the score function requires a good model fit in order to be discriminative between tasks, a smaller window size can be beneficial at the beginning of a new task because it would update the model checkpoint more frequently and thereby improve the discriminative power in detecting a changepoint more quickly. We study the effect of empirically in our experiments.
A sufficiently large minimum sample size is important to obtain reliable estimate of the sample variance and stabilize the distribution of the statistics . But too large value in reduces the range of candidate locations and decreases the power in a single offline detection test. Also, because the sliding window has a stride of , the time complexity increases with . In our experiments, we use a default value , giving . Notice that with such default settings only two checkpoints are needed to be kept in memory (since ) resulting in small memory cost.
Given a total Type I error , the decaying factor
controls the exponential distribution of the error across windows with mean
. In principle, we would like the mass of the error to be distributed in the support of our prior about the changepoint frequency. In lack of this knowledge, we use in all the experiments.The prediction score function must be discriminative with respect to data streams from different tasks. When the model parameter is well fitted to the current task, a properly chosen score function is usually sensitive to the change of the task. Nevertheless, we emphasize that being fitted to the current task is not a necessary condition for our changepoint detection method to work. As demonstrated in the example of Section 6.1, our algorithm in some cases can detect the changepoints robustly regardless of the learning rate in the update rule in Eq. (2) that affects how well the model is fitted to the data.
A key assumption about our detection algorithm is the normal distribution of the score function defined on every minibatch of data. In experiments with continuous observations we use the average negative loglikelihood as score , and in experiments with discrete observations, we find it works better by applying another logarithm operation as
where is a small jittering term for numerical stability. Fig. 2 shows typical histograms of scores in a testing window from our continual learning experiments with realworld data; see Section 6. We also apply the D’Agostino and Pearson’s normality test (d’Agostino Ralph B, 1971; D’Agostino and Pearson, 1973) on a sample of 100 scores in this setting and show the pvalue in the caption of each plot. It is clear that the normality improves with a larger size
of minibatch due to the central limit theorem, and the distribution of scores in the logdomain is closer to a normal distribution. We show in the experiments that the performance of our detection improves significantly with the minibatch size.
Typical histogram of scores in the SplitMNIST experiment when no changepoint occurs. Top row use the score of the mean negative loglikelihood, and bottom row applies another logarithm transformation. The minibatch size from left to right is 10, 50 and 100 respectively. pvalue of the normality test based on 100 samples is shown in the subfigure caption.
4 Application to Continual Learning
To test our method on a challenging online model fitting and changepoint detection problem we consider continual learning (CL) (Ring, 1994; Robins, 1995; Schmidhuber, 2013; Goodfellow et al., 2013), which requires training neural networks on a sequence of tasks. Many recent CL methods, see e.g. (Kirkpatrick et al., 2017; Nguyen et al., 2017; Rusu et al., 2016; Li and Hoiem, 2017; Farquhar and Gal, 2018), typically assume known task changepoints, also called task boundaries. Instead, here we wish to train a CL method without knowing the task boundaries and investigate whether we can accurately detect the changepoint locations that quantify when the data distribution is changing from one task to the next. The sequential learning and changepoint detection Algorithm 1 can be easily incorporated to existent CL algorithms, since the essential component of the algorithm is the prediction score function used in hypothesis testing. In the following, we combine our algorithm with a standard experience replay CL method (Robins, 1995; Robins and McCallum, 1998; LopezPaz et al., 2017; Rebuffi et al., 2017)
, which regularizes stochastic gradient descent model training for the current task by replaying a small subset of observations from previous tasks, as detailed next.
As the main structure for the CL model we consider a feature vector , obtained by a neural network with parameters , where these parameters are shared across all tasks. For each detected th task there is a set of taskspecific or private parameters , which are added dynamically into the model each time our algorithm returns a detected changepoint indicating the beginning of a new task. In the experiments, we consider CL problems where each task is a binary or multiclass classification problem, so that each is a different head, i.e. a set of final output parameters, attached to the main network consisted of the feature vector . For instance, if the task is multiclass classification, then is a matrix of size , where
denotes the number of classes. In this case, such taskspecific parameter allows the computation of the softmax or multinomial logistic regression likelidood
that models the categorical probability distribution for classifying input data points from the
th task.We assume that the CL model is continuously trained so that tasks occur sequentially and they are separated by random changepoints. For simplicity, we also assume that previously seen tasks never reoccur. At time , the shared parameters are initialized to some random value, and the parameters of the first task are also initialized arbitrarily, e.g. to zero or randomly. Then, learning of the model parameters progresses so that whenever a changepoint occurs a fresh set of taskspecific parameters is instantiated while all existing parameters, such the shared parameters , maintain their current values, i.e. they are not reinitialized. However, this continuous updating can cause the shared feature vector to yield poor predictions on early tasks, a phenomenon known in the literature of neural networks as catastrophic forgetting (Robins, 1995; Goodfellow et al., 2013; Kirkpatrick et al., 2017), and the prevention of this is one major challenge CL methods need to deal with.
More specifically, at each time instance the model receives a minibatch of training examples , where is a class label and is an input vector. At each time step the current detected task is , so that in the shared feature vector we have attached so far heads each with taskspecific parameters , . The full set of currently instantiated parameters is denoted by to emphasize the dependence on the th task. Training with the current th task is performed by using the standard negative loglikelihood, i.e. crossentropy loss, regularized by adding a sum of replay buffers, which correspond to negative loglikelihoods terms evaluated at small data subsets from all previous tasks,
(8) 
where is a regularization parameter and each is a sum of negative loglikelihood terms over the individual data points. Each is a small random subset of data from the th task that is stored as soon as this task is detected and then used as an experience replay (to avoid forgetting the th task) when training in future tasks.
Pseudocode of the whole procedure for training the CL model with simultaneous changepoint detection based on our checkpoint framework is outlined in Algorithm 3. For simplicity in Algorithm 3 we assumed that the replay buffers are global variables that affect the subroutine without having to be passed as inputs. A second simplification is that each task replay buffer in practice is actually created inside Algorithm 1, where a few data minibatches of the current task are stored into the fixedsize memory to form .^{1}^{1}1This second simplification was made to keep the structure of Algorithm 1 in its general form, while the minor modification regarding the replay buffers is only needed for this specific CL application.
Finally, an interesting aspect of using checkpoints for changepoint detection in CL is that once the next changepoint is detected, and thus we need to instantiate a new task parameters , we can reuse one of the checkpoints to avoid the full set of model parameters being contaminated by training updates using data from a new task in iterations , without knowing yet the task change. Specifically, we can reset this full parameter vector to the nearest checkpoint that exists on the left of the changepoint location . This allows the checkpoint to act as a recovery state that can mitigate forgetting of the current th task parameters caused by these extra updates, i.e. for .
5 Related Work
Changepoint detection methods are categorized into offline and online settings (Aminikhanghahi and Cook, 2017; Truong et al., 2018). Offline algorithms such as the recent linear time dynamic programming algorithms (Killick et al., 2012; Maidstone et al., 2017)
operate similarly to the Viterbi algorithm in hidden Markov models
(Bishop, 2006), where they need to observe the full data sequence in order to retrospectively identify multiple changepoints. In contrast, in online changepoint detection the aim is to detect a change as soon as it occurs while data arrive online. Online detection has a long history in statistical process control (Page, 1957; Hawkins et al., 2003) where typically we want to detect a change in a mean parameter in time series. More recently, Bayesian online changepoint detection methods have been developed in (Fearnhead, 2006; Fearnhead and Liu, 2007; Adams and MacKay, 2007), that consider conjugate exponential family models and online Bayesian updates. These latter techniques can be extended to also allow online point estimation of some model parameters (Caron et al., 2012; Yildirim et al., 2013), but they remain computationally too expensive to use in deep learning where models consist of neural networks. This is because they are based on Bayesian inference procedures that require selecting suitable priors over model parameters and they rely on applying accurate online Bayesian inference which is generally intractable, unless the model has a simple conjugate form. Also approximate inference can be too costly and inaccurate for highly nonlinear and parametrized models.The method we introduced differs from these previous approaches, since it relies on the idea of a checkpoint which allows to detect changepoints by performing multistep ahead predictions. This setup provides a stream of 1dimensional numbers with a simple distribution on which we can apply standard statistical tools to detect whether there exists an abrupt change in a window of these predictions. The checkpoint is updated over time by tracking slowly (within a distance ) the actual model, which can improve the discriminative power overtime as the task persists and the checkpoint becomes more specialized to the task distribution. The method can be considered as a combination of offline and online detection (Aminikhanghahi and Cook, 2017) since, while model parameter learning is online, each testing with a checkpoint involves an offline subroutine; see Algorithm 2.
More distantly related work is from the recent continual learning literature such as the socalled taskfree or taskagnostic methods (Aljundi et al., 2018, 2019; Kaplanis et al., 2018; Zeno et al., 2018; Rao et al., 2019) that learn without knowing or assuming task boundaries. However, the objective there is typically not to explicitly detect changepoints, but instead to maintain an overall good predictive performance, by avoiding catastrophic forgetting of the neural network model. In contrast, our method aims to explicitly detect abrupt changes in arbitrary online learning systems, either traditional fewparameter models or neural networks used in continual learning. As we discussed in Section 4 and will demonstrate next in Section 6 our algorithm can be combined with existent continual learning methods and enhance them with the ability of changepoint detection.
6 Experiments
6.1 Time Series Example
Fig. 3 shows online changepoint detection on an artificial time series dataset, a small snapshot of which was used in the illustrative example in Fig. 1(b). The task is to track a data stream of 1dimensional noisy observation (each is a scalar value) with abrupt changes in the mean. The model has a single parameter : a moving average that is updated as data arrive online, where the underlying loss is (which up to a constant is the negative loglikelihood of a normal distribution with a fixed variance). Fig. 3 shows changepoint detection achieved by the proposed algorithm. The panel on the top row shows the data, the moving average parameter and the testing windows together with the corresponding checkpoints that lead to all seven detections. The panel on the bottom row shows the GLR statistics, , computed through time which clearly obtains maximal values at the changepoint locations. The window size was , and . All the changepoints are detected by our algorithm without a false alarm. Every changepoint corresponds to a clear spike in the GLR statistics significantly higher than the normal range of values.
We also study the impact of a suboptimal choice of learning rate of the tracking model to our changepoint detection algorithm in Fig. 4. As discussed in Section 3.3, as long as the score function, negative loglikelihood in this example, is able to differentiate between different tasks the changepoints can still be robustly detected regardless of whether the model is underfit or overfit to the data. We should point out, however, that in more complex models might not be feasible to come up with such discriminate scores. For example, a neural network with random weights most likely will not be so discriminative, i.e. it shall provide similar (random) predictions for data coming from different tasks. Nevertheless, with reasonable training the neural network can become more specialized to a certain task and provide predictions that can significantly differ from those associated with data from other tasks, that the network hasn’t trained on. Therefore, in more complex models the learning rate needs to be chosen carefully to allow quick adaptation to the task data so that the score function, computed under checkpoints, can become more discriminative of task changes.
(a) SplitMNIST  (b) PermutedMNIST 
(c) CIFAR100  (d) IncrClassMNIST 
6.2 Experiments on Continual Learning
In all CL experiments throughout this section the proposed Algorithm 1, checkpointbased changepoint detection (CheckpointCD) is applied, in conjunction with Algorithm 3, with the following settings:
Note that and are default values, while and were specified by few preliminary runs on one of the datasets (SplitMNIST). I.e. the cutoff value of the Type I error was set to
to maximize performance (Jaccard index) on SplitMNIST while for all remaining experiments the same cutoff is used and is never reoptimized. The effect of the window size
is also analyzed in Fig. 6.As a strong baseline for comparison we consider Bayesian online changepoint detection (BayesCD) by Adams and MacKay (2007); see also Fearnhead and Liu (2007). We define an instant of this method that is fully applicable to complex models such as deep neural networks. This is expressed by treating the onestep predictive scores (averaged over the minibatch at time so they are close to normality) as sequential observations following a univariate Gaussian, , where the parameters are taskspecific. Then, the algorithm detects when
undergoes an abrupt change, by performing full Bayesian inference and estimating recursively the marginal posterior probability of each time being a changepoint, i.e. the socalled task or run length value to return to zero value
(Adams and MacKay, 2007). This involves placing a conjugate normal inversegamma prior on ^{2}^{2}2The values of the hyperparameters where chosen as .:together with a prior distribution over changepoints, defined through a Hazard function on the run length (Adams and MacKay, 2007)
, that models the prior probability of each time being a changepoint. Then Bayesian online learning requires marginalizing out all unknowns, i.e.
and the run length. Because of the conjugate and Markov structure of the model all computations are analytic and the marginal posterior probability of a changepoint, , across time follows a simple and efficient recursion; see Adams and MacKay (2007) for full details. To apply the algorithm to CL we need to choose a cutoff threshold for that will allow to claim a changepoint. We consider a search over different cutoffs and we report the bestperforming values in Table 1. As a changepoint prior we consider a constant hazard .We also included in the comparison a simpler (SimpleCD) baseline based on purely online statistical testing scheme (without requiring storage of checkpoints) using the onestep ahead predictive score values , where here is a vector of values and is the minibatch size. Then a standard paired Welch’s test can be used to detect a changepoint by using a cutoff critical value. In all experiments we considered a set of different critical values and we report the bestperforming one in Table 1.
Furthermore, for both BayesCD and SimpleCD algorithms we added the constraint that after a detection the algorithm must wait time steps to search for a new changepoint, i.e. the minimum distance between two consecutive detections was set to . Without this constraint the behaviour of these algorithms can become very noisy resulting in many false positive detections around a previous detected changepoint. Note that this minimum distance constraint is by definition satisfied by CheckpointCD, as shown in Algorithm 1, where is the window size hyperparameter.
Dataset  method  batch size=10  batch size=20  batch size=50  batch size=100 

SplitMNIST  CheckpointCD  
BayesCD(0.3)  
BayesCD(0.4)  
BayesCD(0.5)  
BayesCD(0.6)  
SimpleCD  
PermutedMNIST  CheckpointCD  
BayesCD(0.3)  
BayesCD(0.4)  
BayesCD(0.5)  
BayesCD(0.6)  
SimpleCD  
SplitCIFAR100  CheckpointCD  
BayesCD(0.3)  
BayesCD(0.4)  
BayesCD(0.5)  
BayesCD(0.6)  
SimpleCD  
IncrClassMNIST  CheckpointCD  
BayesCD(0.3)  
BayesCD(0.4)  
BayesCD(0.5)  
BayesCD(0.6)  
SimpleCD  
IncrClassCIFAR44  CheckpointCD  
BayesCD(0.3)  
BayesCD(0.4)  
BayesCD(0.5)  
BayesCD(0.6)  
SimpleCD 
Average Jaccard index scores, with one standard deviations, and tolerance 5 on all CL changepoint detection tasks. The numbers inside brackets for the BayesCD method indicate different cutoffs in the changepoint posterior probability
.6.2.1 Datasets, CL tasks and results
We first applied the algorithms to three standard CL classification benchmarks: SplitMNIST, PermutedMNIST and SplitCIFAR100. SplitMNIST, introduced by Zenke et al. (2017), assumes that five binary classification tasks are constructed from the original MNIST (LeCun and Cortes, 2010) handwritten digit classes and they are sequentially presented to the algorithm in the following order: 0/1, 2/3, 4/5, 6/7, 8/9. Each task is a binary classification problem so that any minibatch is such that each , i.e. the task identity cannot be revealed by inspecting these binary labels. In PermutedMNIST (Goodfellow et al., 2013; Kirkpatrick et al., 2017), each task is a variant of the initial class MNIST classification problem where all input pixels have undergone a fixed (random) permutation. A sequence of 10 tasks is considered so that each task is a class classification problem. For the SplitCIFAR100 we assume a sequence of tasks of classes each from the initial CIFAR100 dataset that contains images of visual categories, indexed as . We follow LopezPaz et al. (2017), so that the first task contains the classes the second and etc.
For Split and PermutedMNIST we consider a neural network with a shared representation
obtained by a fully connected multilayer perceptron (MLP) network with two hidden layers of size
and rectified linear units (ReLU) activations. For the SplitCIFAR100 we used a much more complex residual network architecture
(He et al., 2016) with layers (ResNet18), as used by LopezPaz et al. (2017). We created a simulated experiment where a true task changepoint can occur independently at each time step with some small probability with the additional constraint that we need to observe at least a minimum number of minibatches from each true task before the next changepoint.^{3}^{3}3In all experiments this number was and the probability of having a changepoint at each step (after the steps) was . We compare the proposed CheckpointCD method with BayesCD and SimpleCD under four different values of the minibatch size : 10, 20, 50, 100. Training of each CL model was based on Algorithm 3, modified accordingly for BayesCD and SimpleCD so that to apply their respective changepoint detection subroutine instead of Algorithm 1 used by CheckpointCD. The learning rate sequence in stochastic gradient optimization of the objective in Eq. (8) was set in a dimensionwise manner by using the Adam optimizer (Kingma and Ba, 2014) which the standard approach for training neural networks; see Appendix for further details.To measure performance we used the intersection over union score (also called Jaccard index) defined as the number of correctly detected changepoints divided by the union of the detected and the true changepoints
where the larger the score the better. Note that the Jaccard index is the hardest among other related scores such as recall, precision and F1, which are softer/upper bounds (i.e. closer to 1) than Jaccard index. For completeness full tables with precision and recall score values are given in the Appendix. When computing Jaccard index we also allow some tolerance when declaring that a pair
of a true and detected locations correspond to the same true changepoint . A tolerance equal to time steps distance was used which means that only when the detected is considered correct.Furthermore, in order to create harder changepoint detection problems, we consider two classincremental variants of MNIST and and CIFAR so that each task differs from the previous task by changing only a single class and without affecting the labeling of the remaining classes. This creates the IncrClassMNIST with 9 tasks:
. To speed up the experiment in CIFAR we consider only the first classes and create a very challenging changepoint detection problem, IncrClassCIFAR44, with tasks: (1,2,3,4,5), (6,2,3,4,5), (6,7,3,4,5) and etc.Table 1 reports all results obtained by random repetitions of the experiments. The table shows that the proposed algorithm is consistently better than the other methods and it provides accurate changepoint detection even with minibatch size as small as . Notice also that, as expected, all methods improve as the minibatch size increases.
Fig. 5 visualizes the GLR values, , in some of the runs with SplitMNIST, PermutedMNIST, CIFAR100 and IncrClassMNIST. Similarly, to Fig. 3 most changepoints are detected by our algorithm and every changepoint corresponds to a clear spike in the GLR statistics. Note that the plots in Fig. 5 are obtained for the most difficult case where the data minibatch size when fitting the CL model is , while for larger minibatch sizes the detection is more robust and the spikes of the GLR statitics become sharper.
Finally, Fig. 6 studies the effect of the window size in changepoint detection performance, which shows that too small value of could decrease the performance presumably due to very small sample size when performing each test. This corroborate our discussion in Section 3.3 that a large value of increases the power of hypothesis testing, although it should not be larger than the minimum length of a task from our prior knowledge to avoid including multiple changepoints in the same testing window.
7 Discussion
We have introduced an algorithm for online changepoint detection that can be easily combined with online learning of complex nonlinear models such as neural networks. We have demonstrated the effectiveness of our method in challenging continual learning tasks for automatically detecting the task changepoints. The use of checkpoints allowed us to define a sequential hypothesis testing procedure to control a predetermined Type I error upperbound, and evaluate empirically the overall performance of both Type I and II error using Jaccard index and or other metrics.
The simplicity of checkpoints means that practitioners can use them for changepoint detection without having to modify their preferred way of estimating or fitting models to data. For instance, in deep learning (LeCun et al., 2015) the dominant approach to model fitting is based on point parameter estimation with stochastic gradient descent (SGD), where the model is typically a neural network. As seen in this paper this can be easily combined with checkpoints to detect changepoints, without having to modify this standard SGD model fitting procedure. Similarly, checkpoints could be also combined with other ways of fitting models to data, e.g. Bayesian approaches, since the essence of the algorithm is a cached model representation (not necessarily a point parametric estimate) that together with a prediction score can detect changes. For instance, if we follow a Bayesian model estimation approach, online learning will require updating a posterior probability distribution through time. Then, a checkpoint becomes an early version of this posterior distribution, i.e. , while the predictive score will be obtained by averaging some function under this checkpoint posterior. In this setting, the use of the algorithm remains the same and the only thing we need to modify, to accommodate this Bayesian way of model fitting, is to change the online model update rule (i.e. the line in Algorithm 1) together with the definition of the score function , where the latter should correspond now to a Bayesian predictive score. While Bayesian model fitting is very difficult for complex models, such as neural networks, it is certainly feasible for simple conjugate Bayesian models where we could apply the checkpoint method as outlined above. We leave the experimentation with this more Bayesian way of using checkpoints as a future work.
Finally, another topic for future research is to consider checkpoints to detect changes at different time scales, such as longterm and shortterm changes.
References
 Adams and MacKay (2007) Adams RP, MacKay D (2007) Bayesian online changepoint detection. Tech. rep.
 Aljundi et al. (2018) Aljundi R, Kelchtermans K, Tuytelaars T (2018) Taskfree continual learning. CoRR abs/1812.03596
 Aljundi et al. (2019) Aljundi R, Lin M, Goujaud B, Bengio Y (2019) Online continual learning with no task boundaries. CoRR abs/1903.08671
 Aminikhanghahi and Cook (2017) Aminikhanghahi S, Cook DJ (2017) A survey of methods for time series change point detection. Knowl Inf Syst 51(2):339–367
 Bansal and Zhou (2002) Bansal R, Zhou H (2002) Term structure of interest rates with regime shifts. The Journal of Finance 57(5):1997–2043
 Barry and Hartigan (1992) Barry D, Hartigan JA (1992) Product partition models for change point problems. Annals of Statistics 20(1):260–279

Bishop (2006)
Bishop CM (2006) Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag New York, Inc., Secaucus, NJ, USA
 Caron et al. (2012) Caron F, Doucet A, Gottardo R (2012) Online changepoint detection and parameter estimation with application to genomic data. Statistics and Computing 22(2):579–595
 Csorgo and Horváth (1997) Csorgo M, Horváth L (1997) Limit theorems in changepoint analysis. John Wiley & Sons Chichester
 D’Agostino and Pearson (1973) D’Agostino R, Pearson E (1973) Tests for departure from normality. empirical results for the distributions of b2 and b1. Biometrika pp 613–622
 Farquhar and Gal (2018) Farquhar S, Gal Y (2018) Towards Robust Evaluations of Continual Learning. arXiv preprint arXiv:180509733
 Fearnhead (2006) Fearnhead P (2006) Exact and efficient Bayesian inference for multiple changepoint problems. Statistics and Computing 16(2):203–213
 Fearnhead and Liu (2007) Fearnhead P, Liu Z (2007) Online inference for multiple changepoint problems. Journal of the Royal Statistical Society, Series B 69:589–605
 Goodfellow et al. (2013) Goodfellow IJ, Mirza M, Xiao D, Courville A, Bengio Y (2013) An empirical investigation of catastrophic forgetting in gradientbased neural networks. arXiv preprint arXiv:13126211
 Hawkins et al. (2003) Hawkins DM, Qiu P, Kang CW (2003) The changepoint model for statistical process control. Journal of Quality Technology 35(4):355–366

He et al. (2016)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, IEEE Computer Society, pp 770–778
 Jandhyala et al. (2002) Jandhyala VK, Fotopoulos SB, Hawkins DM (2002) Detection and estimation of abrupt changes in the variability of a process. Computational statistics & data analysis 40(1):1–19

Kaplanis et al. (2018)
Kaplanis C, Shanahan M, Clopath C (2018) Continual reinforcement learning with complex synapses. arXiv preprint arXiv:180207239
 Killick et al. (2012) Killick R, Fearnhead P, Eckley I (2012) Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association 107(500):1590–1598
 Kingma and Ba (2014) Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015
 Kirkpatrick et al. (2017) Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, GrabskaBarwinska A, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences p 201611835
 LeCun and Cortes (2010) LeCun Y, Cortes C (2010) MNIST handwritten digit database
 LeCun et al. (2015) LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444, DOI 10.1038/nature14539
 Li and Hoiem (2017) Li Z, Hoiem D (2017) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence
 LopezPaz et al. (2017) LopezPaz D, et al. (2017) Gradient episodic memory for continual learning. In: Advances in Neural Information Processing Systems, pp 6470–6479
 Maidstone et al. (2017) Maidstone R, Hocking T, Rigaill G, Fearnhead P (2017) On optimal multiple changepoint algorithms for large data. Statistics and Computing 27(2):519–533
 Nguyen et al. (2017) Nguyen CV, Li Y, Bui TD, Turner RE (2017) Variational continual learning. arXiv preprint arXiv:171010628
 Page (1957) Page ES (1957) On problems in which a change in a parameter occurs at an unknown point. Biometrika 44(12):248–252
 d’Agostino Ralph B (1971) d’Agostino Ralph B (1971) An omnibus test of normality for moderate and large size samples. Biometrika 58(2):341–348
 Rao et al. (2019) Rao D, Visin F, Rusu A, Pascanu R, Teh YW, Hadsell R (2019) Continual unsupervised representation learning. In: Neural Information Processing Systems
 Rebuffi et al. (2017) Rebuffi SA, Kolesnikov A, Sperl G, Lampert CH (2017) icarl: Incremental classifier and representation learning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp 5533–5542
 Ring (1994) Ring MB (1994) Continual learning in reinforcement environments. PhD thesis, University of Texas at Austin Austin, Texas 78712
 Robbins and Monro (1951) Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Statist 22(3):400–407, DOI 10.1214/aoms/1177729586
 Robins (1995) Robins A (1995) Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7(2):123–146
 Robins and McCallum (1998) Robins A, McCallum S (1998) Catastrophic forgetting and the pseudorehearsal solution in hopfieldtype networks. Connection Science 10(2):121–135
 Rusu et al. (2016) Rusu AA, Rabinowitz NC, Desjardins G, Soyer H, Kirkpatrick J, Kavukcuoglu K, Pascanu R, Hadsell R (2016) Progressive neural networks. arXiv preprint arXiv:160604671
 Schmidhuber (2013) Schmidhuber J (2013) Powerplay: Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem. Frontiers in psychology 4:313
 Truong et al. (2018) Truong C, Oudre L, Vayatis N (2018) Selective review of offline change point detection methods. 1801.00718

Yildirim et al. (2013)
Yildirim S, Singh SS, Doucet A (2013) An online expectation–maximization algorithm for changepoint models. Journal of Computational and Graphical Statistics 22(4):906–926
 Zenke et al. (2017) Zenke F, Poole B, Ganguli S (2017) Continual learning through synaptic intelligence. arXiv preprint arXiv:170304200
 Zeno et al. (2018) Zeno C, Golan I, Hoffer E, Soudry D (2018) Task agnostic continual learning using online variational bayes
Appendix A Quantile of statistics in Algorithm 2
For every window size , we compute the quantile of the statistics (threshold in Algorithm 2) numerically with simulations, and fit a linear function for . Table LABEL:table:z_quantile_30,table:z_quantile_50,table:z_quantile_100,table:z_quantile_200,table:z_quantile_300,table:z_quantile_400 show the computed threshold values as a function of , border size , and eror . We also show the fitted line of when as used in the experiments in Figure 7. We observe that the threshold is close to convergence when .
Error  

0.1  10.024  9.245 
0.05  11.909  11.084 
0.01  16.089  15.173 
0.001  21.837  20.786 
0.0001  27.464  26.273 
1e05  33.124  31.706 
1e06  38.882  37.244 
Error  

0.1  10.661  9.318  8.948 
0.05  12.518  11.095  10.711 
0.01  16.633  15.043  14.635 
0.001  22.283  20.454  20.018 
0.0001  27.824  25.721  25.253 
1e05  33.200  30.771  30.429 
1e06  38.841  35.969  35.716 
Error  

0.1  11.270  10.303  9.726  9.244  8.789 
0.05  13.099  12.069  11.471  10.976  10.507 
0.01  17.143  15.971  15.334  14.818  14.332 
0.001  22.705  21.315  20.628  20.088  19.578 
0.0001  28.152  26.500  25.769  25.180  24.676 
1e05  33.576  31.591  30.852  30.249  29.740 
1e06  38.957  36.549  35.712  35.165  34.720 
Error  

0.1  11.780  10.257  9.505  8.833 
0.05  13.587  11.989  11.223  10.537 
0.01  17.588  15.817  15.028  14.323 
0.001  23.080  21.057  20.248  19.516 
0.0001  28.482  26.173  25.320  24.611 
1e05  33.874  31.153  30.254  29.514 
1e06  38.938  35.961  35.187  34.499 
Error  

0.1  12.023  10.466  9.769  9.171  8.877 
0.05  13.823  12.194  11.486  10.879  10.578 
0.01  17.799  16.007  15.285  14.666  14.357 
0.001  23.271  21.214  20.471  19.848  19.534 
0.0001  28.667  26.270  25.498  24.868  24.546 
1e05  34.061  31.411  30.528  29.960  29.650 
1e06  39.142  36.424  35.684  35.072  34.756 
Error  

0.1  12.182  10.431  9.682  9.017  8.907 
0.05  13.974  12.154  11.396  10.721  10.609 
0.01  17.935  15.957  15.187  14.496  14.382 
0.001  23.389  21.160  20.378  19.681  19.567 
0.0001  28.729  26.232  25.423  24.730  24.621 
1e05  34.091  31.130  30.309  29.661  29.568 
1e06  39.388  35.642  35.106  34.543  34.319 
Appendix B Further details and results
For all CL experiments in Section 6 we used the Adam optimizer (Kingma and Ba, 2014) with its default parameter settings and with base learning rate value where is the minibatch size in each experiment. The hyperparameter in the loss function in Eq. (8) was set to and the size of the replay buffer of each previous task was set to , i.e. .
Table 8 provides the precision scores and Table 9 the recalls for all algorithms applied to the CL benchmarks.
Dataset  method  batch size=10  batch size=20  batch size=50  batch size=100 

SplitMNIST  CheckpointCD  
BayesCD(0.3)  
BayesCD(0.4)  
BayesCD(0.5)  
BayesCD(0.6)  
SimpleCD  
PermutedMNIST  CheckpointCD  
BayesCD(0.3)  
BayesCD(0.4)  
BayesCD(0.5)  
BayesCD(0.6)  
SimpleCD  
SplitCIFAR100  CheckpointCD  
BayesCD(0.3)  
BayesCD(0.4)  
BayesCD(0.5)  
BayesCD(0.6)  
SimpleCD  
IncrClassMNIST  CheckpointCD  
BayesCD(0.3)  
BayesCD(0.4)  
BayesCD(0.5)  
BayesCD(0.6)  
SimpleCD  
IncrClassCIFAR44  CheckpointCD  
BayesCD(0.3)  
BayesCD(0.4)  
BayesCD(0.5)  
BayesCD(0.6)  
SimpleCD 
Comments
There are no comments yet.