Continual Learning for Infinite Hierarchical Change-Point Detection

by   Pablo Moreno-Muñoz, et al.

Change-point detection (CPD) aims to locate abrupt transitions in the generative model of a sequence of observations. When Bayesian methods are considered, the standard practice is to infer the posterior distribution of the change-point locations. However, for complex models (high-dimensional or heterogeneous), it is not possible to perform reliable detection. To circumvent this problem, we propose to use a hierarchical model, which yields observations that belong to a lower-dimensional manifold. Concretely, we consider a latent-class model with an unbounded number of categories, which is based on the chinese-restaurant process (CRP). For this model we derive a continual learning mechanism that is based on the sequential construction of the CRP and the expectation-maximization (EM) algorithm with a stochastic maximization step. Our results show that the proposed method is able to recursively infer the number of underlying latent classes and perform CPD in a reliable manner.


page 1

page 2

page 3

page 4


Multinomial Sampling for Hierarchical Change-Point Detection

Bayesian change-point detection, together with latent variable models, a...

Change-Point Detection on Hierarchical Circadian Models

This paper addresses the problem of change-point detection on sequences ...

Usage of specific attention improves change point detection

The change point is a moment of an abrupt alteration in the data distrib...

Changepoint detection for high-dimensional time series with missing data

This paper describes a novel approach to change-point detection when the...

Principled change point detection via representation learning

Change points are abrupt alterations in the distribution of sequential d...

A Fast and Efficient Change-point Detection Framework for Modern Data

Change-point analysis is thriving in this big data era to address proble...

Variational Beam Search for Online Learning with Distribution Shifts

We consider the problem of online learning in the presence of sudden dis...

1 Introduction

Change-point detection (CPD), which consists of locating abrupt transitions in the generative model of the observations, is a problem with a plethora of applications. For instance, CPD is widely used in finance [2, 4], the analysis of social networks [14, 9], or cognitive radio [3, 7]

. The main focus of CPD methods has been traditionally on batch settings, where the entire sequence of observations is available and has to be segmented. However, CPD is most useful in online scenarios, where change points must be detected as new incoming samples are observed. Online CPD methods have two intertwined tasks to solve: i) segmentation of sequential data into partitions (or segments) and ii) estimation of the generative model parameters for the given partitions.

Since each partition has a different generative distribution, the identifiability of change points is related to the difference between such distributions. In this context, Bayesian inference is useful for inferring the distributions given a prior distribution in a reliable manner. The Bayesian online change-point detection (BOCPD) approach [1] used this idea for recursively performing density estimation, which yields a more robust detection process as the propagation of uncertainty is considered. However, it can be observed that, for complex likelihood models, which have a number of parameters much higher than the number of observations between two consecutive change points, reliable CPD becomes unfeasible. This can be the case of, although is not restricted to, high-dimensional and/or heterogeneous observations (mixture of continuous and discrete variables), which usually have a prohibitive number of parameters.

To address the aforementioned issue, in [10] we presented a hierarchical probabilistic model based on latent classes, i.e., a mixture model. The CPD problem can be carried out directly on the lower-dimensional manifold, where the discrete latent variables lie. Hence, this method requires less evidence than the observational counterpart since the number of parameters is reduced, which yields faster and more reliable detections. However, [10] requires that the number of classes is fixed a priori.

The main contribution of this paper is to introduce a novel approach, based on continual learning [15, 16, 11], to recursively infer the underlying sequence of latent classes, its distributions, and the change points. The key idea of the proposed model is to allow for an unbounded order on the latent model, that is, the number of classes is not fixed and could even become infinite. In particular, we use the Chinese-restaurant process (CRP) [13], which is a well-known Bayesian non-parametrics method, to model the latent variables with an unbounded number of classes. That is, the CRP may increase the number of classes as new observations come in. Moreover, as with any mixture model, the expectation-maximization (EM) algorithm [6] is used, but in this work the maximization step (M-step) is substituted by a stochastic M-step [5]. Finally, the experimental results on real data show how both the latent-class inference process and the change-point detection perform reliably.

2 Bayesian Online Change-Point Detection

We start by considering a time series , which is divided into non-overlapping partitions, denoted by Each partition is separated from its neighbors by change points (cp). Based on [1], we assume that the data within each partition

is independent and identically distributed (i.i.d.) according to some generative probability distribution

, where the parameter vector,

, is unknown. Under this assumption, change points are determined by changes in the parameters:


The main idea in [1] is the run-length,

, which is defined as a discrete random variable that counts the number of time-steps since the last

cp, that is,


and may be seen as a proxy for change points. The objective of the BOCPD technique is to compute the posterior distribution recursively, from which we will identify a cp if the probability mass accumulates near .

The posterior distribution

is obtained by marginalizing the joint distribution

over all the values seen so far, which, in turn, is computed by marginalizing the model parameters, . The learning of given the partition, required for the computation of , is carried out using a multiple thread inference mechanism induced by the run-length. For instance, to learn given , only the observations are required. This parallel inference scheme is depicted in Figure 1, where we illustrate the aforementioned example using the notation .

The inference of in [1] may become unfeasible when the complexity of the generative model increases, for instance, for high-dimensional and/or heterogenous observations. That is, if the likelihood for the partition depends on an extremely large number of parameters, it would not be possible to obtain sufficient statistical evidence to detect change points. This problem may yield the BOCPD method unusable in some problems.

3 CPD on Hierarchical Models

Figure 1: Illustration of the parallel inference threads for the estimation of conditioned on the run-length given .

The aforementioned problem of the BOCPD for complex generative models can be overcome by introducing hierarchical models. We propose to use latent classes to obtain such hierarchical model. These latent classes, , yield observations, , that belong to a lower-dimensional manifold, and allow us to write the generative distribution of as

where is a categorical random variable, with being the maximum number of classes or categories, and is the vector of parameters, i.e., the probability of each class. This form of latent-class model can be seen as a mixture model.

Even assuming a hierarchical model, we are still interested in , which would require the marginalization over as follows


However, for large values of and , the marginalization in (3) is computationally unfeasible due to the combinatorial sums. In [10], to avoid the marginalization, we assumed that we observe , instead of marginalizing them, by directly plugging in the values of the maximum a posteriori (MAP) estimates, which are given by


Now, using the MAP estimates as observations and assuming that the joint distribution on the right hand side (r.h.s.) of (3) factorizes as


we are effectively considering that the change points occurred on the sequence of latent classes. Using the extended recursion of [10], which is given by


where is the conditional prior and


is the predictive distribution of the present latent variable conditioned on previous data and the run-length, we have all the ingredients to compute


which determines the location of the change points.

3.1 Infinite-dimensional Hierarchical BOCPD

The problem of the hierarchical BOCPD algorithm presented above is that the number of classes, , must be known and fixed a priori. That is, is not allowed to vary over time, which can be a stringent condition in some scenarios. In this section, we consider the more interesting case that is unknown and can be time-varying, i.e., new classes may appear as . Then, we cannot select the order of the latent-class model in advance. A naive idea would be to fix an upper bound on and proceed as in the previous section. However, this upper bound could not be available and, even if it is, the performance can be poor, as we will see in Section 5. In the following, we will present a method for unbounded and time-varying , that is, is incremented when an unseen type of observations appears, which translates into a hierarchical BOCPD with unbounded .

Using an unbounded number of classes results in the following problem when integrating over to compute . Assuming a Dirichlet distribution for

, which is the conjugate prior for categorical distributions and therefore yields a tractable integral in (

6), the evidence as grows. To overcome this issue, we can consider an exchangeable distribution of the form , where is a given division of classes, which is independent of the temporal assignments, i.e., corresponds to the same division of objects as . This is often known as the exchangeability property [8, 13] and is a safe assumption in our setup as we are interested in changes in the probabilities of , not in the particular sequences .

The latent-class model model with an unbounded dimension can be addressed using the Chinese-restaurant process (CRP) [13], which is a Bayesian non-parametrics method [12]. The CRP is based on the metaphor where clients (observations ) are assigned to different tables (latent classes ) in a sequential manner. The assignment of classes to objects in the CRP is determined by the predictive posterior distribution, which is given by


where counts the number of assignments to class up to time , is the number of classes associated with and

is a hyperparameter, which corresponds to the natural parameter of a symmetric Dirichlet prior distribution, and controls how likely is the appearance of a new class.

Exploiting the aforementioned CRP construction, the computation of in (5) is straightforward, and is given by


where we now count the number of MAP estimates, , equal to up to time . Notice that this expression is analogous to (8) for a given run-length, i.e., for each parallel thread in Fig. 1. Then, we may proceed to compute the posterior .

One final comment is in order. So far, we have derived a tractable recursive way to introduce latent-class models into Bayesian CPD methods with an unbounded number of classes. However, nothing has been said on how to compute the MAP estimates in a continual learning fashion, which are required in (7). This task is explored in Section 4.

4 Continual Learning of the CRP

1:  Input: Observe and initialize .
2:  Sample
3:  if  then
4:     Initialize
5:  end if
6:  Compute
7:  Compute
8:  Update parameters using (11)
9:  Calculate
10:  if  then
12:  end if
13:  for  to  do
14:     Evaluate using (9)
15:     Calculate
16:     Obtain
17:     Compute
18:     Update
19:  end for
20:  Return:
Algorithm 1 Infinite-dimensional Hierarchical BOCPD

In this section, we compute the MAP estimates of in an online and recursive fashion. This task also involves the estimation of , which are the parameters of the mapping between observations and latent variables, that is, . Here, the number of classes increases if when sampling from the CRP predictive distribution the result is . That is, at the beginning of each iteration we create a new class with an emission probability given by (8), which is only kept if the MAP estimate is .

Mixture models do not usually have closed-form solutions for the estimates of the parameters and the class assignments. Therefore, it is necessary to resort to the expectation-maximization (EM) algorithm [6], for which we need the log-likelihood of the complete data, which is given by


where the prior distribution factorizes as

This factorization is possible due to the chain-rule and the CRP construction described in Section

3.1. Once the complete data log-likelihood is available, we may apply the expectation step (E-step) and the maximization step (M-step) of the EM algorithm. In this work, we have slightly modified the M-step to accept the proposed continual learning framework. concretely, the estimation of the parameter at each step is simply performed by taking one iterate of a steepest descent method, yielding a stochastic M-step [5]. The E-step amounts to

where is the expectation operator, is the estimate of at time , and we have exploited (8). In the M-step, the estimate of the parameters is updated based on the gradient:


where is the (adaptive) learning rate for the th class at time . In this expression, we have assumed that the same initial learning rate is chosen for the parameters of a given class, but it is possible to select multiple learning rates per class. Once we have the E- and M-steps, we can compute the posterior of and maximize it to obtain as in (4). Finally, Algorithm 1 presents all the necessary computations of the proposed recursive method at each time instant and the Python implementation can be found in for reproducibility purposes.

5 Experiments

Figure 2: Upper row plots show the well-drilling univariate signal for the unbounded latent variable model (left) and the hierarchical CPD method (right) with fixed . The colors represent latent-class asignments. Bottom row plots show the MAP estimates of the run-length.

In this section we evaluate the performance of the proposed method. We apply the infinite-dimensional hierarchical BOCPD algorithm to real-world data, and in particular, to a sequence of raw nuclear magnetic response measurements taken during a well-drilling process. This data consists of real-valued univariate observations taken at a fixed sampling frequency. In the following, we assume that the time steps are ordered and discrete for simplicity.

To apply the proposed model, we choose

to be Gaussian distributed with unknown mean and variance, that is,

. Moreover, the model has two hyperparameters that we need to select. The first one, which is related to the CPD method, is the parameter of the hazard function that is used as the conditional prior, . In the experiments, we have selected . The second one is the parameter , which is involved in the CRP construction, and controls how likely is the appearance of a new unseen class. We set it to . For the stochastic M-step, we use two different adaptive learning rates for the mean and variance whose initial values are given by and . Importantly, we made both learning rates decrease at a rate of per time-step if was selected as the most likely latent class. This choice avoids adapting very old parameters with new incoming data.

Figure 2 shows the results obtained for iterations.111A video demonstrating the complete simulation of the algorithms is available at The unbounded model is compared with the hierarchical CPD approach with an upper bound on the number of classes . In the upper figures we can see the well-drilling signals, as well as the latent-class assignments in different colors for both approaches. As can be observed, the final number of classes inferred by the CRP was . In the bottom figures we show the MAP estimates of the run-length,

. These figures show that the MAP estimation of the run-length aligns quite well with the signal transitions. Furthermore, it should be noted that the proposed method is more robust to outliers as can be seen for

and , where the outlier is captured by the latent class assignment but a CP is not declared. In fact, the MAP estimate of the run-length is noisier for the method with a fixed number of classes than for the unbounded model.

Finally, it is important to note that, since the unbounded model creates new classes as they become necessary, its computational complexity is smaller than that of hierarchical CPD approach, which needs to estimate the parameters of classes at every time step.

6 Discussion and Future Work

This work has extended the Bayesian online change-point detection method to more complex scenarios by considering a hierarchical model, which is based on latent-class variables. To prevent the limitation of fixing the order of the hierarchical model a priori, we allow for an unbounded number of classes using the chinese restaurant process. Moreover, the inference of the class assignments is done with an expectation-maximization algorithm, where the M-step is carried out stochastically, that is, only one iteration of a steepest descent method is taken. Finally, the performance of the proposed method is validated empirically over real-world data. We show its robustness and utility for the aforementioned purposes. In future work, it would be interesting to extend it to multi-channel settings with multivariate generative models.


  • [1] R. P. Adams and D. J. C. MacKay (2007) Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742. Cited by: §1, §2, §2.
  • [2] E. Andersson, D. Bock, and M. Frisén (2006) Some statistical aspects of methods for detection of turning points in business cycles. Journal of Applied Statistics 33 (3), pp. 257–278. Cited by: §1.
  • [3] M. Arts, A. Bollig, and R. Mathar (2015)

    Quickest eigenvalue-based spectrum sensing using random matrix theory

    arXiv preprint arXiv:1504.01628v1. Cited by: §1.
  • [4] I. Berkes, E. Gombay, L. Horváth, and P. Kokoszka (2004) Sequential change-point detection in GARCH (p, q) models. Econometric Theory 20 (6), pp. 1140–1167. Cited by: §1.
  • [5] O. Cappé and E. Moulines (2009) On-line expectation–maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (3), pp. 593–613. Cited by: §1, §4.
  • [6] A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39 (1), pp. 1–38. Cited by: §1, §4.
  • [7] L. Du, C. Liu, M. Laghate, and D. Cabric (2015) Sequential detection of number of primary users in cognitive radio networks. In Asilomar Conf. Signals, Systems and Computers, Cited by: §1.
  • [8] J. F. C. Kingman (1982) The coalescent. Stochastic Processes and their Applications 13 (3), pp. 235–248. Cited by: §3.1.
  • [9] V. Krishnamurthy (2012) Quickest detection POMDPs with social learning: interaction of local and global decision makers. IEEE Transactions on Information Theory 58 (8), pp. 5563–5587. Cited by: §1.
  • [10] P. Moreno-Muñoz, D. Ramírez, and A. Artés-Rodríguez (2018) Change-point detection on hierarchical circadian models. arXiv preprint arXiv:1809.04197. Cited by: §1, §3.
  • [11] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner (2018) Variational continual learning. International Conference on Learning Representations (ICLR). Cited by: §1.
  • [12] P. Orbanz and Y. W. Teh (2010) Bayesian nonparametric models.

    Encyclopedia of Machine Learning

    , pp. 81–89.
    Cited by: §3.1.
  • [13] J. Pitman (2002) Combinatorial stochastic processes. Technical report Dept. Statistics, UC Berkeley.. Cited by: §1, §3.1, §3.1.
  • [14] M. Raginsky, R. M. Willett, C. Horn, J. Silva, and R. F. Marcia (2012)

    Sequential anomaly detection in the presence of noise and limited feedback

    IEEE Transactions on Information Theory 58 (8), pp. 5544–5562. Cited by: §1.
  • [15] M. B. Ring (1994) Continual learning in reinforcement environments. Ph.D. Thesis, University of Texas at Austin. Cited by: §1.
  • [16] J. Schmidhuber (2013) Powerplay: Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem. Frontiers in Psychology 4, pp. 313. Cited by: §1.