BOSD
None
view repo
Online detection of instantaneous changes in the generative process of a data sequence generally focuses on retrospective inference of such change points without considering their future occurrences. We extend the Bayesian Online Change Point Detection algorithm to also infer the number of time steps until the next change point (i.e., the residual time). This enables us to handle observation models which depend on the total segment duration, which is useful to model data sequences with temporal scaling. In addition, we extend the model by removing the i.i.d. assumption on the observation model parameters. The resulting inference algorithm for segment detection can be deployed in an online fashion, and we illustrate applications to synthetic and to two medical real-world data sets.
READ FULL TEXT VIEW PDFNone
An underlying assumption in time series models is often that the parameters of the data generative process remain fixed across the sequence of observations. Yet, it is not uncommon to encounter phenomena where this assumption does not hold.
Switching models (Chiappa, 2014), also known as change point models, account for the time-variability in the model by splitting a sequence of observations into non-overlapping segments, each with fixed model parameters. Change point detection (CPD) algorithms aim to identify the boundaries of these segments, given the observed time series and an underlying predictive model (UPM) from which the segments of observations are generated.
Many of the previous approaches to CPD focused on the offline setting (Harchaoui & Cappé, 2007; Stephens, 1994), i.e., the detection algorithm treats the observations in batch mode, and they are focused on retrospective segmentation as opposed to predictive modeling. Online methods for CPD were proposed by (Adams & MacKay, 2007; Fearnhead & Liu, 2007)
from a Bayesian perspective. The core idea behind Bayesian online change point detection (BOCPD) is to keep a probability distribution over the
run length , i.e., the elapsed time since the most recent change point (See Definition 1). When a new observation comes in, the belief over the run length is updated accordingly. Importantly, as we explain in Section 2.2, the updated run length distribution enables one-step-ahead prediction in contrast with the offline approaches.However, all these approaches have in common that they do not explicitly incorporate the modeling of the total segment duration into account. We argue that this is of high practical importance for certain applications. Consider the scenario of medical time series, where each person has a different, personalized duration pattern, e.g., in EEG or ECG time series. While real-time classification or segmentation of time series is often a challenging problem of practical importance and already some solutions exist, in many disciplines, planning of interventions has received significantly less attention. For instance, in sleep research it was shown (for more details we refer to section 4
) that acoustic stimulation has a positive effect only if the intervention is done in the up-shift of the slow wave. Thus, not only is online labeling required but likewise a prediction and uncertainty quantification of the current estimate is needed. While the time series in the case of EEG data might look rough and irregular at first, an endogenous 24h circadian clock ensures that even in this case certain personalized duration patterns indeed exist.
In order to provide a solution, particularly for sparse data regimes, we propose to further extend the BOCPD formalism along multiple directions in this work.
First of all, we incorporate explicitly into the framework the modeling of the total segment duration at a particular time-step. Leveraging on this, we can infer when the next change point will occur through a residual time variable , which is the remaining time until the next change point in the sequence of observations (see Definition 3). At the same time, this also opens the door to use UPMs that not only depend on the run length and the most recent observed data but also on the total segment duration (see Definition 2). This class of UPMs is suitable for modelling time series that share the same underlying functional shape but with a completely different time scale, e.g., EEG data from the same person but from different days or ECG data while walking and running. To address such scenario, the total segment duration is added as an input of the UPM, and the online inference scheme is extended to account for this.
Moreover, we add sequential structure to the generative model of the parameters governing the UPM. The original assumption in BOCPD is that all the parameters governing the different segments come from the same prior distribution. In contrast, we assume that there are
different distributions from which the UPM parameters could be sampled from. Furthermore, the distributions are chosen following a Markov chain. As with the segment duration
, the latent state denoting the model responsible of generating the parameters at time , denoted by , is incorporated into the online inference framework seamlessly. Notice that the resulting generative model is similar to an HSMM. The main difference is, however, that online segment inference is supported for such model.We prefer to use the term segment detection over change point detection to highlight the fact that the proposed approach not only infers the segment starting time but explicitly accounts for its ending position as well.
The segmentation and classification of time series has many applications in different fields such as predicting failures in an oil process plant, reconstructing trajectories in air traffic control, the identification of interaction scenarios in robotic environments, real-time brain computer interfaces, probabilistic forecasting of volcano eruptions or forecasting time series with multiple seasonal patterns (Molina et al., 2009; Gharghabi et al., 2018; Xiao et al., 2018; Cassisi et al., 2016; Gould et al., 2008). For a recent review of change point detection methods we refer to Aminikhanghahi & Cook (2017); Truong et al. (2018)
. These models have been especially popular in computer vision for human motion modeling
(Lafferty et al., 2001; Oh et al., 2008; Shi et al., 2008; Hoai et al., 2011; Lin et al., 2016; Karagiannaki et al., 2017). Regarding the modeling aspect, Gaussian processes have been used to model temporal structure within segments (Saatçi et al., 2010). Alternative methods, which are often based on HMMs, have recently been combined with recurrent and structured inference procedures (Dai et al., 2017; Liu et al., 2018; Kong et al., 2015; Ammar et al., 2014). In the following, we provide more details about the models we are building on top of: HSMMs and BOCPD, and we point out the differences concerning inference and comment on a setting which leads to a similar generative model.Hidden semi-Markov Models (HSMMs) represent a particular way to define a probability distribution over a data sequence . Similarly to Hidden Markov Models (HMMs), the observed data is assumed to be generated from a hidden sequence of discrete r.v. , where , through the so-called emission process. The semi-Markov property defines how the hidden variables are generated in a sequential manner. Namely, an HSMM explicitly models the number of consecutive time steps that the hidden process remains in a particular state through the duration model (Murphy, 2002). HSMMs are fully parameterized by the transition model represented by the matrix , the probability mass function (p.m.f.) of the initial hidden state , the duration model induced by matrix and the emission parameters . Given that the -th hidden state generates a segment of total duration at time step , the emission likelihood for the sequence of observations is
(1) |
Note that the observations in the sequence are not necessarily conditionally independent among each other given the corresponding hidden variables. Moreover, they could even potentially depend on the total segment duration . This allows to express more complex emission processes than with the HMM and to capture richer dependencies among distant observations.
The BOCPD (Adams & MacKay, 2007) generative model assumes that the segments composing a sequence of observations are non-overlapping product partitions. This means that every partition (i.e., segments for HSMMs) is generated i.i.d. given a fixed observation model . The parameters are also drawn i.i.d. across partitions from some fixed distribution parameterized by . Formally, the UPM represents the distribution which generates observations based on the current run length and observed values
. The original formulation of BOCPD did not provide a learning procedure since the model hyperparameters
were assumed known.Under this model a change point (CP) occurs when the underlying generative model changes and a new product partition is generated. Equivalently, the time step is regarded as a CP iff and are generated by different models. In order to find change points in an online manner, BOCPD relies on computing the posterior over the run length (Definition 1) given all the observations up to time .
Let the run length be a non-negative discrete quantity denoting the number of time steps elapsed at time since the last change point occurred in the sequence. It holds that at a change point.
The run length posterior can be conveniently written in terms of the joint for which the following recursion holds:
(2) | ||||
The predictive distribution for in (2) is remarkably only conditioned on the subset of observations since the last change point according to a particular run length . The subset is empty if . The CP prior model is defined through a hazard function which parameterizes the two possible outcomes for . Namely, or .
Note that the hazard function implicitly induces a distribution over the duration of the partitions composing an observation sequence. In the HSMM framework this is accounted for explicitly through the duration model introduced in section 2.1. For instance, a constant hazard function
induces a geometric distribution over the segment duration
. The latter is indeed the case for the standard HMM.The joint probability in (2) is not only insightful for detecting change points, but can also be used to perform predictions about future observations. The predictive distribution for the next observation (shown in (3)) is derived by marginalizing over the current run length posterior, which is obtained from (2).
(3) |
Note that an HSMM (Section 2.1) with a single hidden state leads to almost the same generative model of BOCPD. One of the key differences is that the HSMMs are traditionally parameterized in terms of the total segment duration, whereas the BOCPD parameterization is based on the run length through the hazard function instead of the duration model. As hinted earlier, the equivalence between the hazard function and the duration model is well-understood—see (Adams & MacKay, 2007) for details. However, the chosen parameterization has relevant implications for the UPMs, as discussed in Section 3.2.
The key difference between HSMMs and BOCPD is related with how inference is performed and its ultimate goal. HSMMs have been mostly used for batch processing of time series and retrospective inference of hidden states through the celebrated Forward-Backward and Viterbi algorithms (Rabiner, 1989). On the other hand, BOCPD focuses on the filtering setting (i.e., online updating) and on events which are relevant for the most recent observation. It does so by explicitly accounting for potentially incomplete segment realizations.
We now extend the BOCPD inference procedure in two ways. First, we bring in more structure to the generative process of the parameters which govern the underlying predictive model for different regimes. Second, we augment the inference procedure to estimate, in addition to the run length, the total segment duration, and the newly introduced generative hidden states.
As in the original formulation (Section 2.2), an observation sequence is composed of non-overlapping product partitions . But the parameters governing the partition are now sampled according to , where the discrete random value indexes a particular distributions over . We also constraint the hidden variables to follow a semi-Markov chain as described in 2.1. This is in contrast with BOCPD, where each is sampled i.i.d. from a single distribution with certain fixed parameters .
Additionally, we reformulate the model in terms of two extra variables for (Definition 2) and , in addition to the run length defined in section 2.2.
The goal for the proposed approach is to compute the posterior in an online manner. It turns out that we can write a similar recursion to (2) for the following joint
(4) | ||||
(5) | ||||
(6) | ||||
(7) |
Let the segment duration be the number of observations jointly emitted with the observation at time . In other words, the length of the product partition associated with .
The factor in (5) accounts for the generative process of observations (i.e., the UPM) and it only takes into account the relevant observations according to the run length . The expression in (6) encodes the underlying semi-Markov dynamics in the latent discrete process and it is further expanded in Equation 8 using a similar notation to the one used in explicit-duration modeling of Markov models (Chiappa & Peters, 2010).
It is remarkable that the joint in (7) has exactly the same form as the joint in (4) but with a temporal shift of one time step. This fact enables the recursive estimation of by processing one observation at a time and updating the joint in (4) accordingly, i.e., online updating. It also means that all the information present in the observations seen before can conveniently be summarized in the joint .
(8) |
The function is iff is equal to and otherwise.
The initial state p.m.f. does not appear anywhere in Equation (8
), however, it must be taken into account at the moment of defining the base case
for the recursion.The desired posterior distribution can be computed from (4) using the fact that . Given that , and are all discrete quantities, the posterior normalization constant is computed straightforwardly as . A similar recursion to (4) in terms of can also be derived. The latter recursion behaves better in terms of numerical stability given that it is a p.m.f.; therefore, the probability mass is distributed over a discrete set, as opposed to the joint . This scaling procedure is well-known in the HMM literature (Rabiner, 1989).
Prediction for future observations is performed similarly to the BOCPD prediction in (3), but marginalizing over the new latent variables, in addition to the run length.
(9) |
Let the residual time be the number of remaining observations after time to reach the end of the product partition at time .
The run length and the segment duration are closely related through a third quantity introduced in Definition 3 and for which the equality holds for all . Thus, the posterior , which allows us to infer the next change point occurrence through , can be easily written in terms of that we just derived using Equation 4. We will exploit this fact in the experiments reported in Section 4.
There are fundamental differences between the UPM contribution derived in our model (Expression 5) and , which is the analogous expression for BOCPD (Equation 2). Apart from the fact that we now account for a discrete number of observation models indexed by , we also handle UPMs which not only depend on the run length but could also potentially depend on the total segment duration . A dependence on is present, for instance, in temporal settings where the emission process is invariant to different time scales. We give a particular example in Section 4.1.
Notice that the UPM contribution in BOCPD is a special case of assuming that there is a single hidden state and the emission model is independent of the total segment duration given the run length .
Note that a naive implementation of Equation (4) would be impractical. To analyze the computational complexity, let and denote the maximum possible segment duration and the number of hidden states, respectively.
Using Dynamic Programming (DP) over the recursive structure of (4), in a similar way to HSMMs inference algorithms (Murphy, 2002), we devised an update algorithm with complexity per time-step for arbitrary UPMs. However, if we deal with UPMs whose likelihood function can be calculated incrementally through a set of sufficient statistics (e.g., exponential family likelihoods), then the cost per time-step reduces to because can be evaluated in constant time in such setting leveraging memoization.
Furthermore, if it turns out that the UPM is agnostic to the total segment duration , then the cost per online update can be further improved to using the same strategy proposed for efficient Forward-Backward algorithms in explicit-duration HMMs in (Yu & Kobayashi, 2003). Remarkably, this is the same complexity reported for BOCPD which, as opposed to the proposed method, does not estimate the occurrence of future change points. A particular example of this type of UPM is presented in Section 4.2 where we test our method with time series of length T=20000 with and .
We first illustrate the proposed inference procedure using a synthetic 2D data set. The data are generated using an HSMM with 4 hidden states and with some fixed initial p.m.f. , transition matrix , and duration matrix . The observation model for a hidden state and for a particular segment duration of observations has the form , where denote state-specific constants and
accounts for output noise with fixed variance
. Note that the sine functions input is always in the interval regardless of the total segment duration. This constraint encodes the property that realizations of the same hidden state with different length share the same functional shape. However, those trajectories might exhibit different time scales (i.e., linear time warping). This is an example of an observation model which jointly depends on the hidden state and the total segment duration, as described in section 3.2. A trajectory sampled from the described model is depicted in the top row of Figure 1.Note the alignment between the ground truth labels (GT) and the inferred most likely hidden state sequence (MAP) in the second row of Figure 1. The latter was picked using the marginal filtering distribution over states for all . The inferred sequence is slightly right-shifted relative to the true sequence because the inference procedure requires a few observations to gather enough evidence about the occurrence of a change point.
The run length and residual time inference are also plotted in Figure 1
as cumulative distributions (CDF) for each of the time steps. Each CDF is depicted as a column vector that goes from white (0 mass) to black (1 mass) as a function of the discrete values of the corresponding random variable.
Remarkably, the run length inference has almost no uncertainty as it can be seen from the sudden color change from white (0 CDF mass) to black (1 CDF mass) around the blue-dashed line depicting the true run length. This high confidence occurs as a consequence of using the actual generative models to process the synthetic observations. Similarly to the hidden state inference, the run length is sometimes overestimated (e.g., around time step 300), but it gets updated to the right value after observing the first few observations of a new segment.
The residual time posterior is also fairly confident but exhibits higher uncertainty during the first observations of a new segment (e.g., time step 700). However, note that it gets more confident after seeing a few more observations. This is a consequence of the observations being dependent on the total duration of the segment they belong to. Intuitively, the very first observations in a segment can give insights about the time scale governing their segment which in turn reduces the uncertainty about the remaining time until the next change point. In both cases, the true functions of run length and residual time are consistent with the inference made by our online segment detector.
Sleep staging is a fundamental but labor intensive process in any sleep laboratory and numerous sleep stage classification schemes have been proposed (Patanaik et al., 2018; Sunagawa et al., 2013). Reliable and online sleep scoring in combination with stage dependent interventions potentially allows for transcranial (Marshall et al., 2006) or acoustic stimulation (Ngo et al., 2013), which has been shown to have positive effects, e.g., for declarative memory. However, interventions require the phase tracking of raw EEG signals in real-time, which is in general a very difficult task (Patanaik et al., 2018) and especially in safety critical medical applications, a predictive model for planning is a key requirement.
In particular, sleep staging requires identifying different sleep states from time-varying measurements; typically electroencephalography (EEG) and electromyography (EMG). In mice, sleep researchers are usually interested in distinguishing among the following states: rapid eye movement sleep (REM), non-rapid eye movement sleep (NREM) and the wake status. The standard approach to stage the EEG/EMG time series is based on frequency-band features, and it has been traditionally applied by inspection of well-trained experts.
We consider a data set with three-channel measurements (2 EEG and 1 EMG) of 3 mice recorded at a frequency of 128Hz for 24h. The data are split into 4-second epochs yielding a sequence of 21600 epochs per subject^{1}^{1}1The dataset and the source code are available at https://github.com/DiegoAE/BOSD.
We choose an observation model which does not depend on the total segment duration for computational reasons given that it allows to handle large segment durations, as explained in Section 3.3. In particular, we assume the likelihood for a particular epoch to be , where denotes the frequency-band feature mapping and denotes the corresponding hidden state. As features we use the amplitudes present in different frequency bands similarly to (Sunagawa et al., 2013).
In Figure 2, we illustrate different inferences drawn from applying the proposed method to one of the subject’s recordings regarded as the test sequence. We use the other two sequences with their respective labels to train the model in a supervised fashion using maximum likelihood estimation (MLE). In the second row, we use colors to represent the sequence of expert-annotated labels provided for the test subject (GT) on top of the sequence of inferred labels (MAP). The latter was derived by taking the state that maximizes the marginal filtering distribution at each epoch .
BOSD | FASTER | |||||
---|---|---|---|---|---|---|
State | Precision | Recall | F1 | Precision | Recall | F1 |
Wake | 0.94 | 0.93 | 0.93 | 1.00 | 0.83 | 0.91 |
REM | 0.90 | 0.91 | 0.91 | 0.76 | 0.98 | 0.86 |
NREM | 0.81 | 0.87 | 0.84 | 0.57 | 0.37 | 0.45 |
AVG | 0.91 | 0.91 | 0.91 | 0.87 | 0.86 | 0.85 |
In Table 1, we compare the sleep state classification performance with respect to FASTER (Sunagawa et al., 2013), which is a fully-automated sleep staging methods designed for mice recordings. Notice that the occurrence of different sleep states is not balanced. For instance, the test sequence shown in Figure 2 has 11240 epochs labeled as wake, 8653 as NREM and only 1707 as REM. According to our experiments, FASTER does not perform well identifying the state NREM (F1 = 0.45), whereas our method gets a much higher score (F1 = 0.84) for the same class. Note that the relative differences in the averages reported in Table 1, which are weighted by the number of true instances of each state, would be even higher if the label imbalance is taken into account.
We emphasize that the proposed method does not explicitly optimize for discrimination performance. The inferences drawn from it respect consistency at the sequence level rather than at the epoch level. Remarkably, our method still outperforms FASTER in the described experimental setup. In addition we note that our algorithm performed online inference on the test data while FASTER considered the significantly easier offline case.
In the last two rows of Figure 2 we show the posterior CDFs over the run length and the residual time derived from the central object of our method . In both cases, the probability mass is represented by a gray scale ranging from white (0 mass) to black (1 mass). The obtained run length inference is clearly much more confident than the corresponding residual time inference given the sharp pattern of its probabilities as opposed to the shadowed pattern obtained in the last row of Figure 2. This is mostly due to the fact that the residual time prediction is inherently a predictive task, whereas the run length estimation accounts for an event that has already happened and from which there must be more evidence about. The run length estimation tends to be particularly accurate for long segments as shown by the inference performed over the consecutive runs of the red color (wake state).
Regarding the residual time prediction, we highlight that the confidence fades out in a way that is consistent with the ground truth. Consider the residual time posterior for the three longest segments of the wake state and notice how the range of plausible values is much more constrained at the end of such segments. In fact, the true remaining time function is within 2 standard deviations of the expected residual time for all the considered epochs (not shown). Note in addition that at the beginning of these segments most of the probability mass is shifted towards high duration values given that the mice tend to remain in this state for longer periods of time.
In contrast with the synthetic experiment, the observations themselves can not provide much insight about the residual time in this case because the emission process we are using for this experiment does not depend on the total segment duration. This is why we do not get a more peaked posterior over the remaining time after seeing the first few observations of a new segment.
The measurement of the duration cycle of ECG (Electrocardiogram) waves is a useful indicator to determine abnormalities (e.g., cardiac arrhythmia) in the heart function. Moreover, within any complete ECG cycle, there are three main events: the P-wave, the T-wave and the QRT complex, which encode different stages of the heart cycle, and their locations are useful markers for diagnosing heart rhythm problems, myocardial damage and even in the drug development process (Hughes et al., 2004).
We leverage two properties of the ECG waves to fully exploit our inference procedure. Namely, bell-shaped duration models and consistent temporal patterns associated with the different heart stages. Furthermore, the functional shape of such patterns exhibit time warping—i.e., temporal scaling. This motivates the use of an UPM that takes the total duration into account, as explained in Section 3.2.
Formally, we model the ECG signal in terms of a vector of basis functions and their corresponding weights
, with i.i.d. Gaussian noise added. In addition, a conjugate prior is assumed over the weights
, whose mean and covariance depend on the particular hidden state associated with the segment. Given the run length , the segment hidden state and the total stage duration , the observation model takes the form:(10) |
Note that the basis function vector is a function of the ratio , therefore holds. Intuitively, this means that the shape of the functions generated by this emission process can be temporally stretched, and they can still be well represented by the same underlying model. A similar model is known in the robotics community as Probabilistic Movement Primitives (Paraschos et al., 2013).
We use one of the ECG recordings (sel100) of the QT database (Laguna et al., 1997) to showcase the capability of the proposed method to jointly infer the heart cycle stage, the elapsed time on it, and the remaining time until the next stage transition in Figure 3. As with previous experiments, we assume an online setting where all the inferences are only based on previous observations and are updated efficiently when a new observation is feed in.
The QT database provides annotations over the ECG signal that account for the temporal location of the heart cycle events. In this experiment, we use the annotations corresponding to the normal beat event (R peak) and the end of the T-wave, labeled as “N” and “T)” respectively in the database. The ECG segment between a normal peak event and the end of the T-wave corresponds to Systole (i.e., contraction) denoted by S0 in Figure 3, and the segment that goes from the T-wave end to the R peak is Diastole (i.e., dilation) denoted by S1.
In the top row of Figure 3 we depict the ECG waves for the last 5 heart cycles of the subject sequence under study. We use the first part of the sequence with its corresponding annotations to learn the observation model parameters , the transition matrix () and the duration matrix () in a supervised manner. In the second row of Figure 3, we show the segmentation resulting from the annotations (GT) on top of the inferred one (MAP) by maximizing the marginal state posterior. Regarding classification performance, we get and () for S0 and and for S1 (). This shows that our method could be used to stage ECG waves in addition to detecting and predicting their change points. In the third and fourth rows, we illustrate the resulting inference over the run length and residual time after observing each ECG measurement . The inference is consistent with the ground truth values (blue-dashed line) except for some segments located around the time step 100 and time step 300.
We emphasize that the UPM introduced in (10) depends on the total segment duration as opposed to the one used for the EEG in Section 4.2. The main consequence of this is that, in the former case, the observations directly influence the residual time estimate, while in the latter case this happens only indirectly via the run length posterior. The residual time inference of the first S1 segment in Figure 3 serves as an example to show how, after observing the initial part of a segment, the uncertainty reduces abruptly around the true values. On top of that, correcting the residual time estimate in the presence of new evidence is also possible with a duration-dependent UPM, as it can be seen in the segment S1 around the time step 600 where the probability mass is shifted to a higher residual time value towards the end of the segment. Comparing the last rows of Figures 2 and 3, it is clear that in the duration agnostic case the estimates are more conservative, i.e., they have higher uncertainty, as opposed to the posteriors obtained in the duration dependent case where the inferred posteriors get confident early on in the segment.
We extend the Bayesian Online Change Point Detection algorithm to infer when the next change point will occur in a stream of observations. We show how this extension leads to a new family of observation models that were not supported by previous online change point detectors: emission models that depend on the total segment duration in addition to the run length. Moreover, the model is augmented to incorporate a discrete number of distributions from which the observation model parameters can be sampled from—in contrast with the previous i.i.d. assumption.
Remarkably, this general framework is agnostic to the particular type of emission process. We show, however, that there are computational trade-offs that arise from the emission choice when performing online inference. We illustrate the performance for different observation models, with synthetic and real world medical data sets. We leave downstream applications of our inference procedure, as planning and control, for future work.
This research was partially supported by the Max Planck ETH Center for Learning Systems. The authors thank Djordje Miladinovic for providing one of the datasets and members of MPI-IS Tuebingen for their useful feedback.
Conditional random field autoencoders for unsupervised structured prediction.
In Advances in Neural Information Processing Systems, pp. 3311–3319, 2014.Foundations and Trends in Machine Learning
, 7(6):803–886, 2014.Computer Vision and Pattern Recognition (CVPR)
, pp. 3265–3272, 2011.An online feature selection architecture for human activity recognition.
In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2522–2526. IEEE, 2017.Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI
, pp. 2447–2453, 2018.