# Change-Point Detection on Hierarchical Circadian Models

This paper addresses the problem of change-point detection on sequences of high-dimensional and heterogeneous observations, which also possess a periodic temporal structure. Due to the dimensionality problem, when the time between change-points is on the order of the dimension of the model parameters, drifts in the underlying distribution can be misidentified as changes. To overcome this limitation we assume that the observations lie in a lower dimensional manifold that admits a latent variable representation. In particular, we propose a hierarchical model that is computationally feasible, widely applicable to heterogeneous data and robust to missing instances. Additionally, to deal with the observations' periodic dependencies, we employ a circadian model where the data periodicity is captured by non-stationary covariance functions. We validate the proposed technique on synthetic examples and we demonstrate its utility in the detection of changes for human behavior characterization.

• 10 publications
• 7 publications
• 17 publications
07/24/2020

### Multinomial Sampling for Hierarchical Change-Point Detection

Bayesian change-point detection, together with latent variable models, a...
10/22/2019

### Continual Learning for Infinite Hierarchical Change-Point Detection

Change-point detection (CPD) aims to locate abrupt transitions in the ge...
08/24/2012

### Changepoint detection for high-dimensional time series with missing data

This paper describes a novel approach to change-point detection when the...
11/18/2020

### Detecting Hierarchical Changes in Latent Variable Models

This paper addresses the issue of detecting hierarchical changes in late...
09/25/2018

### Optimal Change Point Detection and Localization in Sparse Dynamic Networks

We study the problem of change point detection and localization in dynam...
07/05/2019

### An Approximate Bayesian Approach to Surprise-Based Learning

Surprise-based learning allows agents to adapt quickly in non-stationary...
12/05/2021

### Another look at synthetic-type control charts

During the last two decades, in statistical process monitoring plentiful...

## Code Repositories

### HierCPD

Hierarchical Change-Point Detection

## 1 Introduction

Change-point detection (CPD) consists in locating abrupt transitions in the generative model of a set of observations. This problem has been long studied and appears in a vast amount of real-world scenarios such as finance (Andersson et al., 2006; Berkes et al., 2004), social networks (Raginsky et al., 2012; Krishnamurthy, 2012), control (Lai, 1995) or behavioral sciences (Park et al., 2017)

, to name a few. Regarding the type of data to be modeled, the general assumption is to consider low-dimensional and homogenous datasets where observations are also assumed to be independent and identically distributed (i.i.d.). However, in many current applications the problem becomes much more challenging because the data is usually high-dimensional and heterogeneous, that is, we observe a mix of continuous, categorical, binary, or discrete variables. The reason is that high-dimensional and heterogeneous observations result in more complicated pre-change and post-change distributions, which need to be estimated (explicitly or implicitly) to identify change points.

The CPD problem becomes even more difficult when the observations deviate from the i.i.d. assumption, that is, when they present some temporal structure. Concretely, this work focuses on data with periodic temporal structure. This particular structure appears in scenarios dealing with humans. Since we are strictly conditioned by 24-hours periods, almost any data generated by humans will present such underlying periodicities, which are related to the circadian rhythm in human behavior (Pentland and Liu, 1999; Eagle and Pentland, 2009).

Additionally, the CPD problem is also difficult because we have to deal with online data streams. Many methods proposed in the literature focus on the segmentation of a complete time-series in a fixed number of segments. This differs from the inference and estimation in streaming datasets because they need to be incrementally updated in real-time (as new input data comes in). This issue is particularly important for high-dimensional signals due to memory and computational costs, which cannot be handled easily with standard batch methods. Another challenge appears when sequences are incomplete or cannot be fully observed. In those cases, the first step is to obtain accurate estimates of the missing values in order to improve the detection performance.

The limited applicability of state-of-art methods to the aforementioned problems motivates this work, where we present a novel solution for performing change-point detection in the presence of such problems. Our model is robust to missing data and able to detect changes in heterogeneous and high-dimensional data with periodic temporal structure. Inspired in

Adams and MacKay (2007), we propose a generalization thereof that considers a latent-variable model, which addresses the high dimensionality and heterogeneity of the observations, as well as the periodic temporal structure. The main contribution of this work is to introduce a hierarchy in the underlying probabilistic model of the detector, which admits multiple heterogeneities, high dimensionality and periodic dependencies at several time scales. To the best of our knowledge, these problems have not been treated together by any change-point detection technique.

### 1.1 Related Work

Typical approaches to change-point detection focus on batch settings (Fearnhead, 2006; Harchaoui et al., 2009), where the number of changes is often unknown. In contrast, online methods were also introduced via particle filtering models (Fearnhead and Liu, 2007) and the Bayesian Online CPD algorithm (BOCPD) (Adams and MacKay, 2007). Both works infer the location of change points for univariate time series using maximum-a-posteriori (MAP) estimation. A different approach is developed in Li et al. (2015), where the changes are detected for sequences of samples whose time horizon can be either fixed or not.

There has been a considerable effort to estimate change points on different domains. For example, Höhle (2010) introduced an online cumulative sum detector that finds changes on categorical time series. Other approaches are able to model multivariate real data using undirected Gaussian graphical models (GGMs) (Xuan and Murphy, 2007) or mixture models (Alippi et al., 2015; Kuncheva, 2013) together with log-likelihood ratios or divergences. Following the Bayesian perspective, several extensions of the BOCPD have been developed, including adaptative sequential methods (Turner et al., 2009) or the Gaussian Process CPD model (Saatçi et al., 2010). This last work extends the BOCPD algorithm to locate change points from observations with arbitrary temporal structure. Additionally, unlike previous approaches, the non-exponential CPD model (Turner et al., 2013)

explored applications to new families of distributions, where computing posterior probabilities is intractable and variational inference is therefore required. However, there is a lack of methods that account for change-point detection on heterogeneous datasets where observations may contain several statistical data types.

Besides the heterogeneity problem, there is another key challenge in the state-of-art to scale up CPD methods for high-dimensional data, which has not been considered much in the past. One exception is the work of Xie et al. (2013), which approaches the problem by considering that the data belongs to a low-dimensional manifold embedded in some high-dimensional ambient space. This idea is similar to our approach since it reduces the complexity problem by focusing the change-point analysis to some latent space.

## 2 Change-Point Detection on Hierarchical Models

Based on Adams and MacKay (2007), we consider that the sequence of heterogeneous and high-dimensional observations, , may be divided into non-overlapping partitions separated by change points. We assume that the data within each partition is independent and identically distributed (i.i.d.) for some distribution . We also define the run-length as the number of time steps since the last change point appeared.

The objective of the BOCPD method is to perform the recursive estimation of . However, when the time between change points becomes of the order of magnitude (or less) of the number of the model parameters, the algorithm is unable to accumulate enough probability mass on low values of . This causes that drifts in the underlying distribution can be misidentified as changes. To overcome this limitation, we further assume that even if the data is high-dimensional, it belongs to a low-dimensional manifold, admitting a latent variable model of the form

 p(xt)=∫p(xt|zt)p(zt)dzt, (1)

where

is a vector of latent variables associated with the observation

. For the time being, we consider that is fixed and known a priori. Even for this latent variable model, we are still interested in the posterior , and we therefore need to extend the recursive estimation to account for the sequence of latent variable vectors,

. To compute this posterior distribution, we start with the joint distribution

, where is the vector of unknown parameters that modulates the distribution of the latent variables. From the joint distribution, may be directly obtained as

 p(rt|X1:t)=p(rt,X1:t)∑rtp(rt,X1:t), (2)

where

 p(rt,X1:t)=∫∫p(rt,Z1:t,X1:t,θt) dθtdZ1:t. (3)

This approach can be involved, mainly due to the integrals in (3), and also requires some manipulations to achieve a recursive expression. The first step is to factorize the joint distribution as

 p(rt,Z1:t,X1:t,θt)=p(X1:t|Z1:t)p(rt,Z1:t,θt), (4)

where we have exploited the assumption that is fixed and known a priori. Now, we continue by marginalizing out the parameters as

 (5)

where

 p(rt,Z1:t)=∫p(rt,Z1:t,θt) dθt. (6)

The recursive nature of the algorithm is a consequence of the factorization (see Appendix A)

 p(rt,Z1:t,θt)=∑rt−1p(rt|rt−1)p(zt|θt)p(θt|rt−1,Z(r)1:t−1)p(rt−1,Z1:t−1), (7)

where is the likelihood of the latent variables and denotes the subset of latent variables for a given partition (Adams and MacKay, 2007). Moreover, the last term in right-hand side of (7) can be obtained by marginalizing over the previous likelihood parameters, that is,

 p(rt−1,Z1:t−1)=∫p(rt−1,Z1:t−1,θt−1)dθt−1, (8)

and the prior of conditioned to , also known as the change-point prior, is given by

 p(rt|rt−1)={H(rt−1+1),rt=0,1−H(rt−1+1),rt=rt−1+1. (9)

Here, is the hazard

function, which we take to be constant with a given time-scale hyperparameter

. Finally, is the posterior of the parameters at time given the previous run-length and the corresponding partition of the latent variables .

Plugging now (7) into (6), the probability becomes

 p(rt,Z1:t)=∑rt−1p(rt|rt−1)π(r)tp(rt−1,Z1:t−1), (10)

where the predictive posterior (Adams and MacKay, 2007; Turner et al., 2009) is given by

 π(r)t=p(zt|rt−1,Z(r)1:t−1)=∫p(zt|θt)p(θt|rt−1,Z(r)1:t−1)dθt. (11)

So far, we have obtained a recursive expression, but we now face the first problem due to the marginalization. Concretely, the exact evaluation of is often intractable for non-Gaussian likelihoods and may require numerical solutions or approximate inference. This issue will be addressed later in Section 2.3.

The marginalization over the latent variables also poses a major problem. Additionally, the dimension of the latent space, and as a consequence that of the parameters , raises two issues. First, as we have already stated, the detector’s performance directly depends on this dimension and, second, the computation complexity of (11) also varies. Hence, we propose to use a simple latent class model, where the latent variables are given by a set of discrete values , with being the number of classes. Each observable object has therefore a single representation , which indicates the class where it belongs to. In practice, this means that we now need to compute , where is a univariate sequence of latent class indicators. This selection yields a (relatively) simpler marginalization and, since the latent variables are just class indicators, the parameter space is greatly reduced.

Based on this selection for the latent variable model, (3) becomes

 p(rt,X1:t)=∑z1:tp(rt,z1:t,X1:t)=∑rt−1p(rt|rt−1)∑ztp(xt|zt)∑z1:t−1ψ(r)t, (12)

where

 ψ(r)t=p(zt|rt−1,z(r)1:t−1)p(rt−1,z1:t−1)p(X1:t−1|z1:t−1), (13)

and we have factored

 p(X1:t|z1:t)=p(xt|zt)p(X1:t−1|z1:t−1). (14)

The values represent a function parametrized at each time step by all previous latent classes for a given run-length . Due to the recursive dependence in (12), marginalizing this whole function results in a combinatorial problem that can be computationally challenging. Concretely, for some sequence of length and with classes, the evaluation of requires operations. Thus, despite (12) provides certain advantage with respect to (3), it may not be possible to compute. This problem is further simplified next. Finally, Figure 1 depicts the proposed hierarchical model.

### 2.1 Simplified Hierarchical Models

We have shown how the computational complexity of the hierarchical model grows exponentially as the signal length, , and the number of categories, , increase, which becomes prohibitive very quickly. The main factor of this computationally complexity is the marginalization of , since it requires summing over all combinations of the latent classes . To overcome this issue, we simplify the previous model by considering alternative strategies. When dealing with latent variables, if collapsing the entire latent domain is too costly, we can directly observe it. That is, we may explicitly observe the values taken by the class indicators , which before were hidden. This sort of observation model has been previously explored by Nazábal et al. (2016)

for the combination of classifiers and it offers several ways to overcome computational costly operations.

#### 2.1.1 Full Posterior Observation

As a first approximation to the hierarchical model, we can use the full posterior probability as the input observation to the change-point detector. If we assume that the distribution over the complete sequence has been previously inferred (See Section 3

), we may take these probabilities as random variables similarly to

Nazábal et al. (2016). Thus, the new observations are , with since . Here, denotes the -dimensional simplex.

The main advantage of this approach is that even if the true classes are unknown, we can detect change points directly from the full posterior probability, which avoids the marginalization of the entire latent sequence. In particular, we show how this form allows to perform reliable detections similarly to the original hierarchical model in Section 2.3. The graphical representation of the posterior observation model can be seen in Figure 2.

#### 2.1.2 Point Estimate Observation

For the scenarios where the above approach is still too computationally demanding, another option appears if we are able to obtain good point estimates, , of the class indicators, . We can use them as an accurate approximation of the latent sequence instead of marginalizing . To adapt the hierarchical model to this new observed latent variables, we take the set of point estimates also as an input sequence to the detector. Note that we further assume the likelihood to be , so we are directly modeling changes on the observed point-estimates. In particular, we compute the point estimates using the maximum a posteriori criterion. That is . In practice, to infer the class posterior distribution , we will use the circadian model presented in Section 3.

### 2.2 Missing Temporal Data

Another important contribution of this work is that the proposed method is able to handle missing data. Concretely, the missing observations follow the model completely at random (MCAR) (Rubin, 1976), and are denoted by . Moreover, the samples that are fully observed are . Dividing the observations in two sets also translates to the sequence of latent variables, which are divided in lost observations , where all components of are missing, and observed variables that correspond to . The case where only some components of are missing is addressed in Section 3 from a different perspective.

For the missing latent variables, following a fully Bayesian approach, we marginalize them out in our hierarchical model. The corresponding predictive probability is therefore reduced to . The main advantage of this marginalization is that the detection algorithm is able to compute probabilities even when is missing. Moreover, the recursiveness remains unaltered since the conditional change-point prior, , is always evaluated sequentially. That is, the uncertainty is still propagated forwards as (6) becomes

 p(rt,Z1:t)=∑rt−1p(rt−1,Z1:t−1)p(rt|rt−1). (15)

This approach for incomplete sequences presents good performance if the missing entries do not appear as long bursts. If that were the case, the lack of randomness induced by the missing data may ruin the estimation of change points since would be computed from very old data.

In Figure 3 we compare the simplified detection model, the PEO model, for both complete and incomplete data sequences. First, this figure shows the MAP estimates of for the complete sequence (top row) and the incomplete one (middle row). Additionally, in the bottom row we plot the MAP estimate of for the cases of missing and non-missing data, as well as the ground truth. As can be seen, the performance with missing data is almost identical to the case of non-missing data.

### 2.3 Inference

In this section, we describe the inference procedures for the two simplified hierarchical models (see FPO and PEO in Figure 2).

#### 2.3.1 Approximate Inference for Full Posterior Observations

Under this model the likelihood function is and we take it as a Dirichlet distribution. Thus, our objective is to find , for which the hierarchical model described before must be reformulated, as well as the recursiveness in (7), which needs to be updated to accept the new sequence of observed latent probabilities . This is equivalent to using the BOCPD algorithm with inputs . As we will see later, conjugacy is unavailable in this case and, to simplify the inference process, we decompose the natural parameter of the Dirichlet distribution as

with inverse variance

and mean . This decomposition allows us to choose a Gamma prior for and a Dirichlet one with single natural parameter for the mean . Under this construction, we write the probability model as

 ~zt∼Dir(η,λ), (16)

where

 η ∼Ga(κ,ν), λ ∼Dir(β), (17)

with and .

We now need to obtain in a similar way as in (2). Hence, it is necessary to compute the predictive integral , which is given by an expression equivalent to (11). To compute this integral, we need , which reads as

 p(ηt,λt|rt−1,~Z(r)1:t−1)∝p(ηt)p(λt)rt−1∏τ=1p(~zτ|ηt,λt)=Ga(ηt|κ,ν)Dir(λt|β)rt−1∏τ=1Dir(~zτ|ηt,λt), (18)

where we have used the prior distributions defined above and considered the hyperparameters and fixed. Thus, the predictive integral becomes

 p(~zt|rt−1,~Z(r)1:t−1)=∫p(~zt|ηt,λt)p(ηt,λt|rt−1,~Z(r)1:t−1)dηtdλt. (19)

However, as we already pointed out, this integral is analytically intractable. Instead, we propose to solve it via Markov chain Monte Carlo (MCMC) methods, yielding

 p(~zt|rt−1,~Z(r)1:t−1)≈1SS∑s=1p(~zt|ηs,λs), (20)

where and are the th sample of and , with being the total number of samples. In particular, we use a Gibbs sampler to draw realizations from . The equations for the conditional probabilities are

 p(ηt|rt−1,~Z(r)1:t−1,λs−1t)∝p(ηt)p(~Z(r)1:t−1|rt−1,ηs−1t,λs−1t)=Ga(ηt|κ,ν)rt−1∏τ=1Dir(~zτ|ηs−1t,λs−1t), (21)

and

 p(λt|rt−1,~Z(r)1:t−1,ηs−1t)∝p(λt)p(~Z(r)1:t−1|rt−1,ηs−1t,λs−1t)=Dir(λt|β)rt−1∏τ=1Dir(~zτ|ηs−1t,λs−1t), (22)

where and are the realizations drawn in the previous iteration of the Gibbs sampler. Moreover, we propose to use the Gibbs-within-Metropolis Hastings sampler presented in Martino et al. (2015, 2018) since there is not direct way to get samples from the conditional distributions in (21) and (22). The main idea is to use the random-walk (RW) Metropolis-Hastings (MH) algorithm (Martino and Elvira, 2017) to generate samples from (22) since the distribution becomes very narrow when the number of latent classes, , is large. On the other hand, for the conditional in (21), it is possible to use a standard MH sampler with a Gamma proposal.

The complete change-point detection algorithm for the FPO simplified model, including the Gibbs sampler, is summarized in Algorithm 1.

#### 2.3.2 Exact Inference for Observed Point Estimates

To perform change-point detection for the second simplified model, we choose the likelihood function to be a categorical distribution with natural parameter . Similarly to the previous strategy, we place a Dirichlet prior on but with a single hyperparameter , that is,

 z⋆t∼Cat(π),π∼Dir(γ),

where and . Based on these priors, the predictive probabilities may be computed in closed form, which provides a significant reduction in the computational complexity and results in a simple method (See Algorithm 2). Actually, the simplified model is the most effective one in terms of computational cost, while at the same time the performance degradation is minimal. This is illustrated in Figure 4. Concretely, this figure shows the posterior probabilities of the latent variable (top row), the MAP estimates (middle row), and the outputs of the two simplified algorithms, as well as the ground truth. As can be seen in this example, the PEO algorithm detects the same change points, just with a slightly larger delay, but with a reduced computational complexity. The computational cost of the sampler also becomes prohibitive when the dimension of the latent space is large.

A final comment is in order. For the cases where is partially observed (e.g., only a few components of are lost), we are still able to compute the probabilities from the latent model, that is, the simplified models can still be used. Note that this differs from the previous scenario where the entire observation is missing. In this work, we assume that both types of incomplete values may appear for any dataset where we want to perform change-point detection. More details about the inference under missing entries on the set of observations are given in Section 3.

This section studies how to embed heterogeneous models into the hierarchical BOCPD technique presented before, as well as to handle periodic temporal structures. Concretely, it will become clear that this step addresses the issues regarding the heterogeneity and the unknown temporal structures of the data. We pay special attention to the application on human behavior characterization, where a ubiquitous circadian periodicity (24 hours) underlies every observation, but the ideas are valid for any dataset with known periodicity. In this context, there exists significant work related to capturing such circadian rhythms at different scales in time series. For example, its application to gene expression modeling (Hensman et al., 2013; Durrande et al., 2016) aims to identify the periodic dynamic cycles of cellular mechanisms in nature, a key point for the understanding of their hidden correlations.

To address all periodic dependencies, we shall arrange the data such that each sample for a given time step corresponds to the subset of observations in one period. Moreover, we allow our model to work with heterogenous objects by simply stacking observations from different statistical types on larger vectors , with being the number of heterogeneous data types. This essentially means that the sequence of observations summarizes consecutive periods of different heterogeneous measurements. Thus, for the description of one day, , we propose a heterogeneous mixture model for the likelihood of the form

 p(xt|zt,{ϕ1k,…,ϕMk}Kk=1)=K∏k=1M∏j=1p(xjt|ϕjk)I{zt=k}, (23)

which is formed by components, each one with its own parameters for every type of data with a given likelihood function , and the latent variables indicating which component is active. As can be seen in (23), we assume that given the class and the parameters, the different types observations, are independent. This representation is illustrated in Figure 5.

### 3.1 Circadian Gaussian-Bernoulli Mixture Model

Among heterogeneous statistical data types that may be incorporated to the daily representation , we consider two representative cases. We assume that each observation is composed by binary and real values, that is, where and . Here, denotes the dimensionality of each kind of observation and is given by a function of the period. For instance, if each component of represents the traveled distance during every hour, , which corresponds to one day, the period induced by the circadian rhythm.

As an example, in the following we focus on two types of observations, but the method is easily extended to more types of observations. Specifically, for the application that we have in mind in this paper, we propose to choose the composed likelihood

to be a mixture of Bernoulli and multivariate Gaussian distributions, that is,

 p(xt|zt=k,{ϕk}Kk=1)=p(xbint,xrealt|ϕk)=Ber(x%bint|μk)N(xrealt|0,Kk+D), (24)

where is the set of likelihood parameters for each latent class value . Note that we are exploiting the conditional independence of and to factorize across the two types of data. Here, are the means of the Bernoulli variables, is a common diagonal matrix to all latent clases, which is given by with , and is a positive-definite covariance matrix that corresponds to class . It is important to point out that the diagonal matrix

provides some heteroscedasticity to the model since we are assuming that the noise may vary within time inputs.

To capture the periodic (circadian) feature of , we employ a periodic covariance function, similar to the works in Durrande et al. (2016) and Solin and Särkkä (2014). The covariance matrix is therefore generated by a non-stationary periodic kernel, i.e., , where the time is assumed to be equally spaced (e.g, ). Further details about this function are provided in the following. Moreover, to obtain interesting insights, we propose that the distribution is generated using a hierarchy similarly to Hensman et al. (2013, 2015). Concretely, the mean is drawn from a zero-mean Gaussian, which yields

 f ∼N(0,Kk), xrealt|f ∼N(f,D). (25)

Note that the computation of can be avoided by marginalizing it out, and therefore, both expressions would be reduced to .

Following the standard approach used in mixture models, we need to compute the complete likelihood, which includes the prior probability on latent class indicators. By defining the prior distribution on class assignments

, (e.g., ), the complete likelihood at time is

 p(xt,zt=k|{ϕk}Kk=1)=p(zt=k)p(xt|zt=k,ϕk)=πkp(xt|ϕk).

Taking into account that observations are conditionally i.i.d., the complete joint log-likelihood becomes

 Lz,ϕ=logp(X1:t,z1:t|{ϕk,πk}Kk=1)=T∑t=1K∑k=1I{zt=k}logπk+T∑t=1K∑k=1I{zt=k}logp(xbint|μk)+T∑t=1K∑k=1I{zt=k}logp(xrealt|Kk,D). (26)

One key point of this circadian model is that if we needed to take observations at non-uniformly separated inputs , this approach would accept different for each in a straightforward manner.

### 3.2 Periodic Non-stationary Covariance Functions

The proposed model captures the circadian feature of data for each class through the temporal embedding and the periodic covariance functions , which must be non-stationary in our case. One possible solution to achieve non-stationarity in kernels is using an input-dependent mapping similar to the one used in Heinonen et al. (2016). Here, we have , where is a stationary periodic covariance function associated to class . Concretely, we take the periodic version of the exponential kernel (MacKay, 1998), which is given by

 ~kk(t−t′)=σ2akexp(−2sin2(π(|t−t′|/D)ℓ2k), (27)

and for , we use a squared Fourier series of order , with . This constraint imposes a limit on the smoothness and avoids overfitting. Thus, is

 sk(t)=(ak,02+C∑c=1[ak,ccos(2πcDt)+bk,csin(2πcDt)])2, (28)

where and are Fourier coefficients that parametrize the covariance matrix of the class, , together with the parameters of the exponential kernel, and .

To sum up, it is important to note that the defined positive function

allows multimodality along time. Moreover, we are simultaneously modeling the correlation among daily hours, which is presumed to be non-stationary, and, at the same time, capturing the individual variance of data along different moments of the period. This is confirmed by the experimental results presented in Section

4.

### 3.3 EM Algorithm for Heterogeneous Circadian Models with Missing Data

In this section we present an inference technique for the model described above. In particular, given the model and the complete log-likelihood in (26), we need to infer the assignments and the set of model parameters and hyperparameters , with .

Obtaining maximum likelihood estimates of our Gaussian-Bernoulli mixture-model parameters under heterogeneous likelihoods can be done through the expectation-maximization (EM) algorithm

(Dempster et al., 1977). However, there is a slight difference in our model. While the expectation step is computed from the complete mixture distribution, in the maximization step, it is possible to factorize and therefore to estimate the parameters of the Bernoulli and Gaussian distributions separately. This is a key point of the inference process, since as long as we introduce new likelihoods to the model, getting uncertainty measures about parameters will be always a set of independent tasks. Additionally, one important matter of using the EM algorithm here is that it allows us to handle missing components in any observation (Ghahramani and Jordan, 1994).

As previously mentioned, rather than dealing with sequences where we may find complete missing observations, sometimes we have to deal with just a few lost features, i.e., only a few components of are missing, making the vector incomplete. The proposed technique estimates the probabilities even if is only partially observed. Hence, we are interested in adapting the mixture model for dealing with missing data following a similar approach to Ghahramani and Jordan (1994), which computes expectations not only over latent variables but over missing values.

Concretely, the algorithm iterates between the following steps

 E-step: compute Q=Ez1:t,Xm1:t[Lz,ϕ] M-step: maximize Q w.r.t. ak,bk,σak,ℓk,σ maximize Q w.r.t. μ,π

where

denotes the matrix that collects the missing values of every observation. It is important to point out that there are no closed form expressions for the maximization w.r.t. the parameters of the Gaussian distribution. Thus, we use the conjugate gradient (CG) ascent method (See Appendix C). Finally, contrary to standard EM techniques, the proposed approach must combine closed-form estimators for the Bernoulli distribution parameters and numerical optimization for the hyperparameters of the periodic covariance function. Since the latter problem is highly non-convex, several initializations of the hyperparameters are required achieve a good (local) maximum.

## 4 Experiments

In this section, we demonstrate that our approach is capable of estimating change points properly in two different scenarios (one synthetic and one real) with heterogeneous datasets and high-dimensional observations in presence of strong periodic dependencies between samples. Concretely, only the algorithm based on the PEO model is used since the computational complexities of the full hierarchical approach and the FPO model are prohibitive.

### 4.1 Synthetic data

In our first experiment, we want to validate our hierarchical change-point detector on the embedded circadian model for a set of controlled synthetic data. Setting a categorical distribution as the likelihood distribution, that is, using the PEO simplified model, we generate a discrete sequence with instances and categorical values (future assignments) where four change points were introduced. Notice that this means that the sequence is divided into five partitions. Then, the generative parameters for each partition are generated by sampling where we fixed .

The resulting sequence corresponds to the latent classes in the corresponding hierarchical structure. Consequently, we generate pairs of multivariate Bernoulli-Gaussian samples where the covariance matrices are given by the non-stationary periodic kernel described in Section 3.2 and also considering a random selection of hyperparameters.

Note that, in our experiment we only observe the given stream of binary and real observations, never the sequence of latent classes . Then, assuming the Gaussian-Bernoulli mixture model from Section 3.1, we infer the posterior probabilities over class assignments by applying the EM algorithm previously described. For experimental purposes, we have truncated the number of Fourier coefficients to , while the data was generated with .

Once the resulting sequence of posterior probability vectors was obtained, we translated it into point-estimates by using a MAP estimator. Again, using the PEO simplified model, we performed hierarchical change-point detection as in Algorithm 2. We run the whole process for several classes in the Gaussian-Bernoulli mixture model and multiple initializations. Figure 6 shows how our approach detects change-points depending on the number of latent classes selected. Note that, as the number of categories decreases, we find a bit more delay mainly in the last two partitions. Our hypothesis is that as long as the circadian model gets larger amounts of latent classes (i.e., it is more representative), the change-point detection can differ better between adjacent partitions.

The interpretability of change-point detection diagrams is a key point of this paper (see Figures 3, 4 and 6). Usually, we represent the run length with highest probability at each time step . The resulting diagram shows ever-increasing run lengths while change-points are not detected. Note that ground truth plots always represent the exact instant of time when the change occurs. This is generally difficult to be detected without any sort of delay (i.e., a few samples from the post-change sequence must be observed to update the new distribution and then compare with the pre-change distribution). The usual case where this delay could be mostly reduced is given by the predictive probabilities in (11), that is, under extremely abrupt changing conditions, a new sample that does not fit with the old distribution (i.e. probability ), will give a very low probability and therefore will force the run-length to be zero, , immediately. In most cases, this does not occur so we look to the lowest value taken by that is usually a bit higher than zero. For example, in the last change-point detected in Figure 6, the maximum probability of at indicates , that is, there was a change-point 20 time steps before, at .

### 4.2 Anomalous Human Behavior

In this experiment, we evaluate our model on real data for medical applications. Since our work is highly motivated by the problem of change-point detection for personalized medicine, it is important to mention why detecting abnormal changes is a key milestone for the future of modern psychiatry.

#### 4.2.1 Change-point Detection in Psychiatry

Nearly all patients with one of the most prevalent mental health disorders such as schizophrenia or chronic depression are likely to suffer some sort of relapse in the next five years after their first episode. Despite the fact that significant early indicators may appear before an abrupt relapse, when recognizable symptoms are already present, it is usually too late for any preventive treatment.

The majority of these previous signs and triggering effects can be identified by continually monitoring changes in the patient’s behavior. Here, we refer to human behavior as the set of routines and patterns that are persistent during our daily life. In Eagle and Pentland (2009), it was demonstrated how these behavioral patterns are projected in the set of signals that we generate every day (i.e., mobility data, phone communications and logs of social interactions, etc.) and are mainly recognizable through the appearance of circadian features and periodicities. In this sense, the ubiquitous condition of smartphones and wearables, whose recording and memory capabilities are able to perform continuous monitoring of the aforementioned domains, opened a new window for analysing all these sources of information in chronic patients without the need of direct medical interventions. Our primary goal is to provide a robust solution to this paradigm by applying change-point detection on behavioral data obtained from monitored psychiatric patients.

#### 4.2.2 Mobility metrics

In our experiment, we used data which consists of location traces (latitude-longitude) recorded via smartphone during 275 consecutive days. It contains a bit more than 100,000 instances that corresponds to the user’s GPS coordinates every 3 minutes on average. We investigated how to obtain reliable metrics from these traces. Based on Canzian and Musolesi (2015), we considered two types of metrics: 1) a real-valued signal of the log-distance travelled per hour and 2) daily binary vectors of presence or absence at home.

The preprocessing step was particularly different for each one of the metrics. In the first case, we collected all location traces separately for every hour and every day. Once this was done, we calculated the approximated distance wandered between consecutive latitude-longitude pairs. It is important to remark that if the the maximum time without any location point (30 minutes) was exceeded for some hour, the total distance was considered as a missing value. For the binary case, we preprocessed all the locations in order to estimate patients home location, that is, the most usual latitude-longitude pair during noctural hours. To decide whether a patient was or not at home, we established a maximum distance range of 50 m. When one or more location traces show a relative distance to the estimated home position equal or lower than this range, then we assume that at this hour the patient was at home. For the hours with no location traces, we considered them also as missing data.

After the preprocessing step, we can see both signals in Figure 7 (upper plots), which correspond to the observations in the circadian model. For the binary metric, we have 24 hours length vectors for each day where at-home (black) and not-at-home (white) indicators are showed in contrast to the missing values (red). Note that it is especially easy to identify patterns and routines in this type of representations (Eagle and Pentland, 2009). An important detail to remark is that missing traces are mainly concentrated around night hours. This may occur when sensors run out of battery during night, which a priori could make sense. Moreover, visualizing segments of diurnal features across the sequence of days is interesting as long as we can infer the routine structure of consecutive working days and weekend periods.

#### 4.2.3 Detection of Behavioral changes

The results of the EM algorithm after observing the sequence of heterogeneous observations is shown in Figure 7. Similarly to the previous synthetic experiment presented in Section 4.1, the main objective is to obtain posterior probabilities . Following the simplied PEO model, we set point-estimates as the sequence of observed latent variables, see third row in Figure 7.

Results using classes, this is, a large number of day profiles, show that the second and fourth profiles are the most frequent. Looking at the detected change points with the PEO simplified model, we can see that abrupt transitions are produced by sudden changes in the proportions of latent classes. The hierarchical detector is capturing that in the first partition, the natural parameters indicate that the fourth and fifth profiles mainly dominate the routine. However, after a small secondary partition given by an unusual combination of classes, the third partition shows a new pattern of behavior where the second profile is the most frequent one.

One of the strengths of the proposed latent model is its ability to identify the 24 hours periodic cycles from the set of mobility metrics, mainly by estimating the Fourier coefficients of the covariance functions for the wandered distance values. In this process, after fitting all hyperparameters in the maximization step during the inference, we are able to reproduce the patterns that describe how a person usually behaves during each type of day. In the left column of Figure 8, we represent the set of (multimodal) periodic functions generated from the squared Fourier series, , for every profile, that is, each latent class . Moreover, we plot in the right-hand side column of the same figure the estimated vectors of probability of the Bernoulli likelihood, i.e., the probability of being at home for every type of day at different hourly frames.

The input dependent mapping can be understood by the reader as a periodic function that models mobility fluctuations during the whole day. That is, we may assume that large values of correspond to patterns with larger displacements and, on the contrary, values near zero mean no mobility at the given hour for that particular profile. We want to remark the interesting connection between resting states (at home) and profiles of wandered distance. Despite the fact that binary observations suffer from extremely high percentages of missing values during nocturnal hours, it is possible to infer if the patient’s mobility variance correspond or not to places near his/her home.

This circadian phenotypes also provide interesting information for psychiatric use. On one hand, the most likely profiles, such as and (see Figure 7, third row), are easily interpreted as routine days, probably working type ones, as can be seen in Figure 8. For instance, Profile 2 has most of its activity at commuting hours. On the other, other profiles, such as or , may be considered as leisure patterns since the activity is distributed more evenly during the day, as shown in Figure 8. This sort of analysis is particularly important because it demonstrates that heterogeneous circadian models are able to capture the circadian cycles of individuals during their daily life.

## 5 Conclusions

In this paper we have proposed a novel generalization of the Bayesian change-point algorithm (Adams and MacKay, 2007) to handle heterogeneous high-dimensional observations with unknown periodic structure. The main contribution is to introduce a hierarchical model for the underlying probabilities of the detector.

Additionally, we present a new probabilistic approach for detecting change-points from sequences with multiple missing values. This solution can be used in any variation of the BOCPD method (Saatçi et al., 2010; Turner et al., 2009, 2013) and is specially useful when using latent variable models. In addition, we include a circadian model that captures the periodic structure of the data, mainly focusing on the circadian rhythm.

The code is publicly available in the repository https://github.com/pmorenoz/HierCPD and it is fully written in Python. Thus, we expect the hierarchical detector to be utilised by researchers to detect change points for any possible latent variable model. In addition, we also include the heterogeneous circadian mixture model as well as the EM inference and the periodic non-stationary covariance functions.

The authors would like to thank Luca Martino for his useful comments and discussion. This work has been partly supported by Ministerio de Economía of Spain under projects: OTOSIS (TEC2013-41718-R), AID (TEC2014-62194-EXP), and the COMONSENS Network (TEC2015-69648-REDC), by the Ministerio de Economía of Spain jointly with the European Commission (ERDF) under projects ADVENTURE (TEC2015-69868-C2-1-R) and CAIMAN (TEC2017-86921-C2-2-R), and by the Comunidad de Madrid under project CASI-CAM-CM (S2013/ICE-2845). The work of P. Moreno-Muñoz has been supported by FPI grant BES-2016-077626.

## Appendix A. Hierarchical BOCPD Algorithm

The recursive decomposition of the joint probability distribution

is based on the original derivation of Adams and MacKay (2007) and can obtained as follows.

The joint distribution can be easily expanded by marginalizing over all values of the previous run length , that is,

 p(rt,Z1:t,X1:t,θt) = ∑rt−1p(rt,rt−1,Z1:t,X1:t,θt) (29) = ∑rt−1p(rt,rt−1,zt,Z1:t−1,xt,X1:t−1,θt),

where we have separated and to simplify the derivation. The last term can be also expressed as

 p(rt,rt−1,zt,Z1:t−1,xt,X1:t−1,θt)=p(rt,zt,xt,θt|rt−1,Z1:t−1,X1:t−1)p(rt−1,z1:t−1,X1:t−1).

where the right-hand side term may be obtained as

 p(rt−1,Z1:t−1,X1:t−1)=∫p(rt−1,Z1:t−1,X1:t−1,θt−1)dθt−1,

with being the factorized joint probability distribution at .

Since the current run length is only conditioned by its previous value , the previous expression can be written down as

 p(rt,z1:t,X1:t,θt)=∑rt−1p(rt|rt−1)p(zt,xt,θt|rt−1,Z1:t−1,X1:t−1)p(rt−1,z1:t−1,X1:t−1),

where the conditional distribution is the change-point prior (see Section 2). Note that useless conditioned variables have been omitted. At the same time, we may decompose

 p(zt,xt,θt|rt−1,Z1:t−1,X1:t−1)=p(xt|zt)p(zt|θt)p(θt|rt−1,Z1:t−1),

where we have taken into account that heterogeneous high-dimensional observations are only conditioned by its latent representation , which is modeled by the likelihood term given the posterior distribution on the parameters.

The resulting recursive expression is

 p(rt,z1:t,X1:t,θt)=∑rt−1p(rt|rt−1)p(xt|zt)p(zt|θt)p(θt|rt−1,Z1:t−1)p(rt−1,z1:t−1,X1:t−1),

and can be calculated sequentially at each time step .

## Appendix B. Heterogeneous Circadian Mixture Model

#### Gaussian Likelihood with Missing Data

The Gaussian distribution requires to handle missing or partial observations in this problem, based on Ghahramani and Jordan (1994), the likelihood is as follows,

 p(xreali|θk)=p(xo,reali,xm,reali|θk)=N(</