Objective Bayesian Analysis for Change Point Problems

by   Laurentiu Hinoveanu, et al.
University of Kent

In this paper we present an objective approach to change point analysis. In particular, we look at the problem from two perspectives. The first focuses on the definition of an objective prior when the number of change points is known a priori. The second contribution aims to estimate the number of change points by using an objective approach, recently introduced in the literature, based on losses. The latter considers change point estimation as a model selection exercise. We show the performance of the proposed approach on simulated data and on real data sets.



There are no comments yet.


page 1

page 2

page 3

page 4


The development of an information criterion for Change-Point Analysis

Change-point analysis is a flexible and computationally tractable tool f...

Locally-adaptive Bayesian nonparametric inference for phylodynamics

Phylodynamics is an area of population genetics that uses genetic sequen...

Scalable Bayesian change point detection with spike and slab priors

We study the use of spike and slab priors for consistent estimation of t...

Change point detection for graphical models in presence of missing values

We propose estimation methods for change points in high-dimensional cova...

Semi-parametric Bayesian change-point model based on the Dirichlet process

In this work we introduce a semi-parametric Bayesian change-point model,...

A more efficient algorithm to compute the Rand Index for change-point problems

In this paper we provide a more efficient algorithm to compute the Rand ...

Cross-validation for change-point regression: pitfalls and solutions

Cross-validation is the standard approach for tuning parameter selection...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are several practical scenarios where it is inappropriate to assume that the distribution of the observations does not change. For example, financial data sets can exhibit alternate behaviours due to crisis periods. In this case it is sensible to assume changes in the underlying distribution. The change in the distribution can be either in the value of one or more of the parameters or, more in general, on the family of the distribution. In the latter case, for example, one may deem appropriate to consider a normal density for the stagnation periods, while a Student , with relatively heavy tails, may be more suitable to represent observations in the more turbulent stages of a crisis. The task of identifying if, and when, one or more changes have occurred is not trivial and requires appropriate methods to avoid detection of a large number of changes or, at the opposite extreme, seeing no changes at all. The change point problem has been deeply studied from a Bayesian point of view. Chernoff and Zacks (1964)

focused on the change in the means of normally distributed variables.

Smith (1975) looked into the single change point problem when different knowledge of the parameters of the underlying distributions is available: all known, some of them known or none of them known. Smith (1975) focuses on the binomial and normal distributions. In Muliere and Scarsini (1985) the problem is tackled from a Bayesian nonparametric perspective. The authors consider Dirichlet processes with independent base measures as underlying distributions. In this framework, Petrone and Raftery (1997) have showed that the Dirichlet process prior could have a strong effect on the inference and may lead to wrong conclusions in the case of a single change point. Raftery and Akman (1986) have approached the single change point problem in the context of a Poisson likelihood under both proper and improper priors for the model parameters. Carlin et al. (1992) build on the work of Raftery and Akman (1986) by considering a two level hierarchical model. Both papers illustrate the respective approaches by studying the well-known British coal-mining disaster data set. In the context of multiple change points detection, Loschi and Cruz (2005) have provided a fully Bayesian treatment for the product partitions model of Barry and Hartigan (1992). Their application focused on stock exchange data. Stephens (1994) has extended the Gibbs sampler introduced by Carlin et al. (1992) in the change point literature to handle multiple change points. Hannart and Naveau (2009) have used Bayesian decision theory, in particular 0-1 cost functions, to estimate multiple changes in homoskedastic normally distributed observations. Schwaller and Robin (2017) extend the product partition model of Barry and Hartigan (1992) by adding a graphical structure which could capture the dependencies between multivariate observations. Fearnhead and Liu (2007) proposed a filtering algorithm for the sequential multiple change points detection problem in the case of piecewise regression models. Henderson and Matthews (1993)

introduced a partial Bayesian approach which involves the use of a profile likelihood, where the aim is to detect multiple changes in the mean of Poisson distributions with an application to haemolytic uraemic syndrome (HUS) data. The same data set was studied by

Tian et al. (2009), who proposed a method which treats the change points as latent variables. Ko et al. (2015)

have proposed an extension to the hidden Markov model of

Chib (1998) by using a Dirichlet process prior on each row of the regime matrix. Their model is semiparametric, as the number of states is not specified in advance, but it grows according to the data size. Heard and Turcotte (2017) have proposed a new sequential Monte Carlo algorithm to infer multiple change points.

Whilst the literature covering change point analysis from a Bayesian perspective is vast when prior distributions are elicited, the documentation referring to analysis under minimal prior information is limited, see Moreno et al. (2005) and Girón et al. (2007)

. The former paper discusses the single change point problem in a model selection setting, whilst the latter paper, which is an extension of the former, tackles the multivariate change point problem in the context of linear regression models. Our work aims to contribute to the methodology for change point analysis under the assumption that the information about the number of change points and their location is minimal. First, we discuss the definition of an objective prior for change point location, both for single and multiple changes, assuming the number of changes is known a priori. Then, we define a prior on the number of change points via a model selection approach. Here, we assume that the change point coincides with one of the observations. As such, given

data points, the change point location is discrete. To the best of our knowledge, the sole general objective approach to define prior distributions on discrete spaces is the one introduced by Villa and Walker (2015a).

To illustrate the idea, consider a probability distribution

, where is a discrete parameter. Then, the prior is obtained by objectively measuring what is lost if the value is removed from the parameter space, and it is the true value. According to Berk (1966), if a model is misspecified, the posterior distribution asymptotically accumulates on the model which is the most similar to the true one, where the similarity is measured in terms of the Kullback–Leibler (KL) divergence. Therefore, , where is the parameter characterising the nearest model to , represents the utility of keeping . The objective prior is then obtained by linking the aforementioned utility via the self-information loss:


where the Kullback–Leibler divergence

(Kullback and Leibler, 1951) from the sampling distribution with density to the one with density is defined as:

Throughout the paper, the objective prior defined in equation (1) will be referenced as the loss-based prior. This approach is used to define an objective prior distribution when the number of change points is known a priori. To obtain a prior distribution for the number of change points, we adopt a model selection approach based on the results in Villa and Walker (2015b), where a method to define a prior on the space of models is proposed. To illustrate, let us consider Bayesian models:


where is the sampling density characterised by and represents the prior on the model parameter.

Assuming the prior on the model parameter,

, is proper, the model prior probability

is proportional to the expected minimum Kullback–Leibler divergence from , where the expectation is considered with respect to . That is:


The model prior probabilities defined in equation (3

) can be employed to derive the model posterior probabilities through:



is the Bayes factor between model

and model , defined as

with .

This paper is structured as follows: in Section 2 we establish the way we set objective priors on both single and multiple change point locations. Section 3 shows how we define the model prior probabilities for the number of change point locations. Illustrations of the model selection exercise are provided in Sections 4 and 5, where we work with simulated and real data, respectively. Section 6 is dedicated to final remarks.

2 Objective Prior on the Change Point Locations

This section is devoted to the derivation of the loss-based prior when the number of change points is known a priori. Specifically, let be the number of change points and their locations. We introduce the idea in the simple case where we assume that there is only one change point in the data set (see Section 2.1). Then, we extend the results to the more general case where multiple change points are assumed (see Section 2.2).

A well-known objective prior for finite parameter spaces, in cases where there is no structure, is the uniform prior (Berger et al., 2012). As such, a natural choice for the prior on the change points location is the uniform (Koop and Potter, 2009). The corresponding loss-based prior is indeed the uniform, as shown below, which is a reassuring result as the objective prior for a specific parameter space, if exists, should be unique.

2.1 Single Change Point

As mentioned above, we show that the loss-based prior for the single change point case coincides with the discrete uniform distribution over the set


Let denote an n

-dimensional vector of random variables, representing the random sample, and

be our single change point location, that is , such that


Note that we assume that there is a change point in the series, as such the space of does not include the case . In addition, we assume that when . The sampling density for the vector of observations is:


Let . Then, the Kullback–Leibler divergence between the model parametrised by and the one parametrised by is:


Without loss of generality, consider . In this case, note that

leading to


On the right hand side of equation (8), we can recognise the Kullback–Leibler divergence from density to density , thus getting:


In a similar fashion, when , we have that:


In this single change point scenario, we can consider as a perturbation of the change point location , that is where , such that . Then, taking into account equations (9) and (10), the Kullback–Leibler divergence becomes:



We observe that equation (11) is only a function of and and does not depend on . Thus, and, therefore,


This prior was used, for instance, in an econometric context by Koop and Potter (2009) with the rationale of giving equal weight to every possible change point location.

2.2 Multivariate Change Point Problem

In this section, we address the change point problem in its generality by assuming that there are change points. In particular, for the data , we consider the following sampling distribution


where , , is the vector of the change point locations and is the vector of the parameters of the underlying probability distributions. Schematically:

If , then it is reasonable to assume that some of the ’s are different. Without loss of generality, we assume that . In a similar fashion to the single change point case, we cannot assume since we require exactly change points.

In this case, due to the multivariate nature of the vector

, the derivation of the loss-based prior is not as straightforward as in the one dimensional case. In fact, the derivation of the prior is based on heuristic considerations supported by the below Theorem

1 (the proof of which is in the Appendix). In particular, we are able to prove an analogous of equations (9) and (10) when only one component is arbitrarily perturbed. Let us define the following functions:

where . The following Theorem is useful to understand the behaviour of the loss-based prior in the general case.

Theorem 1.

Let be the sampling distribution defined in equation (13) and consider . Let be such that for , and let the component be such that and . Therefore,

where .

Note that, Theorem 1 states that the minimum Kullback–Leibler divergence is achieved when or . This result is not surprising since the Kullback–Leibler divergence measures the degree of similarity between two distributions. The smaller the perturbation caused by changes in one of the parameters is, the smaller the Kullback–Leibler divergence between the two distributions is. Although Theorem 1 makes a partial statement about the multiple change points scenario, it provides a strong argument for supporting the uniform prior. Indeed, if now we consider the general case of having change points, it is straightforward to see that the Kullback–Leibler divergence is minimised when only one of the components of the vector is perturbed by (plus or minus) one unit. As such, the loss-based prior depends on the vector of parameters only, as in the one-dimensional case, yielding the uniform prior for .

Therefore, the loss-based prior on the multivariate change point location is


where . The denominator in equation (14) has the above form because, for every number of change points, we are interested in the number of -subsets from a set of elements, which is . The same prior was also derived in a different way by Girón et al. (2007).

3 Loss-based Prior on the Number of Change Points

Here, we approach the change point analysis as a model selection problem. In particular, we define a prior on the space of models, where each model represents a certain number of change points (including the case of no change points). The method adopted to define the prior on the space of models is the one introduced in Villa and Walker (2015b).

Figure 1: Diagram showing the way we specify our models. The arrows indicate that the respective change point locations remain fixed from the previous model to the current one.

We proceed as follows. Assume we have to select from possible models. Let be the model with no change points, the model with one change point and so on. Generalising, model corresponds to the model with change points. The idea is that the current model encompasses the change point locations of the previous model. As an example, in model the first two change point locations will be the same as in the case of model . To illustrate the way we envision our models, we have provided Figure 1. It has to be noted that the construction of the possible models from to can be done in a different way to one here described. Obviously, the approach to define the model priors stays unchanged. Consistently with the notation used in Section 1,

represents the vector of parameters of model , where are the model specific parameters and are the change point locations, as in Figure 1.

Based on the way we have specified our models, which are in direct correspondence with the number of change points and their locations, we state Theorem 2 (the proof of which is in the Appendix).

Theorem 2.


For any integers, with , and the convention , we have the following:


The result in Theorem 2 is useful when the model selection exercise is implemented. Indeed, the Villa and Walker (2015b) approach requires the computation of the Kullback–Leibler divergences in Theorem 2. Recalling equation (3), the objective model prior probabilities are then given by:


For illustrative purposes, in the Appendix we derive the model prior probabilities to perform model selection among , and .

It is easy to infer from equation (15) that model priors depend on the prior distribution assigned to the model parameters, that is on the level of uncertainty that we have about their true values. For the change point location, a sensible choice is the uniform prior which, as shown in Section 2, corresponds to the loss-based prior. For the model specific parameters, we have several options. If one wishes to pursue an objective analysis, intrinsic priors (Berger and Pericchi, 1996) may represent a viable solution since they are proper. Nonetheles, the method introduce by Villa and Walker (2015b) does not require, in principle, an objective choice as long as the priors are proper. Given that we use the latter approach, here we consider subjective priors for the model specific parameters.

Remark. In the case where the changes in the underlying sampling distribution are limited to the parameter values, the model prior probabilities defined in (15) follow the uniform distribution. That is, . In the real data example illustrated in Section 5.1, we indeed consider a problem where the above case occurs.

3.1 A special case: selection between and

Let us consider the case where we have to estimate whether there is or not a change point in a set of observations. This implies that we have to choose between model (i.e. no change point) and (i.e. one change point). Following our approach, we have:




Now, let us assume independence between the prior on the change point location and the prior on the parameters of the underlying sampling distributions, that is . Let us further recall that, as per equation (14), . As such, we observe that the model prior probability on becomes:


We notice that the model prior probability for model is increasing when the sample size increases. This behaviour occurs whether there is or not a change point in the data. We propose to address the above problem by using a non-uniform prior for . A reasonable alternative, which works quite well in practice, would be the following shifted binomial as prior:


To argument the choice of (19), we note that, as increases, the probability mass will be more and more concentrated towards the upper end of the support. Therefore, from equations (17) and (19) follows:


For the more general case where we consider more than two models, the problem highlighted in equation (18) vanishes.

4 Change Point Analysis on Simulated Data

In this section, we present the results of several simulation studies based on the methodologies discussed in Sections 2 and 3. We start with a scenario involving discrete distributions in the context of the one change point problem. We then show the results obtained when we consider continuous distributions for the case of two change points. The choice of the underlying sampling distributions is in line with Villa and Walker (2015b).

4.1 Single sample

Scenario 1.

The first scenario concerns the choice between models and . Specifically, for we have:

and for we have:

Let us denote with and the probability mass functions of the Geometric and the Poisson distributions, respectively. The priors for the parameters of and are and .

In the first simulation, we sample observations from model with . To perform the change point analysis, we have chosen the following parameters for the priors on and : , , and . Applying the approach introduced in Section 3, we obtain and . These model priors yield the posterior distribution probabilities (refer to equation (4)) and . As expected, the selection process strongly indicates the true model as . Table 1 reports the above probabilities including other information, such as the appropriate Bayes factors.

The second simulation looked at the opposite setup, that is we sample observations from , with and

. We have sampled 50 data points from the Geometric distribution and the remaining 50 data points from the Poisson distribution. In Figure

2, we have plotted the simulated sample, where it is legitimate to assume a change in the underlying distribution. Using the same prior parameters as above, we obtain and . Again, the model selection process is assigning heavy posterior mass to the true model . These results are further detailed in Table 1.

Figure 2: Scatter plot of the data simulated from model in Scenario 1.
True model
0.47 0.47
0.53 0.53
12.39 0.08
0.08 12.80
0.92 0.06
0.08 0.94
Table 1: Model prior, Bayes factor and model posterior probabilities for the change point analysis in Scenario 1. We considered samples from, respectively, model and model .
Scenario 2.

In this scenario we consider the case where we have to select among three models, that is model :


model :


with being the location of the single change point, and model :


with representing the locations of the two change points, such that corresponds exactly to the same location as in model . Analogously to the previous scenario, we sample from each model in turn and perform the selection to detect the number of change points.

Let and represent the Weibull, Log-normal and Gamma densities, respectively, with , and . We assume a Normal prior on and Gamma priors on all the other parameters as follows:

In the first exercise, we have simulated observations from model , where we have set and . We obtain the following model priors: , and , yielding the posteriors , and . We then see that the approach assigns high mass to the true model . Table 2 reports the above probabilities and the corresponding Bayes factors.

True model
0.27 0.27 0.27
0.39 0.39 0.39
0.34 0.34 0.34
50.44 55
0.96 0.00 0.00
0.04 0.98 0.00
0.00 0.02 1.00
Table 2: Model prior, Bayes factor and model posterior probabilities for the change point analysis in Scenario 2. We considered samples from, respectively, model , model and model .

The second simulation was performed by sampling 50 observations from a Weibull with parameter values as in the previous exercise, and the remaining 50 observations from a Log-normal density with location parameter and scale parameter . The data is displayed in Figure 3.

Figure 3: Scatter plot of the observations simulated from model in Scenario 2.

The model posterior probabilities are , and , which are reported in Table 2. In this case as well, we see that the model selection procedure indicates as the true model, as expected.

Figure 4: Scatter plot of the observations simulated from model in Scenario 2.

Finally, for the third simulation exercise we sample 50 and 20 data points from, respectively, a Weibull and a Log-normal with parameter values as defined above, and the last 30 observations are sampled from a Gamma distribution with parameters

and . From Table 2, we note that the posterior distribution on the model space accumulates on the true model .

4.2 Frequentist Analysis

In this section, we perform a frequentist analysis of the performance of the proposed prior by drawing repeated samples from different scenarios. In particular, we look at a two change points problem where the sampling distributions are Student-

with different degrees of freedom. In this scenario, we perform the analysis with

repeated samples generated by different densities with the same mean values.

Then, we repeat the analysis of Scenario 2 by selecting samples for and

. We consider different sampling distributions with the same mean and variance. In this scenario, where we added the further constraint of the equal variance, it is interesting to note that the change in distribution is captured when we increase the sample size, meaning that we learn more about the true sampling distributions.

We also compare the performances of the loss-based prior with the uniform prior when we analyse the scenario with different sampling distributions. Namely, Weibull/Log-normal/Gamma. It is interesting to note that the uniform prior is unable to capture the change in distribution even for a large sample size. On the contrary, the loss-based prior is able to detect the number of change points when . Furthermore, for , even though both priors are not able to detect the change points most of the times, the loss-based prior has a higher frequency of success when compared to the uniform prior.

Scenario 3.

In this scenario, we consider the case where the sampling distributions belong to the same family, that is Student-, where the true model has two change points. In particular, let and represent the densities of three standard distributions, respectively. We assume that and are positive integers strictly greater than one so to have defined mean for each density. Note that this allows us to compare distributions of the same family with equal mean. The priors assigned to the number of degrees of freedom assume a parameter space of positive integers strictly larger than 1. As such, we define them as follows:

In this experiment, we consider 60 repeated samples, each of size and with the following structure:

  • from a Student- distribution with ,

  • from a Student- distribution with ,

  • from a Student- distribution with .

Table 3 reports the frequentist results of the simulation study. First, note that as per the Remark in Section 3. For all the simulated samples, the loss-based prior yields a posterior with the highest probability assigned to the true model . We also note that the above posterior is on average with a variance , making the inferential procedure extremely accurate.

Mean posterior Variance posterior 15cm Freq. true model
0.01 0/60
0.24 0.0160 0/60
0.75 0.0190 60/60
Table 3: Average model posterior probabilities, variance and frequency of true model for the Scenario 3 simulation exercise.
Scenario 4.

In this scenario, we perform repeated sampling from the setup described in scenario 2 above, where the true model has two change points. In particular, we draw samples with and . For , the loss-based prior probabilities are , and . For , the loss-based prior probabilities are , and . The simulation results are reported, respectively, in Table 4 and in Table 5. The two change point locations for are at the st and st observations. For , the first change point is the st observation, while the second is at the st observation. We note that there is a sensible improvement in detecting the true model, using the loss-based prior, when the sample size increases. In particular, we move from to .

Mean posterior Variance posterior 15cm Freq. true model
0.63 0.0749 70/100
0.37 0.0745 30/100
Table 4: Average model posterior probabilities, variance and frequency of true model for the Scenario 4 simulation exercise with and the loss-based prior.
Mean posterior Variance posterior 15cm Freq. true model
0.08 0.0200 4/100
0.92 0.0200 96/100
Table 5: Average model posterior probabilities, variance and frequency of true model for the Scenario 4 simulation exercise with and the loss-based prior.

To compare the loss-based prior with the uniform prior we have run the simulation on the same data samples used above. The results for and are in Table 6 and in Table 7, respectively. Although we can observe an improvement when the sample size increases, the uniform prior does not lead to a clear detection of the true model for both sample sizes.

Mean posterior Variance posterior 15cm Freq. true model
0.82 0.0447 91/100
0.18 0.0443 9/100
Table 6: Average model posterior probabilities, variance and frequency of true model for the Scenario 4 simulation exercise with and the uniform prior.
Mean posterior Variance posterior 15cm Freq. true model
0.501 0.1356 49/100
0.499 0.1356 51/100
Table 7: Average model posterior probabilities, variance and frequency of true model for the Scenario 4 simulation exercise with and the uniform prior.
Figure 5: The densities of Weibull(), Log-normal() and Gamma() with the same mean (equal to 5) and the same variance (equal to 2.5).

Finally, we conclude this section with a remark. One may wonder why the change point detection requires an increasing in the sample size, and the reply can be inferred from Figure 5, which displays the density functions of the distributions employed in this scenario. As it can be observed, the densities are quite similar, which is not surprising since these distributions have the same means and the same variances. The above similarity can also be appreciated in terms of Hellinger distance, see Table 8. In other words, from Figure 5 we can see that the main differences in the underlying distributions are in the tail areas. It is therefore necessary to have a relatively large number of observations in order to be able to discern differences in the densities, because in this case only we would have a sufficient representation of the whole distribution.

Hellinger distances
Weibull() Log-normal() Gamma()
Weibull() 0.1411996 0.09718282
Log-normal() 0.04899711
Table 8: Hellinger distances between all the pairs formed from a Weibull(), Log-normal() and Gamma(

). The six hyperparameters are such that the distributions have the same mean=5 and same variance=2.5.

5 Change Point Analysis on Real Data

In this section, we illustrate the proposed approach applied to real data. We first consider a well known data set which has been extensively studied in the literature of the change point analysis, that is the British coal-mining disaster data (Carlin et al., 1992). The second set of data we consider refers to the daily returns of the S&P 500 index observed over a period of four years. The former data set will be investigated in Section 5.1, while the latter in Section 5.2.

5.1 British Coal-Mining Disaster Data

The British coal-mining disaster data consists of the yearly number of deaths for the British coal miners over the period 1851-1962. It is believed that the change in the working conditions, and in particular, the enhancement of the security measures, led to a decrease in the number of deaths. This calls for a model which can take into account a change in the underlying distribution around a certain observed year. With the proposed methodology we wish to detect if the assumption is appropriate. In particular, if a model with one change point is more suitable to represent the data than a model where no changes in the sampling distribution are assumed. Figure 6 shows the number of deaths per year in the British coal-mining industry from 1851 to 1962.

Figure 6: Scatter plot of the British coal-mining disaster data.

As in Chib (1998), we assume a Poisson sampling distribution with a possible change in the parameter value. That is