1 Introduction
The area of highdimensional inverse covariance (or precision) matrix estimation has developed considerably over the past years. This is partly due to the offdiagonal entries of the precision matrix encoding unscaled partial correlations and in the case of a multivariate normal distribution even conditional independence between two variables given all the others. The resulting graph, with variables as nodes and edges for nonzero offdiagonal entries is called a conditional independence graph or Gaussian graphical model (GGM), respectively
(Lauritzen 1996).In a lot of settings where the graphical structure of a set of observations is of interest, the data is in the form of a time series (or has another natural ordering, e.g. by space or genome location) and the underlying distribution might change over time. A lot of attention is given to the setting where there are structural breaks, so called change points, in between which the observations are identically distributed. Disregarding this structure and estimating a GGM for all observations leads to a fit of mixtures that does not model the true GGM well for any time point. In such cases to estimate the graphical structure, prior good estimation of the change points is key.
Real world data sets often suffer from missing values. In the presence of change points and without prior knowledge of them, usual imputation methods are not expected to perform well as they require observations from a homogeneous distribution. However, some kind of imputation method is necessary in order to apply any of the current methods to find change points, between which the distribution can be assumed to be homogeneous, leading to a chicken and egg kind of problem.
1.1 Related Work
For homogeneous observations, Meinshausen and Bühlmann (2006) proposed to use nodewise regression using the Lasso (Tibshirani 1996) to recover the conditional independence structure from data. Yuan and Lin (2007) introduced an estimator for the precision matrix (and thus the GGM) via maximisation of the penalized Gaussian loglikelihood over the set of positive definite matrices. The graphical Lasso (glasso) algorithm of Friedman et al. (2008) with subsequent improvements (Witten et al. 2011; Mazumder and Hastie 2012) gained popularity for computing such estimates such that the term glasso is now also associated with the corresponding estimator. Computational approaches for the estimator were also presented by Banerjee et al. (2008), theoretical properties of it were investigated by Yuan and Lin (2007), Rothman et al. (2008), Lam and Fan (2009) and Ravikumar et al. (2011).
For the nonhomogeneous case, Zhou et al. (2010) considered cases where the graph structure varies smoothly and Kolar and Xing (2011) investigated under which conditions the graph structure can be recovered consistently. In a setting with abrupt structural breaks, Kolar and Xing (2012) proposed to estimate the locations of change points with a total variation penalty for consecutive observations either using nodewise regression or a penalized likelihood approach. Gibberd and Nelson (2017) investigated groupfused and independentfused graphical Lasso methods (with a Frobenius norm or an fusion penalty). Some statistical analysis of the former was provided by Gibberd and Roy (2017). While Kolar and Xing (2012) used an accelerated gradient descent method, Gibberd and Nelson (2017) proposed an ADMM algorithm to calculate the estimator, similarly to Hallac et al. (2017), who considered larger classes of fusion type penalties. Other computational approaches include an approximate majorizeminimize algorithm (Bybee and Atchadé 2018) or a specific greedy search (Hallac et al. 2018). More thorough statistical analysis of similar proposals was conducted by Avanesov and Buzun (2018), Wang et al. (2017), Dette et al. (2018) and Roy et al. (2017).
In the presence of missing values, the maximisation of the penalized loglikelihood (see equation (5) later on) is no longer a convex problem, even for homogeneous observations. For this scenario of homogeneous observations with missing values, Städler and Bühlmann (2012) proposed to use an Expectation Maximisation algorithm coupled with the glasso to obtain an estimate of the precision matrix. The single example of the treatment of missing values in the context of change point detection we are aware was done by Xie et al. (2013). They however considered an online (sequential) setup and assumed that the observations lie close to a timevarying lowdimensional submanifold within the observation space. This assumption is appropriate in e.g. video surveillance, but it is unrealistic in the setting of GGMs, even if the underlying precision matrix is assumed to be sparse.
1.2 Possible applications
Applications of change point detection within graphical models include the analysis of environmental measurements, biological data and financial time series, which potentially encounter the problem of missing values. One specific motivating example for our proposals was the shallow groundwater monitoring data set, which we will discuss later on. To mention further concrete examples, multivariate change point detection methods (in GGMs) could also be useful for example in detecting the signal of abrupt climate changes imprinted simultaneously into multiple climate proxy records, e.g. ice cores in Antarctica or speleothems, all facing missing values (see e.g., Atsawawaranunt et al. 2018).
So far, it is common practice to discard variables with too much missingness and to perform simple imputations on the rest (see e.g., Matteson and James 2014). Also, sometimes univariate methods are used to detect change points for each variable separately (see e.g., Shirvani 2015)
. These approaches are suboptimal. Discarding noncomplete observations, especially with highdimensional data and inhomogeneity of the missingness structure with respect to time, is impractical as it results in a significant loss in information.
1.3 Our contribution
Our investigated setup is composed of three problems: change point detection, estimation of GGMs and the treatment of missing values. While there are many possible applications, there is currently no readily available method combining the three and thus capable of estimating change points in a GGM in the presence of missing values. We fill this gap and provide practitioners with practically usable, and in particular computationally tractable methods, which are implemented in an Rpackage (see the Supplementary material for the current version. The package is planned to be uploaded to CRAN after acceptance of the manuscript for publication).
We investigate different scenarios of missingness (both missing completely at random and with structures resembling real world scenarios), discuss the resulting difficulties and propose viable estimation approaches. Their performance is evaluated in a simulation study and applied to environmental and financial data.
2 Change point detection without missing values
Consider a sequence of independent Gaussian random variables
with means and covariance matrices such that the map is piecewise constant. Letbe the set of segment boundaries. We label the elements of by their natural order starting with zero such that consecutive pairs of elements in define segments within which the are i.i.d. For let denote the matrix of the observations , denote with their mean and let be the corresponding covariance matrix.
For define to be the family of possible sets of segment boundaries such that the minimal segment length is not smaller than . If is some normalized loss after fitting an adequate model to and a penalty parameter for the number of segments, an estimator for is
(1) 
This estimator can be computed using dynamic programming with evaluations of , see for example Hawkins (1976). This is computationally infeasible if is large, especially if the cost to evaluate is significant.
Binary Segmentation is a much faster greedy algorithm to estimate . For this define the gains function of some segment at some split point to be
(2) 
and define
(3) 
The search for a single change point in by solving (1) breaks down to finding and checking if . For the multiple change point case Binary Segmentation (BS) finds an approximate solution to (1) by recursively splitting segments using (3) until the resulting segment length is smaller than such that a split is no longer allowed or the corresponding gain is not bigger than , the minimally required gain to split. BS typically requires asymptotically evaluations of .
This can still be prohibitive if the cost of evaluating is large and is big. Instead of evaluating at every possible split in a full grid search to find its maximum, one can find one of its local maxima with smartly chosen evaluations, see Haubner (2018). The expected gain curve for common losses (e.g. the squared error loss) is piecewise convex between the true underlying segment boundaries, such that in particular all local maxima correspond to change points. Hence, splitting at a local maximum instead of the global one does not induce a false discovery and the missed global maximum can still be found in a later step. Doing BS with this adaptive search is called Optimistic Binary Segmentation (OBS) (Haubner 2018) and approximately requires evaluations of . Optimism is needed in noisy scenarios, when the idealized piecewise convex structure is distorted. In an extremely noisy case the optimistic search strategy might fail while the full grid search would possibly still perform reasonably. We will see later on how this influences the applicability of OBS in scenarios with lots of missing data.
Leonardi and Bühlmann (2016) applied BS to a highdimensional regression change point problem using the negative loglikelihood for Gaussian errors (sum of squared errors) resulting from a Lasso fit as a loss measure. They provided (under technical conditions) consistency results if the penalization parameter for the Lasso is adjusted by the inverse of the square root of the relative segment length, i.e. by using for some fixed for the segment .
In order to adapt this approach to the multivariate normal case, we set
and define
(4) 
This way is (up to a constant) the negative loglikelihood of a Gaussian with precision matrix given observations that is scaled with segment length. Furthermore, is the glasso estimator from Friedman et al. (2008) for the precision matrix of observations with penalization parameter . We prefer not to penalize the diagonal in . We then use the in sample loss
after fitting the glasso in equation (2) as the loss when greedily searching for optimal splits with BS or OBS.
We choose the graphical Lasso for the estimation of the precision matrix since some sparsity assumption seems crucial in highdimensional scenarios. (Avanesov and Buzun (2018) also relied on the graphical Lasso (or similar procedures for sparse precision matrices) and Roy et al. (2017) also had a sparsity assumption.) Additionally, the change point detection is expected to be more powerful and reasonable when placing the sparsity assumption on the precision matrix rather than the covariance matrix, in particular when considering sparse changes between conditional dependencies among a few variables. For lowdimensional scenarios nonsparse (e.g. ridge type) estimators could also be reasonable, but differences in estimated change point locations compared to our glasso based approach are expected to be small.
Overall the algorithms have three tuning parameters: for the minimal relative segment length, to control for the number of segments and for the sparsity of the precision matrices. Note that the procedure is similar to the one proposed by Angelosante and Giannakis (2011), but as a crucial difference the penalty is scaled according to the segment length with .
3 Adapting to missing data
For noncomplete data denote with the observed part of the th observation and with and the submatrix of and the subvector of corresponding to the observed variables of . Similarly denote with the subvector of corresponding to the variables of with missing values. The normalized negative Gaussian loglikelihood of observations from a segment is
(5) 
Set to be the empirical mean of the observed part of (discarding the missing values). Note that might have missing values itself if for some variable there is no observed value in the segment . Let
(6) 
The penalized maximum likelihood estimator for observations is then
Contrary to (4), this problem is no longer convex and cannot be solved efficiently with an updatebased approach like the glasso. The MissGlasso algorithm proposed by Städler and Bühlmann (2012)
combines the glasso and an Expectation Maximization (EM) algorithm to estimate the precision matrix in the presence of missing data. However, the algorithm needs complete observations for a good initialization, is computationally expensive due to a new glasso fit for each EM iteration and might get stuck in local optima. These features are especially critical in our setting of highdimensional change point detection. High computational cost is prohibitive since we do a lot of evaluations of the loss function. More importantly even, for each split, MissGlasso would be initialised slightly differently and in some situations it could converge to a very different local optimum. This would result in jumps in the gain curve from equation (
2) for some neighboring splits .While an accurate estimate is necessary for a good fit, this is not needed for change point detection, as it is sufficient to preserve the piecewise convex structure of the gains function (2). This key new idea helps to avoid the heavy computations and local optima of the EM algorithm when doing BS or OBS. In the following, we propose estimators that approximate based on . We can then use in the glasso estimator (4) to obtain instead of and use the resulting loglikelihood ) as in equation (6) with minor modifications (see Section 3.2) as a loss measure.
3.1 Missing value imputations
We propose three different estimators for data with missing values. These will lead to different results, computational costs and applicabilities.
The average imputation estimator
In a first attempt, we impute the missing values with the average value of the corresponding variables within the interval . Thus define centered variables
and estimate
This will underestimate the variance and covariance of variables where values are missing. Nonetheless the average method serves as a baseline.
The LohWainwright bias corrected estimator
We try to counteract the shortcomings of the average imputation method using the bias correction presented by Loh and Wainwright (2012). The authors show that if the th variable of an observation in
is discarded with probability
, settingthe matrix
is an unbiased estimator of
. Here denotes the Hadamard (pointwise) product for matrices. In practice, we have to estimate the based on the proportion of missing observations, leading to . Note that the resulting matrix is not necessary positive semidefinite which is required for the glasso algorithm (Friedman et al. 2008) to converge. We thus compute the closest positive semidefinite matrix to with respect to the Frobenius norm using the Higham algorithm (Higham 2002) and take this as our final LohWainwright (LW) bias corrected estimator . Note that the Higham algorithm has an asymptotic complexity of similar to the glasso and thus evaluating the gain using the LW estimator is computationally more expensive than the average imputation method.Pairwise covariance estimation
A third estimate can be based on pairwise covariance estimates, where the covariance between two variables is calculated from observations where both corresponding variables are available. Similar to the LW estimator, this gives a valid estimate of the variances and covariances of variables, even if the missingness structure in the data is not homogeneous. If there are less than two complete observations for a pair of variables in a segment, we set the covariance between these variables to be zero. Again, as with the LW approach, this might yield a matrix that is not positive semidefinite. We hence apply the Higham algorithm to compute the closest positive semidefinite matrix and denote the resulting estimator by .
3.2 Avoiding jumps in the gain curve
For all of the three proposals above, the estimation of the variance of a variable on a given segment requires at least two observations with non missing values. In order to obtain a meaningful estimate, more nonmissing observations are necessary. As a consequence, with a lot of missing values or small segments, we might only be able to estimate a submatrix of the full covariance matrix. For such segments, the loglikelihood can then only be computed for a submodel of the entire multivariate Gaussian distribution. When evaluating the gains function (
2) at some split point , it might thus happen that the loglikelihood of the segment is calculated based on a larger covariance matrix (and thus a model with more parameters) than for or . As the loglikelihoods of multivariate Gaussians of different dimensions are not easily comparable, this is especially problematic for split points such that the estimated covariance matrix for the neighboring split has a different size. In such scenarios, the gains curve often has jumps between and (see Figure 2). To alleviate this, we propose to restrict the loglikelihood of to the dimensions available for and . To make this more concrete, let us introduce some notation.Denote for a segment and some with the indices of the variables for which at least values are observed in the segment . For any imputation method let be the submatrix of where each variable has at least observed values. Finally denote with the obtained glasso fit for the submatrix. In our simulations we set the minimal number of required observations for keeping a variable to be .
A naive estimator for the gains function is
This naive estimator might be comparing loglikelihoods of multivariate normal distributions of different dimensions at some possible split points if or . See Figure 1 for an illustrative example. Here , and . The loglikelihood of the full segment would be using the first variable of the (s+1)st observation for evaluation, whereas the loglikelihood of the segment on the right would not.
To avoid this, we propose to use the slightly different estimator
Here we only use the variables and in the calculation for the loss of the full segment when splitting at . Hence, slightly different losses are used for the full segment depending on the split point. As we are not primarily interested in a segments’ own loss, but rather a fair estimate of the gain, this is a sensible approach.
This does not increase the computational cost significantly, as for repeated evaluations of we can keep the initial estimate . Without this mechanism the gain curve exhibits jumps at the boundaries of missing blocks, as illustrated in Figure 2. This might prevent BS and OBS from correctly estimating the change points. The severity of this problem depends on the block size. When the missingness structure is homogeneous in time, the block size is small such that this is not a pronounced issue. However, in many applications one faces blockwise missing data, where such jumps would be a pronounced issue without our proposal.
4 Model selection
A good choice of tuning parameters is essential for accurate estimation results. The parameter controls the form of the gain curve. When chosen too small, the glasso tends to overfit and the resulting gain curve has an inverse U shape independently of the underlying change points. When chosen too big, the glasso underfits, resulting in an almost constant gain curve. Even though we adjust the penalization () depending on the segment length, one global might not be able to approximate the shape (piecewise convex structure) of the population version of the gain curve simultaneously in all possible segments encountered during BS or OBS. If the sparsity pattern of the underlying graphical model in the segments differs strongly, selecting a new in each splitting step of BS or OBS is advocated to obtain good results.
The parameters (the penalty for the number of segments) and (the minimum relative segment size) on the other hand control the depth of the tree structure generated by BS and thus how many change points are found. A sufficiently large value of is also necessary to achieve stability of fits in highdimensional scenarios. Often, overly small segments are uninterpretable and thus uninteresting for practitioners, such that can be set to some predetermined value (typically around 0.1). Thus, is the key parameter to be chosen in order to avoid under or oversegmentation. Leonardi and Bühlmann (2016) proposed to choose values for and via 2fold cross validation, where the test data is taken from an equispaced grid across the entire sample. Since change points for can be regained from trees grown with smaller , it is only necessary to do one BS fit (with ) per fold and .
This approach did not yield satisfactory results in our settings even when only a small amount of values were missing. Often the value chosen for would correspond to the correct segmentation of the test data, but would underfit on the whole data. This could be explained by the fact that while our covariance estimation methods approximately preserve the structure of the gain curve, they do not reliably preserve its magnitude and thus might be incomparable between folds and different segments.
We thus propose an alternative method as a stopping criterion for splits as well as for choosing . This approach performed empirically much better than the one that was used by Leonardi and Bühlmann (2016) and has the strong computational advantage of only requiring one single BS or OBS fit. For each investigated segment we first apply 10fold cross validation (taking the test data from an equispaced grid of all the observations in ) to obtain an optimal value corresponding to the minimal attained cross validated loss, which we denote by . The loss that is minimized is the negative loglikelihood of the test data given the mean and the estimated precision matrix of the train data. We then use for the evaluation of the gain curve for that segment. In the standard setting, one would then check if to decide whether to split further at the found point or not. Instead, we compare the crossvalidated minimal loss on the full segment to the sum of crossvalidated minimal losses of the subsegments and . We keep the split if there is a positive improvement, i.e. if
Note that our approach is not a proper crossvalidation technique as the above described improvement at a candidate split is evaluated on the same data as was used to find the split point. Hence, this procedure might be optimistic regarding the improvements and thus slightly biased towards finding too many change points. This would not be a big problem in practice, as finding too many change points is preferred to finding too few. In Section 5 we investigate how our stopping criterion performs on simulated data both with and without change points. Contrary to expectations it does not tend to oversegment in scenarios without change points.
5 Simulations
It seems to be hard to provide theoretical guarantees in the setting of change point detection in GGMs with missing values. We think that the only semirealistic case to provide theory is the one with values missing completely at random, which seems to be far from realistic for most applications (see examples in Section 6). Moreover, keeping our tuning parameter fixed would be one of our technical assumptions, which however is clearly suboptimal for real and simulated data (as discussed in Section 4). We thus focus on the practical performance of our methods, which we demonstrate with the following simulations.
5.1 Setup
We first discuss how we generated data, how we deleted values and we introduce the performance measure used to evaluate the results.
Generating covariance matrices
In our simulations we consider two methods to draw precision matrices for the segments. The first one is the random graph model (Erdős and Rényi 1960), which was used to simulate highdimensional graphs (e.g. by Kolar et al. 2010). Here, the graph is generated by connecting nodes randomly with some probability , which we set to in the following to ensure sufficient sparsity. We create the corresponding precision matrix by assigning a constant value (taken here as
) to the entries corresponding to the chosen edges and then adding the absolute value of the smallest eigenvalue of the resulting matrix plus some increment (here
) to the diagonal. This is necessary to ensure positive definiteness of the constructed precision matrix.We used chain networks as a second model (see e.g. Example 4.1 in Fan et al. 2009). Here we set , where , and for . In the simulations we set and additionally draw . The inverse of the resulting matrix is tridiagonal. To generate precision matrices with differing sparsity patterns for different segments, we draw some permutation of and set . This breaks the tridiagonal structure but keeps the sparsity.
Types of missingness
We consider two ways to delete values for the simulations. The first one is missing completely at random (MCAR), where a given percentage of values is discarded uniformly at random. The second one is inspired by the missingness structure of environmental monitoring data (see bottom of Figure 6). Here a failure of a sensor leads to missing values over several consecutive observations, while replacement of them at multiple sampling locations might occur at the same time. Moreover, it is common that simultaneously several sites are newly installed or abandoned based on the available budget. This leads to blocks of observations missing, ranging over multiple variables as well as observations. In order to generate a similar blockwise missingness structure, we repeatedly select variables uniformly at random and delete for all of them a segment of length with the midpoint chosen uniformly between and . We repeat this procedure until the preset percentage of missing values is reached. An example with 30% missing values for and is shown at the bottom of Figure 2.
Performance measures
We use the adjusted Rand Index (Hubert and Arabie 1985), a common measure to compare clusterings, to measure performance in our simulation study (see Table 1). Given two partitions of observations, the Rand Index (Rand 1971) is the number of agreements (pairs of observations that are either in the same subset for both partitions or are in different subsets for both partitions) divided by the total number of pairs . The adjusted Rand Index is the difference between the Rand Index and its expectation when choosing partitions randomly, normalized by the difference between the maximum possible Rand Index and its expectation. The adjusted Rand Index is thus bounded by one and is expected to be zero when partitions are chosen randomly. We illustrate the estimation uncertainty of found change points corresponding to some adjusted Rand measures (taking true and estimated segments as the two partitions) via histograms in Figure 4.
Setup of the main simulation study
We will illustrate the behavior of our methods on settings with observations of dimension with three change points with segments of sizes and . We randomly permute the order of the segments to avoid systematic effects. Note that in the smallest segment the number of observations is smaller than the number of variables, resulting in a truly highdimensional setting when splitting. In each simulation, we generate a precision matrix (either random or chain network) for each segment and then draw observations independently from the corresponding centered multivariate normal distribution. The parameter is held fixed at and we vary the proportion of missing data in steps of between and . Note that in this setup in expectation there is less than one complete observation available per segment when deleting only of the values completely at random. Therefore, discarding incomplete observations is clearly not a viable option and some kind of imputation method is necessary.
5.2 Results
We analyze the estimation performance (using the model selection approach of Section 4) for our three methods (average, pairwise and LW, see Section 3.1) both using BS and OBS. We ran 100 simulations for each setting. The corresponding mean values of adjusted Rand Indices are displayed in Table 1
along with their standard deviations in parenthesis. To aid the interpretability, we present the adjusted Rand Indices of a selection of estimation results together with the true change points in Table
3. Note that finding the correct number of change points with an accuracy of around two observations each leads to an adjusted Rand Index of around 0.95. Finding all true change points plus a false positive one leads to an adjusted Rand Index of around 0.8, similar to finding only two of the three change points. Additionally we show histograms of all the change points found over 500 simulations in Figure 4, for which we needed to consider a fixed true change points scenario to be meaningful. Note that the yaxis is on a logscale.PERCENTAGE OF MISSING VALUES  

NW  MISS  METHOD  10%  20%  30%  40%  50%  
CN  MCAR  LW  BS  0.999 (0.003)  0.996 (0.006)  0.993 (0.010)  0.960 (0.060)  0.430 (0.314) 
CN  MCAR  LW  OBS  0.998 (0.004)  0.996 (0.006)  0.987 (0.025)  0.949 (0.068)  0.466 (0.301) 
CN  MCAR  av  BS  1.000 (0.001)  0.998 (0.004)  0.952 (0.072)  0.244 (0.256)  0.000 (0.000) 
CN  MCAR  av  OBS  0.999 (0.002)  0.998 (0.005)  0.957 (0.069)  0.226 (0.252)  0.000 (0.000) 
CN  MCAR  pair  BS  0.999 (0.003)  0.997 (0.006)  0.993 (0.010)  0.939 (0.118)  0.372 (0.297) 
CN  MCAR  pair  OBS  0.998 (0.004)  0.996 (0.006)  0.986 (0.026)  0.938 (0.074)  0.384 (0.290) 
CN  block  LW  BS  0.999 (0.004)  0.994 (0.012)  0.985 (0.020)  0.953 (0.051)  0.813 (0.208) 
CN  block  LW  OBS  0.998 (0.005)  0.991 (0.019)  0.977 (0.037)  0.923 (0.085)  0.727 (0.257) 
CN  block  av  BS  0.957 (0.086)  0.811 (0.214)  0.367 (0.340)  0.242 (0.267)  0.280 (0.245) 
CN  block  av  OBS  0.953 (0.123)  0.765 (0.243)  0.294 (0.326)  0.197 (0.268)  0.248 (0.253) 
CN  block  pair  BS  0.998 (0.006)  0.990 (0.017)  0.981 (0.025)  0.941 (0.069)  0.813 (0.170) 
CN  block  pair  OBS  0.997 (0.007)  0.990 (0.013)  0.958 (0.039)  0.923 (0.076)  0.732 (0.240) 
RN  MCAR  LW  BS  0.991 (0.012)  0.973 (0.051)  0.943 (0.083)  0.828 (0.116)  0.510 (0.228) 
RN  MCAR  LW  OBS  0.986 (0.025)  0.971 (0.046)  0.932 (0.090)  0.791 (0.146)  0.452 (0.270) 
RN  MCAR  av  BS  0.981 (0.041)  0.878 (0.102)  0.578 (0.221)  0.076 (0.184)  0.005 (0.050) 
RN  MCAR  av  OBS  0.983 (0.028)  0.858 (0.120)  0.499 (0.275)  0.063 (0.168)  0.011 (0.075) 
RN  MCAR  pair  BS  0.988 (0.014)  0.974 (0.050)  0.931 (0.086)  0.779 (0.141)  0.438 (0.253) 
RN  MCAR  pair  OBS  0.986 (0.025)  0.970 (0.050)  0.917 (0.094)  0.766 (0.156)  0.373 (0.266) 
RN  block  LW  BS  0.990 (0.021)  0.974 (0.038)  0.927 (0.103)  0.791 (0.208)  0.464 (0.328) 
RN  block  LW  OBS  0.979 (0.041)  0.949 (0.073)  0.904 (0.137)  0.741 (0.225)  0.386 (0.342) 
RN  block  av  BS  0.752 (0.210)  0.336 (0.311)  0.173 (0.250)  0.198 (0.238)  0.276 (0.226) 
RN  block  av  OBS  0.711 (0.243)  0.234 (0.287)  0.119 (0.207)  0.136 (0.224)  0.227 (0.244) 
RN  block  pair  BS  0.986 (0.027)  0.967 (0.049)  0.923 (0.100)  0.759 (0.221)  0.573 (0.313) 
RN  block  pair  OBS  0.980 (0.037)  0.949 (0.079)  0.863 (0.181)  0.700 (0.234)  0.483 (0.339) 
NUMBER OF OBSERVATIONS  

NW  METHOD  450  400  350  300  250 
CN  BS  1.000 (0.001)  1.000 (0.001)  0.999 (0.002)  1.000 (0.002)  0.997 (0.009) 
CN  OBS  1.000 (0.002)  0.999 (0.003)  0.998 (0.006)  0.999 (0.004)  0.996 (0.016) 
RN  BS  0.990 (0.019)  0.990 (0.018)  0.980 (0.039)  0.947 (0.077)  0.875 (0.095) 
RN  OBS  0.987 (0.021)  0.987 (0.019)  0.969 (0.051)  0.931 (0.083)  0.833 (0.084) 
TRUE CPTS  FOUND CPTS  ADJ. RAND 

120, 190, 310  120, 188, 310  0.992 
70, 190, 310  66, 191, 310  0.980 
120, 240, 430  118, 239, 423  0.9 49 
190, 260, 380  93, 190, 260, 380  0.804 
120, 240, 310  119, 243  0.757 
120, 240, 310  297  0.506 
70, 190, 380  380  0.363 
All three methods perform similarly if at most 30% of values for the chain network setup or 20% of values for the random network setup are deleted completely at random. For more challenging MCAR setups the average imputation method fails to estimate the change points reliably, whereas the LW and pairwise methods still perform reasonably well if up to 30% – 40% of values are missing. The performance of all methods tends to be worse if the values are deleted blockwise, as expected. Here, the average imputation method clearly underperforms compared to the other two methods even when only 10% of the values are deleted. The pairwise and LW methods perform very well for up to 30% – 40% of missing values. The LW method seems to perform somewhat better than the pairwise method. The scenarios with 50% missing values are all very challenging and the ranking of the methods may be slightly different compared to the easier scenarios.
Figure 3 provides a representative example comparing the gain curves recovered using our three methods. If the values are deleted completely at random, the piecewise convex structure is well conserved for all three methods. For blockwise deleted values it is somewhat distorted when using the LW and pairwise methods but local optima still occur at two of the true change points. The structure is not recoverable at all due to big jumps with the average imputation method, explaining the low mean adjusted Rand Index in this scenario. A key insight is that with the LW and pairwise methods the piecewise convex structure is approximately conserved, allowing for good estimation of at least one of the underlying change points, as necessary for BS and OBS. Note also that the imputation method (and in general also the fraction of missing values) additionally has an effect on the magnitude of the gain curve.
The simulation scenarios with random networks seem to be a lot harder for our methods to estimate change points compared to chain networks. As expected, at the price of higher computational cost, BS slightly outperforms OBS, especially in hard scenarios with a high fraction of missing values where we expect the piecewise convex structure of the gain curve to be strongly distorted. We also tested standard BS and OBS (as described in Section 2) on scenarios where no values are missing, but the total number of observations is scaled down by 10% – 50% (Table 2). Not surprisingly, we see that complete data are generally more informative than the analogous data with missing values. Also here, the random network scenarios are more challenging compared to chain networks.
In the middle scenario of Figure 4 with 30% missing values we estimate a total of 1343 change points with 1079 within five and 1190 within ten observations away from a true underlying change point. The analogous numbers for the settings with 40% missingness are 1118, 754 and 882 and for 50% missingness are 724, 353 and 447. The decreasing performance is also reflected in the average adjusted Rand Indices. In total there would be true underlying change points. This indicates a general trend, where our algorithm seems to underfit, i.e. select too few change points, rather than selecting incorrect ones.
To investigate our model selection procedure presented in Section 4, we did a complementary study in a setting of no true underlying change points. Testing all scenarios of Table 1 with 100 simulations each resulted in a total of simulations. In the setting where and and without true underlying change points our estimators found a total of 734 change points, 728 of which were found with the average imputation method for data where values were deleted blockwise, four with the pairwise method and two with the LW method, both with blockwise deleted values. Our model selection did not select any change points in any of the 6000 simulation with values deleted completely at random or any of the 4800 simulations with the LW or pairwise estimator where up to 30% of values were deleted blockwise. In an analogous simulation study with and (setting to obtain reasonable fits) again no change points were found in the simulations with values missing completely at random. In total 615 change points were found in the 12000 simulations. 109 were found with the average method, 285 were found with the LW method and 221 with the pairwise method. Only 3, 1 and 3 respectively were found in scenarios with up to 30% missing values.
Finally we indicate the computational cost of the different methods. Running on a single Intel Xenon 3.0 GHz processor core, with our current R implementation each BS search including model selection took around 130, 150 and 170 seconds for the average, LW and pairwise method for an average scenario of Table 1. Change point estimation with OBS took around 40 seconds for each of the three methods, thus enabling massive speedups.
6 Applications
We applied our methods to two real data examples with naturally occurring missing values. One contains monthly shallow groundwater level measurements at sampling locations between the Rivers Danube and Tisza within Hungary (Figure 5) from January 1951 to September 2013. The original measurements were seasonally adjusted for each sampling location individually. The second data set comprises of daily logreturns of the stocks currently listed in the S&P 500 index for the period January 1990 to March 2019. The two data sets have approximately 35% and 19% missing values, respectively.
Care needs to be taken as the assumption of underlying piecewise constant GGMs might not be fully valid in applications. In particular, the underlying model might also change smoothly over time. Such deviations of the model assumptions are usually visible in the shape of the gains curves.
6.1 Shallow groundwater levels
Due to the nature of the monitoring system there are a lot of missing values with a blockwise missingness structure that is strongly inhomogeneous in time, see bottom right of Figure 6. Out of the 136 sampling sites only 93 had nonmissing values for more than months, with some having values missing for three quarters of the months. The reduced set of 93 sites only has approximately 16% missing values. We first applied BS with our three proposed methods to this subset (left of Figure 6). We used crossvalidated for each segment as described in Section 4 and a small for better visualisation of the gain curves at the boundaries. Note that plotting the whole gain curves requires a full search, for which BS (including the proposed model selection) took roughly 30 minutes. While the order of the splits are different across the three methods, the finally found change points lie very close to each other. The three change points which all methods agree on for both the full and reduced data set with the highest crossvalidated improvements lie around November 1964, April 1983 and January 1996 (with up to only two observations difference). Besides these three, several smaller ones are found. The major change points have a clear hydrogeological interpretation. For example, around 1983, there was a drop in groundwater levels unequally impacting areas within the monitoring system. This presumably shows up jointly as a mean and covariance shift in our analysis. Note that our loglikelihood based approach will also detect shifts in mean.
Interestingly, applying BS to the very challenging full data set yields very similar results (right of Figure 6). This would not have been possible without the techniques preventing jumps in the gain curves discussed in Section 3.2 due to the big blocks of missing values. Even with our technique, the gain curve for the first split flattens out, showing the challenges in such complicated scenarios. Also note the jumps in the gain curve of the average imputation method around the boundaries of big missing blocks, similar to Figure 3.
6.2 Financial time series
Due to the high number of variables combined with a high number of observations, we needed to use OBS to estimate change points on the financial data set. The algorithm found a lot of change points with positive cross validated improvement. At least partially, this could be due to smooth changes in the graphical model besides the abrupt ones. Nonetheless, we would like to present our findings.
The pairwise and LW methods yielded similar results, with the found change points differing only by a few days. The average method agreed on some of these change points, but occasionally found change points differed by a few months. The five change points with the highest cross validated improvement (equal for both the LW and pairwise method) seem reasonable, with August 1998 possibly corresponding to the Russian financial crisis, June 2003 corresponding to the start of the US operations in Iraq, January 2008 and November 2009 to the beginning and end of the peak period of the 2008 financial crisis and August 2012 related to the European debt crisis. While these are possible explanations, more expertise in finance is needed for a more thorough analysis of all the found change points.
7 Conclusions
Our estimation methods enable practitioners to search for change points in settings where this was impossible before. The LohWainwright based imputation method presented in Section 3.1 has a stable performance even in challenging highdimensional scenarios with lots of missing values. Since its computational cost is not significantly higher than the other methods considered, we generally recommend its usage. The choice of BS or OBS should be based on computational resources. BS results in slightly better results and the possibility of better visualization by drawing the full gain curves, however at a considerably higher computational cost than OBS. We emphasize that technical adjustments regarding the evaluation of the gains introduced in Section 3.2 as well as the model selection procedure of Section 4 lie at the core of our methodology, enabling good performances both on simulated and real data with missing values.
Acknowledgements
We thank Tamás Garamhegyi, József Kovács (Department of Geology, Eötvös Loránd University, Budapest, Hungary) and József Szalai (General Directorate of Water Management, Budapest, Hungary) for the groundwater data set and for providing hydrogeological interpretation of the found change points. Solt Kovács and Peter Bühlmann were partially supported by the European Research Commission grant 786461 CausalStats  ERC2017ADG.
Supplementary material
 Rpackage

The Rpackage hdcdwithmissingvalues available at https://github.com/MalteLond/hdcd_NA implements our methods to find change points in a GGM with possibly missing values. The files used for simulations, the simulation results and the files used to draw the plots can be found in the folder simulation. The S&P500 data is also included. The groundwater data set is not public, the owners only permitted using it as an illustrating example. Hence, we cannot include the data set itself in the supplementary materials.
References
 Angelosante and Giannakis (2011) Angelosante, D. and Giannakis, G. B. (2011), “Sparse graphical modeling of piecewisestationary time series,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1960–1963.
 Atsawawaranunt et al. (2018) Atsawawaranunt, K., ComasBru, L., Amirnezhad Mozhdehi, S., Deininger, M., Harrison, S. P., Baker, A., Boyd, M., Kaushal, N., Ahmad, S. M., Ait Brahim, Y., Arienzo, M., Bajo, P., Braun, K., Burstyn, Y., Chawchai, S., Duan, W., Hatvani, I. G., Hu, J., Kern, Z., Labuhn, I., Lachniet, M., Lechleitner, F. A., Lorrey, A., PérezMejías, C., Pickering, R., Scroxton, N., and Members, S. W. G. (2018), “The SISAL database: a global resource to document oxygen and carbon isotope records from speleothems,” Earth System Science Data, 10, 1687–1713.
 Avanesov and Buzun (2018) Avanesov, V. and Buzun, N. (2018), “Changepoint detection in highdimensional covariance structure,” Electronic Journal of Statistics, 12, 3254–3294.

Banerjee et al. (2008)
Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008), “Model
Selection Through Sparse Maximum Likelihood Estimation for Multivariate
Gaussian or Binary Data,”
Journal of Machine Learning Research
, 9, 485–516.  Bybee and Atchadé (2018) Bybee, L. and Atchadé, Y. (2018), “Changepoint Computation for Large Graphical Models: A Scalable Algorithm for Gaussian Graphical Models with Changepoints,” Journal of Machine Learning Research, 19, 440–477.
 Dette et al. (2018) Dette, H., Pan, G., and Yang, Q. (2018), “Estimating a change point in a sequence of very highdimensional covariance matrices,” arXiv:1807.10797.
 Erdős and Rényi (1960) Erdős, P. and Rényi, A. (1960), “On the Evolution of Random Graphs,” in Publication of the Mathematical Institute of the Hungarian Academy of Sciences, vol. 5, pp. 17–61.
 Fan et al. (2009) Fan, J., Feng, Y., and Wu, Y. (2009), “Network exploration via the adaptive LASSO and SCAD penalties,” The Annals of Applied Statistics, 3, 521–541.
 Friedman et al. (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008), “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, 9, 432–441.
 Gibberd and Nelson (2017) Gibberd, A. J. and Nelson, J. D. (2017), “Regularized Estimation of Piecewise Constant Gaussian Graphical Models: The GroupFused Graphical Lasso,” Journal of Computational and Graphical Statistics, 26, 623–634.
 Gibberd and Roy (2017) Gibberd, A. J. and Roy, S. (2017), “Multiple Changepoint Estimation in HighDimensional Gaussian Graphical Models,” arXiv:1712.05786.
 Hallac et al. (2018) Hallac, D., Nystrup, P., and Boyd, S. (2018), “Greedy Gaussian segmentation of multivariate time series,” Advances in Data Analysis and Classification.
 Hallac et al. (2017) Hallac, D., Park, Y., Boyd, S., and Leskovec, J. (2017), “Network Inference via the TimeVarying Graphical Lasso,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 205–213.
 Haubner (2018) Haubner, L. (2018), “Optimistic binary segmentation: A scalable approach to changepoint detection in highdimensional graphical models,” Master Thesis, ETH Zurich.
 Hawkins (1976) Hawkins, D. M. (1976), “Point estimation of the parameters of piecewise regression models,” Journal of the Royal Statistical Society: Series C, 25, 51–57.
 Higham (2002) Higham, N. J. (2002), “Computing the nearest correlation matrix  a problem from finance,” IMA Journal of Numerical Analysis, 22, 329–343.
 Hubert and Arabie (1985) Hubert, L. and Arabie, P. (1985), “Comparing partitions,” Journal of Classification, 2, 193 – 218.
 Kolar et al. (2010) Kolar, M., Song, L., Ahmed, A., and Xing, E. P. (2010), “Estimating timevarying networks,” The Annals of Applied Statistics, 4, 94–123.

Kolar and Xing (2011)
Kolar, M. and Xing, E. P. (2011), “On Time Varying Undirected Graphs,”
in
Proceedings of the 14th International Conference on Artificial Intelligence and Statistics
, pp. 407–415.  Kolar and Xing (2012) — (2012), “Estimating networks with jumps,” Electronic Journal of Statistics, 6, 2069–2106.
 Lam and Fan (2009) Lam, C. and Fan, J. (2009), “Sparsistency and rates of convergence in large covariance matrix estimation,” The Annals of Statistics, 37, 4254–4278.
 Lauritzen (1996) Lauritzen, S. L. (1996), Graphical models, Oxford statistical science series, Oxford University Press.
 Leonardi and Bühlmann (2016) Leonardi, F. and Bühlmann, P. (2016), “Computationally efficient change point detection for highdimensional regression,” arXiv:1601.03704.
 Loh and Wainwright (2012) Loh, P.L. and Wainwright, M. J. (2012), “Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses,” in Advances in Neural Information Processing Systems, pp. 2087–2095.
 Matteson and James (2014) Matteson, D. S. and James, N. A. (2014), “A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data,” Journal of the American Statistical Association, 109, 334–345.
 Mazumder and Hastie (2012) Mazumder, R. and Hastie, T. (2012), “The graphical lasso: New insights and alternatives,” Electronic Journal of Statistics, 6, 2125–2149.
 Meinshausen and Bühlmann (2006) Meinshausen, N. and Bühlmann, P. (2006), “Highdimensional graphs and variable selection with the lasso,” The Annals of Statistics, 34, 1436–1462.
 Rand (1971) Rand, W. M. (1971), “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, 66, 846 – 850.
 Ravikumar et al. (2011) Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. (2011), “Highdimensional covariance estimation by minimizing L1penalized logdeterminant divergence,” Electronic Journal of Statistics, 5, 935–980.
 Rothman et al. (2008) Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008), “Sparse permutation invariant covariance estimation,” Electronic Journal of Statistics, 2, 494–515.
 Roy et al. (2017) Roy, S., Atchadé, Y., and Michailidis, G. (2017), “Change point estimation in high dimensional Markov randomfield models,” Journal of the Royal Statistical Society: Series B, 79, 1187 – 1206.
 Shirvani (2015) Shirvani, A. (2015), “Change point analysis of mean annual air temperature in Iran,” Atmospheric Research, 160, 91–98.
 Städler and Bühlmann (2012) Städler, N. and Bühlmann, P. (2012), “Missing values: sparse inverse covariance estimation and an extension to sparse regression,” Statistics and Computing, 22, 219–235.
 Tibshirani (1996) Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288.
 Wang et al. (2017) Wang, D., Yu, Y., and Rinaldo, A. (2017), “Optimal Covariance Change Point Localization in High Dimension,” arXiv:1712.09912.
 Witten et al. (2011) Witten, D. M., Friedman, J. H., and Simon, N. (2011), “New Insights and Faster Computations for the Graphical Lasso,” Journal of Computational and Graphical Statistics, 20, 892–900.
 Xie et al. (2013) Xie, Y., Huang, J., and Willett, R. (2013), “ChangePoint Detection for HighDimensional Time Series With Missing Data,” IEEE Journal of Selected Topics in Signal Processing, 7, 12–27.
 Yuan and Lin (2007) Yuan, M. and Lin, Y. (2007), “Model selection and estimation in the Gaussian graphical model,” Biometrika, 1–17.
 Zhou et al. (2010) Zhou, S., Lafferty, J., and Wasserman, L. (2010), “Time varying undirected graphs,” Machine Learning, 80, 295–319.
Comments
There are no comments yet.