1 Introduction
Chromosomes are tightly packed in nucleus and well maintained for threedimensional (3D) structures, which play a critical role in many genomic functions such as gene regulation, DNA replication, epigenetic modification and maintenance of genome stability. To understand the spatial information of chromosome organization, several sequencing methods have been developed to assess chromatin interactions. In particular, the chromosome conformation capture (3C) technology quantifies the frequencies of interactions between sites on a chromosome (Dekker et al., 2002). Circularized 3C (4C) technique can obtain the contact reads between one locus and all other loci on a chromosome (van de Werken et al., 2012). 3C carbon copy (5C) technology can detect all the contacts in some specific region of chromosome (Dostie et al., 2006). HiC technology extends 3C and is capable of generating genomewide chromosome interaction information (LiebermanAiden et al., 2009). These 3Cbased experimental techniques have significantly advanced our understanding of chromatin structure (Rao et al., 2015) and its relation to the regulation of gene expression, cancer translocations, copy number alterations (Fudenberg et al., 2011) and de novo genome assembly (Dudchenko et al., 2017).
The study of HiC data has revealed that chromosome can be partitioned into topologically associating domains (TADs) in human and mouse cells (Dixon et al., 2012; Sexton et al., 2012). As indicated by panel A in Figure 1, TADs are characterized as structural and functional units of mammalian chromosomes that engage in dynamic higherorder interTAD connections, playing a major role in regulatory transactions during DNAdependent processes (Hnisz et al., 2016). As TADs are characterized by pronounced longrange associations between loci located in adjacent domains, these studies suggest that chromosomes are composed of a string of domains that are topologically separated from each other. Subsequent studies show that TADs are separated by boundaries that seem to be genetically defined (Nora et al., 2012), and that such TADs are closely related to gene regulatory functions and illustrate the longrange associations between loci on a chromosome (Rao et al., 2015; Symmons et al., 2014). Furthermore, recent biological studies (PhillipsCremins et al., 2013; Zuin et al., 2014) show the distinct combinations of proteins such as CTCF, mediator and cohesin may also influence boundaries of TADs and subTADs, i.e., indicating that TADs have hierachical structures; see panel B in Figure 1.
The HiC technology generates read pairs corresponding to pairs of genomic loci that physically interact in the nucleus (LiebermanAiden et al., 2009). For a chromosome, the observed number of read pairs are used to construct a chromosomal contact map or a square matrix , where describes the frequency of interactions between position and position along the chromosome at a given resolution. By the definition of the contact matrix, , i.e., matrix is symmetric. Note that, a high interaction frequency indicates for the spatial proximity, which facilitates quantitative studies about the relationship between gene functions and genomic 3D structures. Hence, in the HiC contact profile, contact frequencies tend to interact more frequently with each other at loci located within a TAD, but much less frequently at loci of different TADs (Figure 1A). This indicates the HiC data matrix has a blockdiagonal structure. Therefore, the issue of detecting diagonal blocks can be considered as a problem of estimating multiple changepoints (or boundaries) on the diagonal of a matrix, and the problem of estimating hierarchical TAD structures is equivalent to the problem of making inference on hierarchical strucutures of changepoints in HiC profiles.
As the analysis of TADs and their hierarchical structures involves the problem of estimating multiple changepoints and their hierarchical strucutures, we note that various Bayesian and frequentist methods have been developed for multiple changepoint analysis of uni and multivariate data in statistics, econometrics, and engineering over the last several decades. In particular, the Bayesian methods assume multiple changepoints follow a stochastic process and solve the inference problem through Markov chain Monte Carlo (MCMC) simulations
(Albert and Chib, 1993; Liu and Lawrence, 1999; Wang and Zivot, 2000; Chib et al., 2002; Green, 1995). Apart from simulation based Bayesian methods, exact Bayesian inference approaches with statistically efficient approximations are also developed for various multiple changepoints problems in statistics, genomic data analysis, and finance
(Yao, 1984; Lai et al., 2008; Lai and Xing, 2011, 2013; Xing et al., 2012a, b; Xing and Chen, 2018). Different from the Bayesian methods which use either MCMC or exact approaches, the frequentist methods provide more technical tools for multiple changepoint analysis, which include the dynamic programming algorithm based on maximized likelihood (Bai and Perron, 1998, 2003; Perron and Qu, 2007), the binary segmentation procedure (Vostrikova, 1981; Olshen et al., 2004; Matteson and James, 2014), the model selection procedure based on information criteria (Yao, 1988; Siegmund, 2004; Davis et al., 2006; Zhang and Siegmund, 2007; Shen and Zhang, 2012), and the penalized likelihood or cost function approaches (Birge and Massart, 2001; Broman and Speed, 2002; Lavielle, 2005; Harchaoui and LévyLeduc, 2010).Although many approaches (as discussed above) have been developed for multiple changepoint analysis , none of them can be directly to analysis multiple changepoints and their hierarchical structures in HiC profiles. The key reason is that HiC data are essentially a type of matrixvariate data, which can not be easily transformed into a multivariate vector and analyzed by existing methods. Recently, several statistical methods were proposed to address this issue and estimate the boundaries (or changepoints) of TADs and their hierarchical structures. These methods can be roughly classified into three types. The first type of methods projects the 2dimensional (2D) matrix into a onedimensional (1D) vector and use a 1D test statistic to estimate TAD boundaries in the contact matrix
. In particular, Dixon et al. (2012) computed a 1D directionality index(DI) from the contact matrix, and used a hidden Markov model (HMM) to partition the genome into regions defined by changes in the DI values.
Eg Sauria et al. (2015) introduced a 1D statistic to capture sudden shifts in interaction preference, however, they did not explicitly pair their estimated boundaries into domains, leaving the domain structure ambiguous. We shall note that the projection of a 2D matrix into a 1D summary statistic incurs an information loss, and hence could not provide a best estimation of changepoints in HiC matrices. The second type of methods tries to identify TADs using the full 2D contact matrices. LévyLeduc et al. (2014) defined a blockwise segmentation model for the detection of TAD boundaries, and rephrased the maximization of the likelihood with respect to the block boundaries in terms of a 1D segmentation problem, for which the standard dynamic programming applies. Filippova et al. (2014)introduced a dynamicprogramming based algorithm to estimate TADs in HiC data matrices for a given domainlength scaling factor. Their method outputs the set of nonoverlapping domains that are most robust to changes in the parameter value. Assuming that a random symmetric matrix consists of random variables whose distribution changes from block to block.
Brault et al. (2018) developed a nonparametric homogeneity test to estimate locations of block boundaries (changepoints) of nonoverlapping blocks in the matrix. Haddad et al. (2017)introduced a hierarchical clustering algorithm by iteratively merging close pairs of columns in HiC matrix. Note that both the first and second types of methods assume that TADs are nonoverlapping, and ingore the hierarchical structures of TADs. The third type of methods focus on making inference on hierarchical structures of TADs, but the procedures are rather ad hoc. Specifically,
Weinreb and Raphael (2015) proposed a treebased algorithm to estimate hierarchical TADs, but the significance assesment on their estimate is missing. Yu et al. (2017)assumed that each contact frequency is a mixture of intradomain and interdomain contacts and proposed to fit a twocomponent Gaussian mixture model to the normalized HiC data matrix.
Yu et al. (2017)’s method provides us twolayer hierarchical TAD structure, ignoring the fact that the hierarchical TAD structure usually has layers more than two.To overcome the weakness of existing methods for multiple changepoint analysis of HiC matrices, we develop a theory of generalized likelihood ratio (GLR) tests for a known and a unknown changepoint in HiC matrices, and study the asymptotic properties of the test statistics. Since the topology of changepoints in a HiC matrice are different from that in uni or multivarate random variables, the test statistics from the GLR tests for HiC data are different from those obtained in uni or mulitvariate changepoint tests. We show that that the maximum of the GLR test statistics converges to a maxima on a Gaussian random field with twodimensional indices. We then make use of this result and propose a topdown and bottomup procedure to estimate the hierarchical TAD structures in HiC data. Specifically, by repeatedly applying the changepoint test for HiC data, we first use a topdown method to locate all changepoints (or boundaries of TADs) in a HiC matrix; then due to the fact that the hierarchical TADs in HiC matrix usually have more than one root, we apply the changepoint tests again and use a bottomup method to estimate hierarchical structures of TADs.
We then carry out two simulation studies and a real data analysis to test the performance of our method. Our first simulation study considers four different scenarios of HiC data and investigates the performance of the topdown method of estimating boundaries of TADs in simulated HiC data. In the second stimulation study, we apply the topdown and bottomup procedure to estimate hierarchical structures in a collection of benchmark datasets, which were simulated by other research group. In a real data analysis, we use our method to analyze HiC data of multiple cell lines that are produced from genomic experiments on chromatin analysis. By comparing our estimation results with real hierarchical structures of TADs in those cell lines, we show how our method can accurately estimate the hierarchical TAD structures in HiC profiles and demonstrate how the result obtained by our approach is consistent with biological phenomenon via integrative analysis of chromatin interactions and histone modifications.
The rest of the paper is organized as follows. Section 2 specifies a model for TADs in HiC profiles, and develop a generalized likelihood ratio test for a changepoint in HiC matrix. In Section 3, we propose a topdown and bottomup approach to estimate hierarchical TADs. Simulation studies are carried out in Section 4, and real data analysis are performed in Section 5. Section 6 provides some concluding remarks.
2 A generalized likelihood rato test for changepoints in HiC matrices
2.1 Model assumption
Let be a HiC matrix, where is the intensity of the interaction between positions and in the chromosome. By construction, the HiC data matrix is symmetric. Assume that all intensities are independent variables with distribution
(2.1) 
Suppose that consists of nonoverlapping TAD or diagonal blocks and the true locations of changepoint (or block boundaries) are . For convenience, we let and . We define the region of all positions , and the th diagonal block, . For two or more consecutive diagonal blocks , we define the complement part . Then the parameters are supposed to be blockwise constant, i.e.,
(2.2) 
An example of such a matrix with changepoints is displayed in the left panel of Figure 2. For distribution , we will consider three specifications:

Gaussian distribution, or ;

Poisson distribution, or ;

negative binomial distribution, or
.
The Gaussian distribution (G) can be assumed for normalized HiC data, which are the contact frequencies after removing the effect of distance decay. Poisson distribution (P) or negative binomial distribution (NB) is usually assumed for raw HiC data. We further assume that the parameters and in distributions (G) and (NB) are constant for all . Then given distributions and locations of changepoints, the loglikelihood of the data is given by
(2.3) 
Specifically, we have, for Guassian distributions (G),
for Poisson distributions (P),
and for negative bionomial distributions (NB),
2.2 Generalized likelihood ratio tests for a single changepoint in a HiC matrix
Suppose that the data matrix
has at most one changepoint, and we consider a generalized likelihood ratio (GLR) test for the null hypothesis
against the alternative . Suppose that, under the alternative, there exists an unknown changepoint at position such that , as shown in the right panel of Figure 2. Denote , and the sum of in sets , , and , respectively, i.e.,Provided the loglikelihood (2.3) for changepoints, the logarithm of GLR can be expressed as , or equivalently,
(2.4) 
In particular, the logarithms of the GLR statistics for Gaussian distributions (G), Poisson distributions (P), and negative binomial distributions (NB) are expressed as
(2.5)  
(2.6)  
(2.7) 
where represents the cardinality of the set, so we have , , , and . Then the exact GLR test rejects the null hypothesis if the maximum of the finite sample statistic , , exceeds certain threshold, where is some integer. Asymptotically, these exact tests for Guassian distributions, Poisson distributions, and negative binomial distributions are equivalent, which can be seen from the following theorem.
Theorem 1. The GLR statistics (2.5)(2.7) are asymptotically equivalent to the following test statistic
(2.8) 
where for all and , and , and are the sum of in sets , and , respectively. If the alternative hypothesis is that is a changepoint, then the null should be rejected if in (2.8) exceeds certain threshold. If the location of the changepoint is unknown, i.e., the alternative hypothesis is that there is a changepoint between and , then an asymptotic GLR test for Gaussian distributions (G), Poisson distributions (P), and negative binomial distributions (NB) rejects the null hypothesis of no changepoint if, for some , the maxima of , i.e.,
(2.9) 
exceeds certain threshold.
To find the distribution of the test statistics , one can use Monte Carlo simulations. As an exmaple, Figure 3 shows the histogram and the density function estimated by kernal smoothing methods for and . Another way of obtaining the (tail) distribution of is to consider the asymptotic distribution of and . In particular, we consider a Gaussian random field defined on the upper triangular part of a unit square, . The random field satisfies the following properties: (1) for and ,
are normally distributed as
; (2) for , are normally distributed as ; (3) for regions , , Cov. For region , we define the integralThen the asymptotic distribution of and can be described via the following theorem.
Theorem 2. Assume that as , then the regions , and converge to regions , , and , respectively (note that ). Correspondingly,
and if ,
Therefore, at confidence level, an asymptotic GLR test with the alternative that is a changepoint rejects the null hypothesis if , where Prob, and an asymptotic GLR test with the alternative that there is a changepoint between and rejects the null hypothesis if , where .
3 A topdown and bottomup approach to estimate hierarchical TADs
In this section, we propose an estimation procedure for hierarchical TADs in HiC profiles. The procedure consists of two steps. The first step uses a topdown method to identify multiple changepoints in HiC matrices, and the second step uses a bottomup approach to estimate hierarchical TADs in HiC profiles.
3.1 A topdown method of estimating multiple changepoints
The GLR test using the test statistic (2.9) suggests a binary (or topdown) segmentation procedure to sequentially identify changepoints (or block boundaries) in the HiC matrix. The essential idea is to test all the data to determine if there is at least one changepoint and iterate the procedure to find all locally significant changepoints. Assume that is the minimal size of a TAD in the HiC matrix, is the significance level of the GLR test, and is the upper
quantile of
such that . We can use the following topdown segmentation procedure to identify all changepoints in the HiC data matrix.
Step 1a. For , compute the GLR statistic and . If or the value of the GLR test is smaller than , we label the location such that as a changepoint and denoted as ; if there are more than one such locations, we random pick one of them as . If , stop the procedure.

Step 1b. Given the changepoint , we repeat the changepoint test in Step 1a for regions and and obtains changepoints and , respectively.

Step 1c. We repeat Step 1b for each segmented diagonal blocks until the size of the segmented blocks become smaller than the minimal size of a TAD in the HiC data.
Suppose that changepoints are obtained from the above procedure, ordering them as yields diagonal blocks .
We shall notice that the above procedure may ignore the similarity of two adjacent diagonal blocks segmented at different steps. Therefore one may need a pruning step to merge some adjacent diagonal blocks. To achieve this goal, we can use one of the following procedures. The first is to use Bayesian information criteria (BIC) to select the number of changepoints given , where is the maximal number of changepoints. Specifically, changepoints in a symmetric matrix implies parameters, hence
Then the number and locations of changepoints can be obtained by minimizing . Besides BIC, other information criteria can also be used. Another way is to run the GLR test to see if is a changepoint for the submatrix
If the test doesn’t reject the null then we shall remove the changepoint . Note that the purpose of a pruning step is to obtain a relatively smaller number of changepoints. If the purpose is to obtain a hierarchical structure of changepoints, as discussed in the next section, the pruning procedure is not necessary.
3.2 A bottomup approach to identify hierarchical TADs
The topdown approach in Section 3.1 does not just provide us estimates of multiple changepoints, the order of estimated changepoints naturally indicates a nested structural relationship among segments. In fact, Matteson and James (2014) used similar idea and permutation tests to estimate hierarchical changepoints with a single root in sequence of random vectors.
However, as indicated in Figures 1 and 2, hierarchical TADs in HiC profiles usually have multiple roots, suggesting topdown approaches are not appropriate for the inference of hierarchical structure. Therefore, we propose the following bottomup approach, which uses the test statistic (2.8), to identify the herarchical structure of TADs in HiC profiles. For convenience, we assume a series of GLR tests with confidence levels for layers of hierarchical structure, where . We also denote the estimated changepoints from Step 1 as and the corresponding block diagonals as , here the superscript indicates that the hierarchical structure of changepoints have not been assigned. We start from .

Step 2a. For , perform the GLR test with test statistic (2.8) at significance level to conclude if is a changepoint in block . Denote the test result as if the value of the test is smaller than , and otherwise. If the sequence contains one or several consecutive 0’s, for example, , we combine blocks , and their complementary regions into a larger block , i.e.,
For changepoints such that , we denote . Then we obtain the th order of the hierarchical structure. If , the procedure stops.

Step 2b. For each of the aggregated blocks in Step 2a, say , apply the GLR test with test statistic (2.8) to conclude if each of is a single changepoint with significance level . Use the topdown approach in Step 1 to obtain a hierachical strcuture for changepoints . This provides us a finer nested structure of TADs in .

Step 2c. Let and denote the newly obtained changepoints as . Repeat Steps 2a and 2b until no new blocks are generated or the obtained number of layes exceeds , the maximum number of layers.
Figure 4 provides a schematic illustration of the bottomup procedure. The left panel shows the estimated changepoints from the topdown method with 95% confidence level (i.e., ). Then as shown by the middle panel of Figure 4, we perform the GLR test with test statistic (2.8) at 99% confidence level (i.e., ) for blocks , respectively. The values show that . Hence we aggregate blocks 1, 2, 3, and 4 into a bigger block, , and blocks 6 and 7 into another block . In the next step, we apply the GLR test and topdown approach to obtain an inner hierarchical strucutures for and , respectively, as shown by the right panel of Figure 4. For higher layers of hierarchical structures, we repeat the above procedure for newly obtained changepoints . Notice that, since the total number of estimated changepoints are significantly smaller than the size of HiC profiles, the total number of GLR tests needed in the bottomup procedure is also much smaller than the size of HiC profiles.
4 Simulation studies
We perform two simulation studies in this section. The first considers four different scenarios of HiC data and investigates the performance of the topdown method of estimating boundaries of TADs in simulated HiC data. In the second we apply the topdown and bottomup procedure to estimate hierarchical structures in a collection of benchmark datasets, which were simulated by other research group.
4.1 Study one: Inference on multiple changepoints
The purpose of this study is to assess the performance of the topdown procedure of estimating multiple changepoints in HiC matrices. Let . Recall that is the number of changepoints along the diagonal, the (raw or transformed) contact frequencies follow the distribution (2.1), and the parameters are determined by (2.2). We then consider the following four scenarios for HiC matrices.

Evenly spaced changepoints with Gaussian distributions. Assume that , and each of 25 diagonal blocks have the same size, i.e., , . Following Dixon et al. (2012), we specify parameters and observation as follows. Assume that Gamma(4, 18) for , , and for . The values of are generated by the following distributions.
(4.1) 
Unevenly spaced changepoints with Gaussian distributions. assume that , changepoint locations are discrete uniformly drawn with smallest block size at least 5. The values of and are generated in the same way as in (S1).

Evenly spaced changepoints with Poisson and negative binomial distributions. Assume that , for , and Gamma(4, 18). To generate count data for , we use a negative binomial model for HiC data in Lun and Smyth (2015).
(4.2) Note that, in the complementary region follows a mixture of a point mass and a negative binomial distribution.

Unevenly spaced changepoints with Poisson and negative binomial distributions. The locations of changepoints are generated in the same as (S2). Given all changepoints, the values of are are generated in the same way as in (S3).
Among these four scenarios, the profiles of changepoints in (S1) and (S3) are relatively simpler than those in (S2) and (S4). In Scenarios (S3) and (S4), provides us a NB distribution with mean
and variance
, where the parameter , referred as the biological coefficient of variation (BCV) by Lun and Smyth (2015), varies from 0 to . Hence we use in (S3) and (S4). Note that corresponds to the case that follow a Poisson distribution with mean . To specify or the level of noise in (S1) and (S2), we note that a Gamma(4, 18) distribution has mean 72 for parameters , we then choose the value of such that . Then we use and , corresponding to four values of .We simulate 1000 HiC matrices for each scenario, and use the topdown approach with significance levels and in Section 3.1 to estimate the numbers and locations of changepoints, together with parameters . For (S1) and (S2), we assume that only follow Gaussian distributions and hence the GLR test is based on test statistics (2.5). For (S3) and (S4), we assume that follows Gaussian, Poisson, and negative binomial distributions, respectively, and hence the GLR tests for changepoints are based on test statistics (2.5), (2.6), and (2.7), respectively. We use four measures to evaluate the performance of changepoint estimation. Besides the estimated number of changepoints (denoted as ), we also compute the mean square error for estimated parameters (i.e., MSE), true positive rates (TPR) and false discovery rates (FDR) for the number of changepoints.
Setting  
MSE  TPR  FDR  MSE  TPR  FDR  
.17 (.01)  0.75 (.09)  1 (0)  .0068 (5e4)  .03 (.00)  0.14 (.04)  1 (0)  .001 (2e4)  
(S1)  G  .17 (.01)  0.69 (.07)  1 (0)  .0068 (5e4)  .03 (.01)  0.13 (.03)  1 (0)  .0011 (2e4)  
.17 (.01)  0.79 (.08)  1 (0)  .0065 (5e4)  .03 (.01)  0.15 (.03)  1 (0)  .0011 (2e4)  
.17 (.01)  0.74 (.08)  1.000 (4e5)  .0068 (5e4)  .02 (.00)  0.12 (.02)  1.000 (4e5)  .0010 (2e4)  
.15 (.02)  1.56 (.27)  1 (0)  .0046 (4e4)  .01 (.00)  0.14 (.10)  1 (0)  .0003 (9e5)  
(S2)  G  .13 (.01)  1.23 (.17)  1 (0)  .0041 (4e4)  .02 (.00)  0.16 (.05)  1 (0)  .0005 (1e4)  
.15 (.01)  1.32 (.19)  1 (0)  .0047 (4e4)  .03 (.00)  0.25 (.10)  1 (0)  .0008 (2e4)  
.13 (.01)  1.37 (.22)  .9998 (9e5)  .0043 (4e4)  .02 (.00)  0.30 (.07)  .9998 (9e5)  .0008 (2e4)  
P  .17 (.01)  0.90 (.11)  1.0000 (7e5)  .0070 (6e4)  .03 (.01)  0.20 (.04)  1.0000 (7e5)  .0014 (3e4)  
NB  .19 (.01)  0.84 (.09)  .9998 (8e5)  .0077 (6e4)  .03 (.01)  0.15 (.03)  .9998 (8e5)  .0013 (3e4)  
G  .18 (.01)  0.81 (.09)  1 (0)  .0071 (5e4)  .03 (.01)  0.21 (.05)  1 (0)  .0013 (2e4)  
P  .17 (.01)  0.90 (.10)  1.0000 (4e5)  .0069 (5e4)  .04 (.01)  0.19 (.04)  1.0000 (4e5)  .0015 (3e4)  
(S3)  NB  .19 (.01)  0.96 (.10)  .9995 (1e4)  .0078 (6e4)  .04 (.01)  0.31 (.06)  .9995 (1e4)  .0020 (3e4)  
G  .19 (.01)  0.95 (.09)  .9999 (6e5)  .0077 (6e4)  .03 (.01)  0.22 (.04)  .9999 (6e5)  .0014 (3e4)  
P  .17 (.01)  0.89 (.10)  .9998 (1e4)  .0073 (6e4)  .02 (.00)  0.12 (.02)  .9998 (1e4)  .0012 (3e4)  
NB  .17 (.01)  0.86 (.10)  .9997 (1e4)  .0068 (6e4)  .03 (.01)  0.18 (.05)  .9997 (1e4)  .0014 (3e4)  
G  .19 (.01)  0.81 (.08)  1 (0)  .0074 (6e4)  .03 (.01)  0.13 (.02)  1 (0)  .0010 (2e4)  
P  .15 (.01)  0.82 (.09)  .9999 (7e5)  .0062 (5e4)  .02 (.00)  0.14 (.03)  .9999 (7e5)  .0008 (2e4)  
NB  .17 (.01)  0.77 (.09)  .9997 (1e4)  .0070 (6e4)  .03 (.01)  0.22 (.05)  .9997 (1e4)  .0016 (3e4)  
G  .15 (.01)  0.68 (.07)  .9999 (6e5)  .0061 (5e4)  .02 (.00)  0.17 (.03)  .9999 (6e5)  .0010 (2e4)  
P  .15 (.01)  1.38 (.20)  .9998 (9e5)  .0050 (4e4)  .03 (.01)  0.29 (.08)  .9998 (9e5)  .0010 (2e4)  
NB  .16 (.01)  1.19 (.18)  .9993 (2e4)  .0056 (5e4)  .03 (.01)  0.31 (.08)  .9993 (2e4)  .0015 (3e4)  
G  .14 (.01)  1.54 (.26)  .9999 (7e5)  .0045 (4e4)  .02 (.00)  0.25 (.10)  .9999 (7e5)  .0007 (2e4)  
P  .15 (.01)  1.39 (.19)  .9998 (8e5)  .0047 (4e4)  .02 (.00)  0.23 (.05)  .9998 (8e5)  .0010 (2e4)  
(S4)  NB  .15 (.01)  1.18 (.19)  .9995 (1e4)  .0050 (5e4)  .03 (.01)  0.26 (.06)  .9995 (1e4)  .0013 (2e4)  
G  .16 (.01)  1.50 (.22)  .9998 (9e5)  .0050 (4e4)  .03 (.01)  0.18 (.05)  .9998 (9e5)  .0010 (2e4)  
P  .12 (.01)  1.22 (.27)  .9995 (1e4)  .0042 (4e4)  .03 (.01)  0.57 (.25)  .9995 (1e4)  .0013 (2e4)  
NB  .14 (.01)  1.38 (.21)  .9989 (2e4)  .0055 (5e4)  .03 (.01)  0.22 (.04)  .9989 (2e4)  .0020 (3e4)  
G  .15 (.01)  1.63 (.28)  .9998 (7e5)  .0048 (4e4)  .02 (.01)  0.46 (.16)  .9998 (7e5)  .0010 (2e4)  
P  .16 (.01)  1.42 (.20)  .9994 (1e4)  .0057 (4e4)  .04 (.01)  0.41 (.11)  .9994 (1e4)  .0017 (3e4)  
NB  .14 (.01)  1.54 (.36)  .9989 (2e4)  .0053 (4e4)  .03 (.01)  0.25 (.04)  .9989 (2e4)  .0020 (3e4)  
G  .14 (.01)  1.49 (.25)  .9998 (7e5)  .0055 (4e4)  .03 (.01)  0.35 (.14)  .9998 (7e5)  .0010 (2e4) 
Table 1 summarizes the means and standard errors (in parentheses) of the four measurements for changepoint estimation in scenarios (S1)(S4). The difference between estimated and true number of changepoints in all scenariois have average less then one. Compared to the noise level in the simulation, all MSE are quite small. All TPRs are close to one, and all FDRs range from 0.001 to 0.008, indicating the estimation for changepoints is satisfactory.
4.2 Study two: Inference on hierarchical TADs
To study whether our procedure of estimating hierarchical organization of TADs performs well or not, we use the HiC datasets simulated and released in Forcato et al. (2017). There are two main reasons that their simulated datasets shall be used. Forcato et al. (2017) used quasinegativebinomial generator modified from Lun and Smyth (2015) with each specifically designed, which approximates the real HiC data better. Second, Forcato et al. (2017) had compared several existing segmentation methods in their study, using the same datasets facilitates the comparison of our method to others. The simulated datasets in Forcato et al. (2017) consist of the following two types of specifications for HiC matrices.

Large size matrices with unnested TADs. This dataset consists of 25 simulated HiC matrices with noise levels 4%, 8%, 12%, 16%, and 20%. These simulated matrices contain no nested TAD structures, and each simulated HiC matrix has size around 4500 with 171 diagonal blocks (or TADs).

Large size matrices with nested TADs. This dataset consists of 25 simulated HiC matrices with noise levels 4%, 8%, 12%, 16%, and 20%. Different from (S5), these simulated matrices contain nested TAD structures. In particular, each simulated profile has size around 4,500 and 910 diagonal blocks (or TADs) with three layers of hierarchical structure.
Since matrices are simulated as approximations to the raw data produced from the sequencing experiment, the elements in follow the power law indicated by LiebermanAiden et al. (2009). Hence, before applying our method to estimate changepoints and hierarchical TAD structures, we follow LiebermanAiden et al. (2009) and transform the matrix to the matrix by , where the scaling parameter can be estimated from the data. For transformed matrices in (S5), we apply the topdown method in Section 3.1 with GLR test statistic (2.7) and significance level . To estimate the hierarchical structures in (S6), we use the topdown and bottomup procedure in Sections 3.1 and 3.2 with GLR test statistic (2.7) and significance levels . Note that the significant levels and here serve as tuning parameters of changepoints. In these studies, we assume that all follow Gaussian (G), Poisson (P), and negative binomial (NB) distributions.
0.2 (0.20)  1.6 (0.40)  18.6 (2.89)  83.4 (2.18)  124.2 (9.30)  
P  TPR  .9918 (.0040)  .9847 (.0040)  .9706 (.0032)  .9377 (.0069)  .8294 (.0428)  
FDR  .0070 (.0043)  .0243 (.0042)  .1235 (.0155)  .3692 (.0093)  .5195 (.0168)  
0 (0.32)  2 (0.45)  18.6 (2.89)  82.8 (3.07)  123.8 (9.65)  
(S5)  NB  TPR  .9918 (.0030)  .9836 (.0047)  .9659 (.0047)  .9247 (.0137)  .8259 (.0454) 
FDR  .0082 (.0044)  .0277 (.0049)  .1278 (.0153)  .3763 (.0120)  .5210 (.0178)  
0.2 (0.20)  1.8 (0.37)  19.4 (3.22)  82.6 (2.11)  121.6 (10.67)  
G  TPR  .9988 (.0012)  .9930 (.0022)  .9812 (.0039)  .9459 (.0087)  .8271 (.0478)  
FDR  .0000 (0)  .0173 (.0031)  .1176 (.0163)  .3618 (.0086)  .5169 (.0176)  
P  TPR  .8368 (.0087)  .7458 (.0091)  .6537 (.0059)  .5309 (.0111)  .3524 (.0128)  
FDR  .0671 (.0044)  .0974 (.0068)  .1355 (.0038)  .1877 (.0080)  .2961 (.0107)  
(S6)  NB  TPR  .8338 (.0093)  .7448 (.0107)  .6532 (.0048)  .5289 (.0093)  .3565 (.0135) 
FDR  .0716 (.0050)  .0998 (.0056)  .1344 (.0050)  .1865 (.0048)  .2986 (.0104)  
G  TPR  .8450 (.0070)  .7570 (.0090)  .6568 (.0044)  .5289 (.0118)  .3535 (.0125)  
FDR  .0590 (.0050)  .0867 (.0058)  .1343 (.0043)  .1928 (.0066)  .3013 (.0139) 
For matrices in (S5), we compute the difference between estimated and true numbers of changepoints , the TPR and FDR. These results, together with the standard errors, are presented in the upper panel of Table 2. Similar information on other methods can be found in Forcato et al. (2017). For comparison purpose we quote their testing results for TADtree (Weinreb and Raphael, 2015) and HiCseg (LévyLeduc et al., 2014) along with our method and ICFinder (Haddad et al., 2017) in Figure 5. We find that MuChPoint (Brault et al., 2018) is not applicable to large HiC datasets of this size so their method is not presented here. Our estimations for changepoints or boundaries of TADs are very accurate under moderate noise levels. Even in extremely high noise our performance is still reasonable.
For matrices in (S6), we follow Forcato et al. (2017) and present TPR and FDR for estimated changepoints. These measures indicate the capacity of estimating change points at the bottom layer. The lower panel of Table 2 summarizes the TPRs and FDRs of the changepoints in the bottom layer estimated by our topdown approach. For comparison purpose we quote the testing results for TADtree and HiCseg in Forcato et al. (2017) along with our method and ICFinder, in Figure 6. We find the following advantages of our method. First, for fair levels of noise we obtain higher TPR and lower FDR. Actually, no other method can generate TPR higher than 0.75. Second, even under high noise our FDR is smaller than other approaches. While our method cannot differentiate TADs in bottom layer from their top (or root) layer in high noise, when boundaries between TADs and the complementary region are blurred. This explains our lower TPR in high noise level. As for overall performance, we excel in this scenario.
To have a better idea on the performance of our method, we show in Figure 7 two samples from (S6) with noise level . The right case in Figure 7 was also tested in Forcato et al. (2017) so we quote their results for TADtree and HiCseg along with some other methods. Some of the approaches can idenfify top (or root) layer TADs, but only our method and TADtree capture hierarchical structures. Plus we are much better at detecting bottom layer blocks and achieve less false detection.
5 Real data analysis
5.1 HiC datasets
We use the procedures proposed in Section 3 to estimate hierarchical TAD structures in real data. In particular, we study the high resoluation in situ HiC data of seven cell lines produced by Rao et al. (2015), which can be downloaded from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/) with the access number of GSE63525. These seven cell lines are GM12878, HMEC, HUVEC, IMR90, K562, KBM7, and NHEK. Rao et al. (2015) used a segmentation algorithm, Arrowhead, to analyze their data and we denote their estimated TAD structures as RaoTAD in the rest of the paper. To compare our results with RaoTAD, we define the following measures:

Boundary matching rates. If a boundary (changepoint) of RaoTAD and the closest among ours are within certain distance (number of bins in HiC matrix), we take them as matched. Then the boundary matching rate is defined as the proportion of matched boundaries among the total number of boundaries in RaoTAD. We further perform a hypergeometric test to measure the significance of estimated TADs by our approach consisting of estimated TADs in RaoTAD. In particular, we we treat the set of all bins (i.e., all columns or rows of the HiC matrix) as a population and the estimates of RaoTAD and our methods as successes and samples, respectively.

Overlapping rates. We consider the overlapping rate of TAD bodies. Denote as th RaoTAD, and as the TAD we estimated that share the longest overlapping region with . The overlapping rate is the proportion of overlapping region over the minimal length of and . Then we average that of all RaoTADs to get the overlapping rate for a whole cell line.
5.2 Identifying hierarchical architecture of chromosome organization
Following Weinreb and Raphael (2015), we define the root of a TAD as order 1. If the root has sublevel TADs, the the two subTADs are considered as order 2; similarly, we define subsubTADs as order 3, and so on. Applying the topdown and bottomup approach proposed in Section 3 with , we analyze the high resolution in situ HiC data of seven cell lines. Table 3 show total number of TADs we find in each order level. We find that, although the highest order TAD we find is up to order 7, most of them are exhibited in orders 1 and 2 . This indicates that TAD hierarchical structures are mainly consists of order 1 TADs and maybe their one level of subTADs (usually corresponding to CTCF loops).
Cell line  TADs  order 1  order 2  order 3  order 4 

GM12878  8586  7743 (90.18)  687 (8.00)  96 (1.12)  14 (0.16) 
HMEC  8200  7072 (86.24)  869 (10.60)  183 (2.23)  31 (0.38) 
HUVEC  8903  8053 (90.45)  690 (7.75)  98 (1.10)  16 (0.18) 
IMR90  9043  8202 (90.70)  689 (7.62)  94 (1.04)  12 (0.13) 
K562  8801  8046 (91.42)  618 (7.02)  87 (0.99)  4 (0.05) 
KBM7  6246  5402 (86.49)  647 (10.36)  131 (2.10)  21 (0.34) 
NHEK  7726  6935 (89.76)  652 (8.44)  87 (1.13)  6 (0.08) 
To validate our results on real HiC data, we compare our estimated hierarchical TADs with RaoTAD and find that our method can recall their results across different cell lines. we find the following. First, the number of TAD boundaries (or changepoints) estimated by our approach are propotionally allocated in 23 chromosomes similar to the that in RaoTAD (all values , chisquare test, two sides). Second, for boundary matching we find () our boundaries are exactly the same as RaoTAD which is very significant by onesided hypergeometric testing (All values 10). Considering that TAD boundaries are usually not sharply defined and may shift within a certain distance (Andrey et al., 2013), we then loose the criterion for matching by allowing certain bin shift. Specifically, the boundary matching rates dramatically increase to within onebin shift, within 2bin shift, and with a 3bin shift; see the boundary matching rates in Part A of Figure 8. Third, our overlapping rates are increased with more higher order subTADs included. Restricted to only order 1 TADs, our overlapping rates are . When subTADs of all hierarchical orders are included, our overlapping rates increase to ; see panel B of Figure 8.
5.3 Integrative analysis of chromatin interactions and biological signals
To further demonstrate the advantage of our approach, we show how our method can be to delineate chromosomal organization and gene regulation by an integrative analysis of TAD structure and epigenetic markers. Histone protein help packing DNA into structual units. Specifically, boundaries of TADs are primarily associated with CTCF and Rad21 binding peaks (Dixon et al., 2012; Rao et al., 2015). To demonstrate this, we take a 2M region of chromosome 3 from cell line GM12878 to analyze the relationship of TADs organization, histone modifications, and protein binding peaks. This region can be partitioned into three parts, where an active region is flanked by two repressed ones. The active region is exhibited by biological signals like RNAseq signals, pol2 binding peaks and multiple histone modifications such as H3K4me3 and H3K27ac as shown in Figure 9. Our estimated result, which has richer hierarchical structure, is better than RaoTAD. RaoTAD even failed to find any TAD in a large subregion (chromosome 3:1070000011300000) where clear blocks can be observed from the HiC interaction heatmap; see Figure 9. Our method reports four levels of TAD organization in the active region higher than that in the repressed ones, suggesting well spatially controlling for gene expressions.
Cell Line  State  Bins  Boundaries  value 

GM12878  active  29293  2490  5.15e18 
repressive  81220  5553  
HESC  active  55345  4540  7.92e4 
repressive  13010  950  
HUVEC  active  54380  4420  4.48e12 
repressive  53832  3739  
IMR90  active  64220  5494  2.98e58 
repressive  46400  2706  
K562  active  20960  1781  3.50e9 
repressive  89932  6488  
KBM7  active  34269  2090  4.83e15 
repressive  75936  3716 
In our early study, we have estimated chromosome regions as active and repressive for multiple cell lines (Chen et al., 2016). Hence we extend the above analysis to genomewide by comparing hierarchical TAD structures between active and repressive regions in six human cell lines. We find TAD boundaries are significantly enriched in active regions as shown in Table 4
with onesided fisher exact test. For instance, among the estimated 8,200 TAD boudnaries in cell line IMR590, 5,494 boundary points are located in active regions (64,220 bins of 40k), which are almost twice of the 2,706 boundary points in repressive regions (46,400 bins of 40k). Thus, these results of genomewide analysis clearly demonstrate that the active regions of cells usually exhibited more TADs that could be contributed to gene regulations.
6 Concluding remarks
Identification of TADs and their hierarchical structures is extremely important in the study of chromatin interaction. Since identification of TADs and their hierachical strucutres in HiC profiles is essentially equivalent to multiple changepoints anlaysis of matrixvariate data, existing multiple changepoint methods for uni and multivariate data cannot be directly use. Therefore, we develop a theory of GLR tests for analysis of changepoints in HiC matrices, and study the asymptotic properties of the test statistics. We show that the maximum of the GLR test statistic converges to a maxima on a Gaussian random field with twodimensional indices. Based on these GLR tests, we then propose a topdown and bottomup procedure to estimate the hierarchical TAD structures in HiC data. Our simulation studies shown that our method can precisely estimate hierarchical structures of TADs and is robust to various levels of noises. In the analysis of seven human cell lines, we demonstrated that our method performs better than several existing approaches, and furthermore, the hierarchical TAD structures obtained by our approach are consistent with divergent signals as histone modification and ChIPseq data.
Besides deveoping GLR tests for changepoints in HiC matrices, our method also makes the following contributions. First, as our changepointbased approach provides much more detailed structures of TADs than existing methods, the estimates obtained by our approach further reveals that active chromosomal regions contain more and higher order TADs than repressive regions. This provides new biological insights of precise gene regulation within active genomic regions. Second, since structural disorder of chromatin ineractions have been widely observed at different levels, our proposed approach can be applied to the comparative analysis of TAD structures in cancer and normal HiC data. Third, besides HiC data, the proposed changepoint test and segmentation procedures can also be applied to analyze 5C data, which are another type of data produced by 3Cbased experimental techniques.
Acknowledgemnt
The authors’ research is supported by NSFDMSxxxxxx, the Cancer Prevention and Research Institute of Texas (CPRIT) grant (RP180826) and Cecil H. and Ida Green endowment fund.
Reference
References
 Albert and Chib [1993] James H. Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679, 1993.
 Andrey et al. [2013] Guillaume Andrey, Thomas Montavon, Bénédicte Mascrez, Federico Gonzalez Grassi, Daan Noordermeer, Marion Leleu, Didier Trono, Francois Spitz, and Denis Duboule. A switch between topological domains underlies hoxd genes collinearity in mouse limbs. Science, 340(6137), 2013.
 Bai and Perron [1998] Jushan Bai and Pierre Perron. Estimating and testing linear models with multiple structural changes. Econometrica, 66(1):47–78, 1998.
 Bai and Perron [2003] Jushan Bai and Pierre Perron. Computation and analysis of multiple structural change models. Journal of Applied Econometrics, 18(1):1–22, 2003.
 Birge and Massart [2001] L. Birge and P Massart. A generalized cp criterion for gaussian model selection. niversités de Paris 6 and Paris 7, 647:39, 2001.

Brault et al. [2018]
Vincent Brault, Sarah Ouadah, Laure Sansonnet, and Céline LévyLeduc.
Nonparametric homogeneity tests and multiple changepoint estimation
for analyzing large hic data matrices.
Journal of Multivariate Analysis
, 165:143–165, 2018.  Broman and Speed [2002] Karl W. Broman and Terence P. Speed. A model selection approach for the identification of quantitative trait loci in experimental crosses. Journal of the Royal Statistical Society Series B, 64(4):641–656, 2002.
 Chen et al. [2016] Yong Chen, Yunfei Wang, Zhenyu Xuan, Min Chen, and Michael Zhang. De novo deciphering threedimensional chromatin interaction and topological domains by wavelet transformation of epigenetic profiles. Nucleic Acids Research, 44:gkw225, 2016.
 Chib et al. [2002] Siddhartha Chib, Federico Nardari, and Neil Shephard. Markov chain monte carlo methods for stochastic volatility models. Journal of Econometrics, 108(2):281–316, 2002.
 Davis et al. [2006] Richard A Davis, Thomas C. M Lee, and Gabriel A RodriguezYam. Structural break estimation for nonstationary time series models. Journal of the American Statistical Association, 101(473):223–239, 2006.
 Dekker et al. [2002] Job Dekker, Karsten Rippe, Martijn Dekker, and Nancy Kleckner. Capturing chromosome conformation. Science, 295(5558):1306–1311, 2002.
 Dixon et al. [2012] Jesse Dixon, Siddarth Selvaraj, Feng Yue, Audrey Kim, Yan Li, Yin Shen, Ming Hu, Jun Liu, and Bing Ren. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 485:376–80, 2012.
 Dostie et al. [2006] Josée Dostie, Todd Richmond, Ramy A Arnaout, Rebecca R Selzer, William L Lee, Tracey Honan, Eric D Rubio, Anton Krumm, Justin Lamb, Chad Nusbaum, Roland D Green, and Job Dekker. Chromosome conformation capture carbon copy (5c): A massively parallel solution for mapping interactions between genomic elements. Genome research, 16(10):1299–309, 2006.
 Dudchenko et al. [2017] Olga Dudchenko, Sanjit S. Batra, Arina Omer, Sarah Kate Nyquist, Marie Hoeger, Neva Durand, Saad Shamim, Ido Machol, Eric Steven Lander, Aviva Presser Aiden, and Erez Lieberman Aiden. De novo assembly of the aedes aegypti genome using hic yields chromosomelength scaffolds. Science, 356(6333):92–95, 2017.
 Eg Sauria et al. [2015] Michael Eg Sauria, Jennifer PhillipsCremins, Victor Corces, and James Taylor. Hifive: A tool suite for easy and efficient hic and 5c data analysis. Genome biology, 16:237, 2015.
 Filippova et al. [2014] Darya Filippova, Robert Patro, Geet Duggal, and Carl Kingsford. Identification of alternative topological domains in chromatin. Algorithms for molecular biology : AMB, 9:14, 2014.
 Forcato et al. [2017] Mattia Forcato, Chiara Nicoletti, Koustav Pal, Carmen Livi, Francesco Ferrari, and Silvio Bicciato. Comparison of computational methods for hic data analysis. Nature methods, 14, 2017.
 Fudenberg et al. [2011] Geoff Fudenberg, Gad Getz, Matthew L. Meyerson, and Leonid Mirny. High order chromatin architecture shapes the landscape of chromosomal alterations in cancer. Nature biotechnology, 29:1109–13, 2011.
 Green [1995] Peter J. Green. Reversible jump markov chain monto carlo computation and bayesian model determination. Biometrika, 82(4):711–32, 1995.
 Haddad et al. [2017] Noelle Haddad, Cédric Vaillant, and Daniel Jost. Icfinder: Inferring robustly the hierarchical organization of chromatin folding. Nucleic Acids Research, 45(10):e81, 2017.
 Harchaoui and LévyLeduc [2010] Zaid Harchaoui and Céline LévyLeduc. Multiple changepoint estimation with a total variation penalty. Journal of the American Statistical Association, 105(492):1480–1493, 2010.
 Hnisz et al. [2016] Denes Hnisz, Daniel Day, and Raina Young. Insulated neighborhoods: Structural and functional units of mammalian gene control. Cell, 167(5):P1188–1200, 2016.
 Lai and Xing [2013] T. L. Lai and H Xing. Stochastic changepoint arxgarch models and their applications to econometric time series. Statistica Sinica, 23(4):1573–1594, 2013.
 Lai and Xing [2011] Tze Leung Lai and Haipeng Xing. A simple bayesian approach to multiple changepoints. Statistica Sinica, 21(2):539–569, 2011.
 Lai et al. [2008] Tze Leung Lai, Haipeng Xing, and Nancy Zhang. Stochastic segmentation models for arraybased comparative genomic hybridization data analysis. Biostatistics, 9(2):290–307, 2008.
 Lavielle [2005] Marc Lavielle. Using penalized contrasts for the changepoint problem. Signal Processing, 85:1501–1510, 08 2005.
 LiebermanAiden et al. [2009] Erez LiebermanAiden, Nynke L van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, Bryan Lajoie, Peter Sabo, Michael Dorschner, Richard Sandstrom, Bradley Bernstein, MA Bender, Mark Groudine, Andreas Gnirke, John A. Stamatoyannopoulos, Leonid Mirny, Eric S Lander, and Job Dekker. Comprehensive mapping of longrange interactions reveals folding principles of the human genome. Science, 326(5950):289–293, 2009.
 Liu and Lawrence [1999] Jun Liu and Charles Lawrence. Bayesian inference on biopolymer models. Bioinformatics, 15:38–52, 1999.
 Lun and Smyth [2015] Aaron Lun and Gordon Smyth. diffhic: A bioconductor package to detect differential genomic interactions in hic data. BMC bioinformatics, 16:258, 2015.
 LévyLeduc et al. [2014] Céline LévyLeduc, Maud Delattre, T MaryHuard, and Stéphane Robin. Twodimensional segmentation for analyzing hic data. Bioinformatics, 30(17):i386–i392, 2014.
 Matteson and James [2014] David S. Matteson and Nicholas A. James. A nonparametric approach for multiple change point analysis of multivariate data. Journal of the American Statistical Association, 109(505):334–345, 2014.
 Nora et al. [2012] Elphège Nora, Bryan Lajoie, Edda Schulz, Luca Giorgetti, Ikuhiro Okamoto, Nicolas Servant, Tristan Piolot, Nynke L van Berkum, Johannes Meisig, John Sedat, Joost Gribnau, Emmanuel Barillot, Nils Blüthgen, Job Dekker, and Edith Heard. Spatial partitioning of the regulatory landscape of the xinactivation centre. Nature, 485:381–5, 2012.
 Olshen et al. [2004] Adam Olshen, E.S. Venkatraman, Robert Lucito, and Michael Wigler. Circular binary segmentation for the analysis of arraybased dna copy number data. Biostatistics, 5:557–72, 11 2004.
 Perron and Qu [2007] Pierre Perron and Zhongjun Qu. Estimating and testing multiple structural changes in multivariate regressions. Econometrica, 75:459–502, 02 2007.
 PhillipsCremins et al. [2013] Jennifer PhillipsCremins, Michael E G Sauria, Amartya Sanyal, Tatiana Gerasimova, Bryan Lajoie, Joshua S K Bell, ChinTong Ong, Tracy Hookway, Changying Guo, Yuhua Sun, Michael J Bland, William Wagstaff, Stephen Dalton, Todd McDevitt, Ranjan Sen, Job Dekker, James Taylor, and Victor Corces. Architectural protein subclasses shape 3d organization of genomes during lineage commitment. Cell, 153(6):1281–95, 2013.
 Rao et al. [2015] S.S.P. Rao, M.H. Huntley, Neva Durand, E.K. Stamenova, Ivan Bochkov, J.T. Robinson, A.L. Sanborn, I Machol, Arina Omer, E.S. Lander, and E.L. Aiden. A 3d map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 159(7):1665–80, 2015.
 Sexton et al. [2012] Tom Sexton, Eitan Yaffe, Ephraim Kenigsberg, Frédéric Bantignies, Benjamin Leblanc, Michael Hoichman, Hugues Parrinello, Amos Tanay, and Giacomo Cavalli. Threedimensional folding and functional organization principles of the drosophila genome. Cell, 148:458–72, 2012.
 Shen and Zhang [2012] Jeremy Shen and Nancy Zhang. Changepoint model on nonhomogeneous poisson processes with application in copy number profiling by nextgeneration dna sequencing. Annals of Applied Statistics, 6:476–496, 2012.
 Siegmund [2004] David Siegmund. Model selection in irregular problems: Applications to mapping quantitative trait loci. Biometrika, 91:785–800, 12 2004.
 Symmons et al. [2014] Orsolya Symmons, Veli Vural Uslu, Taro Tsujimura, Sandra Ruf, Sonya Nassari, Wibke Schwarzer, Laurence Ettwiller, and Francois Spitz. Functional and topological characteristics of mammalian regulatory domains. Genome research, 24(3):390–400, 2014.
 van de Werken et al. [2012] Harmen van de Werken, Gilad Landan, Sjoerd J B Holwerda, Michael Hoichman, Petra Klous, Ran Chachik, Erik Splinter, Christian ValdesQuezada, Yuva Oz, Britta Bouwman, Marjon Verstegen, Elzo de Wit, Amos Tanay, and Wouter de Laat. Robust 4cseq data analysis to screen for regulatory dna interactions. Nature methods, 9:969–72, 2012.
 Vostrikova [1981] L.J Vostrikova. Detecting “disorder” in multidimensional random processes. Soviet Mathematics Doklady, 24:55–59, 1981.
 Wang and Zivot [2000] Jiahui Wang and Eric Zivot. A bayesian time series model of multiple structural changes in level, trend, and variance. Journal of Business and Economic Statistics, 18(3):374–386, 2000.
 Weinreb and Raphael [2015] Caleb Weinreb and Ben Raphael. Identification of hierarchical chromatin domains. Bioinformatics, 32(11):1601–9, 2015.
 Xing and Chen [2018] H. Xing and Y Chen. Dependence of structural breaks in rating transition dynamics on economic and market variations. Review of Economics and Finance, 11:1–18, 2018.
 Xing et al. [2012a] H. Xing, Y. Mo, W. Liao, and M Zhang. Genomewide localization of proteindna binding and histone modification by bcp with chipseq data. PLoS Computational Biology, 8(7):e1002613, 2012a.
 Xing et al. [2012b] H. Xing, N. Sun, and Y Chen. Credit rating dynamics in the presence of unknown structural breaks. Journal of Banking and Finance, 36(1):78–89, 2012b.
 Yao [1984] YiChing Yao. Estimation of a noisy discretetime step function: Bayes and empirical bayes approaches. Annals of Statistics, 12(4):1434–1447, 1984.

Yao [1988]
YiChing Yao.
Estimating the number of changepoints via schwarz’ criterion.
Statistics and Probability Letters
, 6:181–189, 02 1988.  Yu et al. [2017] Wenbao Yu, Bing He, and Kai Tan. Identifying topologically associating domains and subdomains by gaussian mixture model and proportion test. Nature Communications, 8, 2017.
 Zhang and Siegmund [2007] Nancy Zhang and David Siegmund. A modified bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics, 63:22–32, 2007.
 Zuin et al. [2014] J. Zuin, J.R. Dixon, M.I. van der Reijden, Z. Ye, P. Kolovos, R.W. Brouwer, M.P. van de Corput, H.J. van de Werken, T.A. Knoch, W.F. van IJcken, F.G. Grosveld, B. Ren, and K.S. Wendt. Cohesin and ctcf differentially affect chromatin architecture and gene expression in human cells. Proceedings of National Academy of Science, USA, 111:996–1001, 2014.
Appendix. Proofs of Theorems 1 and 2
Proof of Theorem 1. For the assumption of Gaussian distributions, some algebra shows directly that . For the assumption of Poission distributions, consider in region under the null, . The first order Taylor expansion of around the mean is , hence . Similarly, we have first order expansion for , , and . Hence, under the null hypothesis, we have
For the assumption of negative binomial distributions, we can argue analogously. For example, for in region under the null, . The first order Taylor expansion around the mean
Comments
There are no comments yet.