Chromosomes are tightly packed in nucleus and well maintained for three-dimensional (3D) structures, which play a critical role in many genomic functions such as gene regulation, DNA replication, epigenetic modification and maintenance of genome stability. To understand the spatial information of chromosome organization, several sequencing methods have been developed to assess chromatin interactions. In particular, the chromosome conformation capture (3C) technology quantifies the frequencies of interactions between sites on a chromosome (Dekker et al., 2002). Circularized 3C (4C) technique can obtain the contact reads between one locus and all other loci on a chromosome (van de Werken et al., 2012). 3C carbon copy (5C) technology can detect all the contacts in some specific region of chromosome (Dostie et al., 2006). Hi-C technology extends 3C and is capable of generating genome-wide chromosome interaction information (Lieberman-Aiden et al., 2009). These 3C-based experimental techniques have significantly advanced our understanding of chromatin structure (Rao et al., 2015) and its relation to the regulation of gene expression, cancer translocations, copy number alterations (Fudenberg et al., 2011) and de novo genome assembly (Dudchenko et al., 2017).
The study of Hi-C data has revealed that chromosome can be partitioned into topologically associating domains (TADs) in human and mouse cells (Dixon et al., 2012; Sexton et al., 2012). As indicated by panel A in Figure 1, TADs are characterized as structural and functional units of mammalian chromosomes that engage in dynamic higher-order inter-TAD connections, playing a major role in regulatory transactions during DNA-dependent processes (Hnisz et al., 2016). As TADs are characterized by pronounced long-range associations between loci located in adjacent domains, these studies suggest that chromosomes are composed of a string of domains that are topologically separated from each other. Subsequent studies show that TADs are separated by boundaries that seem to be genetically defined (Nora et al., 2012), and that such TADs are closely related to gene regulatory functions and illustrate the long-range associations between loci on a chromosome (Rao et al., 2015; Symmons et al., 2014). Furthermore, recent biological studies (Phillips-Cremins et al., 2013; Zuin et al., 2014) show the distinct combinations of proteins such as CTCF, mediator and cohesin may also influence boundaries of TADs and sub-TADs, i.e., indicating that TADs have hierachical structures; see panel B in Figure 1.
The Hi-C technology generates read pairs corresponding to pairs of genomic loci that physically interact in the nucleus (Lieberman-Aiden et al., 2009). For a chromosome, the observed number of read pairs are used to construct a chromosomal contact map or a square matrix , where describes the frequency of interactions between position and position along the chromosome at a given resolution. By the definition of the contact matrix, , i.e., matrix is symmetric. Note that, a high interaction frequency indicates for the spatial proximity, which facilitates quantitative studies about the relationship between gene functions and genomic 3D structures. Hence, in the Hi-C contact profile, contact frequencies tend to interact more frequently with each other at loci located within a TAD, but much less frequently at loci of different TADs (Figure 1A). This indicates the Hi-C data matrix has a block-diagonal structure. Therefore, the issue of detecting diagonal blocks can be considered as a problem of estimating multiple change-points (or boundaries) on the diagonal of a matrix, and the problem of estimating hierarchical TAD structures is equivalent to the problem of making inference on hierarchical strucutures of change-points in Hi-C profiles.
As the analysis of TADs and their hierarchical structures involves the problem of estimating multiple change-points and their hierarchical strucutures, we note that various Bayesian and frequentist methods have been developed for multiple change-point analysis of uni- and multi-variate data in statistics, econometrics, and engineering over the last several decades. In particular, the Bayesian methods assume multiple change-points follow a stochastic process and solve the inference problem through Markov chain Monte Carlo (MCMC) simulations(Albert and Chib, 1993; Liu and Lawrence, 1999; Wang and Zivot, 2000; Chib et al., 2002; Green, 1995)
. Apart from simulation based Bayesian methods, exact Bayesian inference approaches with statistically efficient approximations are also developed for various multiple change-points problems in statistics, genomic data analysis, and finance(Yao, 1984; Lai et al., 2008; Lai and Xing, 2011, 2013; Xing et al., 2012a, b; Xing and Chen, 2018). Different from the Bayesian methods which use either MCMC or exact approaches, the frequentist methods provide more technical tools for multiple change-point analysis, which include the dynamic programming algorithm based on maximized likelihood (Bai and Perron, 1998, 2003; Perron and Qu, 2007), the binary segmentation procedure (Vostrikova, 1981; Olshen et al., 2004; Matteson and James, 2014), the model selection procedure based on information criteria (Yao, 1988; Siegmund, 2004; Davis et al., 2006; Zhang and Siegmund, 2007; Shen and Zhang, 2012), and the penalized likelihood or cost function approaches (Birge and Massart, 2001; Broman and Speed, 2002; Lavielle, 2005; Harchaoui and Lévy-Leduc, 2010).
Although many approaches (as discussed above) have been developed for multiple change-point analysis , none of them can be directly to analysis multiple change-points and their hierarchical structures in Hi-C profiles. The key reason is that Hi-C data are essentially a type of matrix-variate data, which can not be easily transformed into a multivariate vector and analyzed by existing methods. Recently, several statistical methods were proposed to address this issue and estimate the boundaries (or change-points) of TADs and their hierarchical structures. These methods can be roughly classified into three types. The first type of methods projects the 2-dimensional (2D) matrix into a one-dimensional (1D) vector and use a 1D test statistic to estimate TAD boundaries in the contact matrix. In particular, Dixon et al. (2012) computed a 1D directionality index
(DI) from the contact matrix, and used a hidden Markov model (HMM) to partition the genome into regions defined by changes in the DI values.Eg Sauria et al. (2015) introduced a 1D statistic to capture sudden shifts in interaction preference, however, they did not explicitly pair their estimated boundaries into domains, leaving the domain structure ambiguous. We shall note that the projection of a 2D matrix into a 1D summary statistic incurs an information loss, and hence could not provide a best estimation of change-points in Hi-C matrices. The second type of methods tries to identify TADs using the full 2D contact matrices. Lévy-Leduc et al. (2014) defined a block-wise segmentation model for the detection of TAD boundaries, and rephrased the maximization of the likelihood with respect to the block boundaries in terms of a 1D segmentation problem, for which the standard dynamic programming applies. Filippova et al. (2014)
introduced a dynamic-programming based algorithm to estimate TADs in Hi-C data matrices for a given domain-length scaling factor. Their method outputs the set of non-overlapping domains that are most robust to changes in the parameter value. Assuming that a random symmetric matrix consists of random variables whose distribution changes from block to block.Brault et al. (2018) developed a nonparametric homogeneity test to estimate locations of block boundaries (change-points) of non-overlapping blocks in the matrix. Haddad et al. (2017)
introduced a hierarchical clustering algorithm by iteratively merging close pairs of columns in Hi-C matrix. Note that both the first and second types of methods assume that TADs are non-overlapping, and ingore the hierarchical structures of TADs. The third type of methods focus on making inference on hierarchical structures of TADs, but the procedures are rather ad hoc. Specifically,Weinreb and Raphael (2015) proposed a tree-based algorithm to estimate hierarchical TADs, but the significance assesment on their estimate is missing. Yu et al. (2017)
assumed that each contact frequency is a mixture of intra-domain and inter-domain contacts and proposed to fit a two-component Gaussian mixture model to the normalized Hi-C data matrix.Yu et al. (2017)’s method provides us two-layer hierarchical TAD structure, ignoring the fact that the hierarchical TAD structure usually has layers more than two.
To overcome the weakness of existing methods for multiple change-point analysis of Hi-C matrices, we develop a theory of generalized likelihood ratio (GLR) tests for a known and a unknown change-point in Hi-C matrices, and study the asymptotic properties of the test statistics. Since the topology of change-points in a Hi-C matrice are different from that in uni- or multi-varate random variables, the test statistics from the GLR tests for Hi-C data are different from those obtained in uni- or mulit-variate change-point tests. We show that that the maximum of the GLR test statistics converges to a maxima on a Gaussian random field with two-dimensional indices. We then make use of this result and propose a top-down and bottom-up procedure to estimate the hierarchical TAD structures in Hi-C data. Specifically, by repeatedly applying the change-point test for Hi-C data, we first use a top-down method to locate all change-points (or boundaries of TADs) in a Hi-C matrix; then due to the fact that the hierarchical TADs in Hi-C matrix usually have more than one root, we apply the change-point tests again and use a bottom-up method to estimate hierarchical structures of TADs.
We then carry out two simulation studies and a real data analysis to test the performance of our method. Our first simulation study considers four different scenarios of Hi-C data and investigates the performance of the top-down method of estimating boundaries of TADs in simulated Hi-C data. In the second stimulation study, we apply the top-down and bottom-up procedure to estimate hierarchical structures in a collection of benchmark datasets, which were simulated by other research group. In a real data analysis, we use our method to analyze Hi-C data of multiple cell lines that are produced from genomic experiments on chromatin analysis. By comparing our estimation results with real hierarchical structures of TADs in those cell lines, we show how our method can accurately estimate the hierarchical TAD structures in Hi-C profiles and demonstrate how the result obtained by our approach is consistent with biological phenomenon via integrative analysis of chromatin interactions and histone modifications.
The rest of the paper is organized as follows. Section 2 specifies a model for TADs in Hi-C profiles, and develop a generalized likelihood ratio test for a change-point in Hi-C matrix. In Section 3, we propose a top-down and bottom-up approach to estimate hierarchical TADs. Simulation studies are carried out in Section 4, and real data analysis are performed in Section 5. Section 6 provides some concluding remarks.
2 A generalized likelihood rato test for change-points in Hi-C matrices
2.1 Model assumption
Let be a Hi-C matrix, where is the intensity of the interaction between positions and in the chromosome. By construction, the Hi-C data matrix is symmetric. Assume that all intensities are independent variables with distribution
Suppose that consists of non-overlapping TAD or diagonal blocks and the true locations of change-point (or block boundaries) are . For convenience, we let and . We define the region of all positions , and the th diagonal block, . For two or more consecutive diagonal blocks , we define the complement part . Then the parameters are supposed to be block-wise constant, i.e.,
An example of such a matrix with change-points is displayed in the left panel of Figure 2. For distribution , we will consider three specifications:
The Gaussian distribution (G) can be assumed for normalized Hi-C data, which are the contact frequencies after removing the effect of distance decay. Poisson distribution (P) or negative binomial distribution (NB) is usually assumed for raw HiC data. We further assume that the parameters and in distributions (G) and (NB) are constant for all . Then given distributions and locations of change-points, the log-likelihood of the data is given by
Specifically, we have, for Guassian distributions (G),
for Poisson distributions (P),
and for negative bionomial distributions (NB),
2.2 Generalized likelihood ratio tests for a single change-point in a Hi-C matrix
Suppose that the data matrix
has at most one change-point, and we consider a generalized likelihood ratio (GLR) test for the null hypothesisagainst the alternative . Suppose that, under the alternative, there exists an unknown change-point at position such that , as shown in the right panel of Figure 2. Denote , and the sum of in sets , , and , respectively, i.e.,
Provided the log-likelihood (2.3) for change-points, the logarithm of GLR can be expressed as , or equivalently,
In particular, the logarithms of the GLR statistics for Gaussian distributions (G), Poisson distributions (P), and negative binomial distributions (NB) are expressed as
where represents the cardinality of the set, so we have , , , and . Then the exact GLR test rejects the null hypothesis if the maximum of the finite sample statistic , , exceeds certain threshold, where is some integer. Asymptotically, these exact tests for Guassian distributions, Poisson distributions, and negative binomial distributions are equivalent, which can be seen from the following theorem.
where for all and , and , and are the sum of in sets , and , respectively. If the alternative hypothesis is that is a change-point, then the null should be rejected if in (2.8) exceeds certain threshold. If the location of the change-point is unknown, i.e., the alternative hypothesis is that there is a change-point between and , then an asymptotic GLR test for Gaussian distributions (G), Poisson distributions (P), and negative binomial distributions (NB) rejects the null hypothesis of no change-point if, for some , the maxima of , i.e.,
exceeds certain threshold.
To find the distribution of the test statistics , one can use Monte Carlo simulations. As an exmaple, Figure 3 shows the histogram and the density function estimated by kernal smoothing methods for and . Another way of obtaining the (tail) distribution of is to consider the asymptotic distribution of and . In particular, we consider a Gaussian random field defined on the upper triangular part of a unit square, . The random field satisfies the following properties: (1) for and ,
are normally distributed as; (2) for , are normally distributed as ; (3) for regions , , Cov. For region , we define the integral
Then the asymptotic distribution of and can be described via the following theorem.
Theorem 2. Assume that as , then the regions , and converge to regions , , and , respectively (note that ). Correspondingly,
and if ,
Therefore, at confidence level, an asymptotic GLR test with the alternative that is a change-point rejects the null hypothesis if , where Prob, and an asymptotic GLR test with the alternative that there is a change-point between and rejects the null hypothesis if , where .
3 A top-down and bottom-up approach to estimate hierarchical TADs
In this section, we propose an estimation procedure for hierarchical TADs in Hi-C profiles. The procedure consists of two steps. The first step uses a top-down method to identify multiple change-points in Hi-C matrices, and the second step uses a bottom-up approach to estimate hierarchical TADs in Hi-C profiles.
3.1 A top-down method of estimating multiple change-points
The GLR test using the test statistic (2.9) suggests a binary (or top-down) segmentation procedure to sequentially identify change-points (or block boundaries) in the Hi-C matrix. The essential idea is to test all the data to determine if there is at least one change-point and iterate the procedure to find all locally significant change-points. Assume that is the minimal size of a TAD in the Hi-C matrix, is the significance level of the GLR test, and is the upper
-quantile ofsuch that . We can use the following top-down segmentation procedure to identify all change-points in the Hi-C data matrix.
Step 1a. For , compute the GLR statistic and . If or the -value of the GLR test is smaller than , we label the location such that as a change-point and denoted as ; if there are more than one such locations, we random pick one of them as . If , stop the procedure.
Step 1b. Given the change-point , we repeat the change-point test in Step 1a for regions and and obtains change-points and , respectively.
Step 1c. We repeat Step 1b for each segmented diagonal blocks until the size of the segmented blocks become smaller than the minimal size of a TAD in the Hi-C data.
Suppose that change-points are obtained from the above procedure, ordering them as yields diagonal blocks .
We shall notice that the above procedure may ignore the similarity of two adjacent diagonal blocks segmented at different steps. Therefore one may need a pruning step to merge some adjacent diagonal blocks. To achieve this goal, we can use one of the following procedures. The first is to use Bayesian information criteria (BIC) to select the number of change-points given , where is the maximal number of change-points. Specifically, change-points in a symmetric matrix implies parameters, hence
Then the number and locations of change-points can be obtained by minimizing . Besides BIC, other information criteria can also be used. Another way is to run the GLR test to see if is a change-point for the sub-matrix
If the test doesn’t reject the null then we shall remove the change-point . Note that the purpose of a pruning step is to obtain a relatively smaller number of change-points. If the purpose is to obtain a hierarchical structure of change-points, as discussed in the next section, the pruning procedure is not necessary.
3.2 A bottom-up approach to identify hierarchical TADs
The top-down approach in Section 3.1 does not just provide us estimates of multiple change-points, the order of estimated change-points naturally indicates a nested structural relationship among segments. In fact, Matteson and James (2014) used similar idea and permutation tests to estimate hierarchical change-points with a single root in sequence of random vectors.
However, as indicated in Figures 1 and 2, hierarchical TADs in Hi-C profiles usually have multiple roots, suggesting top-down approaches are not appropriate for the inference of hierarchical structure. Therefore, we propose the following bottom-up approach, which uses the test statistic (2.8), to identify the herarchical structure of TADs in Hi-C profiles. For convenience, we assume a series of GLR tests with confidence levels for layers of hierarchical structure, where . We also denote the estimated change-points from Step 1 as and the corresponding block diagonals as , here the superscript indicates that the hierarchical structure of change-points have not been assigned. We start from .
Step 2a. For , perform the GLR test with test statistic (2.8) at significance level to conclude if is a change-point in block . Denote the test result as if the -value of the test is smaller than , and otherwise. If the sequence contains one or several consecutive 0’s, for example, , we combine blocks , and their complementary regions into a larger block , i.e.,
For change-points such that , we denote . Then we obtain the th order of the hierarchical structure. If , the procedure stops.
Step 2b. For each of the aggregated blocks in Step 2a, say , apply the GLR test with test statistic (2.8) to conclude if each of is a single change-point with significance level . Use the top-down approach in Step 1 to obtain a hierachical strcuture for change-points . This provides us a finer nested structure of TADs in .
Step 2c. Let and denote the newly obtained change-points as . Repeat Steps 2a and 2b until no new blocks are generated or the obtained number of layes exceeds , the maximum number of layers.
Figure 4 provides a schematic illustration of the bottom-up procedure. The left panel shows the estimated change-points from the top-down method with 95% confidence level (i.e., ). Then as shown by the middle panel of Figure 4, we perform the GLR test with test statistic (2.8) at 99% confidence level (i.e., ) for blocks , respectively. The -values show that . Hence we aggregate blocks 1, 2, 3, and 4 into a bigger block, , and blocks 6 and 7 into another block . In the next step, we apply the GLR test and top-down approach to obtain an inner hierarchical strucutures for and , respectively, as shown by the right panel of Figure 4. For higher layers of hierarchical structures, we repeat the above procedure for newly obtained change-points . Notice that, since the total number of estimated change-points are significantly smaller than the size of Hi-C profiles, the total number of GLR tests needed in the bottom-up procedure is also much smaller than the size of Hi-C profiles.
4 Simulation studies
We perform two simulation studies in this section. The first considers four different scenarios of Hi-C data and investigates the performance of the top-down method of estimating boundaries of TADs in simulated Hi-C data. In the second we apply the top-down and bottom-up procedure to estimate hierarchical structures in a collection of benchmark datasets, which were simulated by other research group.
4.1 Study one: Inference on multiple change-points
The purpose of this study is to assess the performance of the top-down procedure of estimating multiple change-points in Hi-C matrices. Let . Recall that is the number of change-points along the diagonal, the (raw or transformed) contact frequencies follow the distribution (2.1), and the parameters are determined by (2.2). We then consider the following four scenarios for Hi-C matrices.
Evenly spaced change-points with Gaussian distributions. Assume that , and each of 25 diagonal blocks have the same size, i.e., , . Following Dixon et al. (2012), we specify parameters and observation as follows. Assume that Gamma(4, 18) for , , and for . The values of are generated by the following distributions.
Unevenly spaced change-points with Gaussian distributions. assume that , change-point locations are discrete uniformly drawn with smallest block size at least 5. The values of and are generated in the same way as in (S1).
Evenly spaced change-points with Poisson and negative binomial distributions. Assume that , for , and Gamma(4, 18). To generate count data for , we use a negative binomial model for Hi-C data in Lun and Smyth (2015).
Note that, in the complementary region follows a mixture of a point mass and a negative binomial distribution.
Unevenly spaced change-points with Poisson and negative binomial distributions. The locations of change-points are generated in the same as (S2). Given all change-points, the values of are are generated in the same way as in (S3).
Among these four scenarios, the profiles of change-points in (S1) and (S3) are relatively simpler than those in (S2) and (S4). In Scenarios (S3) and (S4), provides us a NB distribution with mean
and variance, where the parameter , referred as the biological coefficient of variation (BCV) by Lun and Smyth (2015), varies from 0 to . Hence we use in (S3) and (S4). Note that corresponds to the case that follow a Poisson distribution with mean . To specify or the level of noise in (S1) and (S2), we note that a Gamma(4, 18) distribution has mean 72 for parameters , we then choose the value of such that . Then we use and , corresponding to four values of .
We simulate 1000 Hi-C matrices for each scenario, and use the top-down approach with significance levels and in Section 3.1 to estimate the numbers and locations of change-points, together with parameters . For (S1) and (S2), we assume that only follow Gaussian distributions and hence the GLR test is based on test statistics (2.5). For (S3) and (S4), we assume that follows Gaussian, Poisson, and negative binomial distributions, respectively, and hence the GLR tests for change-points are based on test statistics (2.5), (2.6), and (2.7), respectively. We use four measures to evaluate the performance of change-point estimation. Besides the estimated number of change-points (denoted as ), we also compute the mean square error for estimated parameters (i.e., MSE), true positive rates (TPR) and false discovery rates (FDR) for the number of change-points.
|.17 (.01)||0.75 (.09)||1 (0)||.0068 (5e-4)||.03 (.00)||0.14 (.04)||1 (0)||.001 (2e-4)|
|(S1)||G||.17 (.01)||0.69 (.07)||1 (0)||.0068 (5e-4)||.03 (.01)||0.13 (.03)||1 (0)||.0011 (2e-4)|
|.17 (.01)||0.79 (.08)||1 (0)||.0065 (5e-4)||.03 (.01)||0.15 (.03)||1 (0)||.0011 (2e-4)|
|.17 (.01)||0.74 (.08)||1.000 (4e-5)||.0068 (5e-4)||.02 (.00)||0.12 (.02)||1.000 (4e-5)||.0010 (2e-4)|
|.15 (.02)||1.56 (.27)||1 (0)||.0046 (4e-4)||.01 (.00)||0.14 (.10)||1 (0)||.0003 (9e-5)|
|(S2)||G||.13 (.01)||1.23 (.17)||1 (0)||.0041 (4e-4)||.02 (.00)||0.16 (.05)||1 (0)||.0005 (1e-4)|
|.15 (.01)||1.32 (.19)||1 (0)||.0047 (4e-4)||.03 (.00)||0.25 (.10)||1 (0)||.0008 (2e-4)|
|.13 (.01)||1.37 (.22)||.9998 (9e-5)||.0043 (4e-4)||.02 (.00)||0.30 (.07)||.9998 (9e-5)||.0008 (2e-4)|
|P||.17 (.01)||0.90 (.11)||1.0000 (7e-5)||.0070 (6e-4)||.03 (.01)||0.20 (.04)||1.0000 (7e-5)||.0014 (3e-4)|
|NB||.19 (.01)||0.84 (.09)||.9998 (8e-5)||.0077 (6e-4)||.03 (.01)||0.15 (.03)||.9998 (8e-5)||.0013 (3e-4)|
|G||.18 (.01)||0.81 (.09)||1 (0)||.0071 (5e-4)||.03 (.01)||0.21 (.05)||1 (0)||.0013 (2e-4)|
|P||.17 (.01)||0.90 (.10)||1.0000 (4e-5)||.0069 (5e-4)||.04 (.01)||0.19 (.04)||1.0000 (4e-5)||.0015 (3e-4)|
|(S3)||NB||.19 (.01)||0.96 (.10)||.9995 (1e-4)||.0078 (6e-4)||.04 (.01)||0.31 (.06)||.9995 (1e-4)||.0020 (3e-4)|
|G||.19 (.01)||0.95 (.09)||.9999 (6e-5)||.0077 (6e-4)||.03 (.01)||0.22 (.04)||.9999 (6e-5)||.0014 (3e-4)|
|P||.17 (.01)||0.89 (.10)||.9998 (1e-4)||.0073 (6e-4)||.02 (.00)||0.12 (.02)||.9998 (1e-4)||.0012 (3e-4)|
|NB||.17 (.01)||0.86 (.10)||.9997 (1e-4)||.0068 (6e-4)||.03 (.01)||0.18 (.05)||.9997 (1e-4)||.0014 (3e-4)|
|G||.19 (.01)||0.81 (.08)||1 (0)||.0074 (6e-4)||.03 (.01)||0.13 (.02)||1 (0)||.0010 (2e-4)|
|P||.15 (.01)||0.82 (.09)||.9999 (7e-5)||.0062 (5e-4)||.02 (.00)||0.14 (.03)||.9999 (7e-5)||.0008 (2e-4)|
|NB||.17 (.01)||0.77 (.09)||.9997 (1e-4)||.0070 (6e-4)||.03 (.01)||0.22 (.05)||.9997 (1e-4)||.0016 (3e-4)|
|G||.15 (.01)||0.68 (.07)||.9999 (6e-5)||.0061 (5e-4)||.02 (.00)||0.17 (.03)||.9999 (6e-5)||.0010 (2e-4)|
|P||.15 (.01)||1.38 (.20)||.9998 (9e-5)||.0050 (4e-4)||.03 (.01)||0.29 (.08)||.9998 (9e-5)||.0010 (2e-4)|
|NB||.16 (.01)||1.19 (.18)||.9993 (2e-4)||.0056 (5e-4)||.03 (.01)||0.31 (.08)||.9993 (2e-4)||.0015 (3e-4)|
|G||.14 (.01)||1.54 (.26)||.9999 (7e-5)||.0045 (4e-4)||.02 (.00)||0.25 (.10)||.9999 (7e-5)||.0007 (2e-4)|
|P||.15 (.01)||1.39 (.19)||.9998 (8e-5)||.0047 (4e-4)||.02 (.00)||0.23 (.05)||.9998 (8e-5)||.0010 (2e-4)|
|(S4)||NB||.15 (.01)||1.18 (.19)||.9995 (1e-4)||.0050 (5e-4)||.03 (.01)||0.26 (.06)||.9995 (1e-4)||.0013 (2e-4)|
|G||.16 (.01)||1.50 (.22)||.9998 (9e-5)||.0050 (4e-4)||.03 (.01)||0.18 (.05)||.9998 (9e-5)||.0010 (2e-4)|
|P||.12 (.01)||1.22 (.27)||.9995 (1e-4)||.0042 (4e-4)||.03 (.01)||0.57 (.25)||.9995 (1e-4)||.0013 (2e-4)|
|NB||.14 (.01)||1.38 (.21)||.9989 (2e-4)||.0055 (5e-4)||.03 (.01)||0.22 (.04)||.9989 (2e-4)||.0020 (3e-4)|
|G||.15 (.01)||1.63 (.28)||.9998 (7e-5)||.0048 (4e-4)||.02 (.01)||0.46 (.16)||.9998 (7e-5)||.0010 (2e-4)|
|P||.16 (.01)||1.42 (.20)||.9994 (1e-4)||.0057 (4e-4)||.04 (.01)||0.41 (.11)||.9994 (1e-4)||.0017 (3e-4)|
|NB||.14 (.01)||1.54 (.36)||.9989 (2e-4)||.0053 (4e-4)||.03 (.01)||0.25 (.04)||.9989 (2e-4)||.0020 (3e-4)|
|G||.14 (.01)||1.49 (.25)||.9998 (7e-5)||.0055 (4e-4)||.03 (.01)||0.35 (.14)||.9998 (7e-5)||.0010 (2e-4)|
Table 1 summarizes the means and standard errors (in parentheses) of the four measurements for change-point estimation in scenarios (S1)-(S4). The difference between estimated and true number of change-points in all scenariois have average less then one. Compared to the noise level in the simulation, all MSE are quite small. All TPRs are close to one, and all FDRs range from 0.001 to 0.008, indicating the estimation for change-points is satisfactory.
4.2 Study two: Inference on hierarchical TADs
To study whether our procedure of estimating hierarchical organization of TADs performs well or not, we use the Hi-C datasets simulated and released in Forcato et al. (2017). There are two main reasons that their simulated datasets shall be used. Forcato et al. (2017) used quasi-negative-binomial generator modified from Lun and Smyth (2015) with each specifically designed, which approximates the real Hi-C data better. Second, Forcato et al. (2017) had compared several existing segmentation methods in their study, using the same datasets facilitates the comparison of our method to others. The simulated datasets in Forcato et al. (2017) consist of the following two types of specifications for Hi-C matrices.
Large size matrices with unnested TADs. This dataset consists of 25 simulated Hi-C matrices with noise levels 4%, 8%, 12%, 16%, and 20%. These simulated matrices contain no nested TAD structures, and each simulated Hi-C matrix has size around 4500 with 171 diagonal blocks (or TADs).
Large size matrices with nested TADs. This dataset consists of 25 simulated Hi-C matrices with noise levels 4%, 8%, 12%, 16%, and 20%. Different from (S5), these simulated matrices contain nested TAD structures. In particular, each simulated profile has size around 4,500 and 910 diagonal blocks (or TADs) with three layers of hierarchical structure.
Since matrices are simulated as approximations to the raw data produced from the sequencing experiment, the elements in follow the power law indicated by Lieberman-Aiden et al. (2009). Hence, before applying our method to estimate change-points and hierarchical TAD structures, we follow Lieberman-Aiden et al. (2009) and transform the matrix to the matrix by , where the scaling parameter can be estimated from the data. For transformed matrices in (S5), we apply the top-down method in Section 3.1 with GLR test statistic (2.7) and significance level . To estimate the hierarchical structures in (S6), we use the top-down and bottom-up procedure in Sections 3.1 and 3.2 with GLR test statistic (2.7) and significance levels . Note that the significant levels and here serve as tuning parameters of change-points. In these studies, we assume that all follow Gaussian (G), Poisson (P), and negative binomial (NB) distributions.
|-0.2 (0.20)||1.6 (0.40)||18.6 (2.89)||83.4 (2.18)||124.2 (9.30)|
|P||TPR||.9918 (.0040)||.9847 (.0040)||.9706 (.0032)||.9377 (.0069)||.8294 (.0428)|
|FDR||.0070 (.0043)||.0243 (.0042)||.1235 (.0155)||.3692 (.0093)||.5195 (.0168)|
|0 (0.32)||2 (0.45)||18.6 (2.89)||82.8 (3.07)||123.8 (9.65)|
|(S5)||NB||TPR||.9918 (.0030)||.9836 (.0047)||.9659 (.0047)||.9247 (.0137)||.8259 (.0454)|
|FDR||.0082 (.0044)||.0277 (.0049)||.1278 (.0153)||.3763 (.0120)||.5210 (.0178)|
|-0.2 (0.20)||1.8 (0.37)||19.4 (3.22)||82.6 (2.11)||121.6 (10.67)|
|G||TPR||.9988 (.0012)||.9930 (.0022)||.9812 (.0039)||.9459 (.0087)||.8271 (.0478)|
|FDR||.0000 (0)||.0173 (.0031)||.1176 (.0163)||.3618 (.0086)||.5169 (.0176)|
|P||TPR||.8368 (.0087)||.7458 (.0091)||.6537 (.0059)||.5309 (.0111)||.3524 (.0128)|
|FDR||.0671 (.0044)||.0974 (.0068)||.1355 (.0038)||.1877 (.0080)||.2961 (.0107)|
|(S6)||NB||TPR||.8338 (.0093)||.7448 (.0107)||.6532 (.0048)||.5289 (.0093)||.3565 (.0135)|
|FDR||.0716 (.0050)||.0998 (.0056)||.1344 (.0050)||.1865 (.0048)||.2986 (.0104)|
|G||TPR||.8450 (.0070)||.7570 (.0090)||.6568 (.0044)||.5289 (.0118)||.3535 (.0125)|
|FDR||.0590 (.0050)||.0867 (.0058)||.1343 (.0043)||.1928 (.0066)||.3013 (.0139)|
For matrices in (S5), we compute the difference between estimated and true numbers of change-points , the TPR and FDR. These results, together with the standard errors, are presented in the upper panel of Table 2. Similar information on other methods can be found in Forcato et al. (2017). For comparison purpose we quote their testing results for TADtree (Weinreb and Raphael, 2015) and HiCseg (Lévy-Leduc et al., 2014) along with our method and IC-Finder (Haddad et al., 2017) in Figure 5. We find that MuChPoint (Brault et al., 2018) is not applicable to large Hi-C datasets of this size so their method is not presented here. Our estimations for change-points or boundaries of TADs are very accurate under moderate noise levels. Even in extremely high noise our performance is still reasonable.
For matrices in (S6), we follow Forcato et al. (2017) and present TPR and FDR for estimated change-points. These measures indicate the capacity of estimating change points at the bottom layer. The lower panel of Table 2 summarizes the TPRs and FDRs of the change-points in the bottom layer estimated by our top-down approach. For comparison purpose we quote the testing results for TADtree and HiCseg in Forcato et al. (2017) along with our method and IC-Finder, in Figure 6. We find the following advantages of our method. First, for fair levels of noise we obtain higher TPR and lower FDR. Actually, no other method can generate TPR higher than 0.75. Second, even under high noise our FDR is smaller than other approaches. While our method cannot differentiate TADs in bottom layer from their top (or root) layer in high noise, when boundaries between TADs and the complementary region are blurred. This explains our lower TPR in high noise level. As for overall performance, we excel in this scenario.
To have a better idea on the performance of our method, we show in Figure 7 two samples from (S6) with noise level . The right case in Figure 7 was also tested in Forcato et al. (2017) so we quote their results for TADtree and HiCseg along with some other methods. Some of the approaches can idenfify top (or root) layer TADs, but only our method and TADtree capture hierarchical structures. Plus we are much better at detecting bottom layer blocks and achieve less false detection.
5 Real data analysis
5.1 Hi-C datasets
We use the procedures proposed in Section 3 to estimate hierarchical TAD structures in real data. In particular, we study the high resoluation in situ Hi-C data of seven cell lines produced by Rao et al. (2015), which can be downloaded from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/) with the access number of GSE63525. These seven cell lines are GM12878, HMEC, HUVEC, IMR90, K562, KBM7, and NHEK. Rao et al. (2015) used a segmentation algorithm, Arrowhead, to analyze their data and we denote their estimated TAD structures as Rao-TAD in the rest of the paper. To compare our results with Rao-TAD, we define the following measures:
Boundary matching rates. If a boundary (change-point) of Rao-TAD and the closest among ours are within certain distance (number of bins in Hi-C matrix), we take them as matched. Then the boundary matching rate is defined as the proportion of matched boundaries among the total number of boundaries in Rao-TAD. We further perform a hypergeometric test to measure the significance of estimated TADs by our approach consisting of estimated TADs in Rao-TAD. In particular, we we treat the set of all bins (i.e., all columns or rows of the Hi-C matrix) as a population and the estimates of Rao-TAD and our methods as successes and samples, respectively.
Overlapping rates. We consider the overlapping rate of TAD bodies. Denote as th Rao-TAD, and as the TAD we estimated that share the longest overlapping region with . The overlapping rate is the proportion of overlapping region over the minimal length of and . Then we average that of all Rao-TADs to get the overlapping rate for a whole cell line.
5.2 Identifying hierarchical architecture of chromosome organization
Following Weinreb and Raphael (2015), we define the root of a TAD as order 1. If the root has sublevel TADs, the the two sub-TADs are considered as order 2; similarly, we define sub-sub-TADs as order 3, and so on. Applying the top-down and bottom-up approach proposed in Section 3 with , we analyze the high resolution in situ Hi-C data of seven cell lines. Table 3 show total number of TADs we find in each order level. We find that, although the highest order TAD we find is up to order 7, most of them are exhibited in orders 1 and 2 . This indicates that TAD hierarchical structures are mainly consists of order 1 TADs and maybe their one level of sub-TADs (usually corresponding to CTCF loops).
|Cell line||TADs||order 1||order 2||order 3||order 4|
|GM12878||8586||7743 (90.18)||687 (8.00)||96 (1.12)||14 (0.16)|
|HMEC||8200||7072 (86.24)||869 (10.60)||183 (2.23)||31 (0.38)|
|HUVEC||8903||8053 (90.45)||690 (7.75)||98 (1.10)||16 (0.18)|
|IMR90||9043||8202 (90.70)||689 (7.62)||94 (1.04)||12 (0.13)|
|K562||8801||8046 (91.42)||618 (7.02)||87 (0.99)||4 (0.05)|
|KBM7||6246||5402 (86.49)||647 (10.36)||131 (2.10)||21 (0.34)|
|NHEK||7726||6935 (89.76)||652 (8.44)||87 (1.13)||6 (0.08)|
To validate our results on real Hi-C data, we compare our estimated hierarchical TADs with Rao-TAD and find that our method can recall their results across different cell lines. we find the following. First, the number of TAD boundaries (or change-points) estimated by our approach are propotionally allocated in 23 chromosomes similar to the that in Rao-TAD (all -values , chi-square test, two sides). Second, for boundary matching we find () our boundaries are exactly the same as Rao-TAD which is very significant by one-sided hypergeometric testing (All -values -10). Considering that TAD boundaries are usually not sharply defined and may shift within a certain distance (Andrey et al., 2013), we then loose the criterion for matching by allowing certain bin shift. Specifically, the boundary matching rates dramatically increase to within one-bin shift, within 2-bin shift, and with a 3-bin shift; see the boundary matching rates in Part A of Figure 8. Third, our overlapping rates are increased with more higher order sub-TADs included. Restricted to only order 1 TADs, our overlapping rates are . When sub-TADs of all hierarchical orders are included, our overlapping rates increase to ; see panel B of Figure 8.
5.3 Integrative analysis of chromatin interactions and biological signals
To further demonstrate the advantage of our approach, we show how our method can be to delineate chromosomal organization and gene regulation by an integrative analysis of TAD structure and epigenetic markers. Histone protein help packing DNA into structual units. Specifically, boundaries of TADs are primarily associated with CTCF and Rad21 binding peaks (Dixon et al., 2012; Rao et al., 2015). To demonstrate this, we take a 2M region of chromosome 3 from cell line GM12878 to analyze the relationship of TADs organization, histone modifications, and protein binding peaks. This region can be partitioned into three parts, where an active region is flanked by two repressed ones. The active region is exhibited by biological signals like RNA-seq signals, pol2 binding peaks and multiple histone modifications such as H3K4me3 and H3K27ac as shown in Figure 9. Our estimated result, which has richer hierarchical structure, is better than Rao-TAD. Rao-TAD even failed to find any TAD in a large sub-region (chromosome 3:10700000-11300000) where clear blocks can be observed from the Hi-C interaction heatmap; see Figure 9. Our method reports four levels of TAD organization in the active region higher than that in the repressed ones, suggesting well spatially controlling for gene expressions.
In our early study, we have estimated chromosome regions as active and repressive for multiple cell lines (Chen et al., 2016). Hence we extend the above analysis to genome-wide by comparing hierarchical TAD structures between active and repressive regions in six human cell lines. We find TAD boundaries are significantly enriched in active regions as shown in Table 4
with one-sided fisher exact test. For instance, among the estimated 8,200 TAD boudnaries in cell line IMR590, 5,494 boundary points are located in active regions (64,220 bins of 40k), which are almost twice of the 2,706 boundary points in repressive regions (46,400 bins of 40k). Thus, these results of genome-wide analysis clearly demonstrate that the active regions of cells usually exhibited more TADs that could be contributed to gene regulations.
6 Concluding remarks
Identification of TADs and their hierarchical structures is extremely important in the study of chromatin interaction. Since identification of TADs and their hierachical strucutres in Hi-C profiles is essentially equivalent to multiple change-points anlaysis of matrix-variate data, existing multiple change-point methods for uni- and multi-variate data cannot be directly use. Therefore, we develop a theory of GLR tests for analysis of change-points in Hi-C matrices, and study the asymptotic properties of the test statistics. We show that the maximum of the GLR test statistic converges to a maxima on a Gaussian random field with two-dimensional indices. Based on these GLR tests, we then propose a top-down and bottom-up procedure to estimate the hierarchical TAD structures in Hi-C data. Our simulation studies shown that our method can precisely estimate hierarchical structures of TADs and is robust to various levels of noises. In the analysis of seven human cell lines, we demonstrated that our method performs better than several existing approaches, and furthermore, the hierarchical TAD structures obtained by our approach are consistent with divergent signals as histone modification and ChIP-seq data.
Besides deveoping GLR tests for change-points in Hi-C matrices, our method also makes the following contributions. First, as our change-point-based approach provides much more detailed structures of TADs than existing methods, the estimates obtained by our approach further reveals that active chromosomal regions contain more and higher order TADs than repressive regions. This provides new biological insights of precise gene regulation within active genomic regions. Second, since structural disorder of chromatin ineractions have been widely observed at different levels, our proposed approach can be applied to the comparative analysis of TAD structures in cancer and normal Hi-C data. Third, besides Hi-C data, the proposed change-point test and segmentation procedures can also be applied to analyze 5C data, which are another type of data produced by 3C-based experimental techniques.
The authors’ research is supported by NSF-DMSxxxxxx, the Cancer Prevention and Research Institute of Texas (CPRIT) grant (RP180826) and Cecil H. and Ida Green endowment fund.
- Albert and Chib  James H. Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679, 1993.
- Andrey et al.  Guillaume Andrey, Thomas Montavon, Bénédicte Mascrez, Federico Gonzalez Grassi, Daan Noordermeer, Marion Leleu, Didier Trono, Francois Spitz, and Denis Duboule. A switch between topological domains underlies hoxd genes collinearity in mouse limbs. Science, 340(6137), 2013.
- Bai and Perron  Jushan Bai and Pierre Perron. Estimating and testing linear models with multiple structural changes. Econometrica, 66(1):47–78, 1998.
- Bai and Perron  Jushan Bai and Pierre Perron. Computation and analysis of multiple structural change models. Journal of Applied Econometrics, 18(1):1–22, 2003.
- Birge and Massart  L. Birge and P Massart. A generalized cp criterion for gaussian model selection. niversités de Paris 6 and Paris 7, 647:39, 2001.
Brault et al. 
Vincent Brault, Sarah Ouadah, Laure Sansonnet, and Céline Lévy-Leduc.
Nonparametric homogeneity tests and multiple change-point estimation
for analyzing large hi-c data matrices.
Journal of Multivariate Analysis, 165:143–165, 2018.
- Broman and Speed  Karl W. Broman and Terence P. Speed. A model selection approach for the identification of quantitative trait loci in experimental crosses. Journal of the Royal Statistical Society Series B, 64(4):641–656, 2002.
- Chen et al.  Yong Chen, Yunfei Wang, Zhenyu Xuan, Min Chen, and Michael Zhang. De novo deciphering three-dimensional chromatin interaction and topological domains by wavelet transformation of epigenetic profiles. Nucleic Acids Research, 44:gkw225, 2016.
- Chib et al.  Siddhartha Chib, Federico Nardari, and Neil Shephard. Markov chain monte carlo methods for stochastic volatility models. Journal of Econometrics, 108(2):281–316, 2002.
- Davis et al.  Richard A Davis, Thomas C. M Lee, and Gabriel A Rodriguez-Yam. Structural break estimation for nonstationary time series models. Journal of the American Statistical Association, 101(473):223–239, 2006.
- Dekker et al.  Job Dekker, Karsten Rippe, Martijn Dekker, and Nancy Kleckner. Capturing chromosome conformation. Science, 295(5558):1306–1311, 2002.
- Dixon et al.  Jesse Dixon, Siddarth Selvaraj, Feng Yue, Audrey Kim, Yan Li, Yin Shen, Ming Hu, Jun Liu, and Bing Ren. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 485:376–80, 2012.
- Dostie et al.  Josée Dostie, Todd Richmond, Ramy A Arnaout, Rebecca R Selzer, William L Lee, Tracey Honan, Eric D Rubio, Anton Krumm, Justin Lamb, Chad Nusbaum, Roland D Green, and Job Dekker. Chromosome conformation capture carbon copy (5c): A massively parallel solution for mapping interactions between genomic elements. Genome research, 16(10):1299–309, 2006.
- Dudchenko et al.  Olga Dudchenko, Sanjit S. Batra, Arina Omer, Sarah Kate Nyquist, Marie Hoeger, Neva Durand, Saad Shamim, Ido Machol, Eric Steven Lander, Aviva Presser Aiden, and Erez Lieberman Aiden. De novo assembly of the aedes aegypti genome using hi-c yields chromosome-length scaffolds. Science, 356(6333):92–95, 2017.
- Eg Sauria et al.  Michael Eg Sauria, Jennifer Phillips-Cremins, Victor Corces, and James Taylor. Hifive: A tool suite for easy and efficient hic and 5c data analysis. Genome biology, 16:237, 2015.
- Filippova et al.  Darya Filippova, Robert Patro, Geet Duggal, and Carl Kingsford. Identification of alternative topological domains in chromatin. Algorithms for molecular biology : AMB, 9:14, 2014.
- Forcato et al.  Mattia Forcato, Chiara Nicoletti, Koustav Pal, Carmen Livi, Francesco Ferrari, and Silvio Bicciato. Comparison of computational methods for hi-c data analysis. Nature methods, 14, 2017.
- Fudenberg et al.  Geoff Fudenberg, Gad Getz, Matthew L. Meyerson, and Leonid Mirny. High order chromatin architecture shapes the landscape of chromosomal alterations in cancer. Nature biotechnology, 29:1109–13, 2011.
- Green  Peter J. Green. Reversible jump markov chain monto carlo computation and bayesian model determination. Biometrika, 82(4):711–32, 1995.
- Haddad et al.  Noelle Haddad, Cédric Vaillant, and Daniel Jost. Ic-finder: Inferring robustly the hierarchical organization of chromatin folding. Nucleic Acids Research, 45(10):e81, 2017.
- Harchaoui and Lévy-Leduc  Zaid Harchaoui and Céline Lévy-Leduc. Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association, 105(492):1480–1493, 2010.
- Hnisz et al.  Denes Hnisz, Daniel Day, and Raina Young. Insulated neighborhoods: Structural and functional units of mammalian gene control. Cell, 167(5):P1188–1200, 2016.
- Lai and Xing  T. L. Lai and H Xing. Stochastic change-point arx-garch models and their applications to econometric time series. Statistica Sinica, 23(4):1573–1594, 2013.
- Lai and Xing  Tze Leung Lai and Haipeng Xing. A simple bayesian approach to multiple change-points. Statistica Sinica, 21(2):539–569, 2011.
- Lai et al.  Tze Leung Lai, Haipeng Xing, and Nancy Zhang. Stochastic segmentation models for array-based comparative genomic hybridization data analysis. Biostatistics, 9(2):290–307, 2008.
- Lavielle  Marc Lavielle. Using penalized contrasts for the change-point problem. Signal Processing, 85:1501–1510, 08 2005.
- Lieberman-Aiden et al.  Erez Lieberman-Aiden, Nynke L van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, Bryan Lajoie, Peter Sabo, Michael Dorschner, Richard Sandstrom, Bradley Bernstein, MA Bender, Mark Groudine, Andreas Gnirke, John A. Stamatoyannopoulos, Leonid Mirny, Eric S Lander, and Job Dekker. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326(5950):289–293, 2009.
- Liu and Lawrence  Jun Liu and Charles Lawrence. Bayesian inference on biopolymer models. Bioinformatics, 15:38–52, 1999.
- Lun and Smyth  Aaron Lun and Gordon Smyth. diffhic: A bioconductor package to detect differential genomic interactions in hi-c data. BMC bioinformatics, 16:258, 2015.
- Lévy-Leduc et al.  Céline Lévy-Leduc, Maud Delattre, T Mary-Huard, and Stéphane Robin. Two-dimensional segmentation for analyzing hi-c data. Bioinformatics, 30(17):i386–i392, 2014.
- Matteson and James  David S. Matteson and Nicholas A. James. A nonparametric approach for multiple change point analysis of multivariate data. Journal of the American Statistical Association, 109(505):334–345, 2014.
- Nora et al.  Elphège Nora, Bryan Lajoie, Edda Schulz, Luca Giorgetti, Ikuhiro Okamoto, Nicolas Servant, Tristan Piolot, Nynke L van Berkum, Johannes Meisig, John Sedat, Joost Gribnau, Emmanuel Barillot, Nils Blüthgen, Job Dekker, and Edith Heard. Spatial partitioning of the regulatory landscape of the x-inactivation centre. Nature, 485:381–5, 2012.
- Olshen et al.  Adam Olshen, E.S. Venkatraman, Robert Lucito, and Michael Wigler. Circular binary segmentation for the analysis of array-based dna copy number data. Biostatistics, 5:557–72, 11 2004.
- Perron and Qu  Pierre Perron and Zhongjun Qu. Estimating and testing multiple structural changes in multivariate regressions. Econometrica, 75:459–502, 02 2007.
- Phillips-Cremins et al.  Jennifer Phillips-Cremins, Michael E G Sauria, Amartya Sanyal, Tatiana Gerasimova, Bryan Lajoie, Joshua S K Bell, Chin-Tong Ong, Tracy Hookway, Changying Guo, Yuhua Sun, Michael J Bland, William Wagstaff, Stephen Dalton, Todd McDevitt, Ranjan Sen, Job Dekker, James Taylor, and Victor Corces. Architectural protein subclasses shape 3d organization of genomes during lineage commitment. Cell, 153(6):1281–95, 2013.
- Rao et al.  S.S.P. Rao, M.H. Huntley, Neva Durand, E.K. Stamenova, Ivan Bochkov, J.T. Robinson, A.L. Sanborn, I Machol, Arina Omer, E.S. Lander, and E.L. Aiden. A 3d map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 159(7):1665–80, 2015.
- Sexton et al.  Tom Sexton, Eitan Yaffe, Ephraim Kenigsberg, Frédéric Bantignies, Benjamin Leblanc, Michael Hoichman, Hugues Parrinello, Amos Tanay, and Giacomo Cavalli. Three-dimensional folding and functional organization principles of the drosophila genome. Cell, 148:458–72, 2012.
- Shen and Zhang  Jeremy Shen and Nancy Zhang. Change-point model on nonhomogeneous poisson processes with application in copy number profiling by next-generation dna sequencing. Annals of Applied Statistics, 6:476–496, 2012.
- Siegmund  David Siegmund. Model selection in irregular problems: Applications to mapping quantitative trait loci. Biometrika, 91:785–800, 12 2004.
- Symmons et al.  Orsolya Symmons, Veli Vural Uslu, Taro Tsujimura, Sandra Ruf, Sonya Nassari, Wibke Schwarzer, Laurence Ettwiller, and Francois Spitz. Functional and topological characteristics of mammalian regulatory domains. Genome research, 24(3):390–400, 2014.
- van de Werken et al.  Harmen van de Werken, Gilad Landan, Sjoerd J B Holwerda, Michael Hoichman, Petra Klous, Ran Chachik, Erik Splinter, Christian Valdes-Quezada, Yuva Oz, Britta Bouwman, Marjon Verstegen, Elzo de Wit, Amos Tanay, and Wouter de Laat. Robust 4c-seq data analysis to screen for regulatory dna interactions. Nature methods, 9:969–72, 2012.
- Vostrikova  L.J Vostrikova. Detecting “disorder” in multidimensional random processes. Soviet Mathematics Doklady, 24:55–59, 1981.
- Wang and Zivot  Jiahui Wang and Eric Zivot. A bayesian time series model of multiple structural changes in level, trend, and variance. Journal of Business and Economic Statistics, 18(3):374–386, 2000.
- Weinreb and Raphael  Caleb Weinreb and Ben Raphael. Identification of hierarchical chromatin domains. Bioinformatics, 32(11):1601–9, 2015.
- Xing and Chen  H. Xing and Y Chen. Dependence of structural breaks in rating transition dynamics on economic and market variations. Review of Economics and Finance, 11:1–18, 2018.
- Xing et al. [2012a] H. Xing, Y. Mo, W. Liao, and M Zhang. Genomewide localization of protein-dna binding and histone modification by bcp with chip-seq data. PLoS Computational Biology, 8(7):e1002613, 2012a.
- Xing et al. [2012b] H. Xing, N. Sun, and Y Chen. Credit rating dynamics in the presence of unknown structural breaks. Journal of Banking and Finance, 36(1):78–89, 2012b.
- Yao  Yi-Ching Yao. Estimation of a noisy discrete-time step function: Bayes and empirical bayes approaches. Annals of Statistics, 12(4):1434–1447, 1984.
Estimating the number of change-points via schwarz’ criterion.
Statistics and Probability Letters, 6:181–189, 02 1988.
- Yu et al.  Wenbao Yu, Bing He, and Kai Tan. Identifying topologically associating domains and subdomains by gaussian mixture model and proportion test. Nature Communications, 8, 2017.
- Zhang and Siegmund  Nancy Zhang and David Siegmund. A modified bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics, 63:22–32, 2007.
- Zuin et al.  J. Zuin, J.R. Dixon, M.I. van der Reijden, Z. Ye, P. Kolovos, R.W. Brouwer, M.P. van de Corput, H.J. van de Werken, T.A. Knoch, W.F. van IJcken, F.G. Grosveld, B. Ren, and K.S. Wendt. Cohesin and ctcf differentially affect chromatin architecture and gene expression in human cells. Proceedings of National Academy of Science, USA, 111:996–1001, 2014.
Appendix. Proofs of Theorems 1 and 2
Proof of Theorem 1. For the assumption of Gaussian distributions, some algebra shows directly that . For the assumption of Poission distributions, consider in region under the null, . The first order Taylor expansion of around the mean is , hence . Similarly, we have first order expansion for , , and . Hence, under the null hypothesis, we have
For the assumption of negative binomial distributions, we can argue analogously. For example, for in region under the null, . The first order Taylor expansion around the mean