# Copula-based semiparametric transformation model for bivariate data under general interval censoring

This research is motivated by discovering and underpinning genetic causes for the progression of a bilateral eye disease, Age-related Macular Degeneration (AMD), of which the primary outcomes, progression times to late-AMD, are bivariate and interval-censored due to intermittent assessment times. We propose a novel class of copula-based semiparametric transformation models for bivariate data under general interval censoring, which includes the case 1 interval censoring (i.e., current status data), case 2 interval censoring, and right censoring. Specifically, the joint likelihood is modeled through a two-parameter Archimedean copula, which can flexibly characterize the dependence between the two margins in both tails. The marginal distributions are modeled through semiparametric transformation models using sieves, with the proportional hazards or odds model being a special case. We develop a computationally efficient two-step sieve maximum likelihood estimation procedure for the unknown parameters, together with a generalized score test for the regression parameter(s). For the proposed sieve estimators of finite-dimensional parameters, we establish their asymptotic normality and efficiency. Extensive simulations are conducted to evaluate the estimation and testing performance of the proposed method in finite samples. Finally, we apply our method to a genome-wide analysis of AMD progression using the Age-Related Eye Disease Study (AREDS) data, to successfully identify novel risk variants associated with the disease progression. We also produce the predicted joint and conditional progression-free probabilities, for patients with different genetic characteristics.

• 42 publications
• 34 publications
04/01/2019

### Gene-based Association Analysis for Bivariate Time-to-event Data through Functional Regression with Copula Models

Several gene-based association tests for time-to-event traits have been ...
03/23/2020

### A Simultaneous Inference Procedure to Identify Subgroups from RCTs with Survival Outcomes: Application to Analysis of AMD Progression Studies

With the uptake of targeted therapies, instead of the "one-fits-all" app...
07/10/2020

### Development and Validation of a Novel Prognostic Model for Predicting AMD Progression Using Longitudinal Fundus Images

Prognostic models aim to predict the future course of a disease or condi...
07/29/2021

### Modelling disease progression with multi-level electronic health records data and informative observation times: an application to treating iron deficiency anaemia in primary c

Modelling disease progression of iron deficiency anaemia (IDA) following...
01/28/2022

### Developing a Machine-Learning Algorithm to Diagnose Age-Related Macular Degeneration

Today, more than 12 million people over the age of 40 suffer from ocular...
08/14/2019

### Robust parametric modeling of Alzheimer's disease progression

Quantitative characterization of disease progression using longitudinal ...
10/16/2021

### BAPGAN: GAN-based Bone Age Progression of Femur and Phalange X-ray Images

Convolutional Neural Networks play a key role in bone age assessment for...

## 1 Introduction

Bivariate time-to-event endpoints are frequently used as co-primary outcomes in biomedical and epidemiological fields. For example, two time-to-event endpoints are often seen in clinical trials studying the progression (or recurrence) of bilateral diseases (e.g., eye diseases) or complex diseases (e.g. cancer and psychiatric disorders). The two endpoints are correlated as they come from the same individual. Bivariate interval-censored data arise when both events are not precisely observed due to intermittent assessment times. A further complication is that the event status can be indeterminate (i.e., right-censored) for individuals who are event-free at their last assessment time. The special case when there exists only one assessment time, leading to the bivariate current status data, can also happen for some individuals. Therefore, the bivariate data we are interested in modeling are under general interval censoring, which includes the mixture of case 1 (current status), case 2 interval censoring and right censoring.

Our motivating example of such bivariate general interval-censored data came from a large clinical trial studying the progression of a bilateral eye disease, Age-related Macular Degeneration (AMD), of which the two-eyes from the same patient were periodically examined for late-AMD. The study aims to discover genetic variants that are significantly associated with AMD progression, as well as to characterize both the joint and conditional risks of AMD progression. For example, the joint 5-year progression-free probability for both eyes is a clinically important measure to group patients into different risk categories. Similarly, for patients who have one eye already progressed, the conditional 5-year progression-free probability for the non-progressed eye (given its fellow eye already progressed) is important to both clinicians and patients. Therefore, a desired statistical approach would be able to characterize and predict both joint and conditional risk profiles, as well as to assess the covariate effects on them.

There have been several popular approaches for modeling bivariate interval-censored data. For example, Goggins & Finkelstein (2000), Kim & Xue (2002), Chen et al. (2007), Tong et al. (2008) and Chen et al. (2013) fitted various marginal models for multivariate interval-censored data. All these approaches model the marginal distributions based on the working independence assumption, and thus cannot produce joint or conditional distributions. Another popular approach is based on frailty models (for example, Oakes, 1982), which are mixed effects models with a latent frailty variable applied to the conditional hazard functions. For example, Chen et al. (2009) and Chen et al. (2014) built frailty proportional hazards models with piecewise constant baseline hazards for multivariate current status data and interval-censored data, respectively. Wen & Chen (2013) and Wang et al. (2015) developed gamma-frailty proportional hazards models for bivariate interval-censored data through a nonparametric maximum likelihood estimation approach and bivariate current status data through a spline-based sieve estimation approach, respectively. Recently, Zhou et al. (2017) and Zeng et al. (2017) proposed frailty-based transformation models for bivariate or multivariate interval-censored data, and obtained parameter estimators through the nonparametric maximum likelihood estimation and sieve maximum likelihood estimation, respectively.

The third popular approach is based on copula models (Clayton, 1978, for example)

. Unlike the marginal or frailty approaches, the copula-based methods directly connect the two marginal distributions through a copula function to construct the joint distribution, of which the dependence is determined by the copula parameter. This unique feature makes the modeling of the margins separable from the copula function, which is attractive from both the modeling perspective and the interpretation purpose. Moreover, the challenge from censoring can be naturally handled through the marginal distributions within the copula function. Both joint and conditional distributions can be obtained from copula models. Various copula models have been proposed in the literature.

Wang et al. (2008) used sieve estimation in a copula model with proportional hazards margins for bivariate current status data. Cook & Tolusso (2009) and Kor et al. (2013) developed estimating equations for copula models with piecewise constant baseline marginal hazards for clustered current status and interval-censored data, respectively. Hu et al. (2017) developed the sieve estimation approach for copula model with proportional hazards margins for bivariate current status data. To date, most copula-based regression models only handle a specific interval censoring type (e.g., case 1 current status or case 2 interval censoring) and are often limited to the proportional hazards assumption. Also, the most frequently used copula models, such as Clayton, Gumbel and Frank, all use only one parameter to characterize the between-margin dependence, which can be lack of flexibility.

In this work, we propose a general class of copula-based semiparametric transformation model for bivariate data subject to general interval censoring. Specifically, we build a two-parameter copula model framework, which can handle more flexible dependence structures than one-parameter copulas. Our method incorporates a broad class of semiparametric regression models that do not assume any specific distribution for the margins. We approximate the infinite-dimensional nuisance parameters using sieves with Bernstein polynomials, and propose a novel two-step maximum likelihood estimation procedure which is computationally stable and efficient. We establish the asymptotic normality and efficiency for the sieve estimators of finite-dimensional model parameters using an M-theorem (that can handle two separate infinite-dimensional parameters). Moreover, we develop a generalized score test with numerical approximations of the score function and observed Fisher information for testing covariate effects.

The paper is organized as follows. Section 2 introduces the model and the joint likelihood function. Section 3 presents the sieve maximum likelihood estimation procedure, the asymptotic properties and the generalized score test. Section 4 illustrates extensive simulation studies for the estimation and testing performances of our proposed methods. We analyze the AREDS data and present the findings in Section 5. Finally, we discuss and conclude in Section 6. The regularity conditions and the proofs of two theorems are included in the Appendix, with additional lemmas and a general M-theorem presented in the supplemental material.

## 2 Notation and Likelihood

### 2.1 Copula model for bivariate censored data

Let be the true bivariate event times, with marginal survival functions , and joint survival function . Assume there are independent subjects in a study. For subject , we observe , where is the time interval that lies in and

is the covariate vector. When

, is right-censored, and when , is left-censored.

By the Sklar’s theorem (Sklar, 1959), so long as the marginal survival functions are continuous, there exists a unique function that connects two marginal survival functions into the joint survival function: Here, the function is called a copula, which maps onto and its parameter measures the dependence between the two margins. A signature feature of the copula is that it allows the dependence to be modeled separately from the marginal distributions (Nelsen, 2006).

One popular copula family for bivariate censored data is the Archimedean copula family, which usually has an explicit formula. Two frequently used Archimedean copulas are the Clayton (Clayton, 1978) and Gumbel (Gumbel, 1960) copula models, which account for the lower or upper tail dependence between two margins using a single parameter.

Here, we consider a more flexible two-parameter Archimedean copula model, which is formulated as

 Cα,κ(u,v)=[1+{(u−1/κ−1)1/α+(v−1/κ−1)1/α}α]−κ, α∈(0,1], κ∈(0,∞), (1)

where and

are two uniformly distributed margins. The two dependence parameters (

and ) account for the correlation between and at both upper and lower tails, respectively, and they are explicitly connected to the Kendall’s with . In particular, when , the two-parameter copula (1) becomes the Clayton copula, and when , it becomes the Gumbel copula. Thus, the two-parameter copula model provides more flexibility in characterizing the dependence than the Clayton or Gumbel copula.

### 2.2 Joint likelihood for bivariate data under general interval censoring

Using the notation introduced in Section 2.1, the joint likelihood function using the two-parameter copula model can be written as

 Ln(S1,S2,α,κ∣D)=n∏i=1pr(Li1Li1,Ti2>Li2∣Zi1,Zi2)−pr(Ti1>Li1,Ti2>Ri2∣Zi1,Zi2) −pr(Ti1>Ri1,Ti2>Li2∣Zi1,Zi2)+pr(Ti1>Ri1,Ti2>Ri2∣Zi1,Zi2)} = n∏i=1{Cα,κ(S1(Li1∣Zi1),S2(Li2∣Zi2))−Cα,κ(S1(Li1∣Zi1),S2(Ri2∣Zi2)) −Cα,κ(S1(Ri1∣Zi1),S2(Li2∣Zi2))+Cα,κ(S1(Ri1∣Zi1),S2(Ri2∣Zi2))}.

For a given subject , if is right-censored, then any term involving becomes 0 (since is set to be ). Then the joint survival function for subject reduces to either only one term (if both and are right-censored) or two terms (if one is right-censored). The special case of current status data can also fit into this model frame, where either is 0 (if the event has already occurred before the examination time, which is in this case) or is (if the event has not occurred upon the examination time, which is in this case). Therefore, the likelihood function (2) can handle the general form of bivariate interval-censored data.

Next, we will model both the dependence parameters () and two marginal survival functions () together.

### 2.3 Semiparametric linear transformation model for marginal survival functions

We consider the semiparametric transformation model for marginal survival functions:

 Sj(t∣Zj)=exp[−Gj{exp(ZTjβj)Λj(t)}], j=1,2, (3)

where is a pre-specified strictly increasing function, is a vector of unknown regression coefficients, and is an unknown non-decreasing function of . In model (3), the transformation function , the regression parameter and the infinite-dimensional parameter are all denoted as margin-specific (indexed by ) for generality. In practice, some or all of them can be the same for the two margins and in that case, the corresponding index can be dropped.

This model (3) contains a class of survival models. For example, when , the marginal survival function follows a proportional hazards model. When , it becomes a proportional odds model. In practice, the transformation function can be also “estimated” by the data. For example, the commonly used Box-Cox transformation , , or the logarithmic transformation , , can be assumed. The proportional hazards and proportional odds models are the special cases in both transformation classes. Then the parameter in can be estimated together with other parameters in the likelihood, as we will demonstrate in our simulation studies.

## 3 Estimation and Inference

### 3.1 Sieve likelihood with Bernstein polynomials

In our likelihood function, we are interested in estimating the unknown parameter :

 Θ={θ=(βT1,βT2,α,κ,Λ1,Λ2)T∈B⊗M⊗M}.

Here with being the dimension of and being a positive constant. We denote by the collection of all bounded, continuous and nondecreasing, nonnegative functions over , where . In practice, can be chosen as the range of all and .

In our log-likelihood function , there are finite-dimensional parameters of interest and two infinite-dimensional nuisance parameters , which need to be estimated simultaneously. Unlike the right-censored data, the interval-censored data do not apply tools like partial likelihood and martingale due to the absence of exact event times. Instead, following Huang & Rossini (1997), we employ the sieve approach and form a sieve likelihood. Specifically, similar to Zhou et al. (2017), we use Bernstein polynomials to build a sieve space . Here, is the space defined by Bernstein polynomials:

 Mn={Λjn(t)=mn∑k=0ϕjkBk(t,mn,c,u):mn∑k=0|ϕjk|≤Mn; 0≤ϕj0≤⋅⋅⋅≤ϕjmn;j=1,2},

where represents the Bernstein basis polynomial defined as:

 Bk(t,mn,c,u)=(mnk)(t−cu−c)k(1−t−cu−c)mn−k; k=0,...,mn, (4)

with degree for some . We assume the basis polynomials are the same between the two margins, while the coefficients can be margin-specific. In practice, with a pre-specified , we solve together with other parameters .

The use of Bernstein polynomials naturally fits the interval-censored data as they are built based on intervals. One big advantage of Bernstein polynomials is that they can naturally achieve the non-negativity and monotonicity properties of through re-parameterization (Zhou et al., 2017). Another advantage of Bernstein polynomials is that they do not require the specification of interior knots, as seen from (4), making them easier to work with.

With the sieve space defined above, will be approximated by . In the next section, we propose an estimation procedure to maximize over the sieve space to obtain the sieve maximum likelihood estimators .

### 3.2 Estimation procedure for sieve maximum likelihood estimators ^θn

We develop a novel two-step sieve maximum likelihood estimation procedure that applies to any choice of Archimedean copulas and marginal models. In principle, we can obtain the sieve maximum likelihood estimators by maximizing the joint likelihood function (2) in one step. Due to the complex structure of the joint likelihood function, we recommend using a separate step to obtain appropriate initial values for all the unknown parameters. In essence, are first estimated marginally in step 1(a). Then their estimators are plugged into the joint likelihood to form a pseudo-likelihood. In step 1(b), the dependence parameters are estimated through maximizing the pseudo-likelihood function. Finally, using initial values from step 1(a) and 1(b), we update all the unknown parameters simultaneously under the joint log-likelihood function in step 2. The estimation procedure is described below:

1. Obtain initial estimates of :

1. , where denotes the sieve log-likelihood for the marginal model, ;

2. , where and are the initial estimates from (a), and is the joint sieve log-likelihood.

2. Simultaneously maximize the joint sieve log-likelihood to get final estimates:
with initial values obtained from step 1(a) and 1(b).

The two-step procedure guarantees almost convergence and our simulations show that the computing speed in step 2 is significantly improved by using initial values obtained from step 1. In fact, the initial estimates from step 1 are already consistent (Sun et al., 2006)

. Step 2 produces correct variance-covariance estimates for all the parameters using the joint likelihood. Some standard optimization algorithms such as the Newton-Raphson algorithm or the conjugate gradient algorithm can be employed to obtain the maximizers and Hessian matrix. Due to the complex structure of the joint sieve log-likelihood, instead of analytically deriving the first and second order derivatives, we propose to use the Richardson’s extrapolation (

Lindfield & Penny, 1989) to approximate the score function and observed Fisher information numerically.

### 3.3 Asymptotic Properties of Sieve Estimators

To establish the asymptotic properties of the sieve maximum likelihood estimators , we need to define the distance between two s. Let be the Euclidean norm for a vector . Define the supremum norm for a function . Also define for a function under the probability measure . Then the norm for is defined as , where

denotes the joint cumulative distribution function of

and . Finally we define the distance between and as

 d(θ1,θ2)=(|β1−β2|2+|α1−α2|2+|κ1−κ2|2+∥Λ11−Λ12∥22+∥Λ21−Λ22∥22)1/2.

Let denote the true value of . The following theorems present the convergence rate, asymptotic normality and efficiency of the sieve estimators.

###### Theorem 1

(Convergence rate) Assume that Conditions 1-2 and 4-5 in the Appendix hold. Let , where and be the smoothness parameter of as defined in Condition 4, then we have

 d(^θn,θ0)=Op{n−min(qν/2,(1−ν)/2)}.

Theorem 1 states that the sieve estimator has a polynomial convergence rate. Although this overall convergence rate is lower than , the following normality theorem states that the proposed estimators of the finite-dimensional parameters () are asymptotically normal and semiparametrically efficient.

###### Theorem 2

(Asymptotic normality and efficiency) Suppose Conditions 1-5 in the Appendix hold. Define and . If , then

 n1/2{^bn−b0}=I−1(b0)n1/2Pnl∗(b0,Λ10,Λ20;D)+op(1)→dN{0,I−1(b0)},

where and is the efficient score function defined in the proof. Therefore, are asymptotically normal and efficient.

### 3.4 Generalized score test

We now separate into two parts: and , where is the parameter set of interest for hypothesis testing and denotes the rest of the regression coefficients. The likelihood-based tests such as the Wald, score, and likelihood-ratio tests can be constructed to test , and they are asymptotically equivalent. In our motivating study, we aim to perform a GWAS on AMD progression, which contains testing millions of SNPs one-by-one. Therefore, the computing speed is critical. We propose to use the generalized score test. One big advantage of the score test in a GWAS setting is, one only needs to estimate the unknown parameters once under the null model without any SNP (i.e., ), since the non-genetic risk factors are the same no matter which SNP is being tested. Therefore, the score test is faster as compared to the likelihood ratio or Wald test. Moreover, the convergence rate is easily guaranteed by performing estimation only once if using the score test.

With the sieve joint likelihood, we can obtain the restricted sieve maximum likelihood estimators under (

and the rest parameters are arbitrary), and then calculate the generalized score test statistic as defined in

Cox & Hinkley (1979). Similar as in our two-step estimation procedure, we also propose to use the Richardson’s extrapolation to numerically approximate the first and second order derivatives when calculating the score test statistic.

## 4 Simulation study

We first evaluated the parameter estimation of our proposed two-parameter copula sieve model for bivariate data under general interval censoring. Then we assessed the type-I error control, power performance and computational speed of the proposed generalized score test.

### 4.1 Generating bivariate interval-censored times

The data were simulated from various Archimedean copula models (i.e., Clayton, Frank, Ali–Mikhail–Hap (AMH) and Joe) with Loglogistic margins. We first generated bivariate true event times using the conditioning approach described in Sun et al. (2018). To obtain interval-censored data, we followed the censoring procedure in Kiani & Arasan (2012), which fits the study design of AREDS data. Specifically, we assumed each subject was assessed for

times with the length between two adjacent assessment times following an Exponential distribution. In the end, for each subject

, was defined as the last assessment time before and was the first assessment time after . The overall right-censoring rate is set to be .

For the dependence strength between margins, we set Kendall’s at 0.2 or 0.6, indicating weak or strong dependence. We assumed that the two event times share a common baseline distribution, for example Loglogistic with scale and shape or Weibull with scale and shape .

We included both genetic and non-genetic covariates in the simulations. Specifically, each SNP, coded as 0 or 1 or 2, was generated from a multinomial distribution with probabilities , where or is the minor allele frequency (MAF). We also included a margin-specific continuous variable, generated from

, and a subject-specific binary variable, generated from Bernoulli (

).

Under all scenarios, the sample size was set as . For simplicity, we assumed the same covariate effects for two margins, denoted as , where and are marginal- and subject-specific non-genetic effects, respectively, and is the SNP effect. We set . For estimation performance evaluation, we let and replicated 1,000 times. For type-I error control evaluation of testing , we replicated 100,000 times and evaluated at various tail levels: 0.05, 0.01, 0.001 and 0.0001, respectively. For power evaluation, we replicated 1,000 times under each SNP effect size, where a range of ’s were selected to represent weak to strong SNP effects.

### 4.2 Simulation-I: parameter estimation

In this section, we evaluated the estimation performance of our proposed method under both correct and misspecified settings. For the sieve margins, we used the true linear transformation function. We assumed the same Bernstein coefficients

with degree () for , . For the event time range , we chose and set as the largest value of all plus a constant.

In Table 1, the true model is Clayton copula with Loglogistic (proportional odds) or Weibull (proportional hazards) margins, with Kendall’s . We fitted three models: the true parametric copula model (i.e., Clayton copula with Loglogistic or Weibull margins), a two-parameter copula model with sieve margins (“Copula2-S”) and a marginal sieve model with the robust variance-covariance estimate (“Robust-S”) (Zhou et al., 2017

). We obtained estimation biases and 95% coverage probabilities for regression coefficients and dependence parameters. Under the two-parameter copula model, the sieve maximum likelihood estimators are all virtually unbiased and all empirical coverage probabilities are close to the nominal level. Moreover, their standard errors are virtually the same as the standard errors under the true parametric model, indicating our proposed method works well. For the robust marginal sieve model, the regression coefficient estimates are also unbiased with correct coverage probabilities, but their standard errors are apparently larger.

In the real setting, the value of the transformation function parameter is often unknown. Therefore, we examined our methods in estimating the transformation function parameter together with the other parameters in our proposed model. Table 2 shows satisfactory estimation performance for all parameters including the transformation parameter in both proportional hazards and proportional odds settings.

We also simulated bivariate current status data by setting to examine how the proposed method works in the special case of case 1 interval censoring. As shown in Table 3, Copula2-S works as well as the true model in this setting too. The larger standard errors are due to less information in current status data as compared to the standard case 2 interval censoring case.

We further evaluated the estimation performance of the proposed model on bivariate interval-censored data generated from copula models that do not belong to the two-parameter copula family, such as Frank copula with , AMH copula with ( is always restricted to be for AMH copula) and Joe copula with . In Table 4, the regression coefficient estimates from the two-parameter copula are all unbiased with coverage probabilities close to 95%. The biases for the estimates are also minimal with good coverage probabilities except in the scenario when data were generated from a Joe copula (coverage probability = 83%). Overall, the two-parameter copula model family demonstrates good robustness to misspecification in copula functions.

### 4.3 Simulation-II, type-I error

We evaluated the type-I error control of our proposed generalized score test under Copula2-S. Specifically, we tested the SNP effect under different dependence strengths (Kendall’s ) and two different MAFs (40%, 5%). The true model is Clayton copula with Loglogistic margins. We included score tests of two misspecified copula models, one with misspecified margins but correct copula (i.e., Clayton copula with Weibull margins) and the other with misspecified copula but correct margins (i.e., Gumbel copula with Loglogistic margins). We also included the score test under the true parametric copula model (i.e., Clayton copula with Loglogistic margins), which served as the benchmark model. In addition, we examined Wald tests from the marginal Loglogistic model with variance-covariance either from the independence estimate or the robust sandwich estimate.

Table 5 shows type I errors under different tail levels. In the top part where Kendall’s , our proposed score test controls type-I errors as well as the true parametric model at all tail levels and MAFs. However, type-I errors in the two misspecified copula models are inflated at all scenarios, especially when margins are wrong at MAF . It is not surprising to observe the greatest inflation occurs with the marginal approach under the independence assumption. After applying the robust variance-covariance estimate, the type-I errors seem to be controlled at MAF = 40%, but are still slightly inflated at MAF = 5%. When Kendall’s , the proposed two-parameter model still performs as well as the correct parametric model and outperforms the other models, although the type-I error inflations from other models were smaller due to the weaker dependence.

### 4.4 Simulation-III, power

We compared the score test under our Copula2-S model with the score tests from two other models (that appropriately control the type-I error in Table 5): the true parametric copula model and the Robust-S model. Figure 1 presents the power curves of these three tests over a range of SNP effect sizes. Our proposed model yields the same power as the true parametric copula model, and is considerably more powerful than the robust marginal sieve model.

### 4.5 Simulation-IV, computing speed

We examined the computational advantages of our two-step sieve estimation procedure as compared to the one-step estimation in the setting where data were simulated from a Clayton copula with Loglogistic margins. For replications, the one-step method took seconds while our two-step procedure took seconds, saving about computing time.

We also compared the computing speed of three likelihood-based tests on testing SNPs under three models: the true Clayton model with Loglogsitc margins, our proposed Copula2-S model and the Robust-S model. The 1,000 genetic variants were simulated from MAF . The results are shown in Table 6. We found that the score test is about 3-5 times faster than the Wald test or the likelihood ratio test on average. Within the three score tests, although the score test under our Copula2-S model is the slowest due to model complexity, it is still faster than the Wald test under the Robust-S model. Given its advantages in robustness, type-I error control and power performance, we recommend the proposed Copula2-S model with the score test for large-scale testings of bivariate data subject to general censoring.

## 5 Real data analysis

We implemented our proposed method to analyze the AREDS data. AREDS was designed to assess the clinical course of, and risk factors for the development and progression of AMD. DNA samples were collected from the consenting participants and genotyped by the International AMD Genomics Consortium (Fritsche et al., 2016). In this study, each participant was examined every 6 months in the first 6 years and then every 1 year after year 6. To measure the disease progression, a severity score, scaled from 1 to 12 (with larger value indicating more severe AMD), was determined for each eye of each participant at every examination. The outcome of interest is the bivariate progression time-to-late-AMD, where late-AMD is defined as the stage with severity score . Both phenotype and genotype data of AREDS are available from the online repository dbGap (accession: phs000001.v3.p1, and phs001039.v1.p1, respectively). By far, all the studies that analyzed the AREDS data for AMD progression treated the outcome as right-censored (e.g., Ding et al. (2017), Yan et al. (2018), and Sun et al. (2018)), and some only used data from one eye for the analysis (e.g., Seddon et al. (2014)).

We analyzed 2718 Caucasian participants, including 2295 subjects who were free of late-AMD in both eyes at the enrollment, i.e., time (denoted as group A), and 423 subjects who have one eye already progressed to late-AMD by enrollment (denoted as group B). For the th eye (free of late-AMD at time 0) of subject , we observe , the last assessment time when the th eye was still free of late-AMD and , the first assessment time when the th eye was already diagnosed as late-AMD. For the eye that did not progress to late-AMD by the end of the study follow-up, we assigned a large number to . Since there are two groups of subjects (group A and B), we implemented a two-part model. Specifically, we created a covariate for each eye to indicate whether its fellow eye had already progressed or not at time 0 (i.e., for both eyes of group A subjects and for group B subjects). Then the joint likelihood is the product of the copula sieve model for group A subjects and the marginal sieve model for group B subjects.

We included four risk factors as non-genetic covariates, including the baseline age, severity score, smoking status, and fellow-eye progression status. We checked various transformation functions and Bernstein polynomial degrees , and chose the model that produced the smallest aic, which is the proportional odds model with for both margins.

We performed GWAS on 6 million SNPs (either from exome chip or imputed) with MAF

across the 22 autosomal chromosomes and plotted their in Figure 2. As highlighted in the figure, the PLEKHA1–ARMS2–HTRA1 region on chromosome 10 and the CFH region on chromosome 1 have variants reaching the “genome-wide” significance level (). Previously, these two regions were found being significantly associated with AMD onset from multiple case-control studies (Fritsche et al., 2016). Moreover, we identified SNPs in a previously unrecognized ATF7IP2 region on chromosome 16, showing moderate to strong association with AMD progression (). As a comparison, we also fitted the robust marginal sieve model (Robust-S) and performed the corresponding score test for each SNP. Overall, the results are consistent with our Copula2-S model, but the -values are generally larger (as shown in Table 7). Note that the CFH region did not reach the “genome-wide” significance level under the Robust-S model.

Table 7 presents the top significant variants of the three regions denoted in Figure 2. The odds ratio of a SNP was calculated by fitting a model including this SNP and those non-genetic factors. For example, , a known AMD risk variant from HTRA1 region, has an estimated odds ratio of 1.66 (95% CI ), which implies its minor allele has a “harmful” effect on AMD progression. Under this model, the estimated dependence parameters are and , corresponding to , which indicates moderate dependence in AMD progression between two eyes. In addition, the upper and lower tail Kendall’s are and , suggesting a stronger between-eye dependence at the later stage of progression.

For variant , we obtained both estimated joint and conditional survival functions from the fitted Copula2-S model. The left panel of Figure 3 plots the joint progression-free probability contours for subjects who are smokers with the same age (= 68.6) and AMD severity score (= 3.0 for both eyes) but different genotypes of . The right panel of Figure 3 plots the corresponding conditional progression-free probability of remaining years (after year 5) for one eye, given its fellow eye has progressed by year 5. It is clearly seen that in both plots, the three genotype groups are well separated, with the group having the largest progression-free probabilities. These estimated progression-free probabilities provide valuable information to characterize or predict the progression profiles for AMD patients with different genetic characteristics.

## 6 Conclusion and Discussion

We proposed a flexible copula-based semiparametric transformation model for analyzing and testing bivariate (general) interval-censored data. We estimated the model parameters through a computationally efficient and stable two-step sieve estimation approach using Bernstein polynomials. We studied asymptotic properties of the proposed sieve estimators and established asymptotic normality and efficiency for the finite-dimensional model parameters. Our extensive simulations demonstrated satisfactory estimation and testing performance of our proposed method under various practical settings. We implemented our proposed methods in R. The key functions can be found in the GitHub https://github.com/yingding99/Copula2S.

As we mentioned in the Introduction section, frailty models are another popular approach for analyzing bivariate censored data. Goethals et al. (2008) presents the connection and distinction between copula and frailty models. For example, the Clayton copula has the same mathematical expression as the Gamma frailty model in terms of the joint survival distribution. However, their marginal survival functions are modeled differently. Specifically, the marginal function under the Clayton model only involves the time and covariate effects, whereas the marginal function under the Gamma frailty model involves not only the time and covaraite effects but also the frailty parameter. As a result, the joint distribution functions of the Clayton copula and Gamma frailty models are not equivalent. The two joint distribution functions are identical only when the two margins are independent. Moreover, the parameter estimation strategies are usually different between copula and frailty models. More discussions can be found in Wienke (2010). In this paper, the objectives of our real study lead us to choose copula-base models, which offer a straightforward interpretation of covariate effects and dependence strength, as well as produce joint and conditional survival distributions.

Several model selection procedures have been proposed for copula-based methods. For example, the AIC is widely used for model selection purpose in copula models. Wang & Wells (2000) proposed a model selection procedure based on the nonparametric estimation of the bivariate joint survival function within Archimedean copulas. For model diagnostics, Chen et al. (2010) proposed a penalized pseudo-likelihood ratio test for copula models in complete data. Recently, Zhang et al. (2016) proposed a goodness-of-fit test for copula models using the pseudo in-and-out-of sample method. To the best of our knowledge, there is no existing goodness-of-fit test for copula models of bivariate interval-censored data. In our real data analysis, we used AIC to guide the model selection for simplicity. However, a formal test for goodness-of-fit is desirable, especially for bivariate interval-censored data under the regression setting. It is worthwhile to investigate it as a future research direction.

We applied our method to a GWAS of AMD progression and successfully identified variants from two known AMD risk regions (CFH on chromosome 1 and PLEKHA1–ARMS2–HTRA1 on chromosome 10) being significantly associated with AMD progression. Moreover, we also discovered variants from a region (ATF7IP2 on chromosome 16), which has not been reported before, showing moderate to strong association with AMD progression. On the contrary, we found that some known AMD risk loci (e.g., from ARHGAP21 on chromosome 10, ) are not associated with AMD progression. Therefore, the variants associated with higher (or lower) risks of having AMD may not be necessarily associated with the disease progression; while some variants may be only associated with AMD progression but not with the disease onset. This is the first research on AMD progression which adopts a solid statistical model that appropriately handles bivariate interval-censored data. Our findings provided new insights into the genetic causes on AMD progression, which are critical for the next step to establish novel and reliable predictive models of AMD progression to accurately identify high-risk patients at an early stage. Our proposed method is applicable to general bilateral diseases and complex diseases with co-primary endpoints.

## 7 Appendix

### 7.1 Regularity Conditions

In this section, we state the regularity conditions needed for Theorems 12.

Condition 1. (i) There exists such that ; (ii) The union of the supports for distributions of and is an interval with .

Condition 2. The distribution of covariate has a bounded support and is not concentrated on any proper subspace of , .

Condition 3. Let be the likelihood function with being substituted by . Define

 vT˙L(β,α,κ,y1,y2)=vT1∂L∂β+v2∂L∂α+v3∂L∂κ+v4∂L∂y1+v5∂L∂y2,

with . There exist for which there are different sets of such that if with for each of these sets of values, then .

Condition 4. (i) The function is continuously differentiable up to order with in and satisfies for some positive constant . Also is an interior point of . (ii) The transformation is a strictly increasing function with and is three-times continuously differentiable in .

Condition 5. For every in a neighborhood of , , where being the log-likelihood function given in Section 3 and means that “the left-hand side is bounded above by a constant times the right-hand side”.

Conditions 1, 2, 4(i) and 5 are commonly used in the studies of interval-censored data (e.g. Huang & Rossini, 1997, Wen & Chen, 2013, Zhou et al., 2017). Condition 4(ii) comes from the definition of the linear transformation model (Cheng et al., 1995). Condition 3 ensures both the identifiability of the parameters and the positivity of the efficient Fisher information matrix (Chang et al., 2007, Wen & Chen, 2013, Zhou et al., 2017).

### 7.2 Proof of Theorem 1

We will derive the convergence rate by verifying the Conditions C1–C3 of Theorem 1 from Shen & Wong (1994). Define , where is the collection of with smoothness as defined in our Condition 4. Similarly, is the corresponding sieve space containing . Then the Condition C1 automatically holds due to our Condition 5, which states for any , Next we verify the Condition C2 of Shen & Wong (1994). Similar to the arguments in the proof of Lemma A3 (using our Conditions 1, 2, 4), we can show that for any ,

 |l(θ;D)−l(θ0;D)| ≲ |b−b0|+|Λ1(L1)−Λ10(L1)|+|Λ1(R1)−Λ10(R1)| + |Λ2(L2)−Λ20(L2)|+|Λ2(R2)−Λ20(R2)|,

where Then, it follows that for any

 P{l(θ;D)−l(θ0;D)}2≲ |b−b0|2+P[{Λ1(l1)−Λ10(l1)}2+{Λ1(r1)−Λ10(r1)}2] +P[{Λ2(l2)−Λ20(l2)}2+{Λ2(r2)−Λ20(r2)}2]=d2(θ,θ0).

It implies that

 sup{d(θ,θ0)≤ϵ,θ∈Θqn}var{l(θ0;D)−l(θ;D)}≤sup{d(θ,θ0)≤ϵ,θ∈Θqn}P{l(θ0;D)−l(θ;D)}2≲ϵ2.

Thus, Condition C2 from Shen & Wong (1994) holds (with in their notation).
Finally, we verify the Condition C3 in Shen & Wong (1994). By Lemma A3, for , we have , with being the dimensionality for . Using the fact that the covering number is bounded by the bracketing number, it follows that

Hence, the Condition C3 of Shen & Wong (1994) in page 583 holds when the constants and in their notations.

Therefore, the constant in Theorem 1 of Shen & Wong (1994) on page 584 is . Since as , we can pick a slightly larger than such that for large . We still denote as so that . Since maximizes over , so satisfies the inequality (1.1) in Shen & Wong (1994) when in their notation. By Lemma A2, there exists a such that . Thus, the sieve approximation error in Shen & Wong (1994) is . In addition, since is maximized at , its first derivative at is equal to . Then, applying the Taylor expansion for around and plugging in , the Kullback-Leilber pseudodistance of and follows

 K(θ0,θ0,n)=−P{l(θ0,n;D)−l(θ0;D)} = −12P{¨lΛ1Λ1(θ0;D)[Λ10,n−Λ10,Λ10,n−Λ10]+¨lΛ2Λ2(θ0;D)[Λ20,n−Λ20,Λ20,n−Λ20] +2¨lΛ1Λ2(θ0;D)[Λ10,n−Λ10,Λ20,n−Λ20]}+o(d2(θ0,n,θ0)) ≲ ∥Λ10,n−Λ10∥22+∥Λ20,n−Λ20∥22+o(∥Λ10,n−Λ10∥22+∥Λ20,n−Λ20∥22)≲O(n−qν).

The second last inequality holds due to boundness of all second order derivatives of log-likelihood in Lemma A1 as well as derivations similar to Lemma A3. The last inequality holds since .
Therefore, . Hence, by Theorem 1 of Shen & Wong (1994), we obtain the convergence rate for as

 d(^θn,θ0)=Op{max(n−(1−ν)/2,n−qν/2,n−qν/2)}=Op{n−min(qν/2,(1−ν)/2)}.

This completes the proof of our Theorem 1.

### 7.3 Proof of Theorem 2

We will prove the theorem by checking assumptions A1-A6 of the general theorem in the supplementary materials. We can verify the assumption A1 by applying our Theorem 1 with and the norm. A2 also automatically holds under the model assumption. For the assumption A3, we need to verify both existence of and nonsingularity of the matrix . Following similar arguments as in Wen & Chen (2013) (page 402-405), the existence of can be verified, which satisfies that for all ,

 P{¨lbΛ1(b0,Λ10,Λ20)[h1]+¨lbΛ2(b0,Λ10,Λ20)[h2]−¨lΛ1Λ1(b0,Λ10,Λ20)[h∗