# Inhomogeneous Markov Survival Regression Models

We propose new regression models in survival analysis based on homogeneous and inhomogeneous phase-type distributions. The intensity function in this setting plays the role of the hazard function. For unidimensional intensity matrices, we recover the proportional hazard and accelerated failure time models, among others. However, when considering higher dimensions, the proposed methods are only asymptotically equivalent to their classical counterparts and enjoy greater flexibility in the body of the distribution. For their estimation, the latent path representation of semi-Markov models is exploited. Consequently, an adapted EM algorithm is provided and the likelihood is shown to increase at each iteration. We provide several examples of practical significance and outline relevant extensions. The practical feasibility of the models is illustrated on simulated and real-world datasets.

• 20 publications
• 7 publications
10/11/2021

12/27/2021

### Survival Analysis of the Compressor Station Based on Hawkes Process with Weibull Base Intensity

In this paper, we use the Hawkes process to model the sequence of failur...
03/24/2021

### Fitting phase-type frailty models

Frailty models are survival analysis models which account for heterogene...
10/31/2021

### Phase-type mixture-of-experts regression for loss severities

The task of modeling claim severities is addressed when data is not cons...
07/22/2022

### Estimating absorption time distributions of general Markov jump processes

The estimation of absorption time distributions of Markov jump processes...
05/29/2020

### Estimation of Semi-Markov Multi-state Models: A Comparison of the Sojourn Times and Transition Intensities Approaches

Semi-Markov models are widely used for survival analysis and reliability...
08/18/2019

### Modeling Time to Open of Emails with a Latent State for User Engagement Level

Email messages have been an important mode of communication, not only fo...

## 1. Introduction

Phase–type distributions are defined as absorption times of pure-jump Markov processes on a finite state space. Many classical distributions have these representations, such as the exponential, Erlang, Coxian, hyperexponential, and any finite mixture of them. In the works or Marcel F. Neuts and co-authors, cf. for instance [neuts75, neuts1981matrix], the modern framework of phase–type as matrix distributions was systematically studied in a unified framework. Perhaps one of the most attractive features of phase–type distributions from a statistical perspective is that they dense on the distributions on the positive real line in the sense of weak convergence, cf. [asmussen2008applied]

. However, for the latter convergence to manifest itself for a given histogram, often a large number of states – or phases – are required, making the estimation procedure a complex one. Several methods of estimation have been proposed in the literature, including method-of-moments in

[bobbio2005matching, horvath2007matching]

and references therein, Bayesian inference in

[bladt2003estimation], and maximum likelihood estimation via the expectation-maximisation (EM) algorithm in [asmussen1996fitting] and for the right-censored case in [olsson1996estimation]

. The latter method considers the full likelihood arising from the path representation of a phase–type distribution and has been the most popular approach for various applications in applied probability.

Recently, a resurgence of statistical modelling using absorption times of stochastic processes has been motivated by the main drawback of phase–type distributions: they all have exponentially decaying tails. In [bladt2017fitting] scale mixtures of phase–type were considered for fitting heavy-tailed distributions. In [mfph, mml, mmml], a time transform consisting of a deterministic scaling and a random stable factor was considered, leading to heavy-tailed distributions that enjoy the Markovian calculus of phase–type distributions but in a fractional derivative framework. Finally, in [albrecher2019inhomogeneous], general deterministic transforms were considered, and an extension which includes the multivariate case was given in [albrecher2020iphfit]. The latter distributions were coined inhomogeneous phase–type distributions (IPH) since they have the interpretation of being absorption times of inhomogeneous Markov processes where time evolves according to an underlying deterministic clock.

The first exploration into survival analysis with phase–type distributions can be found in [aalen1995phase]

, although covariates were not considered. Instead, the focus was on the hazard function, which was shown to be very flexible. Indeed, it follows from the denseness property that it can approximate any hazard shape arbitrarily closely. The focus was also on the identification of a phase–type survival model, that is, associating the phases and phase transitions as physical states that an individual goes through before deceasing. A generalisation of that rationale, and which makes phase–type distributions not only practical but also physically well-motivated is to think in terms of block transitions. We may model an individual’s lifetime and evolution through a fixed number of states before death by associating blocks of phases to each one of these states and considering the phases within each block as unobservable states. This philosophy extends to statistical modelling of lifetimes using IPH distributions and is the primary motivation for their consideration in this work.

In the life sciences, particular instances of phase–type distributions, namely Coxian phase–type distributions, were considered as survival regression models which fall into the accelerated failure time (AFT) specification, cf. [mcgrory2009fully, rizk2019alternative, tang2012modeling]

and references therein. Their popularity for several decades is the product of their natural interpretation in terms of states and their statistical efficacy. Markov Chain Monte Carlo (MCMC) methods seem to be particularly popular for this type of application, but we will not follow this estimation path.

In this paper, we propose survival models and survival regression models based on general IPH distributions. Under our construction, many classical methods are special cases, such as the proportional hazards model for any parametric family, the AFT model, and the Coxian regression. But the combination of inhomogeneous time transforming and considering general sub-intensity matrices makes our models much more flexible both in terms of tails and hazard function shapes. When considering regressing on the inhomogeneity parameters, matrix generalisations of models such as the one of Hsieh, cf. [hsieh2000sample], are available. We follow the footsteps of [albrecher2020iphfit, asmussen1996fitting, olsson1996estimation] to obtain estimation procedures based on the EM algorithm. Throughout, we also give relevant parametric examples of inhomogeneity functions (equivalently, the intensity function) which can be used in practice for a wide range of applications and generalise classical specifications.

The structure of the remainder of the paper is as follows. In Section 2, we revisit existing necessary background on IPH distributions with an emphasis on the intensity function, which plays a central role in subsequent sections. We also work out some new examples relevant to the survival analysis framework and give a survival curve fit to COVID-19 positive patients in Mexico. Section 3 deals with the specification of the proportional intensities model, derivation of its main properties and asymptotics, and the proposal of a modified EM algorithm which is subsequently shown to increase the likelihood at each iteration. A simulated example illustrates the flexibility that our method enjoys. The goodness of fit analysis and some important extensions are provided as well. Section 4

applies the AFT model to IPH distributed random variables and, together with previous models, illustrates how it is possible to seamlessly alleviate badly fitting survival regression models by simply considering their matrix version. In Section

5 we outline more specialized extensions of our models, and finally conclude in Section 6.

## 2. Inhomogeneous phase–type distributions

This section revisits some of the theory of inhomogeneous phase–type distributions, and on the way, provides new examples. We do not aim to be exhaustive, and we give special emphasis on the intensity function, which will later play an important role when considering covariates. In particular, we do not expand on the mathematical representations of the general inhomogeneous intensity matrix in terms of product integrals, nor on the form of the transition matrix of the underlying inhomogeneous Markov process. For the interested reader, we refer to the seminal paper [albrecher2019inhomogeneous].

### 2.1. Basic properties

Let denote a time–inhomogeneous Markov jump process on a state space , where states are transient and state is absorbing. In other words, has an intensity matrix of the form

 Λ(t)=(T(t)t(t)00),t≥0,

where is a matrix and is a

–dimensional column vector. Here, for any time

, , where is the –dimensional column vector of ones. Let , , be a probability vector corresponding to the initial distribution of the jump process, and assume that . Then we say that the time until absorption

 τ=inf{t≥0∣Jt=p+1}

has an inhomogeneous phase–type distribution with representation and we write This setting, however, is too general for applications. Hence, we will consider the following simplified version of the intensity matrix , where is some known non–negative real function and is a sub–intensity matrix. In this case we simply write

 τ∼IPH(π,T,λ).

Note that for one returns to the time–homogeneous case, which corresponds to the conventional phase–type distributions with notation ; a comprehensive and modern account of phase–type distributions can be found in Bladt & Nielsen [Bladt2017]. It is straightforward to see that if , then there exists a function such that

 (2.1) Y∼g(Z),

where . Specifically, is defined in terms of by

 g−1(y)=∫y0λ(s)ds,

or, equivalently,

 λ(s)=ddsg−1(s).

Two recurrent quantities that will be useful are the density and survival function of , which are given by

 fY(y) = λ(y)πexp(∫y0λ(s)ds T)t, SY(y) = πexp(∫y0λ(s)ds T)e.

For further reading on inhomogeneous phase–type distributions and motivations for their use in modelling we refer to Albrecher & Bladt [albrecher2019inhomogeneous]. See also [albrecher2020iphfit] for extensions to the multivariate case and [mfph, mml, mmml] for inhomogeneous phase–type distributions with stochastic clocks.

The tail behavior of IPH distributions can be derived in a straightforward manner from the corresponding tail behavior of a PH distribution. Thus, we first recall an explicit expression for the survival function of a PH distribution. The survival function of a phase–type distribution has the following analytic expression:

 (2.2)

where

are the eigenvalues of the Jordan blocks

of , with corresponding dimensions , , and and are constants depending on and . If is the largest real eigenvalue of and is the dimension of the Jordan block of . Then, it is easy to see from (2.2) that

 (2.3) S(y)∼cyn−1e−ηy,y→∞,

where is a positive constant. That is, all phase–type distributions have exponential tails with Erlang-like second order bias term.

As illustrated in [albrecher2019inhomogeneous], a number of IPH distributions can be expressed as classical distributions with matrix-valued parameters. For the representation of such distributions we make use of functional calculus. If is an analytic function and is a matrix, define

 u(A)=12πi∮γu(w)(wI−A)−1dw,

where is a simple path enclosing the eigenvalues of ; cf. [Bladt2017, Sec. 3.4] for details. In particular, the matrix exponential can be defined in this way.

We now revisit three examples which can be used as building blocks for our regression approach, and propose an additional novel distribution which is relevant for modelling heavy tails.

###### Example 2.1 (Matrix–Pareto distributions).

Let , where and . Then

 SY(y)=π(yθ+1)Te, fY(y)=π(yθ+1)T−It1θ.

Here and . Consequently,

 λ(s)=1s+θ.

We refer to the distribution of as a matrix–Pareto distribution. It is easy to see from (2.3) that as , where is a slowly varying function and is the largest real eigenvalue of .

###### Example 2.2 (Matrix–Weibull distributions).

Let , where and . Then

 SY(y)=πeTyθe, fY(y)=πeTyθtθyθ−1.

Here and . Thus,

 λ(s)=θsθ−1.

This is called matrix–Weibull distribution, since the scale parameter of the usual Weibull distribution is now replaced by a matrix. One can show that as , where . In particular, their tail belong to the Gumbel domain of attraction, and can be lighter or heavier than exponential.

###### Example 2.3 (Matrix–Gompertz distributions).

Let , where and . Then

 SY(y)=πeTβ−1(eβy−1)e, fY(y)=πeTβ−1(eβy−1)teβy.

Here and . Thus,

 λ(s)=eβy.

This is called matrix–Gompertz distribution, and has tails much lighter than exponential. Observe that from an inhomogeneous Markov point of view, the sub-intensity matrix of the underlying jump process is given by , which for scalar gives the hazard function of the Gompertz distribution. The latter has been shown to be effective for the modelling of human mortality over long age ranges, motivating the use of as a model for the lifetime of an individual (see [macdonald2018modelling] for a recent account on mortality modelling).

We now introduce a new type of IPH distribution that is tail equivalent to a Lognormal-type distribution

###### Example 2.4 (Matrix–Lognormal distributions).

Let , where and . Then

 SY(y)=πeT(log(y+1))γe, fY(y)=πeT(log(y+1))γtγ(log(y+1))γ−1y+1.

Here and . Thus,

 λ(s)=γ(log(s+1))γ−1s+1.

We call this the matrix–Lognormal distribution. Concerning its asymptotic behaviour, we have that as . This, in fact, corresponds to a Lognormal-type tail behavior where the heaviness of the tail is controlled by the parameter of the function. The usual lognormal tail behaviour arises when .

### 2.2. Parameter estimation

An EM algorithm for maximum likelihood estimation of IPH distributions was recently introduced in Albrecher et al. [albrecher2020iphfit]. The key is to apply a parameter-dependent transformation after each step of the EM algorithm. We describe briefly such an algorithm.

Let be an i.i.d. sample of an inhomogeneous phase–type distribution with representation , where is a parametric non–negative function depending on the vector . The EM algorithm for fitting is derived under the observation that , where is defined in terms of its inverse function . The EM algorithm for fitting works as follows.

###### Algorithm 2.5 (EM algorithm for IPH distributions).

0. Initialize with some “arbitrary” .

1. Transform the data into , , and apply the E– and M–steps of the EM algorithm of Asmussen et al. [asmussen1996fitting] for PH distributions by which we obtain the estimators .

2. Compute

 ^θ =argmaxθN∑i=1log(fY(yi;^π,^T,θ)) =argmaxθN∑i=1log(λ(yi;θ)^πexp(∫yi0λ(s;θ)ds ^T)^t).

3. Assign and GOTO 1.

###### Remark 2.1.

In [albrecher2020iphfit], it was shown that the likelihood function increases at each iteration, and consequently converges to a (possibly local) maximum. Moreover, such an algorithm can be employed to estimate distributions that are not IPH in a strict sense, but that are defined as the distribution of a transformed phase-type random variable. This is the case, for instance, of the matrix-GEV distribution; see [albrecher2020iphfit] for further details.

#### 2.2.1. Censored data

In many applications, a large proportion of the data is not entirely observed, or censored. It turns out that the EM Algorithm 2.5 works much in the same way as for uncensored data. In this paper, we consider only the case of right-censoring, since it arises naturally in the context of its applications in survival analysis. However, the cases of left-censoring and interval-censoring can be treated pretty much in the same way. The main difference is that we are no longer observe exact data points , but only . In particular, by monotonicity of we have that , which can be interpreted as a censored observation of a random variable with conventional phase–type distribution. Thus, we can employ the EM algorithm in Olsson [olsson1996estimation] to derive the following adaptation of the EM algorithm for inhomogeneous phase–type distributions in the presence of censored data.

###### Algorithm 2.6 (EM algorithm for IPH distributions with right-censoring).

0. Initialize with some “arbitrary” .

1. Transform the data into , , and apply the E– and M–steps of the EM algorithm of Olsson [olsson1996estimation] by which we obtain the estimators .

2. Compute

 ^θ =argmaxθN∑i : yi uncensoredlog(fY(yi;^π,^T,θ))+N∑i : yi censoredlog(SY(yi;^π,^T,θ)).

3. Assign and GOTO 1.

Similarly, as in [albrecher2020iphfit], one can show that the likelihood function increases for each iteration. In what follows, we will provide a proof of a more general algorithm that includes covariates, thus generalizing the previous algorithm.

###### Example 2.7.

We consider a respiratory disease dataset that can be found in the following link https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico, which contains information from the Epidemiological Vigilance System of Viral Respiratory Diseases, of the government of Mexico.

The dataset is updated regularly, and the version used for this article is the one downloaded on October 11th, 2020. At the time of analysis, the data consisted of observations related to subjects deemed suspicious of having a viral respiratory disease upon arrival to the health sector’s medical units. Subsequently, the subjects were hospitalised or left ambulant. The dataset does not contain patients’ evolution, but it does record the time of entry, time of leaving, and possible death.

We consider only patients that were COVID-19 positive, that is, subjects. The variable measuring duration of hospitalisation can help us estimate the disease’s mortality on the patients that managed to arrive in the health sector.

Around

of the subjects died during their hospitalisation. The stay duration has an incredibly difficult shape to fit, since a large number of patients do not stay long. We consider fitting the following parametric models: Weibull, Pareto, Log-Logistic, Matrix–Pareto. For the latter, a Coxian structure for the a 3-dimensional matrix

is chosen. The implementation is summarised in Figure 2.1

. We see that a simple matrix enlargement for the intensity of the underlying distribution can help to capture the shape for upper and lower quantiles alike correctly. In short, it is not that the hazard is not asymptotically Pareto-like, but rather the optimisation was trying to make a global compromise in the scalar case.

An in-depth analysis of this dataset will be the subject of future research.

## 3. The proportional intensities model

This section considers a generalisation of the proportional hazards model for inhomogeneous phase–type distributions. The main idea is that the intensity will be the function which varies proportionally with the covariates.

Recall that the hazard function

is given by

 hY(t)=fY(t)SY(t).

We propose a regression model employing IPH distributions with representation and considering proportional intensities among subjects. Specifically, we assume that is a parametric non–negative function depending on the vector

and incorporate the predictor variables

by considering

 (3.1) λ(t∣X,θ)=λ(t;θ)exp(Xβ),

and we call this model a proportional intensities model. The interpretation in terms of the underlying inhomogeneous Markov structure is that time is changed proportionally among different subjects before time-transforming the associated PH distribution into an IPH one. In the sequel we will indistinctly denote by for a vector of covariates at the population level, or a matrix of covariates for finite samples. In the latter case, denotes the -th row of the matrix, corresponding to the -th individual in the sample.

###### Remark 3.1.

The role of the exponential function in the above specification is, in a sense, merely for illustration purposes. In the remainder of the paper it is possible to consider models where the multiplication with – be it in the intensity of on the variable itself – can be replaced with for any non-negative function (monotonicity is not required).

The density and survival function of a random variable with distribution

 IPH(π,T,λ(⋅;X,θ))

are given by

 f(y) = exp(Xβ)λ(y;θ)πexp(exp(Xβ)∫y0λ(s;θ)ds T)t, S(y) = πexp(exp(Xβ)∫y0λ(s;θ)ds T)e.

Note that the function in the representation of a IPH distribution is not the same as the hazard function. In fact, the hazard function is given by

 h(t)=λ(t)πexp(∫t0λ(s)ds T)tπexp(∫t0λ(s)ds T)e.

This distinction is subtle, but leads to increased flexibility in the interplay of hazards between subjects. A closely related function is the cumulative hazard function , which is given by

 H(t) =∫t0h(s)ds =−log(S(t)).

Thus, for an distribution, takes the form

 H(t) =−log(πexp(∫t0λ(s)ds T)e).
###### Remark 3.2.

In the proportional hazards model one assumes that the hazard function is such that

 h(t|X)=h(t)exp(Xβ).

The part of is sometimes called the baseline hazard function or the hazard function for a standard subject, corresponding to a subject with . Common parametric models for employed in this context are the exponential, Weibull and Lognormal distributions.

One of the advantages of the proportional intensities model is that the implied hazard functions between different subjects can deviate from proportionality in the body of the distribution (for ), as can be observed in Figure 3.1. This addresses one of the common practical drawbacks of the proportional hazards model. If we have that , with constant. More generally, is asymptotically equivalent to the hazard function in the upper tail, as it is shown in the following result.

###### Proposition 3.1.

Let , then the hazard function and cumulative hazard function of satisfy, respectively,

 h(t) ∼cλ(t),t→∞, H(t) ∼kg−1(t),t→∞,

where are positive constants.

###### Proof.

We have that if , then has a representation on the same form as (2.2) (cf. [Bladt2017, Section 4.1.1] for details), which only differs on the multiplicative constants. Thus,

 SY(t) ∼c1[g−1(t)]n−1e−η[g−1(t)], fY(y) ∼c2[g−1(t)]n−1e−η[g−1(t)]ddt[g−1(t)]

as , where is the largest real eigenvalue of , is the dimension of the Jordan block of and are positive constants. The first expression then follows by recalling that . The second one can be shown in an analogous manner and is omitted for brevity. ∎

Maximum likelihood estimation for the proportional intensities model is possible via an adapted EM algorithm. Let and be defined in terms of their inverse functions and

 g−1(y|X,θ)=∫y0λ(s|X,θ)ds=g−1(y;θ)exp(Xβ),

respectively. Note also that

 (3.2) g(y|X,θ)=g(yexp(−Xβ);θ).

In particular . Then, the EM algorithm for fitting the proportional intensities survival regression model works as follows.

###### Algorithm 3.2 (EM algorithm for IPH survival regression models).

0. Initialize with some “arbitrary” .

1. Transform the data into , , and apply the E– and M–steps of the EM algorithm of Olsson [olsson1996estimation] by which we obtain the estimators .

2. Compute

 (^θ,^β) =argmax(θ,β)N∑i : yi uncensoredlog(fY(yi;^π,^T,θ,β))+N∑i : yi censoredlog(SY(yi;^π,^T,θ,β)).

3. Assign and GOTO 1.

###### Remark 3.3.

The E-steps require the computation of matrix exponentials which can be performed in multiple ways (see [moler1978nineteen]). In Asmussen et al. [asmussen1996fitting] this is done by converting the problem into a system of ODEs, which are then solved via a Runge-Kutta method of fourth order (a C implementation, called EMpht, is available online [olsson1998empht]). We employed a new implementation in Rcpp of such a method for the illustration here and everywhere else in the paper. An important consideration is that the Runge-Kutta method requires the sample to be ordered. This is relevant since the transformed data of the above algorithm, will in general not be ordered anymore, given that covariate information varies across subjects. Thus, intermediate ordering steps have to be introduced before proceeding to the numerical implementation of the EM algorithm of [olsson1996estimation] by the Runge–Kutta method. The re-ordering also modifies the ties in the data, which must be accounted for. Additionally, the evaluation of the likelihood in step 2 may be done using by-products of step 1, which implies that re-ordering and re-weighing will also be necessary at each iteration within the optimisation inside step 2. These additional steps pose a significant computational burden relative to the non-covariate algorithm. The uniformization method ([albrecher2020iphfit]) is an alternative method, but the comparison of the two methods’ performance is beyond this paper’s scope.

###### Proposition 3.3.

Employing Algorithm 3.2, the likelihood function increases at each iteration. Consequently, it is guaranteed to converge to a (possibly local) maximum.

###### Proof.

Note that the data log-likelihood is given by

 l(π,T,θ,β;y,X) =N∑i : yi uncensoredXiβ+log(λ(yi;θ))+log(πexp(exp(Xiβ)g−1(yi;θ)T)t) +N∑i : yi censoredlog(πexp(exp(Xiβ)g−1(yi;θ)T)e).

Consider parameter values after the -th iteration. In the -th iteration, we first obtain in 1. so that

 N∑i : yi uncensoredlog(πnexp(exp(Xiβn)g−1(yi;θn)Tn)tn) +N∑i : yi censoredlog(πnexp(exp(Xiβn)g−1(yi;θn)Tn)e) ≤N∑i : yi uncensoredlog(πn+1exp(exp(Xiβn)g−1(yi;θn)Tn+1)tn+1) +N∑i : yi censoredlog(πn+1exp(exp(Xiβn)g−1(yi;θn)Tn+1)e).

Then, adding the remaining terms which are unchanged by the EM step 1., we obtain

 l(πn,Tn,θn,βn;y,X)≤l(πn+1,Tn+1,θn,βn;y,X).

Hence, by 2.,

 l(πn,Tn,θn,βn;y,X) ≤l(πn+1,Tn+1,θn,βn;y,X) ≤sup(θ,β)l(πn+1,Tn+1,θ,β;y,X)=l(πn+1,Tn+1,θn+1,βn+1;y,X),

proving that at each iteration the log-likelihood increases. ∎

Next we consider two important distributions which in presence of covariates, generalize the commonly employed models for parametric proportional hazards modeling: the exponential and the Weibull baseline hazard models.

###### Example 3.4 (Phase–type regression).

We consider , thus for all . The survival function takes the form

 S(y)=πexp(exp(Xβ)Ty)e.

Note also that , where . In particular, this implies that

 E(Y)=exp(−Xβ)E(Z)=exp(−Xβ)π(−T)−1e.
###### Example 3.5 (Matrix-Weibull regression).

We consider , thus for all . The survival function takes the form

 S(y)=πexp(exp(Xβ)Tyθ)e.

Note also that , where . The expression for the expected value is then given by

 E(Y)=exp(−Xβ/θ)E(Z1/θ)=exp(−Xβ/θ)Γ(1+1/θ)π(−T)−1/θe,

cf. [Bladt2017] for the derivation of non-integer moments of phase–type distributions.

###### Remark 3.4.

Note that with , one recovers the conventional proportional hazards models. Thus, our model can be considered a generalization of this type of model. Moreover, both cases have the interpretation that the expected values are proportional among subjects. In other words, for arbitrary matrix dimensions, when the baseline hazard is phase-type or matrix-Weibull, the proportional intensities model is equivalent to the accelerated failure time model. It is also not hard to see that homogeneous functions simultaneously satisfy both models. This feature is in agreement with the fact that for the proportional hazards model, the exponential and Weibull distributions also satisfy the accelerated failure time model, as we will discuss in section 4.

### 3.1. Goodness of fit

As in the conventional proportional hazards rate model, the quality of the fit can often be checked easily using graphical methods. For this purpose, nonparametric estimates of and are primary tools. The goodness of fit of the proportional intensities model can be done by considering a generalisation of the Cox–Snell residuals, as we now explain.

Define the residuals of a proportional intensities model by

 ri=−log(πexp(exp(Xiβ)∫yi0λ(s)ds T)e),i=1,…n.

Under right-censoring completely at random and the joint null hypothesis

 H0:Yi∼IPH(π,T,exp(Xiβ)λ),i=1,…,n,

the sample

 {(r1e1),(r2,e2),…,(rn,en)}

constitutes a right-censored unit exponential dataset. Hence, in practice we may consider the estimated versions , and their Kaplan-Meier survival curve

 ˆS(r)=∏i: ^ri≤r(1−dini),

where is the number of residuals tied at the value , and are all the residuals that have not been fully observed yet. The function

should fall within the confidence bounds, using that the variance can be estimated, for instance, using Greenwood’s formula

 ˆVar(ˆS(r))=ˆS(r)2∑i:^ri≤rdini(ni−di).

Alternatively, the plot of versus should align with a line of intercept zero and slope one. The Nelson-Aalen estimator can be used as an alternative to the Kaplan-Meier estimator (they are asymptotically equivalent).

### 3.2. Covariate-dependent inhomogeneity functions

Within the above setup, it is possible to add dependence of on . That is, we may consider the model

 (3.3) λ(t∣X,γ)=λ(t;exp(Xγ))exp(Xβ),

where is any uni-parametric form of the intensity. The residuals for assessing goodness of fit then become

 ri=−log(πexp(exp(Xiβ)∫yi0λ(s;exp(Xiγ))ds T)e),i=1,…n.

In this general setting one can also consider the replacement of with for any function is possible without any increase of computational complexity. Extensions to the multi-parameter specifications are straightforward to define.

A modified EM algorithm akin to Algorithm 3.2 applies for this model, where we also optimise with respect to in step 2. We omit the details for brevity.

### 3.3. An example

In this subsection we provide an example which illustrates the practical feasibility of the proportional intensities model. The data is simulated. A second and well-known dataset from the Veterans’ Administration Lung Cancer (which is publicly available in the R package survival) will be studied in the next section, in conjunction with the accelerated failure time model.

Thus, consider the following simple two-group setup. We study two groups of equal size, each. For the first group,

 Zi∼IPH⎛⎜⎝(1/4,1/2,1/4),⎛⎜⎝−10000−1000−1/10⎞⎟⎠,32s1/2⎞⎟⎠

while for the second

 Zi∼PH⎛⎜⎝(1/4,1/2,1/4),⎛⎜⎝−20000−2000−2/10⎞⎟⎠⎞⎟⎠.

Subsequently, we randomly censor the by exponential variables of rate by

 Yi=min{Zi,Ei},ei=1{Yi=Zi},i=1,…,1000,

leading to roughy right-censored observations. Notice that in the above setting the inhomogeneity function also changes with the covariates, such that we need to implement the fitting procedure associated to the model 3.3. Additionally, for comparison, we fitted two misspecified models: a standard Weibull proportional hazards model, and the scalar () version of model 3.3. The resulting empirical versus fitted survival curves for each group, as well as the hazard ratio between the two groups is given in Figure 3.2. We observe that only the correctly specified model recovers the wavy shape of the theoretical hazard ratio, and within each group the survival curve is better estimated in the tails of the distribution. Formally, the goodness of fit is evaluated in Figure 3.3. In the latter plot, we use the Kaplan-Meier and Nelson-Aalen non-parametric estimators of the survival curve of the residuals to visually assess standard exponential behaviour. We observe that only the correctly specified model provides an adequate fit in the sense of staying within the confidence bounds implied by Greenwood’s formula for the first approach, and by aligning itself with the identity for the second one.

## 4. Accelerated failure time models

The accelerated failure time (AFT) model specifies that the predictors act multiplicatively on the failure time or additively on the log failure time. The effect of a predictor is to alter the rate at which a subject proceeds along the time axis, i.e., to accelerate the time to failure. Specifically, letting be a random variable with distribution , the model states that

 (4.1) log(Y)=Xβ+σW0.

Thus, the survival function of is given by

 S(y)=S0((log(y)−Xβ)/σ).

Note that (4.1) is equivalent to

 (4.2) Y=exp(Xβ)eσW0.

Thus, an alternative way to represent is throughout the random variable . Moreover, denoting for the hazard function of , we have that

 hY(t)=h0(exp(−Xβ)t)exp(−Xβ)

Whereas in a proportional hazards model, the covariates act multiplicatively on the hazard, in an AFT model the covariates act multiplicatively on time. Common distributions used in the context of AFT modeling, include the Weibull, loglogistic and lognormal distribution. Moreover, the Weibull (and exponential) distribution satisfies the proportional hazard rate and the AFT model. Estimation of AFT models is typically done via MLE.

We now specify IPH distributions in the AFT model. Thus, we assume that is IPH distributed with representation , where is a parametric non–negative function depending on the vector . Then , where and is defined in terms of its inverse function . In particular,

 Yd=exp(Xβ)eσg0(Z;θ)

and

 E(Y)=exp(Xβ)E(eσg0(Z;θ))

Note that if we define , then . For our purposes, it is convenient to work with this last representation, i.e., we consider where is such that , thus making our model .

###### Example 4.1.

(Matrix–Weibull) We know that the matrix–Weibull with parameters , and has representation , where . Note also that has distribution of the same type as the matrix–Gumbel (see [albrecher2020iphfit]). Now consider the AFT model

 log(Y)=Xβ+σlog(Z).

Then

 Y=exp(Xβ)Zσ.

which corresponds to a reparametrization of the proportional intensities model. Thus, the matrix–Weibull satisfies the proportional intensities and the AFT model, in agreement with its scalar counterpart.

More generally, if is homogeneous of degree , i.e., , then

 g(Zexp(−Xβ);θ)=exp(−αXβ)g(Z;θ),

where . This implies (see 3.2) that satisfies the proportional intensities and the AFT model.

We now introduce matrix versions of the logistic and loglogistic distributions.

###### Example 4.2.

(Matrix–logistic) For with and .

 SY(y)=πexp(Tlog(1+exp(y−μν)))e.

This distribution can be interpreted as a matrix version of the logistic distribution. Note that this distribution is not IPH in the strict sense, but it is defined as the distribution of a transformation of a phase-type distributed random variable.

###### Example 4.3.

(Matrix–Loglogistic) For with .

 SY(y)=πexp(Tlog(1+exp(y)))e.

We also compare the estimated models against accelerated failure time specifications, which (except for the Weibull case) also account for non-proportional hazards.

The data consist of observations and variables coming from a randomised trial where two groups are given different treatment regimens for lung cancer. The variables of interest for our analysis are survival time, censoring status (binary), treatment (binary), prior therapy (binary), and Karnofsky performance score (from 0 to 100). Previous studies (cf. for instance, the vignette https://cran.rstudio.com/web/packages/survival/vignettes/timedep.pdf, pp. 15–22) have shown that only the Karnofsky score violates the proportional hazards assumption. Common ways to address this issue are stratifying the dataset into time-groups and considering continuous time-dependent regression coefficients. The moral of that story ends up being that the coefficient for treatment becomes significant when improving the model’s fit. Consequently, we presently focus only on the goodness of fit of four competing models introduced in the text. The latter measured by their log-likelihood and residuals.

We consider two classical models: a) Weibull proportional hazards model; b) Loglogistic accelerated failure time model. The distributions were selected according to performance. Given the modest size of the dataset, we will avoid, in the name of parsimony, large matrices for the IPH-based models. Thus, we present two models based on with a Coxian structure: c) Matrix-Weibull proportional intensities model; d) Matrix-Lognormal accelerated failure time model. We also fitted a Matrix-Loglogistic distribution, but it performed better than its scalar counterpart only for large matrices and is thus omitted.

The results are summarized in the table below and in Figure 4.1.

We observe that models b), c) and d) all seem to improve the fit to where the residuals are within bounds. Visually, there appears to be not much difference between the three last models, and the BIC reflects this as well. The log-likelihood and AIC are best for model c), which is both a proportional intensities and an AFT model. The take-away is that when faced with a dataset where the proportional hazards assumption does not seem to hold, it is worthwhile exploring the enlargement of the underlying parameter space to the matrix domain. In this enlarged space, the proportional intensities model allows for a more flexible specification of the hazard, with proportionality holding only asymptotically.

The AIC and BIC computed here for the matrix-distributions works well for comparison purposes given the low dimensions of the distributions. However, in general one should proceed with caution (especially in high dimensions) due to the well-known identifiability issue for PH distributions, which carries over to IPH survival models. Namely, different dimension and parameter configurations may lead to very similar density shapes.

## 5. Extensions and future research

The proportional intensities model can be modified to allow for more flexible models. We presently show how these extensions can be carried out similarly as in the proportional hazards model. For instance, we propose two immediate extensions: a stratified version and an additive model. We also discuss other possible lines of research, such as time-dependent covariates.

### 5.1. Stratified model

For a stratified version, we assume categories and we include different baseline intensities for each category. For illustration purposes, we assume that the effect of the covariates is the same multiplicative constant on the intensities across all groups, however different constants among the groups can be employed as well. More specifically, the intensities for the group are

 λ(t;X,θm)=λm(t;θm)exp(Xβ),

where , is the baseline intensity for the group , . To describe how maximum likelihood estimation via an EM algorithm similar to algorithm 3.2 can be performed in this case, we consider a sample , where observations corresponds to the group, , such that . Thus, the main changes are that the data should be transformed into , , where , , and that the densities and survival functions should be adapted to match the separation of the data into groups.

### 5.2. Parametric additive intensity models

We consider a parametric additive model similar to the one proposed to Aalen [aalen1989linear] employing IPH distributions. More specifically, we let

 λ(t;X,θ)=β0(t;θ0)+β1(t;θ1)X1+⋯+βd(t;θd)Xd,

where and is a parametric function depending on , . Here, represents the baseline intensity, which corresponds to a subject with all covariates equal to zero. Note that

 ∫y0λ(s;X,θ)ds=∫y0β0(s;θ0)ds+∫y0β1(s;θ1)X1ds+⋯+∫y0βd(s;θd)Xdds.

Thus, denoting , we have that

 g−1(y;X,θ)=B0(y;θ0)+B1(y;θ1)X1+⋯+Bd(y;θd)Xd.

Moreover, maximum likelihood estimation can be performed by employing a modified Algorithm 3.2 with the main difference being that the data is transformed into , , with as above. The density and survival function should be modified accordingly as well.

### 5.3. Time dependent covariates

We allow the covariates to depend on time, i.e., we consider . Thus, a proportional intensities model that allows for time dependent covariates would assume that

 λ(t;X,θ)=λ(t;θ)exp(X(t)β).

The main difficulty of this type of model is that

 g−1(y;X,θ)=∫y0λ(s;θ)exp(X(s)β)ds,

in general does not have an explicit solution. This is relevant since the estimation methods introduced in this paper make use of transformed data via . Possible solutions are:

• Look for specific structures that allow for explicit representations.

• Approximate numerically. This approach may come to a high computational cost, especially considering that the transformation must be applied in each iteration of the EM algorithm.

An in-depth exploration of these approaches represents a future line of research.

## 6. Conclusion

We have introduced several survival models which are based on absorption times of inhomogeneous Markov processes. Their interpretation in terms of underlying phase-switching is very natural for time-to-event data, and their denseness makes them exceptional candidates to consider for real-life applications. We have shown, using recent and new matrix-based distributions, that misspecified survival models can sometimes be just one matrix away from providing a good fit. The practical implementation of the methods in this paper relies on (a modified version of) the EM algorithm, which for large matrices and datasets can become significantly slow. It is our experience that the use of an inhomogeneous intensity function in the underlying stochastic jump process can alleviate the need for a large number of phases, thus making estimation feasible. Finally, we have outlined some extensions which further refine the regression models of this paper; their systematic study and application to large-scale data is a challenging and promising direction.

Acknowledgement. MB would like to acknowledge financial support from the Swiss National Science Foundation Project 200021_191984.