DeepAI

# Stochastic frailty models for modeling and forecasting mortality

In many countries life expectancy gains have been substantially higher than predicted by even recent forecasts. This is primarily due to increasing rates of improvement in old-age mortality not captured by existing models. In this paper we show how the concept of frailty can be used to model both changing rates of improvement and the deceleration of mortality at old ages, also seen in data. We present a "fragilization" method by which frailty can be added to standard mortality models. The aim is to improve the modeling and forecasting of old-age mortality while preserving the structure of the original model and the underlying stochastic processes. Estimation is based on a general pseudo-likelihood approach which allows the use of essentially any frailty distribution and mortality model. We also consider a class of generalized stochastic frailty models with both frailty and non-frailty terms, and we describe how these models can be estimated by the EM-algorithm. The method is applied to the Lee-Carter model and a parametric time-series model. For both applications the effect of adding frailty is illustrated with mortality data for US males.

02/09/2018

### Projecting UK Mortality using Bayesian Generalised Additive Models

Forecasts of mortality provide vital information about future population...
01/04/2023

### Mortality modeling at old-age: a mixture model approach

This paper presents a novel approach for modeling mortality rates above ...
01/31/2021

### Time-Series Forecasting of Mortality Rates using Deep Learning

The time-series nature of mortality rates lends itself to processing thr...
03/01/2018

### Mortality data reliability in an internal model

In this paper, we discuss the impact of some mortality data anomalies on...
05/14/2019

### Shifting attention to old age: Detecting mortality deceleration using focused model selection

The decrease in the increase in death rates at old ages is a phenomenon ...
09/24/2022

### Learning and forecasting of age-specific period mortality via B-spline processes with locally-adaptive dynamic coefficients

The overarching focus on predictive accuracy in mortality forecasting ha...
03/18/2021

### Deepening Lee-Carter for longevity projections with uncertainty estimation

Undoubtedly, several countries worldwide endure to experience a continuo...

## 1 Introduction

Mortality rates have been steadily declining in most of the industrialized world throughout the 20th century and there are no signs of improvements decelerating. On the contrary, populations where mortality rates are already very low still experience rates of improvement of the same, or even higher, magnitude than historically.

A great deal of models for modeling and projecting mortality rates have been proposed in recent years. However, the model formulated by leecar92 is still the most widely used. The Lee-Carter model describes the age-specific death rates (ASDRs) by the log-bilinear relation

 logμ(t,x)=ax+bxkt, (1)

where and are age-specific parameters and is a time-varying index. Projections are made by estimating the parameters of the model from observed data and extrapolate the time index by standard time series methods. Typically is modeled as a random walk with drift. The combination of a linear forecast of and structure (1) implies that ASDRs are forecasted to improve by the same age-specific factor each year.

The Lee-Carter model has gained widespread popularity due to its simplicity and ease of interpretation. Essentially, ASDRs are forecasted to improve by the average rate of improvement over the estimation period. However, the assumption that rates of improvement are constant over time is not generally satisfied. Indeed, the mortality experience of several countries has shown increasing mortality improvement rates over time, in particular for older age groups, see e.g. leemil01; booetal02; renhab03; bon05. Consequently, Lee-Carter forecasts have in many situations underestimated the gains in old-age mortality.

Left panel of Figure 1 shows ASDRs for US males from 1950 to 2010 based on data from the Human Mortality Database (www.mortality.org). The death rates are plotted on a logarithmic scale and increasing rates of improvement can be observed by the concavity, rather than linearity, of the age-specific curves. The consequence for Lee-Carter forecasts is exemplified in the right panel of Figure 1 which shows actual and Lee-Carter forecasted period remaining life expectancy of 60-year old US males.222The period (remaining) life expectancy is a summary measure for the mortality profile of a population in a given year. It is calculated on the basis of the age-specific mortality rates for that year. Assuming death rates continue to decline the life expectancy experienced by a given cohort will be higher. This quantity is termed the cohort life expectancy. Due to increasing rates of improvement the forecasts increase when a later estimation period is used. Still, all historic forecasts are below the actual life expectancy experience.

In the light of changing rates of mortality improvement it has been suggested to fit the Lee-Carter model to shorter periods of data for which death rates comply better with the log-linearity assumption, see e.g. leemil01; tuletal00; booetal02. In an environment of increasing rates of improvement the use of shorter, more recent data periods can indeed result in better forecasts. However, future increases in old-age mortality improvement rates — beyond what has been seen historically — cannot be captured by this method.

### 1.1 Stochastic frailty models

In this paper we investigate the use of frailty theory to address the pattern of changing rates of improvement. Frailty theory rests on the assumption that cohorts are heterogeneous and that some people are more susceptible to death (frail) than others. Frail individuals tend to die sooner than stronger individuals leading to old cohorts being dominated by low mortality individuals. In effect, the selection mechanism causes the cohort intensity to "slow down", i.e. to increase less rapidly than the individual intensity. This has been used to explain the logistic form of old-age mortality observed in data, see e.g. tha99.

In the model introduced by vauetal79 frailty enters as an unobservable, positive random quantify acting multiplicatively on an underlying baseline mortality intensity. The observable (aggregate) mortality intensity takes the form

 μ(t,x)=E[Z|t,x]μbase(t,x), (2)

where denotes mean frailty at time for age , and is the underlying baseline intensity for an individual with frailty one. We assume without loss of generality that mean frailty at birth is one. For young ages with low mortality mean frailty is close to one, while for old ages mean frailty gradually decreases as selection takes effect.

Frailty theory offers an explantation to the observed change in improvement rates of old-age mortality and it suggests that we might expect to see higher rates of improvements in the future. As the probability of attaining a given (old) age increases over time the selection effect weakens and mean frailty increases towards one. This increase partly offsets the underlying improvements in individual baseline intensity, cf. (

2). Initially when selection is high old-age mortality rates will improve slowly (or even increase), but over time as selection weakens improvements in old-age mortality will gradually increase towards baseline improvement rates.

In this paper we will use structure (2) as the basis for mortality modeling. We will show how a given model for aggregate mortality, e.g. the Lee-Carter model, can be "fragilized" by using it as a model for the underlying individual mortality instead. This procedure will enable us to model logistic-type old-age mortality and changing rates of improvement while preserving the structure of the original model and the underlying time-series dynamics. As an example, a "fragilized" Lee-Carter model takes the form

 μ(t,x)=E[Z|t,x]exp[ax+bxkt]. (3)

In Section 4.2 we will compare this model to the ordinary Lee-Carter model.

We show how to perform maximum likelihood estimation and how to forecast in models of form (2). Estimation will be based on a Poisson pseudo-likelihood which allows for general specifications of the frailty distribution, e.g. the generalized stable laws of hou86. We also generalize (2) to include an additive (non-frailty) term and show how models of this form can be estimated by the EM-algorithm. Methods for joint maximum likelihood estimation of the frailty distribution and the baseline intensity are derived based on the (conditional) likelihood for fixed frailty distribution.

In contrast to the typical use of frailty theory in mortality modeling our approach does not rely on a specific parametric model. Indeed, we show how essentially any combination of frailty distribution and parametric or semi-parametric model for baseline intensity can be estimated. The only requirement is that the underlying baseline mortality model can be estimated (by maximum likelihood) and that the Laplace transform of the frailty distribution is known.

Frailty theory is well-established in biostatics and survival analysis and several monographs are devoted to the topic, e.g. wie10; han11. In demographic and actuarial science frailty models are also known as heterogeneity models. They have been used in mortality modeling to fit the logistic form of old-age mortality, see e.g. wanbro98; buthab04; oli06; caibladow06; spretal13, and to allow for overdispersion in mortality data, cf. lietal09.

The present use of frailty theory is closest in spirit to jarkry11 in which a deterministic frailty model is used to describe the trend in international mortality. However, the methods developed in this paper greatly extend that work by incorporating frailty into any standard stochastic mortality model.

### 1.2 Outline

The rest of the paper is organized as follows. In Section 2 we give a brief introduction to frailty theory to establish the fundamental relation between cohort and individual mortality on which our approach relies. Section 3 is the main theoretical chapter in which stochastic frailty models are formalized and estimation and forecasting discussed. This is followed in Section 4 by applications to a parametric time-series model and the Lee-Carter model. Section 5 contains a generalization of stochastic frailty models and explains how they can be estimated by the EM-algorithm. Finally, Section 6 offers some concluding remarks.

## 2 Frailty theory

In this section we go through the basic theory and establish relations to be used in the following. For ease of exposition we consider only a single birth cohort in a continuous-time model. Later on we will add a time parameter. As our starting point we take the frailty model of vauetal79 in which frailty is modeled as an unobservable, non-negative quantity acting multiplicatively on an underlying, baseline (mortality) intensity. The intensity, or death rate, of an individual of age with frailty is given by

 μ(x;z)=zμbase(x), (4)

where is the baseline intensity describing the age effect. The corresponding survival function, i.e. the probability of the person surviving to age , is

 S(x;z)=exp(−∫x0μ(u;z)du)=exp(−zI(x)), (5)

where denotes the integrated baseline intensity,

 I(x)=∫x0μbase(u)du. (6)

The survival function of the cohort, i.e. the expected fraction of the cohort surviving to age , is then

 S(x)=E[S(x;Z)]=E[exp(−ZI(x))]=L(I(x)), (7)

where denotes the Laplace transform of the frailty distribution (at birth),

 L(s)=E[exp(−sZ)]. (8)

Assuming sufficient regularity the (conditional) mean frailty of the cohort at age is given by

 E[Z|x]=E[ZS(x;Z)]S(x)=E[Zexp(−ZI(x))]L(I(x))=−L′(I(x))L(I(x))=ν′(I(x)), (9)

where denotes differentiation and . It follows that the cohort intensity can be written as

 μ(x)=−ddxlogS(x)=ν′(I(x))I′(x)=E[Z|x]μbase(x). (10)

The derivations above relating mean frailty and cohort intensity to the frailty distribution and baseline intensity provide the usual tools for frailty modeling. For later use we make the additional observation that mean frailty can also be expressed in terms of the integrated cohort intensity,

 H(x)=∫x0μ(u)du. (11)

The cohort survival function (7) provides the following link between and ,

 S(x)=exp(−H(x))=L(I(x)), (12)

from which we obtain , and thereby . By (9) and (10) we then get the expression

 μ(x)=ν′(ν−1{H(x)})μbase(x). (13)

The idea later on is to replace by a non-parametric estimate obtained from data, i.e. the integrated observed (empirical) death rates. The substitution disentangles the frailty distribution and the baseline intensity which greatly simplifies the estimation procedure.

In order to apply the theory we need to identity suitable frailty distributions and their Laplace transforms. Typically, the baseline intensity contains a scale parameter in which case it is customary to assume that the frailty distribution has mean one to ensure identifiability. We will also make this assumption here.

### 2.1 Gamma and inverse Gaussian frailty

The frailty distributions most often employed in mortality modeling are the Gamma and inverse Gaussian distributions, see e.g.

vauetal79; hou84; manetal86; buthab04; jarkry11; spretal13. Below we state their Laplace transforms and comment on their use.

When frailty is Gamma-distributed with mean one and variance

the Laplace transform and mean frailty are given by

 L(s) =(1+σ2s)−1/σ2, (14) E[Z|x] =(1+σ2I(x))−1=exp(−σ2H(x)). (15)

It is well-known that Gamma frailty in combination with Gompertz or Makeham baseline intensity leads to a cohort intensity of logistic type, see e.g. Example 2.1 of jarkry11 for details.333A Gompertz intensity is of the form , while a Makeham intensity is of the form . A Makeham intensity is also sometimes referred to as a Gompertz-Makeham intensity. This is also known as the Perks model, per32, and it has been found to describe old-age mortality very well, see e.g. thaetal98; tha99; caibladow06. Gamma-distributed frailty is also mathematically tractable and allows explicit calculations of many quantities of interest, e.g. frailty among survivors at a given age is Gamma-distributed with known scale and shape parameters, cf. vauetal79.

When frailty follows an inverse Gaussian distribution with mean one and variance we have for the Laplace transform and mean frailty

 L(s) =exp[1−√1+2σ2sσ2], (16) E[Z|x] =(1+2σ2I(x))−1/2=(1+σ2H(x))−1. (17)

Similar to Gamma frailty inverse Gaussian frailty enjoys a number of nice theoretical properties, e.g. the frailty distribution among survivors of a given age is again inverse Gaussian and an explicit formula for the density exists, see hou84. hou84 shows further that for inverse Gaussian frailty the coefficient of variation among survivors is decreasing with age making the cohort more homogeneous as it gets older, while for Gamma frailty the coefficient of variation is constant. Thus the impact of frailty is smaller for inverse Gaussian frailty than for Gamma frailty. From (15) and (17) we also see directly how Gamma frailty "slows" down baseline intensity more than inverse Gaussian frailty.

It is generally found that Gamma frailty and the associated logistic form provides a better description of old-age mortality than inverse Gaussian frailty, see e.g. buthab04; spretal13. Furthermore, abbber07 show that for a large class of initial frailty distributions the frailty distribution among survivors converges to a Gamma distribution as the integrated intensity tends to infinity. Thus overall the Gamma distribution is a good default choice.

That being said, it is not immediately clear that this conclusion carries over to the case where frailty is introduced to model increasing rates of improvement over time.

### 2.2 Positive stable frailty

hou86 introduced a family of generalized stable laws which include the Gamma and inverse Gaussian distributions as special cases. The family is obtained by exponential tilting of stable densities with index

. The stable laws themselves only have moments of order strictly less than

, while moments of all orders exist for the exponentially tilted densities. From the original three-parameter family we obtain a two-parameter family by imposing the condition that mean frailty is one.

When follows a generalized stable law with index , mean one and variance the Laplace transform and mean frailty are given by444The stated formulaes are obtained from hou86 using the parametrization and .

 L(s) =exp[1−αα{1−[1+σ2s/(1−α)]ασ2}], (18) E[Z|x] =[1+σ21−αI(x)]α−1=[1+α1−ασ2H(x)](α−1)/α. (19)

We note that for (defined by continuity) the generalized stable law specializes to the Gamma distribution, while for it specializes to the inverse Gaussian distribution. While Gamma and inverse Gaussian densities are available in closed form, generally only series representations of (tilted) stable densities exist, cf. hou86 and Lemma XVII.6.1 of felII.

In either case, we have simple closed form expressions for mean frailty in terms of integrated baseline and cohort intensity which is all we need for estimation and forecasting purposes. Moreover, the fact that the parametric family spans the two most commonly used frailty distributions, Gamma and inverse Gaussian, opens for a unified approach for joint estimation of frailty and baseline parameters.

The family has been further generalized by aal88; aal92 to include also negative . Negative values of

correspond to compound Poisson distributions with positive probability of zero frailty. This is important for some applications when modeling the time to non-certain events, e.g. divorce or unemployment. The generalization is of no use in our context where it implies immortality for some individuals. However, it could be useful for course-specific mortality modeling as demonstrated by

aal88.

## 3 Stochastic frailty models

We first include frailty in a continuous-time model spanning multiple birth cohorts and show the effect on the dynamics of age-specific death rates over time. Then we add frailty to stochastic mortality models and discuss estimation and forecasting of the resulting models.

Again, we start with the multiplicative frailty model. With a slight abuse of notation we will reuse the notation of Section 2 with an added time parameter. Assume that the intensity for an individual of age at time with frailty has the form

 μ(t,x;z)=zμbase(t,x), (20)

where is a baseline intensity describing the period (time) and age effect. The cohort intensity is then given by

 μ(t,x)=E[Z|t,x]μbase(t,x), (21)

where denotes the (conditional) mean frailty of the cohort of age at time , i.e. the cohort born at time .

Assume that all cohorts have the same frailty distribution at birth, and denote the Laplace transform of this common distribution by . Let , and define the integrated baseline and cohort intensities as

 I(t,x) =∫x0μbase(u+t−x,u)du, (22) H(t,x) =∫x0μ(u+t−x,u)du. (23)

That is, we follow a specific birth cohort over time and age. We have as before the relations and

 E[Z|t,x]=ν′(I(t,x))=ν′(ν−1{H(t,x)}). (24)

Following the notation of bon05 we define the rate of improvement in (senescent) mortality as . It follows from (21) that

 ρ(t,x)=−∂logE[Z|t,x]∂t+ρbase(t,x), (25)

where denotes the rate of improvement of baseline intensity.

Suppose we model the period effect of mortality improvements by decreasing age-specific baseline mortality, i.e. . For fixed , the integrated baseline intensity will then also be decreasing over time, while the mean frailty will be increasing over time due to less and less selection. From (25) we see that this implies that the rate of improvement of cohort mortality will be smaller than the rate of improvement at the individual level.

At old ages where death rates, and thereby selection, is high the change in mean frailty over time can substantially offset improvements in baseline mortality causing rates of improvement of cohort mortality to be close to zero. As improvements continue to occur at the individual level the selection effect gradually disappears and rates of improvement of cohort mortality get closer to baseline rates of improvement. The resulting pattern of gradually changing rates of improvement of old-age mortality resembles what is seen in data. This will be illustrated in the applications in Section 4.

### 3.1 Data and terminology

Data are assumed to be of the form of death counts, , and corresponding exposures, , for a range of years, , and ages, . Data may or may not be gender-specific, but that will not be part of the notation. denotes the number of deaths occurring in calender year among people aged . Correspondingly, denotes the total number of years lived during calender year by people of age . For readers familiar with the Lexis diagram, counts the number of deaths in the square of the Lexis diagram and gives the corresponding exposure, i.e. we work with so-called A-groups.

From the death counts and exposures we form the observed (empirical) death rates

 m(t,x)=D(t,x)E(t,x). (26)

The death rate is an estimate of the cohort intensity, , which for modeling purposes is assumed constant over the square in the following.

### 3.2 Fragilization of mortality models

Consider as given a stochastic model for baseline mortality, . For ease of presentation we will assume that it is of the form

 μbase(t,x)=F(θt,ηx), (27)

where and are, possibly multi-dimensional, time- and age-dependent quantities and describes the functional dependence. We could have included also cohort effects and more general dependence structures if so desired.

The form (27) is chosen with two applications mind: with , and we get the Lee-Carter model; with we get parametric time-series models where the functional form of the age-profile is determined by . In the parametric case we could use for instance a Gompertz law, , or a logistic function, . The latter is the model considered by caibladow06 for aggregate mortality. As mentioned in Section 2.1 the logistic model already has a frailty interpretation, and it is therefore not an obvious candidate for further fragilization; but it can nevertheless be done.

For presentational convenience we assume that both and need to be estimated from data, although as just seen may in fact be fixed. Once estimated we can forecast by ARIMA time-series methods and thereby obtain forecasts of by inserting the forecasted values of together with the estimated values of into (27).

In order to proceed we assume that we have available a procedure for maximum likelihood estimation of in the model where death counts are independent with

 D(t,x)∼Poisson(μbase(t,x)E(t,x)). (28)

In the Lee-Carter case such a procedure is described in broetal02.

With these building blocks in place we now consider the following "fragilized" version of the baseline model in which death counts are independent with

 D(t,x) ∼Poisson(μ(t,x)E(t,x)), (29) μ(t,x) =E[Z|t,x]μbase(t,x), (30)

where is given by (27) and denotes the conditional mean frailty of the cohort of age at time . It is assumed that the frailty distribution at birth is the same for all cohorts and that the distribution belongs to a family indexed by . The parameters of the model are thus . The Laplace transform of the frailty distribution with index is denoted , and this is assumed available in explicit form. Further, as a matter of convention we assume that mean frailty is one at birth and we define .

Based on (22) and (24) we can write

 μ(t,x) =ν′ϕ(I(t,x))F(θt,ηx), (31) I(t,x) =x−1∑u=0F(θu+t−x,ηu), (32)

and insert this into (29). In principle, we can estimate all parameters jointly from the resulting likelihood function. However, the likelihood function is intractable with frailty and baseline parameters occurring in a complex mix. Consequently, estimation has to be handled on a case-by-case basis depending on the choice of frailty and baseline model. Below we propose an alternative, generally applicable pseudo-likelihood approach which greatly simplifies the estimation task.

### 3.3 Pseudo-likelihood function

From (24) we know that , where is the integrated cohort intensity . At first sight this does not seem to help much since is at least as complicated as . However, in contrast to we can obtain a (model free) estimate of directly from data as

 ~H(t,x)=x−1∑u=0m(u+t−x,u). (33)

We thus propose to base estimation of (29)–(30) on a likelihood function in which the term is replaced by . The resulting approximate likelihood function is referred to as the pseudo-likelihood function,555The idea of basing inference on a pseudo-likelihood function was first introduced in spatial statistics, see bes75.

 L(ϕ,θ,η) =∏t,xλ(t,x)D(t,x)D(t,x)!exp(−λ(t,x)), (34) λ(t,x) =ν′ϕ(ν−1ϕ{~H(t,x)})F(θt,ηx)E(t,x). (35)

Formally, it corresponds to estimating the modified model

 D(t,x) ∼Poisson(μ(t,x)E(t,x)), (36) μ(t,x) =ν′ϕ(ν−1ϕ{~H(t,x)})F(θt,ηx). (37)

In model (36)–(37) is separable in frailty and baseline parameters and the model is therefore considerably easier to handle than (29)–(32).666Equation (37) might still look daunting, but is simplifies in specific cases. If frailty is assumed Gamma distributed with variance (and mean one) we get , while inverse Gaussian frailty with variance (and mean one) yields . The separability allows for simple joint estimation based on the estimation procedure for the baseline model.

One issue remains before turning to estimation in Section 3.4. To form the (pseudo) likelihood function we need to compute for all and in the data window. However, these quantities depend in part on data outside the data window. In principle, we need to know the death rates from birth to the present/maximum age for all cohorts entering the estimation. The gray area of Figure 2 illustrates the "missing" death rates.

For the purpose of calculating we suggest to define the death rates before and below the data window by

 m(t,x)={m(tmin,x)for t

This corresponds to saying that selection prior to have happened according to initial rates (rather than actual rates), and that all cohorts have mean frailty one at age (rather than at birth). The latter assumption was also used by Vaupel in the contributed discussion part of tha99. With the extension of death rates and hence the likelihood function can be computed.

We note that in some situations, e.g. for many national data sets, we may have more data than we choose to use for estimation. In these situations some or all of the "missing" death rates may in fact be available and hence could be used when calculating . In the present paper, however, we will calculate using the extension (38) even in this case. Assuming no selection prior to also makes forecasting more straightforward.

### 3.4 Maximum likelihood estimation

We here discuss various ways to obtain maximum likelihood estimates of model (36)–(37) of Section 3.3. For computational efficiency and numeric stability one often considers the log-likelihood function (rather than the likelihood function),

 l(ϕ,θ,η)=logL(ϕ,θ,η)=∑t,x{D(t,x)logμ(t,x)−μ(t,x)E(t,x)}+constant, (39)

where is given by (37) and the constant depends on data only. The simplest and most general method to optimize (39) is to form the profile log-likelihood function

 l(ϕ)=l(ϕ,^θ(ϕ),^η(ϕ)), (40)

where and denote the maximum likelihood estimates of and for fixed frailty parameter . Assuming that we know how to perform maximum likelihood estimation of the baseline model, and can be obtained by estimating the baseline model with death counts and exposure . Since the frailty family is typically of low dimension, e.g. one or two dimensions, the profile log-likelihood function can normally be optimized reliably by general purpose optimization routines, e.g. the optimize or optim routines in the free statistical software package .

For a one-dimensional frailty family, e.g. Gamma with mean one and variance , we could use the following skeleton R-code

# model is a list with functions for calculating
# Htilde, mean frailty and MLE of the baseline model
frail.optim <- function(D,E,model) {

Htilde <- model$Htilde(D,E) profile.loglike <- function(phi) { EZ <- model$meanfrail(phi,Htilde)
ret <- model$MLEbase(D,EZ*E) ret$value   # value of maximized log-likelihood
}
optret  <- optimize(profile.loglike,interval=c(0,2),maximum=TRUE)
MLEphi  <- optret$maximum EZ <- model$meanfrail(MLEphi,Htilde)
ret     <- model$MLEbase(D,EZ*E) MLEbase <- ret$par  # MLE of baseline parameters
list(phi=MLEphi,base=MLEbase)
}


Alternatively, maximum likelihood estimation can also be implemented by a switching algorithm. Starting from an initial value, , of the frailty parameter and , the algorithm consists in repeating the following two steps until convergence. On each iteration is increased by one.

1. Calculate as the maximum likelihood estimate with fixed.

2. Calculate as the maximum likelihood estimate with fixed.

The estimation in the first step is, as before, performed by estimating the baseline model with death counts and exposure . The estimation in the second step is performed by maximizing as a function of . With the dimension of typically low, this is easy to do by either Newton-Raphson, steepest ascent, or, in the one-dimensional case, even a simple bisection algorithm.

For the case of Gamma frailty with mean one and variance we have for and its first two partial derivatives with respect to

 l(σ2,θ,η) =∑t,x{−σ2D~H−exp(−σ2~H)F(θt,ηx)E}+const., ∂l∂σ2(σ2,θ,η) =∑t,x{−D~H+~Hexp(−σ2~H)F(θt,ηx)E}, ∂2l∂(σ2)2(σ2,θ,η) =−∑t,x~H2exp(−σ2~H)F(θt,ηx)E<0,

where we have suppressed the dependence on and of , and to simplify notation. The calculations show that is strictly concave as a function of , and thereby that a local maximum is also a global maximum. Hence any gradient method is guaranteed to converge to the global maximum. Note however that the maximum may not be interior and therefore we do not necessarily have at the maximum.

The switching algorithm will always converge to a (local) maximum. However, many iterations may be required in particular if the likelihood has a "diagonal" ridge along frailty and baseline parameters, i.e. if frailty and baseline parameters to some extent describe the same feature of data. The advantage of the switching algorithm is that it allows a more detailed analysis which can guide the choice of optimization routine, like in the Gamma example above.

We finally note that estimation can also be carried out by Newton-Raphson sweeps over frailty and baseline parameters, similar to the algorithm described in broetal02 for the Lee-Carter model. This may be more efficient in terms of computing time, but it requires a substantial amount of tailor-made code for each combination of frailty and baseline model.

Overall, we find that optimization of the profile log-likelihood function is the method of choice. It is particularly well-suited in the exploratory phase due to its flexibility and ease of implementation. In our experience it is also sufficiently fast and robust to be of practical use.

### 3.5 Forecasting

We now assume that we have (maximum likelihood) estimates of and we denote these by . We will follow the usual approach in stochastic mortality modeling and forecast mortality based on a time-series model for . The time-series model is estimated on the basis of treating these as observed, rather than estimated, quantities. Typically a simple (multi-dimensional) random walk with drift is used, see e.g. leecar92 and caibladow06, but models with more structure can also be used.

Assume that we have a forecast for a given forecast horizon . The forecast can be either a deterministic (mean) forecast, or a stochastic realization from the time-series model. The aim is to forecast baseline and cohort mortality for ages ; the forecast region is illustrated as the shaded box to the right in Figure 2.

We first note that the estimated can be written

 ^μ(t,x)=ν′^ϕ(ν−1^ϕ{~H(t,x)})F(^θt,^ηx)=ν′^ϕ(~I(t,x))F(^θt,^ηx), (41)

where . Expressing mean frailty in terms of provides the necessary link for forecasting mean frailty, and thereby cohort mortality.

Baseline mortality is readily forecasted by inserting and into (27),

 μbase(t,x)=F(¯θt,^ηx), (42)

while cohort mortality is forecasted by

 μ(t,x)=ν′^ϕ(~I(t,x))F(¯θt,^ηx), (43)

where in the forecast region is given by the recursion

 ~I(t,x)=⎧⎪⎨⎪⎩0for x=xmin,~I(tmax,x−1)+F(^θtmax,^ηx−1)for % x>xmin,t=tmax+1,~I(t−1,x−1)+F(¯θt−1,^ηx−1)for x>xmin,t>tmax+1. (44)

Note that in the data window is defined by transformation of to ensure consistency with the estimated model, while in the forecast region it is defined recursively in terms of the forecasted baseline mortality.

For Gamma frailty with mean one and estimated variance we will forecast cohort mortality by

 μ(t,x)=F(¯θt,^ηx)1+^σ2~I(t,x), (45)

where in the data window is given by . Inverse Gaussian frailty with mean one and estimated variance yields the cohort mortality forecast

 μ(t,x)=F(¯θt,^ηx)√1+2^σ2~I(t,x), (46)

where in the data window is given by .

## 4 Applications to US male mortality

Broadly speaking stochastic mortality models fall in two main categories: parametric time-series models and semi-parametric models of Lee-Carter type. We consider two applications, one for each of these model classes. The applications are illustrative examples highlighting different aspects of the theory rather than full-blown statistical analyses.

In the first application we take a standard Makeham mortality intensity and show how the fit can be improved for different age groups by adding frailty. To give a flavor of the possibilities we consider models of the generalized, additive form presented in Section 5 in combination with the two-dimensional family of stable laws of Section 2.2. In the second application we add Gamma frailty to the Lee-Carter model and illustrate the effect on the forecasts.

Both applications use US male mortality data from 1950–2010 available at the Human Mortality Database (www.mortality.org). The first application compares a forecast based on the initial half of data with the actual mortality experience. The second application uses the latter half of data and makes a forecast into the future.

### 4.1 Parametric time-series application

The classical Makeham, or Gompertz-Makeham, mortality law states that age-specific mortality can be described by the functional form

 μ(x)=exp(θ1+θ2x)+exp(ζ), (47)

where denotes age. The additive structure suggests an interpretation of mortality as composed of an age-dependent (senescent) part describing mortality as people grow older, and a part describing background mortality, e.g. accidents, which happen independent of age. Despite its simplicity the Makeham intensity provides a reasonable description for a wide range of adult ages. However, for the very old and also for younger adults the description is not adequate. The right panel of Figure 3 shows observed death rates in 1980 for US males aged 20 to 100 with a fitted Makeham curve imposed (and two other curves which will be explained below). It is seen that the Makeham curve overstates old-age mortality and that the curvature is wrong for the younger ages.

For old-age mortality, the lack of fit of the Makeham law is well-known and logistic-type models have been proposed instead, see e.g. thaetal98. In caibladow06

a logit-model for death probabilities (rather than rates) is suggested and applied to England & Wales males above age 60 for which it fits well. However, in

caietal09 potential problems of fitting US male data with the pure logit-model is reported and a variant is considered.

#### 4.1.1 Stochastic frailty model

As an alternative to the logit-model and its variants we here consider the stochastic frailty model

 D(t,x) ∼Poisson(μ(t,x)E(t,x)), (48) μ(t,x) =E[Z|t,x]exp(θ1t+θ2tx)+exp(ζt), (49)

where the distribution of belongs to the positive stable family with index and variance , described in Section 2.2. The model is of the additive form introduced in Section 5 with frailty acting only on the senescent part of mortality. With this structure we keep the interpretation of the second term as background mortality to which everyone is equally susceptible.

We recall from Section 2.2 that the stable family includes both Gamma () and inverse Gaussian () as special cases. Gamma frailty leads to a logistic-type model, akin to the one used in tha99, while values of greater than zero yield old-age mortality "in between" logistic and exponential. The Makeham model with time-varying parameters is also included and corresponds to (for all values of ).

We apply the model to US male mortality data for the period 1950–1980 and ages 20–100. The left panel of Figure 3 shows a contour plot of the profile log-likelihood function,

 l(α,σ)=logL(α,σ,^θ(α,σ),^η(α,σ)), (50)

where is the pseudo-likelihood function of Section 5.2 and and denote the maximum likelihood estimates for fixed value of frailty parameters. The profile log-likelihood function is calculated by the EM-algorithm of Section 5.3.

Interestingly, the profile log-likelihood is maximized for . This implies that the best fit is obtained for a model which is quite far from a logistic form. A closer look at data reveals why this is the case. The right panel of Figure 3 shows the observed death rates in 1980 with three fitted curves imposed. The curve labeled ’Makeham with stable frailty’ is the one corresponding to the MLE, while the curve labeled ’Makeham’ is the fitted Makeham curve mentioned earlier. If we restrict attention to Gamma frailty, i.e. maximize the profile log-likelihood along the axis , we find the maximum . The corresponding fitted curve is the one labeled ’Makeham with Gamma frailty’.

Compared to the Makeham curve, the MLE curve gives an improved fit at young ages and a similar fit at old ages, while the Gamma curve gives an improved fit at old ages and a similar fit at young ages. Since the exposure is much bigger at young ages than at old ages the global MLE is the one fitting young ages the best. Ideally we would like a model which combines the MLE fit for young ages with the Gamma fit at old ages. Note that the observed death rates are actually decreasing from age 20 to around age 30. This however cannot be captured by any of the models considered.

#### 4.1.2 Forecast

Figure 4 shows the estimated and forecasted level, , and slope, , parameters of individual senescent mortality. The forecast is produced as the mean forecast of a fitted two-dimensional random walk with drift,

 θt=θt−1+ξ+Ut, (51)

where is the drift and are independent, identically distributed two-dimensional normal variables with mean zero. Qualitatively similar plots hold for the Gamma frailty model (not shown).

It appears that there is a structural break around 1970 where the annual drift in level and slope changes. With hindsight we now know that this change was genuine and that using only the period 1970–1980 would have resulted in a better forecast. Imagine standing in 1980, however, it is less obvious that we would have used only the last 10 years of data when forecasting. To illustrate the forecast the method would likely have produced in 1980 we therefore use the mean forecast based on the average drift for the full period 1950–1980, as shown in Figure 4.

Cohort mortality with stable frailty is forecasted by

 μ(t,x)=exp(¯θ1t+¯θ2tx)[1+11.7701−0.434~I(t,x)]0.434−1+exp(¯ζt), (52)

where denotes the mean forecast of , cf. Section 5.4. Background mortality shows very little variation over the data window and is therefore kept fixed at the last value in the forecast, . Similarly, cohort mortality with Gamma frailty is forecasted by

 μ(t,x)=exp(¯θ1t+¯θ2tx)1+0.113~I(t,x)+exp(¯ζt). (53)

The fit and forecast with the two frailty specifications are shown in Figure 5 together with observed death rates from 1950 to 2010. In the data window, the model with stable frailty is seen to provide a very good fit for all ages except the oldest (age 100), while Gamma frailty leads to a good fit for all ages above 50, but a poor fit below. None of the models are able to fully predict the improvements that occurred from 1980 onwards in particular for ages 60–80. Of the two models the Gamma frailty model does in fact come closest due to its more log-concave forecasts, cf. the discussion of Gamma versus inverse Gaussian frailty in Section 2.1.

Overall, the example shows that adding frailty can substantially improve the fit of a simple baseline model. However, as with all parsimonious models the fit is not perfect, in particular when we consider an age span as wide as 20 to 100 years. The example also shows that regarding both fit and forecast of old-age mortality Gamma frailty performs better than (other) stable frailties.

### 4.2 Lee-Carter application

The model of leecar92 assumes a log-bilinear structure of mortality,

 logμ(t,x)=ax+bxkt, (54)

where and are age-specific parameters and is a time-varying index. The index is typically modeled as a random walk with drift,

 kt=kt−1+ξ+Ut, (55)

where is the drift and are independent, identically distributed normal variables with mean zero. The combination of (54) and (55) implies that forecasted age-specific death rates decay exponentially at a constant rate. However, the experience of old-age mortality has shown increasing rates of improvement in many countries and, consequently, forecasts based on the Lee-Carter methodology have a tendency to underestimate the actual gains. In the example to follow we illustrate how the addition of frailty can be used to improve the forecast of old-age mortality.

We consider the Poisson version of the Lee-Carter model,

 D(t,x)∼Poisson(exp(ax+bxkt)E(t,x)), (56)

and we use the algorithm of broetal02 to obtain the maximum likelihood estimates of the parameters under the usual identifiability constraints,

 ∑tkt=0and∑xbx=1. (57)

We will use the Lee-Carter model with and without frailty to model mortality for ages 0 to 90, and assume a logistic form for higher ages,777For each , and are estimated by least squares based on the relation , for ages .

 μ(t,x)=exp(ct+dtx)1+exp(ct+dtx)for x=91,…,110. (58)

The extrapolation at the oldest ages is necessary to obtain reliable and stable rates, both historically where data are sparse and in forecasts. The left panel of Figure 6 shows the fit in 2010 for the full age-span from 0 to 110 years. The plot illustrates the flexibility and better fit to data of semi-parametric models, like the Lee-Carter model, over parametric models like the ones considered in Section 4.1. Parametric models, on the other hand, generally produce forecasts which better preserve the overall structure of data.

The four Lee-Carter forecasts used in the introductory example in Section 1 are produced as described above. The mean of the index, , is used for forecasting assuming an underlying random walk with drift. The period life expectancy used for illustration is calculated by the following formula with

 ex0(t)=∫110x0exp(−∫xx0μ(t,⌊y⌋)dy)dx, (59)

where denotes the integer part of . Arguably, the cohort life expectancy taking future improvements into account is of more interested in practise. However, for our purposes the period life expectancy is more useful since it can be compared with the actual experience.

#### 4.2.1 Stochastic frailty model

Consider the following multiplicative stochastic frailty model

 D(t,x) ∼Poisson(μ(t,x)E(t,x)), (60) μ(t,x) =E[Z|t,x]exp(ax+bxkt), (61)

where is assumed to be Gamma distributed with mean one and variance . Due to the assumption of Gamma frailty we have the relations

 E[Z|t,x]=(1+σ2I(t,x))−1=exp(−σ2H(t,x)), (62)

where and denote integrated cohort and baseline mortality respectively. For we have the ordinary Lee-Carter model.

We will deviate slightly from the theory presented so far and use the period version of and rather than the cohort version previously used. Hence we will replace by the quantity

 ˇH(t,x)=x−1∑u=0m(t,u), (63)

when estimating the model. Similarly, we calculate and forecast by its period version . Note that with the period version we do not need to extend the data window.

In the context of the Lee-Carter model the period version appears to stabilize the estimates and forecasts, and this is why we use it. We also note that the use of Gamma frailty on a period basis is consistent with the logistic extension above age 90.

For fixed value of model (60)–(61) is estimated as an ordinary Lee-Carter model with exposure . With the large number of parameters of the Lee-Carter model the addition of makes little difference for the fit. This is illustrated in the left panel of Figure 6 where the fit of the Lee-Carter model with and without frailty is indistinguishable.

Since frailty is introduced with the specific aim of improving the forecast we propose to estimate with a back-test approach, rather than by maximum likelihood estimation. Specifically, we define the forecast fit by

 f(σ2)=2010∑t=200190∑x=0{D(t,x)logμ(t,x;σ2)−μ(t,x;σ2)E(t,x)}, (64)

where denotes the forecast of model (60)–(61) estimated with data from 1970–2000 and ages 0–90 for fixed value of . The model is forecasted as described in Section 4.2.2 below. Plotting shows a unimodal function which is maximized at (plot not shown). The chosen approach is simple and serves as illustration, but many other choices for measure of fit and data period could have been made.

The right panel of Figure 6 shows the life expectancy of US males age 60 resulting from the forecast for different values of . The impact of frailty is initially modest but it gradually leads to substantial differences. Yet, even the presence of frailty cannot fully explain the rapid increase in life expectancy of the last decade. Note that the optimal frailty variance is not the one leading to the highest forecast, but the one giving the best fit overall for ages 0–90 in the ten years, 2001–2010, following the estimation period.

#### 4.2.2 Forecast

Forecasts of model (60)–(61) are obtained by

 μ(t,x)=exp(^ax+^bx¯kt)1+^σ2ˇI(t,x), (65)

where , and denote estimated parameters and is the forecasted index, either deterministic or stochastic. The integrated baseline intensity is given by in the data window and by

 ˇI(t,x)=x−1∑u=0exp(^au+^bu¯kt) (66)

in the forecast.

The left panel of Figure 7 shows the estimated index for the estimation period 1980–2010 and frailty variance . The estimated index looks close to linear and a linear forecast seems reasonable. The right panel of Figure 7 shows the corresponding life expectancy forecast for US males age 60 as the dashed line starting in 2010. The forecast is almost linear while the ordinary Lee-Carter forecast (solid line) curves downwards over time. For comparison the plot also shows Lee-Carter forecasts with and without frailty based on the estimation periods 1950–1980, 1960–1990 and 1970–2000.888To show the effect of the estimation period rather than the level of frailty the same value of is used for all forecasts. The pattern is the same with almost linear frailty forecasts and lower, downward-curving Lee-Carter forecasts. Note that all historic forecasts, both with and without frailty, are below the actual life expectancy evolution.

The effect of frailty in terms of the forecasted age-specific death rates is illustrated in Figure 8. For ages below 70 the Lee-Carter forecasts with and without frailty are almost identical. For higher ages the frailty effect gradually increases and gives rise to the previously announced curving forecasts. In particular, the frailty forecast predicts substantial improvements in mortality for the very old over the next 50 years while the ordinary Lee-Carter forecasts predicts essentially no improvements above age 90. The plot also illustrates that the forecasted age-profile is somewhat distorted. This is a general feature of semi-parametric methods, like the Lee-Carter model, due to the lack of structure.

History shows that mortality rates generally decline over time, but it also shows changing rates of improvement. Clearly, there are substantial period effects due to medical breakthroughs, changes in nutrition and working conditions etc., and frailty can at best be a contributing factor to the observed increasing rates of improvement. Indeed, the example shows that adding frailty to historic forecasts is not enough to explain the actual evolution. However, the addition of frailty leads to projected old-age mortality rates which at least partly accommodate future higher rates of improvement. The resulting almost linear life expectancy projections resemble the actual experience better than the downward-curving Lee-Carter forecasts.

## 5 Generalized stochastic frailty models

The theory presented in Section 3 allows for estimation and forecasting of multiplicative frailty models. This is a useful class of models, but it can also be relevant to consider a larger and more flexible class of additive models with both frailty and non-frailty components. With additive models we can distinguish between senescent mortality influenced by frailty and selection and "background" mortality due to e.g. accidents with no selection effects. The class introduced below contains e.g. logistic-type models with constant terms which is useful when modeling mortality for a wide age span.

In the following we define the class of generalized stochastic frailty models and we show how to estimate and forecast these models. Additive models complicates the theory somewhat and we will have to resort to an EM-algorithm for estimation. The exposition is rather brief and focuses on the areas where the theory differs from the one presented in Section 3. Many of the general remarks still apply, but they will not be repeated here.

The models we consider are based on the underlying assumption that the intensity for an individual of age at time with frailty is of the form

 μ(t,x;z)=zμbase(t,x)+μ%back(t,x), (67)

where is the baseline intensity influenced by individual frailty and is the background mortality common to all individuals independent of their frailty. The cohort intensity is then given by

 μ(t,x)=E[Z|t,x]μbase(t,x)+μback(t,x), (68)

where denotes the (conditional) mean frailty of the cohort of age at time , i.e. the cohort born at time .

Assume that all cohorts have the same frailty distribution at birth and denote its Laplace transform by . By a straightforward generalization of the results in Section 2 we have and

 E[Z|t,x]=ν′(I(t,x))=ν′(ν−1{H(t,x)}), (69)

where and

 I(t,x) =∫x0μbase(u+t−x,u)du, (70) H(t,x) =∫x0μ(u+t−x,u)−μback(u+t−x,u)du. (71)

Note that in contrast to earlier is not just the integrated cohort intensity. The frailty independent part of mortality, , needs to be subtracted to get the usual connection between and .

### 5.1 Model structure

Assume we have models for baseline and background mortality of the form

 μbase(t,x) =F(θt,ηx), (72) μback(t,x) =G(ζt,ωx), (73)

where and together with describe the period and age effects of frailty dependent mortality, and and together with describe the period and age effects of frailty independent mortality. All quantities can be multi-dimensional. Further generalizations are possible, but the chosen structure is sufficient to illustrate the idea.

For ease of presentation we assume that and are to be estimated from data, but they may in fact be fixed. As a simple example we could use a Gompertz law for baseline mortality, , and a constant for background mortality, . With no frailty this is the Gompertz-Makeham model, and when frailty is added to baseline mortality we get the logistic-type model used in Section 4.1.

A generalized stochastic frailty model is a model where death counts are independent with

 D(t,x) ∼Poisson(μ(t,x)E(t,x)), (74) μ(t,x) =E[Z|t,x]μbase(t,x)+μback(t,x), (75)

where and are given by (72)–(73) and denotes the conditional mean frailty of the cohort of age at time . The frailty distribution at birth is the same for all cohorts and it is assumed to belong to a family indexed by . The Laplace transform of the frailty distribution with index is denoted , and this is assumed available in explicit form. Further, as a matter of convention we assume that mean frailty is one at birth and we define .

### 5.2 Pseudo-likelihood function

Estimation will be based on a pseudo-likelihood function in which the problematic term is replaced by an estimate. This opens for a general, (relatively) easy estimation procedure which can be implemented whenever the baseline and background models can be estimated separately.

The idea is to estimate in the data window by

 ~H(t,x,ζ,ω)=x−1∑u=0~m(u+t−x,u), (76)

where

 ~m(t,x)=⎧⎨⎩~m(tmin,x)for t

The reason we need to extend by (77) is that the summation in (76) falls partly outside the data window. Apart from the subtraction of background mortality the construction is the same as in Section 3.3.

Appealing to (69) we propose to base estimation of model (74)–(75) on a likelihood function in which is replaced by . The resulting approximate likelihood function is referred to as the pseudo-likelihood function,

 L(ϕ,θ,η,ζ,ω) =∏t,xλ(t,x)D(t,x)D(t,x)!exp(−λ(t,x)), (78) λ(t,x) =[ν′ϕ(ν