Skewed link regression models for imbalanced binary response with applications to life insurance

by   Shuang Yin, et al.
University of Connecticut

For a portfolio of life insurance policies observed for a stated period of time, e.g., one year, mortality is typically a rare event. When we examine the outcome of dying or not from such portfolios, we have an imbalanced binary response. The popular logistic and probit regression models can be inappropriate for imbalanced binary response as model estimates may be biased, and if not addressed properly, it can lead to serious adverse predictions. In this paper, we propose the use of skewed link regression models (Generalized Extreme Value, Weibull, and Frec̀het link models) as more superior models to handle imbalanced binary response. We adopt a fully Bayesian approach for the generalized linear models (GLMs) under the proposed link functions to help better explain the high skewness. To calibrate our proposed Bayesian models, we use a real dataset of death claims experience drawn from a life insurance company's portfolio. Bayesian estimates of parameters were obtained using the Metropolis-Hastings algorithm and for Bayesian model selection and comparison, the Deviance Information Criterion (DIC) statistic has been used. For our mortality dataset, we find that these skewed link models are more superior than the widely used binary models with standard link functions. We evaluate the predictive power of the different underlying models by measuring and comparing aggregated death counts and death benefits.



There are no comments yet.


page 20


Parametric bootstrapping in a generalized extreme value regression model for binary response

Generalized extreme value (GEV) regression is often more adapted when we...

Hierarchical approaches for flexible and interpretable binary regression models

Binary regression models are ubiquitous in virtually every scientific fi...

Infinitely imbalanced binomial regression and deformed exponential families

The logistic regression model is known to converge to a Poisson point pr...

Flexible Modeling of Hurdle Conway-Maxwell-Poisson Distributions with Application to Mining Injuries

While the hurdle Poisson regression is a popular class of models for cou...

On the Predictive Properties of Binary Link Functions

This paper provides a theoretical and computational justification of the...

Aggregating the response in time series regression models, applied to weather-related cardiovascular mortality

In environmental epidemiology studies, health response data (e.g. hospit...

Using the Softplus Function to Construct Alternative Link Functions in Generalized Linear Models and Beyond

Response functions linking regression predictors to properties of the re...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The prediction and interpretation of mortality rates have been an important aspect of risk management in the life insurance industry. For several decades, different models have been applied to actuarial work to investigate the possibilities of explaining mortality. The early work of Gompertz (1825) provided a parametric form that explains the exponential increase in mortality with age. More recent work includes Haberman and Renshaw (1996) and Zhu et al. (2001), which examined the applications of Cox proportional hazards model (Cox (1972)) in analyzing the mortality for life insurance policies. The Cox regression model is a tool for understanding the effects of the presence of risk factors such as age, gender, policy duration, and smoking habits, in mortality. Olbricht (2012) offered an interesting alternative approach using tree-based models to construct the life table. Accordingly, tree-based models have the advantages of “ease of predictability” and “transparency.” Meyricke and Sherris (2013)

proposed the generalized linear mixed models (GLMMs) framework for mortality modeling to incorporate longitudinal variables allowing for both underwriting factors and frailty.

Booth and Tickle (2008) discussed an array of work in the literature on mortality modeling especially those pertaining to forecasting.

This paper has a different objective for studying mortality. In particular, we are interested in tracking and monitoring the death claims experience of a portfolio of life insurance policies. As pointed out in Yin et al. (2020), such tracking mechanism can help insurers take necessary actions to mitigate against the immediate economic impact of any deviations from expectations. An interesting related work of Zhu et al. (2015)

applies logistic regression models to understand mortality trends, slopes, and differentials. An additional challenge of mortality modeling, especially related to periodic claims monitoring, is that it is considered a rare event with a very tiny probability; there is a significant difference between the number of deaths and the number of survivors. For example, the mortality rate, according to our insured portfolio dataset, observed during a policy year is only about 1.3%, which in a binary model setup, such a response is characterized as highly imbalanced (or skewed). In this paper, we focus on capturing this high skewness for which other methods may not be able to model the data appropriately.

It is well known that binary regression models fall within the class of generalized linear models (GLMs) using Bernoulli distribution. See

McCullagh and Nelder (1989)

. The commonly used functions linking the mean parameter to the predictor variables are the symmetric link functions, logit and probit, leading, respectively, to the logistic and probit regression models. Both link functions reasonably work for observations with balanced response binary outcomes. Suppose we are given a set of observations expressed as

, where is a binary outcome and is a set of predictor variables. The binary regression model has a latent variable representation where , for observation , is related to an unobserved variable as . , also called the latent variable, is directly linked to the predictor variables as a linear model with an error component as , where

is a vector of coefficients for the predictors. The error component,

has a pre-specified distribution directly related to the link function.

To deal with the limitations of symmetric link functions for binary outcomes, some studies extended the parametric class including the use of power transformation of the logit link (Aranda-Ordaz (1981)) and the Box-Cox transformation for binary regression (Guerrero and Johnson (1982)). Other model extensions include two-parameter classes of link functions such as two parameter generalization of logit link by Pregibon (1980), the two shape parameters link by Stukel (1988), and two tailed link by Czado (1994)

. However, the use of these extended classes of parametric link functions do not always fit the data well with extremely imbalanced response variable. For example, Stukel’s generalized logistic model may lead to improper posterior distribution under many types of noninformative improper prior.

Chen et al. (1999) suggested an asymmetric link function especially when the probability of a given binary response approaches 0 at a significantly different rate than it approaches 1. This model yields to proper posterior distribution under noninformative improper priors for the regression coefficients and can handle datasets with balanced and imbalanced response variable. However, the intercept and skewness parameters are confounded with each other in this model, although Kim et al. (2007) overcame this problem to some extent with the constraint that the skewness lies in the range of , which restricts the possibility of negative skewness of the response variable.

This open problem to some extent motivates us to investigate more flexible models to accommodate such imbalances. In order to appropriately model the large skewness, Wang and Dey (2010)

proposed binary regression with the generalized extreme value (GEV) link. The GEV distribution belongs to a family of continuous probability distributions developed from extreme value theory. The GEV distribution combines the Gumbel, Fréchet, and reversed Weibull distributions into a single family.

Caron et al. (2018) presented a Weibull link for the imbalanced binary regression, which is very flexible and capable to handle with different types of data. The Frećhet distribution, which is known as the Type II extreme value distribution, also has the shape parameter to control the direction and degree of skewness.

In this paper, we propose the use of this family of asymmetric link functions to binary regression models for analyzing life insurance data with highly imbalanced mortality. The model parameters are estimated using Bayesian approach. There are advantages to the use of these link functions that are suitable for our purposes:

  • With a continuous range of possible values of the shape parameter, the resulting models can automatically estimate the skewness parameter without any pre-assumption on the direction of the skewness and can thus better quantify the skewness;

  • The commonly used link functions, such as logit, probit, and complementary log-log, can be derived as special or limiting cases of this family of link functions;

  • In a Bayesian framework, even under the improper priors, the posterior distribution of the parameter is still proper; and

  • With the introduction of latent variables, the process can be computationally efficient whereby the Markov Chain Monte Carlo (MCMC) implementation can easily be used to sample the parameters.

This paper has been structured as follows. Section 2 provides a brief review of binary regression models and describes the latent variable interpretation of the binary regression models that is important for comparative purposes. We also describe and review the various skewed link models that are used in this paper. In Section 3

, we present ideas of Bayesian inference useful for the estimation and prediction based on our proposed skewed link regression models. For purposes of being self-contained, a detailed procedure of Metropolis-Hastings algorithm is described in this section. Section

4 discusses the simulation studies and shows the recovery work of the various proposed models. Section 5 is about the empirical data used in our model calibration and predictions. We conclude in Section 6.

2 Binary regression models

For a fixed period, our dataset can be described as where is a vector of responses, is a matrix of predictor variables, and is the total number of observations. The response is a binary outcome, with indicating death has been observed for policyholder and indicating survival otherwise. The vector of predictor variables is , where is the number of observed predictor variables in the dataset.

Let denote the probability of death, given the predictor variables. Since is a binary outcome, it can easily be deduced that the conditional expectation of given is also . It is well known that binary regression model falls within the class of generalized linear models (GLMs) with the link function where is a vector of coefficients. Therefore, and we would conveniently write this as . Note that it is customary to include an intercept coefficient in the linear relationship so that without loss of generality, we can assume that the vector of predictor variables is augmented as . See McCullagh and Nelder (1989) and Cox and Snell (1989).

2.1 Latent variable interpretation

The binary regression model can be re-expressed and re-interpreted in terms of latent (or unobserved) variables, herewith denoted by . Define these latent variables so that the binary outcome is




and . Here

is a cumulative distribution function of

, given

, which is often assumed to be a continuous random variable. It follows therefore that


Note that in the case where is the distribution function of a symmetric random variable with mean 0, then equation (3) becomes

In this case, determines the link function in the GLM framework. This latent variable representation is believed to be introduced by Albert and Chib (1993).

Examples of distributions that belong to the class of symmetric distributions include the standard logistic, standard normal, and the standard Gumbel distributions. This leads respectively to the three commonly used link functions in binary regression (without loss of generality, the subscript is dropped):

  • Logistic regression: When has a logistic distribution with mean 0 and scale parameter 1, i.e., , we have . The link function can be expressed as the logit function .

  • Probit regression: When

    has a normal distribution with mean 0 and scale parameter 1, i.e.,

    , we have . The link function can be expressed as the probit function .

  • cloglog regression: When has a Gumbel distribution with location 0 and scale parameter 1, we have . The link function can be expressed as the cloglog function .

There are several advantages to the latent variable representation of the binomial regression. First, it allows for a natural interpretation. To illustrate, in the mortality context where is death and is survival, then the latent variable may be viewed as a continuous measure of mortality. Second, the latent variable allows for a natural cutpoint, at zero, for splitting the prediction for into one of a binary outcome. Third, it has computational advantages particularly when Bayesian inference is used for estimation such as the use of an MCMC algorithm that may include Metropolis-Hastings and Gibbs sampling. See Gelman et al. (2014). While a direct likelihood estimation is sufficient for simple models such as logistic and probit regression, Bayesian computation is particularly helpful for other forms of link functions such as those asymmetric link functions introduced in the subsequent subsections.

2.2 Skewed link functions

The use of symmetric link functions in binary regression may be considered too restrictive especially for observations with imbalanced response, one where there is considerably a large proportion of one outcome. In the case of imbalanced data, the link function can be mis-specified, which can lead to inconsistent and biased model estimates and therefore increase in the values of mean squared errors. See Czado (1994). This paper focuses on using parametric link functions based on distributions that are not necessarily symmetric. In particular, we examine and adopt skewed link functions derived from three families of distributions that can accommodate large skewness to handle the imbalanced response. They are the generalized extreme value (GEV) distributions, the skewed Weibull distributions, and the Frećhet distributions.

For our purposes, we will consider three families of asymmetric or skewed link functions that are based on the generalized extreme value (GEV) distributions. Its cumulative distribution function has the general form


where provided it satisfies , and the parameters belong to , , and . This GEV distribution function can be derived from a limiting approximation to the distribution of the maxima of a sequence of random variables. This form has been attributed to the original work of McFadden (1978).

After some preliminary investigation of suitable link functions for our purposes, we have concluded to consider the following three families of skewed link functions:

  • Standard GEV link functions: When the location is and the scale , we will call this the standard GEV distribution so that its distribution function in this case is


    We will not preclude the case where , in which case the distribution has the simplified form of . This is the case of the Gumbel link functions, based on the Gumbel or Type I extreme value distribution. The link function can be expressed as .

  • Skewed Weibull link functions: In this case, we have


    defined only for ; its value will be zero elsewhere. This is actually the standard form of the Weibull distribution when the location is and the scale , which is indirectly implied from the GEV form in (4) above. The reverse Weibull, also called the Type III extreme value distribution can be derived directly from this GEV distribution; the derivation of the Weibull considers the minima of a sequence of random variables, in contrast to maxima as in the GEV. The link function can be expressed as .

  • Frećhet link functions: When the location is , the scale , , and some transformation to , we arrive at the Frećhet distribution function of the form


    provided ; it is defined to be 0 elsewhere. This distribution is sometimes referred to as the Type II extreme value distribution. The link function can be expressed as .

Briefly, let us examine some properties of each of these distributions. For the standard GEV, the skewness measure proposed by Arnold and Groeneveld (1995) is given by , where is the mode of the distribution. Based on this definition, the flexibility of the link function is reflected in the skewness expressed explicitly as , which depends only on the shape parameter . The mode is evaluated at .

For the skewed Weibull distribution, the shape parameter controls the skewness. Similar to GEV link, the skewness can also be computed based on the mode , which relies only on the shape parameter . Based on the formula , the skewness lies somewhere in the interval .

The complementary log-log (cloglog) model, which has been considered as an alternative to logistic and probit models, is frequently used when the probability of an event is relatively small or relatively large. The complementary log-log link is a special case of the Weibull link with corresponding cumulative distribution function . Consider the general form of the Weibull distribution expressed as . If we set and divide by , then we have


Although considered a skewed link, the cloglog, with skewness defined as , does not provide enough flexibility since the mode is a constant without incorporating a shape parameter.

Finally, for the Frećhet distribution, the skewness parameter can be estimated by the mode using the relationship , where the mode . Thus,the skewness of this distribution is , which relies only on the shape parameter and lies in the interval . This explains why this can be considered a flexible link function. Another important feature of the shape parameter arises from the fact that this kind of parameter affects the shape of the distribution by purely controlling the tail behavior than simply shifting it or stretching it.

Figure 1 displays the probability densities of asymmetric distributions described in this subsection (standard GEV, Weibull, and Frećhet) under different shape parameters. This allows us to visualize the type of skewness that can be derived from each of these distributions.

Figure 1: Density plots for the standard GEV with , skewed Weibull with , and Frećhet with

3 Bayesian inference

For the binary regression models with standard link functions that include the logit, probit, and cloglog, the parameters can be estimated based on the maximum likelihood using either the Newton’s method or the iteratively reweighted least squares (IRLS). See McCullagh and Nelder (1989). For our binary regression based on the skewed link models, we used the Bayesian approach with approximation of the posterior distribution based on Metropolis-Hastings algorithm, a type of a Markov chain Monte Carlo (MCMC) simulation method.

3.1 Posterior distribution and inference

In Bayesian statistics, the posterior distribution of the parameters is the conditional probability, given relevant evidence or information, which is usually the dataset. Let

denote the posterior distribution of the parameter , given the dataset , and denote the likelihood function of , where

is the probability density function of

, given

. Using the Bayes Theorem, if we assign a prior distribution

to , we have


where is the normalized constant that is free of . Suppose that we sample some number of independent, random values of from the posterior distribution ,

. Then according to the Law of Large Numbers,


where is any function of . As gets larger, the sample mean converges to its expectation. In this paper, the samples ’s are generated from the posterior distribution of equation (9) using MCMC (Metropolis-Hastings) algorithm which is discussed in details in section (3.3).

3.2 Choice of priors for the skewed link models

As earlier introduced, we denote our observed dataset as , and for purposes of the subsequent discussions, we shall denote our set of parameters as where without loss of generality, corresponds to the regression coefficients for the predictor variables and denotes any additional parameters arising from the skewed link models. Given the data , the likelihood of the parameters can be written as:


According to equation (9), if the joint prior of , , is given, then the joint posterior distribution for can be written as follows:


Consider the case of the standard GEV link model where the additional parameter is . If the independent uniform priors , are assigned to and , then . The resulting posterior distribution is still proper (Wang and Dey (2010)). Assuming independent priors for and , equation (12) becomes


In this paper, we assume the independent multivariate normal and univariate normal priors for and , respectively, with and .

Consider the case of the skewed Weibull link model where the additional parameter is . If the informative priors for and cannot be obtained, we can assign uniform prior and noninformative prior , for and , a fixed known constant. The resulting propriety of the posterior under improper priors, in this case, is proved in Caron et al. (2018). Assuming independent priors for and , equation (12) becomes


In this paper, we assume are priori independent and we assign a multivariate normal N and Gamma to and , respectively.

Finally, consider the case of the Frećhet link model where the additional parameter is . When the information of the prior is unavailable, we can consider the noninformative prior for and . Thus, , and is a constant. Given this noninformative prior, the posterior distribution explicitly written below as equation (15) is proper. The resulting propriety of the underlying posterior distribution is proved in the Appendix. Such results have not appeared in the literature and therefore we find it useful to provide enough details of the proof. Assuming independent priors for and , equation (12) becomes


In this paper, the multivariate normal and Gamma distributions are considered to be the priors of

and , respectively, with a similar parameterization of the Gamma distribution as in the case of the skewed Weibull link model.

3.3 Metropolis-Hastings algorithm

The Metropolis-Hastings algorithm is used to sample the regression coefficients and the shape parameters , , and for the standard GEV, skewed Weibull, and Frećhet link models, respectively. The Metropolis-Hastings algorithm (Hastings (1970)) can generate a random walk based on a proposed distribution and through a rejection method to determine the move. To fix ideas, let and denote posterior and proposed distributions with given complete data and let denote the parameters. The procedure of Metropolis-Hastings algorithm can be described in steps as follows:

  • Step 0: Choose an arbitrary starting point and set ;

  • Step 1: Generate a candidate point from the proposed distribution and from Uniform(0, 1);

  • Step 2: if then set , otherwise, set , the acceptance probability is given by:

  • Step 4: Update and repeat steps 1 and 2 until convergence.

The distribution is the kernel distribution. If is symmetric, then the conditional probability of and can be cancelled out in the calculation of . In this paper, we perform 20,000 iterations for the parameter estimation after a burn-in of 1,000 iterations. Let for denote the samples which are draw from Metropolis-Hastings, and by the Law of Large Number, the posterior mean of is


with variance estimate


We choose normal kernels N for the Metropolis-Hastings sampling, which is a normal distribution with mean of the previous state and variance with an appropriate value. The subsampling is also used in this paper, in which the samples are taken from every 50 steps in order to reduce the strong autocorrelation between states in order to further guarantee the stability of the estimates.

3.4 Model assessment and evaluation

For the model fitting in a Bayesian routine, we use the Deviance Information Criterion (DIC) which was proposed by Spiegelhalter et al. (2002) to perform the model comparison and assessment.

The Deviance is defined as , then


where is the Deviance evaluated under the value of the posterior mean of the corresponding parameters, which is the average estimated discrepancy for samples, and is the th sample generated from the posterior distribution . We also use to measure the effective model size and penalize complexity. The DIC is a Bayesian alternative to AIC and BIC. The model with smaller DIC will provide better fit to the data.

For the comparison of the frequentist models, we use the Bayesian Information Criterion (BIC), as developed in Schwarz (1978). Recall that , where is the maximum likelihood estimates and is the model dimension. We use Kolmogorov-Smirnov statistics (KS) as a measure of goodness of fit, which is defined as , the maximum absolute error of the predicted and observed values. Also, another statistics MAE, Mean Absolute Error, defined as , is used to measure the absolute deviations.

3.5 Posterior predictive distribution

The posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values. For a given dataset

, the posterior predictive distribution of a new unobserved value is defined to be the following


If we assume the observed and unobserved data are conditionally independent given , then equation (19) can be written as


MCMC samples from the posterior predictive distribution of can be obtained following the procedure described below (Hoff (2009)). For each ,

  • sample ; and

  • sample .

It can be inferred that the sequence constitutes independent samples from the joint posterior distribution of and the sequence also constitutes independent samples generated from the posterior predictive distribution of .

4 Simulation studies and parameter recovery

In the simulation studies, we conducted two experimental trials and to investigate and assess the performance of the various link models for each of these experiments. For each of the simulated dataset, we deploy six models under different link functions, logit, probit, and cloglog, which are considered frequentist models, together with skewed link models, standard GEV, skewed Weibull, and Fréchet, which are considered Bayesian models. To summarize the posterior marginal densities of the parameters under a Bayesian framework, we use the highest posterior density interval (Box and Tiao (1973)), which has the shortest length for a given probability content. The calculations in this paper use the Chen and Shao (1999) algorithm to estimate an empirical Highest Posterior Density (HPD) interval of the parameters.

For experiment I, we generate 1000 independent response variables from the cloglog regression. The predictor

is a categorical variable with two levels (with respective probabilities

and ) and is a numerical attribute generated from N. The true values of the coefficients are set to be and the linear component of the regression model is . The dataset contains responses of 15 1’s and 985 0’s. The estimation results based on the various link models are summarized in Table 1, which shows that the skewed Weibull link model performs better than other Bayesian models with the lowest DIC and the largest log likelihood. The KS statistics for all the links are approximately close to each other, however, the Frećhet model outperforms all other models with respect to MAE. When it comes to parameter recovery, the asymmetric link models do not necessarily outperform the true probit model but are competitive enough against the other symmetric links. For the logit and cloglog models, the coefficient estimates of the significantly deviate from the true values; for all the skewed link models, most of the estimates fall within the 95% HPD interval.

For experiment II, we generate the response variables with sample size from the GEV regression model with and , and that produced responses of 12 1’s and 988 0’s. The predictors are set to be the same with the previous experiment. The output, summarized in Table 2, presents the model estimates under the six different link models. The data modeling of the standard GEV model performs well not only in view of recovery of parameters, but also in the model assessment. However, the skewed Weibull link is the most appropriate one since in both of the simulations, its estimated parameters are the ones that closely reach the true parameter values. The Frećhet link is the second best model in the sense of recovery of parameters. Therefore, again, the skewed links are very flexible to fit a dataset that is considered imbalanced because they automatically estimate the degree of skewness by controlling the shape parameter to improve their performance.

In each simulation experiment, multiple chains are generated to produce Bayesian estimates. Based on both experiments, we deduce that the skewed link regression models outperform the symmetric link regressions when the binary outcome is highly imbalanced.

logit probit cloglog standard GEV skewed Weibull Frećhet
Parameter Estimate Estimate Estimate Estimate Estimate Estimate
95% CI (HPD) 95% CI (HPD) 95% CI (HPD) 95% CI (HPD) 95% CI (HPD) 95% CI (HPD)
-5.8806 -2.8146 -5.8596 -5.8609 -4.6617 -4.9927
(-7.1661, -4.5951) (-3.2984,-2.3309) (-7.1206,-4.5987) (-8.2872 ,-3.6667) (-7.2326,-2.0174) (-6.9264 -3.0682)
1.169 0.4436 1.1694 1.0438 0.7564 0.9623
(-0.0638, 2.4018) (-0.0687,0.956) (-0.0347, 2.3736) (-0.0746,2.4326) (-0.2486, 1.8675) (-0.3557, 2.0823)
1.3656 0.5651 1.3237 1.2716 0.9187 0.8908
(0.6954, 2.0359) (0.2732, 0.8571) (1.1147, 2.4244) (0.6886, 1.9587) (0.2459, 1.6592) (0.383, 1.4744)
shape parameter 0.0032 1.2437 3.2626
(-0.1884, 0.1495) (0.7824, 1.8776) (2.3721, 4.406)
DIC 106.2756 103.5679 108.6163
-logLik 49.8878 49.8387 49.9225 50.0309 51.3653 51.5532
BIC 120.4988 120.4007 120.5683
KS 0.9972 0.9976 0.9972 0.9961 0.9967 0.8433
MAE 0.0209 0.02095 0.02088 0.02186 0.02123 0.1808
Table 1: Performance comparison of the various link models using simulation experiment I. The true model is based on the probit link with .
logit probit cloglog standard GEV skewed Weibull Frećhet
Parameter Estimate Estimate Estimate Estimate Estimate Estimate
95% CI (HPD) 95% CI (HPD) 95% CI (HPD) 95% CI (HPD) 95% CI (HPD) 95% CI (HPD)
-6.9426 -3.5930 -7.2263 -2.5713 -2.9694 -3.2936
(-8.6051, -5.2801) (-4.4801, -2.7059) (-9.1241, -5.3285) (-3.5931, -1.3858) (-3.9025, -2.0236) (-4.2359, -2.2512)
-0.6864 -0.2800 -0.5363 -0.2301 -0.2835 -0.5350
(-2.2951, 0.9223) (-1.0877, 0.5277) (-2.2800, 1.2074) (-0.6807, 0.2135) (-0.8473, 0.1638) (-1.3672, 0.0370)
2.4534 1.2739 2.6662 0.6581 0.5784 0.8103
(1.6712, 3.2357) (0.7798, 1.7679) (1.7058, 3.6265) (0.1729, 1.1070) (0.3802, 1.1798) (0.4761, 1.2215)
shape parameter -0.4523 1.8945 4.5823
(-0.7527, -0.2207) (1.2508, 2.6177) (3.3014, 6.1055)
DIC 81.8384 82.1210 86.8719
-logLik 56.3239 37.0525 37.3855 37.2829 39.6527 58.1976
BIC 133.3711 94.8283 95.4942
KS 0.9991 0.9925 0.9918 0.9870 0.9869 0.7983
MAE 0.0264 0.0191 0.0187 0.0211 0.0204 0.1481
Table 2: Performance comparison of the various link models using simulation experiment II. The true model is based on the standard GEV link with .

5 Empirical data on life insurance

To illustrate the applicability of the binary regression models previously discussed, we drew observations of policyholders during calendar year 2015 from an insurance portfolio of a major insurer. In order to preserve confidentiality, we took subsamples of 127,777 observations for training our models and 18,254 observations for testing the various estimated models. During the calendar year, we are able to observe whether the policyholder died or survived during the year. The observed mortality is highly imbalanced with 1,720 actual deaths observed in our training set and 247 actual deaths observed in our test set. Each observation is described by 9 attributes of which 6 are considered categorical and 3 are numerical variables.

5.1 Data description

Table 3 shows the description and summary statistics of each categorical variable used in our mortality investigation. We regroup the variable State into 8 different geographic regions and create a new categorical variable Regions in order to simplify the number of levels to describe geographic location. For many life insurance contracts, secondary guarantees are offered to ensure some death benefits are paid even if the cash surrender value of the policy falls to zero. From our portfolio, we observe six levels to describe the kind of secondary guarantees that is purchased with the contract. The insured’s sex indicator Gender is also a discrete variable with 2 levels, Male and Female, with 52.75% Male and the rest are Female. Smoker Status indicates the insured’s smoking status at policy issue with Smokers, Non-Smokers, and

Unismokers. It is not uncommon practice to have simplified underwriting classifying a policyholder as Unismoker, in the absence of information about smoking habit and for which a blended smoker and non-smoker mortality rates are used. The Line of business (LOB) variable broadly classifies the type of life insurance business the contract is issued. Finally, the categorical variable

Plan has 7 different levels: Universal Life with Long Term Care (UL_LTC), Term Insurance Plan (Term), Universal Life with Secondary Guarantees (UL_SG) and without Secondary Guarantees (UL_NSG), Variable Life with Secondary Guarantees (VUL), Whole Life Policy and Corporate-Owned Life insurance (COLI). This categorization can be quite unique to the type of contracts issued by an insurance company.

Summary statistics of the numerical variables are in Table 4. The issue ages range from as young as newborn to as old as 90, with an average age at issue of 37 years. The face amount is the amount of death benefit and although this amount may vary according to the type of contract, most of our policies in our dataset have level face amounts. The duration, measured in years, refers to the number of years since policy issue, as of the period of observation. All three continuous variables can have direct effect on mortality, for example, it is well known that mortality rate increases with age.

In this paper, we normalize these three continuous variables: Issue Age, Face Amount and Duration. For Issue Age and Duration, we rescale the variables so that the range is in , for which the general formula is given by:

where is the original value and is the normalized value. However, for Face Amount, the variable is highly skewed understandably because of some policies with unusually high values; its magnitude has to be narrowed down. In this case, we take logarithm of the total data followed by rescaling the data:

Categorical variables Description Proportions
Regions Region of issue Northest_NewEngland 5.57%
Northest_MidAtlantic 14.36
Midwest 23.13%
South_Atlantic 24.45%
South_SouthCentral 14.19%
West_Mountain 4.92%
West_Pacific 16.17%
Foreign 0.70%
SG_IND Secondary guarantee indicator Brokerage 0.45%
Legacy 1.03%
UL_LTC 6.97%
NSG 31.28%
No Information 52.11%
SG 3.16%
Gender Insured’s sex Female 42.75%
Male 57.25%
Smoker Status Insured’s smoking status Smoker 6.52%
Nonsmoker 75.19%
Unismoker 18.29%
LOB Line of business ISL 20.75%
TRAD 20.09%
EM 1.04%
No Information 52.11%
Plan Plan type UL_LTC 8.44%
Term 29.07%
UL_NSG 31.83%
UL_SG 1.05%
VUL 5.89%
Whole Life 22.68%
COLI 1.04%
Table 3: Mortality data description and summary of categorical variables
Numerical variables Minimum

1st Quantile

Mean Median 3rd Quantile Maximum
Issue Age 0 26 37 39 51 90
Face Amount (in $) 0 25,000 314,902 100,000 250,000 12,000,000
Duration (in years) 0.3 2.5 5.6 4.5 7.5 23.8
Table 4: Mortality data description and summary of numerical variables

5.2 Numerical results

We calibrated the binary regression models under the six different link functions as earlier described. We estimated the parameters using the training dataset; the test dataset has been used for validation. We present the results separately for the standard link functions (logit, probit, cloglog) and the skewed link functions (standard GEV, skewed Weibull, and Frećhet). We will call the standard link models as frequentist models while the skewed link models as Bayesian models. This differentiation has to do with reference to the approach used to estimate the parameters. Note that for the categorical variables, the reference level should be clear from the results presented. For example, in the case of Plan type, the category COLI has been used; in the case of Line of business (LOB), the category EM has been used.

Table 5 provides the parameter estimates for the reduced frequentist models. First, we fit a full model with all the predicted variables included in the model, then the reduced model is deployed with insignificant variables excluded in the model. Interestingly, in terms of model comparison statistics, the probit model has the worst log likelihood, KS, and MAE, but it has the best BIC statistic. As the probit model rejects many predictor variables, its low BIC appears to intuitively make sense. The probit model would be appealing to those seeking for the simplest possible model, and yet still considered to be competitive. The probit model chooses only IssueAge and Duration for significant variables, and the parameter estimates for this model intuitively make sense. The probability of death increases with IssueAge and also slightly increases with policy duration.

logit probit cloglog
Parameter Estimate (s.e.) Estimate (s.e.) Estimate (s.e.)
Intercept -8.1418 (0.8751) -4.8315 (0.1630) -10.8632 (0.5227)
GenderMale 0.4493 (0.1473) 0.4922 (0.1434)
IssueAge 7.5972 (0.5803) 3.2135 (0.2098) 8.1890 (0.5719)
SmokerStatusS 0.6936 (0.1940) 0.5500 (0.1834)
SmokerStatusUni -0.1080 (0.2607) -0.2090 (0.2545)
Duration 0.0767 (0.0074) 0.0423 (0.0022) 0.0937 (0.0066)
FaceAmount -3.0868 (1.0348)
Plan_typeUL_LTC -0.3088 (0.3109) -0.4393 (0.3144)
Plan_typeTERM -0.8083 (0.3736) -1.1180 (0.4024)
Plan_typeUL_NSG -0.1871 (0.2199) -0.0905 (0.2077)
Plan_typeUL_SG -1.4596 (1.0371) -0.0670 (0.4984)
Plan_typeVUL -0.7097 (0.5478) -0.3484 (0.4577)
-log Lik 1044.8173 1068.5629 1039.393
BIC 2207.3802 2166.5623 2186.719
KS 0.9994 0.9999 0.9992
MAE 0.0251 0.0254 0.0253
Table 5: Parameter estimates for the reduced frequentist models

Table 6 presents the parameter estimates for the reduced Bayesian models. As stated before, one statistic used to compare Bayesian models is the DIC, which describes the trade-off between quality of the model fit to the data and the complexity of the model. The model that gives the smaller DIC is generally preferred as it is better supported by the observed data. Among all three skewed link models, the skewed Weibull gives the best DIC value of 2126.3532. While this model has the worst KS statistic, it still outperforms the other two models in terms of loglikelihood and MAE values.

Standard GEV Skewed Weibull Frećhet
Parameter Estimate Estimate Estimate
95% HPD 95% HPD 95% HPD
Intercept -5.3912 -5.2117 -4.7430
(-7.9279, -3.9688) (-7.6501, -3.8083) (-8.5038, -3.2466)
IssueAge 2.0595 4.3174 1.8652
(1.0426, 3.4020) (2.8483, 6.9725) (0.9955, 3.2125)
SmokerStatusS 0.4927 0.3434 0.3651
(0.1658, 0.9199) (0.0792, 0.6566) (0.0749, 0.7524)
SmokerStatusUni 0.2028 -0.0226 0.0955
(-0.1612, 0.6155) (-0.3849, 0.2493) (-0.2651, 0.4372)
Duration 0.9030
(0.5776, 1.4798)
LOBISL 2.2296 1.9686
(0.8777, 4.0847) (0.6643, 4.2602)
LOBNoInformation 2.0458 1.6797
(0.7201, 3.6112) (0.6770, 4.4102)
LOBTRAD 1.6730 1.4174
(0.3112, 3.2024) (0.1673, 3.8592)
Plan type UL_LTC -1.5931 -0.2598 -1.6166
(-2.5892, -0.8178) (-0.6591, 0.0973) (-2.6182, -0.8015)
Plan type TERM -1.8061 -0.5262 -2.8171
(-3.1911, -0.9376) (-1.1886, -0.0119) (-5.0451, -1.2020)
Plan type UL NSG -1.2067 -0.2039 -1.2311
(-1.9193, -0.6662) (-0.4883, 0.0602) (-1.9962, -0.6076)
Plan type UL SG -2.1475 -1.4483 -2.4779
(-3.7416, -0.9047) (-3.6883, -0.0552) (-4.4446, -1.0303)
Plan type VUL -1.8975 -0.5691 -2.4099
(-3.2817, -0.9542) (-1.4931, 0.0668) (-4.1592, -1.0580)
shape parameter -0.1262 1.3185 3.8392
(-0.2498, 0.0266) (1.0112, 1.6081) (2.7269, 5.0725)
DIC 2338.5129 2126.3532 2348.6408
-log Lik 1153.3895 1055.0456 1165.1385
KS 0.9990 0.9997 0.9980
MAE 0.0263 0.0251 0.0262
Table 6: Parameter estimates for the reduced Bayesian models

Interestingly, as in the frequentist models, the skewed Weibull link model also is the simplest model among all three skewed link models as it rejected more predictors than any other. In particular, both policy duration and LOB are considered not significant predictors under the skewed Weibull link model. We can roughly interpret the estimated coefficients of the selected predictor variables for the skewed link model, and it appears to be quite intuitive. First, the positive sign for the IssueAge indicates a greater likelihood of death for older issue ages. Second, the positive sign for SmokerStatusS indicates a worse mortality for smokers than for non-smokers. The negative sign for SmokerStatusUni might be counterintuitive, however, this is more challenging to interpret as this smoking status is about the absence of smoking habits. Finally, all the signs for the Plan types are negative, which indicates that Plan type COLI has the best mortality among all plan types. Because it has the largest negative sign, Plan type UL SG has the worse mortality among all types. Notice further that for all Bayesian models, the FaceAmount was not considered a significant predictor of death. There are two possible explanations to this. First, the effect is depleted when we transform and normalize the variable. Second, the amount of death benefit can be directly linked to the degree of underwriting adopted prior to issue. In which case, the mortality selection effect can be explained by the policy duration.

If we wish to compare the frequentist models to the Bayesian models, we could examine the common model statistics such as the loglikehood, KS, and MAE values. For all models, the KS and MAE statistics are relatively comparable. However, in terms of loglikelihood, the Bayesian models outperform all the frequentist models. In particular, the skewed Weibull link model has the smallest value for all of them.

Amount of death benefits Death counts
Actual amount $ 33,992,610 Actual count 247
Estimated Error rate Estimated
logit $ 28,653,690 15.7% logit 19
probit $ 42,883,180 26.2% probit 17
cloglog $ 37,309,376 9.8% cloglog 19
Standard GEV $ 43,623,689 28.3% Standard GEV 252
Skewed Weibull $ 36,338,219 6.9% Skewed Weibull 248
Frećhet $ 39,838,452 17.2% Frećhet 250
Table 7: Aggregated amount of death benefits and death counts for the various link models
Figure 2:

Mean and confidence intervals of the predictions from the Bayesian models for cumulative count and amount

To further investigate the predictive power of the various models, we compare the prediction of death counts together with the corresponding death benefit amount. Table 7 presents the aggregated death benefits and counts for the various link models using the test dataset. Accordingly, our test dataset has a total death count of 247, out of 18,254 observations; the corresponding benefit amount sums to approximately 34 million dollars. In terms of death counts, as we expected, the frequentist models could hardly come close to the true actual death counts; all three models underestimate the number of deaths. On the other hand, the skewed link models have predictions that are generally close to the actual death counts as shown in the table. In particular, the skewed Weibull, not surprisingly, marginally outperforms the other two models in terms of death counts, but is far superior in terms of predicting the corresponding death benefits. A visualization of the superiority of the skewed Weibull link model in terms of prediction is presented in Figure 2.

6 Concluding remarks

This article was driven by a purpose of developing a mortality investigation for tracking and monitoring death claims experience of a portfolio of life insurance contracts over a specified period. In this case, our primary interest is in developing a suitable binary regression model for predicting mortality, given a set of available covariate information. Our period of investigation is primarily short term, in this case, one year. Such is typical in practice where insurers need to predict the level of mortality for the following year given characteristics of their portfolio at the beginning of the year. However, this is particularly a challenging task as the observed mortality for a cohort of policies is expected to be a rare event.

This paper presents a methodology to handle extremely imbalanced binary outcome. In particular, we focus on the latent variable interpretation of the binary regression model and examine alternative functions that link the mean of the binary outcome to the predictor variables. We find that suitable link functions are those that have the flexibility to handle skewness, and we find three possible skewed link models: standard GEV, skewed Weibull, and Frećhet. We performed simulation studies and calibrated models using empirical data observed from a portfolio of life insurance contracts. Based on the model estimates and predictions, we find that all three models generally outperform frequentist models with standard link functions that include logit, probit, and cloglog. We also presented a Bayesian approach to the estimation of the skewed link models. As pointed out in this paper, the latent variable interpretation connects the various link models but it also has computational advantages especially when Bayesian inference is used for estimation such as the use of Metropolis-Hastings algorithm, an MCMC method.

The models and theoretical framework presented here can be extended to mortality tracking and monitoring on a more frequent basis than annually, for instance quarterly. This is because insurance companies publish quarterly financials to rating analysts and death benefits are a significant portion of an insurer’s financials. The ability to understand and project death benefits on a quarterly basis is important to explain fluctuations in financials driven by mortality claims. However, when doing more frequent mortality tracking and monitoring than annual, it is important to distinguish between non-significant fluctuations in mortality and mortality fluctuations that represent a longer term trend. It would be interesting to explore this issue using the models we have developed in this paper. In addition, there is also the promise of using neural networks to handle imbalanced binary outcomes, and such would be an interesting future direction of research. See

Dreiseitl and Ohno-Machado (2002) and Wang et al. (2016).


The following useful results provide detailed proof that in the case of the Frećhet link model that the resulting posterior distribution is proper.

Proposition 1.

Let be the design matrix with full rank and be the matrix with the th row of , where , for . Then there exists a positive vector with such that . Under the prior given by with , the posterior distribution stated in (15) is proper.


Let be independent random variables with common Frećhet distribution with , and shape parameter . For , it can be shown that for . Observing that and , where is the indicator function. Then




Let . Using Fubini’s Theorem, we obtain


Directly following from Lemma 4.1 of Chen and Shao (2001), and under the condition that and for , there exists a constant such that



  • J. H. Albert and S. Chib (1993) Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88 (422), pp. 669–679. Cited by: §2.1.
  • F. J. Aranda-Ordaz (1981) On two families of transformations to additivity for binary response data. Biometrika 68 (2), pp. 357–363. Cited by: §1.
  • B. C. Arnold and R. A. Groeneveld (1995) Measuring skewness with respect to the mode. The American Statistician 49 (1), pp. 34–38. Cited by: §2.2.
  • H. Booth and L. Tickle (2008) Mortality modelling and forecasting: a review of methods. Annals of Actuarial Science 3 (1-2), pp. 3–43. Cited by: §1.
  • G. E.P. Box and G. C. Tiao (1973) Bayesian inference in statistical analysis. Addison-Wesley Publishing Company: Reading, MA. Cited by: §4.
  • R. Caron, D. Sinha, D. Dey, and A. Polpo (2018) Categorical data analysis using a skewed Weibull regression model. Entropy 20 (176), pp. 1–17. Cited by: §1, §3.2.
  • M. Chen, D. K. Dey, and Q. Shao (1999) A new skewed link model for dichotomous quantal response data. Journal of the American Statistical Association 94 (448), pp. 1172–1186. Cited by: §1.
  • M. Chen and Q. Shao (1999) Monte carlo estimation of Bayesian credible and HPD intervals. Journal of Computational and Graphical Statistics 8 (1), pp. 69–92. Cited by: §4.
  • M. Chen and Q. Shao (2001) Propriety of posterior distribution for dichotomous quantal response models. Proceedings of the American Mathematical Society 129 (1), pp. 293–302. Cited by: Appendix.
  • D. R. Cox and E.J. Snell (1989) Analysis of binary data. Chapman & Hall/CRC Press, Boca Raton, FL, USA. Cited by: §2.
  • D. R. Cox (1972) Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 34 (2), pp. 187–202. Cited by: §1.
  • C. Czado (1994) Parametric link modification of both tails in binary regression. Statistical Papers 35 (1), pp. 189–201. Cited by: §1, §2.2.
  • S. Dreiseitl and L. Ohno-Machado (2002) Logistic regression and artificial neural network classification models: a methodology review. Journal of Biomedical Informatics 35 (5-6), pp. 352–359. Cited by: §6.
  • A. Gelman, J. B. Carlin, H. S. Ster, D. B. Dunson, A. Vehtari, and D. B. Rubin (2014) Bayesian data analysis. CRC Press, Boca Raton, FL USA. Note: Citation count from Google Scholar: 723.

    This book ranked 15th in SIAM’s top selling titles in 2009. This book is also used as a textbook for the following course: ECOM 6349 Selected Topics in Artificial Intelligence: Cluster Analysis, Islamic University of Gaza (IUG), Palestine.

    Excerpts from a book review by Jin-Hong Park, Technometrics, 53(2) 2011: “…This monograph mainly focuses on a number of popular clustering algorithms and groups them according to some specific baseline methodologies, such as hierarchical, center-based, and search-based methods. … Overall, I am very impressed with this book loaded with lots of information on theory, algorithms, and application of data clustering. Data Clustering: Theory, Algorithms, and Applications is a nice piece of work for cluster analysis, containing many interesting theories, computational algorithms, and examples from various applications. Therefore, this monograph is good for statistics, applied mathematics, and computer science senior undergraduates and graduates. Also, it will be beneficial for research scientists who need cluster analysis to deal with real datasets.
    Excerpts from a book review by David J. Hand, International Statistical Review, 76(1), 2008: “ …The book also has excellent discussions of the basic elements which feed into cluster analysis tools, covering such things as data types, scale conversion, standardisation, visualisation, and distance and dissimilarity measures. …The quality of the book is such that I would expect it to be reprinted, or to appear in new editions….Apart from this, the book provides excellent coverage of the area, and I thoroughly recommend it. I have already ordered another copy for a student of mine who is beginning to work in clustering.
    Cited by: §2.1.
  • B. Gompertz (1825) On the nature of a function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies. Philosophical Transactions of the Royal Society of London (Series A) 115, pp. 513–585. Cited by: §1.
  • V. M. Guerrero and R. A. Johnson (1982) Use of the Box-Cox transformation with binary response models. Biometrika 69 (2), pp. 309–314. Cited by: §1.
  • S. Haberman and A. E. Renshaw (1996) Generalized linear models and actuarial science. The Statistician 45 (4), pp. 407–436. Cited by: §1.
  • W. K. Hastings (1970) Monte carlo sampling methods using markov chains and their applications. Oxford University Press. Cited by: §3.3.
  • P. D. Hoff (2009) A first course in bayesian statistical methods. Springer: New York. Cited by: §3.5.
  • S. Kim, M. Chen, and D. K. Dey (2007) Flexible generalized -link models for binary response data. Biometrika 95 (1), pp. 93–106. Cited by: §1.
  • P. McCullagh and J. A. Nelder (1989) Generalized linear models. Chapman and Hall: London. Cited by: §1, §2, §3.
  • D. McFadden (1978) Modeling the choice of residential location. Transportation Research Record 673, pp. 72–77. Cited by: §2.2.
  • R. Meyricke and M. Sherris (2013) The determinants of mortality heterogeneity and implications for pricing annuities. Insurance: Mathematics and Economics 53, pp. 379–387. Cited by: §1.
  • W. Olbricht (2012) Tree-based methods: a useful tool for life insurance. European Actuarial Journal 2 (1), pp. 129–147. Cited by: §1.
  • D. Pregibon (1980) Goodness of link tests for generalized linear models. Journal of the Royal Statistical Society: Series C (Applied Statistics) 29 (1), pp. 15–24. Cited by: §1.
  • G. Schwarz (1978) Estimating the dimension of a model. The Annals of Statistics 6 (2), pp. 461–464. Cited by: §3.4.
  • D. J. Spiegelhalter, N. G. Best, B. P. Carlin, and A. Van Der Linde (2002) Bayesian measures of model complexity and fit. Journal of the Royal Statistical Ssociety: Series b (Statistical Methodology) 64 (4), pp. 583–639. Cited by: §3.4.
  • T. A. Stukel (1988) Generalized logistic models. Journal of the American Statistical Association 83 (402), pp. 426–431. Cited by: §1.
  • S. Wang, W. Liu, J. Wu, L. Cao, Q. Meng, and P. J. Kennedy (2016) Training deep neural networks on imbalanced data sets. In IJCNN 2016: Proceedings of the International Joint Conference on Neural Networks, pp. 4368–4374. Cited by: §6.
  • X. Wang and D. K. Dey (2010) Generalized extreme value regression for binary response data: an application to b2b electronic payments system adoption. The Annals of Applied Statistics 4 (4), pp. 2000–2023. Cited by: §1, §3.2.
  • S. Yin, G. Gan, E. A. Valdez, and J. Vadiveloo (2020) Cluster analysis of death claims experience. Note: Working paper Cited by: §1.
  • Z. Zhu, M. Hoag, S. Julien, and S. Cui (2001) Estimating mortality of insured advanced-age population with Cox regression model. Risk Management. Note: Transamerica Reinsurance, Working paper Cited by: §1.
  • Z. Zhu, Z. Li, D. Wylde, M. Failor, and G. Hrischenko (2015) Logistic regression for insured mortality experience studies. North American Actuarial Journal 19 (4), pp. 241–255. Cited by: §1.