1 Introduction
The prediction and interpretation of mortality rates have been an important aspect of risk management in the life insurance industry. For several decades, different models have been applied to actuarial work to investigate the possibilities of explaining mortality. The early work of Gompertz (1825) provided a parametric form that explains the exponential increase in mortality with age. More recent work includes Haberman and Renshaw (1996) and Zhu et al. (2001), which examined the applications of Cox proportional hazards model (Cox (1972)) in analyzing the mortality for life insurance policies. The Cox regression model is a tool for understanding the effects of the presence of risk factors such as age, gender, policy duration, and smoking habits, in mortality. Olbricht (2012) offered an interesting alternative approach using treebased models to construct the life table. Accordingly, treebased models have the advantages of “ease of predictability” and “transparency.” Meyricke and Sherris (2013)
proposed the generalized linear mixed models (GLMMs) framework for mortality modeling to incorporate longitudinal variables allowing for both underwriting factors and frailty.
Booth and Tickle (2008) discussed an array of work in the literature on mortality modeling especially those pertaining to forecasting.This paper has a different objective for studying mortality. In particular, we are interested in tracking and monitoring the death claims experience of a portfolio of life insurance policies. As pointed out in Yin et al. (2020), such tracking mechanism can help insurers take necessary actions to mitigate against the immediate economic impact of any deviations from expectations. An interesting related work of Zhu et al. (2015)
applies logistic regression models to understand mortality trends, slopes, and differentials. An additional challenge of mortality modeling, especially related to periodic claims monitoring, is that it is considered a rare event with a very tiny probability; there is a significant difference between the number of deaths and the number of survivors. For example, the mortality rate, according to our insured portfolio dataset, observed during a policy year is only about 1.3%, which in a binary model setup, such a response is characterized as highly imbalanced (or skewed). In this paper, we focus on capturing this high skewness for which other methods may not be able to model the data appropriately.
It is well known that binary regression models fall within the class of generalized linear models (GLMs) using Bernoulli distribution. See
McCullagh and Nelder (1989). The commonly used functions linking the mean parameter to the predictor variables are the symmetric link functions, logit and probit, leading, respectively, to the logistic and probit regression models. Both link functions reasonably work for observations with balanced response binary outcomes. Suppose we are given a set of observations expressed as
, where is a binary outcome and is a set of predictor variables. The binary regression model has a latent variable representation where , for observation , is related to an unobserved variable as . , also called the latent variable, is directly linked to the predictor variables as a linear model with an error component as , whereis a vector of coefficients for the predictors. The error component,
has a prespecified distribution directly related to the link function.To deal with the limitations of symmetric link functions for binary outcomes, some studies extended the parametric class including the use of power transformation of the logit link (ArandaOrdaz (1981)) and the BoxCox transformation for binary regression (Guerrero and Johnson (1982)). Other model extensions include twoparameter classes of link functions such as two parameter generalization of logit link by Pregibon (1980), the two shape parameters link by Stukel (1988), and two tailed link by Czado (1994)
. However, the use of these extended classes of parametric link functions do not always fit the data well with extremely imbalanced response variable. For example, Stukel’s generalized logistic model may lead to improper posterior distribution under many types of noninformative improper prior.
Chen et al. (1999) suggested an asymmetric link function especially when the probability of a given binary response approaches 0 at a significantly different rate than it approaches 1. This model yields to proper posterior distribution under noninformative improper priors for the regression coefficients and can handle datasets with balanced and imbalanced response variable. However, the intercept and skewness parameters are confounded with each other in this model, although Kim et al. (2007) overcame this problem to some extent with the constraint that the skewness lies in the range of , which restricts the possibility of negative skewness of the response variable.This open problem to some extent motivates us to investigate more flexible models to accommodate such imbalances. In order to appropriately model the large skewness, Wang and Dey (2010)
proposed binary regression with the generalized extreme value (GEV) link. The GEV distribution belongs to a family of continuous probability distributions developed from extreme value theory. The GEV distribution combines the Gumbel, Fréchet, and reversed Weibull distributions into a single family.
Caron et al. (2018) presented a Weibull link for the imbalanced binary regression, which is very flexible and capable to handle with different types of data. The Frećhet distribution, which is known as the Type II extreme value distribution, also has the shape parameter to control the direction and degree of skewness.In this paper, we propose the use of this family of asymmetric link functions to binary regression models for analyzing life insurance data with highly imbalanced mortality. The model parameters are estimated using Bayesian approach. There are advantages to the use of these link functions that are suitable for our purposes:

With a continuous range of possible values of the shape parameter, the resulting models can automatically estimate the skewness parameter without any preassumption on the direction of the skewness and can thus better quantify the skewness;

The commonly used link functions, such as logit, probit, and complementary loglog, can be derived as special or limiting cases of this family of link functions;

In a Bayesian framework, even under the improper priors, the posterior distribution of the parameter is still proper; and

With the introduction of latent variables, the process can be computationally efficient whereby the Markov Chain Monte Carlo (MCMC) implementation can easily be used to sample the parameters.
This paper has been structured as follows. Section 2 provides a brief review of binary regression models and describes the latent variable interpretation of the binary regression models that is important for comparative purposes. We also describe and review the various skewed link models that are used in this paper. In Section 3
, we present ideas of Bayesian inference useful for the estimation and prediction based on our proposed skewed link regression models. For purposes of being selfcontained, a detailed procedure of MetropolisHastings algorithm is described in this section. Section
4 discusses the simulation studies and shows the recovery work of the various proposed models. Section 5 is about the empirical data used in our model calibration and predictions. We conclude in Section 6.2 Binary regression models
For a fixed period, our dataset can be described as where is a vector of responses, is a matrix of predictor variables, and is the total number of observations. The response is a binary outcome, with indicating death has been observed for policyholder and indicating survival otherwise. The vector of predictor variables is , where is the number of observed predictor variables in the dataset.
Let denote the probability of death, given the predictor variables. Since is a binary outcome, it can easily be deduced that the conditional expectation of given is also . It is well known that binary regression model falls within the class of generalized linear models (GLMs) with the link function where is a vector of coefficients. Therefore, and we would conveniently write this as . Note that it is customary to include an intercept coefficient in the linear relationship so that without loss of generality, we can assume that the vector of predictor variables is augmented as . See McCullagh and Nelder (1989) and Cox and Snell (1989).
2.1 Latent variable interpretation
The binary regression model can be reexpressed and reinterpreted in terms of latent (or unobserved) variables, herewith denoted by . Define these latent variables so that the binary outcome is
(1) 
where
(2) 
and . Here
is a cumulative distribution function of
, given, which is often assumed to be a continuous random variable. It follows therefore that
(3) 
Note that in the case where is the distribution function of a symmetric random variable with mean 0, then equation (3) becomes
In this case, determines the link function in the GLM framework. This latent variable representation is believed to be introduced by Albert and Chib (1993).
Examples of distributions that belong to the class of symmetric distributions include the standard logistic, standard normal, and the standard Gumbel distributions. This leads respectively to the three commonly used link functions in binary regression (without loss of generality, the subscript is dropped):

Logistic regression: When has a logistic distribution with mean 0 and scale parameter 1, i.e., , we have . The link function can be expressed as the logit function .

Probit regression: When
has a normal distribution with mean 0 and scale parameter 1, i.e.,
, we have . The link function can be expressed as the probit function . 
cloglog regression: When has a Gumbel distribution with location 0 and scale parameter 1, we have . The link function can be expressed as the cloglog function .
There are several advantages to the latent variable representation of the binomial regression. First, it allows for a natural interpretation. To illustrate, in the mortality context where is death and is survival, then the latent variable may be viewed as a continuous measure of mortality. Second, the latent variable allows for a natural cutpoint, at zero, for splitting the prediction for into one of a binary outcome. Third, it has computational advantages particularly when Bayesian inference is used for estimation such as the use of an MCMC algorithm that may include MetropolisHastings and Gibbs sampling. See Gelman et al. (2014). While a direct likelihood estimation is sufficient for simple models such as logistic and probit regression, Bayesian computation is particularly helpful for other forms of link functions such as those asymmetric link functions introduced in the subsequent subsections.
2.2 Skewed link functions
The use of symmetric link functions in binary regression may be considered too restrictive especially for observations with imbalanced response, one where there is considerably a large proportion of one outcome. In the case of imbalanced data, the link function can be misspecified, which can lead to inconsistent and biased model estimates and therefore increase in the values of mean squared errors. See Czado (1994). This paper focuses on using parametric link functions based on distributions that are not necessarily symmetric. In particular, we examine and adopt skewed link functions derived from three families of distributions that can accommodate large skewness to handle the imbalanced response. They are the generalized extreme value (GEV) distributions, the skewed Weibull distributions, and the Frećhet distributions.
For our purposes, we will consider three families of asymmetric or skewed link functions that are based on the generalized extreme value (GEV) distributions. Its cumulative distribution function has the general form
(4) 
where provided it satisfies , and the parameters belong to , , and . This GEV distribution function can be derived from a limiting approximation to the distribution of the maxima of a sequence of random variables. This form has been attributed to the original work of McFadden (1978).
After some preliminary investigation of suitable link functions for our purposes, we have concluded to consider the following three families of skewed link functions:

Standard GEV link functions: When the location is and the scale , we will call this the standard GEV distribution so that its distribution function in this case is
(5) We will not preclude the case where , in which case the distribution has the simplified form of . This is the case of the Gumbel link functions, based on the Gumbel or Type I extreme value distribution. The link function can be expressed as .

Skewed Weibull link functions: In this case, we have
(6) defined only for ; its value will be zero elsewhere. This is actually the standard form of the Weibull distribution when the location is and the scale , which is indirectly implied from the GEV form in (4) above. The reverse Weibull, also called the Type III extreme value distribution can be derived directly from this GEV distribution; the derivation of the Weibull considers the minima of a sequence of random variables, in contrast to maxima as in the GEV. The link function can be expressed as .

Frećhet link functions: When the location is , the scale , , and some transformation to , we arrive at the Frećhet distribution function of the form
(7) provided ; it is defined to be 0 elsewhere. This distribution is sometimes referred to as the Type II extreme value distribution. The link function can be expressed as .
Briefly, let us examine some properties of each of these distributions. For the standard GEV, the skewness measure proposed by Arnold and Groeneveld (1995) is given by , where is the mode of the distribution. Based on this definition, the flexibility of the link function is reflected in the skewness expressed explicitly as , which depends only on the shape parameter . The mode is evaluated at .
For the skewed Weibull distribution, the shape parameter controls the skewness. Similar to GEV link, the skewness can also be computed based on the mode , which relies only on the shape parameter . Based on the formula , the skewness lies somewhere in the interval .
The complementary loglog (cloglog) model, which has been considered as an alternative to logistic and probit models, is frequently used when the probability of an event is relatively small or relatively large. The complementary loglog link is a special case of the Weibull link with corresponding cumulative distribution function . Consider the general form of the Weibull distribution expressed as . If we set and divide by , then we have
(8) 
Although considered a skewed link, the cloglog, with skewness defined as , does not provide enough flexibility since the mode is a constant without incorporating a shape parameter.
Finally, for the Frećhet distribution, the skewness parameter can be estimated by the mode using the relationship , where the mode . Thus,the skewness of this distribution is , which relies only on the shape parameter and lies in the interval . This explains why this can be considered a flexible link function. Another important feature of the shape parameter arises from the fact that this kind of parameter affects the shape of the distribution by purely controlling the tail behavior than simply shifting it or stretching it.
Figure 1 displays the probability densities of asymmetric distributions described in this subsection (standard GEV, Weibull, and Frećhet) under different shape parameters. This allows us to visualize the type of skewness that can be derived from each of these distributions.
3 Bayesian inference
For the binary regression models with standard link functions that include the logit, probit, and cloglog, the parameters can be estimated based on the maximum likelihood using either the Newton’s method or the iteratively reweighted least squares (IRLS). See McCullagh and Nelder (1989). For our binary regression based on the skewed link models, we used the Bayesian approach with approximation of the posterior distribution based on MetropolisHastings algorithm, a type of a Markov chain Monte Carlo (MCMC) simulation method.
3.1 Posterior distribution and inference
In Bayesian statistics, the posterior distribution of the parameters is the conditional probability, given relevant evidence or information, which is usually the dataset. Let
denote the posterior distribution of the parameter , given the dataset , and denote the likelihood function of , whereis the probability density function of
, given. Using the Bayes Theorem, if we assign a prior distribution
to , we have(9) 
where is the normalized constant that is free of . Suppose that we sample some number of independent, random values of from the posterior distribution ,
. Then according to the Law of Large Numbers,
(10)  
where is any function of . As gets larger, the sample mean converges to its expectation. In this paper, the samples ’s are generated from the posterior distribution of equation (9) using MCMC (MetropolisHastings) algorithm which is discussed in details in section (3.3).
3.2 Choice of priors for the skewed link models
As earlier introduced, we denote our observed dataset as , and for purposes of the subsequent discussions, we shall denote our set of parameters as where without loss of generality, corresponds to the regression coefficients for the predictor variables and denotes any additional parameters arising from the skewed link models. Given the data , the likelihood of the parameters can be written as:
(11) 
According to equation (9), if the joint prior of , , is given, then the joint posterior distribution for can be written as follows:
(12)  
Consider the case of the standard GEV link model where the additional parameter is . If the independent uniform priors , are assigned to and , then . The resulting posterior distribution is still proper (Wang and Dey (2010)). Assuming independent priors for and , equation (12) becomes
(13)  
In this paper, we assume the independent multivariate normal and univariate normal priors for and , respectively, with and .
Consider the case of the skewed Weibull link model where the additional parameter is . If the informative priors for and cannot be obtained, we can assign uniform prior and noninformative prior , for and , a fixed known constant. The resulting propriety of the posterior under improper priors, in this case, is proved in Caron et al. (2018). Assuming independent priors for and , equation (12) becomes
(14) 
In this paper, we assume are priori independent and we assign a multivariate normal N and Gamma to and , respectively.
Finally, consider the case of the Frećhet link model where the additional parameter is . When the information of the prior is unavailable, we can consider the noninformative prior for and . Thus, , and is a constant. Given this noninformative prior, the posterior distribution explicitly written below as equation (15) is proper. The resulting propriety of the underlying posterior distribution is proved in the Appendix. Such results have not appeared in the literature and therefore we find it useful to provide enough details of the proof. Assuming independent priors for and , equation (12) becomes
(15) 
In this paper, the multivariate normal and Gamma distributions are considered to be the priors of
and , respectively, with a similar parameterization of the Gamma distribution as in the case of the skewed Weibull link model.3.3 MetropolisHastings algorithm
The MetropolisHastings algorithm is used to sample the regression coefficients and the shape parameters , , and for the standard GEV, skewed Weibull, and Frećhet link models, respectively. The MetropolisHastings algorithm (Hastings (1970)) can generate a random walk based on a proposed distribution and through a rejection method to determine the move. To fix ideas, let and denote posterior and proposed distributions with given complete data and let denote the parameters. The procedure of MetropolisHastings algorithm can be described in steps as follows:

Step 0: Choose an arbitrary starting point and set ;

Step 1: Generate a candidate point from the proposed distribution and from Uniform(0, 1);

Step 2: if then set , otherwise, set , the acceptance probability is given by:

Step 4: Update and repeat steps 1 and 2 until convergence.
The distribution is the kernel distribution. If is symmetric, then the conditional probability of and can be cancelled out in the calculation of . In this paper, we perform 20,000 iterations for the parameter estimation after a burnin of 1,000 iterations. Let for denote the samples which are draw from MetropolisHastings, and by the Law of Large Number, the posterior mean of is
(16) 
with variance estimate
(17) 
We choose normal kernels N for the MetropolisHastings sampling, which is a normal distribution with mean of the previous state and variance with an appropriate value. The subsampling is also used in this paper, in which the samples are taken from every 50 steps in order to reduce the strong autocorrelation between states in order to further guarantee the stability of the estimates.
3.4 Model assessment and evaluation
For the model fitting in a Bayesian routine, we use the Deviance Information Criterion (DIC) which was proposed by Spiegelhalter et al. (2002) to perform the model comparison and assessment.
The Deviance is defined as , then
(18) 
where is the Deviance evaluated under the value of the posterior mean of the corresponding parameters, which is the average estimated discrepancy for samples, and is the th sample generated from the posterior distribution . We also use to measure the effective model size and penalize complexity. The DIC is a Bayesian alternative to AIC and BIC. The model with smaller DIC will provide better fit to the data.
For the comparison of the frequentist models, we use the Bayesian Information Criterion (BIC), as developed in Schwarz (1978). Recall that , where is the maximum likelihood estimates and is the model dimension. We use KolmogorovSmirnov statistics (KS) as a measure of goodness of fit, which is defined as , the maximum absolute error of the predicted and observed values. Also, another statistics MAE, Mean Absolute Error, defined as , is used to measure the absolute deviations.
3.5 Posterior predictive distribution
The posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values. For a given dataset
, the posterior predictive distribution of a new unobserved value is defined to be the following(19) 
If we assume the observed and unobserved data are conditionally independent given , then equation (19) can be written as
(20) 
MCMC samples from the posterior predictive distribution of can be obtained following the procedure described below (Hoff (2009)). For each ,

sample ; and

sample .
It can be inferred that the sequence constitutes independent samples from the joint posterior distribution of and the sequence also constitutes independent samples generated from the posterior predictive distribution of .
4 Simulation studies and parameter recovery
In the simulation studies, we conducted two experimental trials and to investigate and assess the performance of the various link models for each of these experiments. For each of the simulated dataset, we deploy six models under different link functions, logit, probit, and cloglog, which are considered frequentist models, together with skewed link models, standard GEV, skewed Weibull, and Fréchet, which are considered Bayesian models. To summarize the posterior marginal densities of the parameters under a Bayesian framework, we use the highest posterior density interval (Box and Tiao (1973)), which has the shortest length for a given probability content. The calculations in this paper use the Chen and Shao (1999) algorithm to estimate an empirical Highest Posterior Density (HPD) interval of the parameters.
For experiment I, we generate 1000 independent response variables from the cloglog regression. The predictor
is a categorical variable with two levels (with respective probabilities
and ) and is a numerical attribute generated from N. The true values of the coefficients are set to be and the linear component of the regression model is . The dataset contains responses of 15 1’s and 985 0’s. The estimation results based on the various link models are summarized in Table 1, which shows that the skewed Weibull link model performs better than other Bayesian models with the lowest DIC and the largest log likelihood. The KS statistics for all the links are approximately close to each other, however, the Frećhet model outperforms all other models with respect to MAE. When it comes to parameter recovery, the asymmetric link models do not necessarily outperform the true probit model but are competitive enough against the other symmetric links. For the logit and cloglog models, the coefficient estimates of the significantly deviate from the true values; for all the skewed link models, most of the estimates fall within the 95% HPD interval.For experiment II, we generate the response variables with sample size from the GEV regression model with and , and that produced responses of 12 1’s and 988 0’s. The predictors are set to be the same with the previous experiment. The output, summarized in Table 2, presents the model estimates under the six different link models. The data modeling of the standard GEV model performs well not only in view of recovery of parameters, but also in the model assessment. However, the skewed Weibull link is the most appropriate one since in both of the simulations, its estimated parameters are the ones that closely reach the true parameter values. The Frećhet link is the second best model in the sense of recovery of parameters. Therefore, again, the skewed links are very flexible to fit a dataset that is considered imbalanced because they automatically estimate the degree of skewness by controlling the shape parameter to improve their performance.
In each simulation experiment, multiple chains are generated to produce Bayesian estimates. Based on both experiments, we deduce that the skewed link regression models outperform the symmetric link regressions when the binary outcome is highly imbalanced.
logit  probit  cloglog  standard GEV  skewed Weibull  Frećhet  
Parameter  Estimate  Estimate  Estimate  Estimate  Estimate  Estimate 
95% CI (HPD)  95% CI (HPD)  95% CI (HPD)  95% CI (HPD)  95% CI (HPD)  95% CI (HPD)  
5.8806  2.8146  5.8596  5.8609  4.6617  4.9927  
(7.1661, 4.5951)  (3.2984,2.3309)  (7.1206,4.5987)  (8.2872 ,3.6667)  (7.2326,2.0174)  (6.9264 3.0682)  
1.169  0.4436  1.1694  1.0438  0.7564  0.9623  
(0.0638, 2.4018)  (0.0687,0.956)  (0.0347, 2.3736)  (0.0746,2.4326)  (0.2486, 1.8675)  (0.3557, 2.0823)  
1.3656  0.5651  1.3237  1.2716  0.9187  0.8908  
(0.6954, 2.0359)  (0.2732, 0.8571)  (1.1147, 2.4244)  (0.6886, 1.9587)  (0.2459, 1.6592)  (0.383, 1.4744)  
shape parameter  0.0032  1.2437  3.2626  
(0.1884, 0.1495)  (0.7824, 1.8776)  (2.3721, 4.406)  
DIC  106.2756  103.5679  108.6163  
logLik  49.8878  49.8387  49.9225  50.0309  51.3653  51.5532 
BIC  120.4988  120.4007  120.5683  
KS  0.9972  0.9976  0.9972  0.9961  0.9967  0.8433 
MAE  0.0209  0.02095  0.02088  0.02186  0.02123  0.1808 
logit  probit  cloglog  standard GEV  skewed Weibull  Frećhet  
Parameter  Estimate  Estimate  Estimate  Estimate  Estimate  Estimate 
95% CI (HPD)  95% CI (HPD)  95% CI (HPD)  95% CI (HPD)  95% CI (HPD)  95% CI (HPD)  
6.9426  3.5930  7.2263  2.5713  2.9694  3.2936  
(8.6051, 5.2801)  (4.4801, 2.7059)  (9.1241, 5.3285)  (3.5931, 1.3858)  (3.9025, 2.0236)  (4.2359, 2.2512)  
0.6864  0.2800  0.5363  0.2301  0.2835  0.5350  
(2.2951, 0.9223)  (1.0877, 0.5277)  (2.2800, 1.2074)  (0.6807, 0.2135)  (0.8473, 0.1638)  (1.3672, 0.0370)  
2.4534  1.2739  2.6662  0.6581  0.5784  0.8103  
(1.6712, 3.2357)  (0.7798, 1.7679)  (1.7058, 3.6265)  (0.1729, 1.1070)  (0.3802, 1.1798)  (0.4761, 1.2215)  
shape parameter  0.4523  1.8945  4.5823  
(0.7527, 0.2207)  (1.2508, 2.6177)  (3.3014, 6.1055)  
DIC  81.8384  82.1210  86.8719  
logLik  56.3239  37.0525  37.3855  37.2829  39.6527  58.1976 
BIC  133.3711  94.8283  95.4942  
KS  0.9991  0.9925  0.9918  0.9870  0.9869  0.7983 
MAE  0.0264  0.0191  0.0187  0.0211  0.0204  0.1481 
5 Empirical data on life insurance
To illustrate the applicability of the binary regression models previously discussed, we drew observations of policyholders during calendar year 2015 from an insurance portfolio of a major insurer. In order to preserve confidentiality, we took subsamples of 127,777 observations for training our models and 18,254 observations for testing the various estimated models. During the calendar year, we are able to observe whether the policyholder died or survived during the year. The observed mortality is highly imbalanced with 1,720 actual deaths observed in our training set and 247 actual deaths observed in our test set. Each observation is described by 9 attributes of which 6 are considered categorical and 3 are numerical variables.
5.1 Data description
Table 3 shows the description and summary statistics of each categorical variable used in our mortality investigation. We regroup the variable State into 8 different geographic regions and create a new categorical variable Regions in order to simplify the number of levels to describe geographic location. For many life insurance contracts, secondary guarantees are offered to ensure some death benefits are paid even if the cash surrender value of the policy falls to zero. From our portfolio, we observe six levels to describe the kind of secondary guarantees that is purchased with the contract. The insured’s sex indicator Gender is also a discrete variable with 2 levels, Male and Female, with 52.75% Male and the rest are Female. Smoker Status indicates the insured’s smoking status at policy issue with Smokers, NonSmokers, and
Unismokers. It is not uncommon practice to have simplified underwriting classifying a policyholder as Unismoker, in the absence of information about smoking habit and for which a blended smoker and nonsmoker mortality rates are used. The Line of business (LOB) variable broadly classifies the type of life insurance business the contract is issued. Finally, the categorical variable
Plan has 7 different levels: Universal Life with Long Term Care (UL_LTC), Term Insurance Plan (Term), Universal Life with Secondary Guarantees (UL_SG) and without Secondary Guarantees (UL_NSG), Variable Life with Secondary Guarantees (VUL), Whole Life Policy and CorporateOwned Life insurance (COLI). This categorization can be quite unique to the type of contracts issued by an insurance company.Summary statistics of the numerical variables are in Table 4. The issue ages range from as young as newborn to as old as 90, with an average age at issue of 37 years. The face amount is the amount of death benefit and although this amount may vary according to the type of contract, most of our policies in our dataset have level face amounts. The duration, measured in years, refers to the number of years since policy issue, as of the period of observation. All three continuous variables can have direct effect on mortality, for example, it is well known that mortality rate increases with age.
In this paper, we normalize these three continuous variables: Issue Age, Face Amount and Duration. For Issue Age and Duration, we rescale the variables so that the range is in , for which the general formula is given by:
where is the original value and is the normalized value. However, for Face Amount, the variable is highly skewed understandably because of some policies with unusually high values; its magnitude has to be narrowed down. In this case, we take logarithm of the total data followed by rescaling the data:
Categorical variables  Description  Proportions  
Regions  Region of issue  Northest_NewEngland  5.57% 
Northest_MidAtlantic  14.36  
Midwest  23.13%  
South_Atlantic  24.45%  
South_SouthCentral  14.19%  
West_Mountain  4.92%  
West_Pacific  16.17%  
Foreign  0.70%  
SG_IND  Secondary guarantee indicator  Brokerage  0.45% 
Legacy  1.03%  
UL_LTC  6.97%  
NSG  31.28%  
No Information  52.11%  
SG  3.16%  
Gender  Insured’s sex  Female  42.75% 
Male  57.25%  
Smoker Status  Insured’s smoking status  Smoker  6.52% 
Nonsmoker  75.19%  
Unismoker  18.29%  
LOB  Line of business  ISL  20.75% 
TRAD  20.09%  
EM  1.04%  
No Information  52.11%  
Plan  Plan type  UL_LTC  8.44% 
Term  29.07%  
UL_NSG  31.83%  
UL_SG  1.05%  
VUL  5.89%  
Whole Life  22.68%  
COLI  1.04% 
Numerical variables  Minimum  1st Quantile 
Mean  Median  3rd Quantile  Maximum 

Issue Age  0  26  37  39  51  90 
Face Amount (in $)  0  25,000  314,902  100,000  250,000  12,000,000 
Duration (in years)  0.3  2.5  5.6  4.5  7.5  23.8 
5.2 Numerical results
We calibrated the binary regression models under the six different link functions as earlier described. We estimated the parameters using the training dataset; the test dataset has been used for validation. We present the results separately for the standard link functions (logit, probit, cloglog) and the skewed link functions (standard GEV, skewed Weibull, and Frećhet). We will call the standard link models as frequentist models while the skewed link models as Bayesian models. This differentiation has to do with reference to the approach used to estimate the parameters. Note that for the categorical variables, the reference level should be clear from the results presented. For example, in the case of Plan type, the category COLI has been used; in the case of Line of business (LOB), the category EM has been used.
Table 5 provides the parameter estimates for the reduced frequentist models. First, we fit a full model with all the predicted variables included in the model, then the reduced model is deployed with insignificant variables excluded in the model. Interestingly, in terms of model comparison statistics, the probit model has the worst log likelihood, KS, and MAE, but it has the best BIC statistic. As the probit model rejects many predictor variables, its low BIC appears to intuitively make sense. The probit model would be appealing to those seeking for the simplest possible model, and yet still considered to be competitive. The probit model chooses only IssueAge and Duration for significant variables, and the parameter estimates for this model intuitively make sense. The probability of death increases with IssueAge and also slightly increases with policy duration.
logit  probit  cloglog  

Parameter  Estimate  (s.e.)  Estimate  (s.e.)  Estimate  (s.e.) 
Intercept  8.1418  (0.8751)  4.8315  (0.1630)  10.8632  (0.5227) 
GenderMale  0.4493  (0.1473)  0.4922  (0.1434)  
IssueAge  7.5972  (0.5803)  3.2135  (0.2098)  8.1890  (0.5719) 
SmokerStatusS  0.6936  (0.1940)  0.5500  (0.1834)  
SmokerStatusUni  0.1080  (0.2607)  0.2090  (0.2545)  
Duration  0.0767  (0.0074)  0.0423  (0.0022)  0.0937  (0.0066) 
FaceAmount  3.0868  (1.0348)  
Plan_typeUL_LTC  0.3088  (0.3109)  0.4393  (0.3144)  
Plan_typeTERM  0.8083  (0.3736)  1.1180  (0.4024)  
Plan_typeUL_NSG  0.1871  (0.2199)  0.0905  (0.2077)  
Plan_typeUL_SG  1.4596  (1.0371)  0.0670  (0.4984)  
Plan_typeVUL  0.7097  (0.5478)  0.3484  (0.4577)  
log Lik  1044.8173  1068.5629  1039.393  
BIC  2207.3802  2166.5623  2186.719  
KS  0.9994  0.9999  0.9992  
MAE  0.0251  0.0254  0.0253 
Table 6 presents the parameter estimates for the reduced Bayesian models. As stated before, one statistic used to compare Bayesian models is the DIC, which describes the tradeoff between quality of the model fit to the data and the complexity of the model. The model that gives the smaller DIC is generally preferred as it is better supported by the observed data. Among all three skewed link models, the skewed Weibull gives the best DIC value of 2126.3532. While this model has the worst KS statistic, it still outperforms the other two models in terms of loglikelihood and MAE values.
Standard GEV  Skewed Weibull  Frećhet  

Parameter  Estimate  Estimate  Estimate 
95% HPD  95% HPD  95% HPD  
Intercept  5.3912  5.2117  4.7430 
(7.9279, 3.9688)  (7.6501, 3.8083)  (8.5038, 3.2466)  
IssueAge  2.0595  4.3174  1.8652 
(1.0426, 3.4020)  (2.8483, 6.9725)  (0.9955, 3.2125)  
SmokerStatusS  0.4927  0.3434  0.3651 
(0.1658, 0.9199)  (0.0792, 0.6566)  (0.0749, 0.7524)  
SmokerStatusUni  0.2028  0.0226  0.0955 
(0.1612, 0.6155)  (0.3849, 0.2493)  (0.2651, 0.4372)  
Duration  0.9030  
(0.5776, 1.4798)  
LOBISL  2.2296  1.9686  
(0.8777, 4.0847)  (0.6643, 4.2602)  
LOBNoInformation  2.0458  1.6797  
(0.7201, 3.6112)  (0.6770, 4.4102)  
LOBTRAD  1.6730  1.4174  
(0.3112, 3.2024)  (0.1673, 3.8592)  
Plan type UL_LTC  1.5931  0.2598  1.6166 
(2.5892, 0.8178)  (0.6591, 0.0973)  (2.6182, 0.8015)  
Plan type TERM  1.8061  0.5262  2.8171 
(3.1911, 0.9376)  (1.1886, 0.0119)  (5.0451, 1.2020)  
Plan type UL NSG  1.2067  0.2039  1.2311 
(1.9193, 0.6662)  (0.4883, 0.0602)  (1.9962, 0.6076)  
Plan type UL SG  2.1475  1.4483  2.4779 
(3.7416, 0.9047)  (3.6883, 0.0552)  (4.4446, 1.0303)  
Plan type VUL  1.8975  0.5691  2.4099 
(3.2817, 0.9542)  (1.4931, 0.0668)  (4.1592, 1.0580)  
shape parameter  0.1262  1.3185  3.8392 
(0.2498, 0.0266)  (1.0112, 1.6081)  (2.7269, 5.0725)  
DIC  2338.5129  2126.3532  2348.6408 
log Lik  1153.3895  1055.0456  1165.1385 
KS  0.9990  0.9997  0.9980 
MAE  0.0263  0.0251  0.0262 
Interestingly, as in the frequentist models, the skewed Weibull link model also is the simplest model among all three skewed link models as it rejected more predictors than any other. In particular, both policy duration and LOB are considered not significant predictors under the skewed Weibull link model. We can roughly interpret the estimated coefficients of the selected predictor variables for the skewed link model, and it appears to be quite intuitive. First, the positive sign for the IssueAge indicates a greater likelihood of death for older issue ages. Second, the positive sign for SmokerStatusS indicates a worse mortality for smokers than for nonsmokers. The negative sign for SmokerStatusUni might be counterintuitive, however, this is more challenging to interpret as this smoking status is about the absence of smoking habits. Finally, all the signs for the Plan types are negative, which indicates that Plan type COLI has the best mortality among all plan types. Because it has the largest negative sign, Plan type UL SG has the worse mortality among all types. Notice further that for all Bayesian models, the FaceAmount was not considered a significant predictor of death. There are two possible explanations to this. First, the effect is depleted when we transform and normalize the variable. Second, the amount of death benefit can be directly linked to the degree of underwriting adopted prior to issue. In which case, the mortality selection effect can be explained by the policy duration.
If we wish to compare the frequentist models to the Bayesian models, we could examine the common model statistics such as the loglikehood, KS, and MAE values. For all models, the KS and MAE statistics are relatively comparable. However, in terms of loglikelihood, the Bayesian models outperform all the frequentist models. In particular, the skewed Weibull link model has the smallest value for all of them.
Amount of death benefits  Death counts  

Actual amount  $ 33,992,610  Actual count  247  
Estimated  Error rate  Estimated  
logit  $ 28,653,690  15.7%  logit  19 
probit  $ 42,883,180  26.2%  probit  17 
cloglog  $ 37,309,376  9.8%  cloglog  19 
Standard GEV  $ 43,623,689  28.3%  Standard GEV  252 
Skewed Weibull  $ 36,338,219  6.9%  Skewed Weibull  248 
Frećhet  $ 39,838,452  17.2%  Frećhet  250 
Mean and confidence intervals of the predictions from the Bayesian models for cumulative count and amount
To further investigate the predictive power of the various models, we compare the prediction of death counts together with the corresponding death benefit amount. Table 7 presents the aggregated death benefits and counts for the various link models using the test dataset. Accordingly, our test dataset has a total death count of 247, out of 18,254 observations; the corresponding benefit amount sums to approximately 34 million dollars. In terms of death counts, as we expected, the frequentist models could hardly come close to the true actual death counts; all three models underestimate the number of deaths. On the other hand, the skewed link models have predictions that are generally close to the actual death counts as shown in the table. In particular, the skewed Weibull, not surprisingly, marginally outperforms the other two models in terms of death counts, but is far superior in terms of predicting the corresponding death benefits. A visualization of the superiority of the skewed Weibull link model in terms of prediction is presented in Figure 2.
6 Concluding remarks
This article was driven by a purpose of developing a mortality investigation for tracking and monitoring death claims experience of a portfolio of life insurance contracts over a specified period. In this case, our primary interest is in developing a suitable binary regression model for predicting mortality, given a set of available covariate information. Our period of investigation is primarily short term, in this case, one year. Such is typical in practice where insurers need to predict the level of mortality for the following year given characteristics of their portfolio at the beginning of the year. However, this is particularly a challenging task as the observed mortality for a cohort of policies is expected to be a rare event.
This paper presents a methodology to handle extremely imbalanced binary outcome. In particular, we focus on the latent variable interpretation of the binary regression model and examine alternative functions that link the mean of the binary outcome to the predictor variables. We find that suitable link functions are those that have the flexibility to handle skewness, and we find three possible skewed link models: standard GEV, skewed Weibull, and Frećhet. We performed simulation studies and calibrated models using empirical data observed from a portfolio of life insurance contracts. Based on the model estimates and predictions, we find that all three models generally outperform frequentist models with standard link functions that include logit, probit, and cloglog. We also presented a Bayesian approach to the estimation of the skewed link models. As pointed out in this paper, the latent variable interpretation connects the various link models but it also has computational advantages especially when Bayesian inference is used for estimation such as the use of MetropolisHastings algorithm, an MCMC method.
The models and theoretical framework presented here can be extended to mortality tracking and monitoring on a more frequent basis than annually, for instance quarterly. This is because insurance companies publish quarterly financials to rating analysts and death benefits are a significant portion of an insurer’s financials. The ability to understand and project death benefits on a quarterly basis is important to explain fluctuations in financials driven by mortality claims. However, when doing more frequent mortality tracking and monitoring than annual, it is important to distinguish between nonsignificant fluctuations in mortality and mortality fluctuations that represent a longer term trend. It would be interesting to explore this issue using the models we have developed in this paper. In addition, there is also the promise of using neural networks to handle imbalanced binary outcomes, and such would be an interesting future direction of research. See
Dreiseitl and OhnoMachado (2002) and Wang et al. (2016).Appendix
The following useful results provide detailed proof that in the case of the Frećhet link model that the resulting posterior distribution is proper.
Proposition 1.
Let be the design matrix with full rank and be the matrix with the th row of , where , for . Then there exists a positive vector with such that . Under the prior given by with , the posterior distribution stated in (15) is proper.
Proof.
Let be independent random variables with common Frećhet distribution with , and shape parameter . For , it can be shown that for . Observing that and , where is the indicator function. Then
(21) 
and
(22) 
Let . Using Fubini’s Theorem, we obtain
(23)  
Directly following from Lemma 4.1 of Chen and Shao (2001), and under the condition that and for , there exists a constant such that
(24) 
∎
References
 Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88 (422), pp. 669–679. Cited by: §2.1.
 On two families of transformations to additivity for binary response data. Biometrika 68 (2), pp. 357–363. Cited by: §1.
 Measuring skewness with respect to the mode. The American Statistician 49 (1), pp. 34–38. Cited by: §2.2.
 Mortality modelling and forecasting: a review of methods. Annals of Actuarial Science 3 (12), pp. 3–43. Cited by: §1.
 Bayesian inference in statistical analysis. AddisonWesley Publishing Company: Reading, MA. Cited by: §4.
 Categorical data analysis using a skewed Weibull regression model. Entropy 20 (176), pp. 1–17. Cited by: §1, §3.2.
 A new skewed link model for dichotomous quantal response data. Journal of the American Statistical Association 94 (448), pp. 1172–1186. Cited by: §1.
 Monte carlo estimation of Bayesian credible and HPD intervals. Journal of Computational and Graphical Statistics 8 (1), pp. 69–92. Cited by: §4.
 Propriety of posterior distribution for dichotomous quantal response models. Proceedings of the American Mathematical Society 129 (1), pp. 293–302. Cited by: Appendix.
 Analysis of binary data. Chapman & Hall/CRC Press, Boca Raton, FL, USA. Cited by: §2.
 Regression models and lifetables. Journal of the Royal Statistical Society: Series B (Methodological) 34 (2), pp. 187–202. Cited by: §1.
 Parametric link modification of both tails in binary regression. Statistical Papers 35 (1), pp. 189–201. Cited by: §1, §2.2.
 Logistic regression and artificial neural network classification models: a methodology review. Journal of Biomedical Informatics 35 (56), pp. 352–359. Cited by: §6.

Bayesian data analysis.
CRC Press, Boca Raton, FL USA.
Note: Citation count from Google Scholar: 723.
This book ranked 15th in SIAM’s top selling titles in 2009. This book is also used as a textbook for the following course: ECOM 6349 Selected Topics in Artificial Intelligence: Cluster Analysis, Islamic University of Gaza (IUG), Palestine.
Excerpts from a book review by JinHong Park, Technometrics, 53(2) 2011: “…This monograph mainly focuses on a number of popular clustering algorithms and groups them according to some specific baseline methodologies, such as hierarchical, centerbased, and searchbased methods. … Overall, I am very impressed with this book loaded with lots of information on theory, algorithms, and application of data clustering. Data Clustering: Theory, Algorithms, and Applications is a nice piece of work for cluster analysis, containing many interesting theories, computational algorithms, and examples from various applications. Therefore, this monograph is good for statistics, applied mathematics, and computer science senior undergraduates and graduates. Also, it will be beneficial for research scientists who need cluster analysis to deal with real datasets.”
Excerpts from a book review by David J. Hand, International Statistical Review, 76(1), 2008: “ …The book also has excellent discussions of the basic elements which feed into cluster analysis tools, covering such things as data types, scale conversion, standardisation, visualisation, and distance and dissimilarity measures. …The quality of the book is such that I would expect it to be reprinted, or to appear in new editions….Apart from this, the book provides excellent coverage of the area, and I thoroughly recommend it. I have already ordered another copy for a student of mine who is beginning to work in clustering.” Cited by: §2.1.  On the nature of a function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies. Philosophical Transactions of the Royal Society of London (Series A) 115, pp. 513–585. Cited by: §1.
 Use of the BoxCox transformation with binary response models. Biometrika 69 (2), pp. 309–314. Cited by: §1.
 Generalized linear models and actuarial science. The Statistician 45 (4), pp. 407–436. Cited by: §1.
 Monte carlo sampling methods using markov chains and their applications. Oxford University Press. Cited by: §3.3.
 A first course in bayesian statistical methods. Springer: New York. Cited by: §3.5.
 Flexible generalized link models for binary response data. Biometrika 95 (1), pp. 93–106. Cited by: §1.
 Generalized linear models. Chapman and Hall: London. Cited by: §1, §2, §3.
 Modeling the choice of residential location. Transportation Research Record 673, pp. 72–77. Cited by: §2.2.
 The determinants of mortality heterogeneity and implications for pricing annuities. Insurance: Mathematics and Economics 53, pp. 379–387. Cited by: §1.
 Treebased methods: a useful tool for life insurance. European Actuarial Journal 2 (1), pp. 129–147. Cited by: §1.
 Goodness of link tests for generalized linear models. Journal of the Royal Statistical Society: Series C (Applied Statistics) 29 (1), pp. 15–24. Cited by: §1.
 Estimating the dimension of a model. The Annals of Statistics 6 (2), pp. 461–464. Cited by: §3.4.
 Bayesian measures of model complexity and fit. Journal of the Royal Statistical Ssociety: Series b (Statistical Methodology) 64 (4), pp. 583–639. Cited by: §3.4.
 Generalized logistic models. Journal of the American Statistical Association 83 (402), pp. 426–431. Cited by: §1.
 Training deep neural networks on imbalanced data sets. In IJCNN 2016: Proceedings of the International Joint Conference on Neural Networks, pp. 4368–4374. Cited by: §6.
 Generalized extreme value regression for binary response data: an application to b2b electronic payments system adoption. The Annals of Applied Statistics 4 (4), pp. 2000–2023. Cited by: §1, §3.2.
 Cluster analysis of death claims experience. Note: Working paper Cited by: §1.
 Estimating mortality of insured advancedage population with Cox regression model. Risk Management. Note: Transamerica Reinsurance, Working paper Cited by: §1.
 Logistic regression for insured mortality experience studies. North American Actuarial Journal 19 (4), pp. 241–255. Cited by: §1.
Comments
There are no comments yet.