Setting the premium for the customers (policyholders) is one of the most important problems in insurance business. Therefore , it is crucial to predict the size of actual (currently unforeseeable) claims. In insurance premium prediction problems, the total claim amount for a covered risk usually has a continuous distribution on positive values, except for the possibility of being exact zero when the claim does not occur. Such type of data cannot be transformed to normality by power transformation, and special treatment on zero claims is often requied. Jørgensen and Paes De Souza (1994) and Smyth and Jørgensen (2002) used generalized linear models (GLM; Nelder and Wedderburn, 1972
) with a Tweedie distributed outcome, assuming Possion arrival of claims and Gamma distributed amount for individual claims, to simultanuously model frequency and severity of insurance claims. The Tweedie GLM has been widely used in actuarial studies (Mildenhall, 1999; Murphy et al., 2000; Sandri and Zuccolotto, 2008). The Tweedie GLM has a major limitation that the structure of the logarithmic mean is restricted to a linear form, which can be too rigid for real applications. Zhang (2011) model the nonlinearity by adding splines, while low-degree splines are often inadequate to capture the nonlinearity in the data and high-degree splines often result in overfitting. Generalized additive models (GAM; Hastie and Tibshirani, 1990; Wood, 2006
) can model the continuous variables by smooth functions estimated from data. But the structure of the model has to be determined a priori, i.e., one has to specify the main effects and interactions to be used in the model.Yang et al. (2017) model the insurance claim size by a nonparametric Tweedie model and propose a gradient tree-boosting algorithm (TDboost) to fit this model.
Despite the popularity of the Tweedie model under linear or nonlinear logarithmic mean assumption, the problems of modeling the extremely unbalanced zero-inflated claim data (over zeros) are still out there. Commonly, there are two methods to handle such data sets with excess zeros: the Hurdle-at-zero models and the zero-inflated models. Hurdle models (Cragg, 1971; Mullahy, 1986) model the zero and non-zero data with one model and then model the non-zero data with another (e.g., truncated Poisson, truncated negative-binomial; Mullahy, 1986). Lambert (1992)’s zero-inflated Poisson (ZIP) regression models inflated the number of zeros by mixing point mass at
with a Poisson distribution.Hall (2000) adapted Lambert’s (1992) ZIP models to an upper bounded count situation using the zero-inflated binomial regression models (ZIB). For continuous zero-inflated data sets, Ancelet et al. (2010) mixed the point mass at
with a compound Poisson exponential distribution: a Poisson random sum of exponential variables.
In this article, we aim to model the entremely unbalanced zero-inflated insurance claim size by a zero-inflated nonparametric Tweedie compound Poisson model (ZIF-Tweedie henceforth) and propose the EM algorithm to fit this mixture model, in which the gradient tree-boosting algorithm (TDboost; Yang et al., 2017) is implemented to fit the nonparametric Tweedie model as part of the Maximization step when estimating the potential customers’ claim size. We denote this algorithm as ZIF-TDboost for simplicity.
The TDboost model is motivated by the proven success of boosting in machine learning for nonparametric regression and classification (Friedman, 2001,2002,2001). Boosting adaptively combines a large number of relatively simple prediction models called base learners into an ensemble learner to achieve high-prediction performance. The seminal work on the boosting algorithm called AdaBoost (Freund and Schapire, 1997) was originally proposed for classification problems. Later Breiman et al. (1998) and Breiman (1999) pointed out an important connection between the AdaBoost algorithm and a functional gradient descent algorithm. Friedman et al. (2000) and Friedman et al. (2001) developed a statistical view of boosting and proposed gradient boosting methods for both classification and regression.
We use the gradient tree-boosting Tweedie models for several reasons. First, the model structure of TDboost is learned from data and not predetermined. Nonlinearities, discontinuities, complex and higher order interactions are naturally incorportaed into the model to reduce the potential modeling bias and to produce high predictive performance. In addition, TDboost handles the response variables of any type without the need for transformation, and its highly robust to outliers. Missing values in the predictors are managed almost without loss of information(Elith et al., 2008).
Our ZIF-TDboost model assumes that with probability , the insurance claim size is , and with probability , a Tweedie insurance size is claimed. For example, a region with highly developed transportation system and well-educated citizen diathesis is more likely to have larger than a region with poorly developed transportation system and low-averaged citizen diathesis. One advantage of our ZIF-TDboost model is that it is straight-forward to fit. Besides, it successfully herits all those above-mentioned advantages of the TDboost model and detect the probability perfectly. On the other hand, ZIF-TDboost also herits the disadvantage of the TDboost that the results are not as friendly interpretable as those from the Tweedie GLM model. And we make the simplified assumption that the probability
is a constant, not functionally related to the predictor variables. Nevertheless, a commonstill has its practical meaning as the example stated above. We also give a heurestic hypothesis score testing with threshold method to tell whether the zero claim size in the insurance data is too large for a Tweedie model to fit the data well and the Zero-inflated Tweedie model needs to be introduced.
The rest of the article is organized as follows. Section 2 briefly presents the models implemented, with the Tweedie model in Section 2.1 and the Zero-inflated Tweedie model in Section 2.2. Section 3 briefly reviews the TDboost models. We present the main methodology with implementation details in Section 4, including the main algorithm in Section 4.1, the data-motivated initialization in Section 4.2 and the penalized methodology in Section 4.4. In Section 5, we use simulation to show the high-prediction accuracy of our ZIF-TDboost model. As an application, we apply the new model to analyze an auto-insurance claim data in Section 6, with the score test in Section 6.4. A brief conclusion is given in Section 7.
2 Tweedie Model and Zero-Inflated Tweedie Model
2.1 Compound Poisson Distribution and Tweedie Model
In insurance premium prediction problems, the total claim amount for a covered risk usually has a highly right-skewed continuous distribution on nonnegative values, and the exact zero claim means the claim does not occur. One standard approach in actuarial science in modeling such data is to use Tweedie compound Poisson models.
be a random variable following Poisson distribution denoted bywith mean , and let ’s () be i.i.d random variables following Gamma distirbution denoted by with mean
and variance. Assume is independent of ’s. Define a random variable by
Then is a Poisson sum of independent Gamma random variables. Note that when , , then the distribution of has positive probability mass at zero: . The compound Poisson distribution (Jørgensen and Paes De Souza, 1994; Smyth and Jørgensen, 2002) is closely connected to a special class of exponential dispersion models (EDM; Jørgensen, 1987) known as Tweedie models (Tweedie, 1984), which are defined by the form
where and are given functions, is a parameter in and is a parameter in . For Tweedie models, the mean and varianze of has the proporty , where and are the first and second derivatives of , respectively. If further assume that Tweedie models have a power mean-variance relationship for some index parameter , then we have , and . If we introduce parameters and reparameterize them by
Then the probability dense function (2) of the compound Poisson distribution can be transformed to the form
with . When , the sum of infinite series is an example of Weight’s generalized Bessel function (Tweedie, 1984).
2.2 Zero-Inflated Tweedie Model
Tweedie model has positive probability mass at zero and exact zero mass when the claim does not occur. Despite the popularity of the Tweedie models in actuarial studies, we find that the empirical distribution of some real data like car insurance claims are extremely unbalanced and has a highly inflated mass at point zero. This motivates us to consider a zero-inflated mixture model that combines an Exact Zero mass with probability and the Tweedie distribution with probability , then define the random variable by:
We call this mixture model the Zero-Inflated Tweedie model, denoted by ZIF model for simplicity. The density probability function of can be written as
3 Gardient-Tree Boosting Tweedie Compound Poisson Model
Boosting is a way of combining the performance of many “weak” learners (e.g. classification and regression trees, Breiman 1984) to produce a powerful “committee” (Friedman et al., 2001). Gradient boosting (Friedman, 2001)
is a recursive, nonparametric machine learning algorithm that shows remarkable flexibility in solving complex but differentiable loss functions. By combining a large number of base learners, it can handle higher order interactions and produce highly complex functional forms. It provides high-prediction accuracy and often outperforms many competing methods, such as linear regression or classification, bagging(Breiman, 1996), splines, and CART (Breiman, 1984).
Gradient-tree boosting Tweedie compound Poisson model (TDboost; Yang et al. 2017) is motivated by three sides of advantages: the popularity of the Tweedie models in actuarial studies, the proven success of boosting in machine learning for classification or regression problems, and the flexibility of gradient boosting in solving complex loss functions. TDboost uses the negative log-likelihood function of the Tweedie model as the loss function and estimates the parameters by following the gradient boosting method and using regression trees as the base learners specifically.
Consider a portfolio of polices from independent insurance contracts, where for the th contract, is the policy pure premium, is a
-dimensional vector of explanatory variables that characterize the policyholder and the risk being insured, andis the policy duration, i.e., the length of time that the policy remains in force. Assume that the expected pure premium is determined by a predictor function of
Then the log-likelihood function of the Tweedie model can be written as:
For now, assume the parameters and are given, then our goal is to estimate the optimal predictor function by minimizing the negative log-likelihood function (9) of the Tweedie model, which can be viewd as the empirical risk function of the observations :
where the risk function of the th obervation is:
with . Note that the risk function does not depend on . For the gradient boosting method, each candidate function is assumed to be an ensemble of base learners
where is a constant scalar, is the expansion coefficient and is the th base learner of the multivariate argument x, characterized by the parameter . For gradient-tree boosting, we specifically use an -terminal nodes regression tree with the parameter , as the base learner in equation (12):
where are the disjoint regions representing the terminal nodes of the tree and the constants are the values assigned to each region respectively. The constant in (12) is chosen as the -terminal tree that minimizes the negative log-likelihood.
A forward stagewise algorithm (Friedman, 2001) can be adopted to approximately solve the minimizer of risk function (10), which builds up the components sequentially through a gradient-descent-like approach. At each iteration stage , suppose that the current estimation for is . To update from to , the gradient-tree boosting method first fits the th regression tree to the negative gradient vector by least-squares function minimization:
where is the current negative gradient vector of with respect to :
When fitting the regression trees, first use a fast top-down “best-fit” algorithm with a least-squares splitting criterion (Friedman et al., 2000) to find the splitting variables and the corresponding splitting locations that determine the terminal regions , then estimate the terminal-node values by averaging the negative gradient (which is the property of least-squares minimization) falling in each region respectively:
This fitted regression tree can be viewed as a tree-constrained approximation of the unconstrained negative gradient. Owing to the disjoint nature of the regions produced by regression trees, finding the expansion coefficient in ensemble equation (12) can be reduced to solving optimal constants within each region :
We then update the current estimate in each region
where is the shrinkage factor (Friedman, 2001) that controls the update step size. A small imposes more shrinkage, while gives complete negative gradient steps. Friedman (2001) has found that the shrinkage factor reduces overfitting and improve the predictive accuracy.
To estimate parameters , we follow Dunn and Smyth (2005) and use the profile likelihood to estimate them, which jointly determine the mean-variance relation . We exploit the fact that in Tweedie models solving the expected pure premium , i.e. the estimator function in (10), depends only on without knowing . Therefore, given a fixed , the mean estimate can be solved by using the above-discussed TDboost algorithm. Conditional on this and the corresponding , we can maximize the log-likelihood function (9) with respect to by
This univariate optimization problem can be solved using a combination of golden section search and successive parabolic interpolation(Brent, 2013). In this way, we have determined the corresponding for each given . Then we acquire the optimal by maximizing the profile likelihood with respect to a sequence of candidate ’s. Finally, the optimal is applied in (9) and (19) to obtain the corrsponding .
4 Our Proposal
In our ZIF model, we assume . Although it seems more reasonable and refined to assume the zero mass probability as a function of variable x, we make this simplified assumption out of two considerations. Firstly, there is practical rationality lies behind our assumption: the overall zero mass probability have explainable meaning in real data. For example, the car insurance claims data in a country with vast territory and sparse population tend to have larger Exact Zero mass probability than those in a country with limited territory and dense population. The claim data in a province of well-developed transportation system and infrastructures tend to have larger zero mass probability than those in a province of poor development. Secondly, although exploiting the more refined model under assumption is meaningful and deserves great distribution, we won’t go such an untrivial step at once. What we concerns in this paper can be viewed as a transition. A better understanding of the characteristics of the ZIF model through its simplified version will benefit our future work. We will see soon that this model can be easily solved by inserting the TDboost algorithm in section 3 into the widely-used EM algorithm.
Given and a portfolio of policies from independent insurance contracts. Assume that each policy pure premium follows the Zero-inflated Tweedie model with an overall zero mass probability:
And is determined by a predictor function of :
Then, the probability density function ofcan be written as
where . The expected pure premium is given by
4.1 Estimating and via Zero-Inflated TDboost
When finding the MLE (Maximum Likelihood Estimator) for the mixture model (6), a straightforward approach is to use the EM algorithm (McLachlan and Krishnan, 2007). To develop the idea, we assume that parameter is given for the time being. We estimate the predictor function , the overal zero mass probability and the dispersion by introducing a latent variable and utilizing the extension version of the EM algorithm–block-wise coordinate descent method. When solving the predictor function , the TDboost algorithm discussed in section 3 is inserted into the Maximization substep. The joint estimation of will be studied in Section 4.3.
Given and a portfolio of policies , our purpose is to maximize the log-likelihood of the ZIF model:
To solve (24), we resort to the EM algorithm by introducing the latent variables , where is the class variable, i.e., if is sampled from , and if
is sampled from the zero point mass. Then the log-likelihood of the joint distribution ofis
where . Taking posterior expectation of (25) with respect to , we have
And . From the procedure of EM algorithm, we know that maximizing the log-likelihood function (24) of the original mixture model can be solved by iteratively computing the posterior expectation 26 w.r.t. the latent variable and maximizing the expectation log-likelihood function of the completed model w.r.t. . Since estimators can not be solved jointly, we use the extensive version of the EM algorithm (McLachlan and Krishnan, 2007). We apply the block-wise coordinate descent method in the Maximization step and solve the estimators successively. Combining with the Expectation step alternatively, the estimators will converge to a theoretical local minimum. This optimization objective is not a convex problem, so a decent starting point deserves some extra efforts and will be studied in section 4.2. For now, we assume the initial parameters are chosen as .
To update from current stage to the next stage , assuming the current estimates of to be , then the Maximization step updates from to with estimates of latent variables fixed. We utilize the block-wise coordinate descent method and solve successively by viewing the other two parameters as fixed.
Conditional on the given and the updated , maximizing the log-likelihood function with respect to is a univariate optimization problem that can be solved by using a combination of golden section search and succssive parabolic interpolation (Brent, 2013).
Then we perform the Expectation step and find the current estimate by using the updated from the M-step. Note from equation (27) that if . So we only need to update with respect to those zero-response observations, i.e. , .
After one updating stage of M-step and E-step respectively, we put the updated estimators into the log-likelihood function (24) and record the log-likelihood . We also compute the maximum difference of the updated parameters and compare it with a tolerance constant set beforehand. Iteration stops and returns the current estimators once the difference of currenct stage is smaller than . If this stopping criterion won’t work after a maximum number, (set beforehand), of iterations, we choose the optimal iteration step that maximize the log-likelihoods through the last iterations and return the corresponding parameters as the optimal estimators.
4.2 Initializing Parameters via Observed Responses
In this section, we give a data-motivated method to initialize the parameters to be estimated. A carefully-selected starting point close to a low local minimum or even the global minimum can promise an efficient algorithm. The initial estimation of is given from the idea that we approximately view the latent variable as , i.e., zeros are all from the Exact Zero mass, which is reasonable for extremely unbalanced zero-inflated claim data. Then maximizing the equation (26) can be approximately viewed as maximizing the following objective:
where , and . To initialize , we can further make an assumption that the initial function is chosen from the functional domain of constant functions, i.e., . Then the initial estimation of , and can be solved by successively maximizing the objective (33) w.r.t. one parameter and the other two fixed as constants.
Given , we can then initialize by putting them into the equation (27):
In summary, the complete ZIF algorithm is shown in Algorithm 1.
Get the training set , the zero-response index set ; Set the parameters of the ZIF model, the maximum iteration step and the tolerance constant . Set . Set the initial parameter difference .
For , repeatedly do steps 3(a)-3(e):
Compute the log-likelihood of the ZIF model by equation (24):
Compute the maximum difference between the former estimators and the updated ones:
If the maximum difference :
and end the algorithm.
Choose the optimal iteration step with respect to the maximum likelihood:
return and end the algorithm.
4.3 Estimating via Profile Likelihood
We follow the method discussed in the last paragraph of section 3 by using the profile likelihood to estimate the index parameter . Conditional on one given , we can follow the Algorithm 1 to solve the maximization log-likelihood fuction (24) of the ZIF model, which gives the corresponding estimators . Then, we acquire the optimal estimation of by maximizing the profile likelihood with respect to a sequence of equally spaced condidate values on interval :
where . Finally we apply the optimal in (24) again to obtain the corresponding estimators , and .
4.4 Modified EM Algorithm: Penalizing on
When applying the EM algorithm to maximize the log-likelihood function of the ZIF-TD model, we want to avoid the case that the zero mass probability might shrink to value
, i.e., the ZIF-TD model degrades to the TD model, since it is computational and time wasting to end up with the TD model after using EM algorithm under ZIF-TD model assumption. Actually, we will introduce the score test with TD models as null hypothesis against ZIF-TD models as alternative hypothesis in the real data application to answer the question whether we should inflate the TD model to ZIF-TD model. Once the test detects that we should use the ZIF-TD model, we don’t expect the estimators degrade to the TD model. Under such considerations, we add a penalization term onto the log-likelihood function (24):
where is the regularization term and is a nonnegative regularization parameter. When maximizing the penalized log-likelihood function (40), smaller will be penalized greater because of the logrithmic property. The regularization term forces the zero mass probablity towards as increases. Similar to the procedure in section 4.1, we introduce the latent variable and take posterior expectation of the penalized log-likelihood function with respect to :