1 Introduction
Setting the premium for the customers (policyholders) is one of the most important problems in insurance business. Therefore , it is crucial to predict the size of actual (currently unforeseeable) claims. In insurance premium prediction problems, the total claim amount for a covered risk usually has a continuous distribution on positive values, except for the possibility of being exact zero when the claim does not occur. Such type of data cannot be transformed to normality by power transformation, and special treatment on zero claims is often requied. Jørgensen and Paes De Souza (1994) and Smyth and Jørgensen (2002) used generalized linear models (GLM; Nelder and Wedderburn, 1972
) with a Tweedie distributed outcome, assuming Possion arrival of claims and Gamma distributed amount for individual claims, to simultanuously model frequency and severity of insurance claims. The Tweedie GLM has been widely used in actuarial studies (
Mildenhall, 1999; Murphy et al., 2000; Sandri and Zuccolotto, 2008). The Tweedie GLM has a major limitation that the structure of the logarithmic mean is restricted to a linear form, which can be too rigid for real applications. Zhang (2011) model the nonlinearity by adding splines, while lowdegree splines are often inadequate to capture the nonlinearity in the data and highdegree splines often result in overfitting. Generalized additive models (GAM; Hastie and Tibshirani, 1990; Wood, 2006) can model the continuous variables by smooth functions estimated from data. But the structure of the model has to be determined a priori, i.e., one has to specify the main effects and interactions to be used in the model.
Yang et al. (2017) model the insurance claim size by a nonparametric Tweedie model and propose a gradient treeboosting algorithm (TDboost) to fit this model.Despite the popularity of the Tweedie model under linear or nonlinear logarithmic mean assumption, the problems of modeling the extremely unbalanced zeroinflated claim data (over zeros) are still out there. Commonly, there are two methods to handle such data sets with excess zeros: the Hurdleatzero models and the zeroinflated models. Hurdle models (Cragg, 1971; Mullahy, 1986) model the zero and nonzero data with one model and then model the nonzero data with another (e.g., truncated Poisson, truncated negativebinomial; Mullahy, 1986). Lambert (1992)’s zeroinflated Poisson (ZIP) regression models inflated the number of zeros by mixing point mass at
with a Poisson distribution.
Hall (2000) adapted Lambert’s (1992) ZIP models to an upper bounded count situation using the zeroinflated binomial regression models (ZIB). For continuous zeroinflated data sets, Ancelet et al. (2010) mixed the point mass atwith a compound Poisson exponential distribution: a Poisson random sum of exponential variables.
In this article, we aim to model the entremely unbalanced zeroinflated insurance claim size by a zeroinflated nonparametric Tweedie compound Poisson model (ZIFTweedie henceforth) and propose the EM algorithm to fit this mixture model, in which the gradient treeboosting algorithm (TDboost; Yang et al., 2017) is implemented to fit the nonparametric Tweedie model as part of the Maximization step when estimating the potential customers’ claim size. We denote this algorithm as ZIFTDboost for simplicity.
The TDboost model is motivated by the proven success of boosting in machine learning for nonparametric regression and classification (
Friedman, 2001,2002,2001). Boosting adaptively combines a large number of relatively simple prediction models called base learners into an ensemble learner to achieve highprediction performance. The seminal work on the boosting algorithm called AdaBoost (Freund and Schapire, 1997) was originally proposed for classification problems. Later Breiman et al. (1998) and Breiman (1999) pointed out an important connection between the AdaBoost algorithm and a functional gradient descent algorithm. Friedman et al. (2000) and Friedman et al. (2001) developed a statistical view of boosting and proposed gradient boosting methods for both classification and regression.We use the gradient treeboosting Tweedie models for several reasons. First, the model structure of TDboost is learned from data and not predetermined. Nonlinearities, discontinuities, complex and higher order interactions are naturally incorportaed into the model to reduce the potential modeling bias and to produce high predictive performance. In addition, TDboost handles the response variables of any type without the need for transformation, and its highly robust to outliers. Missing values in the predictors are managed almost without loss of information
(Elith et al., 2008).Our ZIFTDboost model assumes that with probability , the insurance claim size is , and with probability , a Tweedie insurance size is claimed. For example, a region with highly developed transportation system and welleducated citizen diathesis is more likely to have larger than a region with poorly developed transportation system and lowaveraged citizen diathesis. One advantage of our ZIFTDboost model is that it is straightforward to fit. Besides, it successfully herits all those abovementioned advantages of the TDboost model and detect the probability perfectly. On the other hand, ZIFTDboost also herits the disadvantage of the TDboost that the results are not as friendly interpretable as those from the Tweedie GLM model. And we make the simplified assumption that the probability
is a constant, not functionally related to the predictor variables. Nevertheless, a common
still has its practical meaning as the example stated above. We also give a heurestic hypothesis score testing with threshold method to tell whether the zero claim size in the insurance data is too large for a Tweedie model to fit the data well and the Zeroinflated Tweedie model needs to be introduced.The rest of the article is organized as follows. Section 2 briefly presents the models implemented, with the Tweedie model in Section 2.1 and the Zeroinflated Tweedie model in Section 2.2. Section 3 briefly reviews the TDboost models. We present the main methodology with implementation details in Section 4, including the main algorithm in Section 4.1, the datamotivated initialization in Section 4.2 and the penalized methodology in Section 4.4. In Section 5, we use simulation to show the highprediction accuracy of our ZIFTDboost model. As an application, we apply the new model to analyze an autoinsurance claim data in Section 6, with the score test in Section 6.4. A brief conclusion is given in Section 7.
2 Tweedie Model and ZeroInflated Tweedie Model
2.1 Compound Poisson Distribution and Tweedie Model
In insurance premium prediction problems, the total claim amount for a covered risk usually has a highly rightskewed continuous distribution on nonnegative values, and the exact zero claim means the claim does not occur. One standard approach in actuarial science in modeling such data is to use Tweedie compound Poisson models.
Let
be a random variable following Poisson distribution denoted by
with mean , and let ’s () be i.i.d random variables following Gamma distirbution denoted by with meanand variance
. Assume is independent of ’s. Define a random variable by(1) 
Then is a Poisson sum of independent Gamma random variables. Note that when , , then the distribution of has positive probability mass at zero: . The compound Poisson distribution (Jørgensen and Paes De Souza, 1994; Smyth and Jørgensen, 2002) is closely connected to a special class of exponential dispersion models (EDM; Jørgensen, 1987) known as Tweedie models (Tweedie, 1984), which are defined by the form
(2) 
where and are given functions, is a parameter in and is a parameter in . For Tweedie models, the mean and varianze of has the proporty , where and are the first and second derivatives of , respectively. If further assume that Tweedie models have a power meanvariance relationship for some index parameter , then we have , and . If we introduce parameters and reparameterize them by
(3) 
Then the probability dense function (2) of the compound Poisson distribution can be transformed to the form
(4) 
where
(5) 
with . When , the sum of infinite series is an example of Weight’s generalized Bessel function (Tweedie, 1984).
2.2 ZeroInflated Tweedie Model
Tweedie model has positive probability mass at zero and exact zero mass when the claim does not occur. Despite the popularity of the Tweedie models in actuarial studies, we find that the empirical distribution of some real data like car insurance claims are extremely unbalanced and has a highly inflated mass at point zero. This motivates us to consider a zeroinflated mixture model that combines an Exact Zero mass with probability and the Tweedie distribution with probability , then define the random variable by:
(6) 
We call this mixture model the ZeroInflated Tweedie model, denoted by ZIF model for simplicity. The density probability function of can be written as
(7) 
3 GardientTree Boosting Tweedie Compound Poisson Model
Boosting is a way of combining the performance of many “weak” learners (e.g. classification and regression trees, Breiman 1984) to produce a powerful “committee” (Friedman et al., 2001). Gradient boosting (Friedman, 2001)
is a recursive, nonparametric machine learning algorithm that shows remarkable flexibility in solving complex but differentiable loss functions. By combining a large number of base learners, it can handle higher order interactions and produce highly complex functional forms. It provides highprediction accuracy and often outperforms many competing methods, such as linear regression or classification, bagging
(Breiman, 1996), splines, and CART (Breiman, 1984).Gradienttree boosting Tweedie compound Poisson model (TDboost; Yang et al. 2017) is motivated by three sides of advantages: the popularity of the Tweedie models in actuarial studies, the proven success of boosting in machine learning for classification or regression problems, and the flexibility of gradient boosting in solving complex loss functions. TDboost uses the negative loglikelihood function of the Tweedie model as the loss function and estimates the parameters by following the gradient boosting method and using regression trees as the base learners specifically.
Consider a portfolio of polices from independent insurance contracts, where for the th contract, is the policy pure premium, is a
dimensional vector of explanatory variables that characterize the policyholder and the risk being insured, and
is the policy duration, i.e., the length of time that the policy remains in force. Assume that the expected pure premium is determined by a predictor function of(8) 
Then the loglikelihood function of the Tweedie model can be written as:
(9) 
For now, assume the parameters and are given, then our goal is to estimate the optimal predictor function by minimizing the negative loglikelihood function (9) of the Tweedie model, which can be viewd as the empirical risk function of the observations :
(10) 
where the risk function of the th obervation is:
(11) 
with . Note that the risk function does not depend on . For the gradient boosting method, each candidate function is assumed to be an ensemble of base learners
(12) 
where is a constant scalar, is the expansion coefficient and is the th base learner of the multivariate argument x, characterized by the parameter . For gradienttree boosting, we specifically use an terminal nodes regression tree with the parameter , as the base learner in equation (12):
(13) 
where are the disjoint regions representing the terminal nodes of the tree and the constants are the values assigned to each region respectively. The constant in (12) is chosen as the terminal tree that minimizes the negative loglikelihood.
A forward stagewise algorithm (Friedman, 2001) can be adopted to approximately solve the minimizer of risk function (10), which builds up the components sequentially through a gradientdescentlike approach. At each iteration stage , suppose that the current estimation for is . To update from to , the gradienttree boosting method first fits the th regression tree to the negative gradient vector by leastsquares function minimization:
(14) 
where is the current negative gradient vector of with respect to :
(15) 
When fitting the regression trees, first use a fast topdown “bestfit” algorithm with a leastsquares splitting criterion (Friedman et al., 2000) to find the splitting variables and the corresponding splitting locations that determine the terminal regions , then estimate the terminalnode values by averaging the negative gradient (which is the property of leastsquares minimization) falling in each region respectively:
(16) 
This fitted regression tree can be viewed as a treeconstrained approximation of the unconstrained negative gradient. Owing to the disjoint nature of the regions produced by regression trees, finding the expansion coefficient in ensemble equation (12) can be reduced to solving optimal constants within each region :
(17) 
We then update the current estimate in each region
(18) 
where is the shrinkage factor (Friedman, 2001) that controls the update step size. A small imposes more shrinkage, while gives complete negative gradient steps. Friedman (2001) has found that the shrinkage factor reduces overfitting and improve the predictive accuracy.
To estimate parameters , we follow Dunn and Smyth (2005) and use the profile likelihood to estimate them, which jointly determine the meanvariance relation . We exploit the fact that in Tweedie models solving the expected pure premium , i.e. the estimator function in (10), depends only on without knowing . Therefore, given a fixed , the mean estimate can be solved by using the abovediscussed TDboost algorithm. Conditional on this and the corresponding , we can maximize the loglikelihood function (9) with respect to by
(19) 
This univariate optimization problem can be solved using a combination of golden section search and successive parabolic interpolation
(Brent, 2013). In this way, we have determined the corresponding for each given . Then we acquire the optimal by maximizing the profile likelihood with respect to a sequence of candidate ’s. Finally, the optimal is applied in (9) and (19) to obtain the corrsponding .4 Our Proposal
In our ZIF model, we assume . Although it seems more reasonable and refined to assume the zero mass probability as a function of variable x, we make this simplified assumption out of two considerations. Firstly, there is practical rationality lies behind our assumption: the overall zero mass probability have explainable meaning in real data. For example, the car insurance claims data in a country with vast territory and sparse population tend to have larger Exact Zero mass probability than those in a country with limited territory and dense population. The claim data in a province of welldeveloped transportation system and infrastructures tend to have larger zero mass probability than those in a province of poor development. Secondly, although exploiting the more refined model under assumption is meaningful and deserves great distribution, we won’t go such an untrivial step at once. What we concerns in this paper can be viewed as a transition. A better understanding of the characteristics of the ZIF model through its simplified version will benefit our future work. We will see soon that this model can be easily solved by inserting the TDboost algorithm in section 3 into the widelyused EM algorithm.
Given and a portfolio of policies from independent insurance contracts. Assume that each policy pure premium follows the Zeroinflated Tweedie model with an overall zero mass probability:
(20) 
And is determined by a predictor function of :
(22) 
where . The expected pure premium is given by
(23) 
4.1 Estimating and via ZeroInflated TDboost
When finding the MLE (Maximum Likelihood Estimator) for the mixture model (6), a straightforward approach is to use the EM algorithm (McLachlan and Krishnan, 2007). To develop the idea, we assume that parameter is given for the time being. We estimate the predictor function , the overal zero mass probability and the dispersion by introducing a latent variable and utilizing the extension version of the EM algorithm–blockwise coordinate descent method. When solving the predictor function , the TDboost algorithm discussed in section 3 is inserted into the Maximization substep. The joint estimation of will be studied in Section 4.3.
Given and a portfolio of policies , our purpose is to maximize the loglikelihood of the ZIF model:
(24) 
To solve (24), we resort to the EM algorithm by introducing the latent variables , where is the class variable, i.e., if is sampled from , and if
is sampled from the zero point mass. Then the loglikelihood of the joint distribution of
is(25) 
where . Taking posterior expectation of (25) with respect to , we have
(26) 
where
(27) 
And . From the procedure of EM algorithm, we know that maximizing the loglikelihood function (24) of the original mixture model can be solved by iteratively computing the posterior expectation 26 w.r.t. the latent variable and maximizing the expectation loglikelihood function of the completed model w.r.t. . Since estimators can not be solved jointly, we use the extensive version of the EM algorithm (McLachlan and Krishnan, 2007). We apply the blockwise coordinate descent method in the Maximization step and solve the estimators successively. Combining with the Expectation step alternatively, the estimators will converge to a theoretical local minimum. This optimization objective is not a convex problem, so a decent starting point deserves some extra efforts and will be studied in section 4.2. For now, we assume the initial parameters are chosen as .
To update from current stage to the next stage , assuming the current estimates of to be , then the Maximization step updates from to with estimates of latent variables fixed. We utilize the blockwise coordinate descent method and solve successively by viewing the other two parameters as fixed.

Update by:
(29) where .
Conditional on the given and the updated , maximizing the loglikelihood function with respect to is a univariate optimization problem that can be solved by using a combination of golden section search and succssive parabolic interpolation (Brent, 2013).

Update by:
(30)
Then we perform the Expectation step and find the current estimate by using the updated from the Mstep. Note from equation (27) that if . So we only need to update with respect to those zeroresponse observations, i.e. , .
Update by:
(31)  
(32) 
After one updating stage of Mstep and Estep respectively, we put the updated estimators into the loglikelihood function (24) and record the loglikelihood . We also compute the maximum difference of the updated parameters and compare it with a tolerance constant set beforehand. Iteration stops and returns the current estimators once the difference of currenct stage is smaller than . If this stopping criterion won’t work after a maximum number, (set beforehand), of iterations, we choose the optimal iteration step that maximize the loglikelihoods through the last iterations and return the corresponding parameters as the optimal estimators.
4.2 Initializing Parameters via Observed Responses
In this section, we give a datamotivated method to initialize the parameters to be estimated. A carefullyselected starting point close to a low local minimum or even the global minimum can promise an efficient algorithm. The initial estimation of is given from the idea that we approximately view the latent variable as , i.e., zeros are all from the Exact Zero mass, which is reasonable for extremely unbalanced zeroinflated claim data. Then maximizing the equation (26) can be approximately viewed as maximizing the following objective:
(33) 
where , and . To initialize , we can further make an assumption that the initial function is chosen from the functional domain of constant functions, i.e., . Then the initial estimation of , and can be solved by successively maximizing the objective (33) w.r.t. one parameter and the other two fixed as constants.

Initialize by:
(34) 
Initialize by:
(35) where .

Initialize by:
(36)
Given , we can then initialize by putting them into the equation (27):
(37)  
(38) 
In summary, the complete ZIF algorithm is shown in Algorithm 1.

Get the training set , the zeroresponse index set ; Set the parameters of the ZIF model, the maximum iteration step and the tolerance constant . Set . Set the initial parameter difference .

For , repeatedly do steps 3(a)3(e):

Compute the loglikelihood of the ZIF model by equation (24):

Compute the maximum difference between the former estimators and the updated ones:

If the maximum difference :
 return

and end the algorithm.

Choose the optimal iteration step with respect to the maximum likelihood:
return and end the algorithm.
4.3 Estimating via Profile Likelihood
We follow the method discussed in the last paragraph of section 3 by using the profile likelihood to estimate the index parameter . Conditional on one given , we can follow the Algorithm 1 to solve the maximization loglikelihood fuction (24) of the ZIF model, which gives the corresponding estimators . Then, we acquire the optimal estimation of by maximizing the profile likelihood with respect to a sequence of equally spaced condidate values on interval :
(39) 
where . Finally we apply the optimal in (24) again to obtain the corresponding estimators , and .
4.4 Modified EM Algorithm: Penalizing on
When applying the EM algorithm to maximize the loglikelihood function of the ZIFTD model, we want to avoid the case that the zero mass probability might shrink to value
, i.e., the ZIFTD model degrades to the TD model, since it is computational and time wasting to end up with the TD model after using EM algorithm under ZIFTD model assumption. Actually, we will introduce the score test with TD models as null hypothesis against ZIFTD models as alternative hypothesis in the real data application to answer the question whether we should inflate the TD model to ZIFTD model. Once the test detects that we should use the ZIFTD model, we don’t expect the estimators degrade to the TD model. Under such considerations, we add a penalization term on
to the loglikelihood function (24):(40) 
where is the regularization term and is a nonnegative regularization parameter. When maximizing the penalized loglikelihood function (40), smaller will be penalized greater because of the logrithmic property. The regularization term forces the zero mass probablity towards as increases. Similar to the procedure in section 4.1, we introduce the latent variable and take posterior expectation of the penalized loglikelihood function with respect to :
Comments
There are no comments yet.