1 Prelude
Building regression models for insurance claims presents several challenges. The process can be particularly difficult for individual insurance policies where a large proportion of the claims are zeros and for those policies with a claim, the losses typically exhibit skewness. The degree of skewness in the positive loss distribution varies widely among different lines of insurance business. For example, within a class of automobile insurance policies, the losses arising from damages to property may be mediumtailed but those arising from liabilityrelated claims may be longtailed. For shortterm insurance contracts, the number of claims is referred to as the claim frequency while the amount of claims, conditional on the occurrence of a claim, is the claim severity. The addition of covariates in a regression context within each component allows for capturing the heterogeneity of the individual policies. These two components are used as predictive models for pure premiums and for aggregate or total claims in a portfolio.
There is a long history of studying insurance claims based on the twopart framework. Several wellknown families of distributions for claim frequencies and claim severities are well documented in klugman2012. As more observed data become increasingly available, additional complexities to traditional regression linear models have been introduced both in the literature and in practice; for example, the use of zerotruncated or hurdle models has been discussed to accommodate some of the shortcomings of traditional models. Interestingly, the Tweedie regression models for insurance claims have been particularly popular because it is adaptable to a mixture of zeros and nonnegative insurance claims. smyth2002fitting
described the Tweedie compound Poisson regression models and calibrated such models using a Swedish third party insurance claims dataset; the models were fitted to examine dispersion within the GLM framework. The Tweedie distribution can be treated as a reparameterization of the compound Poissongamma distribution, so that model calibration can be done within the GLM framework.
xacur2015generalisedcompared these two approaches and concluded that there is no clear superior method; the paper also described the advantages and disadvantages of the two methods. Tweedie GLM presents a larger pure premium and implies both a larger claim frequency and claim severity due to the constant scale (dispersion) parameter. In other words, the mean increases with variance. The constant scale parameter also forces an artificial relationship between the claim frequency and the claim severity. Tweedie GLM does not have an optimal coefficient, and it leads to a loss of information because it ignores the number of claims. On the other hand, Tweedie GLM has fewer parameters to estimate and is thus more parsimonious than the twopart framework. When the insurance claims data presents small losses due to low frequencies, the twopart framework most likely overlooks the internal connection between the low frequency and the subsequent small loss amount. For example, the frequency model often indicates zero number of claims for small claim policies, which leads to a zero loss prediction. Additional works related to Tweedie regression can be found in
frees2014predictive and jorgensen1987exponential.Within the GLM framework, the claim frequency component is unable to accurately accommodate imbalances caused by the large proportion of zero claims. Simultaneously, a significant limitation is that the regression structure of the logarithmic mean is restricted to a linear form, which may be too inflexible for real applications. With rapid expansion of available data for improved decision making, there is a growing appetite in the insurance industry for expanding its toolkit for data analysis and modeling. However, the industry is unarguably highly regulated so that there is pressure for actuaries and data scientists to provide adequate transparency in modeling techniques and results. To find a balance between modeling flexibility and interpretation, in this paper, we propose a nonparametric model using treebased models with a hybrid structure. Since breiman1984
introduced the Classification and Regression Tree (CART) algorithm, treebased models have gained momentum as a powerful machine learning approach for decision making in several disciplines. The CART algorithm involves separating the explanatory variable space into several mutually exclusive regions that, as a result, creates a nested hierarchy of branches resembling a tree structure. Each separation or branch is referred to as a node. Each of the bottom nodes of the decision tree, called terminal nodes, has a unique path for observable data to enter the region. Once the decision tree is constructed, it is possible to use paths to locate the region or terminal node to which a new set of explanatory variables will belong.
Prior to further exploring the hybrid structure, it is worth mentioning that a similar structure called Model trees, with an M5 algorithm, was first described in quinlan1992learning
. In the M5 algorithm, it constructs treebased piecewise linear models. Regression trees assign a constant value to the terminal node as the fitted value. However, Model trees use a linear regression model at each terminal node to predict the fitted value for observations that reach that terminal node. Regression trees are a special case of Model trees. Both regression trees and Model trees employ recursive binary splitting to build a fully grown tree. Thereafter, both algorithms use a costcomplexity pruning procedure to trim the fully grown tree back from each terminal node. The primary difference between regression trees and Model trees algorithms is that for the latter step, each terminal node is replaced by a regression model instead of a constant value. The explanatory variables that serve to build that regression models are generally those that participate in the splits within the subtree that will be pruned. In other words, explanatory variables in each node are located beneath the current terminal node in the fully grown tree.
It has been demonstrated that Model trees have advantages over regression trees in both model simplicity and prediction accuracy. Model trees produce decision trees that are considered not only relatively simple to understand, but are also efficient and robust. Additionally, they are able to exploit local linearity within the dataset. When prediction performance are compared, it is worth mentioning the difference in the range of predictions produced between the traditional regression trees and these Model trees. Accordingly, regression trees are only able to give a prediction within the range of observed values in the training dataset. However, Model trees are able to extrapolate prediction range because of the use of the regression models at the terminal node. For further discussion of Model trees and M5 algorithm, please see quinlan1992.
Inspired by the structure and advantages of Model trees when compared to traditional regression trees, we develop a similar algorithm that can be uniquely applied to insurance claims because of its twopart nature. In this paper, we present the hybrid treebased models as a twostep algorithm: the first step builds a classification tree to identify membership of claim occurrence and the second step uses a penalized regression technique to determine the size of the claim at each of the terminal nodes, taking into account the available explanatory variables. In essence, hybrid treebased models for insurance claims as described in this paper integrate the peculiarities of both the classification tree and the linear regression in the modeling process. These sets of models are suitably described as hybrid structures.
We have organized the rest of this paper as follows. In Section 2, we provide technical details of the algorithm arising from hybrid treebased models applicable to insurance claims, separating the modeling of claim frequency and claim severity. In section 3, we create a synthetic dataset based on a simulation produced from a true Tweedie regression model to investigate the predictive accuracy of our hybrid treebased models when compared to the Tweedie GLM. We introduced some noise to the simulated data in order to be able to make a reasonable comparison. In section 4, using empirical data drawn from a portfolio of general insurance policies, we present the estimation results and compared the prediction accuracy based on a holdout sample for validation. Section 5 provides visualization of the results from hybridtree models to allow for better interpretation. We conclude in Section 6.
2 Description of hybrid treebased models
Hybrid treebased models utilize the information arising from both the frequency and the severity components. As already alluded in Section 1, it is a twostage procedure. In the first stage, we construct a classification treebased model for frequency. In the subsequent stage, we employ a penalized linear regression model at each terminal node within the treebased model, based on the severity information. Such models can be drawn from the hybrid structure, which is an ecosystem, to accommodate the variety of the dataset. This hybrid structure captures all the advantageous features of treebased models and the penalized regression models.
We can determine the type of classification trees that can be adapted in the frequency according to the information drawn from the insurance dataset. If the dataset records only whether claims were reported or not, we can construct a classification tree for a binary response variable. If the dataset records additional information of a number of claims, we can construct trees for a count response variable. For the binary classification, we can employ some of the most popular CART algorithm, to list a few: (a) efficient algorithm C4.5
(quinlan1992), with a current updated version C5.0, (b) unbiased approaches like Generalized, Unbiased, Interaction Detection, and Estimation (GUIDE) (loh2009improving), or (c) Conditional Inference Trees (CTREE) (hothorn2006unbiased). For the count response variable, we can apply, to list a few: (a) piecewise linear Poisson using the CART algorithm called SUPPORT (chaudhuri1995generalized), (b) Poisson regression using GUIDE (loh2006regression), or (c) MOdelBased recursive partitioning (MOB) (zeileis2008model).After the completion of the classification tree structure, we then employ a linear regression model to each of the terminal nodes. The simplest model includes GLM (nelder1972glm) with different families and regularized regression like elastic net regularization (zou2005reg) amongst others.
Hybrid treebased models build an ecosystem that utilizes modern techniques from both traditional statistics and machine learning. In the subsequent subsections, for simplicity, we only show a simple hybrid treebased model without exploring all of the possible combinations of different algorithms suitable at each stage of the procedure.
2.1 Claim frequency
To illustrate the use of hybrid treebased models, we select the wellknown CART algorithm for binary classification and least squares regression with elastic net regularization. Elastic net regularization can perform variable selection and improve the prediction accuracy when compared to traditional linear regressions without penalty. In order to be selfcontained, we illustrate this simple hybrid treebased model with sufficient details. See also james2013.
We denote the response variable as Y, the sample space as , and as the number of observations. The th sample with dimensional explanatory variables is denoted as , which is sampled from the space . For example, we can separate each claim into , the claim occurrence or the number of claims, and , the claim severity.
In the CART algorithm, a binary classification tree, denoted by , is produced by partitioning the space of the explanatory variables into disjoint regions and then assigning a boolean for each region , for
. Given a classification tree, each observation can then be classified based on the expression
(1) 
where denotes the partition with the assigned boolean. To be more specific, boolean when the majority of observations in region have a claim; otherwise it is zero.
The traditional classification loss functions used in the classification tree, in Equation (
1), are described as follows:
Misclassification error:

Gini index:

Crossentropy or deviance:
where is proportion of one class in the node. For multiclass loss functions, see hastie2009.
The default CART algorithm uses Gini index as a loss function. The regions in the classification tree are determined according to recursive binary splitting. First in the process of splitting is the discovery of one explanatory variable which best divides the data into two subregions; for example, these regions are the left node and the right node in the case of a continuous explanatory variable. This division is determined as the solution to
(2) 
where is the weight for the subregion determined by the number of observations split into subregion divided by the total number of observations before the split, and is the proportion of one class in the subregion. Subsequently, the algorithm looks for the next explanatory variable with the best division into two subregions and this process is applied recursively until meeting some predefined threshold or reaching a minimum size of observations in the terminal node.
To control for model complexity, we can use costcomplexity pruning to trim the fully grown tree . We define the loss in region by
For any subtree , we denote the number of terminal nodes in this subtree by . To control the number of terminal nodes, we introduce the tuning hyperparameter to the loss function by defining the new cost function as
(3) 
Clearly, according to this cost function, the tuning hyperparameter penalizes large numbers of terminal nodes. The idea then is to find the subtrees for each , and choose the subtree that minimizes in Equation (3). Furthermore, the tuning hyperparameter governs the tradeoff between the size of the tree (model complexity) and its goodness of fit to the data. Large values of result in smaller trees (simple model) and as the notation suggests, leads to the fully grown tree . Additional tuning is done to control the tree depth through , which is the maximum depth of any node of the final tree.
2.2 Claim severity
After building the classification tree structure, we next apply a linear regression on the terminal nodes to model severity. In controlling for model complexity, we set a threshold to determine if we should build a linear regression or directly assign zero for the terminal. For example, if is 80%, then we should directly assign zero to the terminal nodes that contain more than 80% of zero claims. Furthermore, if the terminal node contains less than a certain number of observations, say 40, we can directly use the mean as the prediction similar to regression trees. Otherwise, we need to build a linear regression on the terminal nodes. While any member of the GLM exponential family that is suitable for continuous claims can be used, we find that the special case of GLM Gaussian, or just ordinary linear regression, is sufficiently suitable.
At each terminal node , the linear coefficient can be determined by:
(4) 
where is the negative loglikelihood for sample . For the Gaussian family, denoting the design matrix as X, the coefficient is well known as
Ridge regression (hoerl1970ridge)
achieves better prediction accuracy compared to ordinary least squares because of biasvariance tradeoff. In other words, the reduction in the variance term of the coefficient is larger than the increase in its squared bias. It performs coefficient shrinkage and forces its correlated explanatory variables to have similar coefficients. In ridge regression, at each terminal node
, the linear coefficient can be determined by(5) 
where is a tuning hyperparameter that controls shrinkage and thus, the number of selected explanatory variables. For the following discussion, we assume the values of are standardized so that
If the explanatory variables do not have the same scale, the shrinkage may not be fair. In the case of ridge regression within the Gaussian family, the coefficient in Equation (5) can be shown to have the explicit form
where
is the identity matrix of appropriate dimension.
Ridge regression does not automatically select the important explanatory variables. However, LASSO regression
(tibshirani1996regression) has the effect of sparsity, which forces the coefficients of the least important explanatory variable to have a zero coefficient therefore making the regression model more parsimonious. In addition, LASSO performs coefficient shrinkage and selects only one explanatory variable from the group of correlated explanatory variables. In LASSO, at each terminal node , the linear coefficient can be determined by(6) 
where is a tuning hyperparameter that controls shrinkage. For regularized least squares with LASSO penalty, Equation (6) leads to the following quadratic programming problem
Originally, tibshirani1996regression used the combined quadratic programming method to numerically solve for . In a later development, fu1998penalized proposed the shooting method and friedman2007pathwise redefined shooting as the coordinate descent algorithm which is a popular algorithm used in optimization.
To illustrate the coordinate descent algorithm, we start with . Given the optimal can be found by
where is constant and can be dropped. We denote . It follows that
This can be solved by the Softthresholding Lemma discussed in (donoho1995adapting).
Lemma 2.1.
(Softthresholding Lemma) The following optimization problem
has the solution of
Therefore, can be expressed as
For , update by softthresholding when takes the previously estimated value. Repeat loop until convergence.
zou2005reg pointed out that LASSO has few drawbacks. For example, when the number of explanatory variables is larger than the number of observations , LASSO selects at most explanatory variables. Additionally, if there is a group of pairwise correlated explanatory variables, LASSO randomly selects only one explanatory variable from this group and ignores the rest. zou2005reg also empirically showed that LASSO has subprime prediction performance compared to ridge regression. Therefore, elastic net regularization is proposed which uses a convex combination of ridge and LASSO regression. Elastic net regularization is able to better handle correlated explanatory variables and perform variable selection and coefficient shrinkage. In elastic net regularization, at each terminal node , the linear coefficient can be determined by
(7) 
where controls the elastic net penalty and bridges the gap between LASSO (when ) and ridge regression (when ). If explanatory variables are correlated in groups, an around tends to select the groups in or out altogether. For penalized least squares with elastic net penalty,
This problem can also be solved using the coordinate descent algorithm. We omit the detailed solution which is a similar process to solve LASSO using the coordinate descent algorithm. However, it is worth mentioning that the expression for is as follows:
The mathematical foundation for the elastic net regularization can be found in de2009elastic. It should also be noted that all the regularization schemes mentioned previously have simple Bayesian interpretations. See hastie2009 and james2013. A review paper for statistical learning applications in general insurance can be found in parodi2012computational.
Finally, we conclude that the hybrid treebased models can be expressed in the following functional form:
(8) 
From equation (8), we see that hybrid treebased models can also be viewed as piecewise linear regression models. The treebased algorithms divide the dataset into subsamples (or risk class), and linear regression models are then applied on these subsamples.
Algorithm 1 summarizes the details of implementing hybrid treebased models based on the rpart and glmnetUtils packages in R.
3 Simulation study
In this section, we conduct a simulation study to assess the performance of a simple hybrid treebased model proposed in this paper relative to that of the Tweedie GLM. The synthetic dataset has been optimally created to replicate a real dataset. This synthetic dataset contains continuous explanatory variables and categorical explanatory variables with a continuous response variable sampled from Tweedielike distribution. In some sense, the observed values of the response variable behave like a Tweedie, but some noise were introduced to the data for a fair comparison.
We begin with the continuous explanatory variables by generating sample observations from the multivariate normal with covariance structure
where . Continuous explanatory variables close to each other has higher correlations, and the correlation effect diminishes with increasing distance between two explanatory variables. In real datasets, we can easily observe correlated explanatory variables among hundreds of possible features. It is also very challenging to detect the correlation between explanatory variables when an insurance company merges with third party datasets. Multicollinearity reduces the precision of the estimated coefficients of correlated explanatory variables. It can be problematic to use GLM without handling multicollinearity in the real dataset.
Categorical explanatory variables are generated by random sampling from the set of integers with equal probabilities. Categorical explanatory variables do not have any correlation structure among them.
We create observations (with % zero claims) with explanatory variables, including continuous explanatory variables with relatively larger coefficients, continuous explanatory variables with relatively smaller coefficients, continuous explanatory variables with coefficients of , which means no relevance, categorical explanatory variables with relatively larger coefficients, categorical explanatory variables with relatively smaller coefficients, and categorical explanatory variables with coefficients , which similarly implies no relevance. In effect, there are three groups of explanatory variables according to the size of the linear coefficients. In other words, there are strong signal explanatory variables, weak signal explanatory variables, and noise explanatory variables. Here are the true linear coefficients and used:
The first coefficients refer to the intercepts. Because of the equivalence of the compound Poissongamma with Tweedie distributions, as discussed in Section (1), we generated samples drawn from Poisson for frequency and gamma for severity.
To illustrate possible drawbacks using Tweedie GLM, we use different linear coefficients for the frequency part and the severity part. The absolute value of the linear coefficients is the same but with a different sign. In reallife, there is no guarantee that the explanatory variables have equal coefficients in both frequency and severity parts. The response variable generated by the compound Poissongamma distribution framework using modified rtweedie function in tweedie R package with frequency and severity as components is presented as follows:
Severity part: Gamma distribution with a log link function:
where .
Response variable:
where the superscript has been used to distinguish the response and the gamma variables.
We introduce noise to the true Tweedie model in the following ways: multicollinearity among continuous variables, zero coefficients for nonrelevant explanatory variables, different coefficients for frequency and severity part, and added white noise to positive claims. These noises are what make the reallife data deviate from the true model in practice. It makes for an impartial dataset that can provide for a fair comparison.
Once we have produced the simulated (or synthetic) dataset, we apply both Tweedie GLM and hybrid treebased models to predict values of the response variables, given the available set of explanatory variables. The prediction performance between the two models is then compared using several validation measures as summarized in Table 1. In Appendix A, Table 5 describes details for validation measures.
Validation measure  

Tweedie GLM  0.95  0.16  0.47  0.89  0.07 
Hybrid Treebased Model  0.97  0.29  0.36  0.81  0.06 
From Table 1, we find that the Tweedie GLM performs worse even under the true model assumption, albeit the presence of noise. However, hybrid treebased models with elastic net regression are able to automatically perform coefficient shrinkage via L2 norm and variable selection based on the treestructure algorithm and L1 norm. Hybrid treebased models provide more promising prediction results than the now popular Tweedie GLM. Even under the linear assumption as in our simulation dataset, hybrid treebased models still perform better.
4 Empirical application
We compare Tweedie GLM with hybrid treebased models using one coverage group from the LGPIF data. This coverage group is called building and contents (BC) which provides insurance for buildings and the properties for local government entities including counties, cities, towns, villages, school districts, and other miscellaneous entities. We draw observations from years 2006 to 2010 as training dataset and year 2011 as test dataset. Table 2 and Table 3 summarize training dataset and test dataset respectively.
Response 

variables  Description  Min.  1st Q  Mean  Median  3rd Q  Max. 
ClaimBC 
BC claim amount in million.  
Continuous 

variables  
CoverageBC 
Log of BC coverage amount.  
lnDeductBC 
Log of BC deductible level.  
Categorical 

variables  Proportions  
NoClaimCreditBC 
Indicator for no BC claims in previous year.  32.93 %  
TypeCity 
EntityType: City.  14.03 %  
TypeCounty 
EntityType: County.  5.80 %  
TypeMisc 
EntityType: Miscellaneous.  10.81 %  
TypeSchool 
EntityType: School.  28.25 %  
TypeTown 
EntityType: Town.  17.33 %  
TypeVillage 
EntityType: Village.  23.78 %  

Response 

variables  Description  Min.  1st Q  Mean  Median  3rd Q  Max. 
ClaimBC 
BC claim amount in million.  
Continuous 

variables  
CoverageBC 
Log of BC coverage amount.  
lnDeductBC 
Log of BC deductible level.  
Categorical 

variables  Proportions  
NoClaimCreditBC 
Indicator for no BC claims in previous year.  50.87 %  
TypeCity 
EntityType: City.  14.06 %  
TypeCounty 
EntityType: County.  6.48 %  
TypeMisc 
EntityType: Miscellaneous.  11.42 %  
TypeSchool 
EntityType: School.  27.67 %  
TypeTown 
EntityType: Town.  16.53 %  
TypeVillage 
EntityType: Village.  23.84 %  

4.1 Tweedie GLM
We replicate the Tweedie GLM as noted in frees2016mv. The linear coefficients for the Tweedie GLM has been provided in Table A11 in frees2016mv. It was pointed out that Tweedie GLM may not be ideal for the dataset after visualizing the cumulative density function plot of jittered aggregate losses as depicted in Figure A6 of frees2016mv.
4.2 CART algorithm
We perform grid search on the training dataset with 10fold crossvalidation. The final model has hyperparameters set with 8 minimum number of observations in a region for the recursive binary split to be attempted and costcomplexity pruning parameter () is 0.05.
4.3 Calibration results of hybrid treebased models
We use two hybrid treebased models for this application. For the binary classification, we use the CART algorithm, and at the terminal node, we use either GLM with Gaussian family or elastic net regression with Gaussian family. Similar to the CART part, we perform a grid search with 10fold crossvalidation to find the optimal set of hyperparameter values. Hybrid treebased models with simple GLM with Gaussian family (HTGlm) has the following hyperparameter setting: is , maximum depth of the tree () is , and the threshold for the percentage of the zeros in the node to determine whether to build the linear model or assign as the prediction () is . Hybrid treebased models with elastic net regression (HTGlmnet) has the following hyperparameter setting: is , is , is , the balance between L1 norm and L2 norm () is 1, and size of the regularization parameter by the best crossvalidation result () is . These hyperparameter settings are chosen to have optimal RMSE in the model calibration.
4.4 Model performance
We use several validation measures to examine the performance of different models just as what was done in the simulation. To make for an easier comparison, we added the heat map in addition to the prediction accuracy table based on various validation measures. We present results separately for the training and test datasets. Rescaled value of 100 (colored blue) indicates it is the best model among the comparisons; on the other hand, rescaled value of 0 (colored red) indicates it is the worst one. For further details on such heat map for validation, see quan2018predictive.
From Figure 0(a), we find that Tweedie GLM has the worst performance of model fit in general, and the CART algorithm and HTGlm has better performance on the model fit. From Figure 0(b), we find that HTGlm has the best performance on the test dataset. It also shows that the hybrid treebased models prevents overfitting compared to traditional regression trees using the CART algorithm. Hybrid treebased models perform consistently well for both training dataset and test dataset. Because hybrid treebased models are considered piecewise linear models, the consistency is inherited from the linear model. We can also note that HTGlmnet has worse performance when compared to HTGlm, for both the training and test datasets. The advantage of regularization does not demonstrate significant effects on the LGPIF dataset because of the limited number of explanatory variables. However, in the simulation study, we showed the advantage of the regularization because the data has a more significant number of explanatory variables. The simulated data may be a better replication of reallife situation. In summary, hybrid treebased models outperform Tweedie GLMs.
5 Visualization and interpretation
The frequency component of hybrid treebased models can clearly be visualized as tree structure like a simple CART. Here, we use HTGlm to illustrate the visualization and interpretation of hybrid treebased models. In Appendix B, Figure 5 shows the classification tree model for the frequency. We can follow the tree path and construct the density plot for the response variable for each node. Figure 2 shows the classification tree and we highlighted (blue) nodes two depths away from the root. Consequently, we have seven nodes including the root nodes. Here we denote the node with numbers for the purpose of identification. Node four (4) is the terminal node and others are intermediate nodes. Figure 3 shows the density of the log of the response variable for the seven nodes mentioned in Figure 2. The root node has zero point mass as expected and the black dashed line is the mean of the response variable at the node. After the first split using CoverageBC, the mean shifted towards a different direction and the percentage of the point mass at zero is also changed in a different direction. At the fourth terminal node, we can barely see any positive claims with the mean shifting towards zero and this terminal node ultimately assigns zero as the prediction. We can see the classification tree algorithm, done by recursive binary splitting, divides the dataset into subspaces that contain more similar observations. After dividing the space of explanatory variables into subspaces, it may be more appropriate to make the distribution assumptions and apply the linear model to the subsamples. Hybrid treebased models are, in some sense, employ an algorithm that ‘divide and conquer’ and thus it can be a promising solution for imbalance datasets.
Figure 2 shows the path to terminal nodes and we can see a zero or a one, where a linear model is then fitted, for each terminal node. The tree structure and path can provide an indication of the risk factors based on the split of the explanatory variables. Figure 4 shows variable importance calculated from the classification tree model for the frequency part. Coverage and deductible information are much more important than location and previous claim history to identify whether claim happened or not. At each terminal node, we can also extract the regression coefficients and interpret the estimated linear regression model. Table 4 provides the regression coefficients at each of the nonzero terminal nodes. The node number has the same label as in Figure 2. For example, the terminal node 245 can be reached as follows: , , , , , , , and we estimated the linear model with regression coefficients: , , , . At the terminal node, the linear model is built that solely relies on the prediction accuracy from crossvalidation and based on the regularized algorithm. It does not rely on statistical inference, and indeed, some regression coefficients are not statistically significant. However, we can tune hybrid treebased models by balancing the prediction accuracy and statistical inference.
Terminal node 
245  31  123  27  203  103  187 

Estimates  
(Intercept) 
172854  24263  7922014  150264  1907177  155878  303658 
TypeCity  17218  
TypeCounty  
TypeMisc  62714  
TypeSchool  
TypeTown  
CoverageBC  15251  26601  1102629  9454  447654  2772  287021 
lnDeductBC  16777  15412  1232514  25246  70542  1526  
NoClaimCreditBC  16254  34160  21825  67030  13229  13632  

6 Conclusions
With the development of data science and increased computational capacity, insurance companies can benefit from existing large datasets collected over a stretched period of time. On the other hand, it is also challenging to build a model on “big data.” It is especially true for modeling claims of shortterm insurance contracts since the data set presents many unique characteristics. For example, claim datasets typically have a large proportion of zero claims that lead to imbalances in classification problems, a large number of explanatory variables with high correlations among continuous variables, and many missing values in the datasets. It can be problematic if we directly use traditional approaches like Tweedie GLM without additional modifications on these characteristics. Hence, it is often not surprising to find a less desirable prediction accuracy of conventional methods for reallife datasets.
In this paper, to fully capture the information available in large insurance datasets, we propose the use of hybrid treebased models as a learning machinery and an expansion to the available set of actuarial toolkits for claims prediction. We have reiterated the many benefits that can be derived from such a hybrid structure. The use of classification trees to model the frequency component has the advantages of being able to handle imbalanced classes and to naturally subdivide portfolio into risk classes, as well as the many benefits of treebased models. See also quan2018predictive. The use of a regression model for the severity component provides the flexibility of identifying factors that are important predictors for accuracy and interpretation. Finally, the hybrid specification captures the benefits of tuning hyperparameters at each step of the algorithm. The tuning parameters can be adjusted not only for prediction accuracy but also for possibly reaching for specific business objectives in the decision making process. The transparency and interpretability are advantages to convince interested parties, such as investors and regulators, that hybrid treebased models are much more than just black box.
We examine the prediction performance of our proposed models using datasets produced from a simulation study and reallife insurance portfolio. We focused on comparison to the Tweedie GLM since this has become widespread for modeling claims arising from shortterm insurance policies. Broadly speaking, hybridtree models can outperform such Tweedie GLM without loss of interpretability. However, there are opportunities to improve and to widen the scope of this hybrid tree structure that can be explored for further research.
Acknowledgment
We thank the financial support of the Society of Actuaries through its Centers of Actuarial Excellence (CAE) grant for our research project on Applying Data Mining Techniques in Actuarial Science.
Appendix A. Performance validation measures
The following table provides details of the various validation measures that were used throughout this paper to compare different predictive models.
Validation measure  Definition  Interpretation 

Gini Index  Higher is better.  
where is the corresponding to after  
ranking the corresponding predicted values .  
Coefficient of Determination  Higher is better.  
where is predicted values.  
Concordance Correlation  Higher is better.  
Coefficient  where and are the means  
and are the variances  
is the correlation coefficient  
Root Mean Squared Error  Lower is better  
Mean Absolute Error  Lower is better.  
Mean Absolute Percentage Error  Lower is better.  
Mean Percentage Error  Lower is better. 
Comments
There are no comments yet.