Hybrid Tree-based Models for Insurance Claims

by   Zhiyu Quan, et al.
University of Connecticut

Two-part models and Tweedie generalized linear models (GLMs) have been used to model loss costs for short-term insurance contract. For most portfolios of insurance claims, there is typically a large proportion of zero claims that leads to imbalances resulting in inferior prediction accuracy of these traditional approaches. This article proposes the use of tree-based models with a hybrid structure that involves a two-step algorithm as an alternative approach to these traditional models. The first step is the construction of a classification tree to build the probability model for frequency. In the second step, we employ elastic net regression models at each terminal node from the classification tree to build the distribution model for severity. This hybrid structure captures the benefits of tuning hyperparameters at each step of the algorithm; this allows for improved prediction accuracy and tuning can be performed to meet specific business objectives. We examine and compare the predictive performance of such a hybrid tree-based structure in relation to the traditional Tweedie model using both real and synthetic datasets. Our empirical results show that these hybrid tree-based models produce more accurate predictions without the loss of intuitive interpretation.



There are no comments yet.


page 16

page 18


Big Data Regression Using Tree Based Segmentation

Scaling regression to large datasets is a common problem in many applica...

The effect of Hybrid Principal Components Analysis on the Signal Compression Functional Regression: With EEG-fMRI Application

Objective: In some situations that exist both scalar and functional data...

A Metamodel Structure For Regression Analysis: Application To Prediction Of Autism Spectrum Disorder Severity

Traditional regression models do not generalize well when learning from ...

DPPred: An Effective Prediction Framework with Concise Discriminative Patterns

In the literature, two series of models have been proposed to address pr...

Micro-level Reserving for General Insurance Claims using a Long Short-Term Memory Network

Detailed information about individual claims are completely ignored when...

DeepTriangle: A Deep Learning Approach to Loss Reserving

We propose a novel approach for loss reserving based on deep neural netw...

The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction

Accurate streamflow prediction largely relies on historical records of b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Prelude

Building regression models for insurance claims presents several challenges. The process can be particularly difficult for individual insurance policies where a large proportion of the claims are zeros and for those policies with a claim, the losses typically exhibit skewness. The degree of skewness in the positive loss distribution varies widely among different lines of insurance business. For example, within a class of automobile insurance policies, the losses arising from damages to property may be medium-tailed but those arising from liability-related claims may be long-tailed. For short-term insurance contracts, the number of claims is referred to as the claim frequency while the amount of claims, conditional on the occurrence of a claim, is the claim severity. The addition of covariates in a regression context within each component allows for capturing the heterogeneity of the individual policies. These two components are used as predictive models for pure premiums and for aggregate or total claims in a portfolio.

There is a long history of studying insurance claims based on the two-part framework. Several well-known families of distributions for claim frequencies and claim severities are well documented in klugman2012. As more observed data become increasingly available, additional complexities to traditional regression linear models have been introduced both in the literature and in practice; for example, the use of zero-truncated or hurdle models has been discussed to accommodate some of the shortcomings of traditional models. Interestingly, the Tweedie regression models for insurance claims have been particularly popular because it is adaptable to a mixture of zeros and non-negative insurance claims. smyth2002fitting

described the Tweedie compound Poisson regression models and calibrated such models using a Swedish third party insurance claims dataset; the models were fitted to examine dispersion within the GLM framework. The Tweedie distribution can be treated as a reparameterization of the compound Poisson-gamma distribution, so that model calibration can be done within the GLM framework.


compared these two approaches and concluded that there is no clear superior method; the paper also described the advantages and disadvantages of the two methods. Tweedie GLM presents a larger pure premium and implies both a larger claim frequency and claim severity due to the constant scale (dispersion) parameter. In other words, the mean increases with variance. The constant scale parameter also forces an artificial relationship between the claim frequency and the claim severity. Tweedie GLM does not have an optimal coefficient, and it leads to a loss of information because it ignores the number of claims. On the other hand, Tweedie GLM has fewer parameters to estimate and is thus more parsimonious than the two-part framework. When the insurance claims data presents small losses due to low frequencies, the two-part framework most likely overlooks the internal connection between the low frequency and the subsequent small loss amount. For example, the frequency model often indicates zero number of claims for small claim policies, which leads to a zero loss prediction. Additional works related to Tweedie regression can be found in

frees2014predictive and jorgensen1987exponential.

Within the GLM framework, the claim frequency component is unable to accurately accommodate imbalances caused by the large proportion of zero claims. Simultaneously, a significant limitation is that the regression structure of the logarithmic mean is restricted to a linear form, which may be too inflexible for real applications. With rapid expansion of available data for improved decision making, there is a growing appetite in the insurance industry for expanding its toolkit for data analysis and modeling. However, the industry is unarguably highly regulated so that there is pressure for actuaries and data scientists to provide adequate transparency in modeling techniques and results. To find a balance between modeling flexibility and interpretation, in this paper, we propose a nonparametric model using tree-based models with a hybrid structure. Since breiman1984

introduced the Classification and Regression Tree (CART) algorithm, tree-based models have gained momentum as a powerful machine learning approach for decision making in several disciplines. The CART algorithm involves separating the explanatory variable space into several mutually exclusive regions that, as a result, creates a nested hierarchy of branches resembling a tree structure. Each separation or branch is referred to as a node. Each of the bottom nodes of the decision tree, called terminal nodes, has a unique path for observable data to enter the region. Once the decision tree is constructed, it is possible to use paths to locate the region or terminal node to which a new set of explanatory variables will belong.

Prior to further exploring the hybrid structure, it is worth mentioning that a similar structure called Model trees, with an M5 algorithm, was first described in quinlan1992learning

. In the M5 algorithm, it constructs tree-based piecewise linear models. Regression trees assign a constant value to the terminal node as the fitted value. However, Model trees use a linear regression model at each terminal node to predict the fitted value for observations that reach that terminal node. Regression trees are a special case of Model trees. Both regression trees and Model trees employ recursive binary splitting to build a fully grown tree. Thereafter, both algorithms use a cost-complexity pruning procedure to trim the fully grown tree back from each terminal node. The primary difference between regression trees and Model trees algorithms is that for the latter step, each terminal node is replaced by a regression model instead of a constant value. The explanatory variables that serve to build that regression models are generally those that participate in the splits within the subtree that will be pruned. In other words, explanatory variables in each node are located beneath the current terminal node in the fully grown tree.

It has been demonstrated that Model trees have advantages over regression trees in both model simplicity and prediction accuracy. Model trees produce decision trees that are considered not only relatively simple to understand, but are also efficient and robust. Additionally, they are able to exploit local linearity within the dataset. When prediction performance are compared, it is worth mentioning the difference in the range of predictions produced between the traditional regression trees and these Model trees. Accordingly, regression trees are only able to give a prediction within the range of observed values in the training dataset. However, Model trees are able to extrapolate prediction range because of the use of the regression models at the terminal node. For further discussion of Model trees and M5 algorithm, please see quinlan1992.

Inspired by the structure and advantages of Model trees when compared to traditional regression trees, we develop a similar algorithm that can be uniquely applied to insurance claims because of its two-part nature. In this paper, we present the hybrid tree-based models as a two-step algorithm: the first step builds a classification tree to identify membership of claim occurrence and the second step uses a penalized regression technique to determine the size of the claim at each of the terminal nodes, taking into account the available explanatory variables. In essence, hybrid tree-based models for insurance claims as described in this paper integrate the peculiarities of both the classification tree and the linear regression in the modeling process. These sets of models are suitably described as hybrid structures.

We have organized the rest of this paper as follows. In Section 2, we provide technical details of the algorithm arising from hybrid tree-based models applicable to insurance claims, separating the modeling of claim frequency and claim severity. In section 3, we create a synthetic dataset based on a simulation produced from a true Tweedie regression model to investigate the predictive accuracy of our hybrid tree-based models when compared to the Tweedie GLM. We introduced some noise to the simulated data in order to be able to make a reasonable comparison. In section 4, using empirical data drawn from a portfolio of general insurance policies, we present the estimation results and compared the prediction accuracy based on a hold-out sample for validation. Section 5 provides visualization of the results from hybrid-tree models to allow for better interpretation. We conclude in Section 6.

2 Description of hybrid tree-based models

Hybrid tree-based models utilize the information arising from both the frequency and the severity components. As already alluded in Section 1, it is a two-stage procedure. In the first stage, we construct a classification tree-based model for frequency. In the subsequent stage, we employ a penalized linear regression model at each terminal node within the tree-based model, based on the severity information. Such models can be drawn from the hybrid structure, which is an ecosystem, to accommodate the variety of the dataset. This hybrid structure captures all the advantageous features of tree-based models and the penalized regression models.

We can determine the type of classification trees that can be adapted in the frequency according to the information drawn from the insurance dataset. If the dataset records only whether claims were reported or not, we can construct a classification tree for a binary response variable. If the dataset records additional information of a number of claims, we can construct trees for a count response variable. For the binary classification, we can employ some of the most popular CART algorithm, to list a few: (a) efficient algorithm C4.5

(quinlan1992), with a current updated version C5.0, (b) unbiased approaches like Generalized, Unbiased, Interaction Detection, and Estimation (GUIDE) (loh2009improving), or (c) Conditional Inference Trees (CTREE) (hothorn2006unbiased). For the count response variable, we can apply, to list a few: (a) piecewise linear Poisson using the CART algorithm called SUPPORT (chaudhuri1995generalized), (b) Poisson regression using GUIDE (loh2006regression), or (c) MOdel-Based recursive partitioning (MOB) (zeileis2008model).

After the completion of the classification tree structure, we then employ a linear regression model to each of the terminal nodes. The simplest model includes GLM (nelder1972glm) with different families and regularized regression like elastic net regularization (zou2005reg) amongst others.

Hybrid tree-based models build an ecosystem that utilizes modern techniques from both traditional statistics and machine learning. In the subsequent subsections, for simplicity, we only show a simple hybrid tree-based model without exploring all of the possible combinations of different algorithms suitable at each stage of the procedure.

2.1 Claim frequency

To illustrate the use of hybrid tree-based models, we select the well-known CART algorithm for binary classification and least squares regression with elastic net regularization. Elastic net regularization can perform variable selection and improve the prediction accuracy when compared to traditional linear regressions without penalty. In order to be self-contained, we illustrate this simple hybrid tree-based model with sufficient details. See also james2013.

We denote the response variable as Y, the sample space as , and as the number of observations. The th sample with -dimensional explanatory variables is denoted as , which is sampled from the space . For example, we can separate each claim into , the claim occurrence or the number of claims, and , the claim severity.

In the CART algorithm, a binary classification tree, denoted by , is produced by partitioning the space of the explanatory variables into disjoint regions and then assigning a boolean for each region , for

. Given a classification tree, each observation can then be classified based on the expression


where denotes the partition with the assigned boolean. To be more specific, boolean when the majority of observations in region have a claim; otherwise it is zero.

The traditional classification loss functions used in the classification tree, in Equation (

1), are described as follows:

  • Misclassification error:

  • Gini index:

  • Cross-entropy or deviance:

where is proportion of one class in the node. For multi-class loss functions, see hastie2009.

The default CART algorithm uses Gini index as a loss function. The regions in the classification tree are determined according to recursive binary splitting. First in the process of splitting is the discovery of one explanatory variable which best divides the data into two subregions; for example, these regions are the left node and the right node in the case of a continuous explanatory variable. This division is determined as the solution to


where is the weight for the subregion determined by the number of observations split into subregion divided by the total number of observations before the split, and is the proportion of one class in the subregion. Subsequently, the algorithm looks for the next explanatory variable with the best division into two subregions and this process is applied recursively until meeting some predefined threshold or reaching a minimum size of observations in the terminal node.

To control for model complexity, we can use cost-complexity pruning to trim the fully grown tree . We define the loss in region by

For any subtree , we denote the number of terminal nodes in this subtree by . To control the number of terminal nodes, we introduce the tuning hyperparameter to the loss function by defining the new cost function as


Clearly, according to this cost function, the tuning hyperparameter penalizes large numbers of terminal nodes. The idea then is to find the subtrees for each , and choose the subtree that minimizes in Equation (3). Furthermore, the tuning hyperparameter governs the tradeoff between the size of the tree (model complexity) and its goodness of fit to the data. Large values of result in smaller trees (simple model) and as the notation suggests, leads to the fully grown tree . Additional tuning is done to control the tree depth through , which is the maximum depth of any node of the final tree.

2.2 Claim severity

After building the classification tree structure, we next apply a linear regression on the terminal nodes to model severity. In controlling for model complexity, we set a threshold to determine if we should build a linear regression or directly assign zero for the terminal. For example, if is 80%, then we should directly assign zero to the terminal nodes that contain more than 80% of zero claims. Furthermore, if the terminal node contains less than a certain number of observations, say 40, we can directly use the mean as the prediction similar to regression trees. Otherwise, we need to build a linear regression on the terminal nodes. While any member of the GLM exponential family that is suitable for continuous claims can be used, we find that the special case of GLM Gaussian, or just ordinary linear regression, is sufficiently suitable.

At each terminal node , the linear coefficient can be determined by:


where is the negative log-likelihood for sample . For the Gaussian family, denoting the design matrix as X, the coefficient is well known as

Ridge regression (hoerl1970ridge)

achieves better prediction accuracy compared to ordinary least squares because of bias-variance trade-off. In other words, the reduction in the variance term of the coefficient is larger than the increase in its squared bias. It performs coefficient shrinkage and forces its correlated explanatory variables to have similar coefficients. In ridge regression, at each terminal node

, the linear coefficient can be determined by


where is a tuning hyperparameter that controls shrinkage and thus, the number of selected explanatory variables. For the following discussion, we assume the values of are standardized so that

If the explanatory variables do not have the same scale, the shrinkage may not be fair. In the case of ridge regression within the Gaussian family, the coefficient in Equation (5) can be shown to have the explicit form


is the identity matrix of appropriate dimension.

Ridge regression does not automatically select the important explanatory variables. However, LASSO regression

(tibshirani1996regression) has the effect of sparsity, which forces the coefficients of the least important explanatory variable to have a zero coefficient therefore making the regression model more parsimonious. In addition, LASSO performs coefficient shrinkage and selects only one explanatory variable from the group of correlated explanatory variables. In LASSO, at each terminal node , the linear coefficient can be determined by


where is a tuning hyperparameter that controls shrinkage. For regularized least squares with LASSO penalty, Equation (6) leads to the following quadratic programming problem

Originally, tibshirani1996regression used the combined quadratic programming method to numerically solve for . In a later development, fu1998penalized proposed the shooting method and friedman2007pathwise redefined shooting as the coordinate descent algorithm which is a popular algorithm used in optimization.

To illustrate the coordinate descent algorithm, we start with . Given the optimal can be found by

where is constant and can be dropped. We denote . It follows that

This can be solved by the Soft-thresholding Lemma discussed in (donoho1995adapting).

Lemma 2.1.

(Soft-thresholding Lemma) The following optimization problem

has the solution of

Therefore, can be expressed as

For , update by soft-thresholding when takes the previously estimated value. Repeat loop until convergence.

zou2005reg pointed out that LASSO has few drawbacks. For example, when the number of explanatory variables is larger than the number of observations , LASSO selects at most explanatory variables. Additionally, if there is a group of pairwise correlated explanatory variables, LASSO randomly selects only one explanatory variable from this group and ignores the rest. zou2005reg also empirically showed that LASSO has subprime prediction performance compared to ridge regression. Therefore, elastic net regularization is proposed which uses a convex combination of ridge and LASSO regression. Elastic net regularization is able to better handle correlated explanatory variables and perform variable selection and coefficient shrinkage. In elastic net regularization, at each terminal node , the linear coefficient can be determined by


where controls the elastic net penalty and bridges the gap between LASSO (when ) and ridge regression (when ). If explanatory variables are correlated in groups, an around tends to select the groups in or out altogether. For penalized least squares with elastic net penalty,

This problem can also be solved using the coordinate descent algorithm. We omit the detailed solution which is a similar process to solve LASSO using the coordinate descent algorithm. However, it is worth mentioning that the expression for is as follows:

The mathematical foundation for the elastic net regularization can be found in de2009elastic. It should also be noted that all the regularization schemes mentioned previously have simple Bayesian interpretations. See hastie2009 and james2013. A review paper for statistical learning applications in general insurance can be found in parodi2012computational.

Finally, we conclude that the hybrid tree-based models can be expressed in the following functional form:


From equation (8), we see that hybrid tree-based models can also be viewed as piecewise linear regression models. The tree-based algorithms divide the dataset into subsamples (or risk class), and linear regression models are then applied on these subsamples.

Algorithm 1 summarizes the details of implementing hybrid tree-based models based on the rpart and glmnetUtils packages in R.

Input: Formula for the tree, formula for the linear model, training dataset x, cost-complexity parameter , maximum depth of final tree , the threshold for the proportion of zeros in the node to be considered to build a linear model , the choice of elastic net regularization , penalty size .
Output: Tree structure, linear models at each node, node information, , , fitted dataset.
1 Grow a tree on a training dataset using recursive binary splitting. Use the stopping criterion ;
2 Prune the tree using cost-complexity pruning with ;
3 Assign node information to each observation;
4 for each terminal node do
5       If zero claim exceeds the in the node, then we assign zero as the prediction;
6       Else, we build a linear model using observations in this node. If the number of observations in this node is smaller than 40, then we assign the average of the response variable as the prediction; otherwise, we fit ordinary least squares with elastic net regularization model;
8 end for
9Return Tree structure, linear models at each node, node information, , , fitted dataset;
Algorithm 1 Implementation of the hybrid tree-based models

3 Simulation study

In this section, we conduct a simulation study to assess the performance of a simple hybrid tree-based model proposed in this paper relative to that of the Tweedie GLM. The synthetic dataset has been optimally created to replicate a real dataset. This synthetic dataset contains continuous explanatory variables and categorical explanatory variables with a continuous response variable sampled from Tweedie-like distribution. In some sense, the observed values of the response variable behave like a Tweedie, but some noise were introduced to the data for a fair comparison.

We begin with the continuous explanatory variables by generating sample observations from the multivariate normal with covariance structure

where . Continuous explanatory variables close to each other has higher correlations, and the correlation effect diminishes with increasing distance between two explanatory variables. In real datasets, we can easily observe correlated explanatory variables among hundreds of possible features. It is also very challenging to detect the correlation between explanatory variables when an insurance company merges with third party datasets. Multicollinearity reduces the precision of the estimated coefficients of correlated explanatory variables. It can be problematic to use GLM without handling multicollinearity in the real dataset.

Categorical explanatory variables are generated by random sampling from the set of integers with equal probabilities. Categorical explanatory variables do not have any correlation structure among them.

We create observations (with % zero claims) with explanatory variables, including continuous explanatory variables with relatively larger coefficients, continuous explanatory variables with relatively smaller coefficients, continuous explanatory variables with coefficients of , which means no relevance, categorical explanatory variables with relatively larger coefficients, categorical explanatory variables with relatively smaller coefficients, and categorical explanatory variables with coefficients , which similarly implies no relevance. In effect, there are three groups of explanatory variables according to the size of the linear coefficients. In other words, there are strong signal explanatory variables, weak signal explanatory variables, and noise explanatory variables. Here are the true linear coefficients and used:

The first coefficients refer to the intercepts. Because of the equivalence of the compound Poisson-gamma with Tweedie distributions, as discussed in Section (1), we generated samples drawn from Poisson for frequency and gamma for severity.

To illustrate possible drawbacks using Tweedie GLM, we use different linear coefficients for the frequency part and the severity part. The absolute value of the linear coefficients is the same but with a different sign. In real-life, there is no guarantee that the explanatory variables have equal coefficients in both frequency and severity parts. The response variable generated by the compound Poisson-gamma distribution framework using modified rtweedie function in tweedie R package with frequency and severity as components is presented as follows:

Frequency part

: Poisson distribution with a log link function:

where , with and .

Severity part: Gamma distribution with a log link function:

where .

Response variable:

where the superscript has been used to distinguish the response and the gamma variables.

We introduce noise to the true Tweedie model in the following ways: multicollinearity among continuous variables, zero coefficients for non-relevant explanatory variables, different coefficients for frequency and severity part, and added white noise to positive claims. These noises are what make the real-life data deviate from the true model in practice. It makes for an impartial dataset that can provide for a fair comparison.

Once we have produced the simulated (or synthetic) dataset, we apply both Tweedie GLM and hybrid tree-based models to predict values of the response variables, given the available set of explanatory variables. The prediction performance between the two models is then compared using several validation measures as summarized in Table 1. In Appendix A, Table 5 describes details for validation measures.

Validation measure
Tweedie GLM 0.95 0.16 0.47 0.89 0.07
Hybrid Tree-based Model 0.97 0.29 0.36 0.81 0.06
Table 1: Prediction accuracy using the simulation dataset.

From Table 1, we find that the Tweedie GLM performs worse even under the true model assumption, albeit the presence of noise. However, hybrid tree-based models with elastic net regression are able to automatically perform coefficient shrinkage via L2 norm and variable selection based on the tree-structure algorithm and L1 norm. Hybrid tree-based models provide more promising prediction results than the now popular Tweedie GLM. Even under the linear assumption as in our simulation dataset, hybrid tree-based models still perform better.

4 Empirical application

We compare Tweedie GLM with hybrid tree-based models using one coverage group from the LGPIF data. This coverage group is called building and contents (BC) which provides insurance for buildings and the properties for local government entities including counties, cities, towns, villages, school districts, and other miscellaneous entities. We draw observations from years 2006 to 2010 as training dataset and year 2011 as test dataset. Table 2 and Table 3 summarize training dataset and test dataset respectively.

variables Description Min. 1st Q Mean Median 3rd Q Max.

BC claim amount in million.


Log of BC coverage amount.

Log of BC deductible level.

variables Proportions

Indicator for no BC claims in previous year. 32.93 %

EntityType: City. 14.03 %

EntityType: County. 5.80 %

EntityType: Miscellaneous. 10.81 %

EntityType: School. 28.25 %

EntityType: Town. 17.33 %

EntityType: Village. 23.78 %

Table 2: Summary statistics of the variables for the training dataset, 2006-2010.

variables Description Min. 1st Q Mean Median 3rd Q Max.

BC claim amount in million.


Log of BC coverage amount.

Log of BC deductible level.

variables Proportions

Indicator for no BC claims in previous year. 50.87 %

EntityType: City. 14.06 %

EntityType: County. 6.48 %

EntityType: Miscellaneous. 11.42 %

EntityType: School. 27.67 %

EntityType: Town. 16.53 %

EntityType: Village. 23.84 %

Table 3: Summary statistics of the variables for the test dataset, 2011.

4.1 Tweedie GLM

We replicate the Tweedie GLM as noted in frees2016mv. The linear coefficients for the Tweedie GLM has been provided in Table A11 in frees2016mv. It was pointed out that Tweedie GLM may not be ideal for the dataset after visualizing the cumulative density function plot of jittered aggregate losses as depicted in Figure A6 of frees2016mv.

4.2 CART algorithm

We perform grid search on the training dataset with 10-fold cross-validation. The final model has hyperparameters set with 8 minimum number of observations in a region for the recursive binary split to be attempted and cost-complexity pruning parameter () is 0.05.

4.3 Calibration results of hybrid tree-based models

We use two hybrid tree-based models for this application. For the binary classification, we use the CART algorithm, and at the terminal node, we use either GLM with Gaussian family or elastic net regression with Gaussian family. Similar to the CART part, we perform a grid search with 10-fold cross-validation to find the optimal set of hyperparameter values. Hybrid tree-based models with simple GLM with Gaussian family (HTGlm) has the following hyperparameter setting: is , maximum depth of the tree () is , and the threshold for the percentage of the zeros in the node to determine whether to build the linear model or assign as the prediction () is . Hybrid tree-based models with elastic net regression (HTGlmnet) has the following hyperparameter setting: is , is , is , the balance between L1 norm and L2 norm () is 1, and size of the regularization parameter by the best cross-validation result () is . These hyperparameter settings are chosen to have optimal RMSE in the model calibration.

4.4 Model performance

We use several validation measures to examine the performance of different models just as what was done in the simulation. To make for an easier comparison, we added the heat map in addition to the prediction accuracy table based on various validation measures. We present results separately for the training and test datasets. Rescaled value of 100 (colored blue) indicates it is the best model among the comparisons; on the other hand, rescaled value of 0 (colored red) indicates it is the worst one. For further details on such heat map for validation, see quan2018predictive.

(a) Model performance based on training dataset
(b) Model performance based on test dataset
Figure 1: A heat map of model comparison based on various validation measures.

From Figure 0(a), we find that Tweedie GLM has the worst performance of model fit in general, and the CART algorithm and HTGlm has better performance on the model fit. From Figure 0(b), we find that HTGlm has the best performance on the test dataset. It also shows that the hybrid tree-based models prevents overfitting compared to traditional regression trees using the CART algorithm. Hybrid tree-based models perform consistently well for both training dataset and test dataset. Because hybrid tree-based models are considered piecewise linear models, the consistency is inherited from the linear model. We can also note that HTGlmnet has worse performance when compared to HTGlm, for both the training and test datasets. The advantage of regularization does not demonstrate significant effects on the LGPIF dataset because of the limited number of explanatory variables. However, in the simulation study, we showed the advantage of the regularization because the data has a more significant number of explanatory variables. The simulated data may be a better replication of real-life situation. In summary, hybrid tree-based models outperform Tweedie GLMs.

5 Visualization and interpretation

The frequency component of hybrid tree-based models can clearly be visualized as tree structure like a simple CART. Here, we use HTGlm to illustrate the visualization and interpretation of hybrid tree-based models. In Appendix B, Figure 5 shows the classification tree model for the frequency. We can follow the tree path and construct the density plot for the response variable for each node. Figure 2 shows the classification tree and we highlighted (blue) nodes two depths away from the root. Consequently, we have seven nodes including the root nodes. Here we denote the node with numbers for the purpose of identification. Node four (4) is the terminal node and others are intermediate nodes. Figure 3 shows the density of the log of the response variable for the seven nodes mentioned in Figure 2. The root node has zero point mass as expected and the black dashed line is the mean of the response variable at the node. After the first split using CoverageBC, the mean shifted towards a different direction and the percentage of the point mass at zero is also changed in a different direction. At the fourth terminal node, we can barely see any positive claims with the mean shifting towards zero and this terminal node ultimately assigns zero as the prediction. We can see the classification tree algorithm, done by recursive binary splitting, divides the dataset into subspaces that contain more similar observations. After dividing the space of explanatory variables into subspaces, it may be more appropriate to make the distribution assumptions and apply the linear model to the subsamples. Hybrid tree-based models are, in some sense, employ an algorithm that ‘divide and conquer’ and thus it can be a promising solution for imbalance datasets.

Figure 2: Tree paths with highlighted nodes.
Figure 3: Classification tree for the frequency.
Figure 4: Variable importance for the claim frequency.

Figure 2 shows the path to terminal nodes and we can see a zero or a one, where a linear model is then fitted, for each terminal node. The tree structure and path can provide an indication of the risk factors based on the split of the explanatory variables. Figure 4 shows variable importance calculated from the classification tree model for the frequency part. Coverage and deductible information are much more important than location and previous claim history to identify whether claim happened or not. At each terminal node, we can also extract the regression coefficients and interpret the estimated linear regression model. Table 4 provides the regression coefficients at each of the non-zero terminal nodes. The node number has the same label as in Figure 2. For example, the terminal node 245 can be reached as follows: , , , , , , , and we estimated the linear model with regression coefficients: , , , . At the terminal node, the linear model is built that solely relies on the prediction accuracy from cross-validation and based on the regularized algorithm. It does not rely on statistical inference, and indeed, some regression coefficients are not statistically significant. However, we can tune hybrid tree-based models by balancing the prediction accuracy and statistical inference.

Terminal node
245 31 123 27 203 103 187

-172854 24263 7922014 150264 -1907177 155878 -303658
TypeCity 17218
TypeMisc 62714
CoverageBC 15251 26601 1102629 9454 447654 2772 287021
lnDeductBC 16777 -15412 -1232514 -25246 -70542 -1526
NoClaimCreditBC -16254 -34160 -21825 -67030 -13229 -13632

Table 4: Regression coefficients at the terminal nodes.

6 Conclusions

With the development of data science and increased computational capacity, insurance companies can benefit from existing large datasets collected over a stretched period of time. On the other hand, it is also challenging to build a model on “big data.” It is especially true for modeling claims of short-term insurance contracts since the data set presents many unique characteristics. For example, claim datasets typically have a large proportion of zero claims that lead to imbalances in classification problems, a large number of explanatory variables with high correlations among continuous variables, and many missing values in the datasets. It can be problematic if we directly use traditional approaches like Tweedie GLM without additional modifications on these characteristics. Hence, it is often not surprising to find a less desirable prediction accuracy of conventional methods for real-life datasets.

In this paper, to fully capture the information available in large insurance datasets, we propose the use of hybrid tree-based models as a learning machinery and an expansion to the available set of actuarial toolkits for claims prediction. We have reiterated the many benefits that can be derived from such a hybrid structure. The use of classification trees to model the frequency component has the advantages of being able to handle imbalanced classes and to naturally subdivide portfolio into risk classes, as well as the many benefits of tree-based models. See also quan2018predictive. The use of a regression model for the severity component provides the flexibility of identifying factors that are important predictors for accuracy and interpretation. Finally, the hybrid specification captures the benefits of tuning hyperparameters at each step of the algorithm. The tuning parameters can be adjusted not only for prediction accuracy but also for possibly reaching for specific business objectives in the decision making process. The transparency and interpretability are advantages to convince interested parties, such as investors and regulators, that hybrid tree-based models are much more than just black box.

We examine the prediction performance of our proposed models using datasets produced from a simulation study and real-life insurance portfolio. We focused on comparison to the Tweedie GLM since this has become widespread for modeling claims arising from short-term insurance policies. Broadly speaking, hybrid-tree models can outperform such Tweedie GLM without loss of interpretability. However, there are opportunities to improve and to widen the scope of this hybrid tree structure that can be explored for further research.


We thank the financial support of the Society of Actuaries through its Centers of Actuarial Excellence (CAE) grant for our research project on Applying Data Mining Techniques in Actuarial Science.

Appendix A. Performance validation measures

The following table provides details of the various validation measures that were used throughout this paper to compare different predictive models.

Validation measure Definition Interpretation
Gini Index Higher is better.
where is the corresponding to after
ranking the corresponding predicted values .
Coefficient of Determination Higher is better.
where is predicted values.
Concordance Correlation Higher is better.
Coefficient where and are the means
and are the variances
is the correlation coefficient
Root Mean Squared Error Lower is better
Mean Absolute Error Lower is better.
Mean Absolute Percentage Error Lower is better.
Mean Percentage Error Lower is better.
Table 5: Performance validation measures.

Appendix B. Classification tree for the claim frequency

Figure 5: Classification tree for the claim frequency.