1 Introduction
Calculating the risk premium is a prime objective for insurance pricing in nonlife actuarial science. The risk premium consists of two parts: the pure premium, which is used to compensate the expected value of future losses, and the risk loading, which is used to cover the excess part of future losses over the pure premium. To estimate the risk loading correctly and at the same time allow classification by tariff features, in this study, we develop a new general framework to calculate the individual risk premiums, including risk loadings, based on an arbitrary set of covariates.
A rich variety of premium principles has been proposed in the actuarial literature for predicting the risk premium of individual policies, for example, blueBühlmann (1970), blueMack (1997), blueWang et al. (1997), blueKudryavtsev (2009), and blueHeras et al. (2018). The standard approach for predicting the risk premium involves a separate analysis of two parts of the risk premium: the pure premium and the risk loading. The traditional approach is based on generalized linear models (GLMs) (De Jong and Heller, 2008), which provide estimates of expected losses of individual polices given a number of risk factors. Risk loading is derived in the traditional approach by applying various premium principles, for example, the expected value premium principle, standard deviation premium principle, and Wang premium principle.
Assuming that random variable
denotes the aggregate claim amount for individual policy , the risk premium of policy can be expressed as a distortion function of the random variable . In the expected value premium principle, the risk premium equals the pure premium plus a percentage of the pure premium, that is,(1.1) 
where denotes the risk loading parameter and denotes the corresponding risk loading. In the standard deviation premium principle, the risk premium equals the pure premium plus a percentage of the standard deviation, that is,
(1.2) 
An alternative approach for predicting risk premium is to consider the risk premium as a whole by applying the value at risk (VaR) premium principle and Wang premium principle; see, for example blueWang (1995, 2000), blueWang et al. (1997), and blueKudryavtsev (2009). Based on the Wang premium principle, the risk premium is expressed as follows:
(1.3) 
where and
denote standard normal cumulative distribution function and its inverse function, respectively;
represents the survival function of aggregate claim amount, and denotes a risk factor.The VaR premium principle in quantile regression for ratemaking is first discussed in blueKudryavtsev (2009). The risk premium is calculated as a quantile of the aggregate claim amount, as follows:
(1.4) 
where denotes the quantile of the aggregate claim amount and is a given quantile level, such as 95% or 99%. Risk loading is denoted as
, which is expressed as the difference between the quantile and the pure premium. This premium principle explains the needs of risk loading quite well, as it estimates the maximum possible losses that an individual policy may incur with a given probability
during the forecasting period.Following the VaR premium principle, the quantile premium principle for classification ratemaking is proposed by blueHeras et al. (2018), and the corresponding risk premium is calculated as follows:
(1.5) 
where denotes the th quantile of the aggregate claim amount, is the risk loading parameter, and represents the risk loading, which is the difference between the th quantile of the aggregate claim amount and the pure premium. The main difference between the VaR premium principle in Eq.(1.4) and the quantile premium principle in Eq.(1.5) is that the risk loading in the quantile premium principle is adjusted by risk loading parameter .
Recently, blueBaione and Biancalana (2019) proposes a twopart quantile premium principle, that is,
(1.6) 
where denotes the th quantile of aggregate claim amount given that at least one claim has been incurred and denotes the probability of incurring at least one claim.
In actuarial practice, some parameters, namely, , , and in Eqs.(1.1)(1.6), which are called risk loading parameters in this study, need to be determined in advance. To estimate the risk loading parameters, blueBühlmann (1985) proposes a topdown method for insurance companies by first controlling the probability of ruin at the acceptable level in advance and then imposing this stability criterion regarding yield of invested capital. This allows insurance companies to find a total premium to be charged for the whole portfolio and then split it in a fair way among all the individual risks.
While the topdwon method is well developed, see for example blueCossette et al. (2012) and blueHeras et al. (2018), the use of covariate information in order to estimate the risk loading parameters through generalized linear models and quantile regression models has received much less attention. Following this line of study, blueBaione and Biancalana (2019) extend the work of blueHeras et al. (2018) by developing a downtopdown method for risk premium calculation in classification ratemaking. They first apply twopart GLMs and expected value premium principle to calculate the risk premium for each policy at the individual level and then obtains the total risk premium of the whole portfolio by simply aggregating all individual policy s risk premium. Finally, the risk loading parameter is defined such that the total risk premiums for all policies are sufficient to cover the total expected losses. However, the above approach is debatable because it ignores the risk diversification effect of combining all individual policies, which might result in a overestimated total risk premium of the whole portfolio. Moreover, the total risk premium often relies on the distribution assumption of GLMs at the individual level; for example, blueBaione and Biancalana (2019) apply a gamma (GA) regression to fit the nonzero aggregate claim amounts, which might be not very appropriate for practical insurance portfolios blue(Heller et al., 2006; Eling, 2012; Laudagé et al., 2019).
Our work is motived by the recent works of blueHeras et al. (2018) and blueBaione and Biancalana (2019). We extend this branch of the literature by developing a more general topdown framework to calculate the risk loading parameters. We first derive the total risk premium of the portfolio by implementing the bootstrap method, thereby allowing us to obtain the entire distribution of the total risk premium at the collective level, instead of exploring the distribution at individual level. Given an acceptable confidence level, this approach provides a useful tool for estimating the VaR of a portfolio.
Our method permits estimating risk loading parameters uniquely for various premium principles at the individual level. In this approach, the total risk premium is distributed to the individual policies based on the risk contribution of each policy, so that the sum of the risk premiums of all individual policies is equal to the total risk premium of the whole portfolio, which is proved to be an efficient method in ratemaking by blueBühlmann (1985). The risk premiums of different tariff classes can be estimated by either GLMs or quantile regression models incorporating into the covariate information. For comparison, GLMs is used as a benchmark. Traditional quantile regression, fully parametric quantile regression, and quantile regression with coefficient functions are constructed to calculate the risk premium of each individual policy.
Thus, our approach has two advantages: (1) it controls the probability that the aggregate claim amount of the entire portfolio exceeds the total risk premium to an acceptable level; (2) it provides a general framework to determine risk loading parameters objectively for all types of models, such as twopart GLMs and twopart quantile regression models.
The remainder of the article is structured as follows. Sections 2 and 3 summarize the methods to calculate the risk premium based on twopart GLMs and twopart quantile regression models, respectively, at the individual level. Section 4 presents an analysis of the calculation of the total risk premium of a portfolio and its allocation to individual policies. Section 5 applies the proposed method to an empirical data set. Section 6 summarizes and concludes.
2 Risk Premiums Based on TwoPart GLMs
Suppose an insurance portfolio contains policies, indicates whether or not policy has a claim submitted, represents its aggregate claim amount, denotes its exposure, and
stands for a vector of covariates
.In actuarial practice, the observed aggregate claim amounts of a portfolio usually have a probability mass at zero. In this study, we first implement the twopart GLMs to accommodate the probability mass at zero. In a twopart GLMs framework, the zero component models the probability of incurring no claim, and the continuous component models the aggregate claim amount given that at least one claim has been incurred. It is a common practice to separate claim probability and nonzero aggregate claim amount in pricing nonlife insurance contracts; see, for example, blueFrees (2009) and blueFrees et al. (2013).
For claim probability, we assume that
follows the binomial distribution with parameter
, and consider the conventional logistic regression model:
(2.1) 
where is the logit function, represents the dimensional vector of covariates, and denotes the corresponding regression coefficients to be estimated. The lefthand side of Eq.(2.1
) is the log odds ratio per exposure. The logistic regression model in Eq.(
2.1) is corrected for risk exposure ; see De Jong and Heller(2008). Correspondingly, the probability of at least one claim occurring can be obtained by .For the nonzero aggregate claim amount, we employ Gamma distribution (GA) and inverse Gaussian distribution (IG) to model its skewness and heavy tail (see blueAppendix A for further details). Using the log link function, we obtain the following regression model for the mean parameter of GA and IG distribution:
(2.2) 
where represents a dimensional vector of covariates and denotes the corresponding regression coefficients to be estimated. The mean parameter is obtained by .
Under the assumption of GA and IG distribution, the pure premium of policy is given by
(2.3) 
Specifically, we derive the risk premium of policy by applying the expected value premium principle in Eq.(1.1):
(2.4) 
where is the risk loading parameter in the expected value premium principle.
Similarly, under the standard deviation premium principle in Eq.(1.2), the risk premium of policy is given by
(2.5) 
where is the scale parameter in GA and IG distribution, and is the risk loading parameter in the standard deviation premium principle.
The risk premium by applying the Wang transform in Eq.(1.3) is given by
(2.6) 
where is the standard normal cumulative distribution function, and is its inverse function, denotes the cumulative distribution function of aggregate claim amounts by applying twopart GLMs, and represents the risk aversion parameter of the Wang premium principle.
3 Risk Premiums Based on Twopart Quantile Regression Models
3.1 Risk Premiums Based on Twopart Quantile Regression Models
To assess the risk premium of individual policies, it is common practice to implement a quantile regression framework, which is introduced by blueKudryavtsev (2009) and applied in actuarial practice, see blueHeras et al. (2018) and blueBaione and Biancalana (2019).
Following the twopart quantile premium principle proposed by blueBaione and Biancalana (2019), in this study the risk premium of policy is simply given by
(3.1) 
where stands for a vector of covariates, denotes the probability that policy incurs no claim , represents the nonzero aggregate claim amount given that policy incurs at least one claim, and denotes the th quantile of .
It is clear that
(3.2) 
which means that th quantile function of is equivalent to th quantile function of , that is
(3.3) 
where
(3.4) 
for real number in the interval , which denotes the risk loading parameter in twopart quantile premium principle and need to be given in advance. It is worth noting that though blueBaione and Biancalana (2019) suggests fixing a unique quantile level for associated with the th risk class (see Eq.(1.6)), in this study, we suggests fixing a unique quantile level for all individual polices (see Eq.(3.1)), which follows the same assumption as the work of blueHeras et al. (2018). Therefore, it is quite important to directly control the risk loading by choosing the quantile level in the classification ratemaking process .
In nonlife ratemaking, the log link function is quite popular because it is well connected with the multiplicative framework, see blueMack (1997) and blueKudryavtsev (2009), among others. Similar to the GLMs in Eq.(2.2), we apply the quantile regression model by using the log link function, that is given by
(3.6) 
where represents the dimensional vector of covariates in the quantile regression and denotes the corresponding regression coefficients to be estimated. Note that the vectors of regression coefficients are not the same for different risk classes because of their different quantile levels.
In the following subsections, we discuss how to apply traditional quantile regression, parametric quantile regression, and quantile regression with coefficient functions to determine the risk premiums of individual policies.
3.2 Traditional Quantile Regression Model
Given the quantile level of policy , we have the following traditional quantile regression:
(3.7) 
The estimation of regression coefficients of Eq.(3.7) can be derived by solving the following minimization problem with R package quantreg: Quantile Regression; see blueKoenker and Bassett (1978) and blueKoenker and Hallock (2001):
(3.8) 
3.3 Parametric Quantile Regression Model
Parametric quantile regression models allow us to apply a wide range of skewed and heavy tailed distributions to capture flexible shapes and tail behavior in insurance claim data. These distributions include the generalized beta of the second kind distribution blue(Cummins et al., 1990), generalizedt distribution blue(McDonald and Newey, 1988), and generalized gamma (GG) distribution blue(Noufaily and Jones, 2013).
Compared with traditional quantile regression, parametric quantile regression allows us to consider the impact of covariates on the entire distribution, not merely on its conditional mean. Furthermore, the monotonicity of the quantile function in parametric quantile regression can be strictly guaranteed, because the inverse cumulative distribution function of a distribution is itself a quantile function, which obviates the problem of quantile crossing in the traditional quantile regression.
To develop a framework of parametric quantile regression in predicting the risk premium in nonlife insurance ratemaking, we adopt the GG distribution used in blueNoufaily and Jones (2013). Since GG distribution is defined on a real support, we assume that the log of the aggregate claim amount of the th policy that has at least one claim follows the GG distribution with location parameter , scale parameter , and shape parameter
, with its probability density function given by blueStacy et al. (1962):
(3.10) 
for and is the observed aggregate claim amount for the
th policy that has at least one claim. We consider only the linear regression form for the location parameter of the GG distribution:
(3.11) 
where represents the dimensional vector of covariates and denotes the corresponding regression coefficients to be estimated. It should be noted that the vector of regression coefficients stays the same for different quantile levels.
The quantile function that is associated with density in Eq.(3.10) is given by
(3.12) 
where is an incomplete gamma function, that is, .
Employing the maximum likelihood method, we obtain the estimates of parameters in the GG regression model with optim function in R software. The loglikelihood of the GG regression model is given by
(3.13) 
3.4 Quantile Regression with Coefficient Functions
One problem associated with a quantile regression model is that its coefficients depend on the quantile level; see blueFrumento and Bottai (2016, 2017). To solve this problem, blueFrumento and Bottai (2016) propose a parametric model for the coefficients in the quantile regression and adopt quantile regression coefficients modeling. Specifically, they express the regression coefficients as some parametric functions of the quantile level. Quantile regression with coefficient functions has some advantages, including parsimony, efficiency, and simple interpretation. To develop a framework of quantile regression with coefficient functions in predicting the risk premium in nonlife insurance ratemaking, we adopt similar notation to that of blueFrumento and Bottai (2016) as follows:
(3.15) 
where represents the dimensional vector of covariates, denotes the corresponding vector as a function of quantile level and finitedimensional parameters , namely,
(3.16) 
where is a set of known functions of quantile level , and is a matrix with entries given by
Note that the quantile regression coefficient associated with the th covariate is given by
(3.17) 
where is the corresponding vector of coefficients to be estimated, and some entries of may be set to 0 to allow the regression coefficient to be functions of possibly different subsets of .
Thus, the conditional quantile function is given by
(3.18) 
Note that Eq.(3.18) is associated with the choice of the function . In practice, the choice of
must ensure that the quantile is monotonically increasing. For instance, polynomials, splines, trigonometric functions, and quantile function of standard normal distribution could be used in practice:
(3.19) 
Estimating the th quantile regression coefficients under model (3.18
) requires minimizing the following loss function
(3.20) 
where and is the indicator function. The estimation procedure can be implemented with R package qrcm: quantile regression coefficients modeling; see blueGilchrist (2000) and blueFrumento and Bottai (2016, 2017).
4 Calculation of Total Risk Premium and Risk Loading Parameters
Sections 2 and 3 show that regardless of whether the twopart GLMs or the quantile regression models are used to calculate the individual risk premium, some risk loading parameters (e.g., ,, and ) have to be given subjectively in advance. To overcome this subjective problem, we propose a topdown method to calculate the total risk premium of the portfolio and risk loading parameters in this section.
4.1 Calculating Total Risk Premium
In Solvency II regulation, the probability that the aggregate claim amount of the whole portfolio exceeds its total risk premium that the insurance company will charge should be controlled in a small range, such as less that 0.5%.
For the whole portfolio, suppose that the aggregate claim amount has the cumulative distribution function and its total risk premiumis denoted by ; then, the probability that the aggregate claim amount of the whole portfolio exceeds its total risk premium is given by
(4.1) 
From Eq.(4.1), we obtain the total risk premium of the whole portfolio that the insurance company will charge as follows
(4.2) 
where denotes the th quantile of . In other words, if the probability that the aggregate claim amount for the whole portfolio exceeds the total risk premium is small enough, such as , then the total risk premium for the whole portfolio is the 99.5% quantile of its aggregate claim amount . Hence, the key for controlling the probability and calculating the total risk premium of the whole portfolio is to derive the entire distribution of .
In this subsection, we propose a bootstrap method to calculate the total risk premium of the whole portfolio. First, We generate a sequence of pseudo individual aggregate claim amounts and then predict the total risk premium of the whole portfolio according to the following procedure.
Step 1: Simulate a pseudoresponse of the aggregate claim amount for policy from density function , . Note that density function can be the twopart GA distribution or the twopart IG distribution of Eqs.(blueA.1) and (blueA.4) in the Appendix, respectively. Hence, a simulation of the aggregate claim amount for the whole portfolio is .
Step 2: Use the pseudoresponses to form the bootstrap sample from which to derive the bootstrap replication of by applying the twopart GLMs framework.
Step 3: Repeating these two steps for , we obtain a predictive distribution of aggregate claim amounts for the whole portfolio. As a result, the total risk premium for the whole portfolio is the th quantile of the aggregate claim amount for the whole portfolio, and the total pure premium for the whole portfolio is the mean of the aggregate of claim amount for the whole portfolio. The total risk loading for the whole portfolio is calculated by the difference between the total risk premium and the total pure premium.
4.2 Calculating Risk Loading Parameters
In expected value premium principle and standard deviation premium principle, the risk premium for each individual policy is related to a risk loading parameter . In the Wang premium principle, the risk premium is related to a risk aversion factor . In the quantile premium principle, the risk premium is related to a quantile level . It is obvious that these relevant parameters need to be given directly or indirectly to calculate risk premiums.
In the existing literature, these parameters in premium principles are subjectively given. For instance, blueHeras et al. (2018) propose the quantile level in quantile regression models (see Eq.(1.5)). In the VaR premium principle that blueKudryavtsev (2009) proposes (see Eq.(1.4)), the 95% quantile of the aggregate claim amount of an individual policy is used as its individual risk premium and the sum of the individual risk premiums is used as the total risk premium for the whole portfolio. The shortcoming of this approach is that, while it can guarantee that the aggregate claim amount of each policy exceeds its risk premium by no more than 5%, the probability that the aggregate claim amount of the whole portfolio exceeds its total risk premium may be much less than 5%, due to a certain risk diversification effect between individual policies. In other words, the total risk premium obtained by this method may be higher than what is appropriate.
In this subsection, we first calculate the total risk premium for the whole portfolio and then distribute it to individual policies by solving the following equation:
(4.3) 
where denotes the risk premium for the th policy. Table 1 shows the equations for calculating the risk premiums for individual policies under various premium principles. The risk loading parameters in the expected value premium principle and standard deviation premium principle can be obtained by applying twopart GLMs. The corresponding quantile level in the quantile premium principle and risk aversion factor in the Wang premium principle may be solved by numerical algorithms. For policy , the unique in Table 1 denotes the quantile level of its aggregate claim amount that contains zero claims, while the represents the quantile level of its nonzero aggregate claim amount.
Premium Principle  Allocation Equation  Relevant Parameters  










5 Application to Ratemaking
The data set we use in this study contains information on full comprehensive Australian insurance policies between years 2004 and 2005, which comes from De Jong and Keller (2008); the same data set is analyzed in blueHeras et al. (2018) and blueBaione and Biancalana (2019). The insurance portfolio contains 67,856 policies, of which 4,624 have at least one claim. Each claim record consists of an aggregate claim amount (Claimcst0), claim numbers (Numclaims), occurrence of claim (Clm), exposure, and several covariates, such as age of policyholder, age of vehicle, value of vehicle, area of residence, and body type of vehicle. For simplification and comparative purposes, we consider the same covariates as Heras et al. (2018) in the following application: age of vehicle (Veh_age) and age of driver (Agecat).
The variables in the data set are listed in Table 2. For each policy, we define the aggregate claim amount as the sum of the cost of all claims submitted by the policyholder, assuming that the aggregate amount is zero if the policy has no claim. A histogram of the (positive) aggregate claim amount is given in the left panel of Figure 1. For clarity, the horizontal axis is truncated at $15,000. A total of 65 claims between $15,000 and $57,000 are omitted from this display. A barplot of the claim numbers for those policies that have one or more claims is given in right panel of Figure 1. In this portfolio, most of the policies, up to 93.19%, have only one claim each and only 0.002947% have four claims each.
Variables  Type  Description 

Agecat  Categorical  Driver’s age category: 1 (youngest), 2, 3, 4, 5, 6 
Veh_age  Categorical  Age of vehicle: 1 (youngest), 2, 3, 4 
Exposure  Continuous  Policy years (between 0 and 1) 
Clm  Discrete  Occurrence of claim (0 = no, 1 = yes) 
Numclaims  Discrete  Numbers of claims(0, 1, 2, 3,) 
Claimcst0  Continuous  Aggregate claim amount of a policy (0 if no claim) 
5.1 Total Risk Premium of the Portfolio
To obtain the total risk premium of the portfolio, we first establish twopart GLMs by assuming that the nonzero aggregate claim amounts follow GA distribution or IG distribution, both using two rating factors, Veh_age and Agecat.
Table 3 shows the parameter estimates and the corresponding Pvalues for both models. For the logistic regression part, the estimates of the two models are identical and all the parameters are highly significant, except for the first level of Veh_age and the sixth level of Agecat; this result is equivalent to that of the twopart model in blueHeras et al. (2018). Table 3 shows that the IG regression model is more appropriate for fitting nonzero aggregate claim amounts of individual policies, since its Akaike information criterion and Bayesian information criterion are much smaller than those of the GA regression model.
Models  Parameters  Twopart GA regression  Twopart IG regression  

Estimates  Pvalue  Estimates  Pvalue  
Logistic regression  (Intercept)  1.907  0.001  1.907  0.001 
Veh_age: 1  0.031  0.535  0.031  0.535  
Veh_age: 3  0.127  0.004  0.127  0.004  
Veh_age: 4  0.221  0.001  0.221  0.001  
Agecat: 1  0.533  0.001  0.533  0.001  
Agecat: 2  0.334  0.001  0.334  0.001  
Agecat: 3  0.272  0.001  0.272  0.001  
Agecat: 4  0.230  0.001  0.230  0.001  
Agecat: 6  0.003  0.966  0.003  0.966  
Nonzero aggregate claim amount regression  (Intercept)  7.420  0.001  7.411  0.001 
Veh_age: 1  0.051  0.323  0.056  0.445  
Veh_age: 3  0.027  0.546  0.033  0.608  
Veh_age: 4  0.118  0.012  0.13  0.060  
Agecat: 1  0.439  0.001  0.453  0.001  
Agecat: 2  0.215  0.001  0.223  0.008  
Agecat: 3  0.104  0.072  0.106  0.179  
Agecat: 4  0.119  0.040  0.127  0.110  
Agecat: 6  0.084  0.269  0.091  0.387  
Scale parameter  1.149  0.001  0.037  0.001  
Loglikelihood  55900.58  54844.71  
AIC  111839.20  109727.40  
BIC  112012.50  109900.80 
Table 4 shows the probability of incurring no claims () for individual policies and the pure premiums for 24 risk classes by using the twopart IG regression model. The total number of policies and the total number of claims are given in columns 4 and 5 respectively. Compared with the results of blueHeras et al. (2018), the estimates of the probability of having no claims are the same as those of blueHeras et al. (2018) but the pure premiums are slightly different , because we use the IG regression model instead of the GA regression model, and the former shows better goodness of fit than the latter does.
RiskClass  Veh_age  Agecat  Npolicie  Nclaims  ProbNC  PureP 

1  2  1  1504  159  0.798  524.99 
2  1  1  1283  111  0.803  484.29 
3  3  1  1643  140  0.818  489.82 
4  2  2  3167  288  0.828  354.88 
5  4  1  1312  115  0.831  499.21 
6  1  2  2160  178  0.833  327.06 
7  2  3  3741  295  0.837  299.95 
8  1  3  2706  212  0.841  276.37 
9  2  4  3919  324  0.843  295.68 
10  3  2  3956  280  0.846  329.89 
11  1  4  2935  180  0.847  272.39 
12  3  3  4826  386  0.853  278.54 
13  4  2  3592  254  0.857  335.36 
14  3  4  4760  349  0.859  274.39 
15  4  3  4494  296  0.865  282.96 
16  4  4  4575  332  0.870  278.61 
17  2  5  2635  182  0.871  213.82 
18  2  6  1621  106  0.871  233.62 
19  1  5  2042  122  0.874  196.81 
20  1  6  1131  73  0.875  215.02 
21  3  5  3088  183  0.884  197.75 
22  3  6  1791  108  0.885  216.05 
23  4  5  2971  161  0.894  200.32 
24  4  6  2004  103  0.894  218.85 

Notes: Column 4 reports the number of policies, column 5 the number of claims, column 6 the probability of having no claims, and column 7 the pure premiums. The 24 risk classes are ordered by the probability of having no claims.
Finally, we approximate the predictive distribution of the aggregate claim amounts by bootstrapping for 10,000 times based on the twopart IG regression model. For the current portfolio with 67,856 policies, Figure 2 shows the predictive distribution and QQplots for the aggregate claim amount of the whole portfolio. The mean of the predictive distribution is $18,765,168 and the 99.5% quantile is $20,563,196, which means that if the total risk premium is determined as = $20,563,196, then the probability that the aggregate claim amount of the whole portfolio is greater than the total risk premium is less than 0.5%.
In the following subsection, we assume that the portfolio remains unchanged and that the total risk premium of the whole portfolio charged by the insurance company is $20,563,196.
5.2 Classification Risk Premiums Based on Twopart GLMs
In twopart GLMs, we can obtain not only the pure premium, but also the standard deviation for individual policies. The risk premium for individual policies can be obtained using the expected value premium principle or the standard deviation premium principle. If the sum of the risk premiums for individual policies equals the total risk premium of the portfolio, then the risk loading parameter in the expected value premium principle can be expressed as
(5.1) 
The risk loading parameter in the standard deviation premium principle is expressed as
(5.2) 
where and are the mean and standard deviation, respectively, of the aggregate claim amount of policy .
Similarly, the risk aversion parameter in the Wang premium principle can be solved from the following equation:
(5.3) 
where is the cumulative distribution function of the aggregate claim amount of policy with estimated parameters .
Table 5 presents the risk premiums of 24 risk classes predicted using the twopart IG regression model under various premium principles. We find that the risk premiums of the 24 risk classes are significantly different, and the risk loadings are very close for the expected value premium principle, standard deviation premium principle, and Wang premium principle. Although only the Wang premium principle is a coherent risk measure, the risk premiums obtained from these three premium principles make no big difference in this case.
RiskClass  ProbNC  PureP  EVPP  SDPP  WPP  

RiskP  RiskL  RiskP  RiskL  RiskP  RiskL  
1  0.798  524.99  542.51  17.52  543.74  18.75  543.17  18.18 
2  0.803  484.29  500.30  16.01  501.59  17.30  501.03  16.74 
3  0.818  489.82  507.30  17.48  507.32  17.50  507.20  17.38 
4  0.828  354.88  366.64  11.76  367.55  12.68  367.22  12.34 
5  0.831  499.21  518.52  19.30  517.04  17.83  517.40  18.18 
6  0.833  327.06  337.82  10.76  338.74  11.68  338.42  11.36 
7  0.837  299.95  309.72  9.76  310.67  10.71  310.35  10.39 
8  0.841  276.37  285.30  8.93  286.24  9.87  285.93  9.57 
9  0.843  295.68  305.57  9.89  306.25  10.56  306.03  10.34 
10  0.846  329.89  341.60  11.71  341.67  11.78  341.64  11.76 
11  0.847  272.39  281.43  9.04  282.12  9.73  281.90  9.52 
12  0.853  278.54  288.25  9.71  288.49  9.95  288.44  9.89 
13  0.857  335.36  348.25  12.89  347.34  11.98  347.62  12.26 
14  0.859  274.39  284.22  9.83  284.19  9.80  284.22  9.83 
15  0.865  282.96  293.64  10.67  293.07  10.11  293.27  10.31 
16  0.870  278.61  289.41  10.80  288.56  9.95  288.85  10.24 
17  0.871  213.82  221.38  7.56  221.46  7.64  221.48  7.66 
18  0.871  233.62  242.17  8.55  241.97  8.34  242.08  8.45 
19  0.874  196.81  203.72  6.92  203.84  7.03  203.85  7.04 
20  0.875  215.02  222.85  7.82  222.71  7.68  222.8  7.77 
21  0.884  197.75  205.24  7.50  204.81  7.06  205.00  7.25 
22  0.885  216.05  224.53  8.48  223.77  7.72  224.05  8.00 
23  0.894  200.32  208.54  8.22  207.48  7.16  207.85  7.53 
24  0.894  218.85  228.16  9.30  226.67  7.82  227.16  8.31 

Notes: Column 2 reports the probabilities of having no claims. Column 3 reports the pure premium. RiskL and RiskP denotes risk loadings and risk premiums respectively. EVPP, SDPP and WPP denote expected value premium principle, standard deviation premium principle, and Wang premium principle. The risk loading factor in EVPP is 3.56% and in SDPP is 0.713%. The risk aversion in WPP is 0.0159%.
5.3 Classification Risk Premiums Based on TwoPart Quantile Regression Models
In this section, we apply quantile regression models to calculate the risk premiums for individual policies by using the twopart quantile premium principle in Eq.(3.1). In quantile regression models, the risk loading is implicitly included in the risk premium.
The response variable in the quantile regression is the logtransformed nonzero aggregate claim amounts of individual policies (
(claimcst0)). From Eq.(3.4), we observe that to obtain the quantile of the aggregate claim amounts of individual policies that contains zeroes, we need focus only on those policies that submit at least one claim; then, the quantile level is given by(5.4) 
where is the quantile level of the nonzero aggregate claim amount for individual policy and is the quantile level of the aggregate claim amount that contains zeroes. The probability of having no claim is estimated using the logistic regression model in Eq.(2.1).
Before applying the quantile regression models, we need to choose an appropriate quantile level . In this study, given the total risk premium , the quantile level can be solved from the following equation:
(5.5) 
where is given in Eq.(5.4).
For a given quantile level, we apply the traditional quantile regression, parametric quantile regression, and quantile regression with coefficient functions. The response variable is the logtransformed nonzero aggregate claim amounts of individual policies that submit at least one claim, and the covariates are Veh_age and Agecat, which are the same as those of the mean regression models in the previous section. In the traditional quantile regression model, the covariates are introduced into logtransformed quantile as follows:
(5.6) 
For parametric quantile regression, we assume that the logtransformed nonzero aggregate claim amounts follow GG distribution and the covariates are introduced into its mean parameter as follows:
(5.7) 
The quantile regression with coefficient functions is given by
(5.8) 
where is a polynomial function for capturing the relationship between quantile levels and the coefficients of the quantile regression model.
Table 6 reports the risk premiums of 24 risk classes by using traditional quantile regression, parametric quantile regression, and quantile regression with coefficient functions. For the given total risk premium of the portfolio, the appropriate quantile levels are around 96% in these three quantile regression models.
RiskClass  ProbNC 





RiskP  RiskP  RiskP  
1  0.798  0.811  797.92  0.820  638.76  0.806  845.17  
2  0.803  0.806  634.81  0.816  573.88  0.801  685.69  
3  0.818  0.790  770.09  0.801  566.41  0.785  757.71  
4  0.828  0.777  385.55  0.788  390.23  0.772  400.24  
5  0.831  0.773  736.64  0.785  540.97  0.767  728.17  
6  0.833  0.771  308.46  0.783  349.49  0.766  325.59  
7  0.837  0.766  339.56  0.777  348.96  0.760  342.36  
8  0.841  0.759  267.92  0.771  312.21  0.753  279.09  
9  0.843  0.757  285.29  0.769  319.87  0.751  300.39  
10  0.846  0.752  362.66  0.765  341.59  0.746  350.27  
11  0.847  0.751  225.15  0.763  285.98  0.744  245.04  
12  0.853  0.739  304.62  0.752  304.18  0.733  298.23  
13  0.857  0.732  358.19  0.745  322.98  0.725  332.13  
14  0.859  0.729  254.65  0.743  277.99  0.723  259.88  
15  0.865  0.717  304.88  0.731  286.66  0.710  281.95  
16  0.870  0.706  244.03  0.721  261.36  0.699  244.63  
17  0.871  0.704  180.81  0.719  206.28  0.697  174.54  
18  0.871  0.703  177.61  0.718  218.04  0.696  184.96  
19  0.874  0.696  147.80  0.711  183.65  0.688  143.99  
20  0.875  0.695  152.39  0.711  194.11  0.688  153.14  
21  0.884  0.669  156.17  0.686  176.15  0.661  148.02  
22  0.885  0.669  165.21  0.685  186.14  0.660  159.00  
23  0.894  0.641  146.12  0.659  163.27  0.632  136.56  
24  0.894  0.640  153.23  0.658  172.50  0.631  148.14 

Notes: This table reports the probability of having no claims and risk premiums under different quantile regression models. RiskP denotes risk premiums. QR, PQR and QRCF denote traditional quantile regression, fully parametric quantile regression, and quantile regression with coefficient functions.
5.4 Relationship between probability and quantile level
The total risk premium of the portfolio should cover the actual aggregate claim amount at the probability level or more. In this subsection, we discuss the choice of probability in Eq.(4.2) for the insurance company and check how that affects the total risk premium of the whole portfolio and the quantile level . We focus on the impact of different probabilities on predicting risk premiums of different risk classes.
Figure 3 shows the range of the total risk premium based on the parametric bootstrap method proposed in Section 4. We observe that if the probability varies between 0.5% and 25%, then the total risk premium of the portfolio is between $20,050,581 and $20,563,196, which shows a noticeable difference among these assumptions.
Figure 4 shows the range of quantile level obtained by traditional quantile regression, parametric quantile regression, and quantile regression with coefficient functions. For these three quantile regression models, as the probability increases from 75% to 99.5%, the quantile level just increases slightly and almost remain around 96%.
Generally, as the portfolio size (number of policies) increases, the risk loading ratio, which is defined as the ratio of total risk loading to total pure premium while implementing the topdown method, should decrease due to the diversification effect. Figures 5 show the risk premiums of 24 risk classes under the different probabilities using three quantile regression models. We can see that there are not big differences in the three cases. We conclude that, although the probability controls the risk loading of the whole portfolio at the collective level, the has small impact on the quantile level and the risk premiums for different risk classes at the individual level due to the diversification effect, which is consistent with previous conclusion in Figure 4.
It concludes that the topdown method proposed in this study guarantees that the total risk premium covers the aggregate claim amount with a probability of 75% or more, and the classification risk premiums are less affected by the probability selected in advance, which means that the method is robust.
5.5 Comparative Analysis
blueHeras et al. (2018) propose the quantile premium principle to calculate the risk premiums of individual policies, that is
(5.9) 
where is the quantile level for the th risk class and ; is the probability of having no claims that can be predicted by a logistic regression model; is the th quantile of the aggregate claim amount; is the pure premium of the th risk class; represents risk loading, which is the difference between the 95% quantile of the aggregate claim amount and the pure premium; is the risk loading parameter.
For ease of comparison with the results of blueHeras et al. (2018), we assume the total risk premium of the portfolio is , and the risk premiums of different risk classes are recalculated using the quantile premium principle in blueHeras et al. (2018) with the corresponding risk loading parameter .
Table 7 shows the risk premiums of the 24 risk classes using different models. The risk premiums using the expected value premium principle, standard deviation premium principle, and Wang premium principle are very close to those of the quantile regression model in blueHeras et al. (2018). In other words, for twopart GLMs, given the total risk premium of the whole portfolio , regardless of which premium principle is used, there is little impact on the risk premiums of individual risk classes.
In order to measure the prediction accuracy, it is well known that the frequently used loss functions, eg., the root mean square error (RMSE) are not appropriate measures for capturing the difference between the predictive values and the corresponding outcomes, due to the high proportion of zeros and right heavytailed features in the loss distributions. In this case, the use of loss function is bounded as the observed risk premium of different risk classes is unknown. Therefore, we turn to alternative statistical measures  the ordered Lorenz curve and the associated Gini index. The Gini index is a statistical measure of distribution developed by the Italian statistician Corrado in 1912. It is often used as a gauge of economic inequality, measuring wealth distribution among a population. The index ranges from 0% to 100%, with 0% representing perfect equality and 100% representing perfect inequality. The subsequent literature is extensive. For example, blueFrees et al. (2011) develops theoretical properties of this Gini index and blueShi and Yang (2018) applies it to measure the discrepancy between the premium and loss distributions in the nonlife ratemaking. In this study, we use the original definition of Gini index developed by blueCorrado (1921). The ordered Lorenz curve is the plot with using the proportion of an risk exposure on the horizontal axis and a distribution function of predicted value of risk premiums on the vertical axis. The associated Gini index is defined as twice the area between the ordered Lorenz curve and the line of equality. A higher Gini index indicates greater heterogeneity of different risk classes, with high risk premium individuals receiving much larger percentages of the total risk premiums of the risk exposure.
Figure 7 displays the ordered Lorenz curves corresponding to Gini indices of the risk premium prediction reported in Table 7, which are calculated correspondingly with ranking the value of risk exposure from large to small. Relative to twopart GLMs, the Gini indices calculated by twopart quantile regression models is the largest three of all as expected, which means that the quantile regression can reveal the heterogeneity of different risk classes more efficiently, and thus, can obtain more reasonable risk premiums of individual policies. For graphical comparison that confirms the Gini indices results, we show the predivive risk premium of the 24 risk classes based on the three models proposed in Figure 6. We observed that the risk premiums calculated by quantile regression models are more significantly different between various risk classes.
RiskClass 




Comments
There are no comments yet.