 ## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Assuming that random variable

denotes the aggregate claim amount for individual policy , the risk premium of policy can be expressed as a distortion function of the random variable . In the expected value premium principle, the risk premium equals the pure premium plus a percentage of the pure premium, that is,

 H(Yi)=E(Yi)+φE(Yi), (1.1)

 H(Yi)=E(Yi)+φ√% Var(Yi). (1.2)

An alternative approach for predicting risk premium is to consider the risk premium as a whole by applying the value at risk (VaR) premium principle and Wang premium principle; see, for example blueWang (1995, 2000), blueWang et al. (1997), and blueKudryavtsev (2009). Based on the Wang premium principle, the risk premium is expressed as follows:

 (1.3)

where and

denote standard normal cumulative distribution function and its inverse function, respectively;

represents the survival function of aggregate claim amount, and denotes a risk factor.

The VaR premium principle in quantile regression for ratemaking is first discussed in blueKudryavtsev (2009). The risk premium is calculated as a quantile of the aggregate claim amount, as follows:

 H(Yi)=QYi(τ)=inf{u∈R: FYi(u)≥τ}, (1.4)

where denotes the quantile of the aggregate claim amount and is a given quantile level, such as 95% or 99%. Risk loading is denoted as

, which is expressed as the difference between the quantile and the pure premium. This premium principle explains the needs of risk loading quite well, as it estimates the maximum possible losses that an individual policy may incur with a given probability

during the forecasting period.

Following the VaR premium principle, the quantile premium principle for classification ratemaking is proposed by blueHeras et al. (2018), and the corresponding risk premium is calculated as follows:

 H(Yi)=E(Yi)+φ[QYi(τ)−E(Yi)], (1.5)

Recently, blueBaione and Biancalana (2019) proposes a two-part quantile premium principle, that is,

 H(Yi)=(1−pi)QY∗i(τ), (1.6)

where denotes the -th quantile of aggregate claim amount given that at least one claim has been incurred and denotes the probability of incurring at least one claim.

In actuarial practice, some parameters, namely, , , and in Eqs.(1.1)-(1.6), which are called risk loading parameters in this study, need to be determined in advance. To estimate the risk loading parameters, blueBühlmann (1985) proposes a top-down method for insurance companies by first controlling the probability of ruin at the acceptable level in advance and then imposing this stability criterion regarding yield of invested capital. This allows insurance companies to find a total premium to be charged for the whole portfolio and then split it in a fair way among all the individual risks.

Our work is motived by the recent works of blueHeras et al. (2018) and blueBaione and Biancalana (2019). We extend this branch of the literature by developing a more general top-down framework to calculate the risk loading parameters. We first derive the total risk premium of the portfolio by implementing the bootstrap method, thereby allowing us to obtain the entire distribution of the total risk premium at the collective level, instead of exploring the distribution at individual level. Given an acceptable confidence level, this approach provides a useful tool for estimating the VaR of a portfolio.

Our method permits estimating risk loading parameters uniquely for various premium principles at the individual level. In this approach, the total risk premium is distributed to the individual policies based on the risk contribution of each policy, so that the sum of the risk premiums of all individual policies is equal to the total risk premium of the whole portfolio, which is proved to be an efficient method in ratemaking by blueBühlmann (1985). The risk premiums of different tariff classes can be estimated by either GLMs or quantile regression models incorporating into the covariate information. For comparison, GLMs is used as a benchmark. Traditional quantile regression, fully parametric quantile regression, and quantile regression with coefficient functions are constructed to calculate the risk premium of each individual policy.

Thus, our approach has two advantages: (1) it controls the probability that the aggregate claim amount of the entire portfolio exceeds the total risk premium to an acceptable level; (2) it provides a general framework to determine risk loading parameters objectively for all types of models, such as two-part GLMs and two-part quantile regression models.

The remainder of the article is structured as follows. Sections 2 and 3 summarize the methods to calculate the risk premium based on two-part GLMs and two-part quantile regression models, respectively, at the individual level. Section 4 presents an analysis of the calculation of the total risk premium of a portfolio and its allocation to individual policies. Section 5 applies the proposed method to an empirical data set. Section 6 summarizes and concludes.

## 2 Risk Premiums Based on Two-Part GLMs

Suppose an insurance portfolio contains policies, indicates whether or not policy has a claim submitted, represents its aggregate claim amount, denotes its exposure, and

stands for a vector of covariates

.

In actuarial practice, the observed aggregate claim amounts of a portfolio usually have a probability mass at zero. In this study, we first implement the two-part GLMs to accommodate the probability mass at zero. In a two-part GLMs framework, the zero component models the probability of incurring no claim, and the continuous component models the aggregate claim amount given that at least one claim has been incurred. It is a common practice to separate claim probability and non-zero aggregate claim amount in pricing non-life insurance contracts; see, for example, blueFrees (2009) and blueFrees et al. (2013).

For claim probability, we assume that

follows the binomial distribution with parameter

, and consider the conventional logistic regression model:

 logit[1−piwi]=xRiα, (2.1)

where is the logit function, represents the -dimensional vector of covariates, and denotes the corresponding regression coefficients to be estimated. The left-hand side of Eq.(2.1

) is the log odds ratio per exposure. The logistic regression model in Eq.(

2.1) is corrected for risk exposure ; see De Jong and Heller(2008). Correspondingly, the probability of at least one claim occurring can be obtained by .

For the non-zero aggregate claim amount, we employ Gamma distribution (GA) and inverse Gaussian distribution (IG) to model its skewness and heavy tail (see blueAppendix A for further details). Using the log link function, we obtain the following regression model for the mean parameter of GA and IG distribution:

 log(μi)=xμiβ, (2.2)

where represents a -dimensional vector of covariates and denotes the corresponding regression coefficients to be estimated. The mean parameter is obtained by .

Under the assumption of GA and IG distribution, the pure premium of policy is given by

 E[Yi; pi, μi]=(1−pi)μi. (2.3)

Specifically, we derive the risk premium of policy by applying the expected value premium principle in Eq.(1.1):

 H(Yi;pi,μi)=(1−pi)μi+φ(1−pi)μi, (2.4)

Similarly, under the standard deviation premium principle in Eq.(1.2), the risk premium of policy is given by

 H(Yi;pi,μi,σ)=⎧⎪ ⎪⎨⎪ ⎪⎩(1−pi)μi+φμi√(1−pi)(pi+σ2),Yi|Ri=1∼GA(1−pi)μi+φμi√(1−pi)(pi+μiσ2),Yi|Ri=1∼IG, (2.5)

where is the scale parameter in GA and IG distribution, and is the risk loading parameter in the standard deviation premium principle.

The risk premium by applying the Wang transform in Eq.(1.3) is given by

 H(Yi;pi,μi,σ)=∞∫0Φ[Φ−1(1−FYi(y;pi,μi,σ))+ρ] dy, (2.6)

where is the standard normal cumulative distribution function, and is its inverse function, denotes the cumulative distribution function of aggregate claim amounts by applying two-part GLMs, and represents the risk aversion parameter of the Wang premium principle.

## 3 Risk Premiums Based on Two-part Quantile Regression Models

### 3.1 Risk Premiums Based on Two-part Quantile Regression Models

To assess the risk premium of individual policies, it is common practice to implement a quantile regression framework, which is introduced by blueKudryavtsev (2009) and applied in actuarial practice, see blueHeras et al. (2018) and blueBaione and Biancalana (2019).

Following the two-part quantile premium principle proposed by blueBaione and Biancalana (2019), in this study the risk premium of policy is simply given by

 H(Yi)=(1−pi)QY∗i(τ∗i|xi), (3.1)

where stands for a vector of covariates, denotes the probability that policy incurs no claim , represents the non-zero aggregate claim amount given that policy incurs at least one claim, and denotes the -th quantile of .

It is clear that

 FYi(yi|xi)=Pr(Yi=0|xi)+[1−Pr(Yi=0|xi)]FY∗i(yi|xi), (3.2)

which means that -th quantile function of is equivalent to -th quantile function of , that is

 QY∗i(τ∗i|xi)=QYi(τ|xi), (3.3)

where

 τ∗i=τ−pi1−pi, (3.4)

for real number in the interval , which denotes the risk loading parameter in two-part quantile premium principle and need to be given in advance. It is worth noting that though blueBaione and Biancalana (2019) suggests fixing a unique quantile level for associated with the -th risk class (see Eq.(1.6)), in this study, we suggests fixing a unique quantile level for all individual polices (see Eq.(3.1)), which follows the same assumption as the work of blueHeras et al. (2018). Therefore, it is quite important to directly control the risk loading by choosing the quantile level in the classification ratemaking process .

Using Eqs.(3.1) and (3.3), the risk premium of policy can also be obtained as

 H(Yi)=(1−pi)QYi(τ|xi)=(1−pi)QY∗i(τ∗i|xi). (3.5)

In non-life ratemaking, the log link function is quite popular because it is well connected with the multiplicative framework, see blueMack (1997) and blueKudryavtsev (2009), among others. Similar to the GLMs in Eq.(2.2), we apply the quantile regression model by using the log link function, that is given by

 logQY∗i(τ∗i∣∣xQi)=xQiγτ∗i, (3.6)

where represents the -dimensional vector of covariates in the quantile regression and denotes the corresponding regression coefficients to be estimated. Note that the vectors of regression coefficients are not the same for different risk classes because of their different quantile levels.

In the following subsections, we discuss how to apply traditional quantile regression, parametric quantile regression, and quantile regression with coefficient functions to determine the risk premiums of individual policies.

### 3.2 Traditional Quantile Regression Model

Given the quantile level of policy , we have the following traditional quantile regression:

 logQY∗i(τ∗i∣∣xQi)=xQiγτ∗i. (3.7)

The estimation of regression coefficients of Eq.(3.7) can be derived by solving the following minimization problem with R package quantreg: Quantile Regression; see blueKoenker and Bassett (1978) and blueKoenker and Hallock (2001):

 minγτ∗i∈Rk+1⎡⎢ ⎢ ⎢⎣∑log(y∗i)≥xiγτ∗iτ∗i∣∣log(y∗i)−xiγτ∗i∣∣+∑log(y∗i)

According to Eqs.(3.1) and (3.7), the risk premium of policy is given by

 H(Yi;pi,γ)=(1−pi)exp(xQiγτ∗i). (3.9)

### 3.3 Parametric Quantile Regression Model

Parametric quantile regression models allow us to apply a wide range of skewed and heavy tailed distributions to capture flexible shapes and tail behavior in insurance claim data. These distributions include the generalized beta of the second kind distribution blue(Cummins et al., 1990), generalized-t distribution blue(McDonald and Newey, 1988), and generalized gamma (GG) distribution blue(Noufaily and Jones, 2013).

Compared with traditional quantile regression, parametric quantile regression allows us to consider the impact of covariates on the entire distribution, not merely on its conditional mean. Furthermore, the monotonicity of the quantile function in parametric quantile regression can be strictly guaranteed, because the inverse cumulative distribution function of a distribution is itself a quantile function, which obviates the problem of quantile crossing in the traditional quantile regression.

To develop a framework of parametric quantile regression in predicting the risk premium in non-life insurance ratemaking, we adopt the GG distribution used in blueNoufaily and Jones (2013). Since GG distribution is defined on a real support, we assume that the log of the aggregate claim amount of the -th policy that has at least one claim follows the GG distribution with location parameter , scale parameter , and shape parameter

, with its probability density function given by blueStacy et al. (1962):

 flog(Y∗i)(yi;ηi,ω,k)=kk−1/2ωΓ(k)exp[log(yi)−ηiω√k−kexp(1√klog(yi)−ηiω)], (3.10)

for and is the observed aggregate claim amount for the

-th policy that has at least one claim. We consider only the linear regression form for the location parameter of the GG distribution:

 ηi=xQiγ, (3.11)

where represents the -dimensional vector of covariates and denotes the corresponding regression coefficients to be estimated. It should be noted that the vector of regression coefficients stays the same for different quantile levels.

The quantile function that is associated with density in Eq.(3.10) is given by

 QY∗i(τ∗i|xQi)=exp(ηi){Γ(τ∗i, k)k}ω/√k, (3.12)

where is an incomplete gamma function, that is, .

Employing the maximum likelihood method, we obtain the estimates of parameters in the GG regression model with optim function in R software. The log-likelihood of the GG regression model is given by

 ℓ(γ,ω,k)=N∑i=1 +logyi−xQiγω√k−kexp(logyi−xQiγω√k)]. (3.13)

According to Eqs.(3.1) and (3.12), the risk premium of policy is given by

 H(Yi;pi,γ,k,ω)=(1−pi)exp(xQiγ){Γ(τ∗i,k)k}ω/√k. (3.14)

### 3.4 Quantile Regression with Coefficient Functions

One problem associated with a quantile regression model is that its coefficients depend on the quantile level; see blueFrumento and Bottai (2016, 2017). To solve this problem, blueFrumento and Bottai (2016) propose a parametric model for the coefficients in the quantile regression and adopt quantile regression coefficients modeling. Specifically, they express the regression coefficients as some parametric functions of the quantile level. Quantile regression with coefficient functions has some advantages, including parsimony, efficiency, and simple interpretation. To develop a framework of quantile regression with coefficient functions in predicting the risk premium in non-life insurance ratemaking, we adopt similar notation to that of blueFrumento and Bottai (2016) as follows:

 log[QY∗i(τ∗i∣∣xQi,θ)]=xQiγ(τ∗i|θ), (3.15)

where represents the -dimensional vector of covariates, denotes the corresponding vector as a function of quantile level and finite-dimensional parameters , namely,

 γ(τ∗i|θ)=θb(τ∗i), (3.16)

where is a set of known functions of quantile level , and is a matrix with entries given by

 θ=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣θ11θ12⋯θ1q⋮⋮⋱⋮θk1θk2⋯θk,qθk+1,1θk+1,2⋯θk+1,q⎤⎥ ⎥ ⎥ ⎥ ⎥⎦(k+1)×q.

Note that the quantile regression coefficient associated with the -th covariate is given by

 (3.17)

where is the corresponding vector of coefficients to be estimated, and some entries of may be set to 0 to allow the regression coefficient to be functions of possibly different subsets of .

Thus, the conditional quantile function is given by

 log[QY∗i(τ∗i∣∣xQi,θ)]=xQiγ(τ∗i|θ)=xQiθb(τ∗i). (3.18)

Note that Eq.(3.18) is associated with the choice of the function . In practice, the choice of

must ensure that the quantile is monotonically increasing. For instance, polynomials, splines, trigonometric functions, and quantile function of standard normal distribution could be used in practice:

 bj(τ∗i)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(τ∗i)2(τ∗i)3Φ−1(1−τ∗i)cos(2πτ∗i),    j=1,⋯,J. (3.19)

Estimating the -th quantile regression coefficients under model (3.18

) requires minimizing the following loss function

 (3.20)

where and is the indicator function. The estimation procedure can be implemented with R package qrcm: quantile regression coefficients modeling; see blueGilchrist (2000) and blueFrumento and Bottai (2016, 2017).

According to Eqs.(3.1) and (3.18), the risk premium of policy is given by

 H(Yi;pi,θ)=(1−pi)exp[xQiθb(τ∗i)]. (3.21)

Sections 2 and 3 show that regardless of whether the two-part GLMs or the quantile regression models are used to calculate the individual risk premium, some risk loading parameters (e.g., ,, and ) have to be given subjectively in advance. To overcome this subjective problem, we propose a top-down method to calculate the total risk premium of the portfolio and risk loading parameters in this section.

### 4.1 Calculating Total Risk Premium

In Solvency II regulation, the probability that the aggregate claim amount of the whole portfolio exceeds its total risk premium that the insurance company will charge should be controlled in a small range, such as less that 0.5%.

For the whole portfolio, suppose that the aggregate claim amount has the cumulative distribution function and its total risk premiumis denoted by ; then, the probability that the aggregate claim amount of the whole portfolio exceeds its total risk premium is given by

 Ψ=Pr[S>C]=1−FS(C). (4.1)

From Eq.(4.1), we obtain the total risk premium of the whole portfolio that the insurance company will charge as follows

 C=F−1S(1−Ψ), (4.2)

where denotes the -th quantile of . In other words, if the probability that the aggregate claim amount for the whole portfolio exceeds the total risk premium is small enough, such as , then the total risk premium for the whole portfolio is the 99.5% quantile of its aggregate claim amount . Hence, the key for controlling the probability and calculating the total risk premium of the whole portfolio is to derive the entire distribution of .

In this subsection, we propose a bootstrap method to calculate the total risk premium of the whole portfolio. First, We generate a sequence of pseudo individual aggregate claim amounts and then predict the total risk premium of the whole portfolio according to the following procedure.

Step 1: Simulate a pseudo-response of the aggregate claim amount for policy from density function , . Note that density function can be the two-part GA distribution or the two-part IG distribution of Eqs.(blueA.1) and (blueA.4) in the Appendix, respectively. Hence, a simulation of the aggregate claim amount for the whole portfolio is .

Step 2: Use the pseudo-responses to form the bootstrap sample from which to derive the bootstrap replication of by applying the two-part GLMs framework.

Step 3: Repeating these two steps for , we obtain a predictive distribution of aggregate claim amounts for the whole portfolio. As a result, the total risk premium for the whole portfolio is the -th quantile of the aggregate claim amount for the whole portfolio, and the total pure premium for the whole portfolio is the mean of the aggregate of claim amount for the whole portfolio. The total risk loading for the whole portfolio is calculated by the difference between the total risk premium and the total pure premium.

In expected value premium principle and standard deviation premium principle, the risk premium for each individual policy is related to a risk loading parameter . In the Wang premium principle, the risk premium is related to a risk aversion factor . In the quantile premium principle, the risk premium is related to a quantile level . It is obvious that these relevant parameters need to be given directly or indirectly to calculate risk premiums.

In the existing literature, these parameters in premium principles are subjectively given. For instance, blueHeras et al. (2018) propose the quantile level in quantile regression models (see Eq.(1.5)). In the VaR premium principle that blueKudryavtsev (2009) proposes (see Eq.(1.4)), the 95% quantile of the aggregate claim amount of an individual policy is used as its individual risk premium and the sum of the individual risk premiums is used as the total risk premium for the whole portfolio. The shortcoming of this approach is that, while it can guarantee that the aggregate claim amount of each policy exceeds its risk premium by no more than 5%, the probability that the aggregate claim amount of the whole portfolio exceeds its total risk premium may be much less than 5%, due to a certain risk diversification effect between individual policies. In other words, the total risk premium obtained by this method may be higher than what is appropriate.

In this subsection, we first calculate the total risk premium for the whole portfolio and then distribute it to individual policies by solving the following equation:

 N∑i=1H(Yi)=C, (4.3)

where denotes the risk premium for the -th policy. Table 1 shows the equations for calculating the risk premiums for individual policies under various premium principles. The risk loading parameters in the expected value premium principle and standard deviation premium principle can be obtained by applying two-part GLMs. The corresponding quantile level in the quantile premium principle and risk aversion factor in the Wang premium principle may be solved by numerical algorithms. For policy , the unique in Table 1 denotes the quantile level of its aggregate claim amount that contains zero claims, while the represents the quantile level of its non-zero aggregate claim amount.

## 5 Application to Ratemaking

The data set we use in this study contains information on full comprehensive Australian insurance policies between years 2004 and 2005, which comes from De Jong and Keller (2008); the same data set is analyzed in blueHeras et al. (2018) and blueBaione and Biancalana (2019). The insurance portfolio contains 67,856 policies, of which 4,624 have at least one claim. Each claim record consists of an aggregate claim amount (Claimcst0), claim numbers (Numclaims), occurrence of claim (Clm), exposure, and several covariates, such as age of policyholder, age of vehicle, value of vehicle, area of residence, and body type of vehicle. For simplification and comparative purposes, we consider the same covariates as Heras et al. (2018) in the following application: age of vehicle (Veh_age) and age of driver (Agecat).

The variables in the data set are listed in Table 2. For each policy, we define the aggregate claim amount as the sum of the cost of all claims submitted by the policyholder, assuming that the aggregate amount is zero if the policy has no claim. A histogram of the (positive) aggregate claim amount is given in the left panel of Figure 1. For clarity, the horizontal axis is truncated at $15,000. A total of 65 claims between$15,000 and $57,000 are omitted from this display. A bar-plot of the claim numbers for those policies that have one or more claims is given in right panel of Figure 1. In this portfolio, most of the policies, up to 93.19%, have only one claim each and only 0.002947% have four claims each. Figure 1: Predictive distribution of aggregate claim amount (left panel) and QQ-plot of aggregate claim amount (right panel) of the portfolio. ### 5.1 Total Risk Premium of the Portfolio To obtain the total risk premium of the portfolio, we first establish two-part GLMs by assuming that the non-zero aggregate claim amounts follow GA distribution or IG distribution, both using two rating factors, Veh_age and Agecat. Table 3 shows the parameter estimates and the corresponding P-values for both models. For the logistic regression part, the estimates of the two models are identical and all the parameters are highly significant, except for the first level of Veh_age and the sixth level of Agecat; this result is equivalent to that of the two-part model in blueHeras et al. (2018). Table 3 shows that the IG regression model is more appropriate for fitting non-zero aggregate claim amounts of individual policies, since its Akaike information criterion and Bayesian information criterion are much smaller than those of the GA regression model. Table 4 shows the probability of incurring no claims () for individual policies and the pure premiums for 24 risk classes by using the two-part IG regression model. The total number of policies and the total number of claims are given in columns 4 and 5 respectively. Compared with the results of blueHeras et al. (2018), the estimates of the probability of having no claims are the same as those of blueHeras et al. (2018) but the pure premiums are slightly different , because we use the IG regression model instead of the GA regression model, and the former shows better goodness of fit than the latter does. Finally, we approximate the predictive distribution of the aggregate claim amounts by bootstrapping for 10,000 times based on the two-part IG regression model. For the current portfolio with 67,856 policies, Figure 2 shows the predictive distribution and QQ-plots for the aggregate claim amount of the whole portfolio. The mean of the predictive distribution is$18,765,168 and the 99.5% quantile is $20,563,196, which means that if the total risk premium is determined as =$20,563,196, then the probability that the aggregate claim amount of the whole portfolio is greater than the total risk premium is less than 0.5%. Figure 2: Predictive distribution of aggregate claim amount (left panel) and QQ-plot of aggregate claim amount (right panel) of the portfolio.

In the following subsection, we assume that the portfolio remains unchanged and that the total risk premium of the whole portfolio charged by the insurance company is $20,563,196. ### 5.2 Classification Risk Premiums Based on Two-part GLMs In two-part GLMs, we can obtain not only the pure premium, but also the standard deviation for individual policies. The risk premium for individual policies can be obtained using the expected value premium principle or the standard deviation premium principle. If the sum of the risk premiums for individual policies equals the total risk premium of the portfolio, then the risk loading parameter in the expected value premium principle can be expressed as  ^φ=C−n∑i=1[(1−^pi)^μi]n∑i=1[(1−^pi)^μi]. (5.1) The risk loading parameter in the standard deviation premium principle is expressed as  ^φ=C−n∑i=1[(1−^pi)^μi]n∑i=1[√(1−^pi)^μ2i(^pi+^μi^σ2)], (5.2) where and are the mean and standard deviation, respectively, of the aggregate claim amount of policy . Similarly, the risk aversion parameter in the Wang premium principle can be solved from the following equation:  N∑i=1∫∞0Φ[Φ−1(1−FYi(yi;^μi,^pi,^σ))+ρ]dyi=C, (5.3) where is the cumulative distribution function of the aggregate claim amount of policy with estimated parameters . Table 5 presents the risk premiums of 24 risk classes predicted using the two-part IG regression model under various premium principles. We find that the risk premiums of the 24 risk classes are significantly different, and the risk loadings are very close for the expected value premium principle, standard deviation premium principle, and Wang premium principle. Although only the Wang premium principle is a coherent risk measure, the risk premiums obtained from these three premium principles make no big difference in this case. ### 5.3 Classification Risk Premiums Based on Two-Part Quantile Regression Models In this section, we apply quantile regression models to calculate the risk premiums for individual policies by using the two-part quantile premium principle in Eq.(3.1). In quantile regression models, the risk loading is implicitly included in the risk premium. The response variable in the quantile regression is the log-transformed non-zero aggregate claim amounts of individual policies ( (claimcst0)). From Eq.(3.4), we observe that to obtain the quantile of the aggregate claim amounts of individual policies that contains zeroes, we need focus only on those policies that submit at least one claim; then, the quantile level is given by  τ∗i=τ−pi1−pi, (5.4) where is the quantile level of the non-zero aggregate claim amount for individual policy and is the quantile level of the aggregate claim amount that contains zeroes. The probability of having no claim is estimated using the logistic regression model in Eq.(2.1). Before applying the quantile regression models, we need to choose an appropriate quantile level . In this study, given the total risk premium , the quantile level can be solved from the following equation:  (5.5) where is given in Eq.(5.4). For a given quantile level, we apply the traditional quantile regression, parametric quantile regression, and quantile regression with coefficient functions. The response variable is the log-transformed non-zero aggregate claim amounts of individual policies that submit at least one claim, and the covariates are Veh_age and Agecat, which are the same as those of the mean regression models in the previous section. In the traditional quantile regression model, the covariates are introduced into log-transformed quantile as follows:  log[QY∗i(τ∗i|xQi)]=γτ∗i0+γτ∗i1Veh_age1+γτ*i2Veh_age3+⋯+γτ∗i5Agecat1+γτ∗i9Agecat6. (5.6) For parametric quantile regression, we assume that the log-transformed non-zero aggregate claim amounts follow GG distribution and the covariates are introduced into its mean parameter as follows:  log[QY∗i(τ∗i|xQi)] ηi =γ0+γ1Veh_age1+γ2Veh_age3+⋯+γ5% Agecat1+γ9Agecat6 . (5.7) The quantile regression with coefficient functions is given by  log[QY∗i(τ∗i|xQi)] =γ0(τ∗i)+γ1(τ∗i)% Veh_age1+γ2(τ*i)Veh_% age3 +⋯+γ5(τ∗i)Agecat1+⋯+γ9(τ∗i)Agecat6, γj(τ∗i) =θ0j+θ1jτ∗i+θ2jτ∗i2,j=0,1,⋯,9, (5.8) where is a polynomial function for capturing the relationship between quantile levels and the coefficients of the quantile regression model. Table 6 reports the risk premiums of 24 risk classes by using traditional quantile regression, parametric quantile regression, and quantile regression with coefficient functions. For the given total risk premium of the portfolio, the appropriate quantile levels are around 96% in these three quantile regression models. ### 5.4 Relationship between probability Ψ and quantile level τ The total risk premium of the portfolio should cover the actual aggregate claim amount at the probability level or more. In this subsection, we discuss the choice of probability in Eq.(4.2) for the insurance company and check how that affects the total risk premium of the whole portfolio and the quantile level . We focus on the impact of different probabilities on predicting risk premiums of different risk classes. Figure 3 shows the range of the total risk premium based on the parametric bootstrap method proposed in Section 4. We observe that if the probability varies between 0.5% and 25%, then the total risk premium of the portfolio is between$20,050,581 and \$20,563,196, which shows a noticeable difference among these assumptions. Figure 3: Total Risk Premium of the Whole Portfolio at Probabilities 1−Ψ from 75% to 99.5%.

Figure 4 shows the range of quantile level obtained by traditional quantile regression, parametric quantile regression, and quantile regression with coefficient functions. For these three quantile regression models, as the probability increases from 75% to 99.5%, the quantile level just increases slightly and almost remain around 96%.

Generally, as the portfolio size (number of policies) increases, the risk loading ratio, which is defined as the ratio of total risk loading to total pure premium while implementing the top-down method, should decrease due to the diversification effect. Figures 5 show the risk premiums of 24 risk classes under the different probabilities using three quantile regression models. We can see that there are not big differences in the three cases. We conclude that, although the probability controls the risk loading of the whole portfolio at the collective level, the has small impact on the quantile level and the risk premiums for different risk classes at the individual level due to the diversification effect, which is consistent with previous conclusion in Figure 4.

It concludes that the top-down method proposed in this study guarantees that the total risk premium covers the aggregate claim amount with a probability of 75% or more, and the classification risk premiums are less affected by the probability selected in advance, which means that the method is robust.

### 5.5 Comparative Analysis

blueHeras et al. (2018) propose the quantile premium principle to calculate the risk premiums of individual policies, that is

 H(Yi)=E(Yi)+φ[QYi(τ∗i)−E(Yi)], (5.9)

where is the quantile level for the -th risk class and ; is the probability of having no claims that can be predicted by a logistic regression model; is the -th quantile of the aggregate claim amount; is the pure premium of the -th risk class; represents risk loading, which is the difference between the 95% quantile of the aggregate claim amount and the pure premium; is the risk loading parameter.

For ease of comparison with the results of blueHeras et al. (2018), we assume the total risk premium of the portfolio is , and the risk premiums of different risk classes are recalculated using the quantile premium principle in blueHeras et al. (2018) with the corresponding risk loading parameter .

Table 7 shows the risk premiums of the 24 risk classes using different models. The risk premiums using the expected value premium principle, standard deviation premium principle, and Wang premium principle are very close to those of the quantile regression model in blueHeras et al. (2018). In other words, for two-part GLMs, given the total risk premium of the whole portfolio , regardless of which premium principle is used, there is little impact on the risk premiums of individual risk classes.

In order to measure the prediction accuracy, it is well known that the frequently used loss functions, eg., the root mean square error (RMSE) are not appropriate measures for capturing the difference between the predictive values and the corresponding outcomes, due to the high proportion of zeros and right heavy-tailed features in the loss distributions. In this case, the use of loss function is bounded as the observed risk premium of different risk classes is unknown. Therefore, we turn to alternative statistical measures - the ordered Lorenz curve and the associated Gini index. The Gini index is a statistical measure of distribution developed by the Italian statistician Corrado in 1912. It is often used as a gauge of economic inequality, measuring wealth distribution among a population. The index ranges from 0% to 100%, with 0% representing perfect equality and 100% representing perfect inequality. The subsequent literature is extensive. For example, blueFrees et al. (2011) develops theoretical properties of this Gini index and blueShi and Yang (2018) applies it to measure the discrepancy between the premium and loss distributions in the non-life ratemaking. In this study, we use the original definition of Gini index developed by blueCorrado (1921). The ordered Lorenz curve is the plot with using the proportion of an risk exposure on the horizontal axis and a distribution function of predicted value of risk premiums on the vertical axis. The associated Gini index is defined as twice the area between the ordered Lorenz curve and the line of equality. A higher Gini index indicates greater heterogeneity of different risk classes, with high risk premium individuals receiving much larger percentages of the total risk premiums of the risk exposure.

Figure 7 displays the ordered Lorenz curves corresponding to Gini indices of the risk premium prediction reported in Table 7, which are calculated correspondingly with ranking the value of risk exposure from large to small. Relative to two-part GLMs, the Gini indices calculated by two-part quantile regression models is the largest three of all as expected, which means that the quantile regression can reveal the heterogeneity of different risk classes more efficiently, and thus, can obtain more reasonable risk premiums of individual policies. For graphical comparison that confirms the Gini indices results, we show the predivive risk premium of the 24 risk classes based on the three models proposed in Figure 6. We observed that the risk premiums calculated by quantile regression models are more significantly different between various risk classes.