    # Regression for Copula-linked Compound Distributions with Applications in Modeling Aggregate Insurance Claims

In actuarial research, a task of particular interest and importance is to predict the loss cost for individual risks so that informative decisions are made in various insurance operations such as underwriting, ratemaking, and capital management. The loss cost is typically viewed to follow a compound distribution where the summation of the severity variables is stopped by the frequency variable. A challenging issue in modeling such outcome is to accommodate the potential dependence between the number of claims and the size of each individual claim. In this article, we introduce a novel regression framework for compound distributions that uses a copula to accommodate the association between the frequency and the severity variables, and thus allows for arbitrary dependence between the two components. We further show that the new model is very flexible and is easily modified to account for incomplete data due to censoring or truncation. The flexibility of the proposed model is illustrated using both simulated and real data sets. In the analysis of granular claims data from property insurance, we find substantive negative relationship between the number and the size of insurance claims. In addition, we demonstrate that ignoring the frequency-severity association could lead to biased decision-making in insurance operations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In actuarial research on nonlife insurance, a task of particular interest and importance is to predict the loss cost for individual risks in an insurer’s book of business. Interpretation and prediction of loss cost of individual policyholders deepens the insurer’s understanding of the risk profile of the entire portfolio, which further leads to better-informed decisions in various insurance operations such as underwriting, ratemaking, and capital management.

The loss cost of a policyholder is jointly determined by the number of claims and the amount of each claim during the contract period. As a result, researchers and practitioners typically view the loss cost outcome to follow a compound or generalized distribution (see Karlis and Xekalaki (2005) and Johnson et al. (2005)). Specifically, the loss cost per policy year, denoted by , can be represented as:

 S=Y1+⋯+YN, (1)

where

is a counting random variable and represents the number of claims, and

(

) is a non-negative continuous random variable and represents the size of the

th claim. The sequence of is further assumed to be independently and identically distributed. Compound distributions have been extensively used in the actuarial science literature for modeling aggregate losses in an insurance system (see, for example, Klugman et al. (2012), Lin (2014), and Albrecher et al. (2017)). In insurance applications, and are referred to as the “frequency” and “severity” components respectively.

In this article, we focus on the regression method for compound distributions when both and are observed. A challenging issue in modeling such outcomes in the regression setting is to accommodate the potential dependence between the number of claims and the size of each individual claim. The goal of this work is to introduce a simple yet flexible regression framework to allow for arbitrary dependence between the frequency and severity distributions.

The current regression approach to studying the aggregate loss relies on the independence assumption between and each . Under such independence assumption, one develops regression models for the number and size of claims separately, which is known as the frequency-severity or two-part model. See Frees (2014) for discussions on various types of two-part models. As a special case, when the frequency is a Poisson variable and the severity is a gamma variable, the loss cost is known to follow a Tweedie distribution (Tweedie (1984)). Jørgensen and de Souza (1994) and Smyth and Jørgensen (2002) have explored fitting the Tweedie’s compound Poisson model to the loss cost data in property insurance.

In addition to actuarial science and insurance, regression models based on compound distributions have been used in many other disciplines as well. In health economics, the two-part model was used to study an individual’s total number of doctor visits resulting from multiple spells of illness in a given period (see, for instance, Silva and Windmeijer (2001)). In marketing, Tellis (1988) employed a special case of the frequency-severity model to study the effect of repetitive advertising on consumer purchasing choices; Aribarg et al. (2010) studied consumer advertisement recognition where an individual’s attention is formulated as a compound model determined by eye fixation frequency and gaze duration. In operational risk, the compound distribution for aggregate losses is the foundation for the determination of the operational risk capital required by the Basel capital framework for banks (Panjer (2006) and Shevchenko (2010)). In psychology, Smithson and Shou (2014) pointed out the applications of this type of model in different areas of psychology such as perception and decision making, where a psychological process is thought to be serially summed from observable component process outputs.

The two-part models in different scientific fields described above employ some common key assumptions, including:

1. [nolistsep]

2. The distribution of does not depend on the values of for ;

3. Conditional on , are independently distributed random variables;

4. Conditional on , the common distribution of does not depend on .

The (conditional) independence assumption between and certainly leads to tractable statistical inference because it allows one to build regression models separately for the frequency and severity components. However, if and

are correlated, ignoring the association between them will lead to serious biases in the inference. First, the regression coefficients in the severity regression model will be inconsistent estimates of the marginal effect of explanatory variables. Second, there is a persistent error in the prediction for the severity given the frequency. Third, the misspecification will introduce bias in the inference for the compound distribution.

Motivated by the above observations, we introduce a novel copula-linked compound distribution and the associated two-part regression framework that allow for arbitrary dependence between the frequency and severity components. Specifically, we employ a parametric copula to construct the joint distribution of frequency and severity variables, thus relax the independence assumption in standard methods. We show that the resulting copula regression framework is able to nest several commonly used approaches as special cases, including the hurdle model, the selection model, and the frequency-severity model, among others. Furthermore, we extend the basic model to accommodate the case of incomplete data due to censoring or truncation. Because of the parametric nature, likelihood-based approaches are proposed for estimation, inference, and diagnostics.

The flexibility of the proposed model is illustrated using both simulated and real data sets. In the numerical experiments, we showcase the impact of ignoring the frequency and severity dependence on the resulting compound distribution. In the empirical study, we apply the proposed method to granular claims data in property insurance. Our analysis detects substantive negative dependency between the number and the size of insurance claims. In addition, we demonstrate the importance of such dependency in some key insurance functions, including underwriting and ratemaking, loss reserving, and capital management. The results suggest that ignorance of frequency-severity dependence could lead to biased decision-making in insurance operations.

To the best of our knowledge, this work is among the first efforts to explicitly incorporate the dependence between the frequency and severity variables of a compound distribution in a regression setting. Recent literature has made some development in this direction, for example, see Czado et al. (2012), Krämer et al. (2013), and Garrido et al. (2016) among others. The fundamental difference between our work and existing studies is that the aforementioned studies examined the relation between the frequency and the average severity , while the proposed method directly looks into the relation between the frequency and the individual severity . Alternative mechanisms for introducing dependence between the frequency and individual severity variables include the correlated random-effect framework as in Olsen and Schafer (2001) and the conditional approach as in Frees et al. (2011). The difficulty with both methods compared to the proposed copula approach is that it is not straightforward to handle incomplete data which is not unusual in insurance applications because of various coverage modifications.

Given that our work fits in the broader literature on multivariate modeling in insurance, it is worth discussing their differences and connections. The current literature on dependence modeling of insurance claims focuses on the joint modeling of multiple outcomes of loss cost that could arise from multiple lines of business (see Frees et al. (2016)), multiple coverage in a single business line (see Shi et al. (2016)), or multiple peril types covered by a policy (see Shi and Yang (2018)). In this line of studies, each loss cost outcome is formulated using either a Tweedie model or a two-part model. Both can be viewed in the framework of the compound distribution (1) where the and each are assumed to be independent with each other. Apart from the association among multiple loss cost outcomes, this work examines a single loss cost outcome, and the focus is on the dependence between the frequency and severity components in the compound model.

The rest of the paper is structured as follows. Section 2 introduces the dependent frequency-severity regression model for the compound distribution and discusses its extension for incomplete data due to censoring and truncation. The likelihood-based methods for estimation, inference, and diagnostics are further discussed. Section 3 provides numerical experiments to show the impact of ignoring the frequency-severity dependence under various settings. Section 4 applies the proposed approach to the loss cost data in property insurance and shows the importance of the frequency-severity dependence in insurance operations. Section 5 concludes the article. The supplementary materials contain additional technical examples, numerical studies, and detailed data analysis.

### 2.1 Basic Model

In the basic setup, we assume that complete information on is observed for each subject, where is a count variable, and are continuous variables. For simplicity, we suppress the subject index in the following presentation. The joint distribution of is built upon the assumption that are conditionally i.i.d. given as opposed to the unconditional i.i.d. assumption in the standard compound distribution. There are several implications of this assumption. First, conditional independence of given introduces correlation among , which departs from the i.i.d. assumption in the standard model; Second, identical distribution of given implies identical marginal distribution of , which is consistent with the i.i.d. assumption in the standard model; Third, the bivariate distribution of are identical given , which nests the independent case in the standard model.

To facilitate presentation, we denote as the variable associated with the common distribution of the sequence . Note that is only defined in the sense of a distribution, not in the sense of a random variable. Under the conditional independence assumption, the associated pmf/pdf function is:

 fN,Y(n,y1,…,yn) ={Pr(N=0)n=0∂∂y1⋯∂ynPr(N=n,Y1≤y1,⋯,Yn≤yn)n>0 =[fN(0)]I(n=0)[fN(n)×fY|N(y1,…,yn|n)]I(n>0) =fN(n)×[n∏j=1fY|N(yj|n)]I(n>0), (2)

where is an indictor function.

The central component to define (2) is the joint distribution of and . To allow for flexible dependence between and , we take a parametric approach and employ a bivariate parametric copula to construct their joint distribution. Refer to Nelsen (2006) and Joe (2014) for an introduction to dependence modeling with copulas. According to the Sklar’s theorem, the joint distribution of and can be expressed in terms of a bivariate copula :

 FN,Y(n,y)=Pr(N≤n,Y≤y)=C(FN(n),FY(y)). (3)

Denote , it follows that

 fN,Y(n,y) =∂∂yPr(N=n,Y≤y) =fY(y)[h(FN(n),FY(y))−h(FN(n−1),FY(y))]. (4)

From above, one finds the conditional distribution of given as:

 FY|N(y|n) =Pr(Y≤y|N=n) =1fN(n)[C(FN(n),FY(y))−C(FN(n−1),FY(y))], (5) fY|N(y|n) =∂∂yPr(Y≤y|N=n) =fY(y)fN(n)[h(FN(n),FY(y))−h(FN(n−1),FY(y))]. (6)

In a regression context, one wants to incorporate exogenous explanatory variables to account for observed heterogeneity in both and . Thus, the marginal models for both and are defined conditional on covariates. For example, in generalized linear models, one could specify and , where is the subject index,

is the vector of covariates,

is the regression coefficients, and denotes the link function. Superscripts and indicate the frequency and severity components respectively.

As a special case, when the copula in (3) is an independence copula, i.e., and each are independent, model (2) reduces to:

 fN,Y(n,y1,…,yn)=fN(n)×[n∏j=1fY(yj)]I(n>0), (7)

where the marginal models of and are totally separable. Since (2) nests (7) as a special case, the usual goodness-of-fit statistics such as the likelihood ratio test could be used to test whether the independence assumption between and is supported by the data.

It is worth stressing several observations in model (2). First, the independence assumption of given implies a specific dependence among the sequence . As pointed out by Liu and Wang (2017), other types of dependence might exists between and . Indeed, more flexible relation among could be accommodated by further specifying a joint distribution of given . Since the focus of this work is the association between and each rather than the association within , we leave this potential generalization of the current model for future investigation. Second, the proposed model is flexible such that several commonly used two-part models can be viewed in the copula framework. Specific examples include the hurdle model (Mullahy (1986)), the selection model (Smith (2003)), and the frequency-severity model (Frees (2014)). Detailed discussions can be found in Section S.1 of the supplementary material. Third, the current representation assumes to be a nonnegative continuous outcome. However, the framework is ready to accommodate discrete outcomes with suitable modifications for (2.1). For instance, could be a count variable in the study of health care utilization under multiple spells of illness.

### 2.2 Incomplete Data

Insurance contracts typically contain some cost sharing features such as deductible and policy limit to reduce the cost of insurers. Due to such coverage modifications, and/or are often not completely observed. Motivated by such observations, we extend the basic copula model to accommodate incomplete data.

Presumably the contract has a per-occurrence deductible and a policy limit . The deductible refers to the maximal amount of loss assumed by the policyholder, and the policy limit represents the maximal possible indemnification from the insurer. Note that both quantities vary by policyholders. Given that deductible and policyholder will affect the frequency and severity observed by the insurer, we denote and as the corresponding modified variables. Hence the modified aggregate loss to the insurer is:

 ˜S=˜Y1+⋯+˜Y˜N.

We consider two cases of incomplete data. The first one corresponds to the per-loss scenario as defined in Klugman et al. (2012). This scenario assumes that all accidents are reported to the insurer regardless of whether the loss amount exceeds the deductbile. In this case, the frequency component is not affected by coverage modifications, thus . However, the severity component will be adjusted by:

 ˜Y=⎧⎪⎨⎪⎩0Y≤dY−ddl.

Thus, the joint distribution of can be shown as:

 f˜N,˜Y(n,y1,…,yn) =[f˜N(0)]I(n=0)[f˜N(n)×f˜Y|˜N(y1,…,yn|n)]I(n>0) =[f˜N(0)]I(n=0)[f˜N(n)×n∏j=1f˜Y|˜N(yj|n)]I(n>0), (8)

where , and

 f˜Y|˜N(y|n) =⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩Pr(˜Y=0|˜N=n)y=0∂∂yPr(˜Y≤y|˜N=n)0

As pointed out by one reviewer, the copula between and stays unchanged since censoring is a monotone increasing function of .

The second one corresponds to the per-payment scenario as defined in Klugman et al. (2012). Differing from the former scenario, the accident with a loss amount below the deductible is unobservable to the insurer. Hence both frequency and severity are modified by coverage modifications. The relation between the original and modified variables are:

 ˜N=I(Y1>d)+⋯+I(YN>d),

and

 ˜Y=⎧⎪⎨⎪⎩−Y≤dY−ddl

To derive the distribution of , we assume, without loss of generality, the first claims are below maximum indemnification, and the rest claims receives maximum payments, i.e. and . Then, we have:

 f˜N,˜Y(n,y1,…,yn) = ∂k∂y1⋯∂ykPr(˜N=n,˜Y1≤y1,⋯,˜Yk≤yk,˜Yk+1=⋯=˜Yn=l−d) = E[∂k∂y1⋯∂ykPr(˜N=n,˜Y1≤y1,⋯,˜Yk≤yk,˜Yk+1=⋯=˜Yn=l−d|N)] = E[(Nn)∂k∂y1⋯∂ykPr(dl,Yn+1,⋯,YN≤d|N)] = E⎡⎣(Nn)k∏j=1Pr(Yj=yj+d)n∏j=k+1Pr(Yj>l)N∏j=n+1Pr(Yj≤d)⎤⎦ = E[(Nn)k∏j=1fY|N(yj+d|N)[1−FY|N(l|N)]n−k[FY|N(d|N)]N−n]. (9)

Though motivated by insurance applications, the above cases are representative of two common mechanisms for incomplete observations, censoring and truncation. Our method relies on the assumption that censoring or truncation is exogenous, i.e. the underlying distribution of and are not affected by such mechanisms.

### 2.3 Inference

Because of the parametric nature of the proposed copula model, parameters can be estimated using likelihood-based approach. Denote model parameters by , where is the vector of parameters in the frequency model, is the vector of parameters in the severity model, and represents association parameters in the bivariate copula. For complete data and censored data, one could employ either two-stage MLE or full MLE. However, for truncated data, only full MLE is available. In the following, we give detailed estimation procedures for the case of complete data. The procedures for the censored and truncated data are similar and thus omitted.

Using the basic model (2), the log likelihood function for subject is shown as:

 li(θ) =logfN(ni)+I(ni>0)×ni∑j=1logfY|N(yij|ni).

Given a random sample , the full log likelihood for the case of complete data can be written as

 L(θ) =m∑i=1logfN(ni)+∑{i:ni>0}ni∑j=1logfY|N(yij|ni) =m∑i=1logfN(ni)−∑{i:ni>0}nilogfN(ni) +∑{i:ni>0}ni∑j=1{logfY(yij)+log[h(FN(ni),FY(yij))−h(FN(ni−1),FY(yij))]}.

One estimation strategy is the full information likelihood method. The full MLE can be obtained as the maximizer of the full log likelihood function . Under regularity conditions, e.g. Theorem 3.3 in Newey and McFadden (1994), is consistent and asymptotically normal. The asymptotic covariance matrix of can be consistently estimated using the inverse of observed information at the full MLE , i.e. .

The above likelihood function also suggests a two-stage estimation strategy. Denote the two stage MLE by , and further denote:

 L1(θf) =m∑i=1logfN(ni) L2(θ) =m∑i=1I(ni>0)×[ni∑j=1logfY|N(yij|ni)],

we have . In the first stage, one estimates the count regression model to obtain by solving . Fixing the parameters in first part , the second stage estimates the conditional model to obtain and by solving . Under the regularity conditions of Theorem 6.1 in Newey and McFadden (1994), is consistent and asymptotically normal. However, the asymptotic covariance matrix of can be tedious to calculate. The advantage of the two-stage MLE is its computational efficiency. Thus, to speed up the computation, we first obtain and then use it as the initial point for the maximization of the full likelihood.

The proposed two-stage approach differs from the inference functions for marginals (IFM) method that is widely used in copula regression (Joe (2005)). The IFM first estimates parameters in the univariate marginal models and then estimates the association parameters in the copula. In our case, the parameters in the severity component and the copula shall be estimated simultaneously. Applying IFM estimation to the proposed copula model will lead to inconsistent estimation because the marginal likelihood for is not observed when .

For model comparison, one could refer to information-based criteria such as AIC or BIC. To assess the goodness-of-fit of the copula model, we suggest the following steps. The adequacy of fit for the count regression can be examined using the standard Pearson’s chi-squared test. The usual diagnostic analysis for neither the marginal distribution of nor the bivariate copula is applicable in our case, for the same reason that the two pieces must be estimated jointly. Therefore, we employ a procedure based on the conditional distribution . Specifically, we calculate the fitted distribution for and . One expects the sequence to be a sample of uniform

provided that the copula model is correctly specified. In addition, one could visualize the adequacy of fit with a normal QQ plot by graphing the empirical quantiles from

against the theoretical quantiles from a standard normal distribution. We demonstrate in detail the usage of the proposed diagnostic tools in Section S.3 of the supplementary material.

## 3 Numerical Experiments

### 3.1 Impact of Dependence between N and Y

This section presents two numerical experiments to emphasize the importance of the dependency between and . Consider a compound distribution , where , , and joint distribution of and is specified by a parametric copula. This setting is of particular interest because of the special case where

is known as Tweedie compound Poisson distribution when

and are independent. As noted by Jorgensen (1987), under parameterizations , , and

, this distribution can be expressed in the form of the exponential dispersion model with a power variance function

for .

The first experiment demonstrates the effect of frequency-severity dependency on the distribution of aggregate loss. The distribution of is calculated using Monte Carlo simulation and is displayed in Figure 1. The first panel uses the Gaussian copula with different levels of dependence measured by Kendall’s . When , the copula model reduces to the independence case which is equivalent to a Tweedie distribution (). The positive (negative) dependence leads to a longer (shorter) tail in the aggregate loss distribution. The second panel compares three copulas (Gaussian, Clayton, and Gumbel) with the same Kendall’s . One observes the effect of tail dependence (upper for Gumbel and lower for Clayton), although not substantial. Figure 1: Empirical CDF of aggregate loss. The left panel simulates data from the Gaussian copula with different Kendall’s tau, and the right panel simulates data from different copulas with the same Kendall’s tau.

The second experiment examines the effect of frequency-severity dependence on the conditional severity distribution. Figure 2 reports the distribution of given at different levels of dependence. In each panel, we show densities , , and . The former two cases correspond to the common practice where the claim amount is not affected by the number of claims given occurrence. The result is indicative of severe misspecification bias when the dependence between frequency and severity is ignored. Figure 2: The conditional distribution of loss amount given number of claims. The four panels correspond to different levels of dependence between claim frequency and severity.

### 3.2 Estimation based on the Joint Distribution of N and Y

This simulation study examines the finite-sample performance of the estimations based on the joint distribution of and , and further demonstrates the inference bias incurred by ignoring the frequency-severity dependence. We consider the Gaussian copula compound model in a regression context. The primary distribution is Poisson and the secondary distribution is gamma with:

 Poisson: log(E(Ni))=log(λi)=βf0+βf1X1i+βf2X2i Gamma: log(E(Yij))=log(αγi)=βs0+βs1X1i+βs2X2i,

where and are i.i.d. and and . In the Gaussian copula, we consider different degrees of dependence. The copula model is estimated using both the two-stage method and the joint MLE, and the results are summarized in Table 1. We report the relative bias and the root mean squared error. The calculations are based on a sample size of 500 with 250 replications. There is no substantial difference in the estimates from the two approaches. For comparison, we also report in the table the results of the standard two-part model where and are assumed to be independent. As anticipated, the estimates for the frequency model is consistent with the copula approach. However, the estimation assuming conditional independence introduces a long-term bias in the severity model, and this bias positively correlates with the association between and .

Additional simulation studies are provided in Section S.2 of the supplementary material to illustrate the estimation for incomplete data. We emphasize that, in contrast to the cases of complete data and censored data, independence estimation will introduce persistent bias in both frequency and severity components of the model when data are truncated.

## 4 Modeling Aggregate Insurance Claims

In nonlife insurance (including property, casualty, and health), the compound distribution (1) is a common approach to modeling aggregate losses in an insurance system. Examples of an insurance system include a single policyholder, a line of business, or a portfolio of contracts. The compound distribution is known as collective risk model in the actuarial literature, and the frequency and severity components are the two building blocks of the model (Klugman et al. (2012)).

In this application, we examine the Wisconsin local government property fund which provides property insurance for local government entities in the state of Wisconsin, such as court houses, school districts, fire stations, etc. We consider the building and contents coverage where the building element covers for the physical structure of a property including its permanent fixtures and fittings, and the contents element covers possessions and valuables within the property that are detached and removable. Similar to most nonlife insurance product, the contract provided by the property fund has a one-year term.

The insurance system in this context corresponds to a policyholder, that is a local government entity. The outcome of interest is the aggregate loss for an entity during the policy year, which is determined by both the number and the size of claims. As discussed in Section 1, the collective risk model implies a frequency-severity approach for modeling the aggregate loss for each policyholder, and the current practice relies on the independence assumption between the two building blocks and in the collective risk model.

The purpose of the analysis below is two-folded: first, we provide empirical evidence of significant negative association between the frequency and severity of insurance claims; second, we show that ignoring the frequency-severity dependence could lead to biased decision-making in insurance operations. In the following sections, we use term “independence model” to refer to the standard frequency-severity model that assumes independence between the frequency and severity components, and “copula model” to refer to the proposed copula approach in Section 2.1 that allows for flexible dependence between the frequency and severity components.

Granular insurance claim data are collected for a portfolio of local government entities for years 2009-2011. For each policyholder, one observes the number of claims and the ground-up loss of each claim during each year. We use data of 2009 and 2010 to develop the model, and data of 2011 for model validation. There are 2,080 and 1,017 policy-year observations in the training data and validation data, respectively.

### 4.1 Exploring Frequency-Severity Association

To explore the relationship between claim frequency and severity, we display in Figure 3 the violin plot of claim size by the number of claims for the portfolio of government entities. To account for exposure, the claim size is normalized by the amount of coverage. First, one observes that given occurrence, the distribution of claim severity correlates with claim frequency. Second, the violin plot suggests a negative relation between claim severity and frequency, i.e. the amount of claims tends to be smaller for policyholders who have more claims. Figure 3: Violin plot of claim amount per \$1,000 coverage by the number of claims.

To further motivate the usage of the proposed copula model, we perform some preliminary analyses to examine the role of frequency-severity dependence in model fitting. Our starting point is the Tweedie model given it is the industry standard in property-casualty insurance for modeling semi-continuous loss cost. Recall that the Tweedie distribution is a Poisson sum of gamma variables where the Poisson and gamma variables are assumed to be independent. To examine the role of dependence, we further allow the Possion and gamma variables in the Tweedie distribution to be correlated. Specifically, we fit a copula model for the aggregate loss where the frequency is a Poisson variable, and the severity is a gamma variable, and their joint distribution is specified by a bivariate Gaussian copula. The association parameter in the Gaussian copula is estimated to be

with a standard error of

. This result is consistent with the pattern suggested by Figure 3.

To compare the Tweedie and copula models, we present in Figure 4 two goodness-of-fit plots. Denote

as the cumulative distribution function (CDF) of aggregate loss. The left figure shows the fitted CDF of the aggregate loss from the two parametric models along with the empirical estimate. Since the plot of CDF emphasizes the center of the distribution, it is not ideal to visualize the effects of extremal large values. To further investigate the tail fit, the right figure plots

between the empirical distribution and the two parametric (Tweedie and copula) models.

On one hand, both plots indicate that the copula model exhibits superior fit to the Tweedie model, emphasizing the importance of frequency-severity dependence. On the other hand, there is still room for improvement of goodness-of-fit in both the center and the tail of the distribution. This suggests considering more flexible distributions for marginal behavior. To illustrate, we fit another copula model using zero-one inflated negative binomial distribution for claim frequency, the generalized beta of the second kind (GB2) distribution for claim severity, and a Gaussian copula between the two components. The estimated association parameter is

with a standard error of . The corresponding goodness-of-fit plots are also shown in Figure 4. As anticipated, refined marginal models improve the fit, especially in the heavy right tail.
Overall, the preliminary analyses suggests that there is significant negative dependence between claim frequency and severity, and accounting for such association enhances the goodness-of-fit for the aggregate loss distribution. Figure 4: Comparison between empirical and parametric Cumulative Distribution Functions (CDF, denoted by FS(s)) of aggregate loss.

### 4.2 Empirical Analysis

The observation in Section 4.1 motivates us to jointly examine the frequency and severity components in the collective risk model. Differing from the earlier preliminary analysis, first, we explore using more flexible marginal distributions for modeling the number and the size of insurance claims; second, we incorporate covariates to account for observed heterogeneity, and thus the relation between frequency and severity is interpreted as residual dependence; third, we consider various copula that offer different types of dependence in modeling the frequency-severity relationship.

To facilitate model specification, we examine the distributions of both claim frequency and severity, as well as their relationship with available explanatory variables. The insurance database contains policyholder-specific and claim-specific information that one could use to account for the variation in claim frequency and severity. Details of such covariate information are provided in Section S.3 of the supplementary material. For claim frequency, we consider the policy-level characteristics, including entity type (whether a policyholder is a city, county, township, village, or others), alarm credit (whether a policyholder receives a credit for alarm system), the level of deductible, and the amount of coverage. For illustration, we exhibit in Table 2 the empirical distribution of the number of claims per policyholder in the training data. As usually observed in insurance claims data, the majority of policyholders (about 70%) has zero claims over the year. However, this percentage is much smaller than private lines of business such as personal automobile insurance. Another striking feature of claim counts is there is an excess of ones in addition to the zero inflation. We further break down the frequency distribution by entity type, as shown in Table 2 and visualized in Figure 5. The substantial variation suggests that entity type is an important predictor for the claim count.

Table 3

summarizes the empirical quantiles of claim amounts. There are in total 1,381 claims in the sampling period. The descriptive statistics indicates that claim amount is skewed and heavy-tailed distributed. For claim severity, besides policy-level information, we look into the effects of claim-level information such as peril type, occurrence time, and reporting delay. As an example, Table

3 shows the empirical distribution of claim amount by peril type and by occurrence time. The claim amount due to fire and water damages tends to be larger compared to other perils, and the loss events occurred in the summer is more likely to result in higher claims. The pattern is also displayed in the violin plot of the claim severity in log scale in Figure 6. The plot reinforces the skewness in the severity distribution and stresses the heterogeneity across occurrence and peril type. Figure 6: Violin plots of claim severity. The left and right panels show severity distributions by peril and occurrence respectively.

In the final model, we consider a zero-one inflated negative binomial regression for claim frequency:

 fN(ni)=p0iI(ni=0)+p1iI(ni=1)+(1−p0i−p1i)gN(ni), (10)

where (

) is specified using a multinomial logistic regression:

 pki=exp(x′iβfk)1+∑1k=0exp(x′iβfk),  k=0,1,

and is a standard negative binomial model:

 gN(ni)=Γ(η+ni)Γ(η)Γ(ni+1)[ηη+exp(xiβf)]η[exp(xiβf)η+exp(xiβf)]ni,

with being the dispersion parameter. This specification allows to accommodate the excess of both zeros and ones in the claim count. To accommodate the skewness and heavy-tails, a parametric regression based on GB2 distribution is employed for claim severity (for instance, see Shi (2014) for details on GB2 regression):

 fY(yij)=[exp(wij)]ϕ1yij|σ|B(ϕ1,ϕ2)[1+exp(wij)]ϕ2, (11)

where and are shape parameters, is the scale parameter, and . A parametric bivariate copula is employed to construct the joint distribution of and . We consider commonly used bivariate copulas from the elliptical and Archimedean families, including Gaussian, , Clayton, Frank, Gumbel, and Joe. For the Archimedean copulas that only allow for positive association, we consider the associated 90 and 270 degree rotated copulas.

The copulas models are estimated using likelihood-based estimation described in Section 2.3. The corresponding goodness-of-fit statistics are reported in Table 4. The independence model is presented as a benchmark. Model selection criteria AIC and BIC recommend the Gaussian copula model. It appears that the tail dependence is not a concern in this context. The implied Kendall’s , reported in the table, reinforces the negative frequency-severity dependence obtained in the earlier analysis, indicating that the claim frequency and severity are correlated after controlling for the covariates. Because the independence model is nested by the copula model, we perform a likelihood ratio test to formally evaluate the goodness-of-fit of the copula models against the independence model. The large statistics confirm the statistical significance of the negative frequency-severity dependence.

The specification for the dependent frequency-severity model, including both the marginals and the copula, is a result of a series of model comparisons, diagnostic analysis, and robust checks. The detailed analysis is provided in Section S.3 of the supplementary material.

Table 5 reports the estimated parameters for the selected Gaussian copula model. The association parameter in the Gaussian copula is -0.29 and -0.30 using two-stage and full MLE respectively. Given that the rating variables in insurance are highly regulated, one should regard the observed frequency-severity dependence as a result of unobserved heterogeneity, and thus the sign of the dependence could be either positive and negative. Our focus is to provide a data-driven method to capture such relationship and to show the detrimental effects of ignorant supposition of independence on statistical inference and hence insurance operations. For comparison, we also report in Table 5 the estimation results for the independence model. For the frequency component, one anticipates no essential difference in estimates of regression coefficients between the independence and copula models. We observed that the two-stage MLE is identical to the independence model, and we attribute the difference from the full MLE to the finite sample property. In contrast, the difference in the estimates for the severity component is substantial between the independence and copula models (both two-stage and full MLE), which is in line with the significant negative dependence between and . The analysis indicates that ignoring the frequency-severity dependence could introduce significant bias in parameter estimation.

### 4.3 Implications on Insurance Operations

The previous section shows the statistical significance of the dependence between frequency and severity in the collective risk model. This section focuses on the substantive significance of the frequency-severity dependence and demonstrates its impacts on the decision-making in some key insurance operations (Frees (2015)).

The first operation that we consider is underwriting and ratemaking. They are two basic functions in insurance companies and are closely related to each other. The former deals with the selection of risks, and latter deals with the determination of the price for the risks accepted. To achieve the underwriting profit target, the central task in underwriting and ratemaking is to quantify the risks of potential customers, which provides the insurer a risk score of policyholders to facilitate portfolio selection. To compare performance of the independence and the copula models, we look to the policyholders in the validation data of 2011 and examine which method leads to a more profitable portfolio construction.

For the purpose of underwriting, we use the coefficient of variation to measure the risk of policyholders. For each of the 1,017 policyholders in year 2011, we calculate the coefficient of variation of the loss cost, denoted for the th policyholder. Given that the aggregate loss cost is specified using a collective risk model (1), the mean and variance of is calculated by:

 E[S] =E[NE[Y|N]]independence=E[N]E[Y] Var[S] =E[NVar[Y|N]]+Var[NE[Y|N]]independence=E[N]Var[Y]+Var[N](E[Y])2

The above calculation emphasizes the role of the dependency between the two building blocks, frequency and severity. We calculate the distribution of aggregate loss for each policyholder based on 10,000 Monte Carlo simulations. The upper panel of Figure 7 compares the risk ranking between the independence and the copula models. The first plot is the scatter plot of the ranking for each policyholder by the two methods. The second plot shows the realized aggregate losses (in log scale) with the same ranking from the two models. The risk scores from the two models are highly correlated yet there are considerable difference in their rankings.

To evaluate whether the risk ranking points to a profitable portfolio selection strategy, we display in the lower panel of Figure 7 the cumulative loss distribution () versus the cumulative premium distribution (), both ordered by the riskiness of the policyholders . This curve is known as the ordered Lorenz curve in Frees et al. (2011). In Figure 7, the loss and premium distributions are calibrated using the realized losses of the policyholders and the actual premiums charged by the insurer in year 2011, respectively. The area between the curve and the 45 degree line is interpreted as an average profit or loss for the portfolio, with a convex curve for profit and a concave curve for loss. If one thinks of each underwriting strategy as retaining policies with riskiness less than or equal to , the area represents an average profit in the sense that we are taking an expectation over all decision-making strategies. Furthermore, twice the area is known as the Gini index which thus has a natural economic interpretation. The Lorenz curve for the independence model is close to the 45 degree line. In contrast, the Lorenz curve for the copula model suggests a much higher average profit. Specifically, the Gini indices are 10.55% and 33.24% for the independence and the copula models, respectively. Therefore, a better underwriting strategy could be formed using the copula model, given that each policyholder is charged the contract premium. Figure 7: Risk ranking and portfolio selection using the independent and the copula models. The top two figures compare risk score ranking and the corresponding realized losses between the independence and copula models respectively. The bottom two figures compare ordered Lorenz curves between the independence and copula models where the dashed line indicates perfect equality.

We next compare the rates suggested by the independence and the copula models. A fair rate commensurate with the policyholder’s risk mitigates adverse selection against the insurer. We perform a out-of-sample validation based on the Gini correlation in Frees et al. (2011). Two base premiums are considered, the constant premium and the contract premium. The former charges average cost to each policyholder, and the latter is the premium that the property fund charges based on the basic rating variables. Table 6 presents the Gini correlation coefficients for the independence and the copula models. For both premium bases, the copula model shows a higher index, implying a more refined risk classification than the independence model.

The proposed copula model can also provide insights for the practice of claims reserving. In property casualty insurance, it is typical that a loss event won’t be reported to the insurer immediately upon occurrence. For instance, a hail damage to the roof might be discovered by the policyholder several month later. After being reported, it further takes time for the insurer to decide coverage and finally settle the claim. Because of the long reporting and settlement delays, an insurer could be responsible for future payments associated with the loss events occurred in the policy period even post the expiration of the contract. Claims reserving or loss reserving is the process of estimating outstanding payments or the ultimate payments that an insurer is responsible for. Reserves are determined at both claim level and portfolio level (see, for example, Antonio and Beirlant (2008) and Pigeon et al. (2014)). At claim level, an insurer estimates the amount for which a particular claim will ultimately be settled or adjudicated, also known as case reserve. At portfolio level, an insurer also estimates its future liabilities for the entire book of business. To emphasize its importance, loss reserves typically represent the largest liability item on the balance sheet of nonlife insurers.

For reserving purposes, one is interested in the claims amount given occurrence of the loss events. As pointed out by Wüthrich and Merz (2008), because of the introduction of new supervisory guidelines (Solvency II) and financial reporting standards (IFRS 4 Phase II), the measurement of future cash flows and their uncertainty becomes more important. In this application, we examine the predictive distribution of given . For illustration, we display in Figure 8 the 95% prediction intervals of the claims amount for four representative risks, “poor”, “good”, “average”, and “superior”. The bar is determined by the 2.5th and 97.5th percentiles of the predictive distribution, and the solid dot indicates the predictive mean. The four risks are selected from the validation data based on the expected number of claims . Specifically, they expect to have 2.37, 0.76, 0.37, 0.15 claims per year which corresponding to the 95th, 75th, 50th, 25th percentiles of the frequency distribution, respectively. For comparison, we impose the corresponding prediction interval from the independence model in the figure as indicated by the dashed line. First, as expected, the predictive distribution of claim amounts given frequency is skewed and long-tailed. This observation emphasizes that a range estimate of reserves is more informative than a point estimate for managers to set appropriate reserves, because an insurer doesn’t want to overestimate nor underestimate its outstanding liabilities. Over-reserving could inflate the price and make the product less competitive, while under-reserving increases the solvency risk. Second, because of the significant negative relation between claim frequency and severity, the claims amount becomes smaller as the number of claims increases. A dynamic viewpoint is that an insurer updates its knowledge on the severity distribution based on frequency information. Third, it is apparent that ignoring the frequency-severity dependence will introduce significant bias in the reserving estimates. Under the independence assumption, not only that the claim severity is invariant with respect to claim frequency, but also the magnitude of the prediction could lead to poor decision making. For example, the results suggest that managers relying on the independence model tend to over reserve for better risks. In particular, the over-reserving risk is substantial for superior risks. As described earlier, there will be negative effects on both pricing and reserving. Over prediction of unpaid losses lead to increase in price which could cause the insurer to lose profitable business. Figure 8: Prediction interval of conditional claims amount for four representative risks.

We further test the prediction of ultimate losses given occurrence for all the policyholders in the hold-out sample. To compare the prediction from the independence model to the copula model, we employ the continuous ranked probability score (CRPS) in

Gneiting and Raftery (2007) and Czado et al. (2009). The CRPS is a proper scoring rule that assesses the quality of probabilistic forecasts. For reserving purpose, we focus on policyholders with at least one claims, and we evaluate the prediction of the aggregate loss distribution . The predictive distribution is derived for each policyholder based on 10,000 Monte Carlo simulations where the aggregated loss is generated conditional on occurrence of claims. Then the CRPS assigns a numerical score that measures the distance between the cumulative predictive distribution and the realized losses in the hold-out sample. For 73.34% of the policies in the hold-out sample, the copula model outperforms the independence model. A binomial test suggest the superior prediction of the copula model to the independence model is statistically significant.

In the third application, we briefly demonstrate implications of the frequency-severity dependence on capital management. Insurance is a highly regulated industry. To mitigate solvency risk and protect public interest, insurers are required to hold minimum amount of risk capital as a buffer in case of some unexpected catastrophic events. We have already seen the consequences when the dependence between frequency and severity is unaccounted for at the individual policy level. This example emphasizes its relevance at the portfolio level since the risk capital is determined for the entire book of business.

To calculate the risk capital, we consider the value-at-risk (VaR), a risk measure widely used in the insurance and banking industry. The VaR focuses on the tail of the distribution, and specifically VaR is defined as the th percentile. Our interest is the aggregate losses for the insurance portfolio, defined as , where , the loss cost for policyholder , is specified using the collective risk model (1). The distribution of is estimated using 10,000 Monte Carlo simulations. Table 7

reports the risk measure at 90%, 95%, and 99% levels for both the independence and copula models. To quantify the simulation uncertainty, we replicate the simulation 100 times to obtain the 95% confidence interval. The results implies that ignoring the frequency-severity dependence in the collective risk model leads to significant underestimate of the tail risk for the portfolio.

## 5 Conclusion

The two-part regression model based on compound distributions is commonly used in various disciplines, including insurance, economics, marketing, and psychology, among others. The current practice is to perform a marginal regression on the primary (frequency) outcome, and a separate regression on the positive portion of the secondary (severity) outcome. This practice relies on the (conditional) independence assumption and causes significant biases in inference in the presence of frequency-severity dependence.

Motivated by the wide application of this type of model, this article represents the first attempt at accommodating the association between the frequency and severity components in the compound distribution and the associated regression models. We proposed the novel idea of using a parametric copula to construct the joint distribution of and in the compound distribution. The copula regression is simple yet enjoys several advantages: First, the copula model allows for an arbitrary dependence between frequency and severity, and thus includes the (conditional) independence model as a special case. Second, separating the marginal from the joint distribution, the copula model can easily accommodate nonstandard marginal regressions for complicated data structure, for instance, regressions for zero/one-inflated data or the incomplete data due to censoring and truncation. Third, the parametric nature of the model implies straightforward likelihood-based inference and thus facilitates data-driven model specification and diagnostics, which is critical to the applications with complex and big data.

This work was motivated by the applications in insurance, where the complex and unique features of claims data provide a general setting to investigate the frequency-severity dependence in the context of the two-part model. For example, the standard count regression is not sufficient to capture the features in claim frequency; and the modifications on insurance coverage often cause observations to be incomplete. Although our empirical analysis emphasized the consequences of ignoring the frequency-severity dependence on the operations in insurance companies, the proposed model is general enough and ready to apply to other disciplines. It’ll be interesting to see the implications of the frequency-severity dependence on decision making in other fields as well.

Finally, we conclude the paper with some discussions on the dependence between the frequency and severity in the proposed copula model. First, the proposed copula model relies on a simplifying assumption for the dependence, i.e. the association parameter in the copula is constant and does not vary across covariates. A potential extension is to use a conditional copula approach to allow the association in the copula to be dependent on covariates. See, for example, Patton (2006), Acar et al. (2011), Veraverbeke et al. (2011), Fermanian and Wegkamp (2012), and Castro-Camilo et al. (2018) for some recent development. We note that some domain knowledge is usually needed to support the conditional copula approach, for instance, the dependence among stock markets could be time-varying. We leave it as a future research topic to investigate the conditional dependence in insurance data. Second, we attribute the observed dependence in frequency and severity to unobserved heterogeneity. Regarding whether such relation is positive or negative, we think of this more as an empirical question to investigate. Often there are competing theories to support both positive and negative relationships. For the property insurance in our paper, one example of unobserved heterogeneity that induces dependence is weather related hazard. One can think of a geographical region that has frequent but modest storms versus another region that has infrequent but very severe storms. Another example of unobserved heterogeneity is the social-economic factors. One can think of some areas with frequent but minor crimes versus other areas with infrequent but severe crimes. Thus it is important for the model to offer the flexibility to accommodate both positive and negative relationship, and thus to allow for an empirical test of alternative theories.

## Supplementary Material

### S.1 Special Cases of the Copula Models

We show that several widely used two-part models can be viewed in the proposed copula framework. The first is the hurdle model in the health economics literature (see, for instance, Mullahy (1986) and Pohlmeier and Ulrich (1995)). The hurdle model considers the count measure of health care utilisation as a result of two different decision processes. The first part specifies the decision to seek care by individuals, and the second decision concerns the quantity of health care consumed which is partly determined by physicians. In the copula model, define and for a latent count variable . The copula for and is shown to be . In this case, the copula model (2) becomes:

 fN,Y(n,y1,…,yn) =[fN(0)]I(n=0)[fN(1)×fY|N(y|1)]I(n=1) =FY∗(0)1−FY∗(0)[FY∗(0)]1−n[1−FY∗(0)]nhurdle×[fY∗(y)1−FY∗(0)]n%truncated. (S.1)

This gives the standard hurdle model where the hurdle component can be a logit or a probit regression, and the truncated component is usually a Poisson or a negative binomial model. Governed by two different sets of parameters, the two components are separately and independently estimated. The same framework has also been used to study the semi-continuous health care expenditures where a log-linear or generalized linear model is often employed for the truncated component (see

Mullahy (1998)).

The second special case is the selection model. Assuming is a binary outcome and is determined by a latent continuous variable through relation , one has:

 fN(0)=Pr(N=0) =Pr(N∗≤0)=FN∗(0) fN(1)=Pr(N=1) =Pr(N∗>0)=1−FN∗(0)

Denote as the copula that uniquely defines the joint distribution of and , the bivariate density of and can be expressed as:

 fN∗,Y(n∗,y)=fN∗(n∗)fY(y)c(FN∗(n∗),FY(y))

where . Then the distribution of and is:

 fN,Y(0,y) =∂∂yPr(N∗≤0,Y≤y)=∫0−∞fN∗,Y(s,y)ds =fY(y)h(FN∗(0),FY(y)) fN,Y(1,y) =∂∂yPr(N∗>0,Y≤y)=∫+∞0fN∗,Y(s,y)ds =fY(y)(1−h(FN∗(0),FY(y)))

Note that the above is a selection model in that is observed only if . See Smith (2003) and Prieger (2002) for discussions on copula-based selection model. When and are joint normal distribution, the copula model further reduces to the classic Heckman model (Heckman (1979)).

Under this setting, the joint distribution of is shown:

 fN,Y(n,y1,…,yn) =[fN(0)]1−n[fN,Y(1,y)]n =fY(y)fN(1)fN(0)1−nfN(1)nfN(n)×(fY(y)fN(1)[1−h(FN(0),FY(y))])nfY|N(y|1)

The model has a natural two-part interpretation where the joint distribution decouples into the product of frequency and severity distributions. However, the two components cannot be estimated separately because they are not independent with each other.

The last related case is the frequency-severity model in the actuarial science literature (Frees (2014)), where the joint distribution of is expressed as:

 fN,Y(n,y1,…,yn)=fN(n)×n∏j=1fY|N>0(yj). (S.2)

The frequency component describes the number of claims and is specified as a count regression. The severity component governs the size of claims given occurrence and employs generalized linear models to account for the skewness and heavy tails. The model assumes that given , the distribution of does not depend on . Thus the two pieces can be estimated separately. Note the difference between distributions and . To be more specific,

 fY|N