Predictive Modeling of Multivariate Longitudinal Insurance Claims Using Pair Copula Construction

05/18/2018 ∙ by Peng Shi, et al. ∙ University of Notre Dame 0

The bundling feature of a nonlife insurance contract often leads to multiple longitudinal measurements of an insurance risk. Assessing the association among the evolution of the multivariate outcomes is critical to the operation of property-casualty insurers. One complication in the modeling process is the non-continuousness of insurance risks. Motivated by insurance applications, we propose a general framework for modeling multivariate repeated measurements. The framework easily accommodates different types of data, including continuous, discrete, as well as mixed outcomes. Specifically, the longitudinal observations of each response is separately modeled using pair copula constructions with a D-vine structure. The multiple D-vines are then joined by a multivariate copula. A sequential approach is employed for inference and its performance is investigated under a simulated setting. In the empirical analysis, we examine property risks in a government multi-peril property insurance program. The proposed method is applied to both policyholders' claim count and loss cost. The model is validated based on out-of-sample predictions.



There are no comments yet.


page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past decade, insurance industry, especially property and casualty insurance, has been advancing the use of analytics to leverage big data and improve business performance. Statistical modeling of insurance claims has become an important component in the data-driven decision making in various insurance operations (Frees (2015)). This study focuses on predictive modeling for nonlife insurance products with a “bundling” feature.

Bundling is a common design in modern short-term nonlife insurance contracts and it takes different forms in practice. For instance, in automobile insurance, a comprehensive policy provides coverage for both collision and third-party liability; in property insurance, an open peril policy covers losses from different causes subject to certain exclusions; furthermore, in personal lines of business, car insurance and homeowner insurance are often marketed to households as a package. In the context of predictive modeling, the bundling feature of nonlife insurance contracts often leads to multiple longitudinal measurements of an insurance risk. The goal of this study is to promote a modeling framework for evaluating the association among the evolution of the multivariate risk outcomes.

Assessing the association among multivariate longitudinal insurance claims is critical to the operation of property-casualty insurers. Two types of association among claims are in particular of interest to researchers and have been studied in separate strands of literature. On one hand, one is interested in the temporal relation of insurance claims. The availability of repeated measurements over time allows an insurer to adjust a policyholder’s premium based on the claim history, known as experience rating in insurance (Pinquet (2013)). Understanding the temporal dependence is essential to the derivation of the predictive distribution of future claims and thus the determination of the optimal rating scheme (see, for example, Dionne and Vanasse (1992) and Boucher and Inoussa (2014)). On the other hand, insurers are also interested in the contemporaneous association among the bundling risks. Correlated claims induce concentration risk in the insurance portfolio on the liability side of an insurer’s book in contrast to the investment portfolio on the asset side. Measuring the contemporaneous dependency is necessary to the quantification of the inherent risk in the bundling products, which offers valuable inputs to the insurer’s claim management and risk financing practice (Frees and Valdez (2008) and Frees et al. (2009)).

As noted above, the majority of existing studies tends to examine the contemporaneous and temporal dependence in insurance claims in distinct contexts. We argue that an analysis of both types of association in a unified framework will deliver unique insights to the decision making that cannot be gained from a silo approach. For instance, a joint analysis of multivariate longitudinal outcomes provides a prospective measure of the association among dependent risks as opposed to a retrospective measure, which is more relevant to the insurer’s operation such as ratemaking, risk retention, and portfolio management. Despite of the appealing benefit, studies in this line are still sparse, with one recent exception of Shi et al. (2016). There are several possible explanations: first, a separate analysis provides more focus in problem-driven studies; second, researchers might not have access to the relevant data; third, the discreteness in insurance claims data brings additional challenge to the modeling process.

Setting apart from the existing literature, this work examines the contemporaneous and temporal association embedded in the multivariate longitudinal insurance claims in a unified framework. Specifically, we adopt a regression approach based on parametric copulas, where the longitudinal observations of each response is separately modeled using pair copula constructions with a D-vine structure, and the multiple D-vines are then joined by a multivariate copula. The proposed framework is very general and has a much broader audience in that it easily accommodates different types of data, including continuous, discrete, as well as mixed outcomes. In our context, we are interested in both the number of claims (discrete outcome) and the loss cost of claims (mixed outcome) of individual policyholders.

A number of approaches to joint modeling of multivariate longitudinal outcomes have been found in statistical literature. We refer to Verbeke et al. (2014) for a recent review. All the methods imply specific structures on the serial association of repeated measurements of a response, the contemporaneous association among multiple responses at the same time point, as well as the lead-lag association between the multiple responses at different points in time. In general, dependence among multivariate longitudinal outcomes can be accommodated via three mechanisms.

The first is to use random effects models where latent variables at either time-dimension or outcome dimension are specified to introduce association. For example, Reinsel (1984) and Shah et al. (1997) focused on linear models but limited with balanced data, and Roy and Lin(2000, 2002

) relaxed this restriction to allow for unbalanced data. In theory, non-Gaussian outcomes can be accommodated using generalized linear mixed models. However, their applications to multivariate longitudinal data are far less found in the literature, arguable due to the difficulties in parameter interpretation, model diagnostics, and computational complexity.

The second approach is to directly specify the multivariate distribution for the outcome vector. The conventional case is the multivariate linear regression for Gaussian outcomes where the flexible dependence can be captured by appropriately structuring the covariance matrix (see, for instance,

Galecki (1994)). Full specification of the distribution for discrete outcomes is more challenging and difficult to implement, with Molenberghs and Lesaffre (1994) being one example on ordinal categorical data. A more general class of models that is particularly useful for non-Gaussian outcomes is the copula regression. See Nelsen (2006) for an introduction to copula and Joe (2014) for recent advances. Applications of copulas for analyzing multivariate longitudinal outcomes include Lambert and Vandenhende (2002), Shi (2012), and Shi et al. (2016).

The third strategy is marginal models using generalized estimating equations(GEE) (

Liang and Zeger (1986)). Compared to the above two methods, the advantage of using GEE is that the specification of full distribution can be avoid and inference for regression parameters is robust with respect to misspecification. For example, Rochon (1996) considered a bivariate model to jointly analyze a binary outcome and a continuous outcome that are both repeatedly measured over time. Gray and Brookmeyer (1998) and Gray and Brookmeyer (2000)

proposed multivariate longitudinal models for continuous and discrete/time-to-event response variables respectively.

For the purpose of our application, we adopt the copula approach. On one hand, common measurements for insurance risks are neither Gaussian nor continuous, for example, the number of claims and the loss cost of claims in this study. Due to some unique features such as zero-inflation and heavy tails in these measurements, standard statistical methods including generalized linear models are not ready to apply. In this case, implementing a random effect approach is not straightforward and computationally challenging. On the other hand, prediction plays a central role in insurance business. From the inference perspective, forecasting carries no less weight than estimating regression coefficients for the current task. Hence the GEE approach is not appropriate since it treats association as nuisance and measures it using working correlation.

The proposed copula model in this paper is novel and differs from the literature in that pair copula construction offers a wider range of dependency compared to existing studies on multivariate longitudinal data that limit to the elliptical family of copulas. For example, Shi et al. (2016) employed a factor model to specify the dispersion matrix of elliptical copulas. The factor model implies restricted dependence structure although it is particularly helpful when the data exhibit a hierarchical feature. Lambert and Vandenhende (2002) used two separate copulas in the model formulation, one for the temporal association and the other for the (conditional) contemporaneous association. With an extra copula, the model adds more flexibility to the dependence structure. However, both temporal and contemporaneous association are still limited to the modeling of the correlation matrix in a Gaussian copula. This work contributes to the literature from both the methodological and the applied perspectives. From the methodological standpoint, we introduce a unified framework for modeling dependence in multivariate longitudinal outcomes based on pair copula construction. From the applied standpoint, the unique predictive application in property insurance motivates the proposed modeling framework and advocates its usage for the general two-dimensional data and in much broader disciplines.

The rest of the article is structured as follows. Section 2 describes the Wisconsin local government property insurance fund that drives the demand for advanced statistical methods. Section 3 proposes the pair copula construction approach to modeling multivariate longitudinal outcomes and discusses inference issues. Section 4 investigates the performance of the estimation procedure using simulation studies. Section 5 presents results of the multivariate analysis on the number of claims and the loss cost of claims for the policyholders in the Wisconsin government property insurance program. Section 6 concludes the article.

2 Motivating Dataset

This study is motivated by the operation of the local government property insurance fund from the state of Wisconsin. The fund was established to provide property insurance for local government entities that include counties, cities, towns, villages, school districts, fire departments, and other miscellaneous entities, and is administered by the Wisconsin Office of the Insurance Commissioner. It is a residual market mechanism in that the fund cannot deny coverage, although local government units can secure insurance in the open market. See Frees et al. (2016) and Shi and Yang (2017) for more detailed description of the insurance property fund data.

We focus on the coverage for buildings and contents that is similar to a combined home insurance policy, where the building element covers for the physical structure of a property including its permanent fixtures and fittings, and the contents element covers possessions and valuables within the property that are detached and removable. It is an open-peril policy such that the policy insures against loss to covered property from all causes with certain exclusions. Examples of such exclusions are earthquake, war, wear and tear, nuclear reactions, and embezzlement.

For the sustainable operation of the property fund, the fund manager is particularly interested in two measurements for each policyholder, the claim frequency and the loss cost. The former measures the riskiness of the policyholder and provides insights for practice on loss control. The latter serves as the basis for determining the premium rate charged for a given risk. Policyholder level data are collected for 1019 local government entities over a six-year period from 2006 to 2011. Because of the role of residual market of the property fund, attrition is not a concern for the current longitudinal study. The open-peril policy type allows us further decompose the claim count and loss cost by peril types which provide a natural multivariate context to examine the longitudinal outcomes. Following

Shi and Yang (2017), we consider three perils, water, fire, and other. We use the data in years 2006-2010 to develop the model and the data in year 2011 for validation.

One common feature shared by the claim count and the loss cost is the zero inflation, i.e. a significant portion of zeros associated with policyholders of no claims. Table 1

presents the descriptive statistics of the two outcomes by peril. Specifically, the table reports the percentage of zeros for each peril, and the mean and standard deviation of the outcome given there are at least one claim over the year. On average, about 85% government entities have zero claim per year for a given peril. Conditional on occurrence, we observe over-dispersion in the claim frequency except for fire peril, and skewness and heavy tails in the claim severity for all perils.

Claim Count Loss Cost
% of Zeros Mean SD Mean SD
Water 83.81 3.35 13.16 48,874 520,147
Fire 86.20 1.41 0.82 33,705 118,149
Other 87.95 1.60 2.86 50,824 263,241
Table 1: Descriptive statistics of claim count and loss cost by peril

The property fund data also contain basic risk classification variables that allow us to control for observed heterogeneity in claim count and loss cost. There are two categorical variables, entity type and alarm credit. Entity type is time constant, indicating whether the covered buildings belong to a city, county, school, town, village, or a miscellaneous entity such as fire stations. Alarm credit is time varying, reflecting the discount in premium received by a policyholder based on the features of the fire alarm system in the building. Available levels of discounts are 5%, 10%, and 15%. There is an additional continuous explanatory variable, amount of coverage, that measures the exposure of the policyholder. Due to its skewness, we use the coverage amount in log scale in the regression analysis.

Unobserved heterogeneity in the multivariate longitudinal outcomes is accommodated by modeling the temporal association within each peril, and the contemporaneous and lead-lag association between different perils. To obtain intuitive knowledge of dependence, we visualize in Figure 1 and Figure 2 the pair-wise rank correlation for claim count and loss cost respectively. First, it is not surprising to observe modest serial correlation for each peril. This is consistent with the presence of unobserved peril-specific effect that is often attributed to the imperfection of the insurer’s risk classification. Second, there exists moderate contemporaneous dependence among the three perils, which suggests some relation among the peril-specific effects and thus a diversification effect on the insurer’s liability portfolio. In addition, the lead-lag correlation across perils implies that the temporal and contemporaneous dependence are interrelated. Similar dependence patterns are observed for the claim count and the loss cost, which is explained by the presence of excess zeros in both outcomes.

Figure 1: Rank correlation matrix for the claim count.
Figure 2: Rank correlation matrix for the loss cost.

3 Statistical Method

3.1 Modeling Framework

Let denote the th () response in the th () period for subject (), and be the associated vector of explanatory variables. In this study, the outcomes of interest are the number of claims and the loss cost of claims from multiple perils for local government entities, and these outcomes are repeated observed over years. Specifically, index corresponds to government entity, peril type, and policy year.

Due to the unique features exhibited in the data, standard regression models often fail for insurance data. Therefore, we look into customized models for capturing such interesting data characteristics. For the claim count, we consider a zero-one inflated count regression:


where (

) is specified using a multinomial logistic regression:

and is a standard count regression such as Poisson or negative binomial models. This specification allows to accommodate the excess of both zeros and ones in the claim count.

For the loss cost, we employ a mixture formulation:



corresponds to a binary regression such as logit or probit for accommodating the mass probability at zero, and

is a long-tailed regression to capture the skewness and heavy tails. Commonly used examples are generalized linear models or parametric survival models.

To introduce the dependence model for the multiple longitudinal outcomes, we adopt the pair copula construction approach based on the graphical model known as vine (see Bedford and Cooke (2001, 2002)). Mainly due to its flexibility especially in terms of tail dependence and asymmetric dependence, vine copula has received extensive attention in the recent literature of dependence modeling (see, for instance, Joe and Kurowicka (2011)). In particular, Kurowicka and Cooke (2006) and Aas et al. (2009) are among the first to exploit the idea of building a multivariate model through a series of bivariate copulas and the original effort has been focused on the continuous data. Building on the framework for continuous outcome, Panagiotelis et al. (2012) introduced the pair copula construction approach to the discrete case, Stöber et al. (2015) examined the vine method for multivariate responses including both continuous and discrete variables, Shi and Yang (2017) employed the pair copula construction to model the temporal dependence among hybrid variables, and Barthel et al. (2018) further extended the vine approach to the event time data with censoring. Despite of existing studies, using pair copula construction for data with multilevel structure is still sparse in the literature. Recently, Brechmann and Czado (2015) and Smith (2015) discussed possible strategies of constructing vine trees for multivariate time series. In contrast, we propose below a strategy for multiple longitudinal outcomes.

Define as the history of the th outcome prior to time for subject , i.e. , and denote . Letting

, we express the joint distribution for subject



In the above in (3) denotes either joint (conditional) pdf or (conditional) pmf depending on the scale of the outcomes. Note that this conditional decomposition is generic and it does not impose any constraint on model specification.

Let denote the cdf, be it either univariate or multivariate. The joint distribution of conditioning on in (3) is specified by a -variate copula :


The conditional distribution , in (4) can be derived from the joint distribution of the outcomes of type . To allow for flexible temporal dependence, we further construct this distribution using the pair copula construction approach based on a D-vine structure:


We emphasize that the above proposed framework is general in that it accommodates marginals of different types, including continuous, discrete, and mixed distributions. Thus similar to (3), function in (5) could denote either pdf or pmf depending on the scale of the response variable. Below we discuss in detail the specifications for both continuous and non-continuous including discrete and semi-continuous cases.

3.1.1 Continuous Case

To introduce the idea, we start with the case of continuous outcomes. Using a -variate copula to accommodate the dependence among the multiple outcomes, the joint density associated with (4) becomes


In (6), denotes the density of copula . It could be specified using an elliptical copula with an unstructured dispersion matrix to allow for flexible pair-wise association. As an alternative, one could use pair copula construction with a regular vine, especially when the dimension is high or tail and asymmetric dependence are of more interest.

Next, for each of the multiple outcome, we employ a unique D-vine structure to model the temporal association. The conditional bivariate distribution in (5) can be expressed as


where denotes the bivariate copula density for the th responses in period and conditioning on the responses in the time periods in between. The D-vine specification offers a balance between interpretability and complexity. First, the natural ordering embedded in the longitudinal observations motivates the D-vine structure in modeling the temporal dependence for each outcome. Second, the conditional distribution for in (6) is an intermediate product in the evaluation of (5). This link between the two components of the proposed dependence model implies no extra computational complexity in the evaluation of the likelihood function.

3.1.2 Non-continuous Case

Building upon the above idea, we modify the model to accommodate both discrete outcome and semi-continuous outcome that are more relevant to our application. For discrete outcomes such as the number of insurance claims, (6) will be replaced by


and in (5), (3.1.1) will be replaced by


where is the bivariate copula associated with .

For semi-continuous outcomes, one can think of a hybrid variable where a mass probability of zero is incorporated into an otherwise continuous variable, for instance, a policyholder’s loss cost. In this case, (6) will be replaced by


In the second scenario in (3.1.2), we define

where, without loss of generality, we assume that the first outcomes are positive and the remaining are zero. Further, in (5), (3.1.1) will be replaced by


where for .

3.2 Inference

Because of the parametric nature of the model, we perform a likelihood-based estimation. The log-likelihood function for a portfolio of policyholders can be expressed as:


where the marginal and dependence models are specified respectively by (1) and (8)(3.1.2) for claim count, and (2) and (3.1.2)(3.1.2) for loss cost. The hierarchical nature of the model and the separability assumption for dependence parameters motivate us to adopt a stage-wise estimation strategy in the same spirit of inference function for margins (IFM) (Joe (2005)). In the first stage, one estimates the parameters in the marginal regression models assuming independence among observations both cross-sectionally and intertemporally. The second stage concerns the inference for the D-vine for each outcome. Either a simultaneous estimation or a tree-based estimation can be implemented. In the third stage, we estimate the copula for the contemporaneous association holding parameters in both marginal and temporal dependence models fixed. It is noted that estimation by stage allows us to gain substantial computational efficiency but at the cost of small loss in statistical efficiency. Our argument is that the stage-wise estimation is more feasible for predictive applications where the statistical efficiency is of secondary concern, especially in the case of big data. We investigate the performance of the estimation using simulation studies. Below is a detailed summary of the algorithms that are used to evaluate the likelihood:

Step I: For , we evaluate the following:

(i) For , evaluate and using the marginal models of the outcomes.

(ii) For , evaluate using (3.1.1) for continuous outcome or (3.1.2) for discrete outcome or (3.1.2) for semi-continuous outcome.

(iii) For , evaluate the following for recursively:

(a) Calculate and using:

(b) Calculate using:




(c) Following the similar procedure in (b), calculate .




(d) Calculate using (3.1.1), (3.1.2), and (3.1.2) for continuous, discrete, and semi-continuous outcomes respectively.

Step II: For , calculate the following for :

(i) Calculate using:

(ii) Calculate using: