Analysis of Prescription Drug Utilization with Beta Regression Models

07/31/2020
by   Guojun Gan, et al.
University of Connecticut
0

The healthcare sector in the U.S. is complex and is also a large sector that generates about 20 analytics has been used by researchers and practitioners to better understand the industry. In this paper, we examine and demonstrate the use of Beta regression models to study the utilization of brand name drugs in the U.S. to understand the variability of brand name drug utilization across different areas. The models are fitted to public datasets obtained from the Medicare Medicaid Services and the Internal Revenue Service. Integrated Nested Laplace Approximation (INLA) is used to perform the inference. The numerical results show that Beta regression models can fit the brand name drug claim rates well and including spatial dependence improves the performance of the Beta regression models. Such models can be used to reflect the effect of prescription drug utilization when updating an insured's health risk in a risk scoring model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 9

page 11

page 18

11/25/2019

The Tilted Beta Binomial Linear Regression Model: a Bayesian Approach

This paper proposes new linear regression models to deal with overdisper...
07/09/2019

The Integrated nested Laplace approximation for fitting models with multivariate response

This paper introduces a Laplace approximation to Bayesian inference in r...
10/22/2020

Robust estimation in beta regression via maximum Lq-likelihood

Beta regression models are widely used for modeling continuous data limi...
09/25/2020

Regressor: A C program for Combinatorial Regressions

In statistics, researchers use Regression models for data analysis and p...
07/26/2018

Optimal Designs in Multiple Group Random Coefficient Regression Models

The subject of this work is multiple group random coefficients regressio...
11/18/2020

An application of Zero-One Inflated Beta regression models for predicting health insurance reimbursement

In actuarial practice the dependency between contract limitations (deduc...
05/28/2018

Nonlinear Simplex Regression Models

In this paper, we propose a simplex regression model in which both the m...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the pharmaceutical industry, a brand name drug is a prescription drug that is developed and patented by a company. The patent gives the company a period of market exclusivity, during which the company sells the brand name drug at a significantly high price. For more information about drug pricing, readers are referred to Schoonveld (2015). When the patent expires, generic versions of the brand name drug are marketed at lower prices by other companies. Generally, most states allow pharmacists to substitute brand name drugs with generic versions, unless otherwise directed by physicians. Nevertheless, according to the Association for Accessible Medicines (AAM) (Klinger, 2017), the generics account for 89% of prescriptions dispensed but only 26% of total drug costs in the U.S. Recent annual prescription spending in the United States has been on the rise. According to the NHE (National Health Expenditure) fact sheet 111https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/NationalHealthExpendData/NHE-Fact-Sheet. Accessed on July 6, 2020., prescription drug spending in the U.S. increased 2.5% to $335.0 billion in 2018, faster than the 1.4% growth in 2017.

This large disparity in the costs between brand name and generic drugs motivated us to investigate for the presence of feature variables, e.g., location, income, demographics, that drive the decision to choose one type over the other. It is also our intent to prescribe a model that health insurers can replicate to reflect these decision drivers into a risk adjustment program (Duncan, 2011), a mechanism widely popular for assessing the health risk of an insured (Hileman et al., 2016). This mechanism is an umbrella of commercially designed risk assessment models that are sometimes called risk scoring models. While models do vary, they share a common objective of predicting the health care cost per member per year using prior year’s medical utilization and expenditures as well as additional risk factors such as location and demographic.

The decision to prescribe/purchase generic or brand name drugs has been studied before. Brinberg and Cummings (1984) used two behavioral intension models to examine the decision to purchase generic prescription drugs and found several differences between individuals who intent to purchase generic prescription drugs with those who do not. For example, nonintenders believed it was less likely that generic drugs were a safe product than intenders. Hamann et al. (2013) studied psychiatrists’ decision making between generic and branded antipsychotics and found that psychiatrists were more likely to choose branded drugs when imagining choosing the drug for themselves. Hassali et al. (2014) reviewed experiences of implementing generic medicine policy in eight countries, including the U.S. The review indicates that pharmacists play an essential role in promoting generic medicines as they substituted 83.8% of prescriptions that allowed substitution and that generic prescribing is still not common practice in the U.S. because many physicians have negative perceptions about generic medicines and lack in-depth knowledge about the bioequivalence concept applied in the U.S. Kesselheim et al. (2016) studied variations in patients’ perceptions and use of generic drugs based on a national survey and found a substantial shift that more patients have positive views of generic drugs.

In the aforementioned work, Brinberg and Cummings (1984), Hamann et al. (2013), and Kesselheim et al. (2016) used survey data to study the behavior of the patients and physicians. The review conducted by Hassali et al. (2014) was based on literature search using several electronic databases and search engines such as ISI ( Institute for Scientific Information) Web of Knowledge and Google. Public pharmaceutical drug utilization data have not been used to study the factors driving the decision of choosing one type over the other.

In this paper, we investigate the rates of brand name drug claims in the U.S. In particular, we investigate the variations of the brand name drug claims rates in different areas of the U.S. Our study supplements the aforementioned studies based on survey data. Since rates are values in the interval , the Beta regression model (Ferrari and Cribari-Neto, 2004) is most suitable where covariates can be introduced to account for heterogeneity.

The remaining part of the paper is organized as follows. In Section 2, we review some academic work related to health insurance. In Section 3, we describe the data used in our study. In particular, we describe how the data is processed before models are fitted. In Section 4, we present four Beta regression models for fitting the data. In Section 5, we provide numerical results of the proposed Beta regression models. In Section 6, we conclude the paper with some remarks.

2 Related work

Healthcare is an important practicing area of actuaries due to the existence of health insurance. In this section, we review some work that is related to health insurance analytics.

Early work includes Bolnick (2004), Petertil (2005), and Bachler et al. (2006). Bolnick (2004) proposed a framework for long-term healthcare cost projections by incorporating key healthcare cost drivers such as life expectancy, biological morbidity, and economic morbidity. Using the framework, Bolnick (2004) considered various plausible future scenarios that encompass a reasonable range of healthcare cost outcomes. Petertil (2005) studied the relative significance of aging as a driver of healthcare cost beyond age fifty and used Medicare data to draw general conclusions that utilization and cost differ by age. Bachler et al. (2006) studied the impact of chronic and nonchronic insured members on cost trends and found that classification of chronic members can affect the trends significantly.

Recently, there has been a lot more studies on healthcare analytics than ten years ago. Duncan (2011) devoted a book to healthcare analytics for actuaries. In particular, this book covers the essentials of health risk and case studies on the use of predictive modeling in risk adjustment. For example, Chapter 14 of this book discusses the risk adjustment method used by the Centers for Medicare and Medicaid Services (CMS). Frees et al. (2011) extended the standard two-part model to predict the frequency and amount of healthcare expenditures and used the data from the Medical Expenditure Panel Survey (MEPS) to calibrate and test the model. Shi and Zhang (2013) built a game-theoretic model based on copulas to study the effect of managed care on healthcare utilization compared to traditional fee-for-service plans in private health insurance market.

Getzen (2016) developed a method to evaluate projections of future medical spending used by the U.S. Medicare and Medicaid programs and found that the recent set of projections (1998–2010) is more accurate than the older set of projections (1986–1995) because the recent projections incorporate lagged macroeconomic effects. Duncan et al. (2016)

investigated several statistical techniques (e.g., Lasso GLM, multivariate adaptive regression splines, random forests, decision trees, and boosted trees) for modeling future healthcare costs and found that the traditional regression approach does not perform well.

Huang et al. (2017)

compared different models and model selection methods for health costs in Episode Treatment Groups (ETGs) and found that random forest feature selection is preferable in terms of efficiency and accuracy.

Richardson and Hartman (2018)

proposed a Bayesian nonparametric regression model for predicting healthcare claims. Their numerical results show that the Bayesian nonparametric model outperforms standard linear regression and generalized Beta regression in terms of predictive accuracy.

Brockett et al. (2018) investigated the use of Data Envelopment Analysis (DEA), which is a method developed in management science to measure relative efficiency of organizations, to assess the potential savings of Medicare. Their analysis shows that Medicare Advantage plans are more efficient in reducing health expenditures than the original Medicare. In Brockett et al. (2019)

, the authors used linear regression models to estimate the effect of medical loss ratio (MLR) and efficiency on the quality of care for the Medicaid managed care plans. The results show that the effect of medical services efficiency on the quality of care is insignificant.

Kan et al. (2019)

explored the use of machine learning techniques for risk adjustment and found that penalized regression performs better than ordinary least squares in predicting healthcare costs.

Our work presented in this paper is relative to healthcare analytics but is different from the work mentioned above. In particular, our work examined Beta regression model with spatial dependence to model the percentages of bran name drug claims. The methods and findings of this paper can be used by practitioners in their risk modeling and adjustment.

3 Description of the Data

We use a public dataset called the Part D Prescriber Public Use File (PUF) from the Centers for Medicare & Medicaid Services (CMS) and another public dataset called Individual Income Tax Statistics from the IRS (Internal Revenue Service).

The data in the Part D Prescriber PUF cover calendar years 2013 through 2016 and contain information on prescription drug events (PDEs) incurred by Medicare beneficiaries with a Part D prescription drug plan. The data consist of a detailed data file and two summary tables. The “Part D Prescriber Summary Table” contains overall drug utilization, drug costs, and beneficiary counts organized by National Provider Identifier (NPI). In this paper, we use the 2016 Part D Prescriber Summary Table, which contains 1,131,550 records and 84 variables 222The file name is PartD_Prescriber_PUF_NPI_16.txt and it is available from https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/PartD2016. Accessed on Jan 20, 2020.. The Part D Prescriber Summary table contains information about the brand and generic drug claims at the provider level. Other information includes the zip code of providers, the count of beneficiaries, the average age of beneficiaries, and the average risk score of beneficiaries.

In this paper, our interest is on studying variations of the brand name drug claims rates in different areas of the U.S. To that end, we need to aggregate the data to different areas. However, the data can be aggregated to different levels: the state level, the zip code level, or some customized level between the state level and the zip code level. On the one hand, aggregating the data to the state level is not desirable for the following reasons:

  • The brand name drug claim rate exhibits variations within individual states.

  • The U.S. has only 50 states and aggregating the data to the state level produces only 50 data points.

On the other hand, aggregating the data to the zip code level is also not desirable:

  • The dataset contains about 20,000 different zip codes. Aggregating the data to the zip code level will produce about 20,000 data points. This will cause challenges in modeling the spatial effects as a large number of sites requires lots of parameters.

  • Some zip codes do not correspond to geographical areas but large volume customers or post office boxes.

  • Aggregating the data to the zip level produces highly volatile brand name drug claims rates. That is, the rates are highly volatile between zip codes.

As a result, we decide to aggregate the data to some customized level between the state level and the zip code level.

Figure 1: A mesh of the 48 contiguous states with 530 triangles.

To aggregate the data to a customized level, we first need to split the U.S. land into small areas, i.e., create a mesh of the U.S. land. This can be done conveniently by using the R function inla.mesh.2d from the INLA package. Figure 1 shows a mesh of the 48 contiguous states of the U.S. that consists of 530 triangles. Aggregating the data to the 530 triangular areas will produce a dataset with about 530 data points. This is computationally reasonable for modeling spatial effects.

Aggregating the data to the triangles shown in Figure 1 is done as follows. First, we obtain the longitudes and the latitudes of zip codes from the R package noncensus, which contains regional information and demographic data determined by the U.S. Census Bureau. Note that each zip code is associated with a longitude and a latitude. Second, we determine the triangles to which the zip codes belong by using the longitudes and the latitudes. Finally, we aggregate the data based on the indices of the triangles.

The Part D Summary Table contains average ages and average risk scores of the beneficiaries. The average risk scores are average HCC (Hierarchical Condition Category) risk scores, which estimate how beneficiaries’ FFS (Fee-For-Service) spending will compare to the overall average of the entire Medicare population. Before aggregating the data, we convert average ages and average risk scores to total ages and total risk scores by multiplying the average numbers with the count of beneficiaries. After aggregation is done, we convert total ages and total risk scores to average ages and average risk scores by dividing the total number by the aggregated count of beneficiaries. The Part D Summary Table also contains missing values. We remove missing values before aggregating the data.

(a)
(b)
Figure 2: Distributions of the brand name drug claim rates. White triangles mean that data are not available in these areas.

Figure 1(a) shows the distribution of the brand name drug claim rates in different triangular areas. From the figure, we see that 19 triangular areas do not have the data. As a result, the aggregated dataset contains 511 observations. From the figure, we also see that some triangles in the middle are light red. This suggests that modeling the spatial effects may improve the fitting of Beta regression models to the data.

Min 1st Q Median Mean 3rd Q Max
brandrate 0.1253 0.1784 0.1886 0.1890 0.1986 0.4118
avgage 65.79 70.06 70.89 70.98 71.87 80.00
avgscore 0.9129 1.2102 1.3240 1.3170 1.4199 1.8166
Table 1: Summary statistics of the Part D data.

Table 1 shows the summary statistics of the aggregated data. From the table, we see that the brand name drug claim rate varies from 12.53% to 41.18% among the 511 areas. The range of the average age of beneficiaries is from 65.79 to 80. The average risk score has a range of 0.9129 to 1.8166. The median and mean values indicate that these variables are pretty symmetrical.

Another data source we use in our study is the 2016 individual income tax statistics by zip codes 333The file name is 16zpallagi.csv and it is available from https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2016-zip-code-data-soi. Accessed on Jan 12, 2020.. This dataset contains 29,874 records, each of which is described by 144 variables. It contains information about the number of returns and the amount of returns in different categories. For example, it contains the number of returns with unemployment compensation and the unemployment compensation amount. The return counts and amounts are aggregated to the zip code level. Although this datast does not contain all information about an area, it does provide some demographic and economic information, which might be useful for explaining the variation of the brand name drug claim rate. For example, the number of returns from an area is related to the population of the area.

We aggregate the tax data to the 530 triangular areas in the same way as we aggregate the Part D data. After aggregating the data, we divide the total dollar amount of an area by the corresponding number of returns to get the average dollar amount of the area. The total number of returns and the average amounts in different categories are used in the data analysis. Since this dataset has 144 variables, we do not show the summary statistics of these variables here.

4 Models

In this section, we describe some Beta regression models. In particular, we present four models: the basic Beta regression model, the Beta regression model with random effects, the Beta-Besag model, and the Beta-BYM model.

The density function of the Beta distribution is typically defined as

(Klugman et al., 2012):

(1)

where and are shape parameters. The Beta distribution defined in Equation (1) has been reparameterized by using its mean and dispersion as parameters.

(2)

where and . The shape parameters can be obtained from the mean and the dispersion as follows: and .

The basic Beta regression model is described as follows. Suppose that we have observations. For , let and

be the vector of

regressors and the response in the th case, respectively. The responses are assumed to form a random sample such as

The mean is linked to the regressors as follows:

(3)

where is the link function and

is the vector of regression coefficients. The variance of

can be estimated as

(4)

Random effects models are commonly used to analyze areal or spatial data. For example, Tufvesson et al. (2019) applied random effects models to model car insurance data with geographical locations of the policyholders. There are two types of random effects models: unstructured and structured. Unstructured random effects models assume independence of the random effects, while structured random effects models allow for spatial dependence.

The second model we consider is a Beta regression model with a single random effect. In this model, a random effect is introduced to the linear predictor as follows:

(5)

where are i.i.d Gaussian noise, i.e.,

(6)

where is a precision parameter. This model can help to control for unobserved heterogeneity when the heterogeneity is not correlated with independent variables. The basic Beta regression model assumes that the heterogeneity is correlated with independent variables.

The third model we consider is the Besag model, which introduces a structured random effect to the linear predictor:

(7)

where

s follow a CAR(1) model, i.e., an order-1 conditional autoregressive model

(Besag and Kooperberg, 1995). Building a CAR model requires a neighborhood graph, which tells which areas are neighbors to each other. Figure 3 shows a neighborhood graph where areas that are neighbors are connected by blue lines. In our models, we assume that triangles that share a vertex or a side are neighbors to each other.

Figure 3: The neighborhood graph used to model spatial effects.

Given a neighborhood graph, the CAR(1) model assumes that the random effect in an area is related to the random effects in the area’s neighbors through a conditional Gaussian distribution:

(8)

where is the set of indices of neighbors of the th area, denotes the number of elements in , and is a precision parameter. Note that the CAR(1) model can be written as a multivariate Gaussian model for (Besag and Kooperberg, 1995; Wang et al., 2018):

(9)

where is a highly sparse matrix defined by

The fourth model is the BYM (Besag-York-Mollié) model, which combines an unstructured random effect and a structured random effect (Besag et al., 1991):

(10)

where s are modeled by a CAR(1) model as specified in Equation (8) and s are i.i.d Gaussian noise as specified in Equation (6). The structured effects model the spatial effect. The unstructured effects are used to model additional random variations that are not explained by geographical locations. Between the two effects u and v, a larger effect of in the model means that less exchange of information between areas is allowed. The relative strengths of the unstructured effects and the structured effects

are controlled by the two hyperparameters

and , respectively.

Model name Linear predictor
BetaReg Equation (3)
BetaRE Equation (5)
BetaBesag Equation (7)
BetaBYM Equation (10)
Table 2: Linear predictors of the four models.

Table 2 summarizes the four Beta regression models described above. The first two models do not consider spatial dependence, while the last two models allow for spatial dependence.

There are several choices of the link function (Ferrari and Cribari-Neto, 2004):

  • the logit link:

    ;

  • the probit link: , where

    is the standard normal distribution function;

  • the log-log link: ;

  • the complimentary log-log link: ;

  • the Cauchy link: .

Figure 4 shows the plots of these link functions. The logit, probit, and Cauhy link function are symmetric around the point . The other two link functions are not symmetric around the point.

Figure 4: Some link functions for Beta regression models.

5 Results

In this section, we present the results of fitting the four Beta regression models (see Table 2) to the brand drug claim data.

5.1 Data transformation

Before we fit the models, we transform the covariates by using the Yeo–Johnson transformation introduced by Yeo and Johnson (2000). The purpose of transforming the covariates is to reduce their magnitudes for the convenience of modeling fitting. The Yeo–Johnson transformation is similar to the power transformation but can handle zero and negative values. In this transformation, a value is transformed as follows:

(11)

We transform all covariates except for the average risk score before fitting the models.

After transforming the data, we split the dataset into two sets: one for fitting the models and one for validating the models. We use 80% of the data for fitting and the remaining 20% for evaluation. Figure 5 shows the distribution of the brand name drug claim rates for the training set and the test set.

(a) The training set.
(b) The test set.
Figure 5: The distribution of the brand name drug claim rates for the training and test sets.

5.2 Selection of covariates

The data contain 145 covariates, which include 143 from the individual tax income file and 2 (i.e., avgage and avgscore) from the Part D Summary Table. Including all these covariates in the models causes the parameter identification problem because many of these covariates are highly correlated.

To select covariates for fitting the models, we use the lasso regularized generalized linear models (GLMs), which have been implemented in the R package glmnet (Friedman et al., 2010). Tufvesson et al. (2019) used this method to select covariates. However, the package glmnet

does not support Beta regression models. To circumvent this problem, we first create a new binary response variable by comparing the brand name drug claim rates to the average rate. If a rate is above the average, then the corresponding response value is 1. Otherwise, the response value is 0. Then use lasso regularized logistic regression that is supported by

glmnet to select covariates. We assume that the covariates that help to separate high and lower rates can also serve as useful predictors.

Figure 6: Results from a cross-valuation run of the lasso regularized logistic GLM based on the training set.

Figure 6 shows the results from a cross-validation run of the lasso regularized logistic GLM with different penalties . In the figure, the left dashed vertical line corresponds to the penalty that minimizes the cross-validation error. The second dashed vertical line corresponds to the smallest penalty

with the error that is within one standard deviation of the minimal error. We use the model with the penalty

to select covariates. This gives us 12 covariates, which are described in Table 3. These 12 selected covariates are used to fit the Beta regression models.

Covariate Description
avgscore Average risk score
VITA Number of volunteer income tax assistance (VITA) prepared returns
A00900 Business or professional net income (less loss) amount
A02300 Unemployment compensation amount
A03150 Individual retirement arrangement payments amount
A03230 Tuition and fees deduction amount
A18450 State and local general sales tax amount
A18800 Personal property taxes amount
A07230 Nonrefundable education credit amount
A85770 Total premium tax credit amount
A11070 Additional child tax credit amount
A11902 Overpayments refunded amount
Table 3: Descripton of the selected covariates.
Figure 7: Correlation of the selected covariates and the response variable.

Figure 7 shows a correlation plot of the response variable and the covariates. From the figure, we see that many of these variables are positively correlated. The variable A11070 is negatively correlated with a few other variables.

5.3 Fitting of the models

Bayesian methods have been used to analyze spatial data for about twenty years since year 2000 with the advent of Markov Chain Monte Carlo (MCMC) simulative methods

(Blangiardo and Cameletti, 2015)

. In fact, the advent of MCMC has allowed researchers to use Bayesian methods to develop complex models on large datasets. However, one major drawback of MCMC methods is that they are computationally demanding, especially for large datasets. INLA (Integrated Nested Laplace Approximation) is as an alternative to MCMC for Bayesian inference

(Rue and Held, 2005; Rue et al., 2009). A major advantage of INLA is that it is a deterministic algorithm and is capable of producing accurate and fast results. Since INLA was embedded into R through the R package INLA, it has become popular among researchers and practitioners. For example, Tufvesson et al. (2019) used INLA to model car insurance frequency and severity. In this paper, we use the R package INLA (Blangiardo and Cameletti, 2015; Wang et al., 2018) to fit the four Beta regression models.

The four models described in Section 4 can be formulated as Bayesian hierarchical models. For example, the BetaBYM model can be expressed as

(12a)
(12b)
(12c)
(12d)
(12e)

where denotes a prior distribution for the two hyperparameters and . Common choices for the prior distribution of

include independent gamma distributions. In this paper, we use the default priors in INLA.

5.4 Model selection and validation

The deviance information criterion (DIC) of Spiegelhalter et al. (2002) and the Watanabe Akaike information criterion (WAIC) of Watanabe (2010) are commonly used to select Bayesian models. The DIC is motivated by the Akaike information criterion (AIC) and is defined as

(13)

where

is the deviance of the model and denotes the effective number of parameters that is defined as

In a Bayesian model, the deviance is random variable and the expected deviance under the posterior distribution is used as a measure of fit. Between two models, the model with a lower DIC value is preferred.

The WAIC is similar to the DIC but the effective number of parameters is calculated differently. The WAIC is defined as

(14)

where

The WAIC is interpreted in the same way as the DIC. That is, the lower the WAIC, the more preferable the model. Between the DIC and the WAIC, Gelman et al. (2014) argue that the WAIC is preferred.

We also use two measures to validate the out-of-sample performance of the models. The first measure is the concordance correlation coefficient (CCC), which is used to measure the agreement between two variables. The CCC is defined as (Lin, 1989):

(15)

where is the correlation between the observed values and the predicted values, and are the standard deviation and the mean of the observed values, respectively, and and are the standard deviation and the mean of the predicted values, respectively. The value of the CCC ranges from to 1 with a value of 1 indicating perfect agreement between the predicted values and the observed values. Between two models, a higher CCC value means a better model.

The second measure is the well-known residual standard error (RSE), which is defined as

(16)

where and represent the th observed value and the th predicted value, respectively. Between two models, the model with a lower RSE value is better.

5.5 Results

Table 4 shows the DIC and the WAIC of fitting the four Beta regression models with different link functions to the training dataset. If we look at the rows, we see that the performance based on the five link functions is quite similar except for the Cauchy link. If we look at the columns, we see that the BetaBYM model performs the best. In terms of DIC, the BetaBYM model with the Cauchy link performs the best. In terms of WAIC, however, the BetaBYM model with the log-log link is the best. However, the performance of the BetaBYM model is quite similar for the first four link functions. Using the logit link function is not a bad choice.

Model Logit Probit Loglog Cloglog Cauchy
BetaReg -2103.82 -2102.92 -2101.53 -2104.36 -2109.25
BetaRE -2103.60 -2102.20 -2100.60 -2104.10 -2109.79
BetaBesag -2105.79 -2109.89 -2109.58 -2107.41 -2114.87
BetaBYM -2116.93 -2115.13 -2112.55 -2118.06 -2132.92
BetaReg -2070.47 -2070.84 -2071.36 -2069.72 -2062.49
BetaRE -2070.60 -2071.07 -2071.56 -2069.88 -2063.02
BetaBesag -2077.25 -2081.93 -2083.42 -2077.41 -2069.28
BetaBYM -2089.08 -2091.05 -2091.60 -2088.89 -2087.56
Table 4: In-sample performance of the models with different link functions.
Model Logit Probit Loglog Cloglog Cauchy
BetaReg 0.42846 0.42810 0.42749 0.42801 0.42138
BetaRE 0.42834 0.42790 0.42733 0.42786 0.42126
BetaBesag 0.45289 0.46855 0.47168 0.45622 0.44610
BetaBYM 0.48595 0.48954 0.49127 0.48584 0.47814
BetaReg 0.01380 0.01376 0.01371 0.01383 0.01412
BetaRE 0.01380 0.01376 0.01371 0.01383 0.01413
BetaBesag 0.01364 0.01352 0.01342 0.01367 0.01412
BetaBYM 0.01359 0.01352 0.01340 0.01367 0.01445
Table 5: Out-of-sample performance of the models with different link functions.

Table 5 show the out-of-sample performance of the models with different link functions on the test set. In terms of the CCC, the BetaBYM model with the log-log link performs the best because the corresponding CCC value is the highest among the 20 cases. In terms of the RSE, the BetaBYM model with the log-log link is also the best as it has the lowest RSE value.

(a) The basic Beta regression model.
(b) The BetaRE model.
(c) The BetaBesag model.
(d) The BetaBYM model.
Figure 8: Scatter plots of the observed brand name drug claim rates and those predicted by the four models with the log-log link function. The out-of-sample predictions are plotted as blue dots, while the in-sample fitted values are plotted as black circles.

Figure 8 shows the scatter plots of the observed brand name drug claim rates and the predicted values produced by the four models with the log-log link function. From the four scatter plots, we see that the predicted values of the four models are pretty similar. We also see that all four models do not fit the large rates well. The highest predicted value is around 0.25, while the largest observed value is around 0.41 (see Table 1). Figure 9 shows the distributions of the brand name drug claim rates predicted by the four models with the log-log link function across the triangular areas. The maps look quite similar and it is hard to see the differences. In addition, these maps look similar to the map of the observed rates shown in Figure 1(a).

(a) The basic Beta regression model.
(b) The BetaRE model.
(c) The BetaBesag model.
(d) The BetaBYM model.
Figure 9: Distributions of the brand name drug claim rates predicted by the four models with the log-log link function.
Mean Std 0.025 Q 0.5 Q 0.975 Q.
(Intercept) -0.554 0.106 -0.763 -0.554 -0.345
avgscore 0.061 0.027 0.009 0.061 0.113
VITA 0.003 0.001 0.001 0.003 0.006
A00900 0.004 0.010 -0.017 0.004 0.024
A02300 0.066 0.014 0.039 0.066 0.093
A03150 0.003 0.011 -0.018 0.003 0.025
A03230 -0.002 0.013 -0.027 -0.002 0.023
A18450 -0.044 0.013 -0.069 -0.044 -0.019
A18800 0.002 0.029 -0.055 0.002 0.060
A07230 -0.198 0.044 -0.285 -0.198 -0.111
A85770 0.012 0.011 -0.010 0.012 0.035
A11070 -0.234 0.103 -0.437 -0.234 -0.031
A11902 0.132 0.039 0.057 0.133 0.208
388.740 27.220 337.310 388.040 444.170
Table 6: Coefficients and hyperparameters from fitting the basic Beta regression model with the log-log link.
Mean Std 0.025 Q 0.5 Q 0.975 Q.
(Intercept) -0.553 0.107 -0.763 -0.553 -0.344
avgscore 0.061 0.027 0.008 0.061 0.113
VITA 0.003 0.001 0.001 0.003 0.006
A00900 0.004 0.010 -0.017 0.004 0.024
A02300 0.066 0.014 0.039 0.066 0.093
A03150 0.003 0.011 -0.018 0.003 0.025
A03230 -0.002 0.013 -0.027 -0.002 0.023
A18450 -0.044 0.013 -0.069 -0.044 -0.019
A18800 0.002 0.029 -0.055 0.002 0.060
A07230 -0.198 0.045 -0.285 -0.198 -0.111
A85770 0.012 0.011 -0.010 0.012 0.035
A11070 -0.234 0.104 -0.438 -0.234 -0.030
A11902 0.133 0.039 0.057 0.133 0.208
390.650 27.960 337.970 389.880 448.040
30149.160 22026.050 5634.800 24634.420 87681.270
Table 7: Coefficients and hyperparameters from fitting the Beta regression model with random effects. The log-log link function was used.
Mean Std 0.025 Q 0.5 Q 0.975 Q.
(Intercept) -0.574 0.109 -0.788 -0.574 -0.361
avgscore 0.067 0.027 0.013 0.067 0.121
VITA 0.003 0.001 0.001 0.003 0.006
A00900 0.002 0.011 -0.018 0.002 0.023
A02300 0.065 0.014 0.037 0.065 0.093
A03150 0.004 0.011 -0.017 0.004 0.026
A03230 0.000 0.013 -0.025 0.000 0.025
A18450 -0.036 0.014 -0.064 -0.036 -0.007
A18800 -0.012 0.031 -0.074 -0.012 0.049
A07230 -0.215 0.048 -0.310 -0.215 -0.121
A85770 0.012 0.011 -0.011 0.012 0.034
A11070 -0.206 0.110 -0.421 -0.206 0.011
A11902 0.134 0.039 0.057 0.134 0.210
411.830 31.980 350.770 411.430 476.320
3483.030 5884.000 461.760 1839.520 16609.730
Table 8: Coefficients and hyperparameters from fitting the BetaBesag model with the log-log link.
Mean Std 0.025 Q 0.5 Q 0.975 Q.
(Intercept) -0.580 0.111 -0.800 -0.580 -0.362
avgscore 0.071 0.028 0.016 0.071 0.126
VITA 0.003 0.001 0.001 0.003 0.006
A00900 0.001 0.011 -0.020 0.001 0.023
A02300 0.065 0.015 0.036 0.065 0.093
A03150 0.005 0.011 -0.017 0.005 0.026
A03230 0.002 0.013 -0.024 0.002 0.028
A18450 -0.031 0.014 -0.059 -0.031 -0.002
A18800 -0.021 0.031 -0.083 -0.021 0.041
A07230 -0.225 0.049 -0.321 -0.225 -0.128
A85770 0.011 0.012 -0.012 0.011 0.034
A11070 -0.195 0.114 -0.419 -0.196 0.028
A11902 0.135 0.040 0.056 0.135 0.213
425.630 33.890 362.380 424.470 495.770
6583.970 2917.140 2614.040 6021.720 13846.680
1435.850 1101.810 384.650 1118.950 4360.820
Table 9: Coefficients and hyperparameters from fitting the BetaBYM model with the log-log link.

Tables 6, 7, 8, and 9

show the fitted coefficients and hyperparameters of the four models with the log-log link function. In particular, the tables show the mean, the standard deviation, the 2.5% quantile, the median, and the 97.5% quantile of the estimated parameters. From the 2.5% and the 97.5% quantiles, we get the 95% credibility intervals of the fitted parameters. If we look at the credibility intervals of the fitted coefficients, we see that the credibility intervals of

avgscore, VITA, A02300, and A11902 contain only positive values. This means that these variables tend to have positive impact on the responses, i.e., the brand name drug claim rates.

From these tables, we also see two variables whose 95% credibility intervals contain only negative values. The two variables are A07230 and A18450. From Table 3, we see that both variables are related to education spending. Since the fitted coefficients of these two variables are negative, the two variables tend to have negative impact on the brand name drug claim rates.

Tables 7, 8, and 9 also show the estimated hyperparameters of the unstructured random effects and the structured random effects. Note that the hyperparameters are precision parameters that are inverse to the variance of the random effects. The higher the hyperparameter, the lower the variance of the random effects. In Table 7, we see that the estimated hyperparameter is very large. This means that the random effects component of the model has a very small variance and does not contribute much to explain the variation of the response variable.

In Table 8, we see that the fitted precision parameter of the structured random effects is much lower than the fitted in Table 7. In Table 9, we see the same pattern that fitted is much lower than the fitted . This means that the structured random effects help more than the unstructured random effects in terms of explaining the response variation.

In summary, the above results show that modeling the spatial effect improves the performance of the Beta regression model. In addition, including unstructured random effects in the model only improve the performance marginally. To further assess the comprehensiveness of our Beta regression models, we provide additional results using models suggested in Duncan et al. (2016). See Appendix A for additional results.

6 Concluding Remarks

The healthcare sector in the U.S. is a large sector that generates about 20% of the country’s gross domestic product. This significance of healthcare has motivated researchers to develop enhanced tools and approaches to better understand the industry through healthcare analytics. According to the 2019 Predictive Analytics in Health Care Trend Forecast (Society of Actuaries, 2019), predictive analytics is poised to reshape the healthcare industry by achieving three aims: improved patient outcomes, higher quality of care, and reduced costs. In particular, McKinsey estimates that big data analytics can help the U.S. healthcare industry to save more than $300 billion per year, where two thirds of that come from the reductions of approximately 8% in national healthcare expenditures.

In this paper, we examined and demonstrated the use of Beta regression models to study the utilization of brand name drugs in the U.S. in order to understand variability of the brand name drug claim rates across different areas of the U.S.. We studied different Beta regression models with and without spatial effects and fitted these models to public datasets obtained from the CMS and the IRS. The numerical results show that Beta regression models are able to fit the brand name drug claim rates well and modeling the spatial effects improves the performance of the Beta regression models. We also find some significant variables that help to explain the variation of the brand name drug claim rates across different areas. The methods and findings in this paper are useful for healthcare actuaries in their data analysis. By reflecting the effect of prescription drug utilization, these models can be used to update an insured’s risk score in a risk adjustment model. Specifically, healthcare actuaries can incorporate the geographic variation in their models used to select preferred providers.

References

  • R. Bachler, I. Duncan, and I. Juster (2006) A comparative analysis of chronic and nonchronic insured commercial member cost trends. North American Actuarial Journal 10 (4), pp. 76–89. External Links: Document Cited by: §2.
  • J. Besag and C. Kooperberg (1995) On conditional and intrinsic autoregression. Biometrika 82 (4), pp. 733–746. External Links: ISSN 00063444, Link Cited by: §4, §4.
  • J. Besag, J. York, and A. Mollié (1991) Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics 43, pp. 1–20. Cited by: §4.
  • M. Blangiardo and M. Cameletti (2015) Spatial and spatio-temporal bayesian models with R-INLA. Wiley, Hoboken, NJ. Cited by: §5.3.
  • H. J. Bolnick (2004) A framework for long-term actuarial projections of health care costs: the importance of population aging and other factors. North American Actuarial Journal 8 (4), pp. 1–29. External Links: Document Cited by: §2.
  • D. Brinberg and V. Cummings (1984) Purchasing generic prescriptions drugs: an analysis using two behavioral intention models. In NA - Advances in Consumer Research, Vol. 11, pp. 229–234. Cited by: §1, §1.
  • P. Brockett, L. Golden, C. C. Yang, and D. Young (2019) Medicaid managed care: efficiency, medical loss ratio, and quality of care. North American Actuarial Journal 0 (0), pp. 1–16. External Links: Document Cited by: §2.
  • P. L. Brockett, L. L. Golden, and C. C. Yang (2018) Potential “savings” of Medicare: the analysis of Medicare advantage and accountable care organizations. North American Actuarial Journal 22 (3), pp. 458–472. External Links: Document Cited by: §2.
  • I. Duncan, M. Loginov, and M. Ludkovski (2016) Testing alternative regression frameworks for predictive modeling of health care costs. North American Actuarial Journal 20 (1), pp. 65–87. External Links: Document Cited by: Appendix A, §2, §5.5.
  • I. Duncan (2011) Health risk adjustment and predictive modeling. ACTEX Publications, Winsted, CT. Cited by: §1, §2.
  • S. Ferrari and F. Cribari-Neto (2004) Beta regression for modelling rates and proportions. Journal of Applied Statistics 31 (7), pp. 799–815. Cited by: §1, §4.
  • E. W. Frees, J. Gao, and M. A. Rosenberg (2011) Predicting the frequency and amount of health care expenditures. North American Actuarial Journal 15 (3), pp. 377–392. External Links: Document Cited by: §2.
  • J. Friedman, T. Hastie, and R. Tibshirani (2010) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, Articles 33 (1), pp. 1–22. External Links: Document, Link Cited by: §5.2.
  • A. Gelman, J. Hwang, and A. Vehtari (2014) Understanding predictive information criteria for Bayesian models. Statistics and Computing 24, pp. 997–1016. Cited by: §5.4.
  • T. E. Getzen (2016) Accuracy of long-range actuarial projections of health care costs. North American Actuarial Journal 20 (2), pp. 101–113. External Links: Document Cited by: §2.
  • J. Hamann, R. Mendel, W. Kissling, and StefanLeucht (2013) Psychiatrists’ decision making between branded and generic drugs. European Neuropsychopharmacology 23 (7), pp. 686 – 690. Cited by: §1, §1.
  • M. A. Hassali, A. A. Alrasheedy, A. McLachlan, T. A. Nguyen, S. K. AL-Tamimi, M. I. M. Ibrahim, and H. Aljadhey (2014) The experiences of implementing generic medicine policy in eight countries: a review and recommendations for a successful promotion of generic medicine use. Saudi Pharmaceutical Journal 22 (6), pp. 491 – 503. Cited by: §1, §1.
  • G. R. Hileman, S. M. Mehmud, and M. A. Rosenberg (2016) Risk scoring in health insurance: a primer. Note: https://www.soa.org/research-reports/2016/2016-risk-scoring-primer/, Society of Actuaries. Cited by: §1.
  • S. Huang, B. Hartman, and V. Brazauskas (2017) Model selection and averaging of health costs in episode treatment groups. ASTIN Bulletin 47 (1), pp. 153–167. External Links: Document Cited by: §2.
  • H. J. Kan, H. Kharrazi, H. Chang, D. Bodycombe, K. Lemke, and J. P. Weiner (2019) Exploring the use of machine learning for risk adjustment: a comparison of standard and penalized linear regression models in predicting health care costs in older adults. PLOS ONE 14, pp. 1–13. External Links: Link Cited by: §2.
  • A. S. Kesselheim, J. J. Gagne, J. M. Franklin, W. Eddings, L. A. Fulchino, J. Avorn, and E. G. Campbell (2016) Variations in patients’ perceptions and use of generic drugs: results of a national survey. Journal of general internal medicine 31 (6), pp. 609 – 614. Cited by: §1, §1.
  • E. Klinger (2017) 2017 generic drug access and savings in the U.S. report. Note: https://accessiblemeds.org/resources/blog/2017-generic-drug-access-and-savings-us-report. Accessed on 9/27/2018. Cited by: §1.
  • S.A. Klugman, H.H. Panjer, and G.E. Willmot (2012) Loss models: from data to decisions. 4th edition, Wiley, Hoboken, NJ. Cited by: §4.
  • L. I. Lin (1989) A concordance correlation coefficient to evaluate reproducibility. Biometrics 45 (1), pp. 255–268. Cited by: §5.4.
  • J. P. Petertil (2005) Aging curves for health care costs in retirement. North American Actuarial Journal 9 (3), pp. 22–49. External Links: Document Cited by: §2.
  • R. Richardson and B. Hartman (2018) Bayesian nonparametric regression models for modeling and predicting healthcare claims. Insurance: Mathematics and Economics 83, pp. 1 – 8. External Links: ISSN 0167-6687, Document Cited by: §2.
  • H. Rue and L. Held (2005) Gaussian markov random fields: theory and applications. CRC Press, Boca Raton, FL. Cited by: §5.3.
  • H. Rue, S. Martino, and N. Chopin (2009) Approximate Bayesian inference for latent Gaussian models by using integrated nested laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2), pp. 319–392. External Links: Document Cited by: §5.3.
  • E. Schoonveld (2015) The price of global health: drug pricing strategies to balance patient access and the funding of innovation. 2nd edition, Gower Publishing Company, Burlington, VT. Cited by: §1.
  • P. Shi and W. Zhang (2013) Managed care and health care utilization: specification of bivariate models using copulas. North American Actuarial Journal 17 (4), pp. 306–324. External Links: Document Cited by: §2.
  • Society of Actuaries (2019) 2019 Predictive analytics in health care trend forecast. Note: https://www.soa.org/globalassets/assets/Files/programs/predictive-analytics/2019-health-care-trend.pdfAccessed: 2019-8-23 Cited by: §6.
  • D. J. Spiegelhalter, N. G. Best, B. P. Carlin, and A. Van Der Linde (2002) Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (4), pp. 583–639. External Links: Document Cited by: §5.4.
  • O. Tufvesson, J. Lindström, and E. Lindström (2019) Spatial statistical modelling of insurance risk: a spatial epidemiological approach to car insurance. Scandinavian Actuarial Journal 0 (0), pp. 1–15. External Links: Document Cited by: §4, §5.2, §5.3.
  • X. Wang, Y. R. Yue, and J. Faraway (2018) Bayesian regression modeling with INLA. CRC Press, Boca Raton, FL. Cited by: §4, §5.3.
  • S. Watanabe (2010) Asymptotic equivalence of bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research 11, pp. 3571–3594. Cited by: §5.4.
  • I. Yeo and R. A. Johnson (2000) A new family of power transformations to improve normality or symmetry. Biometrika 87 (4), pp. 954–959. External Links: Link Cited by: §5.1.

Appendix A Additional results

Duncan et al. (2016) tested several alternative regression frameworks for predictive modeling of health care costs. In particular, they tested multiple linear regression models, lasso, multivariate adaptive regression splines (MARS), random forests, M5 decision trees, and boosted trees. For details of these models, readers are referred to Duncan et al. (2016) and the references therein. In this section, we present the out-of-sample prediction results of these models.

Model R Package Parameters
Linear model lm Multiple linear regression without interactions
Lasso glmnet Gaussian family
MARS earth Interactions up to degree = 1;
Generalized Cross Validation penalty = 1
Random forests randomForest Number of trees = 100; minimum node size = 10
M5 decision trees Cubist Number of committees = 25
Boosted trees gbm Number of trees = 5000; Shrinkage = 0.01
Table 10: Parameters and the R packages of the additional models.
Model
Linear model 0.4129 0.0141
Lasso 0.0000 0.0154
MARS 0.4420 0.0143
Random forest 0.4657 0.0126
M5 decision tree 0.4330 0.0131
Boosted trees 0.4658 0.0133
Table 11: Out-of-sample prediction results of the additional models.

Table 10 shows the parameters and R packages we used to test the additional models. Table 11 shows the out-of-sample prediction results obtained by these models. Figure 10 shows the scatter plots of the observed brand name drug claim rates and the predicted values by the additional models. The variables input to these models are the same as those selected for the Beta regression models described in Table 3.

From Table 11

, we see that the lasso model produced the worst result. The reason is that the lasso regression model produced an intercept-only model for the data. Figure

9(b) also confirms this as the predicted values are constant. Comparing Tables 5 and 11, we see that all these models do not perform better than the Beta regression model in terms of . However, the tree-based models (i.e., random forests, M5 decision trees, boosted trees) produced lower than the Beta regression model.

(a) The linear regression model.
(b) The lasso model.
(c) The MARS model.
(d) The random forests model.
(e) The M5 decision trees model.
(f) The boosted trees model.
Figure 10: Scatter plots of the observed brand name drug claim rates and those predicted by the additional models.