1 Introduction
In the pharmaceutical industry, a brand name drug is a prescription drug that is developed and patented by a company. The patent gives the company a period of market exclusivity, during which the company sells the brand name drug at a significantly high price. For more information about drug pricing, readers are referred to Schoonveld (2015). When the patent expires, generic versions of the brand name drug are marketed at lower prices by other companies. Generally, most states allow pharmacists to substitute brand name drugs with generic versions, unless otherwise directed by physicians. Nevertheless, according to the Association for Accessible Medicines (AAM) (Klinger, 2017), the generics account for 89% of prescriptions dispensed but only 26% of total drug costs in the U.S. Recent annual prescription spending in the United States has been on the rise. According to the NHE (National Health Expenditure) fact sheet ^{1}^{1}1https://www.cms.gov/ResearchStatisticsDataandSystems/StatisticsTrendsandReports/NationalHealthExpendData/NHEFactSheet. Accessed on July 6, 2020., prescription drug spending in the U.S. increased 2.5% to $335.0 billion in 2018, faster than the 1.4% growth in 2017.
This large disparity in the costs between brand name and generic drugs motivated us to investigate for the presence of feature variables, e.g., location, income, demographics, that drive the decision to choose one type over the other. It is also our intent to prescribe a model that health insurers can replicate to reflect these decision drivers into a risk adjustment program (Duncan, 2011), a mechanism widely popular for assessing the health risk of an insured (Hileman et al., 2016). This mechanism is an umbrella of commercially designed risk assessment models that are sometimes called risk scoring models. While models do vary, they share a common objective of predicting the health care cost per member per year using prior year’s medical utilization and expenditures as well as additional risk factors such as location and demographic.
The decision to prescribe/purchase generic or brand name drugs has been studied before. Brinberg and Cummings (1984) used two behavioral intension models to examine the decision to purchase generic prescription drugs and found several differences between individuals who intent to purchase generic prescription drugs with those who do not. For example, nonintenders believed it was less likely that generic drugs were a safe product than intenders. Hamann et al. (2013) studied psychiatrists’ decision making between generic and branded antipsychotics and found that psychiatrists were more likely to choose branded drugs when imagining choosing the drug for themselves. Hassali et al. (2014) reviewed experiences of implementing generic medicine policy in eight countries, including the U.S. The review indicates that pharmacists play an essential role in promoting generic medicines as they substituted 83.8% of prescriptions that allowed substitution and that generic prescribing is still not common practice in the U.S. because many physicians have negative perceptions about generic medicines and lack indepth knowledge about the bioequivalence concept applied in the U.S. Kesselheim et al. (2016) studied variations in patients’ perceptions and use of generic drugs based on a national survey and found a substantial shift that more patients have positive views of generic drugs.
In the aforementioned work, Brinberg and Cummings (1984), Hamann et al. (2013), and Kesselheim et al. (2016) used survey data to study the behavior of the patients and physicians. The review conducted by Hassali et al. (2014) was based on literature search using several electronic databases and search engines such as ISI ( Institute for Scientific Information) Web of Knowledge and Google. Public pharmaceutical drug utilization data have not been used to study the factors driving the decision of choosing one type over the other.
In this paper, we investigate the rates of brand name drug claims in the U.S. In particular, we investigate the variations of the brand name drug claims rates in different areas of the U.S. Our study supplements the aforementioned studies based on survey data. Since rates are values in the interval , the Beta regression model (Ferrari and CribariNeto, 2004) is most suitable where covariates can be introduced to account for heterogeneity.
The remaining part of the paper is organized as follows. In Section 2, we review some academic work related to health insurance. In Section 3, we describe the data used in our study. In particular, we describe how the data is processed before models are fitted. In Section 4, we present four Beta regression models for fitting the data. In Section 5, we provide numerical results of the proposed Beta regression models. In Section 6, we conclude the paper with some remarks.
2 Related work
Healthcare is an important practicing area of actuaries due to the existence of health insurance. In this section, we review some work that is related to health insurance analytics.
Early work includes Bolnick (2004), Petertil (2005), and Bachler et al. (2006). Bolnick (2004) proposed a framework for longterm healthcare cost projections by incorporating key healthcare cost drivers such as life expectancy, biological morbidity, and economic morbidity. Using the framework, Bolnick (2004) considered various plausible future scenarios that encompass a reasonable range of healthcare cost outcomes. Petertil (2005) studied the relative significance of aging as a driver of healthcare cost beyond age fifty and used Medicare data to draw general conclusions that utilization and cost differ by age. Bachler et al. (2006) studied the impact of chronic and nonchronic insured members on cost trends and found that classification of chronic members can affect the trends significantly.
Recently, there has been a lot more studies on healthcare analytics than ten years ago. Duncan (2011) devoted a book to healthcare analytics for actuaries. In particular, this book covers the essentials of health risk and case studies on the use of predictive modeling in risk adjustment. For example, Chapter 14 of this book discusses the risk adjustment method used by the Centers for Medicare and Medicaid Services (CMS). Frees et al. (2011) extended the standard twopart model to predict the frequency and amount of healthcare expenditures and used the data from the Medical Expenditure Panel Survey (MEPS) to calibrate and test the model. Shi and Zhang (2013) built a gametheoretic model based on copulas to study the effect of managed care on healthcare utilization compared to traditional feeforservice plans in private health insurance market.
Getzen (2016) developed a method to evaluate projections of future medical spending used by the U.S. Medicare and Medicaid programs and found that the recent set of projections (1998–2010) is more accurate than the older set of projections (1986–1995) because the recent projections incorporate lagged macroeconomic effects. Duncan et al. (2016)
investigated several statistical techniques (e.g., Lasso GLM, multivariate adaptive regression splines, random forests, decision trees, and boosted trees) for modeling future healthcare costs and found that the traditional regression approach does not perform well.
Huang et al. (2017)
compared different models and model selection methods for health costs in Episode Treatment Groups (ETGs) and found that random forest feature selection is preferable in terms of efficiency and accuracy.
Richardson and Hartman (2018)proposed a Bayesian nonparametric regression model for predicting healthcare claims. Their numerical results show that the Bayesian nonparametric model outperforms standard linear regression and generalized Beta regression in terms of predictive accuracy.
Brockett et al. (2018) investigated the use of Data Envelopment Analysis (DEA), which is a method developed in management science to measure relative efficiency of organizations, to assess the potential savings of Medicare. Their analysis shows that Medicare Advantage plans are more efficient in reducing health expenditures than the original Medicare. In Brockett et al. (2019), the authors used linear regression models to estimate the effect of medical loss ratio (MLR) and efficiency on the quality of care for the Medicaid managed care plans. The results show that the effect of medical services efficiency on the quality of care is insignificant.
Kan et al. (2019)explored the use of machine learning techniques for risk adjustment and found that penalized regression performs better than ordinary least squares in predicting healthcare costs.
Our work presented in this paper is relative to healthcare analytics but is different from the work mentioned above. In particular, our work examined Beta regression model with spatial dependence to model the percentages of bran name drug claims. The methods and findings of this paper can be used by practitioners in their risk modeling and adjustment.
3 Description of the Data
We use a public dataset called the Part D Prescriber Public Use File (PUF) from the Centers for Medicare & Medicaid Services (CMS) and another public dataset called Individual Income Tax Statistics from the IRS (Internal Revenue Service).
The data in the Part D Prescriber PUF cover calendar years 2013 through 2016 and contain information on prescription drug events (PDEs) incurred by Medicare beneficiaries with a Part D prescription drug plan. The data consist of a detailed data file and two summary tables. The “Part D Prescriber Summary Table” contains overall drug utilization, drug costs, and beneficiary counts organized by National Provider Identifier (NPI). In this paper, we use the 2016 Part D Prescriber Summary Table, which contains 1,131,550 records and 84 variables ^{2}^{2}2The file name is PartD_Prescriber_PUF_NPI_16.txt and it is available from https://www.cms.gov/ResearchStatisticsDataandSystems/StatisticsTrendsandReports/MedicareProviderChargeData/PartD2016. Accessed on Jan 20, 2020.. The Part D Prescriber Summary table contains information about the brand and generic drug claims at the provider level. Other information includes the zip code of providers, the count of beneficiaries, the average age of beneficiaries, and the average risk score of beneficiaries.
In this paper, our interest is on studying variations of the brand name drug claims rates in different areas of the U.S. To that end, we need to aggregate the data to different areas. However, the data can be aggregated to different levels: the state level, the zip code level, or some customized level between the state level and the zip code level. On the one hand, aggregating the data to the state level is not desirable for the following reasons:

The brand name drug claim rate exhibits variations within individual states.

The U.S. has only 50 states and aggregating the data to the state level produces only 50 data points.
On the other hand, aggregating the data to the zip code level is also not desirable:

The dataset contains about 20,000 different zip codes. Aggregating the data to the zip code level will produce about 20,000 data points. This will cause challenges in modeling the spatial effects as a large number of sites requires lots of parameters.

Some zip codes do not correspond to geographical areas but large volume customers or post office boxes.

Aggregating the data to the zip level produces highly volatile brand name drug claims rates. That is, the rates are highly volatile between zip codes.
As a result, we decide to aggregate the data to some customized level between the state level and the zip code level.
To aggregate the data to a customized level, we first need to split the U.S. land into small areas, i.e., create a mesh of the U.S. land. This can be done conveniently by using the R function inla.mesh.2d from the INLA package. Figure 1 shows a mesh of the 48 contiguous states of the U.S. that consists of 530 triangles. Aggregating the data to the 530 triangular areas will produce a dataset with about 530 data points. This is computationally reasonable for modeling spatial effects.
Aggregating the data to the triangles shown in Figure 1 is done as follows. First, we obtain the longitudes and the latitudes of zip codes from the R package noncensus, which contains regional information and demographic data determined by the U.S. Census Bureau. Note that each zip code is associated with a longitude and a latitude. Second, we determine the triangles to which the zip codes belong by using the longitudes and the latitudes. Finally, we aggregate the data based on the indices of the triangles.
The Part D Summary Table contains average ages and average risk scores of the beneficiaries. The average risk scores are average HCC (Hierarchical Condition Category) risk scores, which estimate how beneficiaries’ FFS (FeeForService) spending will compare to the overall average of the entire Medicare population. Before aggregating the data, we convert average ages and average risk scores to total ages and total risk scores by multiplying the average numbers with the count of beneficiaries. After aggregation is done, we convert total ages and total risk scores to average ages and average risk scores by dividing the total number by the aggregated count of beneficiaries. The Part D Summary Table also contains missing values. We remove missing values before aggregating the data.
Figure 1(a) shows the distribution of the brand name drug claim rates in different triangular areas. From the figure, we see that 19 triangular areas do not have the data. As a result, the aggregated dataset contains 511 observations. From the figure, we also see that some triangles in the middle are light red. This suggests that modeling the spatial effects may improve the fitting of Beta regression models to the data.
Min  1st Q  Median  Mean  3rd Q  Max  

brandrate  0.1253  0.1784  0.1886  0.1890  0.1986  0.4118 
avgage  65.79  70.06  70.89  70.98  71.87  80.00 
avgscore  0.9129  1.2102  1.3240  1.3170  1.4199  1.8166 
Table 1 shows the summary statistics of the aggregated data. From the table, we see that the brand name drug claim rate varies from 12.53% to 41.18% among the 511 areas. The range of the average age of beneficiaries is from 65.79 to 80. The average risk score has a range of 0.9129 to 1.8166. The median and mean values indicate that these variables are pretty symmetrical.
Another data source we use in our study is the 2016 individual income tax statistics by zip codes ^{3}^{3}3The file name is 16zpallagi.csv and it is available from https://www.irs.gov/statistics/soitaxstatsindividualincometaxstatistics2016zipcodedatasoi. Accessed on Jan 12, 2020.. This dataset contains 29,874 records, each of which is described by 144 variables. It contains information about the number of returns and the amount of returns in different categories. For example, it contains the number of returns with unemployment compensation and the unemployment compensation amount. The return counts and amounts are aggregated to the zip code level. Although this datast does not contain all information about an area, it does provide some demographic and economic information, which might be useful for explaining the variation of the brand name drug claim rate. For example, the number of returns from an area is related to the population of the area.
We aggregate the tax data to the 530 triangular areas in the same way as we aggregate the Part D data. After aggregating the data, we divide the total dollar amount of an area by the corresponding number of returns to get the average dollar amount of the area. The total number of returns and the average amounts in different categories are used in the data analysis. Since this dataset has 144 variables, we do not show the summary statistics of these variables here.
4 Models
In this section, we describe some Beta regression models. In particular, we present four models: the basic Beta regression model, the Beta regression model with random effects, the BetaBesag model, and the BetaBYM model.
The density function of the Beta distribution is typically defined as
(Klugman et al., 2012):(1) 
where and are shape parameters. The Beta distribution defined in Equation (1) has been reparameterized by using its mean and dispersion as parameters.
(2) 
where and . The shape parameters can be obtained from the mean and the dispersion as follows: and .
The basic Beta regression model is described as follows. Suppose that we have observations. For , let and
be the vector of
regressors and the response in the th case, respectively. The responses are assumed to form a random sample such asThe mean is linked to the regressors as follows:
(3) 
where is the link function and
is the vector of regression coefficients. The variance of
can be estimated as(4) 
Random effects models are commonly used to analyze areal or spatial data. For example, Tufvesson et al. (2019) applied random effects models to model car insurance data with geographical locations of the policyholders. There are two types of random effects models: unstructured and structured. Unstructured random effects models assume independence of the random effects, while structured random effects models allow for spatial dependence.
The second model we consider is a Beta regression model with a single random effect. In this model, a random effect is introduced to the linear predictor as follows:
(5) 
where are i.i.d Gaussian noise, i.e.,
(6) 
where is a precision parameter. This model can help to control for unobserved heterogeneity when the heterogeneity is not correlated with independent variables. The basic Beta regression model assumes that the heterogeneity is correlated with independent variables.
The third model we consider is the Besag model, which introduces a structured random effect to the linear predictor:
(7) 
where
s follow a CAR(1) model, i.e., an order1 conditional autoregressive model
(Besag and Kooperberg, 1995). Building a CAR model requires a neighborhood graph, which tells which areas are neighbors to each other. Figure 3 shows a neighborhood graph where areas that are neighbors are connected by blue lines. In our models, we assume that triangles that share a vertex or a side are neighbors to each other.Given a neighborhood graph, the CAR(1) model assumes that the random effect in an area is related to the random effects in the area’s neighbors through a conditional Gaussian distribution:
(8) 
where is the set of indices of neighbors of the th area, denotes the number of elements in , and is a precision parameter. Note that the CAR(1) model can be written as a multivariate Gaussian model for (Besag and Kooperberg, 1995; Wang et al., 2018):
(9) 
where is a highly sparse matrix defined by
The fourth model is the BYM (BesagYorkMollié) model, which combines an unstructured random effect and a structured random effect (Besag et al., 1991):
(10) 
where s are modeled by a CAR(1) model as specified in Equation (8) and s are i.i.d Gaussian noise as specified in Equation (6). The structured effects model the spatial effect. The unstructured effects are used to model additional random variations that are not explained by geographical locations. Between the two effects u and v, a larger effect of in the model means that less exchange of information between areas is allowed. The relative strengths of the unstructured effects and the structured effects
are controlled by the two hyperparameters
and , respectively.Model name  Linear predictor 

BetaReg  Equation (3) 
BetaRE  Equation (5) 
BetaBesag  Equation (7) 
BetaBYM  Equation (10) 
Table 2 summarizes the four Beta regression models described above. The first two models do not consider spatial dependence, while the last two models allow for spatial dependence.
There are several choices of the link function (Ferrari and CribariNeto, 2004):

the logit link:
; 
the probit link: , where
is the standard normal distribution function;

the loglog link: ;

the complimentary loglog link: ;

the Cauchy link: .
Figure 4 shows the plots of these link functions. The logit, probit, and Cauhy link function are symmetric around the point . The other two link functions are not symmetric around the point.
5 Results
In this section, we present the results of fitting the four Beta regression models (see Table 2) to the brand drug claim data.
5.1 Data transformation
Before we fit the models, we transform the covariates by using the Yeo–Johnson transformation introduced by Yeo and Johnson (2000). The purpose of transforming the covariates is to reduce their magnitudes for the convenience of modeling fitting. The Yeo–Johnson transformation is similar to the power transformation but can handle zero and negative values. In this transformation, a value is transformed as follows:
(11) 
We transform all covariates except for the average risk score before fitting the models.
After transforming the data, we split the dataset into two sets: one for fitting the models and one for validating the models. We use 80% of the data for fitting and the remaining 20% for evaluation. Figure 5 shows the distribution of the brand name drug claim rates for the training set and the test set.
5.2 Selection of covariates
The data contain 145 covariates, which include 143 from the individual tax income file and 2 (i.e., avgage and avgscore) from the Part D Summary Table. Including all these covariates in the models causes the parameter identification problem because many of these covariates are highly correlated.
To select covariates for fitting the models, we use the lasso regularized generalized linear models (GLMs), which have been implemented in the R package glmnet (Friedman et al., 2010). Tufvesson et al. (2019) used this method to select covariates. However, the package glmnet
does not support Beta regression models. To circumvent this problem, we first create a new binary response variable by comparing the brand name drug claim rates to the average rate. If a rate is above the average, then the corresponding response value is 1. Otherwise, the response value is 0. Then use lasso regularized logistic regression that is supported by
glmnet to select covariates. We assume that the covariates that help to separate high and lower rates can also serve as useful predictors.Figure 6 shows the results from a crossvalidation run of the lasso regularized logistic GLM with different penalties . In the figure, the left dashed vertical line corresponds to the penalty that minimizes the crossvalidation error. The second dashed vertical line corresponds to the smallest penalty
with the error that is within one standard deviation of the minimal error. We use the model with the penalty
to select covariates. This gives us 12 covariates, which are described in Table 3. These 12 selected covariates are used to fit the Beta regression models.Covariate  Description 

avgscore  Average risk score 
VITA  Number of volunteer income tax assistance (VITA) prepared returns 
A00900  Business or professional net income (less loss) amount 
A02300  Unemployment compensation amount 
A03150  Individual retirement arrangement payments amount 
A03230  Tuition and fees deduction amount 
A18450  State and local general sales tax amount 
A18800  Personal property taxes amount 
A07230  Nonrefundable education credit amount 
A85770  Total premium tax credit amount 
A11070  Additional child tax credit amount 
A11902  Overpayments refunded amount 
Figure 7 shows a correlation plot of the response variable and the covariates. From the figure, we see that many of these variables are positively correlated. The variable A11070 is negatively correlated with a few other variables.
5.3 Fitting of the models
Bayesian methods have been used to analyze spatial data for about twenty years since year 2000 with the advent of Markov Chain Monte Carlo (MCMC) simulative methods
(Blangiardo and Cameletti, 2015). In fact, the advent of MCMC has allowed researchers to use Bayesian methods to develop complex models on large datasets. However, one major drawback of MCMC methods is that they are computationally demanding, especially for large datasets. INLA (Integrated Nested Laplace Approximation) is as an alternative to MCMC for Bayesian inference
(Rue and Held, 2005; Rue et al., 2009). A major advantage of INLA is that it is a deterministic algorithm and is capable of producing accurate and fast results. Since INLA was embedded into R through the R package INLA, it has become popular among researchers and practitioners. For example, Tufvesson et al. (2019) used INLA to model car insurance frequency and severity. In this paper, we use the R package INLA (Blangiardo and Cameletti, 2015; Wang et al., 2018) to fit the four Beta regression models.The four models described in Section 4 can be formulated as Bayesian hierarchical models. For example, the BetaBYM model can be expressed as
(12a)  
(12b)  
(12c)  
(12d)  
(12e) 
where denotes a prior distribution for the two hyperparameters and . Common choices for the prior distribution of
include independent gamma distributions. In this paper, we use the default priors in INLA.
5.4 Model selection and validation
The deviance information criterion (DIC) of Spiegelhalter et al. (2002) and the Watanabe Akaike information criterion (WAIC) of Watanabe (2010) are commonly used to select Bayesian models. The DIC is motivated by the Akaike information criterion (AIC) and is defined as
(13) 
where
is the deviance of the model and denotes the effective number of parameters that is defined as
In a Bayesian model, the deviance is random variable and the expected deviance under the posterior distribution is used as a measure of fit. Between two models, the model with a lower DIC value is preferred.
The WAIC is similar to the DIC but the effective number of parameters is calculated differently. The WAIC is defined as
(14) 
where
The WAIC is interpreted in the same way as the DIC. That is, the lower the WAIC, the more preferable the model. Between the DIC and the WAIC, Gelman et al. (2014) argue that the WAIC is preferred.
We also use two measures to validate the outofsample performance of the models. The first measure is the concordance correlation coefficient (CCC), which is used to measure the agreement between two variables. The CCC is defined as (Lin, 1989):
(15) 
where is the correlation between the observed values and the predicted values, and are the standard deviation and the mean of the observed values, respectively, and and are the standard deviation and the mean of the predicted values, respectively. The value of the CCC ranges from to 1 with a value of 1 indicating perfect agreement between the predicted values and the observed values. Between two models, a higher CCC value means a better model.
The second measure is the wellknown residual standard error (RSE), which is defined as
(16) 
where and represent the th observed value and the th predicted value, respectively. Between two models, the model with a lower RSE value is better.
5.5 Results
Table 4 shows the DIC and the WAIC of fitting the four Beta regression models with different link functions to the training dataset. If we look at the rows, we see that the performance based on the five link functions is quite similar except for the Cauchy link. If we look at the columns, we see that the BetaBYM model performs the best. In terms of DIC, the BetaBYM model with the Cauchy link performs the best. In terms of WAIC, however, the BetaBYM model with the loglog link is the best. However, the performance of the BetaBYM model is quite similar for the first four link functions. Using the logit link function is not a bad choice.
Model  Logit  Probit  Loglog  Cloglog  Cauchy 

BetaReg  2103.82  2102.92  2101.53  2104.36  2109.25 
BetaRE  2103.60  2102.20  2100.60  2104.10  2109.79 
BetaBesag  2105.79  2109.89  2109.58  2107.41  2114.87 
BetaBYM  2116.93  2115.13  2112.55  2118.06  2132.92 
BetaReg  2070.47  2070.84  2071.36  2069.72  2062.49 
BetaRE  2070.60  2071.07  2071.56  2069.88  2063.02 
BetaBesag  2077.25  2081.93  2083.42  2077.41  2069.28 
BetaBYM  2089.08  2091.05  2091.60  2088.89  2087.56 
Model  Logit  Probit  Loglog  Cloglog  Cauchy 

BetaReg  0.42846  0.42810  0.42749  0.42801  0.42138 
BetaRE  0.42834  0.42790  0.42733  0.42786  0.42126 
BetaBesag  0.45289  0.46855  0.47168  0.45622  0.44610 
BetaBYM  0.48595  0.48954  0.49127  0.48584  0.47814 
BetaReg  0.01380  0.01376  0.01371  0.01383  0.01412 
BetaRE  0.01380  0.01376  0.01371  0.01383  0.01413 
BetaBesag  0.01364  0.01352  0.01342  0.01367  0.01412 
BetaBYM  0.01359  0.01352  0.01340  0.01367  0.01445 
Table 5 show the outofsample performance of the models with different link functions on the test set. In terms of the CCC, the BetaBYM model with the loglog link performs the best because the corresponding CCC value is the highest among the 20 cases. In terms of the RSE, the BetaBYM model with the loglog link is also the best as it has the lowest RSE value.
Figure 8 shows the scatter plots of the observed brand name drug claim rates and the predicted values produced by the four models with the loglog link function. From the four scatter plots, we see that the predicted values of the four models are pretty similar. We also see that all four models do not fit the large rates well. The highest predicted value is around 0.25, while the largest observed value is around 0.41 (see Table 1). Figure 9 shows the distributions of the brand name drug claim rates predicted by the four models with the loglog link function across the triangular areas. The maps look quite similar and it is hard to see the differences. In addition, these maps look similar to the map of the observed rates shown in Figure 1(a).
Mean  Std  0.025 Q  0.5 Q  0.975 Q.  

(Intercept)  0.554  0.106  0.763  0.554  0.345 
avgscore  0.061  0.027  0.009  0.061  0.113 
VITA  0.003  0.001  0.001  0.003  0.006 
A00900  0.004  0.010  0.017  0.004  0.024 
A02300  0.066  0.014  0.039  0.066  0.093 
A03150  0.003  0.011  0.018  0.003  0.025 
A03230  0.002  0.013  0.027  0.002  0.023 
A18450  0.044  0.013  0.069  0.044  0.019 
A18800  0.002  0.029  0.055  0.002  0.060 
A07230  0.198  0.044  0.285  0.198  0.111 
A85770  0.012  0.011  0.010  0.012  0.035 
A11070  0.234  0.103  0.437  0.234  0.031 
A11902  0.132  0.039  0.057  0.133  0.208 
388.740  27.220  337.310  388.040  444.170 
Mean  Std  0.025 Q  0.5 Q  0.975 Q.  

(Intercept)  0.553  0.107  0.763  0.553  0.344 
avgscore  0.061  0.027  0.008  0.061  0.113 
VITA  0.003  0.001  0.001  0.003  0.006 
A00900  0.004  0.010  0.017  0.004  0.024 
A02300  0.066  0.014  0.039  0.066  0.093 
A03150  0.003  0.011  0.018  0.003  0.025 
A03230  0.002  0.013  0.027  0.002  0.023 
A18450  0.044  0.013  0.069  0.044  0.019 
A18800  0.002  0.029  0.055  0.002  0.060 
A07230  0.198  0.045  0.285  0.198  0.111 
A85770  0.012  0.011  0.010  0.012  0.035 
A11070  0.234  0.104  0.438  0.234  0.030 
A11902  0.133  0.039  0.057  0.133  0.208 
390.650  27.960  337.970  389.880  448.040  
30149.160  22026.050  5634.800  24634.420  87681.270 
Mean  Std  0.025 Q  0.5 Q  0.975 Q.  

(Intercept)  0.574  0.109  0.788  0.574  0.361 
avgscore  0.067  0.027  0.013  0.067  0.121 
VITA  0.003  0.001  0.001  0.003  0.006 
A00900  0.002  0.011  0.018  0.002  0.023 
A02300  0.065  0.014  0.037  0.065  0.093 
A03150  0.004  0.011  0.017  0.004  0.026 
A03230  0.000  0.013  0.025  0.000  0.025 
A18450  0.036  0.014  0.064  0.036  0.007 
A18800  0.012  0.031  0.074  0.012  0.049 
A07230  0.215  0.048  0.310  0.215  0.121 
A85770  0.012  0.011  0.011  0.012  0.034 
A11070  0.206  0.110  0.421  0.206  0.011 
A11902  0.134  0.039  0.057  0.134  0.210 
411.830  31.980  350.770  411.430  476.320  
3483.030  5884.000  461.760  1839.520  16609.730 
Mean  Std  0.025 Q  0.5 Q  0.975 Q.  

(Intercept)  0.580  0.111  0.800  0.580  0.362 
avgscore  0.071  0.028  0.016  0.071  0.126 
VITA  0.003  0.001  0.001  0.003  0.006 
A00900  0.001  0.011  0.020  0.001  0.023 
A02300  0.065  0.015  0.036  0.065  0.093 
A03150  0.005  0.011  0.017  0.005  0.026 
A03230  0.002  0.013  0.024  0.002  0.028 
A18450  0.031  0.014  0.059  0.031  0.002 
A18800  0.021  0.031  0.083  0.021  0.041 
A07230  0.225  0.049  0.321  0.225  0.128 
A85770  0.011  0.012  0.012  0.011  0.034 
A11070  0.195  0.114  0.419  0.196  0.028 
A11902  0.135  0.040  0.056  0.135  0.213 
425.630  33.890  362.380  424.470  495.770  
6583.970  2917.140  2614.040  6021.720  13846.680  
1435.850  1101.810  384.650  1118.950  4360.820 
show the fitted coefficients and hyperparameters of the four models with the loglog link function. In particular, the tables show the mean, the standard deviation, the 2.5% quantile, the median, and the 97.5% quantile of the estimated parameters. From the 2.5% and the 97.5% quantiles, we get the 95% credibility intervals of the fitted parameters. If we look at the credibility intervals of the fitted coefficients, we see that the credibility intervals of
avgscore, VITA, A02300, and A11902 contain only positive values. This means that these variables tend to have positive impact on the responses, i.e., the brand name drug claim rates.From these tables, we also see two variables whose 95% credibility intervals contain only negative values. The two variables are A07230 and A18450. From Table 3, we see that both variables are related to education spending. Since the fitted coefficients of these two variables are negative, the two variables tend to have negative impact on the brand name drug claim rates.
Tables 7, 8, and 9 also show the estimated hyperparameters of the unstructured random effects and the structured random effects. Note that the hyperparameters are precision parameters that are inverse to the variance of the random effects. The higher the hyperparameter, the lower the variance of the random effects. In Table 7, we see that the estimated hyperparameter is very large. This means that the random effects component of the model has a very small variance and does not contribute much to explain the variation of the response variable.
In Table 8, we see that the fitted precision parameter of the structured random effects is much lower than the fitted in Table 7. In Table 9, we see the same pattern that fitted is much lower than the fitted . This means that the structured random effects help more than the unstructured random effects in terms of explaining the response variation.
In summary, the above results show that modeling the spatial effect improves the performance of the Beta regression model. In addition, including unstructured random effects in the model only improve the performance marginally. To further assess the comprehensiveness of our Beta regression models, we provide additional results using models suggested in Duncan et al. (2016). See Appendix A for additional results.
6 Concluding Remarks
The healthcare sector in the U.S. is a large sector that generates about 20% of the country’s gross domestic product. This significance of healthcare has motivated researchers to develop enhanced tools and approaches to better understand the industry through healthcare analytics. According to the 2019 Predictive Analytics in Health Care Trend Forecast (Society of Actuaries, 2019), predictive analytics is poised to reshape the healthcare industry by achieving three aims: improved patient outcomes, higher quality of care, and reduced costs. In particular, McKinsey estimates that big data analytics can help the U.S. healthcare industry to save more than $300 billion per year, where two thirds of that come from the reductions of approximately 8% in national healthcare expenditures.
In this paper, we examined and demonstrated the use of Beta regression models to study the utilization of brand name drugs in the U.S. in order to understand variability of the brand name drug claim rates across different areas of the U.S.. We studied different Beta regression models with and without spatial effects and fitted these models to public datasets obtained from the CMS and the IRS. The numerical results show that Beta regression models are able to fit the brand name drug claim rates well and modeling the spatial effects improves the performance of the Beta regression models. We also find some significant variables that help to explain the variation of the brand name drug claim rates across different areas. The methods and findings in this paper are useful for healthcare actuaries in their data analysis. By reflecting the effect of prescription drug utilization, these models can be used to update an insured’s risk score in a risk adjustment model. Specifically, healthcare actuaries can incorporate the geographic variation in their models used to select preferred providers.
References
 A comparative analysis of chronic and nonchronic insured commercial member cost trends. North American Actuarial Journal 10 (4), pp. 76–89. External Links: Document Cited by: §2.
 On conditional and intrinsic autoregression. Biometrika 82 (4), pp. 733–746. External Links: ISSN 00063444, Link Cited by: §4, §4.
 Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics 43, pp. 1–20. Cited by: §4.
 Spatial and spatiotemporal bayesian models with RINLA. Wiley, Hoboken, NJ. Cited by: §5.3.
 A framework for longterm actuarial projections of health care costs: the importance of population aging and other factors. North American Actuarial Journal 8 (4), pp. 1–29. External Links: Document Cited by: §2.
 Purchasing generic prescriptions drugs: an analysis using two behavioral intention models. In NA  Advances in Consumer Research, Vol. 11, pp. 229–234. Cited by: §1, §1.
 Medicaid managed care: efficiency, medical loss ratio, and quality of care. North American Actuarial Journal 0 (0), pp. 1–16. External Links: Document Cited by: §2.
 Potential “savings” of Medicare: the analysis of Medicare advantage and accountable care organizations. North American Actuarial Journal 22 (3), pp. 458–472. External Links: Document Cited by: §2.
 Testing alternative regression frameworks for predictive modeling of health care costs. North American Actuarial Journal 20 (1), pp. 65–87. External Links: Document Cited by: Appendix A, §2, §5.5.
 Health risk adjustment and predictive modeling. ACTEX Publications, Winsted, CT. Cited by: §1, §2.
 Beta regression for modelling rates and proportions. Journal of Applied Statistics 31 (7), pp. 799–815. Cited by: §1, §4.
 Predicting the frequency and amount of health care expenditures. North American Actuarial Journal 15 (3), pp. 377–392. External Links: Document Cited by: §2.
 Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, Articles 33 (1), pp. 1–22. External Links: Document, Link Cited by: §5.2.
 Understanding predictive information criteria for Bayesian models. Statistics and Computing 24, pp. 997–1016. Cited by: §5.4.
 Accuracy of longrange actuarial projections of health care costs. North American Actuarial Journal 20 (2), pp. 101–113. External Links: Document Cited by: §2.
 Psychiatrists’ decision making between branded and generic drugs. European Neuropsychopharmacology 23 (7), pp. 686 – 690. Cited by: §1, §1.
 The experiences of implementing generic medicine policy in eight countries: a review and recommendations for a successful promotion of generic medicine use. Saudi Pharmaceutical Journal 22 (6), pp. 491 – 503. Cited by: §1, §1.
 Risk scoring in health insurance: a primer. Note: https://www.soa.org/researchreports/2016/2016riskscoringprimer/, Society of Actuaries. Cited by: §1.
 Model selection and averaging of health costs in episode treatment groups. ASTIN Bulletin 47 (1), pp. 153–167. External Links: Document Cited by: §2.
 Exploring the use of machine learning for risk adjustment: a comparison of standard and penalized linear regression models in predicting health care costs in older adults. PLOS ONE 14, pp. 1–13. External Links: Link Cited by: §2.
 Variations in patients’ perceptions and use of generic drugs: results of a national survey. Journal of general internal medicine 31 (6), pp. 609 – 614. Cited by: §1, §1.
 2017 generic drug access and savings in the U.S. report. Note: https://accessiblemeds.org/resources/blog/2017genericdrugaccessandsavingsusreport. Accessed on 9/27/2018. Cited by: §1.
 Loss models: from data to decisions. 4th edition, Wiley, Hoboken, NJ. Cited by: §4.
 A concordance correlation coefficient to evaluate reproducibility. Biometrics 45 (1), pp. 255–268. Cited by: §5.4.
 Aging curves for health care costs in retirement. North American Actuarial Journal 9 (3), pp. 22–49. External Links: Document Cited by: §2.
 Bayesian nonparametric regression models for modeling and predicting healthcare claims. Insurance: Mathematics and Economics 83, pp. 1 – 8. External Links: ISSN 01676687, Document Cited by: §2.
 Gaussian markov random fields: theory and applications. CRC Press, Boca Raton, FL. Cited by: §5.3.
 Approximate Bayesian inference for latent Gaussian models by using integrated nested laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2), pp. 319–392. External Links: Document Cited by: §5.3.
 The price of global health: drug pricing strategies to balance patient access and the funding of innovation. 2nd edition, Gower Publishing Company, Burlington, VT. Cited by: §1.
 Managed care and health care utilization: specification of bivariate models using copulas. North American Actuarial Journal 17 (4), pp. 306–324. External Links: Document Cited by: §2.
 2019 Predictive analytics in health care trend forecast. Note: https://www.soa.org/globalassets/assets/Files/programs/predictiveanalytics/2019healthcaretrend.pdfAccessed: 2019823 Cited by: §6.
 Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (4), pp. 583–639. External Links: Document Cited by: §5.4.
 Spatial statistical modelling of insurance risk: a spatial epidemiological approach to car insurance. Scandinavian Actuarial Journal 0 (0), pp. 1–15. External Links: Document Cited by: §4, §5.2, §5.3.
 Bayesian regression modeling with INLA. CRC Press, Boca Raton, FL. Cited by: §4, §5.3.
 Asymptotic equivalence of bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research 11, pp. 3571–3594. Cited by: §5.4.
 A new family of power transformations to improve normality or symmetry. Biometrika 87 (4), pp. 954–959. External Links: Link Cited by: §5.1.
Appendix A Additional results
Duncan et al. (2016) tested several alternative regression frameworks for predictive modeling of health care costs. In particular, they tested multiple linear regression models, lasso, multivariate adaptive regression splines (MARS), random forests, M5 decision trees, and boosted trees. For details of these models, readers are referred to Duncan et al. (2016) and the references therein. In this section, we present the outofsample prediction results of these models.
Model  R Package  Parameters 

Linear model  lm  Multiple linear regression without interactions 
Lasso  glmnet  Gaussian family 
MARS  earth  Interactions up to degree = 1; 
Generalized Cross Validation penalty = 1  
Random forests  randomForest  Number of trees = 100; minimum node size = 10 
M5 decision trees  Cubist  Number of committees = 25 
Boosted trees  gbm  Number of trees = 5000; Shrinkage = 0.01 
Model  

Linear model  0.4129  0.0141 
Lasso  0.0000  0.0154 
MARS  0.4420  0.0143 
Random forest  0.4657  0.0126 
M5 decision tree  0.4330  0.0131 
Boosted trees  0.4658  0.0133 
Table 10 shows the parameters and R packages we used to test the additional models. Table 11 shows the outofsample prediction results obtained by these models. Figure 10 shows the scatter plots of the observed brand name drug claim rates and the predicted values by the additional models. The variables input to these models are the same as those selected for the Beta regression models described in Table 3.
From Table 11
, we see that the lasso model produced the worst result. The reason is that the lasso regression model produced an interceptonly model for the data. Figure
9(b) also confirms this as the predicted values are constant. Comparing Tables 5 and 11, we see that all these models do not perform better than the Beta regression model in terms of . However, the treebased models (i.e., random forests, M5 decision trees, boosted trees) produced lower than the Beta regression model.
Comments
There are no comments yet.