Penalized Likelihood Methods for Modeling of Reading Count Data

by   Minh Thu Bui, et al.

The paper considers parameter estimation in count data models using penalized likelihood methods. The motivating data consists of multiple independent count variables with a moderate sample size per variable. The data were collected during the assessment of oral reading fluency (ORF) in school-aged children. Specifically, a sample of fourth-grade students was given one of ten possible to read with passages differing in length and difficulty. The observed number of words read incorrectly (WRI) is used to measure ORF. The goal of this paper is to efficiently estimate passage difficulty as measured by the expected proportion of words read incorrectly. Three models are considered for WRI scores, namely the binomial, the zero-inflated binomial, and the beta-binomial. Two types of penalty functions are considered for penalized likelihood, respectively with the goal of shrinking parameter estimates either closer to zero or closer to one another. A simulation study evaluates the efficacy of the shrinkage estimates using Mean Square Error (MSE) as a metric, with big reductions in MSE relative to maximum likelihood in some instances. The paper concludes by presenting an analysis of the motivating ORF data.



There are no comments yet.


page 1

page 2

page 3

page 4


Liu Estimator in the Multinomial Logistic Regression Model

This paper considers the Liu estimator in the multinomial logistic regre...

On Data Enriched Logistic Regression

Biomedical researchers usually study the effects of certain exposures on...

Tuning in ridge logistic regression to solve separation

Separation in logistic regression is a common problem causing failure of...

Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models

It has been shown that AIC-type criteria are asymptotically efficient se...

Proportional hazards model with partly interval censoring and its penalized likelihood estimation

This paper considers the problem of semi-parametric proportional hazards...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The definition of Oral Reading Fluency (ORF) is “the oral translation of text with speed and accuracy,” see for example fuchs2001oral and shinn1992curriculum. Reading fluency is a skill developed during childhood that is needed to understand the meaning of texts and literary pieces. There is a strong correlation between reading fluency and reading comprehension, see allington1983fluency; johns1983informal; samuels1988decoding; schreiber1991understanding. According to disalle2017impact, once a student has identified a word and read it correctly, their focus generally shifts from word recognition (attempting to recognize the word) to comprehension (making meaning of the word). This leads to overall understanding of the text. These authors have claimed that the incompetent ORF levels are the cause of up to 90% of reading fluency issues. If a child is not fluent in their reading, their ability to read comprehensively is hindered and they will have trouble in grasping the meaning of texts. Thus, ORF is a method of evaluating whether a child is at their appropriate reading level compared to their peers and provides a quantifiable score to identify at-risk students with poor reading skills.

In this paper, we analyze ORF data collected from a sample of fourth-grade students. Each child was given one of ten possible passage to read and the number of words read incorrectly (WRI) was recorded. Reading sessions were recorded so that observer error in counting the number of words read correctly and incorrectly can be minimized. Strong readers tend to have low WRI scores and weak readers tend to have high WRI scores. However, as the passages may not all be equal in difficulty, it is important to be cautious in using WRI scores obtained from different passages to measure overall ORF levels in a classroom setting.

This paper develops a likelihood-based approach for estimating passage difficulty as measured by the expected WRI proportion. The passages in ORF assessment are generally designed to be comparable in difficulty level. Passages are also designed to not be overly challenging for proficient readers. These two passage properties can be incorporated in the estimation of WRI proportions using penalty functions. Specifically, penalty functions are considered that encourage the estimated passage-specific WRI proportions to be close to one another and/or close to zero. The use of parameter shrinkage is further motivated by a small sample size per passage relative to the number of passages. Here, there are approximately 50 WRI measurements for each of the 10 passages.

There is, of course, a rich literature on parameter shrinkage in various statistical models. One of the definitive examples in the multivariable setting is the James-Stein estimator of the mean, see stein1956inadmissibility. This estimator is often described as “borrowing” information between variables to obtain a more efficient estimator. Other applications of shrinkage include pandey1985bayes and jani1991class

who considered univariate Bayes-type shrinkage in, respectively, a Weibull distribution and an exponential distribution. In the bivariate setting, shrinkage was used to estimate probabilities of the form

for underlying exponential distributions, see baklizi2003shrinkage.

One of the most frequently encountered applications of shrinkage is in regression models with a large number of predictor variables. The lasso, developed by

tibshirani1996regression, is one such technique which revolutionized parameter estimation in generalized linear models (GLMs). The lasso shrinks regression parameters towards zero using an

penalty, resulting in predictors being “dropped” from the model by setting the corresponding coefficients equal to zero. The lasso was predated by ridge regression using an

penalty, see hoerl1970ridge. This approach can result in some regression coefficients being very close to zero, but unlike the lasso does not eliminate potential predictor variables from the model altogether. Other examples of shrinkage applied to GLMs include maansson2013developing and qasim2020new who developed Liu-type estimators for, respectively, a zero-inflated negative binomial regression model and a Poisson regression model. Shrinkage estimation of fixed effects in a random-effects zero-inflated negative binomial model was considered by zandi2021using. The monographs gruber2017improving and hastie2019statistical are very good resources for further exploration of shrinkage in regression models.

In this paper, the parameters of interest are success proportions (with a success being that a word has been read incorrectly during an assessment). Shrinkage as applied to the estimation of proportions has received limited attention in the literature. In the univariate case, (lemmer1981note) considered three different estimators for a binomial success probability, and lemmer1981ordinary proposed estimators of the type where is an a priori guess. However, neither of these papers consider likelihood-based methods nor provide guidance on selecting the amount of shrinkage. Some work close in spirit to this paper is that of hansen2016efficient who considers three shrinking approaches, namely restricted maximum likelihood, an efficient minimum distance approach, and a projection approach. The work of Hansen requires the specification of a shrinkage direction, which is similar to the selection of a penalty function.

The remainder of this paper proceeds as follows: In Section 2, the penalized likelihood approach is more fully developed, emphasizing the binomial distribution for clarity of exposition. In Section 3, V-fold cross-validation is presented as an approach for selecting the amount of shrinkage. Section 4 presents results from extensive simulation studies and the motivating data are analyzed in Section 5.

2 Shrinkage through Penalized Likelihood Methods

2.1 Shrinkage through Penalized Likelihood Estimation

Consider collection of random variables

, , , with mutually independent . Here, denotes a distribution function with -dimensional parameter . Let denote the parameter space associated with the collection of parameters . Also let denote the log-likelihood of the data and let denote a specified subset of the parameter space that is of interest. Finally, for , let be a norm and define . That is, is the minimum distance between a point and the space as measured by the norm . Note that whenever , the point is closer to the region than the point .

In this context, parameter shrinkage is said to be any estimation method that balances adherence to the data-generating model as measured by and the closeness of any estimator to as measured by . One approach the “balancing” these functions is penalized likelihood. Adopting the convention that denote the penalty function, the penalized likelihood estimator is found by minimizing


with a specified constant. The two component functions of often exist in some kind of tension; minimizing gives the maximum likelihood estimator (MLE), while attains a minimum for any in where the desired parameter constraint is fully satisfied. The tension can be ascribed to the MLE not necessarily being close to subset of interest . The magnitude of determines the balance between these at times competing interest.

Calculation of the penalized likelihood estimator requires the specification of a generating model, a penalty function, and a value for the parameter . Throughout this paper, generating models closely related to the binomial distribution are considered. The remainder of Section 2 will consider possible choices of the penalty function while assuming is known. In Section 3, the use of V-fold cross-validation to select is discussed. Note that when it comes to the selection of a penalty function, it will often be the case that the client presents the statistician with a non-mathematical description of . There may be multiple ways of constructing a set and a penalty function that satisfies the description. Therefore, the penalty functions considered in this paper should in no way be considered an exhaustive enumeration of the possibilities. Rather, these are intended to illustrate the large number of ways in which shrinking can be implemented.

2.2 Shrinkage to Zero in Binomial Models

Let , denote observed realizations of independent random variables , . Assume that the number of binomial trials are known and that estimation of the success probabilities , is of interest. Here, the log-likelihood is given by

Now, say that the client has expressed a strong belief that the success probabilities should all be “small.” In the context of the WRI data, this is equivalent to expecting that only a small proportion of words will be read incorrectly by a grade-level reader. This is consistent with setting . There are numerous penalty functions that can assess the closeness of a potential parameter value to . For example, both the and norms


are candidates worth considering. In the context of binomial success probabilities, both of these functions are bounded, having Figure 1 below illustrates the shapes of these penalties for the case .

Figure 1: norm (left) and norm (right) penalty functions for binomial success probabilities.

Note that as for all , the norm will more “aggressively” shrink success probabilities to than the norm. Due to the resemblance of the norm to the commonly-used lasso penalty in regression, it should however be pointed out that the function will never result in shrinkage to . In fact, the penalized negative log-likelihood function has unique solution

where and is the unpenalized MLE. While it is not necessarily intuitive from the form of the penalized estimator, it can easily be verified that for all . The solution to the penalty function is also easy to compute, but no general closed-form expression is possible as it requires solving a cubic polynomial.

The bounded nature of and in (2) may not appeal to some. One choice of an unbounded penalty is

This penalty has a lower bound of , but has no upper bound. For an illustration when , see Figure 2. The solution to the corresponding penalized likelihood problem is

None of the penalties considered so far have the lasso-like property of shrinking parameters to for a finite value of . However, it is possible to find a penalty that achieves this. Consider

also illustrated in Figure 2 for . This penalty function is bounded above, but has no lower bound as the individual approach . In fact, this penalty function is not associated with a norm as defined in Section 2.1, putting it somewhat outside the framework in which our estimation problem has been formulated. The latter point notwithstanding, the corresponding penalized likelihood estimator is

for . Perhaps this penalty can appropriately be described as “greedy” in the sense that it has the potential to dominate the data and result in a shrinkage estimator of even when there are observed successes suggesting otherwise.

Figure 2: Penalties (left) and (right) penalty functions, respectively unbounded from above and below, for binomial success probabilities.
Figure 3: Schematic representation of four different penalized estimators shrinking closer to .

The four penalized solutions above all corresponding to some notion of success probabilities being “close to ” or “not too large” . Figure 3 shows a schematic representation of the behavior of these estimators as a function of .

2.3 Other Shrinkage Configurations

The penalized estimators of Section 2.2 all revolve around the goal of ensuring that the estimates are close to . If, on the other hand, it were desired the have estimates close to , then by symmetry all of the examples considered could replace the in each of the penalty functions by . Of course, many other types of penalties could also be of interest. For example, say a client expressed the belief that all of the should be close to some specified value . Then, for this specified , define

Note that this penalty function has a minimum when all the are equal to , and is unbounded above whenever one of the approach either or . This penalty therefore shrinks the towards the specified value. The penalized estimators are

For the variable, this estimator is a linear combination of the MLE and . The careful reader may also notice that this estimator has much in common with the Bayesian estimator of a binomial success probability with a beta prior. Here, controls whether the strength of evidence lies with the empirical estimator or with the pre-specified reference . Similarly, say the client believes the success probabilities should all be “close” to one another, but unlike in , no value has been specified. For the WRI data, this is equivalent to expecting the passage success proportions to be similar to one another. For this, define bounded penalty function

Alternatively, if an unbounded penalty function is preferred, one could use


is the standard normal quantile function. Neither of these penalties results in closed-form solutions for the shrinkage estimators

, .

3 Data-Driven Shrinkage

In Section 2, different penalized likelihood approaches were considered for estimating binomial success probabilities. However, in the presentation, the value of the parameter was assumed known. As controls the relative importance of the penalty function, it is important to choose a value resulting in parameter estimates with small MSE. Here, one selection approach using V-fold cross-validation (VFCV) is presented.

Consider an dataset consisting of independently sampled variables, with the th variable consisting of independent observations. Let denote the observations corresponding to the th variable. VFCV proceeds by partitioning the data into V subsets of roughly equal size. For the th variable, let , denote a partition of the indices such that and for all .

VFCV repeatedly creates subsets of the data for model training, in each instance leaving out one of the subsets per variable. The subsets left out in each iteration are then used for model validation. More specifically, the model building data subsets are used to estimate penalized parameter estimates for various degrees of penalty enforcement, say possible values of satisfying . The negative log-likelihood function is then evaluated using penalized estimators corresponding to each possible value of and using the validation subsets. This is repeated times, and the optimal value of is chosen as where the negative log-likelihood function averaged over the validation subsets is minimized.

Algorithmically, implementation of VFCV proceeds as follows. For each fold :

  • For the variable, form a training dataset by excluding the th fold, , and let the th fold equal to the validation set, . Let denote the number of observations in . Also let and denote the collection of the training and validation sets for all variables.

  • For each value , find the estimators that minimize the penalized negative log-likelihood function

    where .

  • Calculate the validation function by evaluate the negative log-likelihood at this estimator,

The VFCV score is then defined as


and the optimal shrinkage level is taken to be the that minimizes , i.e. where . Note that after the optimal penalty level has been chosen using VFCV, penalized estimators are calculated one more time using the full dataset. The penalized likelihood estimator with data-driven shrinkage, denoted , is the value that minimizes

where . The literature on cross-validation recommends various choices for , with common values ranging from to . The choice is equivalent to leave-one-out cross-validation and can become computationally expensive. As discussed in (arlot2010survey), the size of the validation set has an effect on the bias of the penalized estimator, while the number of folds

controls for the variance of the estimated penalization parameter. These authors also discuss some asymptotic considerations of cross-validation. If

denotes the size of the training set, then for , cross-validation is asymptotically equivalent to Mallows’ and therefore asymptotically optimal. Furthermore, if , then asymptotically the model is equivalent to Mallows’ multiplied by (or over-penalized by) a factor . Asymptotics notwithstanding, throughout the remainder of this paper, an approach of is used. This strikes a balance having larger training sets and reasonable computational costs.

4 Simulation Studies

In Section 2, various shrinkage estimators for the binomial distribution were considered, while Section 3 described a V-fold cross-validation approach for choosing the penalty parameter. Of course, the binomial model is not the only count model of interest. In this section, shrinkage estimation is considered both for the binomial model as well as two related models, the zero-inflated binomial and the beta-binomial. In most scenarios investigated here, no closed-form solutions for the penalized estimators are available. Even so, these simulation studies are very useful for investigating the properties of different penalty functions as they impact estimation for the three models. Note that simulations are restricted here to independent variables, each consisting of trials and having independent observations per variable. This choice was made so that the simulations would, at least in part, closely resemble the real data motivating this work.

4.1 The Binomial Model

A sample was generated with independent observations with , , and . The binomial success probabilities

were sampled from different scaled beta distributions. Specifically, for success probability lower and upper bounds


, three shapes of the success probability distribution were considered, namely a skewed distribution

, a very flat distribution , and a bell-shaped distribution . The three success probability distributions are illustrated in Figure 4 below. The term controlling how aggressively the penalty gets enforced was chosen using cross-validation using possible values ranging from to spaced approximately equi-distant on a logarithmic scale. VFCV was used to determine the for each penalty function under consideration.

Figure 4: Success probability distributions considered in the simulation study.

In addition to the estimators resulting from the use of different penalty functions, maximum likelihood estimators were also calculated. In total, samples were generated for each configuration of success probability bounds and Beta shape parameters. Summarized in the tables below are the Monte Carlo estimates of the MSE ratios. For the th sample , let denote the true success probabilities simulated from a specified scaled Beta distribution. Let denote the MLE and let denote a penalized estimator found using VFCV. Define Sum of Squared Deviations . Then, the Monte Carlo MSE ratios are defined as

where the subscript “” emphasizes the specific penalty function used to obtain the estimators. An MSE ratio less than indicates superior performance of the penalized estimator.

In Table 4.1, the results of shrinking success probabilities to zero are presented. Each of the penalties , from Section 2.2 were considered. In Table 4.1, the results of shrinking success probabilities closer to one another are presented. The penalties and from Section 2.2 were considered.

MSE ratios comparing penalized parameter estimates to maximum likelihood when shrinking estimators to 0. Penalty Shape Pen Pen Pen Pen Skew 0.999 0.956 0.988 1.382 Flat 0.999 0.968 0.995 1.012 Bell 0.999 0.961 0.997 1.011 Skew 0.999 0.977 0.995 1.013 Flat 1.000 0.982 0.994 1.004 Bell 1.000 0.978 0.996 1.002 Skew 0.998 0.998 1.015 0.999 Flat 0.999 0.998 1.037 1.000 Bell 1.000 0.999 1.027 1.003

MSE ratios comparing penalized parameter estimates to maximum likelihood when shrinking estimators closer to one another. Penalty Shape L Q Skew 0.928 0.906 Flat 0.935 0.942 Bell 0.705 0.704 Skew 0.960 0.952 Flat 0.969 0.973 Bell 0.854 0.856 Skew 0.411 0.411 Flat 0.652 0.652 Bell 0.292 0.293

In Table 4.1, the best-performing penalty function is . Even so, the relative improvement in efficiency is small throughout. The only penalty that consistently leads to worse performance than maximum likelihood is . Recall that this penalty function is not associated with a norm and is able to very aggressively shrink success probabilities to . This simulation suggests that, at least in the scenarios considered, this penalty shrinks too aggressively. For the other three estimators though, VFCV results in penalized estimators with slightly better performance than MLE. In Table 4.1, the performance of the and penalties is nearly indistinguishable. When shrinking parameters closer to one another, large gains in efficiency are sometimes realized. This is especially notable when the Beta shape from which the success probabilities are generated is bell-shaped, i.e. the are close to one another. In all instances, VFCV results in penalized estimators with performance superior to maximum likelihood. Altogether, these simulations illustrate that both the average success probability and the spacing of the relative to that average are important in determining the reduction in MSE. For penalties shrinking the closer to one another, an MSE ratios below was realized, showing dramatic improvement due to shrinkage.

4.2 The Zero-inflated Binomial Distribution

The pmf of the zero-inflated binomial (ZIB) distribution is

where represents the excess zero probability, and and are the binomial success probabilities and number of trials. For , it follows that . Consequently, the parameter is the expected proportion of successes in a ZIB model. The parameter is of primary interest when considering possible penalty functions, especially under the assumption that the different ZIB distributions are “similar” to one another.

In the simulation study, a sample was generated with independent ZIB variables, . As with the binomial model simulation, , , and . The overall expected success proportions and the excess zero probabilities were sampled from scaled beta distributions as per Figure 4 with the specific bounds listed in the table below. Letting , , the ZIB simulation considered three penalty functions: , , and . The first of these, termed zero shrinkage, results in success proportions closer to . The second, termed mean shrinkage, results in mean success proportions closer to one another. The third, termed full shrinkage, shrinks all closer to one another and all closer to one another. While both the penalties and have the goal of estimating models that are “similar” to one another, the second penalty is much more strict. To see this, consider two parameters . Under the first penalty, these contribution of their squared difference is . However, even with the corresponding parameters and may be very different.

In addition to using VFCV to select the level of shrinkage for the above three penalities, a combined estimator, termed minCV, was calculated by selecting the model parameters associated with the penalty function having minimum VFCV score. The same set of values ranging from to were used. The Monte Carlo MSE ratios for the success proportions are in Table 4.2. The MSE ratios for and were also calculated, and these can be found in the Supplemental Material.

MSE ratios for ZIB success proportions comparing penalized parameter estimates to maximum likelihood for different penalization approaches. Penalty Shape Zero Mean Full minCV Skew 0.957 0.888 0.981 0.958 Flat 0.977 0.942 0.979 0.983 Bell 0.964 0.668 0.836 0.755 Skew 0.968 0.364 0.368 0.356 Flat 0.971 0.562 0.526 0.523 Bell 0.968 0.258 0.246 0.239 Skew 1.006 0.969 0.860 0.885 Flat 1.010 1.005 0.808 0.819 Bell 1.009 0.821 0.873 0.899 Skew 0.963 0.203 0.635 0.273 Flat 0.955 0.223 0.934 0.259 Bell 0.951 0.183 0.372 0.245

These MSE ratios in Table 4.2 are based on simulated datasets for each possible simulation configuration. While the zero shrinkage penalty does result in some efficiency gains in most scenarios, overall MSE ratios close to 1 suggest little improvement from from using this penalty. On the other hand, both mean and full shrinkage result in large decreases in the MSE ratios. Overall, it cannot be said that either mean and full shrinkage performs best. This makes sense, as it depends on the configuration of all parameters and not just the mean parameters. Finally, while minCV does not always have the smallest MSE ratio, it is generally close to the minimum. This suggests that data-driven selection of the level of shrinkage as well as the penalty function leads to good performance for the model.

4.3 The Beta-Binomial Model

The probability mass function of the beta-binomial distribution is given by

where is the Beta function and is the number of trials. The parameters and control the mean and variance of the model. Specifically, for and , the mean and variance can be written as and . In this parameterization, and denote, respectively, the expected proportion of successes and the the over-dispersion of the model relative to the binomial.

Samples were generated with independent Beta-Binomial variables, , with , , and . The overall expected success proportions and the overdispersion were sampled from scaled beta distributions as per Figure 4 with the specific bounds listed in the table below. As in the ZIB case, thie BetaBin simulation considered three penalty functions. Letting , , these were: , , and . These are again termed, respectively, zero shrinkage, mean shrinkage, and full shrinkage. Again, after choosing the shrinkage levels using VFCV, one final estimator, termed minCV, was calculated by selecting among the three penalized estimators the one with the smallest CV score.

For each parameter configuration, a total of samples were generated. The MSE ratios are reported in Table 4.3. The table shows the results for the success proportions , and the equivalent results for and can be found in the Supplemental Material.

MSE ratios for Beta-Binomial success proportions comparing penalized parameter estimates to maximum likelihood for different penalization approaches. Penalty Shape Zero Mean Full minCV Skew 0.917 0.474 0.428 0.429 Flat 0.928 0.702 0.591 0.604 Bell 0.921 0.290 0.270 0.271 Skew 0.974 0.722 0.726 0.708 Flat 0.977 0.903 0.948 0.889 Bell 0.973 0.476 0.466 0.463 Skew 0.971 0.301 0.400 0.331 Flat 0.971 0.445 0.762 0.481 Bell 0.968 0.217 0.242 0.227 Skew 0.905 0.170 0.469 0.211 Flat 0.891 0.188 0.733 0.213 Bell 0.893 0.155 0.187 0.175

When looking at the results in Table 4.3, it comes as no surprise that zero shrinkage is the least effective approach here, even while still being more effective than maximum likelihood. For most of the simulation configurations, MSE ratios under mean and full shrinkage are comparable. Here, the minCV approach is also very impressive, in most instances nearly matching the best-performing method. This reaffirms that VFCV can be effectively used to choose both the level of shrinkage for a specific penalty function, but then also choose from among competing penalty functions.

5 Data Analysis

The work presented in this paper was motivated by the reading data collection a sample of 508 elementary-school aged children. Each child read one randomly selected passage out of ten possible passages. This resulted in roughly 50 Words Read Incorrectly (WRI) scores per passage. The passage lengths varied from 44 to 69 words with an average length of 51 words. Of interest is to accurately and efficiently estimate passage difficulty as measured by the average proportion of words read incorrectly. Note that higher WRI proportions indicate that a passage is more difficult. Figure 5 provides information about the passage-specific WRI proportions. The solid dot in each violin plot represents the mean WRI proportion, a typical estimate of passage difficulty.

Figure 5: Proportions of WRI divided by length of passages to compare between different passages

Note that the mean WRI proportions in Figure 5 appear fairly close to one another, meaning passages are crafted to be within a narrow range of difficulty to reinforce fairness in grading and modeling. Thus, as one expects the WRI proportions to be close to one another, appropriate shrinkage may result in improved estimates.

Three models and three types of shrinkage were considered for the data at hand. In each instance, the same set of partitions were used to select a smoothing parameter with 10-fold cross validation. Table 5 reports the cross-validation scores as defined in (3).

10-fold CV scores and optimal values for the three distributions considered. Distribution Penalty CV Score Binomial None Zero Mean ZIB None Zero Mean Full BetaBin None Zero Mean Full

Beta-binomial parameter estimates for the WRI data. Maximum Likelihood Mean Shrinkage Full Shrinkage Passage P1 1.28 66.34 0.019 1.20 52.67 0.022 0.70 27.45 0.025 P2 1.51 45.15 0.032 1.50 47.47 0.031 0.91 27.44 0.032 P3 0.84 19.85 0.040 0.80 22.81 0.034 0.96 27.44 0.034 P4 2.47 160.0 0.015 2.25 123.5 0.018 0.67 27.45 0.024 P5 1.17 42.54 0.027 1.17 41.65 0.027 0.83 27.45 0.030 P6 1.18 29.10 0.039 1.13 32.52 0.034 0.97 27.44 0.034 P7 0.53 19.48 0.026 0.53 18.75 0.027 0.74 27.44 0.026 P8 0.87 25.37 0.033 0.86 27.20 0.031 0.88 27.44 0.031 P9 0.85 32.25 0.026 0.84 30.74 0.027 0.79 27.45 0.028 P10 0.65 17.03 0.037 0.63 19.14 0.032 0.89 27.44 0.031

It is evident from Table 5 that all variations of the beta-binomial model have much lower cross-validation scores than either the binomial or zero-inflated binomial models. With regards to penalty type, full shrinkage works best for this model with mean shrinkage being a distant second choice. Table 5 shows the beta-binomial parameter estimates obtained using maximum likelihood as well as penalized likelihood with mean shrinkage and full shrinkage. It is interesting to note that in the full shrinkage solution, the values have all been shrunk to within of a common value, but the still exhibit a fair spread of values. In the maximum likelihood sense, the estimated success proportions range from to , while the full shrinkage values range from to . The latter shows much more adherence to the idea that the passages are similar in terms of difficulty.

Figure 6: Beta-binomial parameter estimates under mean shrinkage and full shrinkage. Dashed line indicates optimal shrinkage. Scale value to improve full shrinkage plot readability is .

For the interested reader, Figure 6 shows the penalized likelihood estimate trajectories for mean shrinkage and full shrinkage as a function of . The estimates of are presented on a logarithmic scale. For mean shrinkage, the horizontal scale is and for full shrinkage it is with . These adjustments were all made to improve readability of the plots. Dashed vertical lines indicate the optimal shrinkage solutions as determined by VFCV.

Under mean shrinkage, the passage-specific and still exhibit a large spread even when the success proportions are close to one another. Under full shrinkage, the values are very quickly shrunk to a nearly common value while the still exhibit some spread.

6 Conclusions

The goal of this project was the penalized estimation of multiple independent success proportions from multivariable count data. The application of interest considered WRI scores realized by students during an ORF assessment. The simulation results showed that across the scenarios considered, large decreases in MSE were often achieved. There is also very little risk in using penalized likelihood, as using V-fold cross validation never resulted in a large increase in MSE. When applying the methodology to the data of interest, it is seen that the resulting penalized estimators of the success proportions have a much tighter spread. This affirms the expectation that the passages are very similar in difficulty, with estimated difficulty scores ranging from to of words expected to be read incorrectly. Even so, this does highlight one important avenue for future research. If students are reading different passages to assess ORF, it is desirable to have a method that standardizes WRI scores to be independent of passage difficulty. Also, in practice students typically read multiple passages, so exploring methods accounting for correlated WRI scores need to be considered.


Supplemental Material

In the main paper, simulation results are only presented for the success proportion estimates . Of course, it is reasonable to assess the effect of shrinkage on other model parameters as well. The MSE ratios as defined in Section 3 are presented here when estimating the parameters in the zero-inflated binomial model, see Table 7, and when estimating the parameters in the beta-binomial model, see Table 8. These results confirm that shrinkage can be very beneficial to the estimation of incidental model parameters as well.

MSE ratios for Zero-Inflated Binomial success proportions and comparing penalized parameter estimates to maximum likelihood for different penalization approaches. Shape Zero Mean Full Min CV Zero Mean Full Min CV (0.01,0.05) (0.10,0.14) Skew 1.027 1.000 0.647 0.741 0.947 0.979 0.890 0.928 Flat 1.021 0.992 0.551 0.606 0.973 0.979 0.893 0.921 Bell 1.033 0.998 0.834 0.959 0.964 0.845 0.894 0.889 (0.04,0.06) (0.20,0.30) Skew 1.024 0.855 0.171 0.238 0.979 0.692 0.430 0.446 Flat 1.022 0.868 0.252 0.336 0.981 0.800 0.698 0.701 Bell 1.022 0.770 0.130 0.167 0.980 0.659 0.302 0.318 (0.15,0.30) (0.04,0.06) Skew 1.029 1.059 0.434 0.495 0.998 0.945 1.069 1.071 Flat 1.031 1.047 0.471 0.494 0.997 0.971 1.038 1.039 Bell 1.036 1.009 0.394 0.487 0.998 0.871 1.190 1.204 (0.05,0.06) (0.20,0.70) Skew 1.010 0.610 0.775 0.710 0.983 0.710 0.926 0.779 Flat 0.994 0.517 0.905 0.551 0.991 0.794 0.982 0.805 Bell 0.993 0.478 0.567 0.535 0.985 0.739 0.760 0.746

MSE ratios for Beta-Binomial success proportions and comparing penalized parameter estimates to maximum likelihood for different penalization approaches. Shape Zero Mean Full Min CV Zero Mean Full Min CV (0.05,0.10) (4,6) Skew 0.993 0.922 0.201 0.319 1.037 0.718 0.097 0.192 Flat 0.991 0.932 0.318 0.41 1.037 0.792 0.171 0.254 Bell 0.992 0.922 0.144 0.209 1.039 0.662 0.071 0.119 (0.12,0.22) (2,5) Skew 0.995 0.98 0.69 0.85 1.006 0.954 0.637 0.81 Flat 0.997 1.015 0.999 1.015 1.006 1.01 0.999 1.01 Bell 0.993 0.917 0.186 0.373 1.014 0.845 0.154 0.322 (0.17,0.22) (3,8) Skew 0.992 0.928 0.593 0.746 1.017 0.832 0.544 0.664 Flat 0.994 0.96 1.077 1.019 1.011 0.91 1.058 0.958 Bell 0.991 0.931 0.283 0.401 1.021 0.791 0.241 0.333 (0.05,0.06) (2,10) Skew 0.993 0.988 0.934 0.99 1.017 0.866 0.901 0.866 Flat 0.996 0.958 1.015 0.96 1.007 0.787 1.007 0.788 Bell 0.992 0.91 0.376 0.517 1.046 0.557 0.229 0.306