1 Introduction
The definition of Oral Reading Fluency (ORF) is “the oral translation of text with speed and accuracy,” see for example fuchs2001oral and shinn1992curriculum. Reading fluency is a skill developed during childhood that is needed to understand the meaning of texts and literary pieces. There is a strong correlation between reading fluency and reading comprehension, see allington1983fluency; johns1983informal; samuels1988decoding; schreiber1991understanding. According to disalle2017impact, once a student has identified a word and read it correctly, their focus generally shifts from word recognition (attempting to recognize the word) to comprehension (making meaning of the word). This leads to overall understanding of the text. These authors have claimed that the incompetent ORF levels are the cause of up to 90% of reading fluency issues. If a child is not fluent in their reading, their ability to read comprehensively is hindered and they will have trouble in grasping the meaning of texts. Thus, ORF is a method of evaluating whether a child is at their appropriate reading level compared to their peers and provides a quantifiable score to identify atrisk students with poor reading skills.
In this paper, we analyze ORF data collected from a sample of fourthgrade students. Each child was given one of ten possible passage to read and the number of words read incorrectly (WRI) was recorded. Reading sessions were recorded so that observer error in counting the number of words read correctly and incorrectly can be minimized. Strong readers tend to have low WRI scores and weak readers tend to have high WRI scores. However, as the passages may not all be equal in difficulty, it is important to be cautious in using WRI scores obtained from different passages to measure overall ORF levels in a classroom setting.
This paper develops a likelihoodbased approach for estimating passage difficulty as measured by the expected WRI proportion. The passages in ORF assessment are generally designed to be comparable in difficulty level. Passages are also designed to not be overly challenging for proficient readers. These two passage properties can be incorporated in the estimation of WRI proportions using penalty functions. Specifically, penalty functions are considered that encourage the estimated passagespecific WRI proportions to be close to one another and/or close to zero. The use of parameter shrinkage is further motivated by a small sample size per passage relative to the number of passages. Here, there are approximately 50 WRI measurements for each of the 10 passages.
There is, of course, a rich literature on parameter shrinkage in various statistical models. One of the definitive examples in the multivariable setting is the JamesStein estimator of the mean, see stein1956inadmissibility. This estimator is often described as “borrowing” information between variables to obtain a more efficient estimator. Other applications of shrinkage include pandey1985bayes and jani1991class
who considered univariate Bayestype shrinkage in, respectively, a Weibull distribution and an exponential distribution. In the bivariate setting, shrinkage was used to estimate probabilities of the form
for underlying exponential distributions, see baklizi2003shrinkage.One of the most frequently encountered applications of shrinkage is in regression models with a large number of predictor variables. The lasso, developed by
tibshirani1996regression, is one such technique which revolutionized parameter estimation in generalized linear models (GLMs). The lasso shrinks regression parameters towards zero using anpenalty, resulting in predictors being “dropped” from the model by setting the corresponding coefficients equal to zero. The lasso was predated by ridge regression using an
penalty, see hoerl1970ridge. This approach can result in some regression coefficients being very close to zero, but unlike the lasso does not eliminate potential predictor variables from the model altogether. Other examples of shrinkage applied to GLMs include maansson2013developing and qasim2020new who developed Liutype estimators for, respectively, a zeroinflated negative binomial regression model and a Poisson regression model. Shrinkage estimation of fixed effects in a randomeffects zeroinflated negative binomial model was considered by zandi2021using. The monographs gruber2017improving and hastie2019statistical are very good resources for further exploration of shrinkage in regression models.In this paper, the parameters of interest are success proportions (with a success being that a word has been read incorrectly during an assessment). Shrinkage as applied to the estimation of proportions has received limited attention in the literature. In the univariate case, (lemmer1981note) considered three different estimators for a binomial success probability, and lemmer1981ordinary proposed estimators of the type where is an a priori guess. However, neither of these papers consider likelihoodbased methods nor provide guidance on selecting the amount of shrinkage. Some work close in spirit to this paper is that of hansen2016efficient who considers three shrinking approaches, namely restricted maximum likelihood, an efficient minimum distance approach, and a projection approach. The work of Hansen requires the specification of a shrinkage direction, which is similar to the selection of a penalty function.
The remainder of this paper proceeds as follows: In Section 2, the penalized likelihood approach is more fully developed, emphasizing the binomial distribution for clarity of exposition. In Section 3, Vfold crossvalidation is presented as an approach for selecting the amount of shrinkage. Section 4 presents results from extensive simulation studies and the motivating data are analyzed in Section 5.
2 Shrinkage through Penalized Likelihood Methods
2.1 Shrinkage through Penalized Likelihood Estimation
Consider collection of random variables
, , , with mutually independent . Here, denotes a distribution function with dimensional parameter . Let denote the parameter space associated with the collection of parameters . Also let denote the loglikelihood of the data and let denote a specified subset of the parameter space that is of interest. Finally, for , let be a norm and define . That is, is the minimum distance between a point and the space as measured by the norm . Note that whenever , the point is closer to the region than the point .In this context, parameter shrinkage is said to be any estimation method that balances adherence to the datagenerating model as measured by and the closeness of any estimator to as measured by . One approach the “balancing” these functions is penalized likelihood. Adopting the convention that denote the penalty function, the penalized likelihood estimator is found by minimizing
(1) 
with a specified constant. The two component functions of often exist in some kind of tension; minimizing gives the maximum likelihood estimator (MLE), while attains a minimum for any in where the desired parameter constraint is fully satisfied. The tension can be ascribed to the MLE not necessarily being close to subset of interest . The magnitude of determines the balance between these at times competing interest.
Calculation of the penalized likelihood estimator requires the specification of a generating model, a penalty function, and a value for the parameter . Throughout this paper, generating models closely related to the binomial distribution are considered. The remainder of Section 2 will consider possible choices of the penalty function while assuming is known. In Section 3, the use of Vfold crossvalidation to select is discussed. Note that when it comes to the selection of a penalty function, it will often be the case that the client presents the statistician with a nonmathematical description of . There may be multiple ways of constructing a set and a penalty function that satisfies the description. Therefore, the penalty functions considered in this paper should in no way be considered an exhaustive enumeration of the possibilities. Rather, these are intended to illustrate the large number of ways in which shrinking can be implemented.
2.2 Shrinkage to Zero in Binomial Models
Let , denote observed realizations of independent random variables , . Assume that the number of binomial trials are known and that estimation of the success probabilities , is of interest. Here, the loglikelihood is given by
Now, say that the client has expressed a strong belief that the success probabilities should all be “small.” In the context of the WRI data, this is equivalent to expecting that only a small proportion of words will be read incorrectly by a gradelevel reader. This is consistent with setting . There are numerous penalty functions that can assess the closeness of a potential parameter value to . For example, both the and norms
(2) 
are candidates worth considering. In the context of binomial success probabilities, both of these functions are bounded, having Figure 1 below illustrates the shapes of these penalties for the case .
Note that as for all , the norm will more “aggressively” shrink success probabilities to than the norm. Due to the resemblance of the norm to the commonlyused lasso penalty in regression, it should however be pointed out that the function will never result in shrinkage to . In fact, the penalized negative loglikelihood function has unique solution
where and is the unpenalized MLE. While it is not necessarily intuitive from the form of the penalized estimator, it can easily be verified that for all . The solution to the penalty function is also easy to compute, but no general closedform expression is possible as it requires solving a cubic polynomial.
The bounded nature of and in (2) may not appeal to some. One choice of an unbounded penalty is
This penalty has a lower bound of , but has no upper bound. For an illustration when , see Figure 2. The solution to the corresponding penalized likelihood problem is
None of the penalties considered so far have the lassolike property of shrinking parameters to for a finite value of . However, it is possible to find a penalty that achieves this. Consider
also illustrated in Figure 2 for . This penalty function is bounded above, but has no lower bound as the individual approach . In fact, this penalty function is not associated with a norm as defined in Section 2.1, putting it somewhat outside the framework in which our estimation problem has been formulated. The latter point notwithstanding, the corresponding penalized likelihood estimator is
for . Perhaps this penalty can appropriately be described as “greedy” in the sense that it has the potential to dominate the data and result in a shrinkage estimator of even when there are observed successes suggesting otherwise.
The four penalized solutions above all corresponding to some notion of success probabilities being “close to ” or “not too large” . Figure 3 shows a schematic representation of the behavior of these estimators as a function of .
2.3 Other Shrinkage Configurations
The penalized estimators of Section 2.2 all revolve around the goal of ensuring that the estimates are close to . If, on the other hand, it were desired the have estimates close to , then by symmetry all of the examples considered could replace the in each of the penalty functions by . Of course, many other types of penalties could also be of interest. For example, say a client expressed the belief that all of the should be close to some specified value . Then, for this specified , define
Note that this penalty function has a minimum when all the are equal to , and is unbounded above whenever one of the approach either or . This penalty therefore shrinks the towards the specified value. The penalized estimators are
For the variable, this estimator is a linear combination of the MLE and . The careful reader may also notice that this estimator has much in common with the Bayesian estimator of a binomial success probability with a beta prior. Here, controls whether the strength of evidence lies with the empirical estimator or with the prespecified reference . Similarly, say the client believes the success probabilities should all be “close” to one another, but unlike in , no value has been specified. For the WRI data, this is equivalent to expecting the passage success proportions to be similar to one another. For this, define bounded penalty function
Alternatively, if an unbounded penalty function is preferred, one could use
where
is the standard normal quantile function. Neither of these penalties results in closedform solutions for the shrinkage estimators
, .3 DataDriven Shrinkage
In Section 2, different penalized likelihood approaches were considered for estimating binomial success probabilities. However, in the presentation, the value of the parameter was assumed known. As controls the relative importance of the penalty function, it is important to choose a value resulting in parameter estimates with small MSE. Here, one selection approach using Vfold crossvalidation (VFCV) is presented.
Consider an dataset consisting of independently sampled variables, with the th variable consisting of independent observations. Let denote the observations corresponding to the th variable. VFCV proceeds by partitioning the data into V subsets of roughly equal size. For the th variable, let , denote a partition of the indices such that and for all .
VFCV repeatedly creates subsets of the data for model training, in each instance leaving out one of the subsets per variable. The subsets left out in each iteration are then used for model validation. More specifically, the model building data subsets are used to estimate penalized parameter estimates for various degrees of penalty enforcement, say possible values of satisfying . The negative loglikelihood function is then evaluated using penalized estimators corresponding to each possible value of and using the validation subsets. This is repeated times, and the optimal value of is chosen as where the negative loglikelihood function averaged over the validation subsets is minimized.
Algorithmically, implementation of VFCV proceeds as follows. For each fold :

For the variable, form a training dataset by excluding the th fold, , and let the th fold equal to the validation set, . Let denote the number of observations in . Also let and denote the collection of the training and validation sets for all variables.

For each value , find the estimators that minimize the penalized negative loglikelihood function
where .

Calculate the validation function by evaluate the negative loglikelihood at this estimator,
The VFCV score is then defined as
(3) 
and the optimal shrinkage level is taken to be the that minimizes , i.e. where . Note that after the optimal penalty level has been chosen using VFCV, penalized estimators are calculated one more time using the full dataset. The penalized likelihood estimator with datadriven shrinkage, denoted , is the value that minimizes
where . The literature on crossvalidation recommends various choices for , with common values ranging from to . The choice is equivalent to leaveoneout crossvalidation and can become computationally expensive. As discussed in (arlot2010survey), the size of the validation set has an effect on the bias of the penalized estimator, while the number of folds
controls for the variance of the estimated penalization parameter. These authors also discuss some asymptotic considerations of crossvalidation. If
denotes the size of the training set, then for , crossvalidation is asymptotically equivalent to Mallows’ and therefore asymptotically optimal. Furthermore, if , then asymptotically the model is equivalent to Mallows’ multiplied by (or overpenalized by) a factor . Asymptotics notwithstanding, throughout the remainder of this paper, an approach of is used. This strikes a balance having larger training sets and reasonable computational costs.4 Simulation Studies
In Section 2, various shrinkage estimators for the binomial distribution were considered, while Section 3 described a Vfold crossvalidation approach for choosing the penalty parameter. Of course, the binomial model is not the only count model of interest. In this section, shrinkage estimation is considered both for the binomial model as well as two related models, the zeroinflated binomial and the betabinomial. In most scenarios investigated here, no closedform solutions for the penalized estimators are available. Even so, these simulation studies are very useful for investigating the properties of different penalty functions as they impact estimation for the three models. Note that simulations are restricted here to independent variables, each consisting of trials and having independent observations per variable. This choice was made so that the simulations would, at least in part, closely resemble the real data motivating this work.
4.1 The Binomial Model
A sample was generated with independent observations with , , and . The binomial success probabilities
were sampled from different scaled beta distributions. Specifically, for success probability lower and upper bounds
and, three shapes of the success probability distribution were considered, namely a skewed distribution
, a very flat distribution , and a bellshaped distribution . The three success probability distributions are illustrated in Figure 4 below. The term controlling how aggressively the penalty gets enforced was chosen using crossvalidation using possible values ranging from to spaced approximately equidistant on a logarithmic scale. VFCV was used to determine the for each penalty function under consideration.In addition to the estimators resulting from the use of different penalty functions, maximum likelihood estimators were also calculated. In total, samples were generated for each configuration of success probability bounds and Beta shape parameters. Summarized in the tables below are the Monte Carlo estimates of the MSE ratios. For the th sample , let denote the true success probabilities simulated from a specified scaled Beta distribution. Let denote the MLE and let denote a penalized estimator found using VFCV. Define Sum of Squared Deviations . Then, the Monte Carlo MSE ratios are defined as
where the subscript “” emphasizes the specific penalty function used to obtain the estimators. An MSE ratio less than indicates superior performance of the penalized estimator.
In Table 4.1, the results of shrinking success probabilities to zero are presented. Each of the penalties , from Section 2.2 were considered. In Table 4.1, the results of shrinking success probabilities closer to one another are presented. The penalties and from Section 2.2 were considered.
In Table 4.1, the bestperforming penalty function is . Even so, the relative improvement in efficiency is small throughout. The only penalty that consistently leads to worse performance than maximum likelihood is . Recall that this penalty function is not associated with a norm and is able to very aggressively shrink success probabilities to . This simulation suggests that, at least in the scenarios considered, this penalty shrinks too aggressively. For the other three estimators though, VFCV results in penalized estimators with slightly better performance than MLE. In Table 4.1, the performance of the and penalties is nearly indistinguishable. When shrinking parameters closer to one another, large gains in efficiency are sometimes realized. This is especially notable when the Beta shape from which the success probabilities are generated is bellshaped, i.e. the are close to one another. In all instances, VFCV results in penalized estimators with performance superior to maximum likelihood. Altogether, these simulations illustrate that both the average success probability and the spacing of the relative to that average are important in determining the reduction in MSE. For penalties shrinking the closer to one another, an MSE ratios below was realized, showing dramatic improvement due to shrinkage.
4.2 The Zeroinflated Binomial Distribution
The pmf of the zeroinflated binomial (ZIB) distribution is
where represents the excess zero probability, and and are the binomial success probabilities and number of trials. For , it follows that . Consequently, the parameter is the expected proportion of successes in a ZIB model. The parameter is of primary interest when considering possible penalty functions, especially under the assumption that the different ZIB distributions are “similar” to one another.
In the simulation study, a sample was generated with independent ZIB variables, . As with the binomial model simulation, , , and . The overall expected success proportions and the excess zero probabilities were sampled from scaled beta distributions as per Figure 4 with the specific bounds listed in the table below. Letting , , the ZIB simulation considered three penalty functions: , , and . The first of these, termed zero shrinkage, results in success proportions closer to . The second, termed mean shrinkage, results in mean success proportions closer to one another. The third, termed full shrinkage, shrinks all closer to one another and all closer to one another. While both the penalties and have the goal of estimating models that are “similar” to one another, the second penalty is much more strict. To see this, consider two parameters . Under the first penalty, these contribution of their squared difference is . However, even with the corresponding parameters and may be very different.
In addition to using VFCV to select the level of shrinkage for the above three penalities, a combined estimator, termed minCV, was calculated by selecting the model parameters associated with the penalty function having minimum VFCV score. The same set of values ranging from to were used. The Monte Carlo MSE ratios for the success proportions are in Table 4.2. The MSE ratios for and were also calculated, and these can be found in the Supplemental Material.
These MSE ratios in Table 4.2 are based on simulated datasets for each possible simulation configuration. While the zero shrinkage penalty does result in some efficiency gains in most scenarios, overall MSE ratios close to 1 suggest little improvement from from using this penalty. On the other hand, both mean and full shrinkage result in large decreases in the MSE ratios. Overall, it cannot be said that either mean and full shrinkage performs best. This makes sense, as it depends on the configuration of all parameters and not just the mean parameters. Finally, while minCV does not always have the smallest MSE ratio, it is generally close to the minimum. This suggests that datadriven selection of the level of shrinkage as well as the penalty function leads to good performance for the model.
4.3 The BetaBinomial Model
The probability mass function of the betabinomial distribution is given by
where is the Beta function and is the number of trials. The parameters and control the mean and variance of the model. Specifically, for and , the mean and variance can be written as and . In this parameterization, and denote, respectively, the expected proportion of successes and the the overdispersion of the model relative to the binomial.
Samples were generated with independent BetaBinomial variables, , with , , and . The overall expected success proportions and the overdispersion were sampled from scaled beta distributions as per Figure 4 with the specific bounds listed in the table below. As in the ZIB case, thie BetaBin simulation considered three penalty functions. Letting , , these were: , , and . These are again termed, respectively, zero shrinkage, mean shrinkage, and full shrinkage. Again, after choosing the shrinkage levels using VFCV, one final estimator, termed minCV, was calculated by selecting among the three penalized estimators the one with the smallest CV score.
For each parameter configuration, a total of samples were generated. The MSE ratios are reported in Table 4.3. The table shows the results for the success proportions , and the equivalent results for and can be found in the Supplemental Material.
When looking at the results in Table 4.3, it comes as no surprise that zero shrinkage is the least effective approach here, even while still being more effective than maximum likelihood. For most of the simulation configurations, MSE ratios under mean and full shrinkage are comparable. Here, the minCV approach is also very impressive, in most instances nearly matching the bestperforming method. This reaffirms that VFCV can be effectively used to choose both the level of shrinkage for a specific penalty function, but then also choose from among competing penalty functions.
5 Data Analysis
The work presented in this paper was motivated by the reading data collection a sample of 508 elementaryschool aged children. Each child read one randomly selected passage out of ten possible passages. This resulted in roughly 50 Words Read Incorrectly (WRI) scores per passage. The passage lengths varied from 44 to 69 words with an average length of 51 words. Of interest is to accurately and efficiently estimate passage difficulty as measured by the average proportion of words read incorrectly. Note that higher WRI proportions indicate that a passage is more difficult. Figure 5 provides information about the passagespecific WRI proportions. The solid dot in each violin plot represents the mean WRI proportion, a typical estimate of passage difficulty.
Note that the mean WRI proportions in Figure 5 appear fairly close to one another, meaning passages are crafted to be within a narrow range of difficulty to reinforce fairness in grading and modeling. Thus, as one expects the WRI proportions to be close to one another, appropriate shrinkage may result in improved estimates.
Three models and three types of shrinkage were considered for the data at hand. In each instance, the same set of partitions were used to select a smoothing parameter with 10fold cross validation. Table 5 reports the crossvalidation scores as defined in (3).
It is evident from Table 5 that all variations of the betabinomial model have much lower crossvalidation scores than either the binomial or zeroinflated binomial models. With regards to penalty type, full shrinkage works best for this model with mean shrinkage being a distant second choice. Table 5 shows the betabinomial parameter estimates obtained using maximum likelihood as well as penalized likelihood with mean shrinkage and full shrinkage. It is interesting to note that in the full shrinkage solution, the values have all been shrunk to within of a common value, but the still exhibit a fair spread of values. In the maximum likelihood sense, the estimated success proportions range from to , while the full shrinkage values range from to . The latter shows much more adherence to the idea that the passages are similar in terms of difficulty.
For the interested reader, Figure 6 shows the penalized likelihood estimate trajectories for mean shrinkage and full shrinkage as a function of . The estimates of are presented on a logarithmic scale. For mean shrinkage, the horizontal scale is and for full shrinkage it is with . These adjustments were all made to improve readability of the plots. Dashed vertical lines indicate the optimal shrinkage solutions as determined by VFCV.
Under mean shrinkage, the passagespecific and still exhibit a large spread even when the success proportions are close to one another. Under full shrinkage, the values are very quickly shrunk to a nearly common value while the still exhibit some spread.
6 Conclusions
The goal of this project was the penalized estimation of multiple independent success proportions from multivariable count data. The application of interest considered WRI scores realized by students during an ORF assessment. The simulation results showed that across the scenarios considered, large decreases in MSE were often achieved. There is also very little risk in using penalized likelihood, as using Vfold cross validation never resulted in a large increase in MSE. When applying the methodology to the data of interest, it is seen that the resulting penalized estimators of the success proportions have a much tighter spread. This affirms the expectation that the passages are very similar in difficulty, with estimated difficulty scores ranging from to of words expected to be read incorrectly. Even so, this does highlight one important avenue for future research. If students are reading different passages to assess ORF, it is desirable to have a method that standardizes WRI scores to be independent of passage difficulty. Also, in practice students typically read multiple passages, so exploring methods accounting for correlated WRI scores need to be considered.
References
Supplemental Material
In the main paper, simulation results are only presented for the success proportion estimates . Of course, it is reasonable to assess the effect of shrinkage on other model parameters as well. The MSE ratios as defined in Section 3 are presented here when estimating the parameters in the zeroinflated binomial model, see Table 7, and when estimating the parameters in the betabinomial model, see Table 8. These results confirm that shrinkage can be very beneficial to the estimation of incidental model parameters as well.
Comments
There are no comments yet.