For the analysis of psychological data more and more often statistical learning (a.k.a. machine learning or predictive modelling- we use the terms interchangeably) techniques are advertised, see for example Yarkoni and Westfall (2017); Tonidandel et al. (2016); Putka et al. (2018); Chapman et al. (2016); McNeish (2015). We, the authors, think that this pronounced emphasis on these techniques is in general a good idea. While studying the concepts we found that many ideas go back to the early psychometric literature.
Four core elements in much of the statistical learning literature are bias variance trade-off, cross-validation, regularization, and basis expansion. These four elements are important to find an optimal model for a given data set. We provide a brief overview below, more detailed discussions are provided in Hastie et al. (2009) or Berk (2008).
The bias variance trade-off deals with avoiding over- and underfitting. With overfitting we capture too much noise, with underfitting too little signal. In both cases, the fitted model does not generalize well to new observations from the same population. With an overfitted model the bias reduces to zero, that is, would we fit this model to repeated samples from the population the average of all models is representative for the true population model. However, this model may have large variance, meaning that the estimated parameters from sample to sample vary a lot. On the other hand with an underfitted model the variance from sample to sample decreases at the cost of an increased bias.
Machine learners try to find the optimal model by avoiding underfitting and overfitting. In practice this often boils down to fitting series of models and selecting the optimal one. The selection of an optimal model is done using cross-validation. This technique trades-off bias and variance by finding the model that best generalizes to unseen data.
Early researchers focussed largely on linear models because they were computationally feasible in the pre-computer era. Thinking about over- and underfitting, the linear model might be too complex capturing noise or the linear model might be too simple capturing little signal. In the latter case, the relationship in the population is nonlinear and the linear model does not capture this. In the process of finding an optimal model and starting with a linear model, two directions can be distinguished: more and less flexible models.
Let us start with less flexible models. Modern statistical learning techniques add a penalty part to the (least squares) loss function that is minimized in order to obtain regression weights. In lasso regression the penalty part consists of the sum of absolute values of the regression weights(Tibshirani, 1996)
, in ridge regression the penalty part consists of the sum of squared values of the regression weights(Hoerl and Kennard, 1970), and in elastic net it is a combination of the lasso and ridge penalty (Zou and Hastie, 2005). Another way of understanding these regularization techniques is stating that the method searches for least squares estimates but only in a small region of the parameter space; the form of the region is defined by the penalty function; see (Hastie et al., 2015, page 22, 58) for graphical examples of these regions.
We can also make the model more flexible. One obvious possibility is by adding polynomial terms of the predictor into the regression equation, so not only using , but also , , etc. Such an operation is called basis expansion, because a single predictor is blown up to multiple predictors. Other basis expansion operations are spline regression models or kernel regression models.
In the next section we delve in the history of psychometrics with a focus on cross-validation, the bias variance trade-off, and regularization. We will cite and quote from several papers starting in the 1920s and we bounded (regularized) our search till 1980. We will see that a lot of ideas originated from this early psychometric literature. Moreover, investigating these old papers and relating conclusion from that work to current psychometric knowledge brought us with two new questions/ideas which we will discuss in Section 3. We conclude this paper with some discussion and ideas for teaching and future work.
In this section we present some early papers about cross-validation and regularization in the psychometric literature. Most of this literature is concerned with prediction; more specifically the question how to combine item scores of a psychological test or test scores from a test battery into a composite score to select persons for a job, or students for a school. We provide quotes from these papers to give some idea of the research tendencies at that time. We tried to find origins in the psychometric literature, but are by no means sure we found them. Moreover, we probably missed some contributions for which we apologise in advance. This section is divided in two subsections, one about the early roots of cross-validation and the other about the early roots of regularization.
Selmer C. Larson, 1931 ‘It has been recognized by theoretical statisticians for some time that when a coefficient of multiple correlation () is derived for a given set of data, its value is likeliy to be deceptively large. If the computations have been correct, the value will hold rigidly for the set of data from which the regression equation was derived. If, however, the equation should be applied to a second set of data, even though strictly comparable, it has been supposed that the yield in this latter case would, except for errors due to sampling, be less than in the first. Moreover, it has been supposed that the more variables contained in the regression equation, the greater this shrinkage will be. This is particularly significant because ordinarily the practical employment of a regression equation involves its use with data other than those from which it was derived. ’
Because of the lack of automated computational power, Larson (1931) works out an adjustment to the multiple correlation coefficient. This has been the start of a long series of papers with adjustments to the (Wherry, 1951; Darlington, 1968; Browne, 1975b, a; Rozeboom, 1978; Claudy, 1978). To derive these adjustments distributional assumptions need to be made, which is a severe drawback of these indices, see
Paul A. Herzberg, 1969 ‘However, these formulas require assumptions which are often difficult to satisfy and, therefore, many early investigators estimated the population correlation by applying to a second sample the regression weights calculated in an original sample. They found that the correlation between the regression function and the criterion in the second sample was less than the original sample multiple correlation. This technique became known as cross-validation of the predictor weights or simply as cross-validation (Mosier, 1951). The correlation in the second sample is called the cross-validity. ’
In the 1950 psychometric society meeting there was a symposium about cross-validation with Mosier, Cureton, Katzell, and Wherry as presenters. Their contributions appeared in the Educational and Psychological Measurement of 1951. An important contribution was made my Mosier, who basically was the first who introduced -fold cross-validation, with to overcome the main problem with the loss off efficiency in split sample validation. Two quotes from these papers read:
Charles I. Mosier, 1951 ‘If the combining weights of a set of predictors have been determined from the statistics of one sample, the effectiveness of the predictor-composite must be determined on a separate, independent sample.’
Edward E. Cureton, 1951 ‘Psychologists have long been accustomed to grinding out multiple-regression equations and asserting that the regression coefficients so obtained are the best predictor weights that can be determined from the sample data. In one sense they are correct. The least-squares regression weights are the best weights to use in predicting the criterion scores of the validation- sample subjects. Their use maximizes the multiple correlation in that particular sample. It does so by giving optimal weights to everything which, in that sample, will contribute to prediction. “Everything” includes, however, the sampling errors in the validation-sample data. Hence the least-squares process over-fits; it fits the errors as well as the systematic trends in the data.’
We see that the origins of cross-validation lie far back in history and as a matter of fact in the psychometric literature. We also note that the term overfitting
was used already in 1951. These papers indicate that overfitting is more pronounced for small samples, many predictor variables, and small effect sizes. These conditions were then and now the rule in psychological research.
Regularization is commonly understood as making linear models less flexible by imposing constraints. Popular machine learning methods are lasso regression (Tibshirani, 1996) and elastic-net regression (Zou and Hastie, 2005). By imposing a penalty in the fitting procedure the variance from sample to sample is decreased and hopefully this trades off against the increased bias. In machine learning, regression weights are shrunken towards zero. In the psychometric literature we found the following quote about choosing the regression weights and verifying the prediction accuracy:
Edward E. Cureton, 1951 ‘Hence, we will probably achieve a higher aggregate correlation in a second sample if we weight the standard predictor scores equally, rather than by their beta regression coefficients as determined from the validation sample.’
In a similar fashion, Lawshe and Schucker (1959) investigated the following four weighting schemes
simple addition of raw scores
weighting raw scores by their standard deviation
weighting of raw scores by the inverse of the standard deviation
weighting by the least squares regression weights
and found NO evidence in favor of one of them over the others. Similar results can be found in Schmidt (1971) and Wainer (1976). The idea of equal weighting versus least squares weighting goes back to the early beginnings of psychometrics, see for example Burt (1950) who quoted
Frank N. Freeman, 1926 ‘weighting has come to be far less commonly used than it was a number of years ago.’ and Joy P. Guilford, 1936 ‘weighting is not worth the trouble.’ .
The four weighting schemes described above can be understood using the bias variance trade-off: Equal weighting obviously does not give an unbiased estimate of the true regression equation, but because the data are not used for estimation the sample to sample variance is zero. This may be beneficial in some data analysis situations while harmful in others. This caused an argument betweenWainer (1976) and Pruzek and Frederick (1978): Wainer claimed equal regression weights are beneficial in almost any circumstance, while Pruzek and Frederick claimed that only in very limited situations equal weighting is beneficial.
Darlington (1978) was the first author who investigated these attempts in terms of the bias variance trade-off. He entitled his paper “reduced variance regression” in order to pay attention to the positive aspect (reducing variance) instead of the negative (increasing bias). For a simple regression model with population parameter and the least squares unbiased estimator of he showed that the expected squared error of a shrunken estimator wih equals
where the first term represents the variance (which is equal to zero when ) and the second term the squared bias (which is equal to zero when ). Minimizing this function for gives which is always smaller than 1, i.e. the least squares estimator is never optimal. Darlington further worked out this latter formula to
where represents the population correlation between the predictor and response, which clearly shows the dependency of the amount of shrinkage on the strength of the population relationship and the sample size. The parameters on which depends are generally unknown and therefore the adjustment cannot readily be made in data analysis.
After this analysis of simple regression, Darlington investigated results for multiple regression and discussed the Stein estimator and ridge regression in terms of bias and variance. Moreover, he gave an interpretation of ridge regression in terms of principal component regression and validity concentration.
3 Two New Ideas
While studying the statistical learning papers and the early psychometric papers about cross-validation and regularization, two questions arose. The first question is about the relationship between reliability of the test score and its predictive validity. The second is about regularization towards equal regression weights, as opposed to shrinkage towards zero as in the lasso regression model.
3.1 Idea I: Validity - Reliability relationship
In standard psychometric textbooks we often read that the reliability is an upper bound to the criterion- or predictive validity and that with decreasing reliability also the predictive validity goes down. From the recent machine learning literature and from the papers cited in section 2 we conclude that to objectively say something about predictive validity we should not use the multiple or another in-sample measure, but a cross-validated measure. Furthermore, in order to optimize predictive accuracy regression parameters should be shrunken. The question now arises how reliability, shrinkage, and cross-validated prediction accuracy (or error) are related.
Suppose we have a test score which is not a perfectly reliable score, i.e. where and are uncorrelated. The reliability of the test score can then be defined as
where denotes the variance.
Furthermore, suppose the regression on the observed test score is
where ; the error variance increases due to unreliability of the test score. In terms of reduced variance regression (Darlington, 1978), we note that the optimal shrinkage factor for unreliable test scores becomes smaller, i.e. more shrinkage is needed. One may, however, wonder what the effect is of reliability on predictive validity once this shrinkage is taken into account. To answer this question we set up a simulation experiment.
We generate a calibration set and a validation set, also known as training and test set, respectively. In this experiment the true score (
) is distributed following a standard normal distribution. The criterion () is related to the true score by
with and is drawn from a normal distribution with standard deviation equal to . The correlation between the true scores and criterion scores in the population equals , which we call the effect size. Observed test scores are obtained by
where is a normally distributed variable with mean zero and variance . In our case, with , this variance () is related to the reliability () through
We investigated reliabilities ranging from 1 to 0.5, i.e., from a perfectly reliable test till an unreliable test. Calibration data were generated with varying samples sizes of 25, 50, 100, and 200, respectively. Validation data were generated following the same model, but with a sample size of 1000. We replicated the procedure 1000 times.
In the calibration set we use 10 fold cross-validation to find an optimal value of the shrinkage parameter
, that is, the value that minimizes the prediction error. Then we fitted the linear regression model with this value of the shrinkage parameter to the complete calibration set in order to find an estimated intercept and slope. Using the estimated intercept and slope from the calibration set and the observed test scores in the validation set, we derive predicted criterion scores in the validation set. The prediction error is defined as the mean squared difference between the predicted and the observed criterion scores in the validation set.
For every replication the optimal value for the shrinkage parameter and the corresponding prediction error is computed. In Figure 1 we see that, as expected, with decreasing reliability the optimal shrinkage factor becomes smaller, i.e. more shrinkage is necessary. More specifically, we see that with small effect sizes and small sample sizes the optimal value for is very low and sometimes even 0 (in which case predictions are based only on the estimated intercept). With larger effect sizes or sample sizes the amount of shrinkage is lower (values for are closer to 1).
Figure 2 shows the relationship between reliability and predictive validity. For small effect sizes (), as they often occur in psychology, the effect of reliability on prediction error is minimal, no matter what sample size the prediction error is about equal. For larger effect sizes () prediction error increases when reliability decreases (right hand sided frames). The latter effect is stronger for larger sample sizes.
3.2 Idea II: Shrinkage to equal regression weights
In statistical learning the regression weights are often shrunken towards zero. In the early psychometric papers, in contrast, the weights were shrunken towards their mean. The idea arose that we can combine the two, i.e. use standard software for penalized regression models in order to shrink towards equal regression weights.
Therefore, let us define the regression model with equal weights as
where is the sum score for a subject.
Also define the multiple regression model as
Here we have regression weights from which we can define (the average weight) and . Therefore the same multiple regression model can be written as
Furthermore, define with the person average score and replace in the equation above with to obtain
Using these equations we rewrote the multiple regression model (equation 1) into another form where the predictors are a sum score () and the deviances of item scores () towards the person mean item score (equation 2). These two models are equivalent in OLS terms, that is, they have the same amount of variance explained. This latter regression cannot be estimated using standard least squares procedures, because the predictor variables are perfectly collinear (so we need to remove one of the ’s). However, penalized regression models have no difficulty with multicollinearity; in fact, that is why they were designed in the first place. Therefore, we can apply or penalties on the -coefficients, and this can be done simply in standard software such as the glmnet package in R (Friedman et al., 2010).
Let us verify how this penalizing towards an equal regression coefficient works in an empirical application. Garnefski and Kraaij (2007) describe a questionnaire for the assessment of cognitive emotion regulation. It consists of 36 five point Likert items measuring nine conceptually different subscales: self-blame, other-blame, rumination, catastrophising, putting into perspective positive refocusing, positive reappraisal, acceptance, and planning. Each subscale is measured by four items and has possible scores ranging from 4 till 20. Garnefski and Kraaij (2007) have two criterion variables: depressive and anxiety symptoms as measured by the SCL-90. We will use the depression subscale which is measured by 16 items with possible scores between 16 and 80.
We compare two lasso regression models, one standard model where the coefficients of the predictor variables are shrunken towards zero, and the reparametrization such that coefficients are shrunken towards equality, i.e. model 2. We used 10 fold cross-validation as imposed in the glmnet package (Friedman et al., 2010).
The results of fitting the model in the calibration set are shown in Figure 3 where on the left hand side the results are given for the standard lasso model and on the right hand side for the lasso as defined in equation 2. In the 10-fold cross-validation plots (upper and middle row) we see that the mean squared error for the newly proposed model is somewhat lower than for the standard lasso model. In the lower plots we see the regression coefficients as functions of the penalty parameter.
In Figure 3 we see that the mean squared errors obtained by shrinkage towards equal coefficients is lower than these from the standard lasso (most clearly seen in the middle rows of the figure). The minimum mean squared error for the standard lasso model is 92.16, whereas that for the new parametrization is 88.45. Table 1 shows the estimated regression coefficients in the standard regression parametrization and the parametrizatoin of Equation 2.
|Refocus on Planning||-0.11||0||0||-0.48||-0.14||0|
|Putting into Perspective||-0.10||0||0||-0.48||-0.18||0|
4 Conclusions and Discussion
Machine learning methods become more and more popular for psychological research. The main aim of these models is to find prediction rules that have good predictive performance. Predictive performance is often measured using cross-validation techniques. Compared to linear models the prediction rules are based on statistical models that are either more or less flexible. More flexibility is obtained by basis expansions, whereas less flexibility is obtained by regularization.
We showed that the field of psychometrics considered cross-validation and regularization already at the beginning of the 20th century as viable approaches to obtain generalizable results, i.e. prediction rules that would achieve better performance in practice. As such we can say that psychometrics is really at the origin of statistical learning theory.
On the other hand, it is sad to see how little of this rich history is portrayed in our common wisdom. A standard psychometric book covers the topic of predictive validity, but (as far as we know) does not cover regularization, the bias variance trade-off, nor cross-validation. Hence, many psychologist keep on studying predictive validity in a suboptimal manner (using in-sample estimates) and provide overly optimistic estimates of predictive validity. We therefore suggest that this theory is re-incorporated in basic psychometric textbooks.
Based on our review of the early psychometric literature we wondered about two issues.
In psychometric textbooks it is often written that with decreasing reliability of a test the predictive validity also goes down. Given that the unreliability of a test score find its way to the error in a regression model we concluded that more shrinkage is needed for unreliable tests. We tested this in a simulation study, and found that this is indeed correct. Furthermore, in this simulation study we found that there is a very weak relationship between reliability and predictive accuracy (see Figure 2) for small effect sizes. When the effect size is larger the relationship between reliability and predictive accuracy becomes stronger.
This finding has an important consequence for test assessment. The Dutch Committee for Test Evaluation (COTAN), for example, uses the criterion that the reliability of a test should be at least .90 in order to be used as a selection test. Such a criterion is based on the idea that predictive validity goes up with reliability. We showed, that this is not a linear relationship and that for weak effects sizes, as often found in psychology, measurement error hardly influences predictive accuracy. So, to assess psychological tests that are used for selection it is important to explicitly focus on prediction error or accuracy instead of using the surrogate of reliability.
The second issue is that in early psychometrics regularization was often in terms of an equal or even unit regression weight for items in a test or tests in a test battery. Modern regularizarion techniques often shrink coefficients towards zero, not towards the mean coefficient. We rewrote the linear regression model in terms of a sum score and item deviation scores, i.e. the item score minus the person average over the items. Using this rewritten model we are able to shrink towards an equal regression weight for all items. Using an empirical example we showed that such a regularization might indeed be beneficial.
We would like to notice two more things here. Mosier (1951) discussed validity generalization, the question whether a regression equation derived on a sample from one population generalizes to a second sample from another population. Such a second population may differ from the first population in terms of location (i.e. regression equation derived in Europe, validation sample from Asia) or time (regression equation derived in 2000, validation sample in 2019). Usual cross-validation techniques, like -fold cross-validation nowadays often empowered, do not test for this type of generalization and therefore even cross-validated results obtained on a sample should always be taken with a grain of salt.
Darlington (1968) already noted that the well known relationship between mean squared error and multiple does not hold out-of-sample. Making out-of-sample predictions the mean value is not calibrated and therefore the usual in-sample relationships between mean squared error, variance explained, and correlation are no longer true. We might obtain, for example, that for new observations with observed criterion values one prediction rule gives predictions and another one . If we would use the correlation between predicted value and observed value as measure of predictive validity both sets of predictions would have predictive validity equal to 1. The second set of predictions is, however, much better than the first, which would be evident by using average squared difference. More information is given in Alexander et al. (2015). Darlington (1968) wrote: "The correlation coefficient is more useful in “fixed quota” situations, and the mean square error is more useful in “flexible quota” situations". We would like to add that a test is often used for single person decisions and in such a case having a correlation measure is not helpful.
We thank Nadia Garnefski and Vivian Kraaij for providing us with the empirical data.
The R code for the experiment and for the analysis of the empirical data can be requested from the first author.
- Alexander et al. (2015) Alexander, D. L. J., Tropsha, A., and Winkler, D. A. (2015). Beware of : simple, unambiguous assessment of the prediction accuracy of qsar and qspr models. Journal of Chemical Information and Modeling, 55:1316–1322.
- Berk (2008) Berk, R. (2008). Statistical learning from a regression perspective. Springer, New York.
- Browne (1975a) Browne, M. W. (1975a). A comparison of single sample and cross-validation methods for estimating the mean squared error of prediction in multiple linear regression. British Journal of Mathematical and Statistical Psychology, 28:112–1120.
- Browne (1975b) Browne, M. W. (1975b). Predictive validity of a linear regression equation. British Journal of Mathematical and Statistical Psychology, 28:79 – 87.
- Burt (1950) Burt, C. (1950). The influence of differential weighting. British Journal of Psychology, Statistical Section, 3:105F – 125.
- Chapman et al. (2016) Chapman, B. P., Weiss, A., and Duberstein, P. R. (2016). Statistical learning theory for high dimensional prediction: Application to criterion-keyed scale development. Psychological Methods, 21:603–620.
- Claudy (1978) Claudy, J. G. (1978). Multiple regression and validity estimatoin in one sample. Applied Psychological Measurement, 2:595–607.
- Cureton (1951) Cureton, E. E. (1951). Approximate linear restraints and best predictor weights. Educational and Psychological Measurement, 11:12–15.
- Darlington (1968) Darlington, R. (1968). Multiple regression in psychological research and practice. Psychological Bulletin, 69:161 – 182.
- Darlington (1978) Darlington, R. (1978). Reduced-variance regression. Psychological Bulletin, 85:1238 – 1255.
- Freeman (1926) Freeman, F. N. (1926). Mental Tests: Their History, Principles and Applications. Houghton Mifflin, Oxford, England.
- Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22.
- Garnefski and Kraaij (2007) Garnefski, N. and Kraaij, V. (2007). The cognitive emotion regulation questionairre: Psychometric features and prospective relationships with depression and anxiety in adults. European Journal of Psychological Assessment, 23:141 – 149.
- Guilford (1936) Guilford, J. P. (1936). Psychometric methods. McGraw-Hill, New York.
- Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning, 2nd edition. Springer, New York.
- Hastie et al. (2015) Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC, Boca Raton, FL.
- Herzberg (1969) Herzberg, P. A. (1969). The parameters of cross-validation. Psychometrika, 34:Monograph.
- Hoerl and Kennard (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55 – 67.
- Larson (1931) Larson, S. C. (1931). The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology, 22:45–55.
- Lawshe and Schucker (1959) Lawshe, C. and Schucker, R. (1959). The relative efficiency of four test weighting methods in multiple prediction. Ediucational and Psychological Measurement, 14:103 – 114.
- McNeish (2015) McNeish, D. M. (2015). Using lasso for predictor selection and to assuage overfitting: A method long overlooked in behavioural sciences. Multivariate Behavioural Research, 50:471 – 484.
- Mosier (1951) Mosier, M. W. (1951). Problems and design of cross-validation. Educational and Psychological Measurement, 11:5–11.
- Pruzek and Frederick (1978) Pruzek, R. and Frederick, B. (1978). Weighting predictors in linear models: Alternatives to least squares and limitations of equal weights. Psychological Bulletin, 85:254 – 266.
- Putka et al. (2018) Putka, D. J., Beatty, A. S., and Reeder, M. C. (2018). Modern prediction methods: new perspectives on common problems. Organizational Research Methods, 21:689 – 732.
- Rozeboom (1978) Rozeboom, W. W. (1978). Estimation of cross-validated multiple correlation: A clarification. Psychological bulletin, 85:1348–1351.
- Schmidt (1971) Schmidt, F. (1971). The relative efficiency of regression and simple unit predictor weights in applied differential psychology. Educational and Psychological Measurement, 31(3):699–714.
- Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, B, 58:267 – 288.
Tonidandel et al. (2016)
Tonidandel, S., King, E. B., and Cortina, J. M. (2016).
Big data at work: The data science revolution and organizational psychology. Routledge, New York, NY.
- Wainer (1976) Wainer, H. (1976). Estimating coefficients in linear models: It don’t make no nevermind. Psychological Bulletin, 83:213 – 217.
- Wherry (1951) Wherry, R. J. (1951). Iv. comparison of cross-validation with statistical inference of betas and multiple from a single sample. Educational and Psychological Measurement, 11:23–28.
- Yarkoni and Westfall (2017) Yarkoni, T. and Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12:1100 – 1122.
- Zou and Hastie (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic-net. Journal of the Royal Statistical Society, B, 67:301 – 320.