If a parametric model fits the available data well, is it because the model captures structure that is specific to the observed data, or because the model is so flexible that it would fit almost all conceivable data?
This paper provides a quantitative measure of model restrictiveness that can distinguish between the two explanations above and is easy to compute across a variety of applications. We test the restrictiveness of a model by simulating hypothetical data sets and seeing how well the model can fit this data. A restrictive model performs poorly on most of the hypothetical data, while an unrestrictive model approximates almost all conceivable data.
What the analyst views as conceivable reflects their ex-ante knowledge or intuition. For example, the analyst might think that everyone prefers more money to less, or that players are less likely to choose strictly dominated actions. To measure restrictiveness, we propose that the analyst first stipulates some basic application-dependent restrictions on the data, and then generates random data sets that obey these properties. Our measure of restrictiveness, based on the model’s performance on this hypothetical data, tells us how much the model restricts behaviors beyond these background restrictions.
We complement the evaluation of restrictiveness, which is based solely on hypothetical data, with an evaluation of the model’s performance on actual data, using the measure of completeness proposed in Fudenberg et al. (2019). If a model is very unrestrictive, then its completeness on the real data does not directly speak to its relevance. In contrast, a model that is simultaneously restrictive and complete encodes important structure.
Our restrictiveness measure can be computed from data without the guidance of analytical results regarding the model’s implications or empirical content, so it can be used in settings where there are no analytic results that describe the model’s implications.111There are representation theorems for many non-parametric theories of individual choice, and some analytic results for the sets of equilibria in games, but we are unaware of representation theorems for the functional forms that are commonly used in applied work.
We provide estimators for restrictiveness and completeness, and characterize their asymptotic distributions and standard errors. We then apply our method and estimators to evaluate parametric models from two classic settings in experimental economics: predicting certainty equivalents for binary lotteries and predicting initial play in matrix games. In each of these domains, these measures reveal new insights about the models we examine.
In our first application, we evaluate the restrictiveness of a popular three-parameter specification of Cumulative Prospect Theory (CPT), using a set of binary lotteries from Bruhin et al. (2010). In addition to the reported certainty equivalents, we generate many hypothetical data sets of certainty equivalents (restricted to satisfy first-order stochastic dominance). We find that while the CPT is nearly complete, it is not very restrictive, because it is also able to fit the hypothetical certainty equivalent data quite well, even though it has only three free parameters. This highlights an important difference between restrictiveness and notions of complexity based on parameter counts (see Section 2 for further comparison). CPT’s relatively low restrictiveness is important to keep in mind when interpreting its striking predictive performance on real data.
To investigate the role of the parameters in CPT, we next compare the initial three-parameter specification of CPT to alternative specifications from the literature that have fewer parameters. We find that using only the two nonlinear probability weighting parameters approximates the performance of the three-parameter specification on actual data, while being substantially more restrictive. These results point to the importance of the nonlinear probability weighting parameters in CPT.
Our second application is to the prediction of initial play in matrix games from Fudenberg and Liang (2019). We evaluate the restrictiveness of the Poisson Cognitive Hierarchy Model (PCHM) (Camerer et al., 2004) by generating hypothetical distributions of play and evaluating how well the PCHM fits the hypothetical data. We find that in contrast to CPT, the PCHM is very restrictive: Most hypothetical distributions are poorly fit by the PCHM for any parameter values. In contrast, the PCHM’s performance on the actual data is substantially better than its performance on the hypothetical data. These findings suggest that the PCHM precisely isolates a systematic regularity in real behavior.
We next compare the PCHM with two alternative models: logit level-1
, which models the distribution of play as a logistic best reply to the uniform distribution, andlogit PCHM, which allows for logistic best replies in the PCHM (Wright and Leyton-Brown, 2014)
. We find that logit level-1 not only fits the actual data better than the PCHM, but is also more restrictive. Moreover, logit level-1 performs almost as well as the more complex logit PCHM on the actual data, and is substantially more restrictive.
Our measure of restrictiveness provides a new perspective on the problem of how richly to parameterize a model. Minimizing cross-validated prediction error can help, as overparameterized models can overfit to training data and perform poorly on test data. But cross-validation—like other techniques for guarding against overfitting—tends to favor increasingly flexible models given increasingly large data sets. In contrast, our approach supposes an intrinsic preference for more parsimonious models. As we show, models with a small number of parameters, such as the four-parameter specification of CPT that we examine, can allow for a large range of behaviors; models with the same number of parameters (PCHM versus logit level-1) can differ substantially in their restrictiveness. Understanding the range of behaviors permitted by these models explains how much a model’s success on real data is due to flexibility and how much is due to specifically tracking regularities present in the data.
2 Related Work
Koopmans and Reiersol (1950) defined a model to be observationally restrictive if the distributions of observables it allows are a proper subset of the distributions that would otherwise be possible. Their definition is with respect to an ambient family of outcome distributions; when this ambient family consists of every distribution, a non-restrictive theory cannot be refuted from data.222As Koopmans and Reiersol (1950) points out, a special case of an observationally restrictive specification is an overidentifying restriction. See e.g. Sargan (1958), Hausman (1978), Hansen (1982), and Chen and Santos (2018) for econometric tests of overidentification.
Selten (1991) subsequently proposed measuring the restrictiveness of a model by the fraction of possible data sets that it can exactly explain. To compute this measure, the analyst needs to know which data sets are consistent with the model, which can be a demanding criterion. This criterion is satisfied in some cases, e.g. evaluating whether individual choices from budget sets are consistent with maximization of a utility function (Beatty and Crawford, 2011) and whether individual choices between certain pairs of lotteries are consistent with expected utility, or one of its generalizations (Hey, 1998; Harless and Camerer, 1994). However we do not know which distributions of initial play are consistent with PCHM, so it is difficult to compute a measure such as Selten (1991)’s for this parametric model.
In contrast, our proposed measure of restrictiveness is based on approximate rather than exact fit to a model, and we compute the model’s fit numerically. In this respect, our approach is closer to revealed preference papers that measure the distribution of the Afriat index. Choi et al. (2007) and Polisson et al. (2020)
relax the implications of expected utility maximization using Afriat’s “efficiency index” as an analog of our loss function. They then compare the distribution of the efficiency indices of the actual subjects with the distribution of efficiency indices in randomly generated data. Our approach is designed to evaluate parametric models, while GARP is nonparametric, but can be seen as a way of extending a similar idea to other problem domains and “loss functions.”
Our use of simulated data to evaluate restrictiveness is similar in spirit to the use of simulated data to evaluate the power of a hypothesis test, as in Bronars (1987)’s numerical evaluation of a test of GARP proposed by Varian (1982), but it is not linked to hypothesis testing. We also provide statistical estimators for our proposed measures and standard errors for these estimates. These results tell us, for example, how many hypothetical data sets need to be generated in order to achieve a given level of approximation to our measure of restrictiveness.
Our work complements the representation theorems of decision theory, which describe the empirical content of different models. Currently, there are no representation theorems for many parametric economic models (including commonly-used parameterizations of Cumulative Prospect Theory and the Poisson Cognitive Hierarchy Model). For example, although there are theorems that characterize which data are consistent with a general Cumulative Prospect Theory specification (Quiggin, 1982; Yaari, 1987), we know of no representation theorems for the popular functional form we use here. Moreover, even if a representation theorem is available, it can be computationally challenging to determine whether a given data set is consistent with the characterization.333For example, the Harless and Camerer (1994) exercise would be much harder on larger menus of binary lotteries, on -outcome lotteries, or if subjects had been asked to report real-valued certainty equivalents.
Our paper is also related to the vast literature in statistics and econometrics on model selection, which dates back to Cox (1961, 1962). Unlike classic measures, including AIC and BIC, restrictiveness is not based on observed data, and it is not designed to guard against overfitting. Instead, it proposes a practical procedure for evaluating the restrictiveness of a parametric modeling class within a class of permissible models.444This paper has a different goal than the extensive econometric literature that studies how the “restrictiveness” of an econometric model may affect the identification of parameters and the efficiency of estimators. Similarly, although VC dimension—which provides another measure for the “span” of a model—is related to our restrictiveness measure at a high level, it is generally nontrivial to determine the VC dimension of any given model.555The VC dimension is known for very few economic models. A recent exception is the work of Basu and Echenique (2020) for various models of decision-making under uncertainty. In contrast, our metric is (by design) easy to compute. Finally, to derive standard errors for our estimator for completeness, the paper utilizes a recent development in the statistics literature (Austern and Zhou, 2020) on the asymptotic theory of the cross-validation risk estimator.
Let be an observable (random) feature vector
feature vectortaking values in a finite set , and be an observable random outcome variable taking values in a finite-dimensional set . We use
to denote the joint distribution of, to denote the marginal distribution of and to denote the conditional distribution of given . We assume that the marginal is known to the analyst, while the conditional distribution is not.666For example, in a decision theory experiment the experimenter knows the distribution over menus that the subjects will face.
The analyst wants to learn a function of the conditional distribution, , where is finite-dimensional. We call any function a predictive mapping, or simply mapping, and denote the true mapping by . The set of all possible mappings is denoted by .
We focus on two leading cases of this problem whose structure makes our methods easier to explain; Section 7 explains how to extend our approach to more general problems.
Prediction of a Conditional Expectation.
When the statistic of interest is , the analyst’s objective is to learn the average outcome for each realization of . To evaluate the error of predicting when the realized outcome is , we use squared loss . The expected error of a mapping is then which is minimized by the true mapping . We show in Appendix A that the difference between the error of an arbitrary mapping and the best possible error is
i.e. the expected mean-squared difference between the predicted outcomes.
Our first application, predicting the average reported certainty equivalent for binary lotteries, is an example of this case. Each lottery is described as a tuple , and the feature space consists of the 50 tuples associated with lotteries in a data set from Bruhin et al. (2010). The outcome space of certainty equivalents is , and we seek to predict the population average of certainty equivalents for each lottery . A predictive mapping for this problem specifies an average certainty equivalent for each of the 50 binary lotteries.
Prediction of a Conditional Distribution.
Here the statistic of interest is , so the analyst’s objective is to learn the conditional distribution itself. To evaluate the error of predicting the distribution when the realized outcome is , we use the negative (conditional) log-likelihood . The expected error of mapping is which is minimized by the true conditional distribution . As we show in Appendix A, the difference between the error of an arbitrary mapping , and the best possible error, , is
i.e. the expected Kullback-Liebler divergence between and the true distribution.
Our second application, predicting initial play in in matrix games, is an example of this case. Here the feature space consists of the 466 unique matrix games from Fudenberg and Liang (2019), each described as a vector in . The outcome space is (the set of row player actions) and the analyst seeks to predict the conditional distribution over for each game, interpreted as choices made by a population of subjects for the same game. Thus, , the set of all distributions over row player actions. A predictive mapping is any function taking the 466 games into predicted distributions of play.
Our goal is to evaluate the restrictiveness of parametric models , where the permitted mappings are indexed by a finite dimensional parameter and is a compact set. If the model contains a mapping that can approximate the predictions of the true mapping , then also approximates the true mapping’s error, . Given enough data, such a model will predict about as well as possible, but a good fit to the data could be because the model includes the “right” regularities, or because it is simply flexible enough to accommodate any pattern of behavior (i.e. includes most mappings).
Our strategy to determine the restrictiveness of a model is to generate random mappings from a primitive distribution . In our applications below, we choose to be uniform over a set of “permissible mappings,” which encodes prior knowledge or intuition about the setting. For example, when predicting certainty equivalents for lotteries, we may assume that people prefer more money to less.
We treat both and as primitives. In a sense, their role is analogous to the choice of what alternatives to consider when computing the power of a statistical test. In both cases, the right choice is guided by intuition and prior knowledge, and not derived from formal considerations.777Thus the choice of the distribution of simulated data is related to the choice of what alternatives to consider when evaluating the power of a test. We note that in many settings where a “correct” distribution does not exist, uniform distributions are used as a default. For example, in computational complexity, the average-case time complexity of an algorithm measures the amount of time used by the algorithm, averaged over all possible inputs (Goldreich and Vadhan, 2007). For this reason, it can be instructive to compute restrictiveness with respect to different choices of —including those that have support on different permissible sets —as we do in Appendix B.2.
We then evaluate how well the generated mappings can be approximated using the model . When predicting conditional expectations, we define to extend (as given in (3.1)) to
When predicting a conditional distribution, we define to extend (as given in (3.2)) to
Since our subsequent statements hold for both of these functions, we simply use the notation , understanding that it means for predicting the conditional expectation, and for predicting the conditional distribution.
The model’s approximation error to a generated mapping is then We normalize this raw error relative to a benchmark naive mapping chosen to suit the problem. We interpret the naive mapping as a lower bound that any sensible model should outperform. For example, in our application to predicting initial play in games, we define the naive mapping to predict a uniform distribution of play in every game. Normalizing relative to a naive benchmark returns a unit-free measure of approximation, which is easier to interpret. (See further discussion in 3.4.)
The -discrepancy of model is
Since is assumed to be an element of , the -discrepancy of is bounded above by , and since is nonnegative, the -discrepancy is also bounded below by zero. Thus, the -discrepancy in any problem must fall between 0 and 1. Large values of imply that the model does not approximate much better than the naive mapping does. Since the naive mapping itself has no free parameters and therefore does not have the flexibility accommodate most mappings, concentration of the distribution of around large values implies that the model rules out many kinds of regularities.
The restrictiveness of model is its average -discrepancy.
The restrictiveness of model is
If (so that the model is completely unrestrictive), then for every choice of with support on .
While restrictive models are desirable, a restrictive model is not particularly useful if it fails to predict real data. We would like models to embody regularities that are present in actual behavior, and rule out conceivable regularities that are not. We thus evaluate models from the dual perspectives of how restrictive they are and how well they predict actual data. The latter can be measured using the -discrepancy of the model, where is the true mapping. This measure is tightly linked to the notion of completeness introduced in Fudenberg et al. (2019).
Definition 3.3 (Fudenberg et al., 2019).
The completeness of model is
Completeness is the complement of the -discrepancy, since
A model’s completeness can be interpreted as the ratio of the reduction in error achieved by the model (relative to the naive baseline), compared to the largest achievable reduction. By construction, the measure is scale-free and lies within the unit interval. A large suggests that the model is able to approximate the real data well: at the extremes, a model with matches the true mapping exactly, while a model with is no better at matching than the naive model. We will report both restrictiveness and completeness for each of the models that we consider.
3.4 Discussion of Measures
An alternative “area” measure.
Selten’s area measure of model flexibility is , where is the Lebesgue measure, i.e. the fraction of possible mappings that are exactly consistent with the model. Our measure of restrictiveness differs both by normalization with respect to the performance of a naive model, and by measuring how well the model approximates a randomly drawn mapping in , which allows us to quantify the degree of error. A model that does not include most mappings from would be considered highly restrictive under the Selten measure, but would have low restrictiveness by our measure if it approximated most mappings very well.
Role of the normalization.
We define restrictiveness to be the average value of , rather than its un-normalized counterpart . Normalizing relative to a naive mapping has several advantages compared to the unit-dependent raw error : If we were to scale up the payoffs in the binary lotteries in our first application, then would mechanically scale up as well, even though the flexibility of the model has not changed, which makes it hard to say what constitutes a “large” value of . Normalizing relative to the naive error returns a unitless quantity that is easier to interpret, and can more easily be compared across problems that use different error metrics.
Sensitivity to .
We might prefer that the restrictiveness measure does not respond too sensitively to small changes in . We demonstrate now that it does not. For any two measures ,
where is the total variation distance. Thus for any two measures that are close in total variation distance, the corresponding restrictiveness measures must also be close. We complement this theoretical bound with a numerical sensitivity check in Section 5.2
, where we evaluate restrictiveness with respect to beta distributions that are close to our specification thatis uniform. The resulting variation in restrictiveness is quite small.
Combining and .
Ideal models have high , so they approximate the real data well, but also high restrictiveness , so they rule out regularities that could have been present but are not. These two criteria generate a partial order on models; there are many ways to complete it. One possibility is to use a lexicographic ordering, where models are ordered first by and then by . Another is to impose a functional form that combines and restrictiveness , such as .888Selten (1991) provided an axiomatic characterization of the similar aggregator , where is the pass rate of the model on the actual data and is the area measure we discussed above.
Yet another possibility is to use the probability that the model fits the actual data better than it fits a randomly generated data set, namely the quantile ofunder the distribution of -discrepancies. In the present paper, we report and separately, and leave it to the analyst’s discretion whether or how to combine these two metrics.
Point-Identified and Set-Identified Models.
Note that -discrepancy, restrictiveness, and completeness are well-defined regardless of whether the parametric model is point-identified or set-identified. This is because the definitions of , restrictiveness, and do not rely on the uniqueness of the minimizers. In other words, we evaluate the parametric model with and , so our measures do not differentiate point-identified models from set-identified models that yield the same and .
4 Estimates and Test Statistics
We now discuss how to implement our approach in practice.
4.1 Computing Restrictiveness
We provide an algorithm for computing : Sample times from the distribution on , and for each sampled , compute The sample mean is an estimator for restrictiveness. In principle, the number of simulations we run, , can be taken as large as we want, so can be made arbitrarily close to
by the Law of Large Numbers. Moreover, the approximation error under a given finitecan be quantified using standard statistical inference methods. We focus on the case where the distribution of is nondegenerate.
The distribution of is non-degenerate.
Assumption 1 is a very mild condition that can be easily verified, as it is sufficient for any two and to be distinct.
One-sided hypothesis tests on , e.g. for the null that so the model is completely unrestrictive, can also be carried out in standard ways. We again note that the confidence intervals here simply measure the approximation error of based on a finite number of simulations and do not reflect randomness in experimental data.
4.2 Estimating Completeness
In this section, we show how to estimate completeness, .
Suppose that the analyst has access to a finite sample of data drawn from the unknown true distribution . To estimate completeness, we use -fold cross-validation to estimate the out-of-sample prediction error of the model. (In our applications, we take the standard choice of .) Specifically, we randomly divide into (approximately) equal-sized groups. To simplify notation, assume that is an integer. Let denote the group number of observation , and for each group , define
to be the mapping from that minimizes error for prediction of observations outside of group . This estimated mapping is used for prediction of the -th test set, and
is its out-of-sample error on the -th test set. Then,
is the average test error across the folds. This is an estimator for the unobservable expected error of the best mapping from class .
Setting to be , , or , we can compute , and from the data, leading to the following estimator for :
It is crucial that the denominator in does not vanish asymptotically, so we impose the following assumption:
Assumption 2 (Naive Rule is Imperfect).
This assumption is quite weak, as it simply says that the naive mapping performs strictly worse in expectation than the best mapping. Under additional technical conditions, we show, by applying and adapting Proposition 5 in Austern and Zhou (2020), that is asymptotically normal. See Appendix C for details.
To obtain the standard error, we use a variance estimator adapted from Proposition 1 in Austern and Zhou (2020). Specifically, for the -th test set, let and be the estimated mappings from models and , respectively. The difference in their test errors on observation is , and the average difference across all observations in test fold is
The sample variance of the difference in test errors is
Based on this, we define the following variance estimator for :
We establish the asymptotic distribution of our proposed estimators via the following theorem.
5 Application 1: Certainty Equivalents
Our first application is to predicting certainty equivalents for a set of 25 binary lotteries from Bruhin et al. (2010). Each lottery is described as a tuple , where are the two possible prizes, and is the probability of the larger prize . (See Appendix B.1 for our analysis of the Bruhin et al. (2010) lotteries in the loss domain, which is qualitatively very similar.) The feature space consists of the 25 tuples associated with lotteries in the Bruhin et al. (2010) data. The outcome space is . Each observation in the data is a pair consisting of a lottery and a reported certainty equivalent by a given subject. Note that the variation in for fixed reflects the fact that different subjects report different certainty equivalents for the same lottery. (In Appendix B.5, we discuss how to extend our approach to allow for subject-level heterogeneity.
We seek to predict the average of the certainty equivalents (over subjects) reported for each lottery. A mapping for this problem is any function from the 25 lotteries to predicted average certainty equivalents. We define to be the expected mean-squared distance between the two mappings’ predictions, as in (3.1).
The economic model that we evaluate is a three-parameter version of Cumulative Prospect Theory indexed by , which predicts
is a value function for money, and
is a probability weighting function.101010This parametric form for was first suggested by Goldstein and Einhorn (1987) and Lattimore et al. (1992). Following Bruhin et al. (2010) and much of the literature, we will estimate separate values of these parameters for losses (see Appendix B.1, so in a sense the “overall CPT model” has 6 parameters. We specify as the set of all such functions , and refer to this model simply as CPT. As a naive benchmark, we consider the function that maps each lottery into its expected value, corresponding to in CPT.
CPT is 95% complete for predicting this data,111111 . A similar result was reported in Fudenberg et al. (2019) for the pooled sample of gain-domain and loss-domain lotteries. so the model achieves almost all of the possible improvement in prediction accuracy over the naive baseline.121212This finding is consistent Peysakhovich and
Naecker (2017) ’s result that CPT approximates the predictive performance of lasso regression trained on a high-dimensional set of features.
’s result that CPT approximates the predictive performance of lasso regression trained on a high-dimensional set of features.(Equivalently, the estimated -discrepancy of this model is 0.05.) One explanation is that CPT is a very good model of risk preferences; another possibility is that the model is flexible enough to mimic most functions from binary lotteries to certainty equivalents. These explanations have very different implications for how to interpret CPT’s empirical success.
To distinguish between these explanations, we now compute CPT’s restrictiveness. Our primitive distribution is a uniform distribution over the set of all mappings satisfying the following criteria:131313This uniform distribution is well-defined since is a bounded subset of .
if , , and then
if , , then
Constraint (1) requires that the certainty equivalent is within the range of the possible payoffs, while constraints (2) and (3) require to respect first-order stochastic dominance. Note that in the Bruhin et al. (2010) lottery data, there are many pairs of lotteries that can be compared via (2) and (3), so these conditions are not vacuous.
Below we plot the distribution of -discrepancies for 100 random mappings .
The restrictiveness of the model (i.e. the average -discrepancy) is , so on average CPT’s approximation error is less than a third of the error of the naive (expected-value) mapping. Thus CPT is quite flexible, as it rules out very few regularities that are not already restricted by first-order stochastic dominance.
CPT’s restrictiveness suggests an explanation of its completeness that is intermediate to the two explanations proposed above: CPT is quite flexible, as its average completeness is 71% on the hypothetical data, but it is even more complete on the real data (95%). Taking both measures into account via a composite such as the difference , CPT’s high completeness on real data somewhat compensates for its moderately high completeness on hypothetical data, so we conclude that it encodes some—but not very much—structure beyond first-order stochastic dominance.
The restrictiveness measure depends on the choice of distribution , which we chose to be uniform. The uniform distribution is the same as , so to test the sensitivity of the restrictiveness measure we consider nearby distributions, with parameters sampled from a uniform distribution over . For each pair, we generate certainty equivalents from a distribution over the prize range, again keeping only those functions that satisfy FOSD. Over 100 such distributions , the average restrictiveness is 0.30, with a min value of 0.17 and a max value of 0.41. Thus our finding that CPT is quite flexible is robust to these perturbations in .141414The variation in restrictiveness is bounded by the total variation distance between the primitive choices of (see (3.4)), but it can be difficult to compute the total variation distance between complex choices of .
Next, in Appendix B.2, we compute the restrictiveness of the model with respect to a different background constraint, dropping the FOSD restrictions in (2) and (3) while keeping the range restriction in (1). We would expect the restrictiveness of CPT to increase in this case, since (for all parameter values) CPT obeys first-order stochastic dominance. We find however that the restrictiveness of CPT relative to this larger permissible set, 0.35, is only slightly higher than the restrictiveness of 0.29 that we find for the main specification of .151515Normalization plays an important role here: CPT’s errors are substantially higher when we drop FOSD, but so are the errors of the naive benchmark (Expected Value). CPT’s relative performance compared to the naive benchmark is comparable, whether or not we impose FOSD. This reinforces our finding that CPT is not very restrictive.
Our analysis so far leaves open the possibility that the flexibility of the 4-parameter CPT model is specific to the domain of binary lotteries. In Appendix B.4, we evaluate the restrictiveness of CPT on a set of 3-outcome lotteries from Bernheim and Sprenger (2020). We find that CPT is indeed more restrictive on this domain, but still quite flexible: Its restrictiveness on these lotteries is . In particular, CPT is much less restrictive than the models of initial play that we study in Section 6.
5.3 Comparing Models
One way to evaluate the value of additional parameters is to compare the increase in completeness that they permit, relative to the decrease in restrictiveness. As an illustration, we compare the three-parameter specification of CPT with more restrictive special cases that have been studied in the literature: , as in Tversky and Kahneman (1992), , which corresponds to a risk-neutral CPT agent whose utility function over money is but exhibits nonlinear probability weighting, and , which corresponds to an Expected Utility decision-maker whose utility function is as given in (5.1). We refer to these models respectively as CPT, CPT(, and CPT(), where models are associated with their free parameters. We refer to the original three-parameter specification of CPT as CPT. The distributions of -discrepancies under these more restrictive models are shown in Figure 2 below.
Less general specifications are always at least weakly more restrictive, but the restrictiveness of a model must be considered jointly with its ability to explain the actual data. Table 3 reports restrictiveness and completeness for all four specifications of CPT, and Figure 3 plots these measures.
We find that CPT(), which uses only the nonlinear probability weighting parameters and , achieves a higher completeness than CPT), and does so despite being more restrictive. This suggests to us that it is a better model of risk preferences. Adding the risk-aversion parameter to the nonlinear probability weighting parameters and leads to only a slight improvement in completeness ( increases from 0.91 to 0.95), but results in a substantial drop in restrictiveness ( falls from 0.49 to 0.29). This suggests that the probability weighting parameters and are more useful than the utility curvature parameter . (These qualitative comparisons also hold when we consider lotteries on the loss domain, see Appendix B.1.) Our finding is consistent with previous studies which find that probability distortions play an important role in explaining field data (Snowberg and Wolfers, 2010; Barseghyan et al., 2013), and adds a perspective on how much flexibility these parameters introduce. The model CPT- is less complete, but more restrictive than CPT-(, so these two models cannot be directly ranked.
6 Application 2: The Distribution of Initial Play
Our second application is to predicting the distribution of initial play in games. Here the feature space consists of the 466 unique matrix games from Fudenberg and
Liang (2019),161616This data includes a meta data-set of experimental data aggregated in Wright and
Leyton-Brown (2014) from six experimental game theory papers, in addition to Mechanical Turk data from new experiments in
from six experimental game theory papers, in addition to Mechanical Turk data from new experiments inFudenberg and Liang (2019). each described as a vector in . The outcome space is (the set of row player actions) and the analyst seeks to predict the conditional distribution over for each game, interpreted as choices made by a population of subjects for the same game. Thus, , the set of all distributions over row player actions. A mapping for this problem is any function taking the 466 games into predicted distributions of play. We define to be the expected Kullback-Liebler divergence between the predicted distributions under and , as in (3.2).
We define the naive mapping to predict the uniform distribution for every game: for every . Additionally, we consider three economic models for this prediction task. The Poisson Cognitive Hierarchy Model (PCHM) of Camerer et al. (2004) supposes that there is a distribution over players of differing levels of sophistication: The level-0 player randomizes uniformly over his available actions, the level-1 player best responds to level-0 play (Stahl and Wilson, 1994, 1995; Nagel, 1995); and for , level- players best respond to a perceived distribution
over (lower) opponent levels, where
is the Poisson distribution with rate parameter. The parameter is the only free parameter of the model, and the naive mapping is nested as .
We also evaluate a model that we call logit level-1, which has a single free parameter . For each action , the predicted frequency with which is played is
This model nests prediction of uniform play (our naive rule) as , and predicts a degenerate distribution on the level-1 action when is sufficiently large.
Finally, we consider a model that we call logit PCHM (see e.g. Wright and Leyton-Brown (2014)), which replaces the assumption of exact maximization in the PCHM with a logit best response. This model has two free parameters: . The level-0 player chooses , as in the PCHM. Recursively define for each
to be the expected payoff of action against a player whose type is distributed according to , where is as defined in (6.1), and define
to be the distribution of level- play. We aggregate across levels using a Poisson distribution with rate parameter .
The models PCHM, logit level-1, and logit PCHM turn out to be 43.6%, 72.7%, and 72.9% complete on the actual data. (Equivalently, their -discrepancies are 0.564, 0.273, and 0.271.) Thus, as observed in a related study by Wright and Leyton-Brown (2014), logit PCHM provides much better predictions of the distribution of play than the baseline PCHM does. Perhaps surprisingly, we find that almost all of this improvement is obtained by simply adding the logit parameter to the level-1 model, i.e. the further improvement from allowing for multiple levels of sophistication is negligible.
The strong performance of logit level-1 for predicting initial play is consistent with the earlier result of Fudenberg and Liang (2019) that the level-1 model provides a good prediction of the modal action. It is harder to predict the full distribution of play, so it is not obvious from the previous result that level-1 play with a logit noise parameter would perform so well for prediction of the distribution of play. The strong performance of level-1 for predicting modal play, combined with our new observation that logit level-1 does a good job predicting the distribution of play, suggests that initial play in many games is rather unstrategic.171717In Fudenberg and Liang (2019), we found that modal play in some sorts of games is better described by equilibrium notions than level-1. Since such regularities cannot be accommodated by the logit level-1 model, these may explain the gap between the completeness of logit level-1 and full completeness. Costa-Gomes et al. (2001) find a sizable fraction of level-2 players in their experimental data, which may further help to explain this gap.
We turn now to evaluating the restrictiveness for these models. Compared to the case of preferences over binary lotteries, economic theory provides very little in the way of a priori restrictions on initial play.181818Classic game theory alone would suggest that dominant strategies have probability 1 and dominated strategies have probability 0, but this is inconsistent with our data (and most experimental data of play in games). We thus define the permissible set to include all mappings satisfying the following very weak conditions:
If an action is strictly dominated, then the frequency with which it is chosen does not exceed 1/3.191919In the actual data, the median strictly dominated action receives a frequency of 0.03 and the max frequency is 0.35.
If an action is strictly dominant, then the frequency with which it is chosen is at least .202020In the actual data, the median strictly dominant action receives a frequency of 0.86 and the min frequency is 0.69.
For each of the PCHM, level-1, and logit PCHM, we generate 100 mappings from a uniform distribution over the set of permissible mappings , and evaluate the -discrepancies with respect to these mappings.212121The set can be embedded in , and so the uniform distribution over is well-defined. The distributions of -discrepancies are shown in the figure below.
We find that logit level-1’s restrictiveness is , PCHM’s restrictiveness is , and logit-PCHM’s restrictiveness is 0.822. Indeed, across all of these mappings and models, the -discrepancy is always at least 0.72. Equivalently, the completenesses of these models across the simulated mappings is bounded above by 0.28. Since the completeness of these models on the actual data ranged from 0.436 to 0.729, each of these models is a much better predictor of the real data than of our hypothetical data sets.
Simply comparing the completeness of the PCHM, 0.436, against the completeness of CPT, 0.95, suggests that the PCHM is a “worse” model of initial play than CPT is of certainty equivalents for lotteries. The contrast in their restrictivenesses (0.915 vs. 0.27) tells us that while PCHM does not capture all of the observed behaviors, it more successfully rules out behaviors that we do not observe. These two perspectives are depicted in Figure 5, where is smaller for CPT than for PCHM, implying that CPT better fits real data, but the distribution of -discrepancies computed from simulated data is concentrated at substantially larger values for PCHM, so it is a more restrictive model.222222The figure naturally suggests composite measures such as (the difference between the average -discrepancy computed on hypothetical data and the -discrepancy computed on real data), or the fraction of sampled for which (as proposed in Section 3.4). By either of these composite measures, PCHM is the “better” model, but we don’t know what the right composite measure is.
Table 2 summarizes completeness and restrictiveness measures for all three models.
From Table 2 we see that logit level-1 is more complete and also more restrictive than the PCHM. Logit level-1 is also substantially more restrictive than logit PCHM, at the cost of only a slight and statistically insignificant decrease in completeness. These observations suggest that logit level-1 may be a preferable model to the PCHM and logit PCHM for predicting initial play.232323We suspect that PCHM and logit PCHM would outperform logit level-1 for predicting the actions of subjects who played these games several times and learned from feedback. Note however that the restrictiveness of the models would not change.
7 Application to General Prediction Problems
In the two leading cases we have analyzed in the main text (Section 3.1), the function is derived from a primitive loss function . We call the general property that permits this decomposability.
Definition 7.1 (Decomposablity).
Consider an arbitrary loss function and define to be the expected loss of mapping . For any distribution , let denote the error-minimizing mapping under that distribution. Say that the problem is decomposable if there exists a function such that
for every distribution (with fixed marginal distribution ). That is, is the difference between the error of mapping and the error of the best mapping .
In general, prediction problems need not be decomposable. For example, suppose the objective is to predict the conditional median, and the loss function is instead of squared loss. The expected error is then , and the error-minimizing function takes each into the median value of at . We might want to use
as a measure of how different the predictions are under and ’, but this function does not satisfy (7.1). For the absolute value loss function, there is in fact no function that satisfies (7.1), because the difference in errors cannot be determined from and alone, but depends on further properties of the conditional distribution . (See Appendix D.2 for more details.)
When the problem is decomposable, as in the cases analyzed in the main text, then our approach is applicable without change by setting to be the function satisfying (7.1). If the problem is not decomposable, we take as a primitive, rather than deriving it from the loss function . The key concepts of -discrepancies and restrictiveness are defined as they are in the main text, using this primitive . What we lose is the equivalence between the and completeness , as described in (3.3). One can report restrictiveness (based on the primitive ) and completeness (based on the primitive ), understanding that there is no inherent relationship between these concepts. Larger values of and can still be interpreted as more restrictive and more complete models. A second alternative is to report instead of completeness. Since is derived from , this second approach does not require specification of a loss function at all. A new estimation procedure for is needed, however, as our approach in Section 4.2 makes use of the relationship . We provide an alternative estimator for in Appendix D.1 for this purpose.
When a theory fits the data well, it matters whether this is because the theory captures important regularities in the data, or whether the theory is so flexible that it can explain any behavior at all. We provide a practical, algorithmic approach for evaluating the restrictiveness of a theory, and demonstrate that it reveals new insights into models from two economic domains. The method is easily applied to models from different domains.242424For example, to measure the restrictiveness of rational aggregate demand, one could generate random demand functions on a finite collection of budget sets, and compute the “distance” between these functions and one that satisfies GARP. (We thank Tilman Börgers for this suggestion.)
We conclude with a few final comments.
Why prefer restrictive theories?
Completely unrestrictive theories, such as the theory of utility maximization with unrestricted dependence of preferences on the menu, can explain any data and so are vacuous. A theory is falsifiable if there is at least one potential observation that it couldn’t explain. We can view restrictiveness as a quantitative extension of the idea of falsifiability. Just as we prefer falsifiable theories to vacuous one, we prefer theories that are more restrictive, though this is not quite the same as “more falsifiable,” as it replaces the binary evaluation of whether or not a data set refutes the theory with a quantitative evaluation of how theory approximates the data.
Comparing the predictions of two models.
A common practice for distinguishing the empirical content of two models is to find instances where the models make different predictions. We do not compare models here, although our approach can be extended to compare the predictions of two models on the hypothetical data sets. Specifically, instead of evaluating the discrepancy between the estimated model and the best mapping, one could evaluate the discrepancy between the estimated models from two parametric families. The average discrepancy in this case would then represent an average disagreement between the two models on hypothetical data. We leave development of such concepts to future work.
- Austern and Zhou (2020) Austern, M. and W. Zhou (2020): “Asymptotics of Cross-Validation,” arXiv preprint arXiv:2001.11111.
- Barseghyan et al. (2013) Barseghyan, L., F. Molinari, T. O’Donoghue, and J. C. Teitelbaum (2013): “The Nature of Risk Preferences: Evidence from Insurance Choices,” American Economic Review, 103, 2499–2529.
- Basu and Echenique (2020) Basu, P. and F. Echenique (2020): “On the falsifiability and learnability of decision theories,” Theoretical Economics, forthcoming.
- Beatty and Crawford (2011) Beatty, T. and I. Crawford (2011): “How Demanding Is the Revealed Preference Approach to Demand?” American Economic Review, 101, 2782–95.
- Bernheim and Sprenger (2020) Bernheim, D. and C. Sprenger (2020): “Direct Tests of Cumulative Prospect Theory,” Working Paper.
- Bronars (1987) Bronars, S. (1987): “The Power of Nonparametric Tests of Preference Maximization,” Econometrica, 55, 693–698.
- Bruhin et al. (2010) Bruhin, A., H. Fehr-Duda, and T. Epper (2010): “Risk and Rationality: Uncovering Heterogeneity in Probability Distortion,” Econometrica, 78, 1375–1412.
- Camerer et al. (2004) Camerer, C. F., T.-H. Ho, and J.-K. Chong (2004): “A cognitive hierarchy model of games,” The Quarterly Journal of Economics, 119, 861–898.
- Chen and Santos (2018) Chen, X. and A. Santos (2018): “Overidentification in regular models,” Econometrica, 86, 1771–1817.
- Choi et al. (2007) Choi, S., R. Fisman, D. Gale, and S. Kariv (2007): “Consistency and Heterogeneity of Individual Behavior under Uncertainty,” American Economic Review, 97, 1–15.
- Costa-Gomes et al. (2001) Costa-Gomes, M., V. P. Crawford, and B. Broseta (2001): “Cognition and behavior in normal-form games: An experimental study,” Econometrica, 69, 1193–1235.
- Cox (1961) Cox, D. R. (1961): “Tests of separate families of hypotheses,” in Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, vol. 1, 105–123.
- Cox (1962) ——— (1962): “Further results on tests of separate families of hypotheses,” Journal of the Royal Statistical Society: Series B (Methodological), 24, 406–424.
- Fudenberg et al. (2019) Fudenberg, D., J. Kleinberg, A. Liang, and S. Mullainathan (2019): “Measuring the Completeness of Theories,” Working Paper.
- Fudenberg and Liang (2019) Fudenberg, D. and A. Liang (2019): “Predicting and Understanding Initial Play,” American Economic Review, 109, 4112–4141.
- Goldreich and Vadhan (2007) Goldreich, O. and S. Vadhan (2007): “Special Issue On Worst-case Versus Average-case Complexity,” .
- Goldstein and Einhorn (1987) Goldstein, W. M. and H. J. Einhorn (1987): “Expression theory and the preference reversal phenomena,” Psychological review, 94, 236–254.
Hansen, L. P.
(1982): “Large sample properties of generalized method of moments estimators,”Econometrica, 50, 1029–1054.
- Harless and Camerer (1994) Harless, D. and C. Camerer (1994): “The Predictive Utility of Generalized Expected Utility Theories,” Econometrica, 62, 1251–1289.
- Hausman (1978) Hausman, J. A. (1978): “Specification tests in econometrics,” Econometrica, 46, 1251–1271.
- Hey (1998) Hey, J. D. (1998): “An application of Selten’s measure of predictive success,” Mathematical Social Sciences, 35, 1–15.
- Koopmans and Reiersol (1950) Koopmans, T. and O. Reiersol (1950): “The Identification of Structural Characteristics,” The Annals of Mathematical Statistics, 21, 165–181.
- Lattimore et al. (1992) Lattimore, P. K., J. R. Baker, and A. D. Witte (1992): “The influence of probability on risky choice: A parametric examination,” Journal of Economic Behavior & Organization, 17, 315–436.
- Nagel (1995) Nagel, R. (1995): “Unraveling in Guessing Games: An Experimental Study,” American Economic Review, 85, 1313–1326.
Peysakhovich, A. and J. Naecker
(2017): “Using methods from machine learning to evaluate behavioral models of choice under risk and ambiguity,”Journal of Economic Behavior and Organization, 133, 373–384.
- Polisson et al. (2020) Polisson, M., J. K.-H. Quah, and L. Renou (2020): “Revealed Preferences over Risk and Uncertainty,” American Economic Review, 110, 1782–1820.
- Quiggin (1982) Quiggin, J. (1982): “A Theory of Anticipated Utility,” Journal of Economic Behavior and Organization, 3, 323–343.
- Sargan (1958) Sargan, J. D. (1958): “The estimation of economic relationships using instrumental variables,” Econometrica, 26, 393–415.
- Selten (1991) Selten, R. (1991): “Properties for a Measure of Predictive Success,” Mathematical Social Sciences, 21, 153–167.
- Snowberg and Wolfers (2010) Snowberg, E. and J. Wolfers (2010): “Explaining the Favorite-Long Shot Bias: Is It Risk-Love or Misperceptions?” Journal of Political Economy, 118, 723–746.
- Stahl and Wilson (1994) Stahl, D. O. and P. W. Wilson (1994): “Experimental evidence on players’ models of other players,” Journal of Economic Behavior and Organization, 25, 309–327.
- Stahl and Wilson (1995) ——— (1995): “On players’ models of other players: Theory and experimental evidence,” Games and Economic Behavior, 10, 218–254.
- Tversky and Kahneman (1992) Tversky, A. and D. Kahneman (1992): “Advances in Prospect Theory: Cumulative Representation of Uncertainty,” Journal of Risk and Uncertainty, 5, 297–323.
- Varian (1982) Varian, H. (1982): “The Nonparametric Approach to Demand Analysis,” Econometrica, 50, 945–973.
- Wright and Leyton-Brown (2014) Wright, J. R. and K. Leyton-Brown (2014): “Level-0 meta-models for predicting human behavior in games,” Proceedings of the fifteenth ACM conference on Economics and computation, 857–874.
- Yaari (1987) Yaari, M. (1987): “The Dual Theory of Choice under Risk,” Econometrica, 55, 95–115.
Appendix A Supplementary Material to Section 3.1
Mean-Squared Error. Suppose and the loss function is . The following decomposition is standard:
Negative Log-Likelihood. Suppose where is a finite set, and the loss function is for any mapping . Then,
So as desired.
Appendix B Supplementary Material for Application 1
b.1 Loss Domain Results
Below we repeat the analysis of Section 5 for the 25 binary lotteries over the loss domain from Bruhin et al. (2010). Again each lottery is denoted where is the probability of the first prize. The prizes satisfy . We evaluate a 3-parameter version of CPT indexed to , where
with and .
We report below the equivalent of Table 3 for this domain. The qualitative findings are very similar to what we found in the main text. In particular, CPT’s restrictiveness is 0.35 (compare to our previous estimate of 0.29), so CPT is fairly unrestrictive on this set of loteries as well. Additionally, we again find that CPT-( is simultaneously more complete and more restrictive than CPT-(, and that augmenting CPT-() with the utility curvature parameter only marginally improves completeness while substantially decreasing restrictiveness.
b.2 Different Specification for the Permissible Set
Consider the alternative permissible set of mappings consisting of all functions satisfying . We sample 100 times from a uniform distribution over this set and report the distribution of -discrepancies in the figure below:
The average discrepancy, 0.35, tells us that the model is more restrictive on this expanded domain of mappings, but not substantially so.252525Even though the errors are substantially higher than when we require the permissible mappings to respect FOSD, the estimated restrictiveness is almost the same because the naive error also increases. Specifically, the mean naive error is 343.32 (compared to 178.73 under the original ), while the mean CPT error is 110.73 (compared to 58.21 under the original ).
b.3 Parameter Estimates
We report below the estimated parameters for each of the models that we consider. In the first column, we report the estimated parameters on the actual data. In the second, we report the average parameter estimates for across our generated mappings.
|Free Parameters||Real Data||Generated Mappings|
|Real Data||Generated Mappings|
b.4 Three-Outcome Lotteries
We use a set of 18 three-outcome lotteries from Bernheim and Sprenger (2020) (listed below) and evaluate the restrictiveness of Cumulative Prospect Theory for predicting certainty equivalents for these lotteries.
The prizes satisfy . On the domain of three-outcome lotteries, CPT predicts
for each lottery (Tversky and Kahneman, 1992). We use the functional forms for and given in the main text.
A predictive mapping is a map from these 18 lotteries into average certainty equivalents. The set of permissible mappings is again defined to satisfy: (1) each certainty equivalent has to be in the range of the lottery outcomes, and (2) if a lottery first-order stochastically dominates another, then its certainty equivalent must be higher. We generate 100 random mappings from a uniform distribution over mappings satisfying these properties.
Below, we compare the distribution of -discrepancies from Figure 6 with the distribution of -discrepancies that we find for these three-outcome lotteries.
The restrictiveness of CPT on this set of three-outcome lotteries is 0.496, with a standard error of 0.018. Thus CPT is about 1.5 times as restrictive as a model of certainty equivalents for three-outcome lotteries than as a model of certainty equivalents for binary lotteries. Besides imposing FOSD, CPT imposes rank dependence on the domain of three-outcome lotteries. (This restriction does not apply for binary lotteries.) This may explain part of the increase in restrictiveness. Even this higher restrictiveness, however, is substantially less than what we find for models of initial play.
b.5 Heterogeneous Risk Preferences
Our analysis in the main text considered representative agent models. In some cases, the analyst may have auxiliary data on the subjects that can be used to improve predictions. We show now how completeness and restrictiveness can be evaluated in this case.
Specifically, we return to our first application and group subjects into three clusters identified by Bruhin et al. (2010). We fit CPT for each cluster, allowing parameter values to vary across groups. Table 4 reports completeness measures cluster by cluster.
|Cluster 1||Cluster 2||Cluster 3|
|Best Achievable Error||29.59||36.30||67.05|
The performance of the naive expected value rule, the best achievable performance, and the performance of CPT, all vary substantially across clusters. For example, the behavior of subjects in cluster 1 is roughly consistent with expected value (the error of the naive rule is 39.90), while the behavior of subjects in cluster 2 departs substantially from this benchmark (the error of the naive rule is 99.94). The best achievable prediction for these groups of subjects is also very different (ranging from 29.59 to 67.05), as is the completeness of CPT (ranging from 30.74 to 69.62).
The average completeness, weighted by proportion of observations in each cluster, is 0.91, which is very close to what we found for the representative agent model. This may seem surprising at first, since allowing for parameters to vary across subjects improves the accuracy of predictions. But the best mapping from the extended feature space to is more predictive than the best mapping considered previously. Thus what we find is that the completeness of CPT with three clusters, relative to the best three-cluster mapping, is comparable to the completeness of the representative-agent version of CPT, relative to the best representative-agent mapping.
Similarly, when measuring restrictiveness, we extend the set of permissible mappings to the domain . Each generated pattern of behavior is thus a triple of mappings from the original . We ask how well these tuples can be approximated using mappings from CPT. It is straightforward to see that the restrictiveness of the three-cluster CPT is identical to the restrictiveness of the representative-agent model.262626Note that this is true for any number of exogenously specified clusters.
Appendix C Estimation of Completeness
c.1 Preliminary Definitions
We now introduce some definitions and notation that will be useful in the derivation of the asymptotic distribution of the CV-based completeness estimator.
c.1.1 Finite-Sample Out-of-Sample Error
Let be a random sample of observations in a given data set, and let
denote a random variable with the same distributionthat is independent of . For a given data set and a given model , we define the conditional out-of-sample error (given data set ) as
where is an estimator, or an algorithm, that selects a mapping within the model based on data . We also define the out-of-sample error, with expectation taken over different possible data sets , as
From the definition of the K-fold cross-validation estimator, it can be easily shown that . As a result, the asymptotic distribution of